Multi-Core Processor Memory Contention Benchmark Analysis Case Study
NASA Technical Reports Server (NTRS)
Simon, Tyler; McGalliard, James
2009-01-01
Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
NASA Astrophysics Data System (ADS)
Giusi, Giovanni; Liu, Scige J.; Galli, Emanuele; Di Giorgio, Anna M.; Farina, Maria; Vertolli, Nello; Di Lellis, Andrea M.
2016-07-01
In this paper we present the results of a series of performance tests carried out on a prototype board mounting the Cobham Gaisler GR712RC Dual Core LEON3FT processor. The aim was the characterization of the performances of the dual core processor when used for executing a highly demanding lossless compression task, acting on data segments continuously copied from the static memory to the processor RAM. The selection of the compression activity to evaluate the performances was driven by the possibility of a comparison with previously executed tests on the Cobham/Aeroflex Gaisler UT699 LEON3FT SPARC™ V8. The results of the test activity have shown a factor 1.6 of improvement with respect to the previous tests, which can easily be improved by adopting a faster onboard board clock, and provided indications on the best size of the data chunks to be used in the compression activity.
Optimization of image processing algorithms on mobile platforms
NASA Astrophysics Data System (ADS)
Poudel, Pramod; Shirvaikar, Mukul
2011-03-01
This work presents a technique to optimize popular image processing algorithms on mobile platforms such as cell phones, net-books and personal digital assistants (PDAs). The increasing demand for video applications like context-aware computing on mobile embedded systems requires the use of computationally intensive image processing algorithms. The system engineer has a mandate to optimize them so as to meet real-time deadlines. A methodology to take advantage of the asymmetric dual-core processor, which includes an ARM and a DSP core supported by shared memory, is presented with implementation details. The target platform chosen is the popular OMAP 3530 processor for embedded media systems. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and a TMS320C64x Digital Signal Processor (DSP). The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 MB SDRAM memory. The basic image correlation algorithm is chosen for benchmarking as it finds widespread application for various template matching tasks such as face-recognition. The basic algorithm prototypes conform to OpenCV, a popular computer vision library. OpenCV algorithms can be easily ported to the ARM core which runs a popular operating system such as Linux or Windows CE. However, the DSP is architecturally more efficient at handling DFT algorithms. The algorithms are tested on a variety of images and performance results are presented measuring the speedup obtained due to dual-core implementation. A major advantage of this approach is that it allows the ARM processor to perform important real-time tasks, while the DSP addresses performance-hungry algorithms.
Initial Performance Results on IBM POWER6
NASA Technical Reports Server (NTRS)
Saini, Subbash; Talcott, Dale; Jespersen, Dennis; Djomehri, Jahed; Jin, Haoqiang; Mehrotra, Piysuh
2008-01-01
The POWER5+ processor has a faster memory bus than that of the previous generation POWER5 processor (533 MHz vs. 400 MHz), but the measured per-core memory bandwidth of the latter is better than that of the former (5.7 GB/s vs. 4.3 GB/s). The reason for this is that in the POWER5+, the two cores on the chip share the L2 cache, L3 cache and memory bus. The memory controller is also on the chip and is shared by the two cores. This serializes the path to memory. For consistently good performance on a wide range of applications, the performance of the processor, the memory subsystem, and the interconnects (both latency and bandwidth) should be balanced. Recognizing this, IBM has designed the Power6 processor so as to avoid the bottlenecks due to the L2 cache, memory controller and buffer chips of the POWER5+. Unlike the POWER5+, each core in the POWER6 has its own L2 cache (4 MB - double that of the Power5+), memory controller and buffer chips. Each core in the POWER6 runs at 4.7 GHz instead of 1.9 GHz in POWER5+. In this paper, we evaluate the performance of a dual-core Power6 based IBM p6-570 system, and we compare its performance with that of a dual-core Power5+ based IBM p575+ system. In this evaluation, we have used the High- Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four real-world applications--three from computational fluid dynamics and one from climate modeling.
GR712RC- Dual-Core Processor- Product Status
NASA Astrophysics Data System (ADS)
Sturesson, Fredrik; Habinc, Sandi; Gaisler, Jiri
2012-08-01
The GR712RC System-on-Chip (SoC) is a dual core LEON3FT system suitable for advanced high reliability space avionics. Fault tolerance features from Aeroflex Gaisler’s GRLIB IP library and an implementation using Ramon Chips RadSafe cell library enables superior radiation hardness.The GR712RC device has been designed to provide high processing power by including two LEON3FT 32- bit SPARC V8 processors, each with its own high- performance IEEE754 compliant floating-point-unit and SPARC reference memory management unit.This high processing power is combined with a large number of serial interfaces, ranging from high-speed links for data transfers to low-speed control buses for commanding and status acquisition.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors
2009-09-01
TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
RTEMS SMP and MTAPI for Efficient Multi-Core Space Applications on LEON3/LEON4 Processors
NASA Astrophysics Data System (ADS)
Cederman, Daniel; Hellstrom, Daniel; Sherrill, Joel; Bloom, Gedare; Patte, Mathieu; Zulianello, Marco
2015-09-01
This paper presents the final result of an European Space Agency (ESA) activity aimed at improving the software support for LEON processors used in SMP configurations. One of the benefits of using a multicore system in a SMP configuration is that in many instances it is possible to better utilize the available processing resources by load balancing between cores. This however comes with the cost of having to synchronize operations between cores, leading to increased complexity. While in an AMP system one can use multiple instances of operating systems that are only uni-processor capable, a SMP system requires the operating system to be written to support multicore systems. In this activity we have improved and extended the SMP support of the RTEMS real-time operating system and ensured that it fully supports the multicore capable LEON processors. The targeted hardware in the activity has been the GR712RC, a dual-core core LEON3FT processor, and the functional prototype of ESA's Next Generation Multiprocessor (NGMP), a quad core LEON4 processor. The final version of the NGMP is now available as a product under the name GR740. An implementation of the Multicore Task Management API (MTAPI) has been developed as part of this activity to aid in the parallelization of applications for RTEMS SMP. It allows for simplified development of parallel applications using the task-based programming model. An existing space application, the Gaia Video Processing Unit, has been ported to RTEMS SMP using the MTAPI implementation to demonstrate the feasibility and usefulness of multicore processors for space payload software. The activity is funded by ESA under contract 4000108560/13/NL/JK. Gedare Bloom is supported in part by NSF CNS-0934725.
Cognitive Medical Wireless Testbed System (COMWITS)
2016-11-01
Number: ...... ...... Sub Contractors (DD882) Names of other research staff Inventions (DD882) Scientific Progress This testbed merges two ARO grants...bit 64 bit CPU Intel Xeon Processor E5-1650v3 (6C, 3.5 GHz, Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon
Developing infrared array controller with software real time operating system
NASA Astrophysics Data System (ADS)
Sako, Shigeyuki; Miyata, Takashi; Nakamura, Tomohiko; Motohara, Kentaro; Uchimoto, Yuka Katsuno; Onaka, Takashi; Kataza, Hirokazu
2008-07-01
Real-time capabilities are required for a controller of a large format array to reduce a dead-time attributed by readout and data transfer. The real-time processing has been achieved by dedicated processors including DSP, CPLD, and FPGA devices. However, the dedicated processors have problems with memory resources, inflexibility, and high cost. Meanwhile, a recent PC has sufficient resources of CPUs and memories to control the infrared array and to process a large amount of frame data in real-time. In this study, we have developed an infrared array controller with a software real-time operating system (RTOS) instead of the dedicated processors. A Linux PC equipped with a RTAI extension and a dual-core CPU is used as a main computer, and one of the CPU cores is allocated to the real-time processing. A digital I/O board with DMA functions is used for an I/O interface. The signal-processing cores are integrated in the OS kernel as a real-time driver module, which is composed of two virtual devices of the clock processor and the frame processor tasks. The array controller with the RTOS realizes complicated operations easily, flexibly, and at a low cost.
Parallelization of a Monte Carlo particle transport simulation code
NASA Astrophysics Data System (ADS)
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
High Speed White Dwarf Asteroseismology with the Herty Hall Cluster
NASA Astrophysics Data System (ADS)
Gray, Aaron; Kim, A.
2012-01-01
Asteroseismology is the process of using observed oscillations of stars to infer their interior structure. In high speed asteroseismology, we complete that by quickly computing hundreds of thousands of models to match the observed period spectra. Each model on a single processor takes five to ten seconds to run. Therefore, we use a cluster of sixteen Dell Workstations with dual-core processors. The computers use the Ubuntu operating system and Apache Hadoop software to manage workloads.
2014-10-01
44 Table 19: Raspberry Pi Information...boards – These are single board devices targeted to education and embedding, the best known being the Raspberry Pi ; and 3. Development boards – These...popular, as it has high performance processor (perhaps 4 times the power of a Raspberry Pi ) with dual core processors running at 1.6 GHz and the cost is
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barhen, Jacob; Imam, Neena
2007-01-01
Revolutionary computing technologies are defined in terms of technological breakthroughs, which leapfrog over near-term projected advances in conventional hardware and software to produce paradigm shifts in computational science. For underwater threat source localization using information provided by a dynamical sensor network, one of the most promising computational advances builds upon the emergence of digital optical-core devices. In this article, we present initial results of sensor network calculations that focus on the concept of signal wavefront time-difference-of-arrival (TDOA). The corresponding algorithms are implemented on the EnLight processing platform recently introduced by Lenslet Laboratories. This tera-scale digital optical core processor is optimizedmore » for array operations, which it performs in a fixed-point-arithmetic architecture. Our results (i) illustrate the ability to reach the required accuracy in the TDOA computation, and (ii) demonstrate that a considerable speed-up can be achieved when using the EnLight 64a prototype processor as compared to a dual Intel XeonTM processor.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.
2017-10-20
Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search
2009-11-01
i.e., index construction may involve multiple flushes to local disk and on-disk merge sorts outside of MapReduce). Once the local indexes have been...contained 198 cores, which, with current dual -processor quad-core con- figurations, could fit into 25 machines—a far more modest cluster with today’s...signifi- cant impact on effectiveness. Our simple pruning technique was performed at query time and hence could be adapted to query-dependent
Efficiency of static core turn-off in a system-on-a-chip with variation
Cher, Chen-Yong; Coteus, Paul W; Gara, Alan; Kursun, Eren; Paulsen, David P; Schuelke, Brian A; Sheets, II, John E; Tian, Shurong
2013-10-29
A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.
A Locality-Based Threading Algorithm for the Configuration-Interaction Method
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shan, Hongzhang; Williams, Samuel; Johnson, Calvin
The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
A Locality-Based Threading Algorithm for the Configuration-Interaction Method
Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; ...
2017-07-03
The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
Results of SEI Independent Research and Development Projects
2008-12-01
contained there. When laptops with a dual-core processor came out, ITunes fails crashed. ITunes was designed as multi-threaded application, but until...involving product portfolio, in-bound technical marketing, research and development, product engineering, supply chain, and out-bound sales and marketing...of quality and process improvement professionals to the marketing, product engineering, supply chain, product test and sales professionals. 3
2015-06-01
5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Cognitive and neural foundations of discrete sequence skill: a TMS study.
Ruitenberg, Marit F L; Verwey, Willem B; Schutter, Dennis J L G; Abrahamse, Elger L
2014-04-01
Executing discrete movement sequences typically involves a shift with practice from a relatively slow, stimulus-based mode to a fast mode in which performance is based on retrieving and executing entire motor chunks. The dual processor model explains the performance of (skilled) discrete key-press sequences in terms of an interplay between a cognitive processor and a motor system. In the present study, we tested and confirmed the core assumptions of this model at the behavioral level. In addition, we explored the involvement of the pre-supplementary motor area (pre-SMA) in discrete sequence skill by applying inhibitory 20 min 1-Hz off-line repetitive transcranial magnetic stimulation (rTMS). Based on previous work, we predicted pre-SMA involvement in the selection/initiation of motor chunks, and this was confirmed by our results. The pre-SMA was further observed to be more involved in more complex than in simpler sequences, while no evidence was found for pre-SMA involvement in direct stimulus-response translations or associative learning processes. In conclusion, support is provided for the dual processor model, and for pre-SMA involvement in the initiation of motor chunks. Copyright © 2014 Elsevier Ltd. All rights reserved.
Embedded Palmprint Recognition System Using OMAP 3530
Shen, Linlin; Wu, Shipei; Zheng, Songhao; Ji, Zhen
2012-01-01
We have proposed in this paper an embedded palmprint recognition system using the dual-core OMAP 3530 platform. An improved algorithm based on palm code was proposed first. In this method, a Gabor wavelet is first convolved with the palmprint image to produce a response image, where local binary patterns are then applied to code the relation among the magnitude of wavelet response at the ccentral pixel with that of its neighbors. The method is fully tested using the public PolyU palmprint database. While palm code achieves only about 89% accuracy, over 96% accuracy is achieved by the proposed G-LBP approach. The proposed algorithm was then deployed to the DSP processor of OMAP 3530 and work together with the ARM processor for feature extraction. When complicated algorithms run on the DSP processor, the ARM processor can focus on image capture, user interface and peripheral control. Integrated with an image sensing module and central processing board, the designed device can achieve accurate and real time performance. PMID:22438721
Embedded palmprint recognition system using OMAP 3530.
Shen, Linlin; Wu, Shipei; Zheng, Songhao; Ji, Zhen
2012-01-01
We have proposed in this paper an embedded palmprint recognition system using the dual-core OMAP 3530 platform. An improved algorithm based on palm code was proposed first. In this method, a Gabor wavelet is first convolved with the palmprint image to produce a response image, where local binary patterns are then applied to code the relation among the magnitude of wavelet response at the central pixel with that of its neighbors. The method is fully tested using the public PolyU palmprint database. While palm code achieves only about 89% accuracy, over 96% accuracy is achieved by the proposed G-LBP approach. The proposed algorithm was then deployed to the DSP processor of OMAP 3530 and work together with the ARM processor for feature extraction. When complicated algorithms run on the DSP processor, the ARM processor can focus on image capture, user interface and peripheral control. Integrated with an image sensing module and central processing board, the designed device can achieve accurate and real time performance.
Ordering of guarded and unguarded stores for no-sync I/O
Gara, Alan; Ohmacht, Martin
2013-06-25
A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
High Performance Computing Assets for Ocean Acoustics Research
2016-11-18
independently on processing units with access to a typically available amount of memory, say 16 or 32 gigabytes. Our models require each processor to...allow results to be obtained with limited amounts of memory available to individual processing units (with no time frame for successful completion...put into use. One file server computer to store simulation output has also been purchased. The first workstation has 28 CPU cores, dual- thread , (56
Research on SEU hardening of heterogeneous Dual-Core SoC
NASA Astrophysics Data System (ADS)
Huang, Kun; Hu, Keliu; Deng, Jun; Zhang, Tao
2017-08-01
The implementation of Single-Event Upsets (SEU) hardening has various schemes. However, some of them require a lot of human, material and financial resources. This paper proposes an easy scheme on SEU hardening for Heterogeneous Dual-core SoC (HD SoC) which contains three techniques. First, the automatic Triple Modular Redundancy (TMR) technique is adopted to harden the register heaps of the processor and the instruction-fetching module. Second, Hamming codes are used to harden the random access memory (RAM). Last, a software signature technique is applied to check the programs which are running on CPU. The scheme need not to consume additional resources, and has little influence on the performance of CPU. These technologies are very mature, easy to implement and needs low cost. According to the simulation result, the scheme can satisfy the basic demand of SEU-hardening.
Implementation of kernels on the Maestro processor
NASA Astrophysics Data System (ADS)
Suh, Jinwoo; Kang, D. I. D.; Crago, S. P.
Currently, most microprocessors use multiple cores to increase performance while limiting power usage. Some processors use not just a few cores, but tens of cores or even 100 cores. One such many-core microprocessor is the Maestro processor, which is based on Tilera's TILE64 processor. The Maestro chip is a 49-core, general-purpose, radiation-hardened processor designed for space applications. The Maestro processor, unlike the TILE64, has a floating point unit (FPU) in each core for improved floating point performance. The Maestro processor runs at 342 MHz clock frequency. On the Maestro processor, we implemented several widely used kernels: matrix multiplication, vector add, FIR filter, and FFT. We measured and analyzed the performance of these kernels. The achieved performance was up to 5.7 GFLOPS, and the speedup compared to single tile was up to 49 using 49 tiles.
Testing and operating a multiprocessor chip with processor redundancy
Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J
2014-10-21
A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.
NASA Astrophysics Data System (ADS)
Zou, Liang; Fu, Zhuang; Zhao, YanZheng; Yang, JunYan
2010-07-01
This paper proposes a kind of pipelined electric circuit architecture implemented in FPGA, a very large scale integrated circuit (VLSI), which efficiently deals with the real time non-uniformity correction (NUC) algorithm for infrared focal plane arrays (IRFPA). Dual Nios II soft-core processors and a DSP with a 64+ core together constitute this image system. Each processor undertakes own systematic task, coordinating its work with each other's. The system on programmable chip (SOPC) in FPGA works steadily under the global clock frequency of 96Mhz. Adequate time allowance makes FPGA perform NUC image pre-processing algorithm with ease, which has offered favorable guarantee for the work of post image processing in DSP. And at the meantime, this paper presents a hardware (HW) and software (SW) co-design in FPGA. Thus, this systematic architecture yields an image processing system with multiprocessor, and a smart solution to the satisfaction with the performance of the system.
Concurrent computation of attribute filters on shared memory parallel machines.
Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold
2008-10-01
Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.
Multiple core computer processor with globally-accessible local memories
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shalf, John; Donofrio, David; Oliker, Leonid
A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality ofmore » processor cores.« less
Development of an extensible dual-core wireless sensing node for cyber-physical systems
NASA Astrophysics Data System (ADS)
Kane, Michael; Zhu, Dapeng; Hirose, Mitsuhito; Dong, Xinjun; Winter, Benjamin; Häckell, Mortiz; Lynch, Jerome P.; Wang, Yang; Swartz, A.
2014-04-01
The introduction of wireless telemetry into the design of monitoring and control systems has been shown to reduce system costs while simplifying installations. To date, wireless nodes proposed for sensing and actuation in cyberphysical systems have been designed using microcontrollers with one computational pipeline (i.e., single-core microcontrollers). While concurrent code execution can be implemented on single-core microcontrollers, concurrency is emulated by splitting the pipeline's resources to support multiple threads of code execution. For many applications, this approach to multi-threading is acceptable in terms of speed and function. However, some applications such as feedback controls demand deterministic timing of code execution and maximum computational throughput. For these applications, the adoption of multi-core processor architectures represents one effective solution. Multi-core microcontrollers have multiple computational pipelines that can execute embedded code in parallel and can be interrupted independent of one another. In this study, a new wireless platform named Martlet is introduced with a dual-core microcontroller adopted in its design. The dual-core microcontroller design allows Martlet to dedicate one core to standard wireless sensor operations while the other core is reserved for embedded data processing and real-time feedback control law execution. Another distinct feature of Martlet is a standardized hardware interface that allows specialized daughter boards (termed wing boards) to be interfaced to the Martlet baseboard. This extensibility opens opportunity to encapsulate specialized sensing and actuation functions in a wing board without altering the design of Martlet. In addition to describing the design of Martlet, a few example wings are detailed, along with experiments showing the Martlet's ability to monitor and control physical systems such as wind turbines and buildings.
Rapid Damage Assessment. Volume II. Development and Testing of Rapid Damage Assessment System.
1981-02-01
pixels/s Camera Line Rate 732.4 lines/s Pixels per Line 1728 video 314 blank 4 line number (binary) 2 run number (BCD) 2048 total Pixel Resolution 8 bits...sists of an LSI-ll microprocessor, a VDI -200 video display processor, an FD-2 dual floppy diskette subsystem, an FT-I function key-trackball module...COMPONENT LIST FOR IMAGE PROCESSOR SYSTEM IMAGE PROCESSOR SYSTEM VIEWS I VDI -200 Display Processor Racks, Table FD-2 Dual Floppy Diskette Subsystem FT-l
Scheduler for multiprocessor system switch with selective pairing
Gara, Alan; Gschwind, Michael Karl; Salapura, Valentina
2015-01-06
System, method and computer program product for scheduling threads in a multiprocessing system with selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). The method configures the selective pairing facility to use checking provide one highly reliable thread for high-reliability and allocate threads to corresponding processor cores indicating need for hardware checking. The method configures the selective pairing facility to provide multiple independent cores and allocate threads to corresponding processor cores indicating inherent resilience.
NASA Astrophysics Data System (ADS)
Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.
2017-11-01
Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
Soft-core processor study for node-based architectures.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James
2008-09-01
Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor builtmore » out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.« less
Application of Advanced Multi-Core Processor Technologies to Oceanographic Research
2013-09-30
STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the
NASA Technical Reports Server (NTRS)
McGalliard, James
2008-01-01
This viewgraph presentation details the science and systems environments that NASA High End computing program serves. Included is a discussion of the workload that is involved in the processing for the Global Climate Modeling. The Goddard Earth Observing System Model, Version 5 (GEOS-5) is a system of models integrated using the Earth System Modeling Framework (ESMF). The GEOS-5 system was used for the Benchmark tests, and the results of the tests are shown and discussed. Tests were also run for the Cubed Sphere system, results for these test are also shown.
Low-Power Embedded DSP Core for Communication Systems
NASA Astrophysics Data System (ADS)
Tsao, Ya-Lan; Chen, Wei-Hao; Tan, Ming Hsuan; Lin, Maw-Ching; Jou, Shyh-Jye
2003-12-01
This paper proposes a parameterized digital signal processor (DSP) core for an embedded digital signal processing system designed to achieve demodulation/synchronization with better performance and flexibility. The features of this DSP core include parameterized data path, dual MAC unit, subword MAC, and optional function-specific blocks for accelerating communication system modulation operations. This DSP core also has a low-power structure, which includes the gray-code addressing mode, pipeline sharing, and advanced hardware looping. Users can select the parameters and special functional blocks based on the character of their applications and then generating a DSP core. The DSP core has been implemented via a cell-based design method using a synthesizable Verilog code with TSMC 0.35[InlineEquation not available: see fulltext.]m SPQM and 0.25[InlineEquation not available: see fulltext.]m 1P5M library. The equivalent gate count of the core area without memory is approximately 50 k. Moreover, the maximum operating frequency of a[InlineEquation not available: see fulltext.] version is 100 MHz (0.35[InlineEquation not available: see fulltext.]m) and 140 MHz (0.25[InlineEquation not available: see fulltext.]m).
NASA Astrophysics Data System (ADS)
Rahman, P. A.
2018-05-01
This scientific paper deals with the model of the knapsack optimization problem and method of its solving based on directed combinatorial search in the boolean space. The offered by the author specialized mathematical model of decomposition of the search-zone to the separate search-spheres and the algorithm of distribution of the search-spheres to the different cores of the multi-core processor are also discussed. The paper also provides an example of decomposition of the search-zone to the several search-spheres and distribution of the search-spheres to the different cores of the quad-core processor. Finally, an offered by the author formula for estimation of the theoretical maximum of the computational acceleration, which can be achieved due to the parallelization of the search-zone to the search-spheres on the unlimited number of the processor cores, is also given.
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks
Kim, Deokho; Park, Karam; Ro, Won W.
2011-01-01
While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
NASA Astrophysics Data System (ADS)
Coffey, Stephen; Connell, Joseph
2005-06-01
This paper presents a development platform for real-time image processing based on the ADSP-BF533 Blackfin processor and the MicroC/OS-II real-time operating system (RTOS). MicroC/OS-II is a completely portable, ROMable, pre-emptive, real-time kernel. The Blackfin Digital Signal Processors (DSPs), incorporating the Analog Devices/Intel Micro Signal Architecture (MSA), are a broad family of 16-bit fixed-point products with a dual Multiply Accumulate (MAC) core. In addition, they have a rich instruction set with variable instruction length and both DSP and MCU functionality thus making them ideal for media based applications. Using the MicroC/OS-II for task scheduling and management, the proposed system can capture and process raw RGB data from any standard 8-bit greyscale image sensor in soft real-time and then display the processed result using a simple PC graphical user interface (GUI). Additionally, the GUI allows configuration of the image capture rate and the system and core DSP clock rates thereby allowing connectivity to a selection of image sensors and memory devices. The GUI also allows selection from a set of image processing algorithms based in the embedded operating system.
An evaluation of MPI message rate on hybrid-core processors
Barrett, Brian W.; Brightwell, Ron; Grant, Ryan; ...
2014-11-01
Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insightmore » into how MPI implementations for future hybrid-core processors should be designed.« less
State recovery and lockstep execution restart in a system with multiprocessor pairing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gara, Alan; Gschwind, Michael K; Salapura, Valentina
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switchmore » or a bus. Each selectively paired processor core is includes a transactional execution facility, whereing the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.« less
LOSITAN: a workbench to detect molecular adaptation based on a Fst-outlier method.
Antao, Tiago; Lopes, Ana; Lopes, Ricardo J; Beja-Pereira, Albano; Luikart, Gordon
2008-07-28
Testing for selection is becoming one of the most important steps in the analysis of multilocus population genetics data sets. Existing applications are difficult to use, leaving many non-trivial, error-prone tasks to the user. Here we present LOSITAN, a selection detection workbench based on a well evaluated Fst-outlier detection method. LOSITAN greatly facilitates correct approximation of model parameters (e.g., genome-wide average, neutral Fst), provides data import and export functions, iterative contour smoothing and generation of graphics in a easy to use graphical user interface. LOSITAN is able to use modern multi-core processor architectures by locally parallelizing fdist, reducing computation time by half in current dual core machines and with almost linear performance gains in machines with more cores. LOSITAN makes selection detection feasible to a much wider range of users, even for large population genomic datasets, by both providing an easy to use interface and essential functionality to complete the whole selection detection process.
A TMS320-based modem for the aeronautical-satellite core data service
NASA Astrophysics Data System (ADS)
Moher, Michael L.; Lodge, John H.
The International Civil Aviation Organization (ICAO) Future Air Navigation Systems (FANS) committee, the Airlines Electronics Engineering Committee (AEEC), and Inmarsat have been developing standards for an aeronautical satellite communications service. These standards encompass a satellite communications system architecture to provide comprehensive aeronautical communications services. Incorporated into the architecture is a core service capability, providing only low rate data communications, which all service providers and all aircraft earth terminals are required to support. In this paper an implementation of the physical layer of this standard for the low data rate core service is described. This is a completely digital modem (up to a low intermediate frequency). The implementation uses a single TMS320C25 chip for the transmit baseband functions of scrambling, encoding, interleaving, block formatting and modulation. The receiver baseband unit uses a dual processor configuration to implement the functions of demodulation, synchronization, de-interleaving, decoding and de-scrambling. The hardware requirements, the software structure and the algorithms of this implementation are described.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-code Processors
NASA Astrophysics Data System (ADS)
Linderman, R.; Spetka, S.; Fitzgerald, D.; Emeny, S.
The Physically-Constrained Iterative Deconvolution (PCID) image deblurring code is being ported to heterogeneous networks of multi-core systems, including Intel Xeons and IBM Cell Broadband Engines. This paper reports results from experiments using the JAWS supercomputer at MHPCC (60 TFLOPS of dual-dual Xeon nodes linked with Infiniband) and the Cell Cluster at AFRL in Rome, NY. The Cell Cluster has 52 TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes Infiniband, 10 Gigabit Ethernet and 1 Gigabit Ethernet to each of the 336 PS3s. The results compare approaches to parallelizing FFT executions across the Xeons and the Cell's Synergistic Processing Elements (SPEs) for frame-level image processing. The experiments included Intel's Performance Primitives and Math Kernel Library, FFTW3.2, and Carnegie Mellon's SPIRAL. Optimization of FFTs in the PCID code led to a decrease in relative processing time for FFTs. Profiling PCID version 6.2, about one year ago, showed the 13 functions that accounted for the highest percentage of processing were all FFT processing functions. They accounted for over 88% of processing time in one run on Xeons. FFT optimizations led to improvement in the current PCID version 8.0. A recent profile showed that only two of the 19 functions with the highest processing time were FFT processing functions. Timing measurements showed that FFT processing for PCID version 8.0 has been reduced to less than 19% of overall processing time. We are working toward a goal of scaling to 200-400 cores per job (1-2 imagery frames/core). Running a pair of cores on each set of frames reduces latency by implementing parallel FFT processing. Our current results show scaling well out to 100 pairs of cores. These results support the next higher level of parallelism in PCID, where groups of several hundred frames each producing one resolved image are sent to cliques of several hundred cores in a round robin fashion. Current efforts toward further performance enhancement for PCID are shifting toward using the Playstations in conjunction with the Xeons to take advantage of outstanding price/performance as well as the Flops/Watt cost advantage. We are fine-tuning the PCID parallization strategy to balance processing over Xeons and Cell BEs to find an optimal partitioning of PCID over the heterogeneous processors. A high performance information management system that exploits native Infiniband multicast is used to improve latency among the head nodes. Using a publication/subscription oriented information management system to implement a unified communications platform makes runs on large HPCs with thousands of intercommunicating cores more flexible and more fault tolerant. It features a loose couplingof publishers to subscribers through intervening brokers. We are also working on enhancing performance for both Xeons and Cell BEs, buy moving selected operations to single precision. Techniques for adapting the code to single precision and performance results are reported.
NASA Astrophysics Data System (ADS)
Pruhs, Kirk
A particularly important emergent technology is heterogeneous processors (or cores), which many computer architects believe will be the dominant architectural design in the future. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs. Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.
Dual-mode self-validating resistance/Johnson noise thermometer system
Shepard, Robert L.; Blalock, Theron V.; Roberts, Michael J.
1993-01-01
A dual-mode Johnson noise and DC resistance thermometer capable of use in control systems where prompt indications of temperature changes and long term accuracy are needed. A resistance-inductance-capacitance (RLC) tuned circuit produces a continuous voltage signal for Johnson noise temperature measurement. The RLC circuit provides a mean-squared noise voltage that depends only on the capacitance used and the temperature of the sensor. The sensor has four leads for simultaneous coupling to a noise signal processor and to a DC resistance signal processor.
Free-Electron Laser Driven by the NBS (National Bureau of Standards) CW Microtron
1988-03-31
planned over several years. This will begin with the purchase of a 32-bit dual processor system for the yet to be constructed primary station wire scanner ...display subsystem. This 32-bit dual processor system will not only form the wire scanner display system, but has sufficient processing power to...7th hit. Coiif. on FELs, eds., E.T. Scharlemann and D. Prosnitz (North- Holland, Amsterdam, 1986) p. 278. 121 X.K Maruyania and S. Penner, C.M. Tang
NASA Astrophysics Data System (ADS)
Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto
2012-11-01
In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.
CoNNeCT Baseband Processor Module Boot Code SoftWare (BCSW)
NASA Technical Reports Server (NTRS)
Yamamoto, Clifford K.; Orozco, David S.; Byrne, D. J.; Allen, Steven J.; Sahasrabudhe, Adit; Lang, Minh
2012-01-01
This software provides essential startup and initialization routines for the CoNNeCT baseband processor module (BPM) hardware upon power-up. A command and data handling (C&DH) interface is provided via 1553 and diagnostic serial interfaces to invoke operational, reconfiguration, and test commands within the code. The BCSW has features unique to the hardware it is responsible for managing. In this case, the CoNNeCT BPM is configured with an updated CPU (Atmel AT697 SPARC processor) and a unique set of memory and I/O peripherals that require customized software to operate. These features include configuration of new AT697 registers, interfacing to a new HouseKeeper with a flash controller interface, a new dual Xilinx configuration/scrub interface, and an updated 1553 remote terminal (RT) core. The BCSW is intended to provide a "safe" mode for the BPM when initially powered on or when an unexpected trap occurs, causing the processor to reset. The BCSW allows the 1553 bus controller in the spacecraft or payload controller to operate the BPM over 1553 to upload code; upload Xilinx bit files; perform rudimentary tests; read, write, and copy the non-volatile flash memory; and configure the Xilinx interface. Commands also exist over 1553 to cause the CPU to jump or call a specified address to begin execution of user-supplied code. This may be in the form of a real-time operating system, test routine, or specific application code to run on the BPM.
Realization of a single image haze removal system based on DaVinci DM6467T processor
NASA Astrophysics Data System (ADS)
Liu, Zhuang
2014-10-01
Video monitoring system (VMS) has been extensively applied in domains of target recognition, traffic management, remote sensing, auto navigation and national defence. However the VMS has a strong dependence on the weather, for instance, in foggy weather, the quality of images received by the VMS are distinct degraded and the effective range of VMS is also decreased. All in all, the VMS performs terribly in bad weather. Thus the research of fog degraded images enhancement has very high theoretical and practical application value. A design scheme of a fog degraded images enhancement system based on the TI DaVinci processor is presented in this paper. The main function of the referred system is to extract and digital cameras capture images and execute image enhancement processing to obtain a clear image. The processor used in this system is the dual core TI DaVinci DM6467T - ARM@500MHz+DSP@1GH. A MontaVista Linux operating system is running on the ARM subsystem which handles I/O and application processing. The DSP handles signal processing and the results are available to the ARM subsystem in shared memory.The system benefits from the DaVinci processor so that, with lower power cost and smaller volume, it provides the equivalent image processing capability of a X86 computer. The outcome shows that the system in this paper can process images at 25 frames per second on D1 resolution.
A hybrid algorithm for parallel molecular dynamics simulations
NASA Astrophysics Data System (ADS)
Mangiardi, Chris M.; Meyer, R.
2017-10-01
This article describes algorithms for the hybrid parallelization and SIMD vectorization of molecular dynamics simulations with short-range forces. The parallelization method combines domain decomposition with a thread-based parallelization approach. The goal of the work is to enable efficient simulations of very large (tens of millions of atoms) and inhomogeneous systems on many-core processors with hundreds or thousands of cores and SIMD units with large vector sizes. In order to test the efficiency of the method, simulations of a variety of configurations with up to 74 million atoms have been performed. Results are shown that were obtained on multi-core systems with Sandy Bridge and Haswell processors as well as systems with Xeon Phi many-core processors.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures
Manolakos, Elias S.
2015-01-01
Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.
Sharma, Anuj; Manolakos, Elias S
2015-01-01
Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Multiprocessor switch with selective pairing
Gara, Alan; Gschwind, Michael K; Salapura, Valentina
2014-03-11
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus
Dual-scale topology optoelectronic processor.
Marsden, G C; Krishnamoorthy, A V; Esener, S C; Lee, S H
1991-12-15
The dual-scale topology optoelectronic processor (D-STOP) is a parallel optoelectronic architecture for matrix algebraic processing. The architecture can be used for matrix-vector multiplication and two types of vector outer product. The computations are performed electronically, which allows multiplication and summation concepts in linear algebra to be generalized to various nonlinear or symbolic operations. This generalization permits the application of D-STOP to many computational problems. The architecture uses a minimum number of optical transmitters, which thereby reduces fabrication requirements while maintaining area-efficient electronics. The necessary optical interconnections are space invariant, minimizing space-bandwidth requirements.
NASA Astrophysics Data System (ADS)
Rakvic, Ryan N.; Ives, Robert W.; Lira, Javier; Molina, Carlos
2011-01-01
General purpose computer designers have recently begun adding cores to their processors in order to increase performance. For example, Intel has adopted a homogeneous quad-core processor as a base for general purpose computing. PlayStation3 (PS3) game consoles contain a multicore heterogeneous processor known as the Cell, which is designed to perform complex image processing algorithms at a high level. Can modern image-processing algorithms utilize these additional cores? On the other hand, modern advancements in configurable hardware, most notably field-programmable gate arrays (FPGAs) have created an interesting question for general purpose computer designers. Is there a reason to combine FPGAs with multicore processors to create an FPGA multicore hybrid general purpose computer? Iris matching, a repeatedly executed portion of a modern iris-recognition algorithm, is parallelized on an Intel-based homogeneous multicore Xeon system, a heterogeneous multicore Cell system, and an FPGA multicore hybrid system. Surprisingly, the cheaper PS3 slightly outperforms the Intel-based multicore on a core-for-core basis. However, both multicore systems are beaten by the FPGA multicore hybrid system by >50%.
Replication of Space-Shuttle Computers in FPGAs and ASICs
NASA Technical Reports Server (NTRS)
Ferguson, Roscoe C.
2008-01-01
A document discusses the replication of the functionality of the onboard space-shuttle general-purpose computers (GPCs) in field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). The purpose of the replication effort is to enable utilization of proven space-shuttle flight software and software-development facilities to the extent possible during development of software for flight computers for a new generation of launch vehicles derived from the space shuttles. The replication involves specifying the instruction set of the central processing unit and the input/output processor (IOP) of the space-shuttle GPC in a hardware description language (HDL). The HDL is synthesized to form a "core" processor in an FPGA or, less preferably, in an ASIC. The core processor can be used to create a flight-control card to be inserted into a new avionics computer. The IOP of the GPC as implemented in the core processor could be designed to support data-bus protocols other than that of a multiplexer interface adapter (MIA) used in the space shuttle. Hence, a computer containing the core processor could be tailored to communicate via the space-shuttle GPC bus and/or one or more other buses.
Crespo, Alejandro C.; Dominguez, Jose M.; Barreiro, Anxo; Gómez-Gesteira, Moncho; Rogers, Benedict D.
2011-01-01
Smoothed Particle Hydrodynamics (SPH) is a numerical method commonly used in Computational Fluid Dynamics (CFD) to simulate complex free-surface flows. Simulations with this mesh-free particle method far exceed the capacity of a single processor. In this paper, as part of a dual-functioning code for either central processing units (CPUs) or Graphics Processor Units (GPUs), a parallelisation using GPUs is presented. The GPU parallelisation technique uses the Compute Unified Device Architecture (CUDA) of nVidia devices. Simulations with more than one million particles on a single GPU card exhibit speedups of up to two orders of magnitude over using a single-core CPU. It is demonstrated that the code achieves different speedups with different CUDA-enabled GPUs. The numerical behaviour of the SPH code is validated with a standard benchmark test case of dam break flow impacting on an obstacle where good agreement with the experimental results is observed. Both the achieved speed-ups and the quantitative agreement with experiments suggest that CUDA-based GPU programming can be used in SPH methods with efficiency and reliability. PMID:21695185
Benchmarking NWP Kernels on Multi- and Many-core Processors
NASA Astrophysics Data System (ADS)
Michalakes, J.; Vachharajani, M.
2008-12-01
Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Multi-Core Programming Design Patterns: Stream Processing Algorithms for Dynamic Scene Perceptions
2014-05-01
processor developed by IBM and other companies , incorpo- rates the verb—POWER5— processor as the Power Processor Element (PPE), one of the early general...deliver an power efficient single-precision peak performance of more than 256 GFlops. Substantially more raw power became available later, when nVIDIA ...algorithms, including IBM’s Cell/B.E., GPUs from NVidia and AMD and many-core CPUs from Intel.27 The vast growth of digital video content has been a
The parallel algorithm for the 2D discrete wavelet transform
NASA Astrophysics Data System (ADS)
Barina, David; Najman, Pavel; Kleparnik, Petr; Kula, Michal; Zemcik, Pavel
2018-04-01
The discrete wavelet transform can be found at the heart of many image-processing algorithms. Until now, the transform on general-purpose processors (CPUs) was mostly computed using a separable lifting scheme. As the lifting scheme consists of a small number of operations, it is preferred for processing using single-core CPUs. However, considering a parallel processing using multi-core processors, this scheme is inappropriate due to a large number of steps. On such architectures, the number of steps corresponds to the number of points that represent the exchange of data. Consequently, these points often form a performance bottleneck. Our approach appropriately rearranges calculations inside the transform, and thereby reduces the number of steps. In other words, we propose a new scheme that is friendly to parallel environments. When evaluating on multi-core CPUs, we consistently overcome the original lifting scheme. The evaluation was performed on 61-core Intel Xeon Phi and 8-core Intel Xeon processors.
NASA Astrophysics Data System (ADS)
Hayashi, Akihiro; Wada, Yasutaka; Watanabe, Takeshi; Sekiguchi, Takeshi; Mase, Masayoshi; Shirako, Jun; Kimura, Keiji; Kasahara, Hironori
Heterogeneous multicores have been attracting much attention to attain high performance keeping power consumption low in wide spread of areas. However, heterogeneous multicores force programmers very difficult programming. The long application program development period lowers product competitiveness. In order to overcome such a situation, this paper proposes a compilation framework which bridges a gap between programmers and heterogeneous multicores. In particular, this paper describes the compilation framework based on OSCAR compiler. It realizes coarse grain task parallel processing, data transfer using a DMA controller, power reduction control from user programs with DVFS and clock gating on various heterogeneous multicores from different vendors. This paper also evaluates processing performance and the power reduction by the proposed framework on a newly developed 15 core heterogeneous multicore chip named RP-X integrating 8 general purpose processor cores and 3 types of accelerator cores which was developed by Renesas Electronics, Hitachi, Tokyo Institute of Technology and Waseda University. The framework attains speedups up to 32x for an optical flow program with eight general purpose processor cores and four DRP(Dynamically Reconfigurable Processor) accelerator cores against sequential execution by a single processor core and 80% of power reduction for the real-time AAC encoding.
Space Tug Avionics Definition Study. Volume 5: Cost and Programmatics
NASA Technical Reports Server (NTRS)
1975-01-01
The baseline avionics system features a central digital computer that integrates the functions of all the space tug subsystems by means of a redundant digital data bus. The central computer consists of dual central processor units, dual input/output processors, and a fault tolerant memory, utilizing internal redundancy and error checking. Three electronically steerable phased arrays provide downlink transmission from any tug attitude directly to ground or via TDRS. Six laser gyros and six accelerometers in a dodecahedron configuration make up the inertial measurement unit. Both a scanning laser radar and a TV system, employing strobe lamps, are required as acquisition and docking sensors. Primary dc power at a nominal 28 volts is supplied from dual lightweight, thermally integrated fuel cells which operate from propellant grade reactants out of the main tanks.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Learn, Mark Walter
Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not availablemore » to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.« less
Interactive high-resolution isosurface ray casting on multicore processors.
Wang, Qin; JaJa, Joseph
2008-01-01
We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Energy consumption estimation of an OMAP-based Android operating system
NASA Astrophysics Data System (ADS)
González, Gabriel; Juárez, Eduardo; Castro, Juan José; Sanz, César
2011-05-01
System-level energy optimization of battery-powered multimedia embedded systems has recently become a design goal. The poor operational time of multimedia terminals makes computationally demanding applications impractical in real scenarios. For instance, the so-called smart-phones are currently unable to remain in operation longer than several hours. The OMAP3530 processor basically consists of two processing cores, a General Purpose Processor (GPP) and a Digital Signal Processor (DSP). The former, an ARM Cortex-A8 processor, is aimed to run a generic Operating System (OS) while the latter, a DSP core based on the C64x+, has architecture optimized for video processing. The BeagleBoard, a commercial prototyping board based on the OMAP processor, has been used to test the Android Operating System and measure its performance. The board has 128 MB of SDRAM external memory, 256 MB of Flash external memory and several interfaces. Note that the clock frequency of the ARM and DSP OMAP cores is 600 MHz and 430 MHz, respectively. This paper describes the energy consumption estimation of the processes and multimedia applications of an Android v1.6 (Donut) OS on the OMAP3530-Based BeagleBoard. In addition, tools to communicate the two processing cores have been employed. A test-bench to profile the OS resource usage has been developed. As far as the energy estimates concern, the OMAP processor energy consumption model provided by the manufacturer has been used. The model is basically divided in two energy components. The former, the baseline core energy, describes the energy consumption that is independent of any chip activity. The latter, the module active energy, describes the energy consumed by the active modules depending on resource usage.
JPRS Report, Science & Technology, Europe.
1991-04-30
processor in collaboration with Intel . The processor , christened Touchstone, will be used as the core of a parallel computer with 2,000 processors . One of...ELECTRONIQUE HEBDO in French 24 Jan 91 pp 14-15 [Article by Claire Remy: "Everything Set for Neural Signal Processors " first paragraph is ELECTRONIQUE...paving the way for neural signal processors in so doing. The principal advantage of this specific circuit over a neuromimetic software program is
An FPGA computing demo core for space charge simulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Jinyuan; Huang, Yifei; /Fermilab
2009-01-01
In accelerator physics, space charge simulation requires large amount of computing power. In a particle system, each calculation requires time/resource consuming operations such as multiplications, divisions, and square roots. Because of the flexibility of field programmable gate arrays (FPGAs), we implemented this task with efficient use of the available computing resources and completely eliminated non-calculating operations that are indispensable in regular micro-processors (e.g. instruction fetch, instruction decoding, etc.). We designed and tested a 16-bit demo core for computing Coulomb's force in an Altera Cyclone II FPGA device. To save resources, the inverse square-root cube operation in our design is computedmore » using a memory look-up table addressed with nine to ten most significant non-zero bits. At 200 MHz internal clock, our demo core reaches a throughput of 200 M pairs/s/core, faster than a typical 2 GHz micro-processor by about a factor of 10. Temperature and power consumption of FPGAs were also lower than those of micro-processors. Fast and convenient, FPGAs can serve as alternatives to time-consuming micro-processors for space charge simulation.« less
A VME-based software trigger system using UNIX processors
NASA Astrophysics Data System (ADS)
Atmur, Robert; Connor, David F.; Molzon, William
1997-02-01
We have constructed a distributed computing platform with eight processors to assemble and filter data from digitization crates. The filtered data were transported to a tape-writing UNIX computer via ethernet. Each processor ran a UNIX operating system and was installed in its own VME crate. Each VME crate contained dual-port memories which interfaced with the digitizers. Using standard hardware and software (VME and UNIX) allows us to select from a wide variety of non-proprietary products and makes upgrades simpler, if they are necessary.
Reconfigurable lattice mesh designs for programmable photonic processors.
Pérez, Daniel; Gasulla, Ivana; Capmany, José; Soref, Richard A
2016-05-30
We propose and analyse two novel mesh design geometries for the implementation of tunable optical cores in programmable photonic processors. These geometries are the hexagonal and the triangular lattice. They are compared here to a previously proposed square mesh topology in terms of a series of figures of merit that account for metrics that are relevant to on-chip integration of the mesh. We find that that the hexagonal mesh is the most suitable option of the three considered for the implementation of the reconfigurable optical core in the programmable processor.
Accelerating 3D Elastic Wave Equations on Knights Landing based Intel Xeon Phi processors
NASA Astrophysics Data System (ADS)
Sourouri, Mohammed; Birger Raknes, Espen
2017-04-01
In advanced imaging methods like reverse-time migration (RTM) and full waveform inversion (FWI) the elastic wave equation (EWE) is numerically solved many times to create the seismic image or the elastic parameter model update. Thus, it is essential to optimize the solution time for solving the EWE as this will have a major impact on the total computational cost in running RTM or FWI. From a computational point of view applications implementing EWEs are associated with two major challenges. The first challenge is the amount of memory-bound computations involved, while the second challenge is the execution of such computations over very large datasets. So far, multi-core processors have not been able to tackle these two challenges, which eventually led to the adoption of accelerators such as Graphics Processing Units (GPUs). Compared to conventional CPUs, GPUs are densely populated with many floating-point units and fast memory, a type of architecture that has proven to map well to many scientific computations. Despite its architectural advantages, full-scale adoption of accelerators has yet to materialize. First, accelerators require a significant programming effort imposed by programming models such as CUDA or OpenCL. Second, accelerators come with a limited amount of memory, which also require explicit data transfers between the CPU and the accelerator over the slow PCI bus. The second generation of the Xeon Phi processor based on the Knights Landing (KNL) architecture, promises the computational capabilities of an accelerator but require the same programming effort as traditional multi-core processors. The high computational performance is realized through many integrated cores (number of cores and tiles and memory varies with the model) organized in tiles that are connected via a 2D mesh based interconnect. In contrary to accelerators, KNL is a self-hosted system, meaning explicit data transfers over the PCI bus are no longer required. However, like most accelerators, KNL sports a memory subsystem consisting of low-level caches and 16GB of high-bandwidth MCDRAM memory. For capacity computing, up to 400GB of conventional DDR4 memory is provided. Such a strict hierarchical memory layout means that data locality is imperative if the true potential of this product is to be harnessed. In this work, we study a series of optimizations specifically targeting KNL for our EWE based application to reduce the time-to-solution time for the following 3D model sizes in grid points: 1283, 2563 and 5123. We compare the results with an optimized version for multi-core CPUs running on a dual-socket Xeon E5 2680v3 system using OpenMP. Our initial naive implementation on the KNL is roughly 20% faster than the multi-core version, but by using only one thread per core and careful memory placement using the memkind library, we could achieve higher speedups. Additionally, by using the MCDRAM as cache for problem sizes that are smaller than 16 GB further performance improvements were unlocked. Depending on the problem size, our overall results indicate that the KNL based system is approximately 2.2x faster than the 24-core Xeon E5 2680v3 system, with only modest changes to the code.
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Tumeo, Antonino; Secchi, Simone
Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G; Salapura, Valentina
2014-12-02
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Design of a dataway processor for a parallel image signal processing system
NASA Astrophysics Data System (ADS)
Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu
1995-04-01
Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cohen, J; Dossa, D; Gokhale, M
Critical data science applications requiring frequent access to storage perform poorly on today's computing architectures. This project addresses efficient computation of data-intensive problems in national security and basic science by exploring, advancing, and applying a new form of computing called storage-intensive supercomputing (SISC). Our goal is to enable applications that simply cannot run on current systems, and, for a broad range of data-intensive problems, to deliver an order of magnitude improvement in price/performance over today's data-intensive architectures. This technical report documents much of the work done under LDRD 07-ERD-063 Storage Intensive Supercomputing during the period 05/07-09/07. The following chapters describe:more » (1) a new file I/O monitoring tool iotrace developed to capture the dynamic I/O profiles of Linux processes; (2) an out-of-core graph benchmark for level-set expansion of scale-free graphs; (3) an entity extraction benchmark consisting of a pipeline of eight components; and (4) an image resampling benchmark drawn from the SWarp program in the LSST data processing pipeline. The performance of the graph and entity extraction benchmarks was measured in three different scenarios: data sets residing on the NFS file server and accessed over the network; data sets stored on local disk; and data sets stored on the Fusion I/O parallel NAND Flash array. The image resampling benchmark compared performance of software-only to GPU-accelerated. In addition to the work reported here, an additional text processing application was developed that used an FPGA to accelerate n-gram profiling for language classification. The n-gram application will be presented at SC07 at the High Performance Reconfigurable Computing Technologies and Applications Workshop. The graph and entity extraction benchmarks were run on a Supermicro server housing the NAND Flash 40GB parallel disk array, the Fusion-io. The Fusion system specs are as follows: SuperMicro X7DBE Xeon Dual Socket Blackford Server Motherboard; 2 Intel Xeon Dual-Core 2.66 GHz processors; 1 GB DDR2 PC2-5300 RAM (2 x 512); 80GB Hard Drive (Seagate SATA II Barracuda). The Fusion board is presently capable of 4X in a PCIe slot. The image resampling benchmark was run on a dual Xeon workstation with NVIDIA graphics card (see Chapter 5 for full specification). An XtremeData Opteron+FPGA was used for the language classification application. We observed that these benchmarks are not uniformly I/O intensive. The only benchmark that showed greater that 50% of the time in I/O was the graph algorithm when it accessed data files over NFS. When local disk was used, the graph benchmark spent at most 40% of its time in I/O. The other benchmarks were CPU dominated. The image resampling benchmark and language classification showed order of magnitude speedup over software by using co-processor technology to offload the CPU-intensive kernels. Our experiments to date suggest that emerging hardware technologies offer significant benefit to boosting the performance of data-intensive algorithms. Using GPU and FPGA co-processors, we were able to improve performance by more than an order of magnitude on the benchmark algorithms, eliminating the processor bottleneck of CPU-bound tasks. Experiments with a prototype solid state nonvolative memory available today show 10X better throughput on random reads than disk, with a 2X speedup on a graph processing benchmark when compared to the use of local SATA disk.« less
2015-06-13
The Berkeley Out-of-Order Machine (BOOM): An Industry- Competitive, Synthesizable, Parameterized RISC-V Processor Christopher Celio David A...Synthesizable, Parameterized RISC-V Processor Christopher Celio, David Patterson, and Krste Asanović University of California, Berkeley, California 94720...Order Machine BOOM is a synthesizable, parameterized, superscalar out- of-order RISC-V core designed to serve as the prototypical baseline processor
Options for Parallelizing a Planning and Scheduling Algorithm
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin D.
2011-01-01
Space missions have a growing interest in putting multi-core processors onboard spacecraft. For many missions processing power significantly slows operations. We investigate how continual planning and scheduling algorithms can exploit multi-core processing and outline different potential design decisions for a parallelized planning architecture. This organization of choices and challenges helps us with an initial design for parallelizing the CASPER planning system for a mesh multi-core processor. This work extends that presented at another workshop with some preliminary results.
Fault-Tolerant, Real-Time, Multi-Core Computer System
NASA Technical Reports Server (NTRS)
Gostelow, Kim P.
2012-01-01
A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
The mathematical theory of signal processing and compression-designs
NASA Astrophysics Data System (ADS)
Feria, Erlan H.
2006-05-01
The mathematical theory of signal processing, named processor coding, will be shown to inherently arise as the computational time dual of Shannon's mathematical theory of communication which is also known as source coding. Source coding is concerned with signal source memory space compression while processor coding deals with signal processor computational time compression. Their combination is named compression-designs and referred as Conde in short. A compelling and pedagogically appealing diagram will be discussed highlighting Conde's remarkable successful application to real-world knowledge-aided (KA) airborne moving target indicator (AMTI) radar.
Methanol tailgas combustor control method
Hart-Predmore, David J.; Pettit, William H.
2002-01-01
A method for controlling the power and temperature and fuel source of a combustor in a fuel cell apparatus to supply heat to a fuel processor where the combustor has dual fuel inlet streams including a first fuel stream, and a second fuel stream of anode effluent from the fuel cell and reformate from the fuel processor. In all operating modes, an enthalpy balance is determined by regulating the amount of the first and/or second fuel streams and the quantity of the first air flow stream to support fuel processor power requirements.
A highly efficient multi-core algorithm for clustering extremely large datasets
2010-01-01
Background In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer. Results We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization. Conclusions Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer. PMID:20370922
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee
2008-01-01
High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less
NASA Astrophysics Data System (ADS)
Erez, Mattan; Dally, William J.
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
The Brain's Router: A Cortical Network Model of Serial Processing in the Primate Brain
Zylberberg, Ariel; Fernández Slezak, Diego; Roelfsema, Pieter R.; Dehaene, Stanislas; Sigman, Mariano
2010-01-01
The human brain efficiently solves certain operations such as object recognition and categorization through a massively parallel network of dedicated processors. However, human cognition also relies on the ability to perform an arbitrarily large set of tasks by flexibly recombining different processors into a novel chain. This flexibility comes at the cost of a severe slowing down and a seriality of operations (100–500 ms per step). A limit on parallel processing is demonstrated in experimental setups such as the psychological refractory period (PRP) and the attentional blink (AB) in which the processing of an element either significantly delays (PRP) or impedes conscious access (AB) of a second, rapidly presented element. Here we present a spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior. The precise mapping of incoming sensory stimuli onto motor representations relies on a “router” network capable of flexibly interconnecting processors and rapidly changing its configuration from one task to another. Simulations show that, when presented with dual-task stimuli, the network exhibits parallel processing at peripheral sensory levels, a memory buffer capable of keeping the result of sensory processing on hold, and a slow serial performance at the router stage, resulting in a performance bottleneck. The network captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions, and establishes concrete predictions on neuronal dynamics during dual-task experiments in humans and non-human primates. PMID:20442869
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sitaraman, Hariswaran; Grout, Ray W
This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
Performance of the Cell processor for biomolecular simulations
NASA Astrophysics Data System (ADS)
De Fabritiis, G.
2007-06-01
The new Cell processor represents a turning point for computing intensive applications. Here, I show that for molecular dynamics it is possible to reach an impressive sustained performance in excess of 30 Gflops with a peak of 45 Gflops for the non-bonded force calculations, over one order of magnitude faster than a single core standard processor.
SpaceCubeX: A Framework for Evaluating Hybrid Multi-Core CPU FPGA DSP Architectures
NASA Technical Reports Server (NTRS)
Schmidt, Andrew G.; Weisz, Gabriel; French, Matthew; Flatley, Thomas; Villalpando, Carlos Y.
2017-01-01
The SpaceCubeX project is motivated by the need for high performance, modular, and scalable on-board processing to help scientists answer critical 21st century questions about global climate change, air quality, ocean health, and ecosystem dynamics, while adding new capabilities such as low-latency data products for extreme event warnings. These goals translate into on-board processing throughput requirements that are on the order of 100-1,000 more than those of previous Earth Science missions for standard processing, compression, storage, and downlink operations. To study possible future architectures to achieve these performance requirements, the SpaceCubeX project provides an evolvable testbed and framework that enables a focused design space exploration of candidate hybrid CPU/FPGA/DSP processing architectures. The framework includes ArchGen, an architecture generator tool populated with candidate architecture components, performance models, and IP cores, that allows an end user to specify the type, number, and connectivity of a hybrid architecture. The framework requires minimal extensions to integrate new processors, such as the anticipated High Performance Spaceflight Computer (HPSC), reducing time to initiate benchmarking by months. To evaluate the framework, we leverage a wide suite of high performance embedded computing benchmarks and Earth science scenarios to ensure robust architecture characterization. We report on our projects Year 1 efforts and demonstrate the capabilities across four simulation testbed models, a baseline SpaceCube 2.0 system, a dual ARM A9 processor system, a hybrid quad ARM A53 and FPGA system, and a hybrid quad ARM A53 and DSP system.
Energy-efficient fault tolerance in multiprocessor real-time systems
NASA Astrophysics Data System (ADS)
Guo, Yifeng
The recent progress in the multiprocessor/multicore systems has important implications for real-time system design and operation. From vehicle navigation to space applications as well as industrial control systems, the trend is to deploy multiple processors in real-time systems: systems with 4 -- 8 processors are common, and it is expected that many-core systems with dozens of processing cores will be available in near future. For such systems, in addition to general temporal requirement common for all real-time systems, two additional operational objectives are seen as critical: energy efficiency and fault tolerance. An intriguing dimension of the problem is that energy efficiency and fault tolerance are typically conflicting objectives, due to the fact that tolerating faults (e.g., permanent/transient) often requires extra resources with high energy consumption potential. In this dissertation, various techniques for energy-efficient fault tolerance in multiprocessor real-time systems have been investigated. First, the Reliability-Aware Power Management (RAPM) framework, which can preserve the system reliability with respect to transient faults when Dynamic Voltage Scaling (DVS) is applied for energy savings, is extended to support parallel real-time applications with precedence constraints. Next, the traditional Standby-Sparing (SS) technique for dual processor systems, which takes both transient and permanent faults into consideration while saving energy, is generalized to support multiprocessor systems with arbitrary number of identical processors. Observing the inefficient usage of slack time in the SS technique, a Preference-Oriented Scheduling Framework is designed to address the problem where tasks are given preferences for being executed as soon as possible (ASAP) or as late as possible (ALAP). A preference-oriented earliest deadline (POED) scheduler is proposed and its application in multiprocessor systems for energy-efficient fault tolerance is investigated, where tasks' main copies are executed ASAP while backup copies ALAP to reduce the overlapped execution of main and backup copies of the same task and thus reduce energy consumption. All proposed techniques are evaluated through extensive simulations and compared with other state-of-the-art approaches. The simulation results confirm that the proposed schemes can preserve the system reliability while still achieving substantial energy savings. Finally, for both SS and POED based Energy-Efficient Fault-Tolerant (EEFT) schemes, a series of recovery strategies are designed when more than one (transient and permanent) faults need to be tolerated.
2016-05-07
REPORT DOCUMENTATION PAGE I . ... ... .. . ,...,.., ............. OMB No. 0704-0188 The public reporting burden for this collection of...Student Support for Appl ication of Advanced Multi- Core Processor N00014-12-1-0298 Technologies to Oceanographic Research Sb. GRANT NUMBER Sc...communications protocols (i.e. UART, I2C, and SPI), through the , ’ . handing off of the data to the server APis. By providing a common set of tools
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh; Zhang, Zhao
With each CMOS technology generation, leakage energy consumption has been dramatically increasing and hence, managing leakage power consumption of large last-level caches (LLCs) has become a critical issue in modern processor design. In this paper, we present EnCache, a novel software-based technique which uses dynamic profiling-based cache reconfiguration for saving cache leakage energy. EnCache uses a simple hardware component called profiling cache, which dynamically predicts energy efficiency of an application for 32 possible cache configurations. Using these estimates, system software reconfigures the cache to the most energy efficient configuration. EnCache uses dynamic cache reconfiguration and hence, it does not requiremore » offline profiling or tuning the parameter for each application. Furthermore, EnCache optimizes directly for the overall memory subsystem (LLC and main memory) energy efficiency instead of the LLC energy efficiency alone. The experiments performed with an x86-64 simulator and workloads from SPEC2006 suite confirm that EnCache provides larger energy saving than a conventional energy saving scheme. For single core and dual-core system configurations, the average savings in memory subsystem energy over a shared baseline configuration are 30.0% and 27.3%, respectively.« less
Cheung, Kit; Schultz, Simon R; Luk, Wayne
2015-01-01
NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
Cheung, Kit; Schultz, Simon R.; Luk, Wayne
2016-01-01
NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation. PMID:26834542
MILC Code Performance on High End CPU and GPU Supercomputer Clusters
NASA Astrophysics Data System (ADS)
DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug
2018-03-01
With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores
Kim, Youngmin; Lee, Chan-Gun
2017-01-01
In wireless sensor networks (WSNs), sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods. PMID:29240695
Processor-in-memory-and-storage architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeBenedictis, Erik
A method and apparatus for performing reliable general-purpose computing. Each sub-core of a plurality of sub-cores of a processor core processes a same instruction at a same time. A code analyzer receives a plurality of residues that represents a code word corresponding to the same instruction and an indication of whether the code word is a memory address code or a data code from the plurality of sub-cores. The code analyzer determines whether the plurality of residues are consistent or inconsistent. The code analyzer and the plurality of sub-cores perform a set of operations based on whether the code wordmore » is a memory address code or a data code and a determination of whether the plurality of residues are consistent or inconsistent.« less
Towards a Generic and Adaptive System-On-Chip Controller for Space Exploration Instrumentation
NASA Technical Reports Server (NTRS)
Iturbe, Xabier; Keymeulen, Didier; Yiu, Patrick; Berisford, Dan; Hand, Kevin; Carlson, Robert; Ozer, Emre
2015-01-01
This paper introduces one of the first efforts conducted at NASA’s Jet Propulsion Laboratory (JPL) to develop a generic System-on-Chip (SoC) platform to control science instruments that are proposed for future NASA missions. The SoC platform is named APEX-SoC, where APEX stands for Advanced Processor for space Exploration, and is based on a hybrid Xilinx Zynq that combines an FPGA and an ARM Cortex-A9 dual-core processor on a single chip. The Zynq implements a generic and customizable on-chip infrastructure that can be reused with a variety of instruments, and it has been coupled with a set of off-chip components that are necessary to deal with the different instruments. We have taken JPL’s Compositional InfraRed Imaging Spectrometer (CIRIS), which is proposed for NASA icy moons missions, as a use-case scenario to demonstrate that the entire data processing, control and interface of an instrument can be implemented on a single device using the on-chip infrastructure described in this paper. We show that the performance results achieved in this preliminary version of the instrumentation controller are sufficient to fulfill the science requirements demanded to the CIRIS instrument in future NASA missions, such as Europa.
Self-Calibrating and Remote Programmable Signal Conditioning Amplifier System and Method
NASA Technical Reports Server (NTRS)
Medelius, Pedro J. (Inventor); Hallberg, Carl G. (Inventor); Simpson, Howard J., III (Inventor); Thayer, Stephen W. (Inventor)
1998-01-01
A self-calibrating, remote programmable signal conditioning amplifier system employs information read from a memory attached to a measurement transducer for automatic calibration. The signal conditioning amplifier is self-calibrated on a continuous basis through use of a dual input path arrangement, with each path containing a multiplexer and a programmable amplifier. A digital signal processor controls operation of the system such that a transducer signal is applied to one of the input paths, while one or more calibration signals are applied to the second input path. Once the second path is calibrated, the digital signal processor switches the transducer signal to the second path. and then calibrates the first path. This process is continually repeated so that each path is calibrated on an essentially continuous basis. Dual output paths are also employed which are calibrated in the same manner. The digital signal processor also allows the implementation of a variety of digital filters which are either programmed into the system or downloaded by an operator, and performs up to eighth order linearization.
Parallel processing approach to transform-based image coding
NASA Astrophysics Data System (ADS)
Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.
1991-06-01
This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.
Jang, Yongwon; Noh, Hyung Wook; Lee, I B; Jung, Ji-Wook; Song, Yoonseon; Lee, Sooyeul; Kim, Seunghwan
2012-01-01
A patch type embedded cardiac function monitoring system was developed to detect arrhythmias such as PVC (Premature Ventricular Contraction), pause, ventricular fibrillation, and tachy/bradycardia. The overall system is composed of a main module including a dual processor and a Bluetooth telecommunication module. The dual microprocessor strategy minimizes power consumption and size, and guarantees the resources of embedded software programs. The developed software was verified with standard DB, and showed good performance.
Exact diagonalization of quantum lattice models on coprocessors
NASA Astrophysics Data System (ADS)
Siro, T.; Harju, A.
2016-10-01
We implement the Lanczos algorithm on an Intel Xeon Phi coprocessor and compare its performance to a multi-core Intel Xeon CPU and an NVIDIA graphics processor. The Xeon and the Xeon Phi are parallelized with OpenMP and the graphics processor is programmed with CUDA. The performance is evaluated by measuring the execution time of a single step in the Lanczos algorithm. We study two quantum lattice models with different particle numbers, and conclude that for small systems, the multi-core CPU is the fastest platform, while for large systems, the graphics processor is the clear winner, reaching speedups of up to 7.6 compared to the CPU. The Xeon Phi outperforms the CPU with sufficiently large particle number, reaching a speedup of 2.5.
Scalable Motion Estimation Processor Core for Multimedia System-on-Chip Applications
NASA Astrophysics Data System (ADS)
Lai, Yeong-Kang; Hsieh, Tian-En; Chen, Lien-Fei
2007-04-01
In this paper, we describe a high-throughput and scalable motion estimation processor architecture for multimedia system-on-chip applications. The number of processing elements (PEs) is scalable according to the variable algorithm parameters and the performance required for different applications. Using the PE rings efficiently and an intelligent memory-interleaving organization, the efficiency of the architecture can be increased. Moreover, using efficient on-chip memories and a data management technique can effectively decrease the power consumption and memory bandwidth. Techniques for reducing the number of interconnections and external memory accesses are also presented. Our results demonstrate that the proposed scalable PE-ringed architecture is a flexible and high-performance processor core in multimedia system-on-chip applications.
Sentinel-2 Level 2A Prototype Processor: Architecture, Algorithms And First Results
NASA Astrophysics Data System (ADS)
Muller-Wilm, Uwe; Louis, Jerome; Richter, Rudolf; Gascon, Ferran; Niezette, Marc
2013-12-01
Sen2Core is a prototype processor for Sentinel-2 Level 2A product processing and formatting. The processor is developed for and with ESA and performs the tasks of Atmospheric Correction and Scene Classification of Level 1C input data. Level 2A outputs are: Bottom-Of- Atmosphere (BOA) corrected reflectance images, Aerosol Optical Thickness-, Water Vapour-, Scene Classification maps and Quality indicators, including cloud and snow probabilities. The Level 2A Product Formatting performed by the processor follows the specification of the Level 1C User Product.
Performance of VPIC on Sequoia
NASA Astrophysics Data System (ADS)
Nystrom, William
2014-10-01
Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
Active non-volatile memory post-processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kannan, Sudarsun; Milojicic, Dejan S.; Talwar, Vanish
A computing node includes an active Non-Volatile Random Access Memory (NVRAM) component which includes memory and a sub-processor component. The memory is to store data chunks received from a processor core, the data chunks comprising metadata indicating a type of post-processing to be performed on data within the data chunks. The sub-processor component is to perform post-processing of said data chunks based on said metadata.
Design of the SLAC RCE Platform: A General Purpose ATCA Based Data Acquisition System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Herbst, R.; Claus, R.; Freytag, M.
2015-01-23
The SLAC RCE platform is a general purpose clustered data acquisition system implemented on a custom ATCA compliant blade, called the Cluster On Board (COB). The core of the system is the Reconfigurable Cluster Element (RCE), which is a system-on-chip design based upon the Xilinx Zynq family of FPGAs, mounted on custom COB daughter-boards. The Zynq architecture couples a dual core ARM Cortex A9 based processor with a high performance 28nm FPGA. The RCE has 12 external general purpose bi-directional high speed links, each supporting serial rates of up to 12Gbps. 8 RCE nodes are included on a COB, eachmore » with a 10Gbps connection to an on-board 24-port Ethernet switch integrated circuit. The COB is designed to be used with a standard full-mesh ATCA backplane allowing multiple RCE nodes to be tightly interconnected with minimal interconnect latency. Multiple shelves can be clustered using the front panel 10-gbps connections. The COB also supports local and inter-blade timing and trigger distribution. An experiment specific Rear Transition Module adapts the 96 high speed serial links to specific experiments and allows an experiment-specific timing and busy feedback connection. This coupling of processors with a high performance FPGA fabric in a low latency, multiple node cluster allows high speed data processing that can be easily adapted to any physics experiment. RTEMS and Linux are both ported to the module. The RCE has been used or is the baseline for several current and proposed experiments (LCLS, HPS, LSST, ATLAS-CSC, LBNE, DarkSide, ILC-SiD, etc).« less
A Real-Time Marker-Based Visual Sensor Based on a FPGA and a Soft Core Processor
Tayara, Hilal; Ham, Woonchul; Chong, Kil To
2016-01-01
This paper introduces a real-time marker-based visual sensor architecture for mobile robot localization and navigation. A hardware acceleration architecture for post video processing system was implemented on a field-programmable gate array (FPGA). The pose calculation algorithm was implemented in a System on Chip (SoC) with an Altera Nios II soft-core processor. For every frame, single pass image segmentation and Feature Accelerated Segment Test (FAST) corner detection were used for extracting the predefined markers with known geometries in FPGA. Coplanar PosIT algorithm was implemented on the Nios II soft-core processor supplied with floating point hardware for accelerating floating point operations. Trigonometric functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Inverse square root method has been implemented for approximating square root computations. Real time results have been achieved and pixel streams have been processed on the fly without any need to buffer the input frame for further implementation. PMID:27983714
A Real-Time Marker-Based Visual Sensor Based on a FPGA and a Soft Core Processor.
Tayara, Hilal; Ham, Woonchul; Chong, Kil To
2016-12-15
This paper introduces a real-time marker-based visual sensor architecture for mobile robot localization and navigation. A hardware acceleration architecture for post video processing system was implemented on a field-programmable gate array (FPGA). The pose calculation algorithm was implemented in a System on Chip (SoC) with an Altera Nios II soft-core processor. For every frame, single pass image segmentation and Feature Accelerated Segment Test (FAST) corner detection were used for extracting the predefined markers with known geometries in FPGA. Coplanar PosIT algorithm was implemented on the Nios II soft-core processor supplied with floating point hardware for accelerating floating point operations. Trigonometric functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Inverse square root method has been implemented for approximating square root computations. Real time results have been achieved and pixel streams have been processed on the fly without any need to buffer the input frame for further implementation.
Fault Mitigation Schemes for Future Spaceflight Multicore Processors
NASA Technical Reports Server (NTRS)
Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.
2012-01-01
Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Parallel multireference configuration interaction calculations on mini-β-carotenes and β-carotene
NASA Astrophysics Data System (ADS)
Kleinschmidt, Martin; Marian, Christel M.; Waletzke, Mirko; Grimme, Stefan
2009-01-01
We present a parallelized version of a direct selecting multireference configuration interaction (MRCI) code [S. Grimme and M. Waletzke, J. Chem. Phys. 111, 5645 (1999)]. The program can be run either in ab initio mode or as semiempirical procedure combined with density functional theory (DFT/MRCI). We have investigated the efficiency of the parallelization in case studies on carotenoids and porphyrins. The performance is found to depend heavily on the cluster architecture. While the speed-up on the older Intel Netburst technology is close to linear for up to 12-16 processes, our results indicate that it is not favorable to use all cores of modern Intel Dual Core or Quad Core processors simultaneously for memory intensive tasks. Due to saturation of the memory bandwidth, we recommend to run less demanding tasks on the latter architectures in parallel to two (Dual Core) or four (Quad Core) MRCI processes per node. The DFT/MRCI branch has been employed to study the low-lying singlet and triplet states of mini-n-β-carotenes (n =3, 5, 7, 9) and β-carotene (n =11) at the geometries of the ground state, the first excited triplet state, and the optically bright singlet state. The order of states depends heavily on the conjugation length and the nuclear geometry. The B1u+ state constitutes the S1 state in the vertical absorption spectrum of mini-3-β-carotene but switches order with the 2 A1g- state upon excited state relaxation. In the longer carotenes, near degeneracy or even root flipping between the B1u+ and B1u- states is observed whereas the 3 A1g- state is found to remain energetically above the optically bright B1u+ state at all nuclear geometries investigated here. The DFT/MRCI method is seen to underestimate the absolute excitation energies of the longer mini-β-carotenes but the energy gaps between the excited states are reproduced well. In addition to singlet data, triplet-triplet absorption energies are presented. For β-carotene, where these transition energies are known from experiment, excellent agreement with our calculations is observed.
Kleinschmidt, Martin; Marian, Christel M; Waletzke, Mirko; Grimme, Stefan
2009-01-28
We present a parallelized version of a direct selecting multireference configuration interaction (MRCI) code [S. Grimme and M. Waletzke, J. Chem. Phys. 111, 5645 (1999)]. The program can be run either in ab initio mode or as semiempirical procedure combined with density functional theory (DFT/MRCI). We have investigated the efficiency of the parallelization in case studies on carotenoids and porphyrins. The performance is found to depend heavily on the cluster architecture. While the speed-up on the older Intel Netburst technology is close to linear for up to 12-16 processes, our results indicate that it is not favorable to use all cores of modern Intel Dual Core or Quad Core processors simultaneously for memory intensive tasks. Due to saturation of the memory bandwidth, we recommend to run less demanding tasks on the latter architectures in parallel to two (Dual Core) or four (Quad Core) MRCI processes per node. The DFT/MRCI branch has been employed to study the low-lying singlet and triplet states of mini-n-beta-carotenes (n=3, 5, 7, 9) and beta-carotene (n=11) at the geometries of the ground state, the first excited triplet state, and the optically bright singlet state. The order of states depends heavily on the conjugation length and the nuclear geometry. The (1)B(u) (+) state constitutes the S(1) state in the vertical absorption spectrum of mini-3-beta-carotene but switches order with the 2 (1)A(g) (-) state upon excited state relaxation. In the longer carotenes, near degeneracy or even root flipping between the (1)B(u) (+) and (1)B(u) (-) states is observed whereas the 3 (1)A(g) (-) state is found to remain energetically above the optically bright (1)B(u) (+) state at all nuclear geometries investigated here. The DFT/MRCI method is seen to underestimate the absolute excitation energies of the longer mini-beta-carotenes but the energy gaps between the excited states are reproduced well. In addition to singlet data, triplet-triplet absorption energies are presented. For beta-carotene, where these transition energies are known from experiment, excellent agreement with our calculations is observed.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sundaram, Sriram; Grenat, Aaron; Naffziger, Samuel
Power management techniques can be effective at extracting more performance and energy efficiency out of mature systems on chip (SoCs). For instance, the peak performance of microprocessors is often limited by worst case technology (Vmax), infrastructure (thermal/electrical), and microprocessor usage assumptions. Performance/watt of microprocessors also typically suffers from guard bands associated with the test and binning processes as well as worst case aging/lifetime degradation. Similarly, on multicore processors, shared voltage rails tend to limit the peak performance achievable in low thread count workloads. In this paper, we describe five power management techniques that maximize the per-part performance under the before-mentionedmore » constraints. Using these techniques, we demonstrate a net performance increase of up to 15% depending on the application and TDP of the SoC, implemented on 'Bristol Ridge,' a 28-nm CMOS, dual-core x 86 accelerated processing unit.« less
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G.; Salapura, Valentina
2012-07-24
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Application of Prognostic Health Management in Digital Electronic Systems
2007-01-01
variable external supply applied the necessary core power to the processor while the motherboard continued to source power from the ATX supply. By...isolating the processor power from the motherboard power , control over the aging profile of the processor was achieved. Once nominal operating...Physics-of-failure RISC – Reduced Instruction Set Computer RUL – Remaining Useful Life 1 1-4244-0525-4/07/$20.00 ©2007 IEEE. Paper 1326
High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine
NASA Astrophysics Data System (ADS)
Goto, Hiroyuki
This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband Engine™ (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3™ equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
Multi-level Hierarchical Poly Tree computer architectures
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug
1990-01-01
Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.
APRON: A Cellular Processor Array Simulation and Hardware Design Tool
NASA Astrophysics Data System (ADS)
Barr, David R. W.; Dudek, Piotr
2009-12-01
We present a software environment for the efficient simulation of cellular processor arrays (CPAs). This software (APRON) is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.
The design of dual-mode complex signal processors based on quadratic modular number codes
NASA Astrophysics Data System (ADS)
Jenkins, W. K.; Krogmeier, J. V.
1987-04-01
It has been known for a long time that quadratic modular number codes admit an unusual representation of complex numbers which leads to complete decoupling of the real and imaginary channels, thereby simplifying complex multiplication and providing error isolation between the real and imaginary channels. This paper first presents a tutorial review of the theory behind the different types of complex modular rings (fields) that result from particular parameter selections, and then presents a theory for a 'dual-mode' complex signal processor based on the choice of augmented power-of-2 moduli. It is shown how a diminished-1 binary code, used by previous designers for the realization of Fermat number transforms, also leads to efficient realizations for dual-mode complex arithmetic for certain augmented power-of-2 moduli. Then a design is presented for a recursive complex filter based on a ROM/ACCUMULATOR architecture and realized in an augmented power-of-2 quadratic code, and a computer-generated example of a complex recursive filter is shown to illustrate the principles of the theory.
Electronic Structure Calculations and Adaptation Scheme in Multi-core Computing Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Seshagiri, Lakshminarasimhan; Sosonkina, Masha; Zhang, Zhao
2009-05-20
Multi-core processing environments have become the norm in the generic computing environment and are being considered for adding an extra dimension to the execution of any application. The T2 Niagara processor is a very unique environment where it consists of eight cores having a capability of running eight threads simultaneously in each of the cores. Applications like General Atomic and Molecular Electronic Structure (GAMESS), used for ab-initio molecular quantum chemistry calculations, can be good indicators of the performance of such machines and would be a guideline for both hardware designers and application programmers. In this paper we try to benchmarkmore » the GAMESS performance on a T2 Niagara processor for a couple of molecules. We also show the suitability of using a middleware based adaptation algorithm on GAMESS on such a multi-core environment.« less
NASA Astrophysics Data System (ADS)
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
FPGA Acceleration of the phylogenetic likelihood function for Bayesian MCMC inference methods.
Zierke, Stephanie; Bakos, Jason D
2010-04-12
Likelihood (ML)-based phylogenetic inference has become a popular method for estimating the evolutionary relationships among species based on genomic sequence data. This method is used in applications such as RAxML, GARLI, MrBayes, PAML, and PAUP. The Phylogenetic Likelihood Function (PLF) is an important kernel computation for this method. The PLF consists of a loop with no conditional behavior or dependencies between iterations. As such it contains a high potential for exploiting parallelism using micro-architectural techniques. In this paper, we describe a technique for mapping the PLF and supporting logic onto a Field Programmable Gate Array (FPGA)-based co-processor. By leveraging the FPGA's on-chip DSP modules and the high-bandwidth local memory attached to the FPGA, the resultant co-processor can accelerate ML-based methods and outperform state-of-the-art multi-core processors. We use the MrBayes 3 tool as a framework for designing our co-processor. For large datasets, we estimate that our accelerated MrBayes, if run on a current-generation FPGA, achieves a 10x speedup relative to software running on a state-of-the-art server-class microprocessor. The FPGA-based implementation achieves its performance by deeply pipelining the likelihood computations, performing multiple floating-point operations in parallel, and through a natural log approximation that is chosen specifically to leverage a deeply pipelined custom architecture. Heterogeneous computing, which combines general-purpose processors with special-purpose co-processors such as FPGAs and GPUs, is a promising approach for high-performance phylogeny inference as shown by the growing body of literature in this field. FPGAs in particular are well-suited for this task because of their low power consumption as compared to many-core processors and Graphics Processor Units (GPUs).
Reconfigurable signal processor designs for advanced digital array radar systems
NASA Astrophysics Data System (ADS)
Suarez, Hernan; Zhang, Yan (Rockee); Yu, Xining
2017-05-01
The new challenges originated from Digital Array Radar (DAR) demands a new generation of reconfigurable backend processor in the system. The new FPGA devices can support much higher speed, more bandwidth and processing capabilities for the need of digital Line Replaceable Unit (LRU). This study focuses on using the latest Altera and Xilinx devices in an adaptive beamforming processor. The field reprogrammable RF devices from Analog Devices are used as analog front end transceivers. Different from other existing Software-Defined Radio transceivers on the market, this processor is designed for distributed adaptive beamforming in a networked environment. The following aspects of the novel radar processor will be presented: (1) A new system-on-chip architecture based on Altera's devices and adaptive processing module, especially for the adaptive beamforming and pulse compression, will be introduced, (2) Successful implementation of generation 2 serial RapidIO data links on FPGA, which supports VITA-49 radio packet format for large distributed DAR processing. (3) Demonstration of the feasibility and capabilities of the processor in a Micro-TCA based, SRIO switching backplane to support multichannel beamforming in real-time. (4) Application of this processor in ongoing radar system development projects, including OU's dual-polarized digital array radar, the planned new cylindrical array radars, and future airborne radars.
Fast 2D FWI on a multi and many-cores workstation.
NASA Astrophysics Data System (ADS)
Thierry, Philippe; Donno, Daniela; Noble, Mark
2014-05-01
Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sancho Pitarch, Jose Carlos; Kerbyson, Darren; Lang, Mike
Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for amore » wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.« less
NASA Astrophysics Data System (ADS)
Geng, Ying; Li, Shenping; Li, Ming-Jun; Sutton, Clifford G.; McCollum, Robert L.; McClure, Randy L.; Koklyushkin, Alexander V.; Matthews, Karen I.; Luther, James P.; Butler, Douglas L.
2015-03-01
A complete single mode dual-core fiber system for short-reach optical interconnects is fabricated and tested for high-speed data transmission. It includes dual-core fibers capable of bi-directional data transmission, dual-core simplex LC connectors, and fan-outs. The transmission system offers simplified bi-directional traffic engineering with integrated bidirectional transceivers and compact system design, utilizing simplex dual-core LC connectors that use half the space while increasing the bandwidth density by a factor of two. The fiber has two cores that are compatible with single mode fiber and conforms to the industry standard outer diameter of 125 μm. This reduces operational complexity by reducing the size and number of fibers, cables and connectors. Measured OTDR loss for both cores was 0.34 dB/km at 1310 nm and 0.19 dB/km at 1550 nm. Crosstalk for a piece of 5.8 km long dual-core fiber was measured to be below -75 dB at 1310 nm, and below -40 dB at 1550 nm. Both free-space optics fan-outs and tapered-fiber-coupler based MCF fan-outs were evaluated for the transmission system. Error-free and penalty-free 25 Gb/s bi-directional transmission performance was demonstrated for three different fiber lengths, 200 m, 2 km and 10 km, using the complete all-fiber-based system including connectors and fan-outs. This single mode, dual-core fiber transmission system adds complementary value to systems where additional increases in bandwidth density can come from wavelength division multiplexing and multiple bits per symbol.
NASA Technical Reports Server (NTRS)
2006-01-01
Topics covered include: Measurement and Controls Data Acquisition System IMU/GPS System Provides Position and Attitude Data Using Artificial Intelligence to Inform Pilots of Weather Fast Lossless Compression of Multispectral-Image Data Developing Signal-Pattern-Recognition Programs Implementing Access to Data Distributed on Many Processors Compact, Efficient Drive Circuit for a Piezoelectric Pump; Dual Common Planes for Time Multiplexing of Dual-Color QWIPs; MMIC Power Amplifier Puts Out 40 mW From 75 to 110 GHz; 2D/3D Visual Tracker for Rover Mast; Adding Hierarchical Objects to Relational Database General-Purpose XML-Based Information Managements; Vaporizable Scaffolds for Fabricating Thermoelectric Modules; Producing Quantum Dots by Spray Pyrolysis; Mobile Robot for Exploring Cold Liquid/Solid Environments; System Would Acquire Core and Powder Samples of Rocks; Improved Fabrication of Lithium Films Having Micron Features; Manufacture of Regularly Shaped Sol-Gel Pellets; Regulating Glucose and pH, and Monitoring Oxygen in a Bioreactor; Satellite Multiangle Spectropolarimetric Imaging of Aerosols; Interferometric System for Measuring Thickness of Sea Ice; Microscale Regenerative Heat Exchanger Protocols for Handling Messages Between Simulation Computers Statistical Detection of Atypical Aircraft Flights NASA's Aviation Safety and Modeling Project Multimode-Guided-Wave Ultrasonic Scanning of Materials Algorithms for Maneuvering Spacecraft Around Small Bodies Improved Solar-Radiation-Pressure Models for GPS Satellites Measuring Attitude of a Large, Flexible, Orbiting Structure
Takano, Yu; Nakata, Kazuto; Yonezawa, Yasushige; Nakamura, Haruki
2016-05-05
A massively parallel program for quantum mechanical-molecular mechanical (QM/MM) molecular dynamics simulation, called Platypus (PLATform for dYnamic Protein Unified Simulation), was developed to elucidate protein functions. The speedup and the parallelization ratio of Platypus in the QM and QM/MM calculations were assessed for a bacteriochlorophyll dimer in the photosynthetic reaction center (DIMER) on the K computer, a massively parallel computer achieving 10 PetaFLOPs with 705,024 cores. Platypus exhibited the increase in speedup up to 20,000 core processors at the HF/cc-pVDZ and B3LYP/cc-pVDZ, and up to 10,000 core processors by the CASCI(16,16)/6-31G** calculations. We also performed excited QM/MM-MD simulations on the chromophore of Sirius (SIRIUS) in water. Sirius is a pH-insensitive and photo-stable ultramarine fluorescent protein. Platypus accelerated on-the-fly excited-state QM/MM-MD simulations for SIRIUS in water, using over 4000 core processors. In addition, it also succeeded in 50-ps (200,000-step) on-the-fly excited-state QM/MM-MD simulations for the SIRIUS in water. © 2016 The Authors. Journal of Computational Chemistry Published by Wiley Periodicals, Inc.
Impact of Azimuthally Controlled Fluidic Chevrons on Jet Noise
NASA Technical Reports Server (NTRS)
Henderson, Brenda S.; Norum, Thomas D.
2008-01-01
The impact of azimuthally controlled air injection on broadband shock noise and mixing noise for single and dual stream jets was investigated. The single stream experiments focused on noise reduction for low supersonic jet exhausts. Dual stream experiments included high subsonic core and fan conditions and supersonic fan conditions with transonic core conditions. For the dual stream experiments, air was injected into the core stream. Significant reductions in broadband shock noise were achieved in a single jet with an injection mass flow equal to 1.2% of the core mass flow. Injection near the pylon produced greater broadband shock noise reductions than injection at other locations around the nozzle periphery. Air injection into the core stream did not result in broadband shock noise reduction in dual stream jets. Fluidic injection resulted in some mixing noise reductions for both the single and dual stream jets. For subsonic fan and core conditions, the lowest noise levels were obtained when injecting on the side of the nozzle closest to the microphone axis.
Multiple Embedded Processors for Fault-Tolerant Computing
NASA Technical Reports Server (NTRS)
Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy
2005-01-01
A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Accelerating Climate Simulations Through Hybrid Computing
NASA Technical Reports Server (NTRS)
Zhou, Shujia; Sinno, Scott; Cruz, Carlos; Purcell, Mark
2009-01-01
Unconventional multi-core processors (e.g., IBM Cell B/E and NYIDIDA GPU) have emerged as accelerators in climate simulation. However, climate models typically run on parallel computers with conventional processors (e.g., Intel and AMD) using MPI. Connecting accelerators to this architecture efficiently and easily becomes a critical issue. When using MPI for connection, we identified two challenges: (1) identical MPI implementation is required in both systems, and; (2) existing MPI code must be modified to accommodate the accelerators. In response, we have extended and deployed IBM Dynamic Application Virtualization (DAV) in a hybrid computing prototype system (one blade with two Intel quad-core processors, two IBM QS22 Cell blades, connected with Infiniband), allowing for seamlessly offloading compute-intensive functions to remote, heterogeneous accelerators in a scalable, load-balanced manner. Currently, a climate solar radiation model running with multiple MPI processes has been offloaded to multiple Cell blades with approx.10% network overhead.
Using all of your CPU's in HIPE
NASA Astrophysics Data System (ADS)
Jacobson, J. D.; Fadda, D.
2012-09-01
Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.
Computational multicore on two-layer 1D shallow water equations for erodible dambreak
NASA Astrophysics Data System (ADS)
Simanjuntak, C. A.; Bagustara, B. A. R. H.; Gunawan, P. H.
2018-03-01
The simulation of erodible dambreak using two-layer shallow water equations and SCHR scheme are elaborated in this paper. The results show that the two-layer SWE model in a good agreement with the data experiment which is performed by Louvain-la-Neuve Université Catholique de Louvain. Moreover, the parallel algorithm with multicore architecture are given in the results. The results show that Computer I with processor Intel(R) Core(TM) i5-2500 CPU Quad-Core has the best performance to accelerate the computational time. Moreover, Computer III with processor AMD A6-5200 APU Quad-Core is observed has higher speedup and efficiency. The speedup and efficiency of Computer III with number of grids 3200 are 3.716050530 times and 92.9% respectively.
NASA Astrophysics Data System (ADS)
Liu, Fenglai; Kong, Jing
2018-07-01
Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
Present Status and Extensions of the Monte Carlo Performance Benchmark
NASA Astrophysics Data System (ADS)
Hoogenboom, J. Eduard; Petrovic, Bojan; Martin, William R.
2014-06-01
The NEA Monte Carlo Performance benchmark started in 2011 aiming to monitor over the years the abilities to perform a full-size Monte Carlo reactor core calculation with a detailed power production for each fuel pin with axial distribution. This paper gives an overview of the contributed results thus far. It shows that reaching a statistical accuracy of 1 % for most of the small fuel zones requires about 100 billion neutron histories. The efficiency of parallel execution of Monte Carlo codes on a large number of processor cores shows clear limitations for computer clusters with common type computer nodes. However, using true supercomputers the speedup of parallel calculations is increasing up to large numbers of processor cores. More experience is needed from calculations on true supercomputers using large numbers of processors in order to predict if the requested calculations can be done in a short time. As the specifications of the reactor geometry for this benchmark test are well suited for further investigations of full-core Monte Carlo calculations and a need is felt for testing other issues than its computational performance, proposals are presented for extending the benchmark to a suite of benchmark problems for evaluating fission source convergence for a system with a high dominance ratio, for coupling with thermal-hydraulics calculations to evaluate the use of different temperatures and coolant densities and to study the correctness and effectiveness of burnup calculations. Moreover, other contemporary proposals for a full-core calculation with realistic geometry and material composition will be discussed.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava
2017-01-01
For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs
NASA Astrophysics Data System (ADS)
Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
2017-08-01
For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Low latency messages on distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Rosing, Matthew; Saltz, Joel
1993-01-01
Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Li, Na; Takagaki, Tomohiro; Sadr, Alireza; Waidyasekera, Kanchana; Ikeda, Masaomi; Chen, Jihua; Nikaido, Toru; Tagami, Junji
2011-12-01
To evaluate the microtensile bond strength (μTBS) and acid-base resistant zone (ABRZ) of two dualcuring core systems to dentin using four curing modes. Sixty-four caries-free human molars were randomly divided into two groups according to two dual-curing resin core systems: (1) Clearfil DC Core Automix; (2) Estelite Core Quick. For each core system, four different curing modes were applied to the adhesive and core resin: (1) dual-cured and dual-cured (DD); (2) chemically cured and dual-cured (CD); (3) dual-cured and chemically cured (DC); (4) chemically cured and chemically cured (CC). The specimens were sectioned into sticks (n = 20 for each group) for the microtensile bond test. μTBS data were analyzed using two-way ANOVA and the Dunnett T3 test. Failure patterns were examined with scanning electron microscopy (SEM) to determine the proportion of each mode. Dentin sandwiches were produced and subjected to an acid-base challenge. After argon-ion etching, the ultrastructure of ABRZ was observed using SEM. For Clearfil DC Core Automix, the μTBS values in MPa were as follows: DD: 29.1 ± 5.4, CD: 21.6 ± 5.6, DC: 17.9 ± 2.8, CC: 11.5 ± 3.2. For Estelite Core Quick, they were: DD: 48.9 ±5.7, CD: 20.5 ± 4.7, DC: 41.4 ± 8.3, CC: 19.1 ± 6.0. The bond strength was affected by both material and curing mode, and the interaction of the two factors was significant (p < 0.001). Within both systems, there were significant differences among groups, and the DD group showed the highest μTBS (p < 0.05). ABRZ morphology was not affected by curing mode, but it was highly adhesive-material dependent. The curing mode of dual-curing core systems affects bond strength to dentin, but has no significant effect on the formation of ABRZ.
Direct access inter-process shared memory
Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B
2013-10-22
A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.
CMS Readiness for Multi-Core Workload Scheduling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.
In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides amore » solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.« less
CMS readiness for multi-core workload scheduling
NASA Astrophysics Data System (ADS)
Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.
2017-10-01
In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.
VENTURE/PC manual: A multidimensional multigroup neutron diffusion code system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shapiro, A.; Huria, H.C.; Cho, K.W.
1991-12-01
VENTURE/PC is a recompilation of part of the Oak Ridge BOLD VENTURE code system, which will operate on an IBM PC or compatible computer. Neutron diffusion theory solutions are obtained for multidimensional, multigroup problems. This manual contains information associated with operating the code system. The purpose of the various modules used in the code system, and the input for these modules are discussed. The PC code structure is also given. Version 2 included several enhancements not given in the original version of the code. In particular, flux iterations can be done in core rather than by reading and writing tomore » disk, for problems which allow sufficient memory for such in-core iterations. This speeds up the iteration process. Version 3 does not include any of the special processors used in the previous versions. These special processors utilized formatted input for various elements of the code system. All such input data is now entered through the Input Processor, which produces standard interface files for the various modules in the code system. In addition, a Standard Interface File Handbook is included in the documentation which is distributed with the code, to assist in developing the input for the Input Processor.« less
DFT algorithms for bit-serial GaAs array processor architectures
NASA Technical Reports Server (NTRS)
Mcmillan, Gary B.
1988-01-01
Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Daily, Jeffrey A.
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments
Daily, Jeffrey A.
2016-02-10
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. As a result, a faster intra-sequence pairwise alignment implementation is described and benchmarked. Using a 375 residue query sequence a speed of 136 billion cell updates permore » second (GCUPS) was achieved on a dual Intel Xeon E5-2670 12-core processor system, the highest reported for an implementation based on Farrar’s ’striped’ approach. When using only a single thread, parasail was 1.7 times faster than Rognes’s SWIPE. For many score matrices, parasail is faster than BLAST. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. In conclusion, applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.« less
Improving energy efficiency of Embedded DRAM Caches for High-end Computing Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh; Vetter, Jeffrey S; Li, Dong
2014-01-01
With increasing system core-count, the size of last level cache (LLC) has increased and since SRAM consumes high leakage power, power consumption of LLCs is becoming a significant fraction of processor power consumption. To address this, researchers have used embedded DRAM (eDRAM) LLCs which consume low-leakage power. However, eDRAM caches consume a significant amount of energy in the form of refresh energy. In this paper, we propose ESTEEM, an energy saving technique for embedded DRAM caches. ESTEEM uses dynamic cache reconfiguration to turn-off a portion of the cache to save both leakage and refresh energy. It logically divides the cachemore » sets into multiple modules and turns-off possibly different number of ways in each module. Microarchitectural simulations confirm that ESTEEM is effective in improving performance and energy efficiency and provides better results compared to a recently-proposed eDRAM cache energy saving technique, namely Refrint. For single and dual-core simulations, the average saving in memory subsystem (LLC+main memory) on using ESTEEM is 25.8% and 32.6%, respectively and average weighted speedup are 1.09X and 1.22X, respectively. Additional experiments confirm that ESTEEM works well for a wide-range of system parameters.« less
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures
NASA Technical Reports Server (NTRS)
Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.
2012-01-01
In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.
New Dimensions in Microarchitecture Harnessing 3D Integration Technologies (BRIEFING CHARTS)
2007-03-06
Quad Core Bandwidth and Latency Boundaries General Purpose Processor Loads Latency limited Ba nd w id th li m ite dProcessor load trade -off between I...delay No= number of ckts at 1V do= ckt delay at 1V From “3D Intergration ” Special Topic Sessionl W. Haensch, ISSCC ‘07, 2/07 11 DARPA MTS March 6, 2007
NASA Astrophysics Data System (ADS)
Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.
2011-12-01
Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
Reconfigurable Very Long Instruction Word (VLIW) Processor
NASA Technical Reports Server (NTRS)
Velev, Miroslav N.
2015-01-01
Future NASA missions will depend on radiation-hardened, power-efficient processing systems-on-a-chip (SOCs) that consist of a range of processor cores custom tailored for space applications. Aries Design Automation, LLC, has developed a processing SOC that is optimized for software-defined radio (SDR) uses. The innovation implements the Institute of Electrical and Electronics Engineers (IEEE) RazorII voltage management technique, a microarchitectural mechanism that allows processor cores to self-monitor, self-analyze, and selfheal after timing errors, regardless of their cause (e.g., radiation; chip aging; variations in the voltage, frequency, temperature, or manufacturing process). This highly automated SOC can also execute legacy PowerPC 750 binary code instruction set architecture (ISA), which is used in the flight-control computers of many previous NASA space missions. In developing this innovation, Aries Design Automation has made significant contributions to the fields of formal verification of complex pipelined microprocessors and Boolean satisfiability (SAT) and has developed highly efficient electronic design automation tools that hold promise for future developments.
Fault-tolerant corrector/detector chip for high-speed data processing
Andaleon, David D.; Napolitano, Jr., Leonard M.; Redinbo, G. Robert; Shreeve, William O.
1994-01-01
An internally fault-tolerant data error detection and correction integrated circuit device (10) and a method of operating same. The device functions as a bidirectional data buffer between a 32-bit data processor and the remainder of a data processing system and provides a 32-bit datum is provided with a relatively short eight bits of data-protecting parity. The 32-bits of data by eight bits of parity is partitioned into eight 4-bit nibbles and two 4-bit nibbles, respectively. For data flowing towards the processor the data and parity nibbles are checked in parallel and in a single operation employing a dual orthogonal basis technique. The dual orthogonal basis increase the efficiency of the implementation. Any one of ten (eight data, two parity) nibbles are correctable if erroneous, or two different erroneous nibbles are detectable. For data flowing away from the processor the appropriate parity nibble values are calculated and transmitted to the system along with the data. The device regenerates parity values for data flowing in either direction and compares regenerated to generated parity with a totally self-checking equality checker. As such, the device is self-validating and enabled to both detect and indicate an occurrence of an internal failure. A generalization of the device to protect 64-bit data with 16-bit parity to protect against byte-wide errors is also presented.
Fault-tolerant corrector/detector chip for high-speed data processing
Andaleon, D.D.; Napolitano, L.M. Jr.; Redinbo, G.R.; Shreeve, W.O.
1994-03-01
An internally fault-tolerant data error detection and correction integrated circuit device and a method of operating same is described. The device functions as a bidirectional data buffer between a 32-bit data processor and the remainder of a data processing system and provides a 32-bit datum with a relatively short eight bits of data-protecting parity. The 32-bits of data by eight bits of parity is partitioned into eight 4-bit nibbles and two 4-bit nibbles, respectively. For data flowing towards the processor the data and parity nibbles are checked in parallel and in a single operation employing a dual orthogonal basis technique. The dual orthogonal basis increase the efficiency of the implementation. Any one of ten (eight data, two parity) nibbles are correctable if erroneous, or two different erroneous nibbles are detectable. For data flowing away from the processor the appropriate parity nibble values are calculated and transmitted to the system along with the data. The device regenerates parity values for data flowing in either direction and compares regenerated to generated parity with a totally self-checking equality checker. As such, the device is self-validating and enabled to both detect and indicate an occurrence of an internal failure. A generalization of the device to protect 64-bit data with 16-bit parity to protect against byte-wide errors is also presented. 8 figures.
Theorem Proving in Intel Hardware Design
NASA Technical Reports Server (NTRS)
O'Leary, John
2009-01-01
For the past decade, a framework combining model checking (symbolic trajectory evaluation) and higher-order logic theorem proving has been in production use at Intel. Our tools and methodology have been used to formally verify execution cluster functionality (including floating-point operations) for a number of Intel products, including the Pentium(Registered TradeMark)4 and Core(TradeMark)i7 processors. Hardware verification in 2009 is much more challenging than it was in 1999 - today s CPU chip designs contain many processor cores and significant firmware content. This talk will attempt to distill the lessons learned over the past ten years, discuss how they apply to today s problems, outline some future directions.
Beyond core count: a look at new mainstream computing platforms for HEP workloads
NASA Astrophysics Data System (ADS)
Szostek, P.; Nowak, A.; Bitzes, G.; Valsan, L.; Jarp, S.; Dotti, A.
2014-06-01
As Moore's Law continues to deliver more and more transistors, the mainstream processor industry is preparing to expand its investments in areas other than simple core count. These new interests include deep integration of on-chip components, advanced vector units, memory, cache and interconnect technologies. We examine these moving trends with parallelized and vectorized High Energy Physics workloads in mind. In particular, we report on practical experience resulting from experiments with scalable HEP benchmarks on the Intel "Ivy Bridge-EP" and "Haswell" processor families. In addition, we examine the benefits of the new "Haswell" microarchitecture and its impact on multiple facets of HEP software. Finally, we report on the power efficiency of new systems.
GPU: the biggest key processor for AI and parallel processing
NASA Astrophysics Data System (ADS)
Baji, Toru
2017-07-01
Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
NASA Astrophysics Data System (ADS)
Curilla, L.; Astrauskas, I.; Pugzlys, A.; Stajanca, P.; Pysz, D.; Uherek, F.; Baltuska, A.; Bugar, I.
2018-05-01
We demonstrate ultrafast soliton-based nonlinear balancing of dual-core asymmetry in highly nonlinear photonic crystal fiber at sub-nanojoule pulse energy level. The effect of fiber asymmetry was studied experimentally by selective excitation and monitoring of individual fiber cores at different wavelengths between 1500 nm and 1800 nm. Higher energy transfer rate to non-excited core was observed in the case of fast core excitation due to nonlinear asymmetry balancing of temporal solitons, which was confirmed by the dedicated numerical simulations based on the coupled generalized nonlinear Schrödinger equations. Moreover, the simulation results correspond qualitatively with the experimentally acquired dependences of the output dual-core extinction ratio on excitation energy and wavelength. In the case of 1800 nm fast core excitation, narrow band spectral intensity switching between the output channels was registered with contrast of 23 dB. The switching was achieved by the change of the excitation pulse energy in sub-nanojoule region. The performed detailed analysis of the nonlinear balancing of dual-core asymmetry in solitonic propagation regime opens new perspectives for the development of ultrafast nonlinear all-optical switching devices.
NASA Technical Reports Server (NTRS)
Perez, Christopher E.; Berg, Melanie D.; Friendlich, Mark R.
2011-01-01
Motivation for this work is: (1) Accurately characterize digital signal processor (DSP) core single-event effect (SEE) behavior (2) Test DSP cores across a large frequency range and across various input conditions (3) Isolate SEE analysis to DSP cores alone (4) Interpret SEE analysis in terms of single-event upsets (SEUs) and single-event transients (SETs) (5) Provide flight missions with accurate estimate of DSP core error rates and error signatures.
NASA Astrophysics Data System (ADS)
Bellerby, Tim
2015-04-01
PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aaby, Brandon G; Perumalla, Kalyan S; Seal, Sudip K
2010-01-01
An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Messagemore » Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.« less
NASA Technical Reports Server (NTRS)
Irom, Farokh; Farmanesh, Farhad; Kouba, Coy K.
2006-01-01
SEU from heavy-ions is measured for SOI PowerPC microprocessors. Results for 0.13 micron PowerPC with 1.1V core voltages increases over 1.3V versions. This suggests that improvement in SEU for scaled devices may be reversed. In recent years there has been interest in the possible use of unhardened commercial microprocessors in space because of their superior performance compared to hardened processors. However, unhardened devices are susceptible to upset from radiation space. More information is needed on how they respond to radiation before they can be used in space. Only a limited number of advanced microprocessors have been subjected to radiation tests, which are designed with lower clock frequencies and higher internal core voltage voltages than recent devices [1-6]. However the trend for commercial Silicon-on-insulator (SOI) microprocessors is to reduce feature size and internal core voltage and increase the clock frequency. Commercial microprocessors with the PowerPC architecture are now available that use partially depleted SOI processes with feature size of 90 nm and internal core voltage as low as 1.0 V and clock frequency in the GHz range. Previously, we reported SEU measurements for SOI commercial PowerPCs with feature size of 0.18 and 0.13 m [7, 8]. The results showed an order of magnitude reduction in saturated cross section compared to CMOS bulk counterparts. This paper examines SEUs in advanced commercial SOI microprocessors, focusing on SEU sensitivity of D-Cache and hangs with feature size and internal core voltage. Results are presented for the Motorola SOI processor with feature sizes of 0.13 microns and internal core voltages of 1.3 and 1.1 V. These results are compared with results for the Motorola SOI processors with feature size of 0.18 microns and internal core voltage of 1.6 and 1.3 V.
VENTURE/PC manual: A multidimensional multigroup neutron diffusion code system. Version 3
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shapiro, A.; Huria, H.C.; Cho, K.W.
1991-12-01
VENTURE/PC is a recompilation of part of the Oak Ridge BOLD VENTURE code system, which will operate on an IBM PC or compatible computer. Neutron diffusion theory solutions are obtained for multidimensional, multigroup problems. This manual contains information associated with operating the code system. The purpose of the various modules used in the code system, and the input for these modules are discussed. The PC code structure is also given. Version 2 included several enhancements not given in the original version of the code. In particular, flux iterations can be done in core rather than by reading and writing tomore » disk, for problems which allow sufficient memory for such in-core iterations. This speeds up the iteration process. Version 3 does not include any of the special processors used in the previous versions. These special processors utilized formatted input for various elements of the code system. All such input data is now entered through the Input Processor, which produces standard interface files for the various modules in the code system. In addition, a Standard Interface File Handbook is included in the documentation which is distributed with the code, to assist in developing the input for the Input Processor.« less
Zhang, Zhen; Ma, Cheng; Zhu, Rong
2017-08-23
Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.
Zhang, Zhen; Zhu, Rong
2017-01-01
Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas. PMID:28832522
Development of small scale cluster computer for numerical analysis
NASA Astrophysics Data System (ADS)
Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.
2017-09-01
In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
NASA Astrophysics Data System (ADS)
Tanikawa, Ataru; Yoshikawa, Kohji; Nitadori, Keigo; Okamoto, Takashi
2013-02-01
We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius rcut (i.e. f(r)=0 at r>rcut), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i7-2600 processor, we measure the performance of our library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2×109 interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×109 interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×109 and 4×109 interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.
The Transition to a Many-core World
NASA Astrophysics Data System (ADS)
Mattson, T. G.
2012-12-01
The need to increase performance within a fixed energy budget has pushed the computer industry to many core processors. This is grounded in the physics of computing and is not a trend that will just go away. It is hard to overestimate the profound impact of many-core processors on software developers. Virtually every facet of the software development process will need to change to adapt to these new processors. In this talk, we will look at many-core hardware and consider its evolution from a perspective grounded in the CPU. We will show that the number of cores will inevitably increase, but in addition, a quest to maximize performance per watt will push these cores to be heterogeneous. We will show that the inevitable result of these changes is a computing landscape where the distinction between the CPU and the GPU is blurred. We will then consider the much more pressing problem of software in a many core world. Writing software for heterogeneous many core processors is well beyond the ability of current programmers. One solution is to support a software development process where programmer teams are split into two distinct groups: a large group of domain-expert productivity programmers and much smaller team of computer-scientist efficiency programmers. The productivity programmers work in terms of high level frameworks to express the concurrency in their problems while avoiding any details for how that concurrency is exploited. The second group, the efficiency programmers, map applications expressed in terms of these frameworks onto the target many-core system. In other words, we can solve the many-core software problem by creating a software infrastructure that only requires a small subset of programmers to become master parallel programmers. This is different from the discredited dream of automatic parallelism. Note that productivity programmers still need to define the architecture of their software in a way that exposes the concurrency inherent in their problem. We submit that domain-expert programmers understand "what is concurrent". The parallel programming problem emerges from the complexity of "how that concurrency is utilized" on real hardware. The research described in this talk was carried out in collaboration with the ParLab at UC Berkeley. We use a design pattern language to define the high level frameworks exposed to domain-expert, productivity programmers. We then use tools from the SEJITS project (Selective embedded Just In time Specializers) to build the software transformation tool chains thst turn these framework-oriented designs into highly efficient code. The final ingredient is a software platform to serve as a target for these tools. One such platform is the OpenCL industry standard for programming heterogeneous systems. We will briefly describe OpenCL and show how it provides a vendor-neutral software target for current and future many core systems; both CPU-based, GPU-based, and heterogeneous combinations of the two.
NASA Astrophysics Data System (ADS)
Olson, Richard F.
2013-05-01
Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
NASA Astrophysics Data System (ADS)
Tanikawa, Ataru; Yoshikawa, Kohji; Okamoto, Takashi; Nitadori, Keigo
2012-02-01
We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme ( Makino and Aarseth, 1992), and achieved the performance of ˜20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions ( Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme ( Nitadori et al., 2006a), and achieved ˜90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ˜ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs ( Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.
Guan, Xuewei; Hou, Likai; Ren, Yukun; Deng, Xiaokang; Lang, Qi; Jia, Yankai; Hu, Qingming; Tao, Ye; Liu, Jiangwei; Jiang, Hongyuan
2016-05-01
Droplet-based microfluidics has provided a means to generate multi-core double emulsions, which are versatile platforms for microreactors in materials science, synthetic biology, and chemical engineering. To provide new opportunities for double emulsion platforms, here, we report a glass capillary microfluidic approach to first fabricate osmolarity-responsive Water-in-Oil-in-Water (W/O/W) double emulsion containing two different inner droplets/cores and to then trigger the coalescence between the encapsulated droplets precisely. To achieve this, we independently control the swelling speed and size of each droplet in the dual-core double emulsion by controlling the osmotic pressure between the inner droplets and the collection solutions. When the inner two droplets in one W/O/W double emulsion swell to the same size and reach the instability of the oil film interface between the inner droplets, core-coalescence happens and this coalescence process can be controlled precisely. This microfluidic methodology enables the generation of highly monodisperse dual-core double emulsions and the osmolarity-controlled swelling behavior provides new stimuli to trigger the coalescence between the encapsulated droplets. Such swelling-caused core-coalescence behavior in dual-core double emulsion establishes a novel microreactor for nanoliter-scale reactions, which can protect reaction materials and products from being contaminated or released.
Guan, Xuewei; Hou, Likai; Ren, Yukun; Deng, Xiaokang; Lang, Qi; Jia, Yankai; Hu, Qingming; Tao, Ye; Liu, Jiangwei; Jiang, Hongyuan
2016-01-01
Droplet-based microfluidics has provided a means to generate multi-core double emulsions, which are versatile platforms for microreactors in materials science, synthetic biology, and chemical engineering. To provide new opportunities for double emulsion platforms, here, we report a glass capillary microfluidic approach to first fabricate osmolarity-responsive Water-in-Oil-in-Water (W/O/W) double emulsion containing two different inner droplets/cores and to then trigger the coalescence between the encapsulated droplets precisely. To achieve this, we independently control the swelling speed and size of each droplet in the dual-core double emulsion by controlling the osmotic pressure between the inner droplets and the collection solutions. When the inner two droplets in one W/O/W double emulsion swell to the same size and reach the instability of the oil film interface between the inner droplets, core-coalescence happens and this coalescence process can be controlled precisely. This microfluidic methodology enables the generation of highly monodisperse dual-core double emulsions and the osmolarity-controlled swelling behavior provides new stimuli to trigger the coalescence between the encapsulated droplets. Such swelling-caused core-coalescence behavior in dual-core double emulsion establishes a novel microreactor for nanoliter-scale reactions, which can protect reaction materials and products from being contaminated or released. PMID:27279935
Modeling Large Scale Circuits Using Massively Parallel Descrete-Event Simulation
2013-06-01
exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power consumption...grow to exascale levels of performance, the smallest elements of a single processor can greatly affect the entire computer system (e.g. its power...Warp Speed 10.0. 2.0 INTRODUCTION As supercomputer systems approach exascale , the core count will exceed 1024 and number of transistors used in
PDSparc: A Drop-In Replacement for LEON3 Written Using Synopsys Processor Designer
2015-09-24
Kate Thurmer MIT Lincoln Laboratory, Lexington, MA, USA Distribution A: Public Release ABSTRACT Microprocessors are the...enabled appliances has opened a significant new niche: the Application Specific Standard Product (ASSP) microprocessor . These processors usually start...out as soft-cores that are parameterized at design time to realize exclusively the specific needs of the application. The microprocessor is a small
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee
The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. Themore » complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.« less
NASA Astrophysics Data System (ADS)
Wang, Yazhou; Zhang, Yiqiong; Wang, Bochu; Cao, Yang; Yu, Qingsong; Yin, Tieying
2013-06-01
The study aimed at constructing a novel drug delivery system for programmable multiple drug release controlled with core-shell structure. The core-shell structure consisted of chitosan nanoparticles as core and polyvinylpyrrolidone micro/nanocoating as shell to form core-shell micro/nanoparticles, which was fabricated by ionic gelation and emulsion electrospray methods. As model drug agents, Naproxen and rhodamine B were encapsulated in the core and shell regions, respectively. The core-shell micro/nanoparticles thus fabricated were characterized and confirmed by scanning electron microscope, transmission electron microscope, and fluorescence optical microscope. The core-shell micro/nanoparticles showed good release controllability through drug release experiment in vitro. It was noted that a programmable release pattern for dual drug agents was also achieved by adjusting their loading regions in the core-shell structures. The results indicate that emulsion electrospraying technology is a promising approach in fabrication of core-shell micro/nanoparticles for programmable dual drug release. Such a novel multi-drug delivery system has a potential application for the clinical treatment of cancer, tuberculosis, and tissue engineering.
A Real-Time Optical 3D Tracker for Head-Mounted Display Systems
1990-03-01
paper. OPTOTRAK [Nor88] uses one camera with two dual-axis CCD infrared position sensors. Each position sen- sor has a dedicated processor board to...enhance the use- [Nor88] Northern Digital. Trade literature on Optotrak fulness of head-mounted display systems. - Northern Digital’s Three Dimensional
Evaluation of an Adaptive Automation Trigger Based on Task Performance, Priority, and Frequency
2013-06-01
with dual Intel ® Xeon ® CPU x5550 processors @ 2.67 GHz each, 12.0 GB RAM, and a 1.5 GB PCIe nVidia Quadro FX 4800 graphics card (Microsoft...Cole Publishing Company . Miller, C. A., & Parasuraman, R. (2007). Designing for flexible interaction between humans and automation: Delegation
Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers
NASA Astrophysics Data System (ADS)
Edwards, Mark; Heward, Jeffrey; Clark, C. W.
2009-05-01
At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
A dual-processor multi-frequency implementation of the FINDS algorithm
NASA Technical Reports Server (NTRS)
Godiwala, Pankaj M.; Caglayan, Alper K.
1987-01-01
This report presents a parallel processing implementation of the FINDS (Fault Inferring Nonlinear Detection System) algorithm on a dual processor configured target flight computer. First, a filter initialization scheme is presented which allows the no-fail filter (NFF) states to be initialized using the first iteration of the flight data. A modified failure isolation strategy, compatible with the new failure detection strategy reported earlier, is discussed and the performance of the new FDI algorithm is analyzed using flight recorded data from the NASA ATOPS B-737 aircraft in a Microwave Landing System (MLS) environment. The results show that low level MLS, IMU, and IAS sensor failures are detected and isolated instantaneously, while accelerometer and rate gyro failures continue to take comparatively longer to detect and isolate. The parallel implementation is accomplished by partitioning the FINDS algorithm into two parts: one based on the translational dynamics and the other based on the rotational kinematics. Finally, a multi-rate implementation of the algorithm is presented yielding significantly low execution times with acceptable estimation and FDI performance.
NASA Astrophysics Data System (ADS)
Pillans, Luke; Harmer, Jack; Edwards, Tim; Richardson, Lee
2016-05-01
Geolocation is the process of calculating a target position based on bearing and range relative to the known location of the observer. A high performance thermal imager with integrated geolocation functions is a powerful long range targeting device. Firefly is a software defined camera core incorporating a system-on-a-chip processor running the AndroidTM operating system. The processor has a range of industry standard serial interfaces which were used to interface to peripheral devices including a laser rangefinder and a digital magnetic compass. The core has built in Global Positioning System (GPS) which provides the third variable required for geolocation. The graphical capability of Firefly allowed flexibility in the design of the man-machine interface (MMI), so the finished system can give access to extensive functionality without appearing cumbersome or over-complicated to the user. This paper covers both the hardware and software design of the system, including how the camera core influenced the selection of peripheral hardware, and the MMI design process which incorporated user feedback at various stages.
NASA Astrophysics Data System (ADS)
Abeywickrama, Sandu; Furdek, Marija; Monti, Paolo; Wosinska, Lena; Wong, Elaine
2016-12-01
Core network survivability affects the reliability performance of telecommunication networks and remains one of the most important network design considerations. This paper critically examines the benefits arising from utilizing dual-homing in the optical access networks to provide resource-efficient protection against link and node failures in the optical core segment. Four novel, heuristic-based RWA algorithms that provide dedicated path protection in networks with dual-homing are proposed and studied. These algorithms protect against different failure scenarios (i.e. single link or node failures) and are implemented with different optimization objectives (i.e., minimization of wavelength usage and path length). Results obtained through simulations and comparison with baseline architectures indicate that exploiting dual-homed architecture in the access segment can bring significant improvements in terms of core network resource usage, connection availability, and power consumption.
Katherine: Ethernet Embedded Readout Interface for Timepix3
NASA Astrophysics Data System (ADS)
Burian, P.; Broulím, P.; Jára, M.; Georgiev, V.; Bergmann, B.
2017-11-01
The Timepix3—the latest generation of hybrid particle pixel detectors of Medipix family—yields a lot of new possibilities, i.e. a high hit-rate, a time resolution of 1.56 ns, event data-driven readout mode, and the capability of measuring the Time-over-Threshold (ToT - energy) and the Time-of-Arrival (ToA) simultaneously. This paper introduces a newly developed readout device for the Timepix3, called "Katherine", featuring a Gigabit Ethernet interface. The primary benefit of the Katherine is the operation of Timepix3 at long distance (up to 100 m) from computer or server, which is advantageous for the installation at beam lines, where the access is difficult or where radiation levels are too high for human interventions. The maximal hit-rate is limited by the bandwidth of the Ethernet connection (peer-to-peer connection; up to 16 Mhit/s). Since the Katherine interface is equipped with a processor of high computational power (ARM Cortex-A9 dual-core processor), it permits the use as a stand-alone (autonomous) radiation detector. The key features of the device are described in detail. These are the implemented high voltage power supply offering both polarities of bias voltage (up to ± 300 V), the automatic data sending to a sever via SSH, the automatic compensation of ToA values from columns with shifted matrix clock, etc. A dedicated control software was developed, which can be used for the detector preparation (sensor equalization, the DACs dependency scan, and the THL scan) and measurement control. Measured energy spectra from photon fields are shown.
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation
2011-01-01
Background The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. Results A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Conclusions Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance. PMID:21631914
Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation.
Rognes, Torbjørn
2011-06-01
The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License. Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.
A Future Accelerated Cognitive Distributed Hybrid Testbed for Big Data Science Analytics
NASA Astrophysics Data System (ADS)
Halem, M.; Prathapan, S.; Golpayegani, N.; Huang, Y.; Blattner, T.; Dorband, J. E.
2016-12-01
As increased sensor spectral data volumes from current and future Earth Observing satellites are assimilated into high-resolution climate models, intensive cognitive machine learning technologies are needed to data mine, extract and intercompare model outputs. It is clear today that the next generation of computers and storage, beyond petascale cluster architectures, will be data centric. They will manage data movement and process data in place. Future cluster nodes have been announced that integrate multiple CPUs with high-speed links to GPUs and MICS on their backplanes with massive non-volatile RAM and access to active flash RAM disk storage. Active Ethernet connected key value store disk storage drives with 10Ge or higher are now available through the Kinetic Open Storage Alliance. At the UMBC Center for Hybrid Multicore Productivity Research, a future state-of-the-art Accelerated Cognitive Computer System (ACCS) for Big Data science is being integrated into the current IBM iDataplex computational system `bluewave'. Based on the next gen IBM 200 PF Sierra processor, an interim two node IBM Power S822 testbed is being integrated with dual Power 8 processors with 10 cores, 1TB Ram, a PCIe to a K80 GPU and an FPGA Coherent Accelerated Processor Interface card to 20TB Flash Ram. This system is to be updated to the Power 8+, an NVlink 1.0 with the Pascal GPU late in 2016. Moreover, the Seagate 96TB Kinetic Disk system with 24 Ethernet connected active disks is integrated into the ACCS storage system. A Lightweight Virtual File System developed at the NASA GSFC is installed on bluewave. Since remote access to publicly available quantum annealing computers is available at several govt labs, the ACCS will offer an in-line Restricted Boltzmann Machine optimization capability to the D-Wave 2X quantum annealing processor over the campus high speed 100 Gb network to Internet 2 for large files. As an evaluation test of the cognitive functionality of the architecture, the following studies utilizing all the system components will be presented; (i) a near real time climate change study generating CO2 fluxes and (ii) a deep dive capability into an 8000 x8000 pixel image pyramid display and (iii) Large dense and sparse eigenvalue decomposition.
Rubus: A compiler for seamless and extensible parallelism.
Adnan, Muhammad; Aslam, Faisal; Nawaz, Zubair; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer's expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program.
Rubus: A compiler for seamless and extensible parallelism
Adnan, Muhammad; Aslam, Faisal; Sarwar, Syed Mansoor
2017-01-01
Nowadays, a typical processor may have multiple processing cores on a single chip. Furthermore, a special purpose processing unit called Graphic Processing Unit (GPU), originally designed for 2D/3D games, is now available for general purpose use in computers and mobile devices. However, the traditional programming languages which were designed to work with machines having single core CPUs, cannot utilize the parallelism available on multi-core processors efficiently. Therefore, to exploit the extraordinary processing power of multi-core processors, researchers are working on new tools and techniques to facilitate parallel programming. To this end, languages like CUDA and OpenCL have been introduced, which can be used to write code with parallelism. The main shortcoming of these languages is that programmer needs to specify all the complex details manually in order to parallelize the code across multiple cores. Therefore, the code written in these languages is difficult to understand, debug and maintain. Furthermore, to parallelize legacy code can require rewriting a significant portion of code in CUDA or OpenCL, which can consume significant time and resources. Thus, the amount of parallelism achieved is proportional to the skills of the programmer and the time spent in code optimizations. This paper proposes a new open source compiler, Rubus, to achieve seamless parallelism. The Rubus compiler relieves the programmer from manually specifying the low-level details. It analyses and transforms a sequential program into a parallel program automatically, without any user intervention. This achieves massive speedup and better utilization of the underlying hardware without a programmer’s expertise in parallel programming. For five different benchmarks, on average a speedup of 34.54 times has been achieved by Rubus as compared to Java on a basic GPU having only 96 cores. Whereas, for a matrix multiplication benchmark the average execution speedup of 84 times has been achieved by Rubus on the same GPU. Moreover, Rubus achieves this performance without drastically increasing the memory footprint of a program. PMID:29211758
Adhesion of resin composite core materials to dentin.
O'Keefe, K L; Powers, J M
2001-01-01
This study determined (1) the effect of polymerization mode of resin composite core materials and dental adhesives on the bond strength to dentin, and (2) if dental adhesives perform as well to dentin etched with phosphoric acid as to dentin etched with self-etching primer. Human third molars were sectioned 2 mm from the highest pulp horn and polished. Three core materials (Fluorocore [dual cured], Core Paste [self-cured], and Clearfil Photo Core [light cured]) and two adhesives (Prime & Bond NT Dual Cure and Clearfil SE Bond [light cured]) were bonded to dentin using two dentin etching conditions. After storage, specimens were debonded in microtension and bond strengths were calculated. Scanning electron micrographs of representative bonding interfaces were analyzed. Analysis showed differences among core materials, adhesives, and etching conditions. Among core materials, dual-cured Fluorocore had the highest bond strengths. There were incompatibilities between self-cured Core Paste and Prime & Bond NT in both etched (0 MPa) and nonetched (3.0 MPa) dentin. Among adhesives, in most cases Clearfil SE Bond had higher bond strengths than Prime & Bond NT and bond strengths were higher to self-etched than to phosphoric acid-etched dentin. Scanning electron micrographs did not show a relationship between resin tags and bond strengths. There were incompatibilities between a self-cured core material and a dual-cured adhesive. All other combinations of core materials and adhesives produced strong in vitro bond strengths both in the self-etched and phosphoric acid-etched conditions.
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets.
Scharfe, Michael; Pielot, Rainer; Schreiber, Falk
2010-01-11
Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics.
A Versatile Multichannel Digital Signal Processing Module for Microcalorimeter Arrays
NASA Astrophysics Data System (ADS)
Tan, H.; Collins, J. W.; Walby, M.; Hennig, W.; Warburton, W. K.; Grudberg, P.
2012-06-01
Different techniques have been developed for reading out microcalorimeter sensor arrays: individual outputs for small arrays, and time-division or frequency-division or code-division multiplexing for large arrays. Typically, raw waveform data are first read out from the arrays using one of these techniques and then stored on computer hard drives for offline optimum filtering, leading not only to requirements for large storage space but also limitations on achievable count rate. Thus, a read-out module that is capable of processing microcalorimeter signals in real time will be highly desirable. We have developed multichannel digital signal processing electronics that are capable of on-board, real time processing of microcalorimeter sensor signals from multiplexed or individual pixel arrays. It is a 3U PXI module consisting of a standardized core processor board and a set of daughter boards. Each daughter board is designed to interface a specific type of microcalorimeter array to the core processor. The combination of the standardized core plus this set of easily designed and modified daughter boards results in a versatile data acquisition module that not only can easily expand to future detector systems, but is also low cost. In this paper, we first present the core processor/daughter board architecture, and then report the performance of an 8-channel daughter board, which digitizes individual pixel outputs at 1 MSPS with 16-bit precision. We will also introduce a time-division multiplexing type daughter board, which takes in time-division multiplexing signals through fiber-optic cables and then processes the digital signals to generate energy spectra in real time.
Vascular system modeling in parallel environment - distributed and shared memory approaches
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-01-01
The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Thermal Hotspots in CPU Die and It's Future Architecture
NASA Astrophysics Data System (ADS)
Wang, Jian; Hu, Fu-Yuan
Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
FAST: framework for heterogeneous medical image computing and visualization.
Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank
2015-11-01
Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.
On VLSI Design of Rank-Order Filtering using DCRAM Architecture
Lin, Meng-Chun; Dung, Lan-Rong
2009-01-01
This paper addresses on VLSI design of rank-order filtering (ROF) with a maskable memory for real-time speech and image processing applications. Based on a generic bit-sliced ROF algorithm, the proposed design uses a special-defined memory, called the dual-cell random-access memory (DCRAM), to realize major operations of ROF: threshold decomposition and polarization. Using the memory-oriented architecture, the proposed ROF processor can benefit from high flexibility, low cost and high speed. The DCRAM can perform the bit-sliced read, partial write, and pipelined processing. The bit-sliced read and partial write are driven by maskable registers. With recursive execution of the bit-slicing read and partial write, the DCRAM can effectively realize ROF in terms of cost and speed. The proposed design has been implemented using TSMC 0.18 μm 1P6M technology. As shown in the result of physical implementation, the core size is 356.1 × 427.7μm2 and the VLSI implementation of ROF can operate at 256 MHz for 1.8V supply. PMID:19865599
iHand: an interactive bare-hand-based augmented reality interface on commercial mobile phones
NASA Astrophysics Data System (ADS)
Choi, Junyeong; Park, Jungsik; Park, Hanhoon; Park, Jong-Il
2013-02-01
The performance of mobile phones has rapidly improved, and they are emerging as a powerful platform. In many vision-based applications, human hands play a key role in natural interaction. However, relatively little attention has been paid to the interaction between human hands and the mobile phone. Thus, we propose a vision- and hand gesture-based interface in which the user holds a mobile phone in one hand but sees the other hand's palm through a built-in camera. The virtual contents are faithfully rendered on the user's palm through palm pose estimation, and reaction with hand and finger movements is achieved that is recognized by hand shape recognition. Since the proposed interface is based on hand gestures familiar to humans and does not require any additional sensors or markers, the user can freely interact with virtual contents anytime and anywhere without any training. We demonstrate that the proposed interface works at over 15 fps on a commercial mobile phone with a 1.2-GHz dual core processor and 1 GB RAM.
CMOS Image Sensor with a Built-in Lane Detector.
Hsiao, Pei-Yung; Cheng, Hsien-Chein; Huang, Shih-Shinh; Fu, Li-Chen
2009-01-01
This work develops a new current-mode mixed signal Complementary Metal-Oxide-Semiconductor (CMOS) imager, which can capture images and simultaneously produce vehicle lane maps. The adopted lane detection algorithm, which was modified to be compatible with hardware requirements, can achieve a high recognition rate of up to approximately 96% under various weather conditions. Instead of a Personal Computer (PC) based system or embedded platform system equipped with expensive high performance chip of Reduced Instruction Set Computer (RISC) or Digital Signal Processor (DSP), the proposed imager, without extra Analog to Digital Converter (ADC) circuits to transform signals, is a compact, lower cost key-component chip. It is also an innovative component device that can be integrated into intelligent automotive lane departure systems. The chip size is 2,191.4 × 2,389.8 μm, and the package uses 40 pin Dual-In-Package (DIP). The pixel cell size is 18.45 × 21.8 μm and the core size of photodiode is 12.45 × 9.6 μm; the resulting fill factor is 29.7%.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh; Zhang, Zhao; Vetter, Jeffrey S
Recent trends of CMOS scaling and use of large last level caches (LLCs) have led to significant increase in the leakage energy consumption of LLCs and hence, managing their energy consumption has become extremely important in modern processor design. The conventional cache energy saving techniques require offline profiling or provide only coarse granularity of cache allocation. We present FlexiWay, a cache energy saving technique which uses dynamic cache reconfiguration. FlexiWay logically divides the cache sets into multiple (e.g. 16) modules and dynamically turns off suitable and possibly different number of cache ways in each module. FlexiWay has very small implementationmore » overhead and it provides fine-grain cache allocation even with caches of typical associativity, e.g. an 8-way cache. Microarchitectural simulations have been performed using an x86-64 simulator and workloads from SPEC2006 suite. Also, FlexiWay has been compared with two conventional energy saving techniques. The results show that FlexiWay provides largest energy saving and incurs only small loss in performance. For single, dual and quad core systems, the average energy saving using FlexiWay are 26.2%, 25.7% and 22.4%, respectively.« less
Huang, Kuan-Ju; Shih, Wei-Yeh; Chang, Jui Chung; Feng, Chih Wei; Fang, Wai-Chi
2013-01-01
This paper presents a pipeline VLSI design of fast singular value decomposition (SVD) processor for real-time electroencephalography (EEG) system based on on-line recursive independent component analysis (ORICA). Since SVD is used frequently in computations of the real-time EEG system, a low-latency and high-accuracy SVD processor is essential. During the EEG system process, the proposed SVD processor aims to solve the diagonal, inverse and inverse square root matrices of the target matrices in real time. Generally, SVD requires a huge amount of computation in hardware implementation. Therefore, this work proposes a novel design concept for data flow updating to assist the pipeline VLSI implementation. The SVD processor can greatly improve the feasibility of real-time EEG system applications such as brain computer interfaces (BCIs). The proposed architecture is implemented using TSMC 90 nm CMOS technology. The sample rate of EEG raw data adopts 128 Hz. The core size of the SVD processor is 580×580 um(2), and the speed of operation frequency is 20MHz. It consumes 0.774mW of power during the 8-channel EEG system per execution time.
Energy Efficient Image/Video Data Transmission on Commercial Multi-Core Processors
Lee, Sungju; Kim, Heegon; Chung, Yongwha; Park, Daihee
2012-01-01
In transmitting image/video data over Video Sensor Networks (VSNs), energy consumption must be minimized while maintaining high image/video quality. Although image/video compression is well known for its efficiency and usefulness in VSNs, the excessive costs associated with encoding computation and complexity still hinder its adoption for practical use. However, it is anticipated that high-performance handheld multi-core devices will be used as VSN processing nodes in the near future. In this paper, we propose a way to improve the energy efficiency of image and video compression with multi-core processors while maintaining the image/video quality. We improve the compression efficiency at the algorithmic level or derive the optimal parameters for the combination of a machine and compression based on the tradeoff between the energy consumption and the image/video quality. Based on experimental results, we confirm that the proposed approach can improve the energy efficiency of the straightforward approach by a factor of 2∼5 without compromising image/video quality. PMID:23202181
High performance in silico virtual drug screening on many-core processors.
McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A
2015-05-01
Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
High performance in silico virtual drug screening on many-core processors
Price, James; Sessions, Richard B; Ibarra, Amaurys A
2015-01-01
Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
A Down-to-Earth Educational Operating System for Up-in-the-Cloud Many-Core Architectures
ERIC Educational Resources Information Center
Ziwisky, Michael; Persohn, Kyle; Brylow, Dennis
2013-01-01
We present "Xipx," the first port of a major educational operating system to a processor in the emerging class of many-core architectures. Through extensions to the proven Embedded Xinu operating system, Xipx gives students hands-on experience with system programming in a distributed message-passing environment. We expose the software primitives…
Design and implementation of a high performance network security processor
NASA Astrophysics Data System (ADS)
Wang, Haixin; Bai, Guoqiang; Chen, Hongyi
2010-03-01
The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.
Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer.
Hines, Michael; Kumar, Sameer; Schürmann, Felix
2011-01-01
For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8-128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·10(10) connections (K is 1024, M is 1024(2), and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods-the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores-had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.
NASA Astrophysics Data System (ADS)
Zhuravska, Iryna M.; Koretska, Oleksandra O.; Musiyenko, Maksym P.; Surtel, Wojciech; Assembay, Azat; Kovalev, Vladimir; Tleshova, Akmaral
2017-08-01
The article contains basic approaches to develop the self-powered information measuring wireless networks (SPIM-WN) using the distribution of tasks within multicore processors critical applying based on the interaction of movable components - as in the direction of data transmission as wireless transfer of energy coming from polymetric sensors. Base mathematic model of scheduling tasks within multiprocessor systems was modernized to schedule and allocate tasks between cores of one-crystal computer (SoC) to increase energy efficiency SPIM-WN objects.
NASA Technical Reports Server (NTRS)
Swift, Gary M.; Allen, Gregory S.; Farmanesh, Farhad; George, Jeffrey; Petrick, David J.; Chayab, Fayez
2006-01-01
Shown in this presentation are recent results for the upset susceptibility of the various types of memory elements in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively through appropriate design triplication and configuration scrubbing, these upsets of processor elements can dominate the system error rate. Data from irradiations with both protons and heavy ions are given and compared using available models.
Zheng, Tingting; Zhang, Rui; Zhang, Qingfeng; Tan, Tingting; Zhang, Kui; Zhu, Jun-Jie; Wang, Hui
2013-09-18
We have developed a robust enzymatic peptide cleavage-based assay for the ultrasensitive dual-channel detection of matrix metalloproteinase-2 (MMP-2) in human serum using gold-quantum dot (Au-QD) core-satellite nanoprobes.
Control of automated behavior: insights from the discrete sequence production task
Abrahamse, Elger L.; Ruitenberg, Marit F. L.; de Kleine, Elian; Verwey, Willem B.
2013-01-01
Work with the discrete sequence production (DSP) task has provided a substantial literature on discrete sequencing skill over the last decades. The purpose of the current article is to provide a comprehensive overview of this literature and of the theoretical progress that it has prompted. We start with a description of the DSP task and the phenomena that are typically observed with it. Then we propose a cognitive model, the dual processor model (DPM), which explains performance of (skilled) discrete key-press sequences. Key features of this model are the distinction between a cognitive processor and a motor system (i.e., motor buffer and motor processor), the interplay between these two processing systems, and the possibility to execute familiar sequences in two different execution modes. We further discuss how this model relates to several related sequence skill research paradigms and models, and we outline outstanding questions for future research throughout the paper. We conclude by sketching a tentative neural implementation of the DPM. PMID:23515430
Fast multi-core based multimodal registration of 2D cross-sections and 3D datasets
2010-01-01
Background Solving bioinformatics tasks often requires extensive computational power. Recent trends in processor architecture combine multiple cores into a single chip to improve overall performance. The Cell Broadband Engine (CBE), a heterogeneous multi-core processor, provides power-efficient and cost-effective high-performance computing. One application area is image analysis and visualisation, in particular registration of 2D cross-sections into 3D image datasets. Such techniques can be used to put different image modalities into spatial correspondence, for example, 2D images of histological cuts into morphological 3D frameworks. Results We evaluate the CBE-driven PlayStation 3 as a high performance, cost-effective computing platform by adapting a multimodal alignment procedure to several characteristic hardware properties. The optimisations are based on partitioning, vectorisation, branch reducing and loop unrolling techniques with special attention to 32-bit multiplies and limited local storage on the computing units. We show how a typical image analysis and visualisation problem, the multimodal registration of 2D cross-sections and 3D datasets, benefits from the multi-core based implementation of the alignment algorithm. We discuss several CBE-based optimisation methods and compare our results to standard solutions. More information and the source code are available from http://cbe.ipk-gatersleben.de. Conclusions The results demonstrate that the CBE processor in a PlayStation 3 accelerates computational intensive multimodal registration, which is of great importance in biological/medical image processing. The PlayStation 3 as a low cost CBE-based platform offers an efficient option to conventional hardware to solve computational problems in image processing and bioinformatics. PMID:20064262
Optimization of the coherence function estimation for multi-core central processing unit
NASA Astrophysics Data System (ADS)
Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.
2017-02-01
The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
Dual-core optical fiber based strain sensor for remote sensing in hard-to-reach areas
NASA Astrophysics Data System (ADS)
MÄ kowska, Anna; Szostkiewicz, Łukasz; Kołakowska, Agnieszka; Budnicki, Dawid; Bieńkowska, Beata; Ostrowski, Łukasz; Murawski, Michał; Napierała, Marek; Mergo, Paweł; Nasiłowski, Tomasz
2017-10-01
We present research on optical fiber sensors based on microstructured multi-core fiber. Elaborated sensor can be advantageously used in hard-to-reach areas by taking advantage of the fact, that optical fibers can play both the role of sensing elements and they can realize signal delivery. By using the sensor, it is possible to increase the level of the safety in the explosive endangered areas, e.g. in mine-like objects. As a base for the strain remote sensor we use dual-core fibers. The multi-core fibers possess a characteristic parameter called crosstalk, which is a measure of the amount of signal which can pass to the adjacent core. The strain-sensitive area is made by creating the tapered section, in which the level of crosstalk is changed. Due to this fact, we present broadened conception of fiber optic sensor designing. Strain measurement is realized thanks to the fact, that depending on the strain applied, the power distribution between the cores of dual-core fibers changes. Principle of operation allows realization of measurements both in wavelength and power domain.
On-chip programmable ultra-wideband microwave photonic phase shifter and true time delay unit.
Burla, Maurizio; Cortés, Luis Romero; Li, Ming; Wang, Xu; Chrostowski, Lukas; Azaña, José
2014-11-01
We proposed and experimentally demonstrated an ultra-broadband on-chip microwave photonic processor that can operate both as RF phase shifter (PS) and true-time-delay (TTD) line, with continuous tuning. The processor is based on a silicon dual-phase-shifted waveguide Bragg grating (DPS-WBG) realized with a CMOS compatible process. We experimentally demonstrated the generation of delay up to 19.4 ps over 10 GHz instantaneous bandwidth and a phase shift of approximately 160° over the bandwidth 22-29 GHz. The available RF measurement setup ultimately limits the phase shifting demonstration as the device is capable of providing up to 300° phase shift for RF frequencies over a record bandwidth approaching 1 THz.
Aerospace Applications Conference, Steamboat Springs, CO, Feb. 1-8, 1986, Digest
NASA Astrophysics Data System (ADS)
The present conference considers topics concerning the projected NASA Space Station's systems, digital signal and data processing applications, and space science and microwave applications. Attention is given to Space Station video and audio subsystems design, clock error, jitter, phase error and differential time-of-arrival in satellite communications, automation and robotics in space applications, target insertion into synthetic background scenes, and a novel scheme for the computation of the discrete Fourier transform on a systolic processor. Also discussed are a novel signal parameter measurement system employing digital signal processing, EEPROMS for spacecraft applications, a unique concurrent processor architecture for high speed simulation of dynamic systems, a dual polarization flat plate antenna, Fresnel diffraction, and ultralinear TWTs for high efficiency satellite communications.
Hiding the Disk and Network Latency of Out-of-Core Visualization
NASA Technical Reports Server (NTRS)
Ellsworth, David
2001-01-01
This paper describes an algorithm that improves the performance of application-controlled demand paging for out-of-core visualization by hiding the latency of reading data from both local disks or disks on remote servers. The performance improvements come from better overlapping the computation with the page reading process, and by performing multiple page reads in parallel. The paper includes measurements that show that the new multithreaded paging algorithm decreases the time needed to compute visualizations by one third when using one processor and reading data from local disk. The time needed when using one processor and reading data from remote disk decreased by two thirds. Visualization runs using data from remote disk actually ran faster than ones using data from local disk because the remote runs were able to make use of the remote server's high performance disk array.
Heterogeneous High Throughput Scientific Computing with APM X-Gene and Intel Xeon Phi
NASA Astrophysics Data System (ADS)
Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; Eulisse, Giulio; Knight, Robert; Muzaffar, Shahzad
2015-05-01
Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. We report our experience on software porting, performance and energy efficiency and evaluate the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).
A pluggable framework for parallel pairwise sequence search.
Archuleta, Jeremy; Feng, Wu-chun; Tilevich, Eli
2007-01-01
The current and near future of the computing industry is one of multi-core and multi-processor technology. Most existing sequence-search tools have been designed with a focus on single-core, single-processor systems. This discrepancy between software design and hardware architecture substantially hinders sequence-search performance by not allowing full utilization of the hardware. This paper presents a novel framework that will aid the conversion of serial sequence-search tools into a parallel version that can take full advantage of the available hardware. The framework, which is based on a software architecture called mixin layers with refined roles, enables modules to be plugged into the framework with minimal effort. The inherent modular design improves maintenance and extensibility, thus opening up a plethora of opportunities for advanced algorithmic features to be developed and incorporated while routine maintenance of the codebase persists.
Fault-Tolerant, Radiation-Hard DSP
NASA Technical Reports Server (NTRS)
Czajkowski, David
2011-01-01
Commercial digital signal processors (DSPs) for use in high-speed satellite computers are challenged by the damaging effects of space radiation, mainly single event upsets (SEUs) and single event functional interrupts (SEFIs). Innovations have been developed for mitigating the effects of SEUs and SEFIs, enabling the use of very-highspeed commercial DSPs with improved SEU tolerances. Time-triple modular redundancy (TTMR) is a method of applying traditional triple modular redundancy on a single processor, exploiting the VLIW (very long instruction word) class of parallel processors. TTMR improves SEU rates substantially. SEFIs are solved by a SEFI-hardened core circuit, external to the microprocessor. It monitors the health of the processor, and if a SEFI occurs, forces the processor to return to performance through a series of escalating events. TTMR and hardened-core solutions were developed for both DSPs and reconfigurable field-programmable gate arrays (FPGAs). This includes advancement of TTMR algorithms for DSPs and reconfigurable FPGAs, plus a rad-hard, hardened-core integrated circuit that services both the DSP and FPGA. Additionally, a combined DSP and FPGA board architecture was fully developed into a rad-hard engineering product. This technology enables use of commercial off-the-shelf (COTS) DSPs in computers for satellite and other space applications, allowing rapid deployment at a much lower cost. Traditional rad-hard space computers are very expensive and typically have long lead times. These computers are either based on traditional rad-hard processors, which have extremely low computational performance, or triple modular redundant (TMR) FPGA arrays, which suffer from power and complexity issues. Even more frustrating is that the TMR arrays of FPGAs require a fixed, external rad-hard voting element, thereby causing them to lose much of their reconfiguration capability and in some cases significant speed reduction. The benefits of COTS high-performance signal processing include significant increase in onboard science data processing, enabling orders of magnitude reduction in required communication bandwidth for science data return, orders of magnitude improvement in onboard mission planning and critical decision making, and the ability to rapidly respond to changing mission environments, thus enabling opportunistic science and orders of magnitude reduction in the cost of mission operations through reduction of required staff. Additional benefits of COTS-based, high-performance signal processing include the ability to leverage considerable commercial and academic investments in advanced computing tools, techniques, and infra structure, and the familiarity of the science and IT community with these computing environments.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sreepathi, Sarat; Sripathi, Vamsi; Mills, Richard T
2013-01-01
Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting themore » I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library, SCORPIO (SCalable block-ORiented Parallel I/O) that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.« less
NASA Astrophysics Data System (ADS)
Kajiyama, Shinya; Fujito, Masamichi; Kasai, Hideo; Mizuno, Makoto; Yamaguchi, Takanori; Shinagawa, Yutaka
A novel 300MHz embedded flash memory for dual-core microcontrollers with a shared ROM architecture is proposed. One of its features is a three-stage pipeline read operation, which enables reduced access pitch and therefore reduces performance penalty due to conflict of shared ROM accesses. Another feature is a highly sensitive sense amplifier that achieves efficient pipeline operation with two-cycle latency one-cycle pitch as a result of a shortened sense time of 0.63ns. The combination of the pipeline architecture and proposed sense amplifiers significantly reduces access-conflict penalties with shared ROM and enhances performance of 32-bit RISC dual-core microcontrollers by 30%.
NASA Astrophysics Data System (ADS)
Zhang, Hui; Li, Yu-Hao; Chen, Yang; Wang, Man-Man; Wang, Xue-Sheng; Yin, Xue-Bo
2017-03-01
Phototherapy shows some unique advantages in clinical application, such as remote controllability, improved selectivity, and low bio-toxicity, than chemotherapy. In order to improve the safety and therapeutic efficacy, imaging-guided therapy seems particularly important because it integrates visible information to speculate the distribution and metabolism of the probe. Here we prepare biocompatible core-shell nanocomposites for dual-modality imaging-guided photothermal and photodynamic dual-therapy by the in situ growth of porphyrin-metal organic framework (PMOF) on Fe3O4@C core. Fe3O4@C core was used as T2-weighted magnetic resonance (MR) imaging and photothermal therapy (PTT) agent. The optical properties of porphyrin were well remained in PMOF, and PMOF was therefore selected for photodynamic therapy (PDT) and fluorescence imaging. Fluorescence and MR dual-modality imaging-guided PTT and PDT dual-therapy was confirmed with tumour-bearing mice as model. The high tumour accumulation of Fe3O4@C@PMOF and controllable light excitation at the tumour site achieved efficient cancer therapy, but low toxicity was observed to the normal tissues. The results demonstrated that Fe3O4@C@PMOF was a promising dual-imaging guided PTT and PDT dual-therapy platform for tumour diagnosis and treatment with low cytotoxicity and negligible in vivo toxicity.
Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology
NASA Astrophysics Data System (ADS)
Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.
2009-04-01
Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.
Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments.
Daily, Jeff
2016-02-10
Sequence alignment algorithms are a key component of many bioinformatics applications. Though various fast Smith-Waterman local sequence alignment implementations have been developed for x86 CPUs, most are embedded into larger database search tools. In addition, fast implementations of Needleman-Wunsch global sequence alignment and its semi-global variants are not as widespread. This article presents the first software library for local, global, and semi-global pairwise intra-sequence alignments and improves the performance of previous intra-sequence implementations. A faster intra-sequence local pairwise alignment implementation is described and benchmarked, including new global and semi-global variants. Using a 375 residue query sequence a speed of 136 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon E5-2670 24-core processor system, the highest reported for an implementation based on Farrar's 'striped' approach. Rognes's SWIPE optimal database search application is still generally the fastest available at 1.2 to at best 2.4 times faster than Parasail for sequences shorter than 500 amino acids. However, Parasail was faster for longer sequences. For global alignments, Parasail's prefix scan implementation is generally the fastest, faster even than Farrar's 'striped' approach, however the opal library is faster for single-threaded applications. The software library is designed for 64 bit Linux, OS X, or Windows on processors with SSE2, SSE41, or AVX2. Source code is available from https://github.com/jeffdaily/parasail under the Battelle BSD-style license. Applications that require optimal alignment scores could benefit from the improved performance. For the first time, SIMD global, semi-global, and local alignments are available in a stand-alone C library.
Ultra-Reliable Digital Avionics (URDA) processor
NASA Astrophysics Data System (ADS)
Branstetter, Reagan; Ruszczyk, William; Miville, Frank
1994-10-01
Texas Instruments Incorporated (TI) developed the URDA processor design under contract with the U.S. Air Force Wright Laboratory and the U.S. Army Night Vision and Electro-Sensors Directorate. TI's approach couples advanced packaging solutions with advanced integrated circuit (IC) technology to provide a high-performance (200 MIPS/800 MFLOPS) modular avionics processor module for a wide range of avionics applications. TI's processor design integrates two Ada-programmable, URDA basic processor modules (BPM's) with a JIAWG-compatible PiBus and TMBus on a single F-22 common integrated processor-compatible form-factor SEM-E avionics card. A separate, high-speed (25-MWord/second 32-bit word) input/output bus is provided for sensor data. Each BPM provides a peak throughput of 100 MIPS scalar concurrent with 400-MFLOPS vector processing in a removable multichip module (MCM) mounted to a liquid-flowthrough (LFT) core and interfacing to a processor interface module printed wiring board (PWB). Commercial RISC technology coupled with TI's advanced bipolar complementary metal oxide semiconductor (BiCMOS) application specific integrated circuit (ASIC) and silicon-on-silicon packaging technologies are used to achieve the high performance in a miniaturized package. A Mips R4000-family reduced instruction set computer (RISC) processor and a TI 100-MHz BiCMOS vector coprocessor (VCP) ASIC provide, respectively, the 100 MIPS of a scalar processor throughput and 400 MFLOPS of vector processing throughput for each BPM. The TI Aladdim ASIC chipset was developed on the TI Aladdin Program under contract with the U.S. Army Communications and Electronics Command and was sponsored by the Advanced Research Projects Agency with technical direction from the U.S. Army Night Vision and Electro-Sensors Directorate.
2008-07-01
generation of process partitioning, a thread pipelining becomes possible. In this paper we briefly summarize the requirements and trends for FADEC based... FADEC environment, presenting a hypothetical realization of an example application. Finally we discuss the application of Time-Triggered...based control applications of the future. 15. SUBJECT TERMS Gas turbine, FADEC , Multi-core processing technology, disturbed based control
Characterization of Three-Stream Jet Flow Fields
NASA Technical Reports Server (NTRS)
Henderson, Brenda S.; Wernet, Mark P.
2016-01-01
Flow-field measurements were conducted on single-, dual- and three-stream jets using two-component and stereo Particle Image Velocimetry (PIV). The flow-field measurements complimented previous acoustic measurements. The exhaust system consisted of externally-plugged, externally-mixed, convergent nozzles. The study used bypass-to-core area ratios equal to 1.0 and 2.5 and tertiary-to-core area ratios equal to 0.6 and 1.0. Axisymmetric and offset tertiary nozzles were investigated for heated and unheated high-subsonic conditions. Centerline velocity decay rates for the single-, dual- and three-stream axisymmetric jets compared well when axial distance was normalized by an equivalent diameter based on the nozzle system total exit area. The tertiary stream had a greater impact on the mean axial velocity for the small bypass-to-core area ratio nozzles than for large bypass-to-core area ratio nozzles. Normalized turbulence intensities were similar for the single-, dual-, and three-stream unheated jets due to the small difference (10 percent) in the core and bypass velocities for the dual-stream jets and the low tertiary velocity (50 percent of the core stream) for the three-stream jets. For heated jet conditions where the bypass velocity was 65 percent of the core velocity, additional regions of high turbulence intensity occurred near the plug tip which were not present for the unheated jets. Offsetting the tertiary stream moved the peak turbulence intensity levels upstream relative to those for all axisymmetric jets investigated.
Characterization of Three-Stream Jet Flow Fields
NASA Technical Reports Server (NTRS)
Henderson, Brenda S.; Wernet, Mark P.
2016-01-01
Flow-field measurements were conducted on single-, dual- and three-stream jets using two-component and stereo Particle Image Velocimetry (PIV). The flow-field measurements complimented previous acoustic measurements. The exhaust system consisted of externally-plugged, externally-mixed, convergent nozzles. The study used bypass-to-core area ratios equal to 1.0 and 2.5 and tertiary-to-core area ratios equal to 0.6 and 1.0. Axisymmetric and offset tertiary nozzles were investigated for heated and unheated high-subsonic conditions. Centerline velocity decay rates for the single-, dual- and three-stream axisymmetric jets compared well when axial distance was normalized by an equivalent diameter based on the nozzle system total exit area. The tertiary stream had a greater impact on the mean axial velocity for the small bypass-to-core area ratio nozzles than for large bypass-to-core area ratio nozzles. Normalized turbulence intensities were similar for the single-, dual-, and three-stream unheated jets due to the small difference (10%) in the core and bypass velocities for the dual-stream jets and the low tertiary velocity (50% of the core stream) for the three-stream jets. For heated jet conditions where the bypass velocity was 65% of the core velocity, additional regions of high turbulence intensity occurred near the plug tip which were not present for the unheated jets. Offsetting the tertiary stream moved the peak turbulence intensity levels upstream relative to those for all axisymmetric jets investigated.
Atoche, Alejandro Castillo; Castillo, Javier Vázquez
2012-01-01
A high-speed dual super-systolic core for reconstructive signal processing (SP) operations consists of a double parallel systolic array (SA) machine in which each processing element of the array is also conceptualized as another SA in a bit-level fashion. In this study, we addressed the design of a high-speed dual super-systolic array (SSA) core for the enhancement/reconstruction of remote sensing (RS) imaging of radar/synthetic aperture radar (SAR) sensor systems. The selected reconstructive SP algorithms are efficiently transformed in their parallel representation and then, they are mapped into an efficient high performance embedded computing (HPEC) architecture in reconfigurable Xilinx field programmable gate array (FPGA) platforms. As an implementation test case, the proposed approach was aggregated in a HW/SW co-design scheme in order to solve the nonlinear ill-posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) from a remotely sensed scene. We show how such dual SSA core, drastically reduces the computational load of complex RS regularization techniques achieving the required real-time operational mode. PMID:22736964
A Wearable Healthcare System With a 13.7 μA Noise Tolerant ECG Processor.
Izumi, Shintaro; Yamashita, Ken; Nakano, Masanao; Kawaguchi, Hiroshi; Kimura, Hiromitsu; Marumoto, Kyoji; Fuchikami, Takaaki; Fujimori, Yoshikazu; Nakajima, Hiroshi; Shiga, Toshikazu; Yoshimoto, Masahiko
2015-10-01
To prevent lifestyle diseases, wearable bio-signal monitoring systems for daily life monitoring have attracted attention. Wearable systems have strict size and weight constraints, which impose significant limitations of the battery capacity and the signal-to-noise ratio of bio-signals. This report describes an electrocardiograph (ECG) processor for use with a wearable healthcare system. It comprises an analog front end, a 12-bit ADC, a robust Instantaneous Heart Rate (IHR) monitor, a 32-bit Cortex-M0 core, and 64 Kbyte Ferroelectric Random Access Memory (FeRAM). The IHR monitor uses a short-term autocorrelation (STAC) algorithm to improve the heart-rate detection accuracy despite its use in noisy conditions. The ECG processor chip consumes 13.7 μA for heart rate logging application.
A subgradient approach for constrained binary optimization via quantum adiabatic evolution
NASA Astrophysics Data System (ADS)
Karimi, Sahar; Ronagh, Pooya
2017-08-01
Outer approximation method has been proposed for solving the Lagrangian dual of a constrained binary quadratic programming problem via quantum adiabatic evolution in the literature. This should be an efficient prescription for solving the Lagrangian dual problem in the presence of an ideally noise-free quantum adiabatic system. However, current implementations of quantum annealing systems demand methods that are efficient at handling possible sources of noise. In this paper, we consider a subgradient method for finding an optimal primal-dual pair for the Lagrangian dual of a constrained binary polynomial programming problem. We then study the quadratic stable set (QSS) problem as a case study. We see that this method applied to the QSS problem can be viewed as an instance-dependent penalty-term approach that avoids large penalty coefficients. Finally, we report our experimental results of using the D-Wave 2X quantum annealer and conclude that our approach helps this quantum processor to succeed more often in solving these problems compared to the usual penalty-term approaches.
NASA Astrophysics Data System (ADS)
Schultz, A.
2010-12-01
3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Development of an embedded atmospheric turbulence mitigation engine
NASA Astrophysics Data System (ADS)
Paolini, Aaron; Bonnett, James; Kozacik, Stephen; Kelmelis, Eric
2017-05-01
Methods to reconstruct pictures from imagery degraded by atmospheric turbulence have been under development for decades. The techniques were initially developed for observing astronomical phenomena from the Earth's surface, but have more recently been modified for ground and air surveillance scenarios. Such applications can impose significant constraints on deployment options because they both increase the computational complexity of the algorithms themselves and often dictate a requirement for low size, weight, and power (SWaP) form factors. Consequently, embedded implementations must be developed that can perform the necessary computations on low-SWaP platforms. Fortunately, there is an emerging class of embedded processors driven by the mobile and ubiquitous computing industries. We have leveraged these processors to develop embedded versions of the core atmospheric correction engine found in our ATCOM software. In this paper, we will present our experience adapting our algorithms for embedded systems on a chip (SoCs), namely the NVIDIA Tegra that couples general-purpose ARM cores with their graphics processing unit (GPU) technology and the Xilinx Zynq which pairs similar ARM cores with their field-programmable gate array (FPGA) fabric.
Effect of waist diameter and twist on tapered asymmetrical dual-core fiber MZI filter.
Liu, Yan; Li, Yang; Yan, Xiaojun; Li, Weidong
2015-10-01
A compact in-fiber Mach-Zehnder interferometer (MZI) filter fabricated from custom-designed asymmetrical dual-core fiber is numerically analyzed in detail and experimentally verified. The asymmetrical dual-core fiber has core diameters and a core pitch of 6.9, 6, and 19.9 μm, respectively. The fiber tapering technique is introduced to fuse the originally uncoupled cores into strong coupling tapered regions. The length and diameter of the waist region have a close impact on the splitting ratio, which further affects the spectral properties of the MZI filter. The field evolution with varied waist parameters is characterized by the finite element method and beam propagation method. Repeatable comb filters with ∼15 dB extinction ratio are successfully achieved under the guidance of simulated optimum conditions. The twist-induced circular birefringence gives rise to a retardance that causes the spectral shifts of the MZI filter. The theoretical and experimental results confirm that the relative wavelength shift is proportional to the retardance, which follows a sinc function in the limit of a large twist rate.
A PIV Study of Slotted Air Injection for Jet Noise Reduction
NASA Technical Reports Server (NTRS)
Henderson, Brenda S.; Wernet, Mark P.
2012-01-01
Results from acoustic and Particle Image Velocimetry (PIV) measurements are presented for single and dual-stream jets with fluidic injection on the core stream. The fluidic injection nozzles delivered air to the jet through slots on the interior of the nozzle at the nozzle trailing edge. The investigations include subsonic and supersonic jet conditions. Reductions in broadband shock noise and low frequency mixing noise were obtained with the introduction of fluidic injection on single stream jets. Fluidic injection was found to eliminate shock cells, increase jet mixing, and reduce turbulent kinetic energy levels near the end of the potential core. For dual-stream subsonic jets, the introduction of fluidic injection reduced low frequency noise in the peak jet noise direction and enhanced jet mixing. For dual-stream jets with supersonic fan streams and subsonic core streams, the introduction of fluidic injection in the core stream impacted the jet shock cell structure but had little effect on mixing between the core and fan streams.
NASA Astrophysics Data System (ADS)
Porsezian, K.; Nithyanandan, K.; Vasantha Jayakantha Raja, R.; Ganapathy, R.
2013-07-01
The supercontinuum generation (SCG) in liquid core photonic crystal fiber (LCPCF) with versatile nonlinear response and the spectral broadening in dual core optical fiber is presented. The analysis is presented in two phase, phase I deals with the SCG in LCPCF with the effect of saturable nonlinearity and re-orientational nonlinearity. We identify and discuss the generic nature of the saturable nonlinearity and reorientational nonlinearity in the SCG, using suitable model. For the physical explanation, modulational instability and soliton fission techniques is implemented to investigate the impact of saturable nonlinear response and slow nonlinear response, respectively. It is observed that the saturable nonlinearity inevitably suppresses the MI and the subsequent SCG. On the other hand, the re-orientational nonlinearity contributes to the slow nonlinear response in addition to the conventional fast response due to the electronic contribution. The phase II features the exclusive investigation of the spectral broadening in the dual core optical fiber.
Input-independent, Scalable and Fast String Matching on the Cray XMT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Chavarría-Miranda, Daniel; Maschhoff, Kristyn J
2009-05-25
String searching is at the core of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real- time, string searching solutions. For these conditions, many software implementations (if not all) targeting conventional cache-based microprocessors do not perform well. They either exhibit overall low performance or exhibit highly variable performance depending on the types of inputs. For this reason, real-time state of the art solutions rely on the use of either custom hardware or Field-Programmable Gatemore » Arrays (FPGAs) at the expense of overall system flexibility and programmability. This paper presents a software based implementation of the Aho-Corasick string searching algorithm on the Cray XMT multithreaded shared memory machine. Our so- lution relies on the particular features of the XMT architecture and on several algorith- mic strategies: it is fast, scalable and its performance is virtually content-independent. On a 128-processor Cray XMT, it reaches a scanning speed of ≈ 28 Gbps with a performance variability below 10 %. In the 10 Gbps performance range, variability is below 2.5%. By comparison, an Intel dual-socket, 8-core system running at 2.66 GHz achieves a peak performance which varies from 500 Mbps to 10 Gbps depending on the type of input and dictionary size.« less
Ng, C M
2013-10-01
The development of a population PK/PD model, an essential component for model-based drug development, is both time- and labor-intensive. A graphical-processing unit (GPU) computing technology has been proposed and used to accelerate many scientific computations. The objective of this study was to develop a hybrid GPU-CPU implementation of parallelized Monte Carlo parametric expectation maximization (MCPEM) estimation algorithm for population PK data analysis. A hybrid GPU-CPU implementation of the MCPEM algorithm (MCPEMGPU) and identical algorithm that is designed for the single CPU (MCPEMCPU) were developed using MATLAB in a single computer equipped with dual Xeon 6-Core E5690 CPU and a NVIDIA Tesla C2070 GPU parallel computing card that contained 448 stream processors. Two different PK models with rich/sparse sampling design schemes were used to simulate population data in assessing the performance of MCPEMCPU and MCPEMGPU. Results were analyzed by comparing the parameter estimation and model computation times. Speedup factor was used to assess the relative benefit of parallelized MCPEMGPU over MCPEMCPU in shortening model computation time. The MCPEMGPU consistently achieved shorter computation time than the MCPEMCPU and can offer more than 48-fold speedup using a single GPU card. The novel hybrid GPU-CPU implementation of parallelized MCPEM algorithm developed in this study holds a great promise in serving as the core for the next-generation of modeling software for population PK/PD analysis.
Binocular Multispectral Adaptive Imaging System (BMAIS)
2010-07-26
system for pilots that adaptively integrates shortwave infrared (SWIR), visible, near ‐IR (NIR), off‐head thermal, and computer symbology/imagery into...respective areas. BMAIS is a binocular helmet mounted imaging system that features dual shortwave infrared (SWIR) cameras, embedded image processors and...algorithms and fusion of other sensor sites such as forward looking infrared (FLIR) and other aircraft subsystems. BMAIS is attached to the helmet
Heterogeneous high throughput scientific computing with APM X-Gene and Intel Xeon Phi
Abdurachmanov, David; Bockelman, Brian; Elmer, Peter; ...
2015-05-22
Electrical power requirements will be a constraint on the future growth of Distributed High Throughput Computing (DHTC) as used by High Energy Physics. Performance-per-watt is a critical metric for the evaluation of computer architectures for cost- efficient computing. Additionally, future performance growth will come from heterogeneous, many-core, and high computing density platforms with specialized processors. In this paper, we examine the Intel Xeon Phi Many Integrated Cores (MIC) co-processor and Applied Micro X-Gene ARMv8 64-bit low-power server system-on-a-chip (SoC) solutions for scientific computing applications. As a result, we report our experience on software porting, performance and energy efficiency and evaluatemore » the potential for use of such technologies in the context of distributed computing systems such as the Worldwide LHC Computing Grid (WLCG).« less
NASA Astrophysics Data System (ADS)
Yakushin, Sergey S.; Wolf, Alexey A.; Dostovalov, Alexandr V.; Skvortsov, Mikhail I.; Wabnitz, Stefan; Babin, Sergey A.
2018-07-01
Fiber Bragg gratings with different reflection wavelengths have been inscribed in different cores of a dual-core fiber section. The effect of fiber bending on the FBG reflection spectra has been studied. Various interrogation schemes are presented, including a single-end scheme based on a cross-talk between the cores that uses only standard optical components. Simultaneous interrogation of the FBGs in both cores allows to achieve a bending sensitivity of 12.8 pm/m-1, being free of temperature and strain influence. The technology enables the development of real-time bending sensors with high spatial resolution based on series of FBGs with different wavelength inscribed along the multi-core fiber.
Oto, Tatsuki; Yasuda, Genta; Tsubota, Keishi; Kurokawa, Hiroyasu; Miyazaki, Masashi; Platt, Jeffrey A
2009-01-01
This study examined the influence of power density on dentin bond strength and polymerization behavior of dual-cured direct core foundation resin systems. Two commercially available dual-cured direct core foundation resin systems, Clearfil DC Core Automix with Clearfil DC Bond and UniFil Core with Self-Etching Bond, were studied. Bovine mandibular incisors were mounted in autopolymerizing resin and the facial dentin surfaces were ground wet on 600-grit SiC paper. Dentin surfaces were treated according to manufacturer's recommendations. The resin pastes were condensed into the mold and cured with the power densities of 0 (no irradiation), 100, 200, 400 and 600 mW/cm2. Ten specimens per group were stored in 37 degrees C water for 24 hours, then shear tested at a crosshead speed of 1.0 mm/minute in a universal testing machine. An ultrasonic measurement device was used to measure the ultrasonic velocities through the core foundation resins. The power densities selected were 0 (no irradiation), 200, and 600 mW/cm2, and ultrasonic velocity was calculated. ANOVA and Tukey HSD tests were performed at a level of 0.05. The highest bond strengths were obtained when the resin pastes were cured with the highest power density for both core foundation systems (16.8 +/- 1.9 MPa for Clearfil DC Core Automix, 15.6 +/- 2.9 MPa for UniFil Core). When polymerized with the power densities under 200 mW/cm2, significantly lower bond strengths were observed compared to those obtained with the power density of 600 mW/cm2. As the core foundation resins hardened, the sonic velocities increased and this tendency differed among the power density of the curing unit. When the sonic velocities at three minutes after the start of measurements were compared, there were no significant differences among different irradiation modes for UniFil Core, while a significant decrease in sonic velocity was obtained when the resin paste was chemically polymerized compared with dual-polymerization for Clearfil DC Core Automix. The data suggests that the dentin bond strengths and polymerization behavior of the dual-cured, direct core foundation systems are still affected by the power density of the curing unit. With a careful choice of the core foundation systems and power density of the curing unit, the benefit of using resin composites to endodontically-treated teeth might be acceptable.
ERIC Educational Resources Information Center
Alfaro, Cristina; Durán, Richard; Hunt, Alexandra; Aragón, María José
2014-01-01
Recent education reforms have begun to reframe academic discussion and teacher practice surrounding bilingual educational approaches for preparing "21st century, college and career ready" citizens. Given this broader context, in this article we examine ways that we might join implementation of dual language programs, Common Core State…
List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor
NASA Astrophysics Data System (ADS)
Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.
2014-03-01
List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.
Safeguards Technology Factsheet - Unattended Dual Current Monitor (UDCM)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Newell, Matthew R.
2016-04-13
The UDCM is a low-current measurement device designed to record sub-nano-amp to micro-amp currents from radiation detectors. The UDCM is a two-channel device that incorporates a Commercial-Off-The-Shelf (COTS) processor enabling both serial over USB as well as Ethernet communications. The instrument includes microSD and USB flash memory for data storage as well as a programmable High Voltage (HV) power supply for detector bias. The UDCM is packaged in the same enclosure, employs the same processor and has a similar user interface as the UMSR. A serial over USB communication line to the UDCM allows the use of existing versions ofmore » MIC software, while the Ethernet port is compatible with the new IAEA RAINSTORM communication protocol.« less
A scalable SIMD digital signal processor for high-quality multifunctional printer systems
NASA Astrophysics Data System (ADS)
Kang, Hyeong-Ju; Choi, Yongwoo; Kim, Kimo; Park, In-Cheol; Kim, Jung-Wook; Lee, Eul-Hwan; Gahang, Goo-Soo
2005-01-01
This paper describes a high-performance scalable SIMD digital signal processor (DSP) developed for multifunctional printer systems. The DSP supports a variable number of datapaths to cover a wide range of performance and maintain a RISC-like pipeline structure. Many special instructions suitable for image processing algorithms are included in the DSP. Quad/dual instructions are introduced for 8-bit or 16-bit data, and bit-field extraction/insertion instructions are supported to process various data types. Conditional instructions are supported to deal with complex relative conditions efficiently. In addition, an intelligent DMA block is integrated to align data in the course of data reading. Experimental results show that the proposed DSP outperforms a high-end printer-system DSP by at least two times.
VLSI 'smart' I/O module development
NASA Astrophysics Data System (ADS)
Kirk, Dan
The developmental history, design, and operation of the MIL-STD-1553A/B discrete and serial module (DSM) for the U.S. Navy AN/AYK-14(V) avionics computer are described and illustrated with diagrams. The ongoing preplanned product improvement for the AN/AYK-14(V) includes five dual-redundant MIL-STD-1553 channels based on DSMs. The DSM is a front-end processor for transferring data to and from a common memory, sharing memory with a host processor to provide improved 'smart' input/output performance. Each DSM comprises three hardware sections: three VLSI-6000 semicustomized CMOS arrays, memory units to support the arrays, and buffers and resynchronization circuits. The DSM hardware module design, VLSI-6000 design tools, controlware and test software, and checkout procedures (using a hardware simulator) are characterized in detail.
A light hydrocarbon fuel processor producing high-purity hydrogen
NASA Astrophysics Data System (ADS)
Löffler, Daniel G.; Taylor, Kyle; Mason, Dylan
This paper discusses the design process and presents performance data for a dual fuel (natural gas and LPG) fuel processor for PEM fuel cells delivering between 2 and 8 kW electric power in stationary applications. The fuel processor resulted from a series of design compromises made to address different design constraints. First, the product quality was selected; then, the unit operations needed to achieve that product quality were chosen from the pool of available technologies. Next, the specific equipment needed for each unit operation was selected. Finally, the unit operations were thermally integrated to achieve high thermal efficiency. Early in the design process, it was decided that the fuel processor would deliver high-purity hydrogen. Hydrogen can be separated from other gases by pressure-driven processes based on either selective adsorption or permeation. The pressure requirement made steam reforming (SR) the preferred reforming technology because it does not require compression of combustion air; therefore, steam reforming is more efficient in a high-pressure fuel processor than alternative technologies like autothermal reforming (ATR) or partial oxidation (POX), where the combustion occurs at the pressure of the process stream. A low-temperature pre-reformer reactor is needed upstream of a steam reformer to suppress coke formation; yet, low temperatures facilitate the formation of metal sulfides that deactivate the catalyst. For this reason, a desulfurization unit is needed upstream of the pre-reformer. Hydrogen separation was implemented using a palladium alloy membrane. Packed beds were chosen for the pre-reformer and reformer reactors primarily because of their low cost, relatively simple operation and low maintenance. Commercial, off-the-shelf balance of plant (BOP) components (pumps, valves, and heat exchangers) were used to integrate the unit operations. The fuel processor delivers up to 100 slm hydrogen >99.9% pure with <1 ppm CO, <3 ppm CO 2. The thermal efficiency is better than 67% operating at full load. This fuel processor has been integrated with a 5-kW fuel cell producing electricity and hot water.
Security Primitives for Reconfigurable Hardware-Based Systems
2010-05-01
work, we propose security primitives using ideas centered around the notion of “moats and drawbridges .” The primitives encompass four design properties...Santa Bar- bara, CA 93106; email: sherwood@cs.ucsb.edu; R. Kastner, Department of Computer Science and Engineering , University of California, San...fingerprint reader), the other to control the ethernet IP core—and an AES encryption engine used by both of the processor cores. These cores are all implemented
Progress Towards a Rad-Hydro Code for Modern Computing Architectures LA-UR-10-02825
NASA Astrophysics Data System (ADS)
Wohlbier, J. G.; Lowrie, R. B.; Bergen, B.; Calef, M.
2010-11-01
We are entering an era of high performance computing where data movement is the overwhelming bottleneck to scalable performance, as opposed to the speed of floating-point operations per processor. All multi-core hardware paradigms, whether heterogeneous or homogeneous, be it the Cell processor, GPGPU, or multi-core x86, share this common trait. In multi-physics applications such as inertial confinement fusion or astrophysics, one may be solving multi-material hydrodynamics with tabular equation of state data lookups, radiation transport, nuclear reactions, and charged particle transport in a single time cycle. The algorithms are intensely data dependent, e.g., EOS, opacity, nuclear data, and multi-core hardware memory restrictions are forcing code developers to rethink code and algorithm design. For the past two years LANL has been funding a small effort referred to as Multi-Physics on Multi-Core to explore ideas for code design as pertaining to inertial confinement fusion and astrophysics applications. The near term goals of this project are to have a multi-material radiation hydrodynamics capability, with tabular equation of state lookups, on cartesian and curvilinear block structured meshes. In the longer term we plan to add fully implicit multi-group radiation diffusion and material heat conduction, and block structured AMR. We will report on our progress to date.
MPI parallelization of Vlasov codes for the simulation of nonlinear laser-plasma interactions
NASA Astrophysics Data System (ADS)
Savchenko, V.; Won, K.; Afeyan, B.; Decyk, V.; Albrecht-Marc, M.; Ghizzo, A.; Bertrand, P.
2003-10-01
The simulation of optical mixing driven KEEN waves [1] and electron plasma waves [1] in laser-produced plasmas require nonlinear kinetic models and massive parallelization. We use Massage Passing Interface (MPI) libraries and Appleseed [2] to solve the Vlasov Poisson system of equations on an 8 node dual processor MAC G4 cluster. We use the semi-Lagrangian time splitting method [3]. It requires only row-column exchanges in the global data redistribution, minimizing the total number of communications between processors. Recurrent communication patterns for 2D FFTs involves global transposition. In the Vlasov-Maxwell case, we use splitting into two 1D spatial advections and a 2D momentum advection [4]. Discretized momentum advection equations have a double loop structure with the outer index being assigned to different processors. We adhere to a code structure with separate routines for calculations and data management for parallel computations. [1] B. Afeyan et al., IFSA 2003 Conference Proceedings, Monterey, CA [2] V. K. Decyk, Computers in Physics, 7, 418 (1993) [3] Sonnendrucker et al., JCP 149, 201 (1998) [4] Begue et al., JCP 151, 458 (1999)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, W.W.; Layton, J.P.
1976-09-13
The three-volume report describes a dual-mode nuclear space power and propulsion system concept that employs an advanced solid-core nuclear fission reactor coupled via heat pipes to one of several electric power conversion systems. The NUROC3A systems analysis code was designed to provide the user with performance characteristics of the dual-mode system. Volume 3 describes utilization of the NUROC3A code to produce a detailed parameter study of the system.
The Holidays Are Coming! Time to Start Planning for Healthy Holiday Meals
... 1 medium orange, quartered and seeds removed 1 apple, cored ¾ cup to 1 cup sugar (or substitute non-sugar sweetener) Put berries, orange and apple through food processor, blender or food mill until ...
Using Multi-Core Systems for Rover Autonomy
NASA Technical Reports Server (NTRS)
Clement, Brad; Estlin, Tara; Bornstein, Benjamin; Springer, Paul; Anderson, Robert C.
2010-01-01
Task Objectives are: (1) Develop and demonstrate key capabilities for rover long-range science operations using multi-core computing, (a) Adapt three rover technologies to execute on SOA multi-core processor (b) Illustrate performance improvements achieved (c) Demonstrate adapted capabilities with rover hardware, (2) Targeting three high-level autonomy technologies (a) Two for onboard data analysis (b) One for onboard command sequencing/planning, (3) Technologies identified as enabling for future missions, (4)Benefits will be measured along several metrics: (a) Execution time / Power requirements (b) Number of data products processed per unit time (c) Solution quality
Digital Beamforming Scatterometer
NASA Technical Reports Server (NTRS)
Rincon, Rafael F.; Vega, Manuel; Kman, Luko; Buenfil, Manuel; Geist, Alessandro; Hillard, Larry; Racette, Paul
2009-01-01
This paper discusses scatterometer measurements collected with multi-mode Digital Beamforming Synthetic Aperture Radar (DBSAR) during the SMAP-VEX 2008 campaign. The 2008 SMAP Validation Experiment was conducted to address a number of specific questions related to the soil moisture retrieval algorithms. SMAP-VEX 2008 consisted on a series of aircraft-based.flights conducted on the Eastern Shore of Maryland and Delaware in the fall of 2008. Several other instruments participated in the campaign including the Passive Active L-Band System (PALS), the Marshall Airborne Polarimetric Imaging Radiometer (MAPIR), and the Global Positioning System Reflectometer (GPSR). This campaign was the first SMAP Validation Experiment. DBSAR is a multimode radar system developed at NASA/Goddard Space Flight Center that combines state-of-the-art radar technologies, on-board processing, and advances in signal processing techniques in order to enable new remote sensing capabilities applicable to Earth science and planetary applications [l]. The instrument can be configured to operate in scatterometer, Synthetic Aperture Radar (SAR), or altimeter mode. The system builds upon the L-band Imaging Scatterometer (LIS) developed as part of the RadSTAR program. The radar is a phased array system designed to fly on the NASA P3 aircraft. The instrument consists of a programmable waveform generator, eight transmit/receive (T/R) channels, a microstrip antenna, and a reconfigurable data acquisition and processor system. Each transmit channel incorporates a digital attenuator, and digital phase shifter that enables amplitude and phase modulation on transmit. The attenuators, phase shifters, and calibration switches are digitally controlled by the radar control card (RCC) on a pulse by pulse basis. The antenna is a corporate fed microstrip patch-array centered at 1.26 GHz with a 20 MHz bandwidth. Although only one feed is used with the present configuration, a provision was made for separate corporate feeds for vertical and horizontal polarization. System upgrades to dual polarization are currently under way. The DBSAR processor is a reconfigurable data acquisition and processor system capable of real-time, high-speed data processing. DBSAR uses an FPGA-based architecture to implement digitally down-conversion, in-phase and quadrature (I/Q) demodulation, and subsequent radar specific algorithms. The core of the processor board consists of an analog-to-digital (AID) section, three Altera Stratix field programmable gate arrays (FPGAs), an ARM microcontroller, several memory devices, and an Ethernet interface. The processor also interfaces with a navigation board consisting of a GPS and a MEMS gyro. The processor has been configured to operate in scatterometer, Synthetic Aperture Radar (SAR), and altimeter modes. All the modes are based on digital beamforming which is a digital process that generates the far-field beam patterns at various scan angles from voltages sampled in the antenna array. This technique allows steering the received beam and controlling its beam-width and side-lobe. Several beamforming techniques can be implemented each characterized by unique strengths and weaknesses, and each applicable to different measurement scenarios. In Scatterometer mode, the radar is capable to.generate a wide beam or scan a narrow beam on transmit, and to steer the received beam on processing while controlling its beamwidth and side-lobe level. Table I lists some important radar characteristics
NASA Astrophysics Data System (ADS)
Szplet, R.; Kalisz, J.; Jachna, Z.
2009-02-01
We present a time digitizer having 45 ps resolution, integrated in a field programmable gate array (FPGA) device. The time interval measurement is based on the two-stage interpolation method. A dual-edge two-phase interpolator is driven by the on-chip synthesized 250 MHz clock with precise phase adjustment. An improved dual-edge double synchronizer was developed to control the main counter. The nonlinearity of the digitizer's transfer characteristic is identified and utilized by the dedicated hardware code processor for the on-the-fly correction of the output data. Application of presented ideas has resulted in the measurement uncertainty of the digitizer below 70 ps RMS over the time interval ranging from 0 to 1 s. The use of the two-stage interpolation and a fast FIFO memory has allowed us to obtain the maximum measurement rate of five million measurements per second.
Web-based DAQ systems: connecting the user and electronics front-ends
NASA Astrophysics Data System (ADS)
Lenzi, Thomas
2016-12-01
Web technologies are quickly evolving and are gaining in computational power and flexibility, allowing for a paradigm shift in the field of Data Acquisition (DAQ) systems design. Modern web browsers offer the possibility to create intricate user interfaces and are able to process and render complex data. Furthermore, new web standards such as WebSockets allow for fast real-time communication between the server and the user with minimal overhead. Those improvements make it possible to move the control and monitoring operations from the back-end servers directly to the user and to the front-end electronics, thus reducing the complexity of the data acquisition chain. Moreover, web-based DAQ systems offer greater flexibility, accessibility, and maintainability on the user side than traditional applications which often lack portability and ease of use. As proof of concept, we implemented a simplified DAQ system on a mid-range Spartan6 Field Programmable Gate Array (FPGA) development board coupled to a digital front-end readout chip. The system is connected to the Internet and can be accessed from any web browser. It is composed of custom code to control the front-end readout and of a dual soft-core Microblaze processor to communicate with the client.
Embedded System Implementation on FPGA System With μCLinux OS
NASA Astrophysics Data System (ADS)
Fairuz Muhd Amin, Ahmad; Aris, Ishak; Syamsul Azmir Raja Abdullah, Raja; Kalos Zakiah Sahbudin, Ratna
2011-02-01
Embedded systems are taking on more complicated tasks as the processors involved become more powerful. The embedded systems have been widely used in many areas such as in industries, automotives, medical imaging, communications, speech recognition and computer vision. The complexity requirements in hardware and software nowadays need a flexibility system for further enhancement in any design without adding new hardware. Therefore, any changes in the design system will affect the processor that need to be changed. To overcome this problem, a System On Programmable Chip (SOPC) has been designed and developed using Field Programmable Gate Array (FPGA). A softcore processor, NIOS II 32-bit RISC, which is the microprocessor core was utilized in FPGA system together with the embedded operating system(OS), μClinux. In this paper, an example of web server is explained and demonstrated
Dual Purkinje-Image Eyetracker
1996-01-01
Abnormal nystagmus can also be detected through the use of an eyetracker [4]. Through tracking points of eye gaze within a scene, it is possible to...moving, even when gazing . Correcting for these unpredictable micro eye movements would allow corrective procedures in eye surgery to become more accurate...victim with a screen of letters on a monitor. A calibrated eyetracker then provides a processor with information about the location of eye gaze . The
Electrical start-up for diesel fuel processing in a fuel-cell-based auxiliary power unit
NASA Astrophysics Data System (ADS)
Samsun, Remzi Can; Krupp, Carsten; Tschauder, Andreas; Peters, Ralf; Stolten, Detlef
2016-01-01
As auxiliary power units in trucks and aircraft, fuel cell systems with a diesel and kerosene reforming capacity offer the dual benefit of reduced emissions and fuel consumption. In order to be commercially viable, these systems require a quick start-up time with low energy input. In pursuit of this end, this paper reports an electrical start-up strategy for diesel fuel processing. A transient computational fluid dynamics model is developed to optimize the start-up procedure of the fuel processor in the 28 kWth power class. The temperature trend observed in the experiments is reproducible to a high degree of accuracy using a dual-cell approach in ANSYS Fluent. Starting from a basic strategy, different options are considered for accelerating system start-up. The start-up time is reduced from 22 min in the basic case to 9.5 min, at an energy consumption of 0.4 kW h. Furthermore, an electrical wire is installed in the reformer to test the steam generation during start-up. The experimental results reveal that the generation of steam at 450 °C is possible within seconds after water addition to the reformer. As a result, the fuel processor can be started in autothermal reformer mode using the electrical concept developed in this work.
Design and implementation of a random neural network routing engine.
Kocak, T; Seeber, J; Terzioglu, H
2003-01-01
Random neural network (RNN) is an analytically tractable spiked neural network model that has been implemented in software for a wide range of applications for over a decade. This paper presents the hardware implementation of the RNN model. Recently, cognitive packet networks (CPN) is proposed as an alternative packet network architecture where there is no routing table, instead the RNN based reinforcement learning is used to route packets. Particularly, we describe implementation details for the RNN based routing engine of a CPN network processor chip: the smart packet processor (SPP). The SPP is a dual port device that stores, modifies, and interprets the defining characteristics of multiple RNN models. In addition to hardware design improvements over the software implementation such as the dual access memory, output calculation step, and reduced output calculation module, this paper introduces a major modification to the reinforcement learning algorithm used in the original CPN specification such that the number of weight terms are reduced from 2n/sup 2/ to 2n. This not only yields significant memory savings, but it also simplifies the calculations for the steady state probabilities (neuron outputs in RNN). Simulations have been conducted to confirm the proper functionality for the isolated SPP design as well as for the multiple SPP's in a networked environment.
Many-core computing for space-based stereoscopic imaging
NASA Astrophysics Data System (ADS)
McCall, Paul; Torres, Gildo; LeGrand, Keith; Adjouadi, Malek; Liu, Chen; Darling, Jacob; Pernicka, Henry
The potential benefits of using parallel computing in real-time visual-based satellite proximity operations missions are investigated. Improvements in performance and relative navigation solutions over single thread systems can be achieved through multi- and many-core computing. Stochastic relative orbit determination methods benefit from the higher measurement frequencies, allowing them to more accurately determine the associated statistical properties of the relative orbital elements. More accurate orbit determination can lead to reduced fuel consumption and extended mission capabilities and duration. Inherent to the process of stereoscopic image processing is the difficulty of loading, managing, parsing, and evaluating large amounts of data efficiently, which may result in delays or highly time consuming processes for single (or few) processor systems or platforms. In this research we utilize the Single-Chip Cloud Computer (SCC), a fully programmable 48-core experimental processor, created by Intel Labs as a platform for many-core software research, provided with a high-speed on-chip network for sharing information along with advanced power management technologies and support for message-passing. The results from utilizing the SCC platform for the stereoscopic image processing application are presented in the form of Performance, Power, Energy, and Energy-Delay-Product (EDP) metrics. Also, a comparison between the SCC results and those obtained from executing the same application on a commercial PC are presented, showing the potential benefits of utilizing the SCC in particular, and any many-core platforms in general for real-time processing of visual-based satellite proximity operations missions.
CTF Preprocessor User's Manual
DOE Office of Scientific and Technical Information (OSTI.GOV)
Avramova, Maria; Salko, Robert K.
2016-05-26
This document describes how a user should go about using the CTF pre- processor tool to create an input deck for modeling rod-bundle geometry in CTF. The tool was designed to generate input decks in a quick and less error-prone manner for CTF. The pre-processor is a completely independent utility, written in Fortran, that takes a reduced amount of input from the user. The information that the user must supply is basic information on bundle geometry, such as rod pitch, clad thickness, and axial location of spacer grids--the pre-processor takes this basic information and determines channel placement and connection informationmore » to be written to the input deck, which is the most time-consuming and error-prone segment of creating a deck. Creation of the model is also more intuitive, as the user can specify assembly and water-tube placement using visual maps instead of having to place them by determining channel/channel and rod/channel connections. As an example of the benefit of the pre-processor, a quarter-core model that contains 500,000 scalar-mesh cells was read into CTF from an input deck containing 200,000 lines of data. This 200,000 line input deck was produced automatically from a set of pre-processor decks that contained only 300 lines of data.« less
NASA Astrophysics Data System (ADS)
Needham, Perri J.; Bhuiyan, Ashraf; Walker, Ross C.
2016-04-01
We present an implementation of explicit solvent particle mesh Ewald (PME) classical molecular dynamics (MD) within the PMEMD molecular dynamics engine, that forms part of the AMBER v14 MD software package, that makes use of Intel Xeon Phi coprocessors by offloading portions of the PME direct summation and neighbor list build to the coprocessor. We refer to this implementation as pmemd MIC offload and in this paper present the technical details of the algorithm, including basic models for MPI and OpenMP configuration, and analyze the resultant performance. The algorithm provides the best performance improvement for large systems (>400,000 atoms), achieving a ∼35% performance improvement for satellite tobacco mosaic virus (1,067,095 atoms) when 2 Intel E5-2697 v2 processors (2 ×12 cores, 30M cache, 2.7 GHz) are coupled to an Intel Xeon Phi coprocessor (Model 7120P-1.238/1.333 GHz, 61 cores). The implementation utilizes a two-fold decomposition strategy: spatial decomposition using an MPI library and thread-based decomposition using OpenMP. We also present compiler optimization settings that improve the performance on Intel Xeon processors, while retaining simulation accuracy.
Lee, Anthony; Yau, Christopher; Giles, Michael B.; Doucet, Arnaud; Holmes, Christopher C.
2011-01-01
We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers and can be thought of as prototypes of the next generation of many-core processors. For certain classes of population-based Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we nd speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. PMID:22003276
An Augmented Lagrangian Filter Method for Real-Time Embedded Optimization
Chiang, Nai -Yuan; Huang, Rui; Zavala, Victor M.
2017-04-17
We present a filter line-search algorithm for nonconvex continuous optimization that combines an augmented Lagrangian function and a constraint violation metric to accept and reject steps. The approach is motivated by real-time optimization applications that need to be executed on embedded computing platforms with limited memory and processor speeds. The proposed method enables primal–dual regularization of the linear algebra system that in turn permits the use of solution strategies with lower computing overheads. We prove that the proposed algorithm is globally convergent and we demonstrate the developments using a nonconvex real-time optimization application for a building heating, ventilation, and airmore » conditioning system. Our numerical tests are performed on a standard processor and on an embedded platform. Lastly, we demonstrate that the approach reduces solution times by a factor of over 1000.« less
An Augmented Lagrangian Filter Method for Real-Time Embedded Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chiang, Nai -Yuan; Huang, Rui; Zavala, Victor M.
We present a filter line-search algorithm for nonconvex continuous optimization that combines an augmented Lagrangian function and a constraint violation metric to accept and reject steps. The approach is motivated by real-time optimization applications that need to be executed on embedded computing platforms with limited memory and processor speeds. The proposed method enables primal–dual regularization of the linear algebra system that in turn permits the use of solution strategies with lower computing overheads. We prove that the proposed algorithm is globally convergent and we demonstrate the developments using a nonconvex real-time optimization application for a building heating, ventilation, and airmore » conditioning system. Our numerical tests are performed on a standard processor and on an embedded platform. Lastly, we demonstrate that the approach reduces solution times by a factor of over 1000.« less
Combustor air flow control method for fuel cell apparatus
Clingerman, Bruce J.; Mowery, Kenneth D.; Ripley, Eugene V.
2001-01-01
A method for controlling the heat output of a combustor in a fuel cell apparatus to a fuel processor where the combustor has dual air inlet streams including atmospheric air and fuel cell cathode effluent containing oxygen depleted air. In all operating modes, an enthalpy balance is provided by regulating the quantity of the air flow stream to the combustor to support fuel cell processor heat requirements. A control provides a quick fast forward change in an air valve orifice cross section in response to a calculated predetermined air flow, the molar constituents of the air stream to the combustor, the pressure drop across the air valve, and a look up table of the orifice cross sectional area and valve steps. A feedback loop fine tunes any error between the measured air flow to the combustor and the predetermined air flow.
Multicore Education through Simulation
ERIC Educational Resources Information Center
Ozturk, O.
2011-01-01
A project-oriented course for advanced undergraduate and graduate students is described for simulating multiple processor cores. Simics, a free simulator for academia, was utilized to enable students to explore computer architecture, operating systems, and hardware/software cosimulation. Motivation for including this course in the curriculum is…
Dense and Sparse Matrix Operations on the Cell Processor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, Samuel W.; Shalf, John; Oliker, Leonid
2005-05-01
The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less
GPU accelerated dynamic functional connectivity analysis for functional MRI data.
Akgün, Devrim; Sakoğlu, Ünal; Esquivel, Johnny; Adinoff, Bryon; Mete, Mutlu
2015-07-01
Recent advances in multi-core processors and graphics card based computational technologies have paved the way for an improved and dynamic utilization of parallel computing techniques. Numerous applications have been implemented for the acceleration of computationally-intensive problems in various computational science fields including bioinformatics, in which big data problems are prevalent. In neuroimaging, dynamic functional connectivity (DFC) analysis is a computationally demanding method used to investigate dynamic functional interactions among different brain regions or networks identified with functional magnetic resonance imaging (fMRI) data. In this study, we implemented and analyzed a parallel DFC algorithm based on thread-based and block-based approaches. The thread-based approach was designed to parallelize DFC computations and was implemented in both Open Multi-Processing (OpenMP) and Compute Unified Device Architecture (CUDA) programming platforms. Another approach developed in this study to better utilize CUDA architecture is the block-based approach, where parallelization involves smaller parts of fMRI time-courses obtained by sliding-windows. Experimental results showed that the proposed parallel design solutions enabled by the GPUs significantly reduce the computation time for DFC analysis. Multicore implementation using OpenMP on 8-core processor provides up to 7.7× speed-up. GPU implementation using CUDA yielded substantial accelerations ranging from 18.5× to 157× speed-up once thread-based and block-based approaches were combined in the analysis. Proposed parallel programming solutions showed that multi-core processor and CUDA-supported GPU implementations accelerated the DFC analyses significantly. Developed algorithms make the DFC analyses more practical for multi-subject studies with more dynamic analyses. Copyright © 2015 Elsevier Ltd. All rights reserved.
Flow solution on a dual-block grid around an airplane
NASA Technical Reports Server (NTRS)
Eriksson, Lars-Erik
1987-01-01
The compressible flow around a complex fighter-aircraft configuration (fuselage, cranked delta wing, canard, and inlet) is simulated numerically using a novel grid scheme and a finite-volume Euler solver. The patched dual-block grid is generated by an algebraic procedure based on transfinite interpolation, and the explicit Runge-Kutta time-stepping Euler solver is implemented with a high degree of vectorization on a Cyber 205 processor. Results are presented in extensive graphs and diagrams and characterized in detail. The concentration of grid points near the wing apex in the present scheme is shown to facilitate capture of the vortex generated by the leading edge at high angles of attack and modeling of its interaction with the canard wake.
Orthorectification by Using Gpgpu Method
NASA Astrophysics Data System (ADS)
Sahin, H.; Kulur, S.
2012-07-01
Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth speed compared to central processing units (CPU). Data-parallel computations can be shortly described as mapping data elements to parallel processing threads. The rapid development of GPUs programmability and capabilities attracted the attentions of researchers dealing with complex problems which need high level calculations. This interest has revealed the concepts of "General Purpose Computation on Graphics Processing Units (GPGPU)" and "stream processing". The graphic processors are powerful hardware which is really cheap and affordable. So the graphic processors became an alternative to computer processors. The graphic chips which were standard application hardware have been transformed into modern, powerful and programmable processors to meet the overall needs. Especially in recent years, the phenomenon of the usage of graphics processing units in general purpose computation has led the researchers and developers to this point. The biggest problem is that the graphics processing units use different programming models unlike current programming methods. Therefore, an efficient GPU programming requires re-coding of the current program algorithm by considering the limitations and the structure of the graphics hardware. Currently, multi-core processors can not be programmed by using traditional programming methods. Event procedure programming method can not be used for programming the multi-core processors. GPUs are especially effective in finding solution for repetition of the computing steps for many data elements when high accuracy is needed. Thus, it provides the computing process more quickly and accurately. Compared to the GPUs, CPUs which perform just one computing in a time according to the flow control are slower in performance. This structure can be evaluated for various applications of computer technology. In this study covers how general purpose parallel programming and computational power of the GPUs can be used in photogrammetric applications especially direct georeferencing. The direct georeferencing algorithm is coded by using GPGPU method and CUDA (Compute Unified Device Architecture) programming language. Results provided by this method were compared with the traditional CPU programming. In the other application the projective rectification is coded by using GPGPU method and CUDA programming language. Sample images of various sizes, as compared to the results of the program were evaluated. GPGPU method can be used especially in repetition of same computations on highly dense data, thus finding the solution quickly.
Neural simulations on multi-core architectures.
Eichner, Hubert; Klug, Tobias; Borst, Alexander
2009-01-01
Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing.
Neural Simulations on Multi-Core Architectures
Eichner, Hubert; Klug, Tobias; Borst, Alexander
2009-01-01
Neuroscience is witnessing increasing knowledge about the anatomy and electrophysiological properties of neurons and their connectivity, leading to an ever increasing computational complexity of neural simulations. At the same time, a rather radical change in personal computer technology emerges with the establishment of multi-cores: high-density, explicitly parallel processor architectures for both high performance as well as standard desktop computers. This work introduces strategies for the parallelization of biophysically realistic neural simulations based on the compartmental modeling technique and results of such an implementation, with a strong focus on multi-core architectures and automation, i.e. user-transparent load balancing. PMID:19636393
A VHDL Core for Intrinsic Evolution of Discrete Time Filters with Signal Feedback
NASA Technical Reports Server (NTRS)
Gwaltney, David A.; Dutton, Kenneth
2005-01-01
The design of an Evolvable Machine VHDL Core is presented, representing a discrete-time processing structure capable of supporting control system applications. This VHDL Core is implemented in an FPGA and is interfaced with an evolutionary algorithm implemented in firmware on a Digital Signal Processor (DSP) to create an evolvable system platform. The salient features of this architecture are presented. The capability to implement IIR filter structures is presented along with the results of the intrinsic evolution of a filter. The robustness of the evolved filter design is tested and its unique characteristics are described.
Benchmarking and tuning the MILC code on clusters and supercomputers
NASA Astrophysics Data System (ADS)
Gottlieb, Steven
2002-03-01
Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Benchmarking and tuning the MILC code on clusters and supercomputers
NASA Astrophysics Data System (ADS)
Gottlieb, Steven
Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems
2015-05-01
of lockdown registers, to provide way-based partitioning. These alternatives are illustrated in Fig. 1 with respect to a quad-core ARM Cortex A9...presented a cache-partitioning scheme that allows multiple tasks to share the same cache partition on a single processor (as we do for Level-A and...sets and determined the fraction that were schedulable on our target hardware platform, the quad-core ARM Cortex A9 machine mentioned earlier, the LLC
Graphics Processing Unit (GPU) Acceleration of the Goddard Earth Observing System Atmospheric Model
NASA Technical Reports Server (NTRS)
Putnam, Williama
2011-01-01
The Goddard Earth Observing System 5 (GEOS-5) is the atmospheric model used by the Global Modeling and Assimilation Office (GMAO) for a variety of applications, from long-term climate prediction at relatively coarse resolution, to data assimilation and numerical weather prediction, to very high-resolution cloud-resolving simulations. GEOS-5 is being ported to a graphics processing unit (GPU) cluster at the NASA Center for Climate Simulation (NCCS). By utilizing GPU co-processor technology, we expect to increase the throughput of GEOS-5 by at least an order of magnitude, and accelerate the process of scientific exploration across all scales of global modeling, including: The large-scale, high-end application of non-hydrostatic, global, cloud-resolving modeling at 10- to I-kilometer (km) global resolutions Intermediate-resolution seasonal climate and weather prediction at 50- to 25-km on small clusters of GPUs Long-range, coarse-resolution climate modeling, enabled on a small box of GPUs for the individual researcher After being ported to the GPU cluster, the primary physics components and the dynamical core of GEOS-5 have demonstrated a potential speedup of 15-40 times over conventional processor cores. Performance improvements of this magnitude reduce the required scalability of 1-km, global, cloud-resolving models from an unfathomable 6 million cores to an attainable 200,000 GPU-enabled cores.
Multitasking 3-D forward modeling using high-order finite difference methods on the Cray X-MP/416
DOE Office of Scientific and Technical Information (OSTI.GOV)
Terki-Hassaine, O.; Leiss, E.L.
1988-01-01
The CRAY X-MP/416 was used to multitask 3-D forward modeling by the high-order finite difference method. Flowtrace analysis reveals that the most expensive operation in the unitasked program is a matrix vector multiplication. The in-core and out-of-core versions of a reentrant subroutine can perform any fraction of the matrix vector multiplication independently, a pattern compatible with multitasking. The matrix vector multiplication routine can be distributed over two to four processors. The rest of the program utilizes the microtasking feature that lets the system treat independent iterations of DO-loops as subtasks to be performed by any available processor. The availability ofmore » the Solid-State Storage Device (SSD) meant the I/O wait time was virtually zero. A performance study determined a theoretical speedup, taking into account the multitasking overhead. Multitasking programs utilizing both macrotasking and microtasking features obtained actual speedups that were approximately 80% of the ideal speedup.« less
Accelerating Demand Paging for Local and Remote Out-of-Core Visualization
NASA Technical Reports Server (NTRS)
Ellsworth, David
2001-01-01
This paper describes a new algorithm that improves the performance of application-controlled demand paging for the out-of-core visualization of data sets that are on either local disks or disks on remote servers. The performance improvements come from better overlapping the computation with the page reading process, and by performing multiple page reads in parallel. The new algorithm can be applied to many different visualization algorithms since application-controlled demand paging is not specific to any visualization algorithm. The paper includes measurements that show that the new multi-threaded paging algorithm decreases the time needed to compute visualizations by one third when using one processor and reading data from local disk. The time needed when using one processor and reading data from remote disk decreased by up to 60%. Visualization runs using data from remote disk ran about as fast as ones using data from local disk because the remote runs were able to make use of the remote server's high performance disk array.
MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware.
Lommen, Arjen; Kools, Harrie J
2012-08-01
A new, multi-threaded version of the GC-MS and LC-MS data processing software, metAlign, has been developed which is able to utilize multiple cores on one PC. This new version was tested using three different multi-core PCs with different operating systems. The performance of noise reduction, baseline correction and peak-picking was 8-19 fold faster compared to the previous version on a single core machine from 2008. The alignment was 5-10 fold faster. Factors influencing the performance enhancement are discussed. Our observations show that performance scales with the increase in processor core numbers we currently see in consumer PC hardware development.
Controlled release liquid dosage formulation
Benton, Ben F.; Gardner, David L.
1989-01-01
A liquid dual coated dosage formulation sustained release pharmaceutic having substantial shelf life prior to ingestion is disclosed. A dual coating is applied over controlled release cores to form dosage forms and the coatings comprise fats melting at less than approximately 101.degree. F. overcoated with cellulose acetate phthalate or zein. The dual coated dosage forms are dispersed in a sugar based acidic liquid carrier such as high fructose corn syrup and display a shelf life of up to approximately at least 45 days while still retaining their release profiles following ingestion. Cellulose acetate phthalate coated dosage form cores can in addition be dispersed in aqueous liquids of pH <5.
Early MIMD experience on the CRAY X-MP
NASA Astrophysics Data System (ADS)
Rhoades, Clifford E.; Stevens, K. G.
1985-07-01
This paper describes some early experience with converting four physics simulation programs to the CRAY X-MP, a current Multiple Instruction, Multiple Data (MIMD) computer consisting of two processors each with an architecture similar to that of the CRAY-1. As a multi-processor, the CRAY X-MP together with the high speed Solid-state Storage Device (SSD) in an ideal machine upon which to study MIMD algorithms for solving the equations of mathematical physics because it is fast enough to run real problems. The computer programs used in this study are all FORTRAN versions of original production codes. They range in sophistication from a one-dimensional numerical simulation of collisionless plasma to a two-dimensional hydrodynamics code with heat flow to a couple of three-dimensional fluid dynamics codes with varying degrees of viscous modeling. Early research with a dual processor configuration has shown speed-ups ranging from 1.55 to 1.98. It has been observed that a few simple extensions to FORTRAN allow a typical programmer to achieve a remarkable level of efficiency. These extensions involve the concept of memory local to a concurrent subprogram and memory common to all concurrent subprograms.
Real-Time Data Processing in the muon system of the D0 detector.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Neeti Parashar et al.
2001-07-03
This paper presents a real-time application of the 16-bit fixed point Digital Signal Processors (DSPs), in the Muon System of the D0 detector located at the Fermilab Tevatron, presently the world's highest-energy hadron collider. As part of the Upgrade for a run beginning in the year 2000, the system is required to process data at an input event rate of 10 KHz without incurring significant deadtime in readout. The ADSP21csp01 processor has high I/O bandwidth, single cycle instruction execution and fast task switching support to provide efficient multisignal processing. The processor's internal memory consists of 4K words of Program Memorymore » and 4K words of Data Memory. In addition there is an external memory of 32K words for general event buffering and 16K words of Dual port Memory for input data queuing. This DSP fulfills the requirement of the Muon subdetector systems for data readout. All error handling, buffering, formatting and transferring of the data to the various trigger levels of the data acquisition system is done in software. The algorithms developed for the system complete these tasks in about 20 {micro}s per event.« less
Center for Technology for Advanced Scientific Componet Software (TASCS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Govindaraju, Madhusudhan
Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more » We also worked on the design and implementation modules that will be required for the emerging MapReduce model to be effective for scientific applications. Finally, we focused on optimizing the processing of scientific datasets on multi-core processors. Research Details We worked on the following research projects that we are working on applying to CCA-based scientific applications. 1. Non-Hydrostatic Hydrodynamics: Non-static hydrodynamics are significantly more accurate at modeling internal waves that may be important in lake ecosystems. Non-hydrostatic codes, however, are significantly more computationally expensive, often prohibitively so. We have worked with Chin Wu at the University of Wisconsin to parallelize non-hydrostatic code. We have obtained a speed up of about 26 times maximum. Although this is significant progress, we hope to improve the performance further, such that it becomes a practical alternative to hydrostatic codes. 2. Model-coupling for water-based ecosystems: To answer pressing questions about water resources requires that physical models (hydrodynamics) be coupled with biological and chemical models. Most hydrodynamics codes are written in Fortran, however, while most ecologists work in MATLAB. This disconnect creates a great barrier. To address this, we are working on a model coupling interface that will allow biogeochemical computations written in MATLAB to couple with Fortran codes. This will greatly improve the productivity of ecosystem scientists. 2. Low overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications: Since its inception, MapReduce has frequently been associated with Hadoop and large-scale datasets. Its deployment at Amazon in the cloud, and its applications at Yahoo! for large-scale distributed document indexing and database building, among other tasks, have thrust MapReduce to the forefront of the data processing application domain. The applicability of the paradigm however extends far beyond its use with data intensive applications and diskbased systems, and can also be brought to bear in processing small but CPU intensive distributed applications. MapReduce however carries its own burdens. Through experiments using Hadoop in the context of diverse applications, we uncovered latencies and delay conditions potentially inhibiting the expected performance of a parallel execution in CPU-intensive applications. Furthermore, as it currently stands, MapReduce is favored for data-centric applications, and as such tends to be solely applied to disk-based applications. The paradigm, falls short in bringing its novelty to diskless systems dedicated to in-memory applications, and compute intensive programs processing much smaller data, but requiring intensive computations. In this project, we focused both on the performance of processing large-scale hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in diskless, and memory resident I/O systems. We designed LEMO-MR [1], a Low overhead, elastic, configurable for in- memory applications, and on-demand fault tolerance, an optimized implementation of MapReduce, for both on disk and in memory applications. We conducted experiments to identify not only the necessary components of this model, but also trade offs and factors to be considered. We have initial results to show the efficacy of our implementation in terms of potential speedup that can be achieved for representative data sets used by cloud applications. We have quantified the performance gains exhibited by our MapReduce implementation over Apache Hadoop in a compute intensive environment. 3. Cache Performance Optimization for Processing XML and HDF-based Application Data on Multi-core Processors: It is important to design and develop scientific middleware libraries to harness the opportunities presented by emerging multi-core processors. Implementations of scientific middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. In this project, we focused on the utilization of the L2 cache, which is a critical shared resource on chip multiprocessors (CMP). The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or hurt the ability to hide memory latency on a multi-core processor. Therefore, while processing scientific datasets such as HDF5, it is essential to conduct fine-grained analysis of cache utilization, to inform scheduling decisions in multi-threaded programming. In this project, using the TAU toolkit for performance feedback from dual- and quad-core machines, we conducted performance analysis and recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of scientific applications that requires processing of HDF5 data. In particular, we quantified the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time [2]. References: 1. Zacharia Fadika, Madhusudhan Govindaraju, ``MapReduce Implementation for Memory-Based and Processing Intensive Applications'', accepted in 2nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, USA, Nov 30 - Dec 3, 2010. 2. Rajdeep Bhowmik, Madhusudhan Govindaraju, ``Cache Performance Optimization for Processing XML-based Application Data on Multi-core Processors'', in proceedings of The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 17-20, 2010, Melbourne, Victoria, Australia. Contact Information: Madhusudhan Govindaraju Binghamton University State University of New York (SUNY) mgovinda@cs.binghamton.edu Phone: 607-777-4904« less
Chen, Yen-Lin; Chiang, Hsin-Han; Chiang, Chuan-Yen; Liu, Chuan-Ming; Yuan, Shyan-Ming; Wang, Jenq-Haur
2012-01-01
This study proposes a vision-based intelligent nighttime driver assistance and surveillance system (VIDASS system) implemented by a set of embedded software components and modules, and integrates these modules to accomplish a component-based system framework on an embedded heterogamous dual-core platform. Therefore, this study develops and implements computer vision and sensing techniques of nighttime vehicle detection, collision warning determination, and traffic event recording. The proposed system processes the road-scene frames in front of the host car captured from CCD sensors mounted on the host vehicle. These vision-based sensing and processing technologies are integrated and implemented on an ARM-DSP heterogamous dual-core embedded platform. Peripheral devices, including image grabbing devices, communication modules, and other in-vehicle control devices, are also integrated to form an in-vehicle-embedded vision-based nighttime driver assistance and surveillance system. PMID:22736956
Chen, Yen-Lin; Chiang, Hsin-Han; Chiang, Chuan-Yen; Liu, Chuan-Ming; Yuan, Shyan-Ming; Wang, Jenq-Haur
2012-01-01
This study proposes a vision-based intelligent nighttime driver assistance and surveillance system (VIDASS system) implemented by a set of embedded software components and modules, and integrates these modules to accomplish a component-based system framework on an embedded heterogamous dual-core platform. Therefore, this study develops and implements computer vision and sensing techniques of nighttime vehicle detection, collision warning determination, and traffic event recording. The proposed system processes the road-scene frames in front of the host car captured from CCD sensors mounted on the host vehicle. These vision-based sensing and processing technologies are integrated and implemented on an ARM-DSP heterogamous dual-core embedded platform. Peripheral devices, including image grabbing devices, communication modules, and other in-vehicle control devices, are also integrated to form an in-vehicle-embedded vision-based nighttime driver assistance and surveillance system.
NASA Technical Reports Server (NTRS)
Jovic, Srboljub
2015-01-01
This document provides the software design description for the two core software components, the LVC Gateway, the LVC Gateway Toolbox, and two participants, the LVC Gateway Data Logger and the SAA Processor (SaaProc).
Parallel processing architecture for H.264 deblocking filter on multi-core platforms
NASA Astrophysics Data System (ADS)
Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao
2012-03-01
Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.
Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems
2015-05-01
form of lockdown registers, to provide way-based partitioning. These alternatives are illustrated in Fig. 1 with respect to a quad-core ARM Cortex A9... processor (as we do for Level-A and -B tasks), but they did not consider MC systems. Altmeyer et al. [1] considered uniprocessor scheduling on a system with a...framework. We randomly generated task sets and determined the fraction that were schedulable on our target hardware platform, the quad-core ARM Cortex A9
Floating-point performance of ARM cores and their efficiency in classical molecular dynamics
NASA Astrophysics Data System (ADS)
Nikolskiy, V.; Stegailov, V.
2016-02-01
Supercomputing of the exascale era is going to be inevitably limited by power efficiency. Nowadays different possible variants of CPU architectures are considered. Recently the development of ARM processors has come to the point when their floating point performance can be seriously considered for a range of scientific applications. In this work we present the analysis of the floating point performance of the latest ARM cores and their efficiency for the algorithms of classical molecular dynamics.
Frequency Dependence of Single-event Upset in Advanced Commerical PowerPC Microprocessors
NASA Technical Reports Server (NTRS)
Irom, Frokh; Farmanesh, Farhad F.; Swift, Gary M.; Johnston, Allen H.
2004-01-01
This paper examines single-event upsets in advanced commercial SOI microprocessors in a dynamic mode, studying SEU sensitivity of General Purpose Registers (GPRs) with clock frequency. Results are presented for SOI processors with feature sizes of 0.18 microns and two different core voltages. Single-event upset from heavy ions is measured for advanced commercial microprocessors in a dynamic mode with clock frequency up to 1GHz. Frequency and core voltage dependence of single-event upsets in registers is discussed.
MIL-STD-1553B Marconi LSI chip set in a remote terminal application
NASA Astrophysics Data System (ADS)
Dimarino, A.
1982-11-01
Marconi Avionics is utilizing the MIL-STD-1553B LSI Chip Set in the SCADC Air Data Computer application to perform all of the required remote terminal MIL-STD-1553B protocol functions. Basic components of the RTU are the dual redundant chip set, CT3231 Transceivers, 256 x 16 RAM and a Z8002 microprocessor. Basic transfers are to/from the RAM command of the bus controller or Z8002 processor. During transfers from the processor to the RAM, the chip set busy bit is set for a period not exceeding 250 microseconds. When the transfer is complete, the busy bit is released and transfers to the data bus occur on command. The LSI Chip Set word count lines are used to locate each data word in the local memory and 4 mode codes are used in the application: reset remote terminal, transmit status word, transmitter shut-down, and override transmitter shutdown.
Challenges and Opportunities in Propulsion Simulations
2015-09-24
leverage Nvidia GPU accelerators • Release common computational infrastructure as Distro A for collaboration • Add physics modules as either...Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s) Interconnect Topology 3D Torus Non-blocking Fat Tree Processors AMD Opteron™ NVIDIA Kepler™ IBM...POWER9 NVIDIA Volta™ File System 32 PB, 1 TB/s, Lustre® 120 PB, 1 TB/s, GPFS™ Peak power consumption 9 MW 10 MW Titan vs. Summit Source: R
Ansari, A H; Cherian, P J; Dereymaeker, A; Matic, V; Jansen, K; De Wispelaere, L; Dielman, C; Vervisch, J; Swarte, R M; Govaert, P; Naulaers, G; De Vos, M; Van Huffel, S
2016-09-01
After identifying the most seizure-relevant characteristics by a previously developed heuristic classifier, a data-driven post-processor using a novel set of features is applied to improve the performance. The main characteristics of the outputs of the heuristic algorithm are extracted by five sets of features including synchronization, evolution, retention, segment, and signal features. Then, a support vector machine and a decision making layer remove the falsely detected segments. Four datasets including 71 neonates (1023h, 3493 seizures) recorded in two different university hospitals, are used to train and test the algorithm without removing the dubious seizures. The heuristic method resulted in a false alarm rate of 3.81 per hour and good detection rate of 88% on the entire test databases. The post-processor, effectively reduces the false alarm rate by 34% while the good detection rate decreases by 2%. This post-processing technique improves the performance of the heuristic algorithm. The structure of this post-processor is generic, improves our understanding of the core visually determined EEG features of neonatal seizures and is applicable for other neonatal seizure detectors. The post-processor significantly decreases the false alarm rate at the expense of a small reduction of the good detection rate. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
Reducing Backups by Utilizing DMF
NASA Technical Reports Server (NTRS)
Cardo, Nicholas P.; Woodrow, Thomas (Technical Monitor)
1994-01-01
Although a filesystem may be migratable, for a period of time the data blocks are on disk only. When performing system dumps, these data blocks are backed up to tape. If the data blocks are offline or dual resident, then only the inode is backed up. If all online files are made dual resident prior to performing system dumps, the dump time and the amount of resources required can be significantly reduced. The High Speed Processors group at the Numerical Aerodynamics Simulation (NAS) Facility at NASA Ames Research Center developed a tool to make all online files dual resident. The result is that a file whose data blocks are on DMF tape and still assigned to the original inode. Our 150GB filesystem used to take 8 to 12 hours to backup and used 50 to 60 tapes. Now the backup is typically under 10 tapes and completes in under 2 hours. This paper discusses this new tool and advantages gained by using it.
Fast Plasma Investigation for Magnetospheric Multiscale
NASA Technical Reports Server (NTRS)
Pollock, C.; Moore, T.; Coffey, V.; Dorelli J.; Giles, B.; Adrian, M.; Chandler, M.; Duncan, C.; Figueroa-Vinas, A.; Garcia, K.;
2016-01-01
The Fast Plasma Investigation (FPI) was developed for flight on the Magnetospheric Multiscale (MMS) mission to measure the differential directional flux of magnetospheric electrons and ions with unprecedented time resolution to resolve kinetic-scale plasma dynamics. This increased resolution has been accomplished by placing four dual 180-degree top hat spectrometers for electrons and four dual 180-degree top hat spectrometers for ions around the periphery of each of four MMS spacecraft. Using electrostatic field-of-view deflection, the eight spectrometers for each species together provide 4pi-sr-field-of-view with, at worst, 11.25-degree sample spacing. Energy/charge sampling is provided by swept electrostatic energy/charge selection over the range from 10 eVq to 30000 eVq. The eight dual spectrometers on each spacecraft are controlled and interrogated by a single block redundant Instrument Data Processing Unit, which in turn interfaces to the observatory's Instrument Suite Central Instrument Data processor. This paper described the design of FPI, its ground and in-flight calibration, its operational concept, and its data products.
NASA Astrophysics Data System (ADS)
Blume, H.; Alexandru, R.; Applegate, R.; Giordano, T.; Kamiya, K.; Kresina, R.
1986-06-01
In a digital diagnostic imaging department, the majority of operations for handling and processing of images can be grouped into a small set of basic operations, such as image data buffering and storage, image processing and analysis, image display, image data transmission and image data compression. These operations occur in almost all nodes of the diagnostic imaging communications network of the department. An image processor architecture was developed in which each of these functions has been mapped into hardware and software modules. The modular approach has advantages in terms of economics, service, expandability and upgradeability. The architectural design is based on the principles of hierarchical functionality, distributed and parallel processing and aims at real time response. Parallel processing and real time response is facilitated in part by a dual bus system: a VME control bus and a high speed image data bus, consisting of 8 independent parallel 16-bit busses, capable of handling combined up to 144 MBytes/sec. The presented image processor is versatile enough to meet the video rate processing needs of digital subtraction angiography, the large pixel matrix processing requirements of static projection radiography, or the broad range of manipulation and display needs of a multi-modality diagnostic work station. Several hardware modules are described in detail. For illustrating the capabilities of the image processor, processed 2000 x 2000 pixel computed radiographs are shown and estimated computation times for executing the processing opera-tions are presented.
Seed and blanket fuel arrangement for dual-phase nuclear reactors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Congdon, S.P.; Fawcett, R.M.
1992-09-22
This patent describes a fuel management method for a dual-phase nuclear reactor, it comprises: installing a fuel bundle at a first core location accessed by coolant through a relatively small aperture, each of the bundles having a predetermined group of fuel elements; operating the reactor a first time; shutting down the reactor; reinstalling the fuel bundle at a second core location accessed by coolant through a relatively large aperture; and operating the reactor a second time.
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kerbyson, Darren J; Lang, Michael; Pakin, Scott
2009-01-01
Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contain wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost ismore » typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional computation and higher use of on-chip communications. This tradeoff is explored using a performance model and an implementation on the Petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
CoreTSAR: Core Task-Size Adapting Runtime
Scogland, Thomas R. W.; Feng, Wu-chun; Rountree, Barry; ...
2014-10-27
Heterogeneity continues to increase at all levels of computing, with the rise of accelerators such as GPUs, FPGAs, and other co-processors into everything from desktops to supercomputers. As a consequence, efficiently managing such disparate resources has become increasingly complex. CoreTSAR seeks to reduce this complexity by adaptively worksharing parallel-loop regions across compute resources without requiring any transformation of the code within the loop. Lastly, our results show performance improvements of up to three-fold over a current state-of-the-art heterogeneous task scheduler as well as linear performance scaling from a single GPU to four GPUs for many codes. In addition, CoreTSAR demonstratesmore » a robust ability to adapt to both a variety of workloads and underlying system configurations.« less
Interactive collision detection for deformable models using streaming AABBs.
Zhang, Xinyu; Kim, Young J
2007-01-01
We present an interactive and accurate collision detection algorithm for deformable, polygonal objects based on the streaming computational model. Our algorithm can detect all possible pairwise primitive-level intersections between two severely deforming models at highly interactive rates. In our streaming computational model, we consider a set of axis aligned bounding boxes (AABBs) that bound each of the given deformable objects as an input stream and perform massively-parallel pairwise, overlapping tests onto the incoming streams. As a result, we are able to prevent performance stalls in the streaming pipeline that can be caused by expensive indexing mechanism required by bounding volume hierarchy-based streaming algorithms. At runtime, as the underlying models deform over time, we employ a novel, streaming algorithm to update the geometric changes in the AABB streams. Moreover, in order to get only the computed result (i.e., collision results between AABBs) without reading back the entire output streams, we propose a streaming en/decoding strategy that can be performed in a hierarchical fashion. After determining overlapped AABBs, we perform a primitive-level (e.g., triangle) intersection checking on a serial computational model such as CPUs. We implemented the entire pipeline of our algorithm using off-the-shelf graphics processors (GPUs), such as nVIDIA GeForce 7800 GTX, for streaming computations, and Intel Dual Core 3.4G processors for serial computations. We benchmarked our algorithm with different models of varying complexities, ranging from 15K up to 50K triangles, under various deformation motions, and the timings were obtained as 30 approximately 100 FPS depending on the complexity of models and their relative configurations. Finally, we made comparisons with a well-known GPU-based collision detection algorithm, CULLIDE [4] and observed about three times performance improvement over the earlier approach. We also made comparisons with a SW-based AABB culling algorithm [2] and observed about two times improvement.
A Survey of Recent MARTe Based Systems
NASA Astrophysics Data System (ADS)
Neto, André C.; Alves, Diogo; Boncagni, Luca; Carvalho, Pedro J.; Valcarcel, Daniel F.; Barbalace, Antonio; De Tommasi, Gianmaria; Fernandes, Horácio; Sartori, Filippo; Vitale, Enzo; Vitelli, Riccardo; Zabeo, Luca
2011-08-01
The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardize the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilization system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.
Cost efficient CFD simulations: Proper selection of domain partitioning strategies
NASA Astrophysics Data System (ADS)
Haddadi, Bahram; Jordan, Christian; Harasek, Michael
2017-10-01
Computational Fluid Dynamics (CFD) is one of the most powerful simulation methods, which is used for temporally and spatially resolved solutions of fluid flow, heat transfer, mass transfer, etc. One of the challenges of Computational Fluid Dynamics is the extreme hardware demand. Nowadays super-computers (e.g. High Performance Computing, HPC) featuring multiple CPU cores are applied for solving-the simulation domain is split into partitions for each core. Some of the different methods for partitioning are investigated in this paper. As a practical example, a new open source based solver was utilized for simulating packed bed adsorption, a common separation method within the field of thermal process engineering. Adsorption can for example be applied for removal of trace gases from a gas stream or pure gases production like Hydrogen. For comparing the performance of the partitioning methods, a 60 million cell mesh for a packed bed of spherical adsorbents was created; one second of the adsorption process was simulated. Different partitioning methods available in OpenFOAM® (Scotch, Simple, and Hierarchical) have been used with different numbers of sub-domains. The effect of the different methods and number of processor cores on the simulation speedup and also energy consumption were investigated for two different hardware infrastructures (Vienna Scientific Clusters VSC 2 and VSC 3). As a general recommendation an optimum number of cells per processor core was calculated. Optimized simulation speed, lower energy consumption and consequently the cost effects are reported here.
NASA Astrophysics Data System (ADS)
Keleshis, C.; Ioannou, S.; Vrekoussis, M.; Levin, Z.; Lange, M. A.
2014-08-01
Continuous advances in unmanned aerial vehicles (UAV) and the increased complexity of their applications raise the demand for improved data acquisition systems (DAQ). These improvements may comprise low power consumption, low volume and weight, robustness, modularity and capability to interface with various sensors and peripherals while maintaining the high sampling rates and processing speeds. Such a system has been designed and developed and is currently integrated on the Autonomous Flying Platforms for Atmospheric and Earth Surface Observations (APAESO/NEA-YΠOΔOMH/NEKΠ/0308/09) however, it can be easily adapted to any UAV or any other mobile vehicle. The system consists of a single-board computer with a dual-core processor, rugged surface-mount memory and storage device, analog and digital input-output ports and many other peripherals that enhance its connectivity with various sensors, imagers and on-board devices. The system is powered by a high efficiency power supply board. Additional boards such as frame-grabbers, differential global positioning system (DGPS) satellite receivers, general packet radio service (3G-4G-GPRS) modems for communication redundancy have been interfaced to the core system and are used whenever there is a mission need. The onboard DAQ system can be preprogrammed for automatic data acquisition or it can be remotely operated during the flight from the ground control station (GCS) using a graphical user interface (GUI) which has been developed and will also be presented in this paper. The unique design of the GUI and the DAQ system enables the synchronized acquisition of a variety of scientific and UAV flight data in a single core location. The new DAQ system and the GUI have been successfully utilized in several scientific UAV missions. In conclusion, the novel DAQ system provides the UAV and the remote-sensing community with a new tool capable of reliably acquiring, processing, storing and transmitting data from any sensor integrated on an UAV.
Tolbert, Jeremy R; Kabali, Pratik; Brar, Simeranjit; Mukhopadhyay, Saibal
2009-01-01
We present a digital system for adaptive data compression for low power wireless transmission of Electroencephalography (EEG) data. The proposed system acts as a base-band processor between the EEG analog-to-digital front-end and RF transceiver. It performs a real-time accuracy energy trade-off for multi-channel EEG signal transmission by controlling the volume of transmitted data. We propose a multi-core digital signal processor for on-chip processing of EEG signals, to detect signal information of each channel and perform real-time adaptive compression. Our analysis shows that the proposed approach can provide significant savings in transmitter power with minimal impact on the overall signal accuracy.
NASA Astrophysics Data System (ADS)
Hadade, Ioan; di Mare, Luca
2016-08-01
Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
Mini-cavity plasma core reactors for dual-mode space nuclear power/propulsion systems. M.S. Thesis
NASA Technical Reports Server (NTRS)
Chow, S.
1976-01-01
A mini-cavity plasma core reactor is investigated for potential use in a dual-mode space power and propulsion system. In the propulsive mode, hydrogen propellant is injected radially inward through the reactor solid regions and into the cavity. The propellant is heated by both solid driver fuel elements surrounding the cavity and uranium plasma before it is exhausted out the nozzle. The propellant only removes a fraction of the driver power, the remainder is transferred by a coolant fluid to a power conversion system, which incorporates a radiator for heat rejection. Neutronic feasibility of dual mode operation and smaller reactor sizes than those previously investigated are shown to be possible. A heat transfer analysis of one such reactor shows that the dual-mode concept is applicable when power generation mode thermal power levels are within the same order of magnitude as direct thrust mode thermal power levels.
Mamede, Joao I.; Hope, Thomas J.
2016-01-01
Summary Live cell imaging is a valuable technique that allows the characterization of the dynamic processes of the HIV-1 life-cycle. Here, we present a method of production and imaging of dual-labeled HIV viral particles that allows the visualization of two events. Varying release of the intravirion fluid phase marker reveals virion fusion and the loss of the integrity of HIV viral cores with the use of live wide-field fluorescent microscopy. PMID:26714704
Efficiently Distributing Component-Based Applications Across Wide-Area Environments
2002-01-01
Oracle 8.1.7 Enterprise Edition), each running on a dedicated 1GHz dual-processor Pentium III workstation. For the RUBiS tests, we used a MySQL 4.0.12...a variety of sophisticated network-accessible services such as e-mail, banking, on-line shopping, entertainment, and serv - ing as a data exchange...Beans Catalog Handles read-only queries to product database Customer Serves as a façade to Order and Account Stateful Session Beans ShoppingCart
Develop, Build, and Test a Virtual Lab to Support a Vulnerability Training System
2004-09-01
docs.us.dell.com/support/edocs/systems/pe1650/ en /it/index.htm> (20 August 2004) “HOWTO: Installing Web Services with Linux /Tomcat/Apache/Struts...configured as host machines with VMware and VNC running on a Linux RedHat 9 Kernel. An Apache-Tomcat web server was configured as the external interface to...1650, dual processor, blade servers were configured as host machines with VMware and VNC running on a Linux RedHat 9 Kernel. An Apache-Tomcat web
Experimentation and Evaluation of Advanced Integrated System Concepts.
1980-09-26
ART). (b) Selects one of four trunk circuits from each trunk (m) Dual Modem and Loop Interface (DMLI) card. circuit card. (n) Dictation and paging...Arbitrator L Bus - Modems ET _Modems Modems Figure 4-1 Certain Telenet Processor models (see Section 4.3 for details) can be equipped with redundancy to...JMemory Bank B Memory Bank A ArbittrAto Arbitrator A t a i Interface U a Modems $ Figure 4-2 In a system with common logic redundancy all centrally
NASA Astrophysics Data System (ADS)
Jang, Haeyun; Lee, Chaedong; Nam, Gi-Eun; Quan, Bo; Choi, Hyuck Jae; Yoo, Jung Sun; Piao, Yuanzhe
2016-02-01
The difficulty in delineating tumor is a major obstacle for better outcomes in cancer treatment of patients. The use of single-imaging modality is often limited by inadequate sensitivity and resolution. Here, we present the synthesis and the use of monodisperse iron oxide nanoparticles coated with fluorescent silica nano-shells for fluorescence and magnetic resonance dual imaging of tumor. The as-synthesized core-shell nanoparticles were designed to improve the accuracy of diagnosis via simultaneous tumor imaging with dual imaging modalities by a single injection of contrast agent. The iron oxide nanocrystals ( 11 nm) were coated with Rhodamine B isothiocyanate-doped silica shells via reverse microemulsion method. Then, the core-shell nanoparticles ( 54 nm) were analyzed to confirm their size distribution by transmission electron microscopy and dynamic laser scattering. Photoluminescence spectroscopy was used to characterize the fluorescent property of the dye-doped silica shell-coated nanoparticles. The cellular compatibility of the as-prepared nanoparticles was confirmed by a trypan blue dye exclusion assay and the potential as a dual-imaging contrast agent was verified by in vivo fluorescence and magnetic resonance imaging. The experimental results show that the uniform-sized core-shell nanoparticles are highly water dispersible and the cellular toxicity of the nanoparticles is negligible. In vivo fluorescence imaging demonstrates the capability of the developed nanoparticles to selectively target tumors by the enhanced permeability and retention effects and ex vivo tissue analysis was corroborated this. Through in vitro phantom test, the core/shell nanoparticles showed a T2 relaxation time comparable to Feridex® with smaller size, indicating that the as-made nanoparticles are suitable for imaging tumor. This new dual-modality-nanoparticle approach has promised for enabling more accurate tumor imaging.
Yoshida, Keiichi; Meng, Xiangfeng
2014-06-01
The optimal luting material for fiber-reinforced posts to ensure the longevity of foundation restorations remains undetermined. The purpose of this study was to evaluate the suitability of 3 dual-polymerizing resin cements and 2 dual-polymerizing foundation composite resins for luting fiber-reinforced posts by assessing their Knoop hardness number. Five specimens of dual-polymerizing resin cements (SA Cement Automix, G-Cem LincAce, and Panavia F2.0) and 5 specimens of dual-polymerizing foundation composite resins (Clearfil DC Core Plus and Unifil Core EM) were polymerized from the top by irradiation for 40 seconds. Knoop hardness numbers were measured at depths of 0.5, 2.0, 4.0, 6.0, 8.0, and 10.0 mm at 0.5 hours and 7 days after irradiation. Data were statistically analyzed by repeated measures ANOVA, 1-way ANOVA, and the Tukey compromise post hoc test (α=.05). At both times after irradiation, the 5 resins materials showed the highest Knoop hardness numbers at the 0.5-mm depth. At 7 days after irradiation, the Knoop hardness numbers of the resin materials did not differ significantly between the 8.0-mm and 10.0-mm depths (P>.05). For all materials, the Knoop hardness numbers at 7 days after irradiation were significantly higher than those at 0.5 hours after irradiation at all depths (P<.05). At 7 days after irradiation, the Knoop hardness numbers of the 5 resin materials were found to decrease in the following order: DC Core Plus, Unifil Core EM, Panavia F2.0, SA Cement Automix, and G-Cem LincAce (P<.05). The Knoop hardness number depends on the depth of the cavity, the length of time after irradiation, and the material brand. Although the Knoop hardness numbers of the 2 dual-polymerizing foundation composite resins were higher than those of the 3 dual-polymerizing resin cements, notable differences were seen among the 5 materials at all depths and at both times after irradiation. Copyright © 2014 Editorial Council for the Journal of Prosthetic Dentistry. Published by Elsevier Inc. All rights reserved.
Peregrine System | High-Performance Computing | NREL
) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors
2013-02-01
with the bias node ( gray ) denoted as ww and the weights associated with the remaining first layer nodes (black) denoted as W. In forming the overall...Implementation of RBF network on GPU Platform 3.5.1 The Cholesky decomposition algorithm We need to invert the matrix multiplication GTG to
NASA Astrophysics Data System (ADS)
Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
2017-05-01
In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Trędak, Przemysław, E-mail: przemyslaw.tredak@fuw.edu.pl; Rudnicki, Witold R.; Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5a, 02-106 Warsaw
The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPUmore » to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.« less
Parallel halftoning technique using dot diffusion optimization
NASA Astrophysics Data System (ADS)
Molina-Garcia, Javier; Ponomaryov, Volodymyr I.; Reyes-Reyes, Rogelio; Cruz-Ramos, Clara
2017-05-01
In this paper, a novel approach for halftone images is proposed and implemented for images that are obtained by the Dot Diffusion (DD) method. Designed technique is based on an optimization of the so-called class matrix used in DD algorithm and it consists of generation new versions of class matrix, which has no baron and near-baron in order to minimize inconsistencies during the distribution of the error. Proposed class matrix has different properties and each is designed for two different applications: applications where the inverse-halftoning is necessary, and applications where this method is not required. The proposed method has been implemented in GPU (NVIDIA GeForce GTX 750 Ti), multicore processors (AMD FX(tm)-6300 Six-Core Processor and in Intel core i5-4200U), using CUDA and OpenCV over a PC with linux. Experimental results have shown that novel framework generates a good quality of the halftone images and the inverse halftone images obtained. The simulation results using parallel architectures have demonstrated the efficiency of the novel technique when it is implemented in real-time processing.
Grazing incidence modeling of a metamaterial-inspired dual-resonance acoustic liner
NASA Astrophysics Data System (ADS)
Beck, Benjamin S.
2014-03-01
To reduce the noise emitted by commercial aircraft turbofan engines, the inlet and aft nacelle ducts are lined with acoustic absorbing structures called acoustic liners. Traditionally, these structures consist of a perforated facesheet bonded on top of a honeycomb core. These traditional perforate over honeycomb core (POHC) liners create an absorption spectra where the maximum absorption occurs at a frequency that is dictated by the depth of the honeycomb core; which acts as a quarter-wave resonator. Recent advances in turbofan engine design have increased the need for thin acoustic liners that are effective at low frequencies. One design that has been developed uses an acoustic metamaterial architecture to improve the low frequency absorption. Specifically, the liner consists of an array of Helmholtz resonators separated by quarter-wave volumes to create a dual-resonance acoustic liner. While previous work investigated the acoustic behavior under normal incidence, this paper outlines the modeling and predicted transmission loss and absorption of a dual-resonance acoustic metamaterial when subjected to grazing incidence sound.
NASA Astrophysics Data System (ADS)
Selvi, N.; Sankar, S.; Dinakaran, K.
2014-12-01
Nanocrystallites of SnO2 core and dual shells (ZnO, SiO2) coated SnO2 core-shell nanospheres were successfully synthesized by co-precipitation method. The as prepared and annealed samples were characterized by X-ray diffraction (XRD), Fourier Transform Infrared spectroscopy (FTIR), High resolution transmission electron microscopy (HRTEM) and UV-Vis analysis. XRD pattern confirms the obtained SnO2 core with tetragonal rutile crystalline structure and the shell ZnO with hexagonal structure. FTIR result shows the functional groups present in the samples. The spherical morphology and the formation of the core-shell structures have been confirmed by HRTEM measurements. The UV-Vis showed that band gap is red shifted for as-prepared and the shells coated core-shell samples. From this investigation it can be concluded that the surface modification with different metal and insulating oxides strongly influences the optical properties of the core-shell materials which enhance their potential applications towards optical devices fabrication.
NEW EPICS/RTEMS IOC BASED ON ALTERA SOC AT JEFFERSON LAB
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yan, Jianxun; Seaton, Chad; Allison, Trent L.
A new EPICS/RTEMS IOC based on the Altera System-on-Chip (SoC) FPGA is being designed at Jefferson Lab. The Altera SoC FPGA integrates a dual ARM Cortex-A9 Hard Processor System (HPS) consisting of processor, peripherals and memory interfaces tied seamlessly with the FPGA fabric using a high-bandwidth interconnect backbone. The embedded Altera SoC IOC has features of remote network boot via U-Boot from SD card or QSPI Flash, 1Gig Ethernet, 1GB DDR3 SDRAM on HPS, UART serial ports, and ISA bus interface. RTEMS for the ARM processor BSP were built with CEXP shell, which will dynamically load the EPICS applications atmore » runtime. U-Boot is the primary bootloader to remotely load the kernel image into local memory from a DHCP/TFTP server over Ethernet, and automatically run RTEMS and EPICS. The first design of the SoC IOC will be compatible with Jefferson Lab’s current PC104 IOCs, which have been running in CEBAF 10 years. The next design would be mounting in a chassis and connected to a daughter card via standard HSMC connectors. This standard SoC IOC will become the next generation of low-level IOC for the accelerator controls at Jefferson Lab.« less
Advanced electronics for the CTF MEG system.
McCubbin, J; Vrba, J; Spear, P; McKenzie, D; Willis, R; Loewen, R; Robinson, S E; Fife, A A
2004-11-30
Development of the CTF MEG system has been advanced with the introduction of a computer processing cluster between the data acquisition electronics and the host computer. The advent of fast processors, memory, and network interfaces has made this innovation feasible for large data streams at high sampling rates. We have implemented tasks including anti-alias filter, sample rate decimation, higher gradient balancing, crosstalk correction, and optional filters with a cluster consisting of 4 dual Intel Xeon processors operating on up to 275 channel MEG systems at 12 kHz sample rate. The architecture is expandable with additional processors to implement advanced processing tasks which may include e.g., continuous head localization/motion correction, optional display filters, coherence calculations, or real time synthetic channels (via beamformer). We also describe an electronics configuration upgrade to provide operator console access to the peripheral interface features such as analog signal and trigger I/O. This allows remote location of the acoustically noisy electronics cabinet and fitting of the cabinet with doors for improved EMI shielding. Finally, we present the latest performance results available for the CTF 275 channel MEG system including an unshielded SEF (median nerve electrical stimulation) measurement enhanced by application of an adaptive beamformer technique (SAM) which allows recognition of the nominal 20-ms response in the unaveraged signal.
Recognition of chromatin by the plant alkaloid, ellipticine as a dual binder
DOE Office of Scientific and Technical Information (OSTI.GOV)
Banerjee, Amrita; Sanyal, Sulagna; Majumder, Parijat
Recognition of core histone components of chromatin along with chromosomal DNA by a class of small molecule modulators is worth examining to evaluate their intracellular mode of action. A plant alkaloid ellipticine (ELP) which is a putative anticancer agent has so far been reported to function via DNA intercalation, association with topoisomerase II and binding to telomere region. However, its effect upon the potential intracellular target, chromatin is hitherto unreported. Here we have characterized the biomolecular recognition between ELP and different hierarchical levels of chromatin. The significant result is that in addition to DNA, it binds to core histone(s) andmore » can be categorized as a ‘dual binder’. As a sequel to binding with histone(s) and core octamer, it alters post-translational histone acetylation marks. We have further demonstrated that it has the potential to modulate gene expression thereby regulating several key biological processes such as nuclear organization, transcription, translation and histone modifications. - Highlights: • Ellipticine acts a dual binder binding to both DNA and core histone(s). • It induces structural perturbations in chromatin, chromatosome and histone octamer. • It alters histones acetylation and affects global gene expression.« less
NASA Astrophysics Data System (ADS)
Johnsson, L.; Netzer, G.
2016-10-01
Moore's law, the doubling of transistors per unit area for each CMOS technology generation, is expected to continue throughout the decade, while Dennard voltage scaling resulting in constant power per unit area stopped about a decade ago. The semiconductor industry's response to the loss of Dennard scaling and the consequent challenges in managing power distribution and dissipation has been leveled off clock rates, a die performance gain reduced from about a factor of 2.8 to 1.4 per technology generation, and multi-core processor dies with increased cache sizes. Increased caches sizes offers performance benefits for many applications as well as energy savings. Accessing data in cache is considerably more energy efficient than main memory accesses. Further, caches consume less power than a corresponding amount of functional logic. As feature sizes continue to be scaled down an increasing fraction of the die must be “underutilized” or “dark” due to power constraints. With power being a prime design constraint there is a concerted effort to find significantly more energy efficient chip architectures than dominant in servers today, with chips potentially incorporating several types of cores to cover a range of applications, or different functions in an application, as is already common for the mobile processor market. Digital Signal Processors (DSPs), largely targeting the embedded and mobile processor markets, typically have been designed for a power consumption of 10% or less of a typical x86 CPU, yet with much more than 10% of the floating-point capability of the same technology generation x86 CPUs. Thus, DSPs could potentially offer an energy efficient alternative to x86 CPUs. Here we report an assessment of the Texas Instruments TMS320C6678 DSP in regards to its energy efficiency for two common HPC benchmarks: STREAM (memory system benchmark) and HPL (CPU benchmark)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Orza, Anamaria; Wu, Hui; Li, Yuancheng
Purpose: To develop a core/shell nanodimer of gold (core) and silver iodine (shell) as a dual-modal contrast-enhancing agent for biomarker targeted x-ray computed tomography (CT) and photoacoustic imaging (PAI) applications. Methods: The gold and silver iodine core/shell nanodimer (Au/AgICSD) was prepared by fusing together components of gold, silver, and iodine. The physicochemical properties of Au/AgICSD were then characterized using different optical and imaging techniques (e.g., HR- transmission electron microscope, scanning transmission electron microscope, x-ray photoelectron spectroscopy, energy-dispersive x-ray spectroscopy, Z-potential, and UV-vis). The CT and PAI contrast-enhancing effects were tested and then compared with a clinically used CT contrast agentmore » and Au nanoparticles. To confer biocompatibility and the capability for efficient biomarker targeting, the surface of the Au/AgICSD nanodimer was modified with the amphiphilic diblock polymer and then functionalized with transferrin for targeting transferrin receptor that is overexpressed in various cancer cells. Cytotoxicity of the prepared Au/AgICSD nanodimer was also tested with both normal and cancer cell lines. Results: The characterizations of prepared Au/AgI core/shell nanostructure confirmed the formation of Au/AgICSD nanodimers. Au/AgICSD nanodimer is stable in physiological conditions for in vivo applications. Au/AgICSD nanodimer exhibited higher contrast enhancement in both CT and PAI for dual-modality imaging. Moreover, transferrin functionalized Au/AgICSD nanodimer showed specific binding to the tumor cells that have a high level of expression of the transferrin receptor. Conclusions: The developed Au/AgICSD nanodimer can be used as a potential biomarker targeted dual-modal contrast agent for both or combined CT and PAI molecular imaging.« less
Study of photon correlation techniques for processing of laser velocimeter signals
NASA Technical Reports Server (NTRS)
Mayo, W. T., Jr.
1977-01-01
The objective was to provide the theory and a system design for a new type of photon counting processor for low level dual scatter laser velocimeter (LV) signals which would be capable of both the first order measurements of mean flow and turbulence intensity and also the second order time statistics: cross correlation auto correlation, and related spectra. A general Poisson process model for low level LV signals and noise which is valid from the photon-resolved regime all the way to the limiting case of nonstationary Gaussian noise was used. Computer simulation algorithms and higher order statistical moment analysis of Poisson processes were derived and applied to the analysis of photon correlation techniques. A system design using a unique dual correlate and subtract frequency discriminator technique is postulated and analyzed. Expectation analysis indicates that the objective measurements are feasible.
Design of a search and rescue terminal based on the dual-mode satellite and CDMA network
NASA Astrophysics Data System (ADS)
Zhao, Junping; Zhang, Xuan; Zheng, Bing; Zhou, Yubin; Song, Hao; Song, Wei; Zhang, Meikui; Liu, Tongze; Zhou, Li
2010-12-01
The current goal is to create a set of portable terminals with GPS/BD2 dual-mode satellite positioning, vital signs monitoring and wireless transmission functions. The terminal depends on an ARM processor to collect and combine data related to vital signs and GPS/BD2 location information, and sends the message to headquarters through the military CDMA network. It integrates multiple functions as a whole. The satellite positioning and wireless transmission capabilities are integrated into the motherboard, and the vital signs sensors used in the form of belts communicate with the board through Bluetooth. It can be adjusted according to the headquarters' instructions. This kind of device is of great practical significance for operations during disaster relief, search and rescue of the wounded in wartime, non-war military operations and other special circumstances.
Acceleration of spiking neural network based pattern recognition on NVIDIA graphics processors.
Han, Bing; Taha, Tarek M
2010-04-01
There is currently a strong push in the research community to develop biological scale implementations of neuron based vision models. Systems at this scale are computationally demanding and generally utilize more accurate neuron models, such as the Izhikevich and the Hodgkin-Huxley models, in favor of the more popular integrate and fire model. We examine the feasibility of using graphics processing units (GPUs) to accelerate a spiking neural network based character recognition network to enable such large scale systems. Two versions of the network utilizing the Izhikevich and Hodgkin-Huxley models are implemented. Three NVIDIA general-purpose (GP) GPU platforms are examined, including the GeForce 9800 GX2, the Tesla C1060, and the Tesla S1070. Our results show that the GPGPUs can provide significant speedup over conventional processors. In particular, the fastest GPGPU utilized, the Tesla S1070, provided a speedup of 5.6 and 84.4 over highly optimized implementations on the fastest central processing unit (CPU) tested, a quadcore 2.67 GHz Xeon processor, for the Izhikevich and the Hodgkin-Huxley models, respectively. The CPU implementation utilized all four cores and the vector data parallelism offered by the processor. The results indicate that GPUs are well suited for this application domain.
The Goddard Space Flight Center (GSFC) robotics technology testbed
NASA Technical Reports Server (NTRS)
Schnurr, Rick; Obrien, Maureen; Cofer, Sue
1989-01-01
Much of the technology planned for use in NASA's Flight Telerobotic Servicer (FTS) and the Demonstration Test Flight (DTF) is relatively new and untested. To provide the answers needed to design safe, reliable, and fully functional robotics for flight, NASA/GSFC is developing a robotics technology testbed for research of issues such as zero-g robot control, dual arm teleoperation, simulations, and hierarchical control using a high level programming language. The testbed will be used to investigate these high risk technologies required for the FTS and DTF projects. The robotics technology testbed is centered around the dual arm teleoperation of a pair of 7 degree-of-freedom (DOF) manipulators, each with their own 6-DOF mini-master hand controllers. Several levels of safety are implemented using the control processor, a separate watchdog computer, and other low level features. High speed input/output ports allow the control processor to interface to a simulation workstation: all or part of the testbed hardware can be used in real time dynamic simulation of the testbed operations, allowing a quick and safe means for testing new control strategies. The NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) hierarchical control scheme, is being used as the reference standard for system design. All software developed for the testbed, excluding some of simulation workstation software, is being developed in Ada. The testbed is being developed in phases. The first phase, which is nearing completion, and highlights future developments is described.
NASA Astrophysics Data System (ADS)
Boyko, Oleksiy; Zheleznyak, Mark
2015-04-01
The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
High-performance multiprocessor architecture for a 3-D lattice gas model
NASA Technical Reports Server (NTRS)
Lee, F.; Flynn, M.; Morf, M.
1991-01-01
The lattice gas method has recently emerged as a promising discrete particle simulation method in areas such as fluid dynamics. We present a very high-performance scalable multiprocessor architecture, called ALGE, proposed for the simulation of a realistic 3-D lattice gas model, Henon's 24-bit FCHC isometric model. Each of these VLSI processors is as powerful as a CRAY-2 for this application. ALGE is scalable in the sense that it achieves linear speedup for both fixed and increasing problem sizes with more processors. The core computation of a lattice gas model consists of many repetitions of two alternating phases: particle collision and propagation. Functional decomposition by symmetry group and virtual move are the respective keys to efficient implementation of collision and propagation.
A Tutorial on Parallel and Concurrent Programming in Haskell
NASA Astrophysics Data System (ADS)
Peyton Jones, Simon; Singh, Satnam
This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Programming for 1.6 Millon cores: Early experiences with IBM's BG/Q SMP architecture
NASA Astrophysics Data System (ADS)
Glosli, James
2013-03-01
With the stall in clock cycle improvements a decade ago, the drive for computational performance has continues along a path of increasing core counts on a processor. The multi-core evolution has been expressed in both a symmetric multi processor (SMP) architecture and cpu/GPU architecture. Debates rage in the high performance computing (HPC) community which architecture best serves HPC. In this talk I will not attempt to resolve that debate but perhaps fuel it. I will discuss the experience of exploiting Sequoia, a 98304 node IBM Blue Gene/Q SMP at Lawrence Livermore National Laboratory. The advantages and challenges of leveraging the computational power BG/Q will be detailed through the discussion of two applications. The first application is a Molecular Dynamics code called ddcMD. This is a code developed over the last decade at LLNL and ported to BG/Q. The second application is a cardiac modeling code called Cardioid. This is a code that was recently designed and developed at LLNL to exploit the fine scale parallelism of BG/Q's SMP architecture. Through the lenses of these efforts I'll illustrate the need to rethink how we express and implement our computational approaches. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
CanOpen on RASTA: The Integration of the CanOpen IP Core in the Avionics Testbed
NASA Astrophysics Data System (ADS)
Furano, Gianluca; Guettache, Farid; Magistrati, Giorgio; Tiotto, Gabriele; Ortega, Carlos Urbina; Valverde, Alberto
2013-08-01
This paper presents the work done within the ESA Estec Data Systems Division, targeting the integration of the CanOpen IP Core with the existing Reference Architecture Test-bed for Avionics (RASTA). RASTA is the reference testbed system of the ESA Avionics Lab, designed to integrate the main elements of a typical Data Handling system. It aims at simulating a scenario where a Mission Control Center communicates with on-board computers and systems through a TM/TC link, thus providing the data management through qualified processors and interfaces such as Leon2 core processors, CAN bus controllers, MIL-STD-1553 and SpaceWire. This activity aims at the extension of the RASTA with two boards equipped with HurriCANe controller, acting as CANOpen slaves. CANOpen software modules have been ported on the RASTA system I/O boards equipped with Gaisler GR-CAN controller and acts as master communicating with the CCIPC boards. CanOpen serves as upper application layer for based on CAN defined within the CAN-in-Automation standard and can be regarded as the definitive standard for the implementation of CAN-based systems solutions. The development and integration of CCIPC performed by SITAEL S.p.A., is the first application that aims to bring the CANOpen standard for space applications. The definition of CANOpen within the European Cooperation for Space Standardization (ECSS) is under development.
Static and Dynamic Frequency Scaling on Multicore CPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bao, Wenlei; Hong, Changwan; Chunduri, Sudheer
2016-12-28
Dynamic voltage and frequency scaling (DVFS) adapts CPU power consumption by modifying a processor’s operating frequency (and the associated voltage). Typical approaches employing DVFS involve default strategies such as running at the lowest or the highest frequency, or observing the CPU’s runtime behavior and dynamically adapting the voltage/frequency configuration based on CPU usage. In this paper, we argue that many previous approaches suffer from inherent limitations, such as not account- ing for processor-specific impact of frequency changes on energy for different workload types. We first propose a lightweight runtime-based approach to automatically adapt the frequency based on the CPU workload,more » that is agnostic of the processor characteristics. We then show that further improvements can be achieved for affine kernels in the application, using a compile-time characterization instead of run-time monitoring to select the frequency and number of CPU cores to use. Our framework relies on a one-time energy characterization of CPU-specific DVFS profiles followed by a compile-time categorization of loop-based code segments in the application. These are combined to determine a priori of the frequency and the number of cores to use to execute the application so as to optimize energy or energy-delay product, outperforming runtime approach. Extensive evaluation on 60 benchmarks and five multi-core CPUs show that our approach systematically outperforms the powersave Linux governor, while improving overall performance.« less
Federal Register 2010, 2011, 2012, 2013, 2014
2010-02-02
... Reference Prices constitute ``non-core data;'' i.e., the Exchange does not require a central processor to... Realtime Reference Prices Service January 22, 2010. I. Introduction On December 1, 2009, NYSE Arca, Inc... thereunder,\\2\\ a proposed rule change to add data elements to its ``NYSE Arca Realtime Reference Prices...
Generic Divide and Conquer Internet-Based Computing
NASA Technical Reports Server (NTRS)
Follen, Gregory J. (Technical Monitor); Radenski, Atanas
2003-01-01
The growth of Internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of Peer to Peer (P2P) software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high-performance computing applications community. The general goal of this project is to achieve better understanding of the transition to Internet-based high-performance computing and to develop solutions for some of the technical challenges of this transition. In particular, we are interested in creating long-term motivation for end users to provide their idle processor time to support computationally intensive tasks. We believe that a practical P2P architecture should provide useful service to both clients with high-performance computing needs and contributors of lower-end computing resources. To achieve this, we are designing dual -service architecture for P2P high-performance divide-and conquer computing; we are also experimenting with a prototype implementation. Our proposed architecture incorporates a master server, utilizes dual satellite servers, and operates on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. A dual satellite server comprises a high-performance computing engine and a lower-end contributor service engine. The computing engine provides generic support for divide and conquer computations. The service engine is intended to provide free useful HTTP-based services to contributors of lower-end computing resources. Our proposed architecture is complementary to and accessible from computational grids, such as Globus, Legion, and Condor. Grids provide remote access to existing higher-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end Internet nodes. Our project is focused on a generic divide and conquer paradigm and on mobile applications of this paradigm that can operate on a loose and ever changing pool of lower-end Internet nodes.
PS3 CELL Development for Scientific Computation and Research
NASA Astrophysics Data System (ADS)
Christiansen, M.; Sevre, E.; Wang, S. M.; Yuen, D. A.; Liu, S.; Lyness, M. D.; Broten, M.
2007-12-01
The Cell processor is one of the most powerful processors on the market, and researchers in the earth sciences may find its parallel architecture to be very useful. A cell processor, with 7 cores, can easily be obtained for experimentation by purchasing a PlayStation 3 (PS3) and installing linux and the IBM SDK. Each core of the PS3 is capable of 25 GFLOPS giving a potential limit of 150 GFLOPS when using all 6 SPUs (synergistic processing units) by using vectorized algorithms. We have used the Cell's computational power to create a program which takes simulated tsunami datasets, parses them, and returns a colorized height field image using ray casting techniques. As expected, the time required to create an image is inversely proportional to the number of SPUs used. We believe that this trend will continue when multiple PS3s are chained using OpenMP functionality and are in the process of researching this. By using the Cell to visualize tsunami data, we have found that its greatest feature is its power. This fact entwines well with the needs of the scientific community where the limiting factor is time. Any algorithm, such as the heat equation, that can be subdivided into multiple parts can take advantage of the PS3 Cell's ability to split the computations across the 6 SPUs reducing required run time by one sixth. Further vectorization of the code can allow for 4 simultanious floating point operations by using the SIMD (single instruction multiple data) capabilities of the SPU increasing efficiency 24 times.
High-Speed On-Board Data Processing Platform for LIDAR Projects at NASA Langley Research Center
NASA Astrophysics Data System (ADS)
Beyon, J.; Ng, T. K.; Davis, M. J.; Adams, J. K.; Lin, B.
2015-12-01
The project called High-Speed On-Board Data Processing for Science Instruments (HOPS) has been funded by NASA Earth Science Technology Office (ESTO) Advanced Information Systems Technology (AIST) program during April, 2012 - April, 2015. HOPS is an enabler for science missions with extremely high data processing rates. In this three-year effort of HOPS, Active Sensing of CO2 Emissions over Nights, Days, and Seasons (ASCENDS) and 3-D Winds were of interest in particular. As for ASCENDS, HOPS replaces time domain data processing with frequency domain processing while making the real-time on-board data processing possible. As for 3-D Winds, HOPS offers real-time high-resolution wind profiling with 4,096-point fast Fourier transform (FFT). HOPS is adaptable with quick turn-around time. Since HOPS offers reusable user-friendly computational elements, its FPGA IP Core can be modified for a shorter development period if the algorithm changes. The FPGA and memory bandwidth of HOPS is 20 GB/sec while the typical maximum processor-to-SDRAM bandwidth of the commercial radiation tolerant high-end processors is about 130-150 MB/sec. The inter-board communication bandwidth of HOPS is 4 GB/sec while the effective processor-to-cPCI bandwidth of commercial radiation tolerant high-end boards is about 50-75 MB/sec. Also, HOPS offers VHDL cores for the easy and efficient implementation of ASCENDS and 3-D Winds, and other similar algorithms. A general overview of the 3-year development of HOPS is the goal of this presentation.
High-Speed On-Board Data Processing for Science Instruments: HOPS
NASA Technical Reports Server (NTRS)
Beyon, Jeffrey
2015-01-01
The project called High-Speed On-Board Data Processing for Science Instruments (HOPS) has been funded by NASA Earth Science Technology Office (ESTO) Advanced Information Systems Technology (AIST) program during April, 2012 â€" April, 2015. HOPS is an enabler for science missions with extremely high data processing rates. In this three-year effort of HOPS, Active Sensing of CO2 Emissions over Nights, Days, and Seasons (ASCENDS) and 3-D Winds were of interest in particular. As for ASCENDS, HOPS replaces time domain data processing with frequency domain processing while making the real-time on-board data processing possible. As for 3-D Winds, HOPS offers real-time high-resolution wind profiling with 4,096-point fast Fourier transform (FFT). HOPS is adaptable with quick turn-around time. Since HOPS offers reusable user-friendly computational elements, its FPGA IP Core can be modified for a shorter development period if the algorithm changes. The FPGA and memory bandwidth of HOPS is 20 GB/sec while the typical maximum processor-to-SDRAM bandwidth of the commercial radiation tolerant high-end processors is about 130-150 MB/sec. The inter-board communication bandwidth of HOPS is 4 GB/sec while the effective processor-to-cPCI bandwidth of commercial radiation tolerant high-end boards is about 50-75 MB/sec. Also, HOPS offers VHDL cores for the easy and efficient implementation of ASCENDS and 3-D Winds, and other similar algorithms. A general overview of the 3-year development of HOPS is the goal of this presentation.
Use of Field Programmable Gate Array Technology in Future Space Avionics
NASA Technical Reports Server (NTRS)
Ferguson, Roscoe C.; Tate, Robert
2005-01-01
Fulfilling NASA's new vision for space exploration requires the development of sustainable, flexible and fault tolerant spacecraft control systems. The traditional development paradigm consists of the purchase or fabrication of hardware boards with fixed processor and/or Digital Signal Processing (DSP) components interconnected via a standardized bus system. This is followed by the purchase and/or development of software. This paradigm has several disadvantages for the development of systems to support NASA's new vision. Building a system to be fault tolerant increases the complexity and decreases the performance of included software. Standard bus design and conventional implementation produces natural bottlenecks. Configuring hardware components in systems containing common processors and DSPs is difficult initially and expensive or impossible to change later. The existence of Hardware Description Languages (HDLs), the recent increase in performance, density and radiation tolerance of Field Programmable Gate Arrays (FPGAs), and Intellectual Property (IP) Cores provides the technology for reprogrammable Systems on a Chip (SOC). This technology supports a paradigm better suited for NASA's vision. Hardware and software production are melded for more effective development; they can both evolve together over time. Designers incorporating this technology into future avionics can benefit from its flexibility. Systems can be designed with improved fault isolation and tolerance using hardware instead of software. Also, these designs can be protected from obsolescence problems where maintenance is compromised via component and vendor availability.To investigate the flexibility of this technology, the core of the Central Processing Unit and Input/Output Processor of the Space Shuttle AP101S Computer were prototyped in Verilog HDL and synthesized into an Altera Stratix FPGA.
Towards energy-efficient photonic interconnects
NASA Astrophysics Data System (ADS)
Demir, Yigit; Hardavellas, Nikos
2015-03-01
Silicon photonics have emerged as a promising solution to meet the growing demand for high-bandwidth, low-latency, and energy-efficient on-chip and off-chip communication in many-core processors. However, current silicon-photonic interconnect designs for many-core processors waste a significant amount of power because (a) lasers are always on, even during periods of interconnect inactivity, and (b) microring resonators employ heaters which consume a significant amount of power just to overcome thermal variations and maintain communication on the photonic links, especially in a 3D-stacked design. The problem of high laser power consumption is particularly important as lasers typically have very low energy efficiency, and photonic interconnects often remain underutilized both in scientific computing (compute-intensive execution phases underutilize the interconnect), and in server computing (servers in Google-scale datacenters have a typical utilization of less than 30%). We address the high laser power consumption by proposing EcoLaser+, which is a laser control scheme that saves energy by predicting the interconnect activity and opportunistically turning the on-chip laser off when possible, and also by scaling the width of the communication link based on a runtime prediction of the expected message length. Our laser control scheme can save up to 62 - 92% of the laser energy, and improve the energy efficiency of a manycore processor with negligible performance penalty. We address the high trimming (heating) power consumption of the microrings by proposing insulation methods that reduce the impact of localized heating induced by highly-active components on the 3D-stacked logic die.
Dual redundant core memory systems
NASA Technical Reports Server (NTRS)
Hull, F. E.
1972-01-01
Electronic memory system consisting of series redundant drive switch circuits, triple redundant majority voted memory timing functions, and two data registers to provide functional dual redundancy is described. Signal flow through the circuits is illustrated and equence of events which occur within the memory system is explained.
NASA Astrophysics Data System (ADS)
Victor, Rodolfo A.; Prodanović, Maša.; Torres-Verdín, Carlos
2017-12-01
We develop a new Monte Carlo-based inversion method for estimating electron density and effective atomic number from 3-D dual-energy computed tomography (CT) core scans. The method accounts for uncertainties in X-ray attenuation coefficients resulting from the polychromatic nature of X-ray beam sources of medical and industrial scanners, in addition to delivering uncertainty estimates of inversion products. Estimation of electron density and effective atomic number from CT core scans enables direct deterministic or statistical correlations with salient rock properties for improved petrophysical evaluation; this condition is specifically important in media such as vuggy carbonates where CT resolution better captures core heterogeneity that dominates fluid flow properties. Verification tests of the inversion method performed on a set of highly heterogeneous carbonate cores yield very good agreement with in situ borehole measurements of density and photoelectric factor.
Active Flash: Performance-Energy Tradeoffs for Out-of-Core Processing on Non-Volatile Memory Devices
DOE Office of Scientific and Technical Information (OSTI.GOV)
Boboila, Simona; Kim, Youngjae; Vazhkudai, Sudharshan S
2012-01-01
In this abstract, we study the performance and energy tradeoffs involved in migrating data analysis into the flash device, a process we refer to as Active Flash. The Active Flash paradigm is similar to 'active disks', which has received considerable attention. Active Flash allows us to move processing closer to data, thereby minimizing data movement costs and reducing power consumption. It enables true out-of-core computation. The conventional definition of out-of-core solvers refers to an approach to process data that is too large to fit in the main memory and, consequently, requires access to disk. However, in Active Flash, processing outsidemore » the host CPU literally frees the core and achieves real 'out-of-core' analysis. Moving analysis to data has long been desirable, not just at this level, but at all levels of the system hierarchy. However, this requires a detailed study on the tradeoffs involved in achieving analysis turnaround under an acceptable energy envelope. To this end, we first need to evaluate if there is enough computing power on the flash device to warrant such an exploration. Flash processors require decent computing power to run the internal logic pertaining to the Flash Translation Layer (FTL), which is responsible for operations such as address translation, garbage collection (GC) and wear-leveling. Modern SSDs are composed of multiple packages and several flash chips within a package. The packages are connected using multiple I/O channels to offer high I/O bandwidth. SSD computing power is also expected to be high enough to exploit such inherent internal parallelism within the drive to increase the bandwidth and to handle fast I/O requests. More recently, SSD devices are being equipped with powerful processing units and are even embedded with multicore CPUs (e.g. ARM Cortex-A9 embedded processor is advertised to reach 2GHz frequency and deliver 5000 DMIPS; OCZ RevoDrive X2 SSD has 4 SandForce controllers, each with 780MHz max frequency Tensilica core). Efforts that take advantage of the available computing cycles on the processors on SSDs to run auxiliary tasks other than actual I/O requests are beginning to emerge. Kim et al. investigate database scan operations in the context of processing on the SSDs, and propose dedicated hardware logic to speed up scans. Also, cluster architectures have been explored, which consist of low-power embedded CPUs coupled with small local flash to achieve fast, parallel access to data. Processor utilization on SSD is highly dependent on workloads and, therefore, they can be idle during periods with no I/O accesses. We propose to use the available processing capability on the SSD to run tasks that can be offloaded from the host. This paper makes the following contributions: (1) We have investigated Active Flash and its potential to optimize the total energy cost, including power consumption on the host and the flash device; (2) We have developed analytical models to analyze the performance-energy tradeoffs for Active Flash, by treating the SSD as a blackbox, this is particularly valuable due to the proprietary nature of the SSD internal hardware; and (3) We have enhanced a well-known SSD simulator (from MSR) to implement 'on-the-fly' data compression using Active Flash. Our results provide a window into striking a balance between energy consumption and application performance.« less
Video Guidance Sensor System With Integrated Rangefinding
NASA Technical Reports Server (NTRS)
Book, Michael L. (Inventor); Bryan, Thomas C. (Inventor); Howard, Richard T. (Inventor); Roe, Fred Davis, Jr. (Inventor); Bell, Joseph L. (Inventor)
2006-01-01
A video guidance sensor system for use, p.g., in automated docking of a chase vehicle with a target vehicle. The system includes an integrated rangefinder sub-system that uses time of flight measurements to measure range. The rangefinder sub-system includes a pair of matched photodetectors for respectively detecting an output laser beam and return laser beam, a buffer memory for storing the photodetector outputs, and a digitizer connected to the buffer memory and including dual amplifiers and analog-to-digital converters. A digital signal processor processes the digitized output to produce a range measurement.
Modulated Fourier Transform Raman Fiber-Optic Spectroscopy
NASA Technical Reports Server (NTRS)
Jensen, Brian J. (Inventor); Cooper, John B. (Inventor); Wise, Kent L. (Inventor)
2000-01-01
A modification to a commercial Fourier Transform (FT) Raman spectrometer is presented for the elimination of thermal backgrounds in the FT Raman spectra. The modification involves the use of a mechanical optical chopper to modulate the continuous wave laser, remote collection of the signal via fiber optics, and connection of a dual-phase digital-signal-processor (DSP) lock-in amplifier between the detector and the spectrometer's collection electronics to demodulate and filter the optical signals. The resulting Modulated Fourier Transform Raman Fiber-Optic Spectrometer is capable of completely eliminating thermal backgrounds at temperatures exceeding 300 C.
Fast 4-2 Compressor of Booth Multiplier Circuits for High-Speed RISC Processor
NASA Astrophysics Data System (ADS)
Yuan, S. C.
2008-11-01
We use different XOR circuits to optimize the XOR structure 4-2 compressor, and design the transmission gates(TG) 4-2 compressor use single to dual rail circuit configurations. The maximum propagation delay, the power consumption and the layout area of the designed 4-2 compressors are simulated with 0.35μm and 0.25μm CMOS process parameters and compared with results of the synthesized 4-2 circuits, and show that the designed 4-2 compressors are faster and area smaller than the synthesized one.
Execution of parallel algorithms on a heterogeneous multicomputer
NASA Astrophysics Data System (ADS)
Isenstein, Barry S.; Greene, Jonathon
1995-04-01
Many aerospace/defense sensing and dual-use applications require high-performance computing, extensive high-bandwidth interconnect and realtime deterministic operation. This paper will describe the architecture of a scalable multicomputer that includes DSP and RISC processors. A single chassis implementation is capable of delivering in excess of 10 GFLOPS of DSP processing power with 2 Gbytes/s of realtime sensor I/O. A software approach to implementing parallel algorithms called the Parallel Application System (PAS) is also presented. An example of applying PAS to a DSP application is shown.
NASA Technical Reports Server (NTRS)
Biess, J. J.; Inouye, L. Y.; Shank, J. H.
1974-01-01
A high-voltage, high-power LC series resonant inverter using SCRs has been developed for an Ion Engine Power Processor. The inverter operates within 200-400Vdc with a maximum output power of 2.5kW. The inverter control logic, the screen supply electrical and mechanical characteristics, the efficiency and losses in power components, regulation on the dual feedback principle, the SCR waveforms and the component weight are analyzed. Efficiency of 90.5% and weight density of 4.1kg/kW are obtained.
Generating unstructured nuclear reactor core meshes in parallel
Jain, Rajeev; Tautges, Timothy J.
2014-10-24
Recent advances in supercomputers and parallel solver techniques have enabled users to run large simulations problems using millions of processors. Techniques for multiphysics nuclear reactor core simulations are under active development in several countries. Most of these techniques require large unstructured meshes that can be hard to generate in a standalone desktop computers because of high memory requirements, limited processing power, and other complexities. We have previously reported on a hierarchical lattice-based approach for generating reactor core meshes. Here, we describe efforts to exploit coarse-grained parallelism during reactor assembly and reactor core mesh generation processes. We highlight several reactor coremore » examples including a very high temperature reactor, a full-core model of the Korean MONJU reactor, a ¼ pressurized water reactor core, the fast reactor Experimental Breeder Reactor-II core with a XX09 assembly, and an advanced breeder test reactor core. The times required to generate large mesh models, along with speedups obtained from running these problems in parallel, are reported. A graphical user interface to the tools described here has also been developed.« less
NASA Astrophysics Data System (ADS)
Romano, Paul Kollath
Monte Carlo particle transport methods are being considered as a viable option for high-fidelity simulation of nuclear reactors. While Monte Carlo methods offer several potential advantages over deterministic methods, there are a number of algorithmic shortcomings that would prevent their immediate adoption for full-core analyses. In this thesis, algorithms are proposed both to ameliorate the degradation in parallel efficiency typically observed for large numbers of processors and to offer a means of decomposing large tally data that will be needed for reactor analysis. A nearest-neighbor fission bank algorithm was proposed and subsequently implemented in the OpenMC Monte Carlo code. A theoretical analysis of the communication pattern shows that the expected cost is O( N ) whereas traditional fission bank algorithms are O(N) at best. The algorithm was tested on two supercomputers, the Intrepid Blue Gene/P and the Titan Cray XK7, and demonstrated nearly linear parallel scaling up to 163,840 processor cores on a full-core benchmark problem. An algorithm for reducing network communication arising from tally reduction was analyzed and implemented in OpenMC. The proposed algorithm groups only particle histories on a single processor into batches for tally purposes---in doing so it prevents all network communication for tallies until the very end of the simulation. The algorithm was tested, again on a full-core benchmark, and shown to reduce network communication substantially. A model was developed to predict the impact of load imbalances on the performance of domain decomposed simulations. The analysis demonstrated that load imbalances in domain decomposed simulations arise from two distinct phenomena: non-uniform particle densities and non-uniform spatial leakage. The dominant performance penalty for domain decomposition was shown to come from these physical effects rather than insufficient network bandwidth or high latency. The model predictions were verified with measured data from simulations in OpenMC on a full-core benchmark problem. Finally, a novel algorithm for decomposing large tally data was proposed, analyzed, and implemented/tested in OpenMC. The algorithm relies on disjoint sets of compute processes and tally servers. The analysis showed that for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead. Tests were performed on Intrepid and Titan and demonstrated that the algorithm did indeed perform well over a wide range of parameters. (Copies available exclusively from MIT Libraries, libraries.mit.edu/docs - docs mit.edu)
Dual-point reflective refractometer based on parallel no-core fiber/FBG structure
NASA Astrophysics Data System (ADS)
Guo, Cuijuan; Niu, Panpan; Wang, Juan; Zhao, Junfa; Zhang, Cheng
2018-01-01
A novel dual-point reflective fiber-optic refractometer based on multimode interference (MMI) effect and fiber Bragg grating (FBG) reflection is proposed and experimentally demonstrated, which adopts parallel structure. Each point of the refractometer consists of a single mode-no core-single mode fiber (SNS) structure cascaded with a FBG. Assisted by the reflection of FBG, refractive index (RI) measurement can be achieved by monitoring the peak power variation of the reflected FBG spectrum. By selecting different length of the no core fiber and center wavelength of the FBG, independent dual-point refractometer is easily realized. Experiment results show that the refractometer has a nonlinear relationship between the surrounding refractive index (SRI) and the peak power of the reflected FBG spectrum in the RI range of 1.3330-1.4086. Linear relationship can be approximately obtained by dividing the measuring range into 1.3330-1.3611 and 1.3764-1.4086. In the RI range of 1.3764-1.4086, the two sensing points have higher RI sensitivities of 319.34 dB/RIU and 211.84 dB/RIU, respectively.
Low-voltage analog front-end processor design for ISFET-based sensor and H+ sensing applications
NASA Astrophysics Data System (ADS)
Chung, Wen-Yaw; Yang, Chung-Huang; Peng, Kang-Chu; Yeh, M. H.
2003-04-01
This paper presents a modular-based low-voltage analog-front-end processor design in a 0.5mm double-poly double-metal CMOS technology for Ion Sensitive Field Effect Transistor (ISFET)-based sensor and H+ sensing applications. To meet the potentiometric response of the ISFET that is proportional to various H+ concentrations, the constant-voltage and constant current (CVCS) testing configuration has been used. Low-voltage design skills such as bulk-driven input pair, folded-cascode amplifier, bootstrap switch control circuits have been designed and integrated for 1.5V supply and nearly rail-to-rail analog to digital signal processing. Core modules consist of an 8-bit two-step analog-digital converter and bulk-driven pre-amplifiers have been developed in this research. The experimental results show that the proposed circuitry has an acceptable linearity to 0.1 pH-H+ sensing conversions with the buffer solution in the range of pH2 to pH12. The processor has a potential usage in battery-operated and portable healthcare devices and environmental monitoring applications.
Fault-Tolerant Software-Defined Radio on Manycore
NASA Technical Reports Server (NTRS)
Ricketts, Scott
2015-01-01
Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.
Single-event upset in highly scaled commercial silicon-on-insulator PowerPc microprocessors
NASA Technical Reports Server (NTRS)
Irom, Farokh; Farmanesh, Farhad H.
2004-01-01
Single event upset effects from heavy ions are measured for Motorola and IBM silicon-on-insulator (SOI) microprocessors with different feature sizes, and core voltages. The results are compared with results for similar devices with build substrates. The cross sections of the SOI processors are lower than their bulk counterparts, but the threshold is about the same, even though the charge collections depth is more than an order of magnitude smaller in the SOI devices. The scaling of the cross section with reduction of feature size and core voltage dependence for SOI microprocessors discussed.
CQPSO scheduling algorithm for heterogeneous multi-core DAG task model
NASA Astrophysics Data System (ADS)
Zhai, Wenzheng; Hu, Yue-Li; Ran, Feng
2017-07-01
Efficient task scheduling is critical to achieve high performance in a heterogeneous multi-core computing environment. The paper focuses on the heterogeneous multi-core directed acyclic graph (DAG) task model and proposes a novel task scheduling method based on an improved chaotic quantum-behaved particle swarm optimization (CQPSO) algorithm. A task priority scheduling list was built. A processor with minimum cumulative earliest finish time (EFT) was acted as the object of the first task assignment. The task precedence relationships were satisfied and the total execution time of all tasks was minimized. The experimental results show that the proposed algorithm has the advantage of optimization abilities, simple and feasible, fast convergence, and can be applied to the task scheduling optimization for other heterogeneous and distributed environment.
Investigation of an advanced fault tolerant integrated avionics system
NASA Technical Reports Server (NTRS)
Dunn, W. R.; Cottrell, D.; Flanders, J.; Javornik, A.; Rusovick, M.
1986-01-01
Presented is an advanced, fault-tolerant multiprocessor avionics architecture as could be employed in an advanced rotorcraft such as LHX. The processor structure is designed to interface with existing digital avionics systems and concepts including the Army Digital Avionics System (ADAS) cockpit/display system, navaid and communications suites, integrated sensing suite, and the Advanced Digital Optical Control System (ADOCS). The report defines mission, maintenance and safety-of-flight reliability goals as might be expected for an operational LHX aircraft. Based on use of a modular, compact (16-bit) microprocessor card family, results of a preliminary study examining simplex, dual and standby-sparing architectures is presented. Given the stated constraints, it is shown that the dual architecture is best suited to meet reliability goals with minimum hardware and software overhead. The report presents hardware and software design considerations for realizing the architecture including redundancy management requirements and techniques as well as verification and validation needs and methods.
A preliminary evaluation of a dual crystal positron camera
NASA Astrophysics Data System (ADS)
Holte, S.; Ostertag, H.; Kesselberg, M.
1987-03-01
A dual crystal whole body camera based on Bi4Ge3O12 and Gd2SiO5 was built. Spatial transaxial resolution is better than 5 mm FWH1, with maintained high sensitivity. The system can be equipped with up to four rings to give sufficient coverage of the organs under study. It can perform true dynamic function studies with frame rates of the order of 1 sec or less and can handle high data acquisition rates, encountered in cerebral blood flow studies and in perfusion studies of the heart, with low dead time losses. High sampling redundancy is achieved by wobbling over two detector channels. Fast image reconstructions is achieved by an array processor. Tilting and rotating capabilities of the gantry facilitate the anatomical alignment of the image plane. A rotating line source is used for accurate transmission images with a low scatter level.
NASA Technical Reports Server (NTRS)
Defeo, P.; Doane, D.; Saito, J.
1982-01-01
A Digital Flight Control Systems Verification Laboratory (DFCSVL) has been established at NASA Ames Research Center. This report describes the major elements of the laboratory, the research activities that can be supported in the area of verification and validation of digital flight control systems (DFCS), and the operating scenarios within which these activities can be carried out. The DFCSVL consists of a palletized dual-dual flight-control system linked to a dedicated PDP-11/60 processor. Major software support programs are hosted in a remotely located UNIVAC 1100 accessible from the PDP-11/60 through a modem link. Important features of the DFCSVL include extensive hardware and software fault insertion capabilities, a real-time closed loop environment to exercise the DFCS, an integrated set of software verification tools, and a user-oriented interface to all the resources and capabilities.
Waidyasekera, Kanchana; Nikaido, Toru; Weerasinghe, Dinesh; Nurrohman, Hamid; Tagami, Junji
2012-04-01
This study evaluated a dual-curing composite along with different dentin adhesive systems for 1 year under water storage, as a new bonding method of root fragments in complete vertical root fracture. Bovine root fragments were bonded with the dual-curing resin composite Clearfil DC Core Automix (DCA) and one of three adhesive systems: two-step self-etching adhesive Clearfil SE Bond (SE), one-step self-etching adhesive Tokuyama Bond Force (BF), one-step dual-curing self-etching adhesive Clearfil DC Bond (DC). Microtensile bond strength (µTBS)/ultimate tensile bond strength (UTS), FE-SEM ultramorphology of fracture modes, and adhesive dentin interface were observed after water storage for periods of up to one year. The data were analyzed with two-way ANOVA. µTBS was influenced by "dentin adhesive system" (F = 324.455, p < 0.001) and "length of water storage" (F = 8.470, p < 0.001). SE yielded significantly higher µTBS, regardless of storage period (p < 0.05) and maintained the initial µTBS without a significant change after 1 year of water storage (p > 0.05). From 24 h to 1 month, BF showed significantly higher bond strength than DC. UTS of DCA was influenced only by the curing mode of the material (F = 5.051, p = 0.027), but not by the length of water storage (F = 0.053, p > 0.05). Two-step self-etching adhesive systems and dual-curing composite core material can be considered as a suitable bonding method for complete root fractures.
Dual-mode plasmonic nanorod type antenna based on the concept of a trapped dipole.
Panaretos, Anastasios H; Werner, Douglas H
2015-04-06
In this paper we theoretically investigate the feasibility of creating a dual-mode plasmonic nanorod antenna. The proposed design methodology relies on adapting to optical wavelengths the principles of operation of trapped dipole antennas, which have been widely used in the low MHz frequency range. This type of antenna typically employs parallel LC circuits, also referred to as "traps", which are connected along the two arms of the dipole. By judiciously choosing the resonant frequency of these traps, as well as their position along the arms of the dipole, it is feasible to excite the λ/2 resonance of both the original dipole as well as the shorter section defined by the length of wire between the two traps. This effectively enables the dipole antenna to have a dual-mode of operation. Our analysis reveals that the implementation of this concept at the nanoscale requires that two cylindrical pockets (i.e. loading volumes) be introduced along the length of the nanoantenna, inside which plasmonic core-shell particles are embedded. By properly selecting the geometry and constitution of the core-shell particle as well as the constitution of the host material of the two loading volumes and their position along the nanorod, the equivalent effect of a resonant parallel LC circuit can be realized. This effectively enables a dual-mode operation of the nanorod antenna. The proposed methodology introduces a compact approach for the realization of dual-mode optical sensors while at the same time it clearly illustrates the inherent tuning capabilities that core-shell particles can offer in a practical framework.
NASA Astrophysics Data System (ADS)
Chen, S.; Chen, H.; Hu, J.; Zhang, A.; Min, C.
2017-12-01
It is more than 3 years since the launch of Global Precipitation Measurement (GPM) core satellite on February 27 2014. This satellite carries two core sensors, i.e. dual-frequency precipitation radar (DPR) and microwave imager (GMI). These two sensors are of the state-of- the-art sensors that observe the precipitation over the globe. The DPR level-2 product provides both precipitation rates and phases. The precipitation phase information can help advance global hydrological cycle modeling, particularly crucial for high-altitude and high latitude regions where solid precipitation is the dominated source of water. However, people are still in short of the reliability and accuracy of DPR level-2 product. Assess the performance and uncertainty of precipitation retrievals derived from the core sensor dual-frequency precipitation radar (DPR) on board the satellite is needed for the precipitation algorithm developers and the end users in hydrology, weather, meteorology, and hydro-related communities. In this study, the precipitation estimation derived from DPR is compared with that derived from CSU-CHILL National Weather Radar from March 2014 to October 2017. The CSU-CHILL radar is located in Greeley, CO, and is an advanced, transportable dual-polarized dual-wavelength (S- and X-band) weather radar. The system and random errors of DPR in measuring precipitation will be analyzed as a function of the precipitation rate and precipitation type (liquid and solid). This study is expected to offer insights into performance of the most advanced sensor and thus provide useful feedback to the algorithm developers as well as the GPM data end users.
Bogdán, István A.; Rivers, Jenny; Beynon, Robert J.; Coca, Daniel
2008-01-01
Motivation: Peptide mass fingerprinting (PMF) is a method for protein identification in which a protein is fragmented by a defined cleavage protocol (usually proteolysis with trypsin), and the masses of these products constitute a ‘fingerprint’ that can be searched against theoretical fingerprints of all known proteins. In the first stage of PMF, the raw mass spectrometric data are processed to generate a peptide mass list. In the second stage this protein fingerprint is used to search a database of known proteins for the best protein match. Although current software solutions can typically deliver a match in a relatively short time, a system that can find a match in real time could change the way in which PMF is deployed and presented. In a paper published earlier we presented a hardware design of a raw mass spectra processor that, when implemented in Field Programmable Gate Array (FPGA) hardware, achieves almost 170-fold speed gain relative to a conventional software implementation running on a dual processor server. In this article we present a complementary hardware realization of a parallel database search engine that, when running on a Xilinx Virtex 2 FPGA at 100 MHz, delivers 1800-fold speed-up compared with an equivalent C software routine, running on a 3.06 GHz Xeon workstation. The inherent scalability of the design means that processing speed can be multiplied by deploying the design on multiple FPGAs. The database search processor and the mass spectra processor, running on a reconfigurable computing platform, provide a complete real-time PMF protein identification solution. Contact: d.coca@sheffield.ac.uk PMID:18453553
A 60 GOPS/W, -1.8 V to 0.9 V body bias ULP cluster in 28 nm UTBB FD-SOI technology
NASA Astrophysics Data System (ADS)
Rossi, Davide; Pullini, Antonio; Loi, Igor; Gautschi, Michael; Gürkaynak, Frank K.; Bartolini, Andrea; Flatresse, Philippe; Benini, Luca
2016-03-01
Ultra-low power operation and extreme energy efficiency are strong requirements for a number of high-growth application areas, such as E-health, Internet of Things, and wearable Human-Computer Interfaces. A promising approach to achieve up to one order of magnitude of improvement in energy efficiency over current generation of integrated circuits is near-threshold computing. However, frequency degradation due to aggressive voltage scaling may not be acceptable across all performance-constrained applications. Thread-level parallelism over multiple cores can be used to overcome the performance degradation at low voltage. Moreover, enabling the processors to operate on-demand and over a wide supply voltage and body bias ranges allows to achieve the best possible energy efficiency while satisfying a large spectrum of computational demands. In this work we present the first ever implementation of a 4-core cluster fabricated using conventional-well 28 nm UTBB FD-SOI technology. The multi-core architecture we present in this work is able to operate on a wide range of supply voltages starting from 0.44 V to 1.2 V. In addition, the architecture allows a wide range of body bias to be applied from -1.8 V to 0.9 V. The peak energy efficiency 60 GOPS/W is achieved at 0.5 V supply voltage and 0.5 V forward body bias. Thanks to the extended body bias range of conventional-well FD-SOI technology, high energy efficiency can be guaranteed for a wide range of process and environmental conditions. We demonstrate the ability to compensate for up to 99.7% of chips for process variation with only ±0.2 V of body biasing, and compensate temperature variation in the range -40 °C to 120 °C exploiting -1.1 V to 0.8 V body biasing. When compared to leading-edge near-threshold RISC processors optimized for extremely low power applications, the multi-core architecture we propose has 144× more performance at comparable energy efficiency levels. Even when compared to other low-power processors with comparable performance, including those implemented in 28 nm technology, our platform provides 1.4× to 3.7× better energy efficiency.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Irshad, Muneeb; Siraj, Khurram, E-mail: razahussaini786@gmail.com, E-mail: khurram.uet@gmail.com; Javed, Fayyaz
Nanocomposites Samarium doped Ceria (SDC), Gadolinium doped Ceria (GDC), core shell SDC amorphous Na{sub 2}CO{sub 3} (SDCC) and GDC amorphous Na{sub 2}CO{sub 3} (GDCC) were synthesized using co-precipitation method and then compared to obtain better solid oxide electrolytes materials for low temperature Solid Oxide Fuel Cell (SOFCs). The comparison is done in terms of structure, crystallanity, thermal stability, conductivity and cell performance. In present work, XRD analysis confirmed proper doping of Sm and Gd in both single phase (SDC, GDC) and dual phase core shell (SDCC, GDCC) electrolyte materials. EDX analysis validated the presence of Sm and Gd in bothmore » single and dual phase electrolyte materials; also confirming the presence of amorphous Na{sub 2}CO{sub 3} in SDCC and GDCC. From TGA analysis a steep weight loss is observed in case of SDCC and GDCC when temperature rises above 725 °C while SDC and GDC do not show any loss. The ionic conductivity and cell performance of single phase SDC and GDC nanocomposite were compared with core shell GDC/amorphous Na{sub 2}CO{sub 3} and SDC/ amorphous Na{sub 2}CO{sub 3} nanocomposites using methane fuel. It is observed that dual phase core shell electrolytes materials (SDCC, GDCC) show better performance in low temperature range than their corresponding single phase electrolyte materials (SDC, GDC) with methane fuel.« less
Using an ARM Processor to boost data acquisition rates
NASA Astrophysics Data System (ADS)
Brown, Anthony; Seaquest Collaboration
2015-10-01
It has been proposed, Fermilab E-1067, to use the SeaQuest (E906/E1039/1037) dimuon spectrometer to do a search for the dark photon and dark Higgs. The concept is that it would run in a parasitic mode with only minor upgrades to the spectrometer. There are various requirements for the upgrades but one of them is to increase the DAQ rates and one minimal cost approach to do this will be discussed. The currently running SeaQuest (E906) experiment has modest rate requirements of around 1 kHz. Since the dark particle search would involve recording particles originating in the first magnet used as a beam dump, the data rate will be higher than recording events just from the target. Thus the DAQ rate capability will need to be increased to around 10 kHz. There exists a possible very low cost solution as the Academica Sinica designed TDCs contains an ARM processor that was not needed to meet the original SeaQuest (E906 needs). Since the 120 GeV beam from the Main Injector is delivered in a 4 second spill, once per minute and the ARM processor on the TDC has two dual-ported memory chips, these could be used to store data during each spill and then read the data out in the time between spills.
Dual annular rotating "windowed" nuclear reflector reactor control system
Jacox, Michael G.; Drexler, Robert L.; Hunt, Robert N. M.; Lake, James A.
1994-01-01
A nuclear reactor control system is provided in a nuclear reactor having a core operating in the fast neutron energy spectrum where criticality control is achieved by neutron leakage. The control system includes dual annular, rotatable reflector rings. There are two reflector rings: an inner reflector ring and an outer reflector ring. The reflectors are concentrically assembled, surround the reactor core, and each reflector ring includes a plurality of openings. The openings in each ring are capable of being aligned or non-aligned with each other. Independent driving means for each of the annular reflector rings is provided so that reactor criticality can be initiated and controlled by rotation of either reflector ring such that the extent of alignment of the openings in each ring controls the reflection of neutrons from the core.
Multicore Considerations for Legacy Flight Software Migration
NASA Technical Reports Server (NTRS)
Vines, Kenneth; Day, Len
2013-01-01
In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.
Photonic-Networks-on-Chip for High Performance Radiation Survivable Multi-Core Processor Systems
2013-12-01
Loss Spectra” Proceedings of SPIE 8255, (2012) and in a journal publication: M. T. Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O...Crowley, D. Murrell, N. Patel, M. Breivik , C.-Y. Lin, Y. Li, B.-O. Fimland and L. F. Lester, "Analytical Modeling of the Temperature Performance of
Energy challenges in optical access and aggregation networks.
Kilper, Daniel C; Rastegarfar, Houman
2016-03-06
Scalability is a critical issue for access and aggregation networks as they must support the growth in both the size of data capacity demands and the multiplicity of access points. The number of connected devices, the Internet of Things, is growing to the tens of billions. Prevailing communication paradigms are reaching physical limitations that make continued growth problematic. Challenges are emerging in electronic and optical systems and energy increasingly plays a central role. With the spectral efficiency of optical systems approaching the Shannon limit, increasing parallelism is required to support higher capacities. For electronic systems, as the density and speed increases, the total system energy, thermal density and energy per bit are moving into regimes that become impractical to support-for example requiring single-chip processor powers above the 100 W limit common today. We examine communication network scaling and energy use from the Internet core down to the computer processor core and consider implications for optical networks. Optical switching in data centres is identified as a potential model from which scalable access and aggregation networks for the future Internet, with the application of integrated photonic devices and intelligent hybrid networking, will emerge. © 2016 The Author(s).
Efficient implementation of the many-body Reactive Bond Order (REBO) potential on GPU
NASA Astrophysics Data System (ADS)
Trędak, Przemysław; Rudnicki, Witold R.; Majewski, Jacek A.
2016-09-01
The second generation Reactive Bond Order (REBO) empirical potential is commonly used to accurately model a wide range hydrocarbon materials. It is also extensible to other atom types and interactions. REBO potential assumes complex multi-body interaction model, that is difficult to represent efficiently in the SIMD or SIMT programming model. Hence, despite its importance, no efficient GPGPU implementation has been developed for this potential. Here we present a detailed description of a highly efficient GPGPU implementation of molecular dynamics algorithm using REBO potential. The presented algorithm takes advantage of rarely used properties of the SIMT architecture of a modern GPU to solve difficult synchronizations issues that arise in computations of multi-body potential. Techniques developed for this problem may be also used to achieve efficient solutions of different problems. The performance of proposed algorithm is assessed using a range of model systems. It is compared to highly optimized CPU implementation (both single core and OpenMP) available in LAMMPS package. These experiments show up to 6x improvement in forces computation time using single processor of the NVIDIA Tesla K80 compared to high end 16-core Intel Xeon processor.
Efficient algorithms and implementations of entropy-based moment closures for rarefied gases
NASA Astrophysics Data System (ADS)
Schaerer, Roman Pascal; Bansal, Pratyuksh; Torrilhon, Manuel
2017-07-01
We present efficient algorithms and implementations of the 35-moment system equipped with the maximum-entropy closure in the context of rarefied gases. While closures based on the principle of entropy maximization have been shown to yield very promising results for moderately rarefied gas flows, the computational cost of these closures is in general much higher than for closure theories with explicit closed-form expressions of the closing fluxes, such as Grad's classical closure. Following a similar approach as Garrett et al. (2015) [13], we investigate efficient implementations of the computationally expensive numerical quadrature method used for the moment evaluations of the maximum-entropy distribution by exploiting its inherent fine-grained parallelism with the parallelism offered by multi-core processors and graphics cards. We show that using a single graphics card as an accelerator allows speed-ups of two orders of magnitude when compared to a serial CPU implementation. To accelerate the time-to-solution for steady-state problems, we propose a new semi-implicit time discretization scheme. The resulting nonlinear system of equations is solved with a Newton type method in the Lagrange multipliers of the dual optimization problem in order to reduce the computational cost. Additionally, fully explicit time-stepping schemes of first and second order accuracy are presented. We investigate the accuracy and efficiency of the numerical schemes for several numerical test cases, including a steady-state shock-structure problem.
Implementation of Augmented Reality Technology in Sangiran Museum with Vuforia
NASA Astrophysics Data System (ADS)
Purnomo, F. A.; Santosa, P. I.; Hartanto, R.; Pratisto, E. H.; Purbayu, A.
2018-03-01
Archaeological object is an evidence of life on ancient relics which has a lifespan of millions years ago. The discovery of this ancient object by the Museum Sangiran then is preserved and protected from potential damage. This research will develop Augmented Reality application for the museum that display a virtual information from ancient object on display. The content includes information as text, audio, and animation of 3D model as a representation of the ancient object. This study emphasizes the 3D Markerless recognition process by using Vuforia Augmented Reality (AR) system so that visitor can access the exhibition objects through different viewpoints. Based on the test result, by registering image target with 25o angle interval, 3D markerless keypoint feature can be detected with different viewpoint. The device must meet minimal specifications of Dual Core 1.2 GHz processor, GPU Power VR SG5X, 8 MP auto focus camera and 1 GB of memory to run the application. The average success of the AR application detects object in museum exhibition to 3D Markerless with a single view by 40%, Markerless multiview by 86% (for angle 0° - 180°) and 100% (for angle 0° - 360°). Application detection distance is between 23 cm and up to 540 cm with the response time to detect 3D Markerless has 12 seconds in average.
Contactless sub-millimeter displacement measurements
NASA Astrophysics Data System (ADS)
Sliepen, Guus; Jägers, Aswin P. L.; Bettonvil, Felix C. M.; Hammerschlag, Robert H.
2008-07-01
Weather effects on foldable domes, as used at the DOT and GREGOR, are investigated, in particular the correlation between the wind field and the stresses caused to both metal framework and tent clothing. Camera systems measure contactless the displacement of several dome points. The stresses follow from the measured deformation pattern. The cameras placed near the dome floor do not disturb telescope operations. In the set-ups of DOT and GREGOR, these cameras are up to 8 meters away from the measured points and must be able to detect displacements of less than 0.1 mm. The cameras have a FireWire (IEEE1394) interface to eliminate the need for frame grabbers. Each camera captures 15 images of 640 × 480 pixels per second. All data is processed on-site in real-time. In order to get the best estimate for the displacement within the constraints of available processing power, all image processing is done in Fourier-space, with all convolution operations being pre-computed once. A sub-pixel estimate of the peak of the correlation function is made. This enables to process the images of four cameras using only one commodity PC with a dual-core processor, and achieve an effective sensitivity of up to 0.01 mm. The deformation measurements are well correlated to the simultaneous wind measurements. The results are of high interest to upscaling the dome design (ELTs and solar telescopes).
FPGA implementation of image dehazing algorithm for real time applications
NASA Astrophysics Data System (ADS)
Kumar, Rahul; Kaushik, Brajesh Kumar; Balasubramanian, R.
2017-09-01
Weather degradation such as haze, fog, mist, etc. severely reduces the effective range of visual surveillance. This degradation is a spatially varying phenomena, which makes this problem non trivial. Dehazing is an essential preprocessing stage in applications such as long range imaging, border security, intelligent transportation system, etc. However, these applications require low latency of the preprocessing block. In this work, single image dark channel prior algorithm is modified and implemented for fast processing with comparable visual quality of the restored image/video. Although conventional single image dark channel prior algorithm is computationally expensive, it yields impressive results. Moreover, a two stage image dehazing architecture is introduced, wherein, dark channel and airlight are estimated in the first stage. Whereas, transmission map and intensity restoration are computed in the next stages. The algorithm is implemented using Xilinx Vivado software and validated by using Xilinx zc702 development board, which contains an Artix7 equivalent Field Programmable Gate Array (FPGA) and ARM Cortex A9 dual core processor. Additionally, high definition multimedia interface (HDMI) has been incorporated for video feed and display purposes. The results show that the dehazing algorithm attains 29 frames per second for the image resolution of 1920x1080 which is suitable of real time applications. The design utilizes 9 18K_BRAM, 97 DSP_48, 6508 FFs and 8159 LUTs.
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.
Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D
2002-08-07
Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
The design of infrared information collection circuit based on embedded technology
NASA Astrophysics Data System (ADS)
Liu, Haoting; Zhang, Yicong
2013-07-01
S3C2410 processor is a 16/32 bit RISC embedded processor which based on ARM920T core and AMNA bus, and mainly for handheld devices, and high cost, low-power applications. This design introduces a design plan of the PIR sensor system, circuit and its assembling, debugging. The Application Circuit of the passive PIR alarm uses the invisibility of the infrared radiation well into the alarm system, and in order to achieve the anti-theft alarm and security purposes. When the body goes into the range of PIR sensor detection, sensors will detect heat sources and then the sensor will output a weak signal. The Signal should be amplified, compared and delayed; finally light emitting diodes emit light, playing the role of a police alarm.
Static analysis of the hull plate using the finite element method
NASA Astrophysics Data System (ADS)
Ion, A.
2015-11-01
This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
Real Time Phase Noise Meter Based on a Digital Signal Processor
NASA Technical Reports Server (NTRS)
Angrisani, Leopoldo; D'Arco, Mauro; Greenhall, Charles A.; Schiano Lo Morille, Rosario
2006-01-01
A digital signal-processing meter for phase noise measurement on sinusoidal signals is dealt with. It enlists a special hardware architecture, made up of a core digital signal processor connected to a data acquisition board, and takes advantage of a quadrature demodulation-based measurement scheme, already proposed by the authors. Thanks to an efficient measurement process and an optimized implementation of its fundamental stages, the proposed meter succeeds in exploiting all hardware resources in such an effective way as to gain high performance and real-time operation. For input frequencies up to some hundreds of kilohertz, the meter is capable both of updating phase noise power spectrum while seamlessly capturing the analyzed signal into its memory, and granting as good frequency resolution as few units of hertz.
NASA Astrophysics Data System (ADS)
Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.
2012-08-01
We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.
Wang, Tianqi; Yu, Xiaoyue; Han, Leiqiang; Liu, Tingxian; Liu, Yongjun; Zhang, Na
2017-01-01
As the tumor microenvironment (TME) develops, it is critical to take the alterations of pH value, reduction and various enzymes of the TME into consideration when constructing the desirable co-delivery systems. Herein, TME pH and enzyme dual-responsive core-shell nanoparticles were prepared for the efficient co-delivery of chemotherapy drug and plasmid DNA (pDNA). A novel pH-responsive, positively charged drug loading material, doxorubicin (DOX)-4-hydrazinobenzoic acid (HBA)-polyethyleneimine (PEI) conjugate (DOX-HBA-PEI, DHP), was synthesized to fabricate positively charged polyion complex inner core DHP/DNA nanoparticles (DDN). Hyaluronic acid (HA) was an enzyme-responsive shell which could protect the core and enhance the co-delivery efficiency through CD44-mediated endocytosis. The HA-shielded pH and enzyme dual-responsive nanoparticles (HDDN) were spherical with narrow distribution. The particle size of HDDN was 148.3±3.88 nm and the zeta potential was changed to negative (-18.1±2.03 mV), which led to decreased cytotoxicity. The cumulative release of DOX from DHP at pH 5.0 (66.4%) was higher than that at pH 7.4 (30.1%), which indicated the pH sensitivity of DHP. The transfection efficiency of HDDN in 10% serum was equal to that in the absence of serum, while the transfection of DDN was significantly decreased in the presence of 10% serum. Furthermore, cellular uptake studies and co-localization assay showed that HDDN were internalized effectively through CD44-mediated endocytosis in the tumor cells. The efficient co-delivery of DOX and pEGFP was confirmed by fluorescent image taken by laser confocal microscope. It can be concluded that TME dual-responsive HA-shielded core-shell nanoparticles could be considered as a promising platform for the co-delivery of chemotherapy drug and pDNA.
In-fiber refractive index sensor based on single eccentric hole-assisted dual-core fiber.
Yang, Jing; Guan, Chunying; Tian, Peixuan; Yuan, Tingting; Zhu, Zheng; Li, Ping; Shi, Jinhui; Yang, Jun; Yuan, Libo
2017-11-01
We propose a novel and simple in-fiber refractive index sensor based on resonant coupling, constructed by a short section of single eccentric hole-assisted dual-core fiber (SEHADCF) spliced between two single-mode fibers. The coupling characteristics of the SEHADCF are calculated numerically. The strong resonant coupling occurs when the fundamental mode of the center core phase-matches to that of the suspended core in the air hole. The effective refractive index of the fundamental mode of the suspended core can be obviously changed by injecting solution into the air hole. The responses of the proposed devices to the refractive index and temperature are experimentally measured. The refractive index sensitivity is 627.5 nm/refractive index unit in the refractive index range of 1.335-1.385. The sensor without solution filling is insensitive to temperature in the range of 30-90°C. The proposed refractive index sensor has outstanding advantages, such as simple fabrication, good mechanical strength, and excellent microfluidic channel, and will be of importance in biological detection, chemical analysis, and environment monitoring.
A Cost Effective System Design Approach for Critical Space Systems
NASA Technical Reports Server (NTRS)
Abbott, Larry Wayne; Cox, Gary; Nguyen, Hai
2000-01-01
NASA-JSC required an avionics platform capable of serving a wide range of applications in a cost-effective manner. In part, making the avionics platform cost effective means adhering to open standards and supporting the integration of COTS products with custom products. Inherently, operation in space requires low power, mass, and volume while retaining high performance, reconfigurability, scalability, and upgradability. The Universal Mini-Controller project is based on a modified PC/104-Plus architecture while maintaining full compatibility with standard COTS PC/104 products. The architecture consists of a library of building block modules, which can be mixed and matched to meet a specific application. A set of NASA developed core building blocks, processor card, analog input/output card, and a Mil-Std-1553 card, have been constructed to meet critical functions and unique interfaces. The design for the processor card is based on the PowerPC architecture. This architecture provides an excellent balance between power consumption and performance, and has an upgrade path to the forthcoming radiation hardened PowerPC processor. The processor card, which makes extensive use of surface mount technology, has a 166 MHz PowerPC 603e processor, 32 Mbytes of error detected and corrected RAM, 8 Mbytes of Flash, and I Mbytes of EPROM, on a single PC/104-Plus card. Similar densities have been achieved with the quad channel Mil-Std-1553 card and the analog input/output cards. The power management built into the processor and its peripheral chip allows the power and performance of the system to be adjusted to meet the requirements of the application, allowing another dimension to the flexibility of the Universal Mini-Controller. Unique mechanical packaging allows the Universal Mini-Controller to accommodate standard COTS and custom oversized PC/104-Plus cards. This mechanical packaging also provides thermal management via conductive cooling of COTS boards, which are typically designed for convection cooling methods.
Markovic, Goran; Sarabon, Nejc; Greblo, Zrinka; Krizanic, Valerija
2015-01-01
Aging is associated with decline in physical function that could result in the development of physical impairment and disability. Hence, interventions that simultaneously challenge balance ability, trunk (core) and extremity strength of older adults could be particularly effective in preserving and enhancing these physical functions. The purpose of this study was to compare the effects of feedback-based balance and core resistance training utilizing the a special computer-controlled device (Huber®) with the conventional Pilates training on balance ability, neuromuscular function and body composition of healthy older women. Thirty-four older women (age: 70±4 years) were randomly assigned to a Huber group (n=17) or Pilates group (n=17). Both groups trained for 8 weeks, 3 times a week. Maximal isometric strength of the trunk flexors, extensors, and lateral flexors, leg power, upper-body strength, single- and dual-task static balance, and body composition were measured before and after the intervention programs. Significant group×time interactions and main effects of time (p<0.05) were found for body composition, balance ability in standard and dual-task conditions, all trunk muscle strength variables, and leg power in favor of the Huber group. The observed improvements in balance ability under both standard and dual-task conditions in the Huber group were mainly the result of enhanced postural control in medial-lateral direction (p<0.05). Feedback-based balance and core resistance training proved to be more effective in improving single- and dual-task balance ability, trunk muscle strength, leg power, and body composition of healthy older women than the traditional Pilates training. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
Okwuosa, Tochukwu C; Pereira, Beatriz C; Arafat, Basel; Cieszynska, Milena; Isreb, Abdullah; Alhnan, Mohamed A
2017-02-01
Individualizing gastric-resistant tablets is associated with major challenges for clinical staff in hospitals and healthcare centres. This work aims to fabricate gastric-resistant 3D printed tablets using dual FDM 3D printing. The gastric-resistant tablets were engineered by employing a range of shell-core designs using polyvinylpyrrolidone (PVP) and methacrylic acid co-polymer for core and shell structures respectively. Filaments for both core and shell were compounded using a twin-screw hot-melt extruder (HME). CAD software was utilized to design a capsule-shaped core with a complementary shell of increasing thicknesses (0.17, 0.35, 0.52, 0.70 or 0.87 mm). The physical form of the drug and its integrity following an FDM 3D printing were assessed using x-ray powder diffractometry (XRPD), thermal analysis and HPLC. A shell thickness ≥0.52 mm was deemed necessary in order to achieve sufficient core protection in the acid medium. The technology proved viable for incorporating different drug candidates; theophylline, budesonide and diclofenac sodium. XRPD indicated the presence of theophylline crystals whilst budesonide and diclofenac sodium remained amorphous in the PVP matrix of the filaments and 3D printed tablets. Fabricated tablets demonstrated gastric resistant properties and a pH responsive drug release pattern in both phosphate and bicarbonate buffers. Despite its relatively limited resolution, FDM 3D printing proved to be a suitable platform for a single-process fabrication of delayed release tablets. This work reveals the potential of dual FDM 3D printing as a unique platform for personalising delayed release tablets to suit an individual patient's needs.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors.
Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun
2016-07-08
Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.
Research on NC motion controller based on SOPC technology
NASA Astrophysics Data System (ADS)
Jiang, Tingbiao; Meng, Biao
2006-11-01
With the rapid development of the digitization and informationization, the application of numerical control technology in the manufacturing industry becomes more and more important. However, the conventional numerical control system usually has some shortcomings such as the poor in system openness, character of real-time, cutability and reconfiguration. In order to solve these problems, this paper investigates the development prospect and advantage of the application in numerical control area with system-on-a-Programmable-Chip (SOPC) technology, and puts forward to a research program approach to the NC controller based on SOPC technology. Utilizing the characteristic of SOPC technology, we integrate high density logic device FPGA, memory SRAM, and embedded processor ARM into a single programmable logic device. We also combine the 32-bit RISC processor with high computing capability of the complicated algorithm with the FPGA device with strong motivable reconfiguration logic control ability. With these steps, we can greatly resolve the defect described in above existing numerical control systems. For the concrete implementation method, we use FPGA chip embedded with ARM hard nuclear processor to construct the control core of the motion controller. We also design the peripheral circuit of the controller according to the requirements of actual control functions, transplant real-time operating system into ARM, design the driver of the peripheral assisted chip, develop the application program to control and configuration of FPGA, design IP core of logic algorithm for various NC motion control to configured it into FPGA. The whole control system uses the concept of modular and structured design to develop hardware and software system. Thus the NC motion controller with the advantage of easily tailoring, highly opening, reconfigurable, and expandable can be implemented.
Thermo-mechanical properties of carbon nanotubes and applications in thermal management
NASA Astrophysics Data System (ADS)
Nguyen, Manh Hong; Thang Bui, Hung; Trinh Pham, Van; Phan, Ngoc Hong; Nguyen, Tuan Hong; Chuc Nguyen, Van; Quang Le, Dinh; Khoi Phan, Hong; Phan, Ngoc Minh
2016-06-01
Thanks to their very high thermal conductivity, high Young’s modulus and unique tensile strength, carbon nanotubes (CNTs) have become one of the most suitable nano additives for heat conductive materials. In this work, we present results obtained for the synthesis of heat conductive materials containing CNT based thermal greases, nanoliquids and lubricating oils. These synthesized heat conductive materials were applied to thermal management for high power electronic devices (CPUs, LEDs) and internal combustion engines. The simulation and experimental results on thermal greases for an Intel Pentium IV processor showed that the thermal conductivity of greases increases 1.4 times and the saturation temperature of the CPU decreased by 5 °C by using thermal grease containing 2 wt% CNTs. Nanoliquids containing CNT based distilled water/ethylene glycol were successfully applied in heat dissipation for an Intel Core i5 processor and a 450 W floodlight LED. The experimental results showed that the saturation temperature of the Intel Core i5 processor and the 450 W floodlight LED decreased by about 6 °C and 3.5 °C, respectively, when using nanoliquids containing 1 g l-1 of CNTs. The CNTs were also effectively utilized additive materials for the synthesis of lubricating oils to improve the thermal conductivity, heat dissipation efficiency and performance efficiency of engines. The experimental results show that the thermal conductivity of lubricating oils increased by 12.5%, the engine saved 15% fuel consumption, and the longevity of the lubricating oil increased up to 20 000 km by using 0.1% vol. CNTs in the lubricating oils. All above results have confirmed the tremendous application potential of heat conductive materials containing CNTs in thermal management for high power electronic devices, internal combustion engines and other high power apparatus.
Systematic Alignment of Dual Teacher Preparation
ERIC Educational Resources Information Center
Anderson, Kelly; Smith, JaneDiane; Olsen, Jacob; Algozzine, Bob
2015-01-01
Given the rapid growth of diversity in schools across the country, teacher educators are turning to innovative ways to redesign their programs. In this article, we describe efforts of a dual licensure program in which undergraduate teachers-in-training acquired knowledge and skills in core content, as well as evidence-based pedagogy and discipline…
Wang, Dongdong; Zhou, Jiajia; Shi, Ruohong; Wu, Huihui; Chen, Ruhui; Duan, Beichen; Xia, Guoliang; Xu, Pengping; Wang, Hui; Zhou, Shu; Wang, Chengming; Wang, Haibao; Guo, Zhen; Chen, Qianwang
2017-01-01
Metal-organic-frameworks (MOFs) possess high porosity, large surface area, and tunable functionality are promising candidates for synchronous diagnosis and therapy in cancer treatment. Although large number of MOFs has been discovered, conventional MOF-based nanoplatforms are mainly limited to the sole MOF source with sole functionality. In this study, surfactant modified Prussian blue (PB) core coated by compact ZIF-8 shell (core-shell dual-MOFs, CSD-MOFs) has been reported through a versatile stepwise approach. With Prussian blue as core, CSD-MOFs are able to serve as both magnetic resonance imaging (MRI) and fluorescence optical imaging (FOI) agents. We show that CSD-MOFs crystals loading the anticancer drug doxorubicin (DOX) are efficient pH and near-infrared (NIR) dual-stimuli responsive drug delivery vehicles. After the degradation of ZIF-8, simultaneous NIR irradiation to the inner PB MOFs continuously generate heat that kill cancer cells. Their efficacy on HeLa cancer cell lines is higher compared with the respective single treatment modality, achieving synergistic chemo-thermal therapy efficacy. In vivo results indicate that the anti-tumor efficacy of CSD-MOFs@DOX+NIR was 7.16 and 5.07 times enhanced compared to single chemo-therapy and single thermal-therapy respectively. Our strategy opens new possibilities to construct multifunctional theranostic systems through integration of two different MOFs. PMID:29158848
Reconfigurable Hardware Adapts to Changing Mission Demands
NASA Technical Reports Server (NTRS)
2003-01-01
A new class of computing architectures and processing systems, which use reconfigurable hardware, is creating a revolutionary approach to implementing future spacecraft systems. With the increasing complexity of electronic components, engineers must design next-generation spacecraft systems with new technologies in both hardware and software. Derivation Systems, Inc., of Carlsbad, California, has been working through NASA s Small Business Innovation Research (SBIR) program to develop key technologies in reconfigurable computing and Intellectual Property (IP) soft cores. Founded in 1993, Derivation Systems has received several SBIR contracts from NASA s Langley Research Center and the U.S. Department of Defense Air Force Research Laboratories in support of its mission to develop hardware and software for high-assurance systems. Through these contracts, Derivation Systems began developing leading-edge technology in formal verification, embedded Java, and reconfigurable computing for its PF3100, Derivational Reasoning System (DRS ), FormalCORE IP, FormalCORE PCI/32, FormalCORE DES, and LavaCORE Configurable Java Processor, which are designed for greater flexibility and security on all space missions.
Wang, Haiyang; Yan, Xin; Li, Shuguang; An, Guowen; Zhang, Xuenan
2016-10-08
A refractive index sensor based on dual-core photonic crystal fiber (PCF) with hexagonal lattice is proposed. The effects of geometrical parameters of the PCF on performances of the sensor are investigated by using the finite element method (FEM). Two fiber cores are separated by two air holes filled with the analyte whose refractive index is in the range of 1.33-1.41. Numerical simulation results show that the highest sensitivity can be up to 22,983 nm/RIU(refractive index unit) when the analyte refractive index is 1.41. The lowest sensitivity can reach to 21,679 nm/RIU when the analyte refractive index is 1.33. The sensor we proposed has significant advantages in the field of biomolecule detection as it provides a wide-range of detection with high sensitivity.
Dual annular rotating [open quotes]windowed[close quotes] nuclear reflector reactor control system
Jacox, M.G.; Drexler, R.L.; Hunt, R.N.M.; Lake, J.A.
1994-03-29
A nuclear reactor control system is provided in a nuclear reactor having a core operating in the fast neutron energy spectrum where criticality control is achieved by neutron leakage. The control system includes dual annular, rotatable reflector rings. There are two reflector rings: an inner reflector ring and an outer reflector ring. The reflectors are concentrically assembled, surround the reactor core, and each reflector ring includes a plurality of openings. The openings in each ring are capable of being aligned or non-aligned with each other. Independent driving means for each of the annular reflector rings is provided so that reactor criticality can be initiated and controlled by rotation of either reflector ring such that the extent of alignment of the openings in each ring controls the reflection of neutrons from the core. 4 figures.
Wang, Haiyang; Yan, Xin; Li, Shuguang; An, Guowen; Zhang, Xuenan
2016-01-01
A refractive index sensor based on dual-core photonic crystal fiber (PCF) with hexagonal lattice is proposed. The effects of geometrical parameters of the PCF on performances of the sensor are investigated by using the finite element method (FEM). Two fiber cores are separated by two air holes filled with the analyte whose refractive index is in the range of 1.33–1.41. Numerical simulation results show that the highest sensitivity can be up to 22,983 nm/RIU(refractive index unit) when the analyte refractive index is 1.41. The lowest sensitivity can reach to 21,679 nm/RIU when the analyte refractive index is 1.33. The sensor we proposed has significant advantages in the field of biomolecule detection as it provides a wide-range of detection with high sensitivity. PMID:27740607
Development of 3-Year Roadmap to Transform the Discipline of Systems Engineering
2010-03-31
quickly humans could physically construct them. Indeed, magnetic core memory was entirely constructed by human hands until it was superseded by...For their mainframe computers, IBM develops the applications, operating system, computer hardware and microprocessors (off the shelf standard memory ...processor developers work on potential computational and memory pipelines to support the required performance capabilities and use the available transistors
Scaling Support Vector Machines On Modern HPC Platforms
DOE Office of Scientific and Technical Information (OSTI.GOV)
You, Yang; Fu, Haohuan; Song, Shuaiwen
2015-02-01
We designed and implemented MIC-SVM, a highly efficient parallel SVM for x86 based multicore and many-core architectures, such as the Intel Ivy Bridge CPUs and Intel Xeon Phi co-processor (MIC). We propose various novel analysis methods and optimization techniques to fully utilize the multilevel parallelism provided by these architectures and serve as general optimization methods for other machine learning tools.
NASA Astrophysics Data System (ADS)
1995-04-01
Bell Laboratories has developed the world's first optical information processor. Its core device is a self-excited electrooptical effect apparatus array of symmetric operation. After being developed in the United States, this high-technology device was successfully developed by China's scientists,thus making the fact that China's optoelectronic technology is among the most advanced in the world.
A Pervasive Parallel Processing Framework for Data Visualization and Analysis at Extreme Scale
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ma, Kwan-Liu
Most of today’s visualization libraries and applications are based off of what is known today as the visualization pipeline. In the visualization pipeline model, algorithms are encapsulated as “filtering” components with inputs and outputs. These components can be combined by connecting the outputs of one filter to the inputs of another filter. The visualization pipeline model is popular because it provides a convenient abstraction that allows users to combine algorithms in powerful ways. Unfortunately, the visualization pipeline cannot run effectively on exascale computers. Experts agree that the exascale machine will comprise processors that contain many cores. Furthermore, physical limitations willmore » prevent data movement in and out of the chip (that is, between main memory and the processing cores) from keeping pace with improvements in overall compute performance. To use these processors to their fullest capability, it is essential to carefully consider memory access. This is where the visualization pipeline fails. Each filtering component in the visualization library is expected to take a data set in its entirety, perform some computation across all of the elements, and output the complete results. The process of iterating over all elements must be repeated in each filter, which is one of the worst possible ways to traverse memory when trying to maximize the number of executions per memory access. This project investigates a new type of visualization framework that exhibits a pervasive parallelism necessary to run on exascale machines. Our framework achieves this by defining algorithms in terms of functors, which are localized, stateless operations. Functors can be composited in much the same way as filters in the visualization pipeline. But, functors’ design allows them to be concurrently running on massive amounts of lightweight threads. Only with such fine-grained parallelism can we hope to fill the billions of threads we expect will be necessary for efficient computation on an exascale computer. This project concludes with a functional prototype containing pervasively parallel algorithms that perform demonstratively well on many-core processors. These algorithms are fundamental for performing data analysis and visualization at extreme scale.« less
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
NASA Astrophysics Data System (ADS)
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Dual-stroke heat pump field performance
NASA Astrophysics Data System (ADS)
Veyo, S. E.
1984-11-01
Two nearly identical proprototype systems, each employing a unique dual-stroke compressor, were built and tested. One was installed in an occupied residence in Jeannette, Pa. It has provided the heating and cooling required from that time to the present. The system has functioned without failure of any prototypical advanced components, although early field experience did suffer from deficiencies in the software for the breadboard micro processor control system. Analysis of field performance data indicates a heating performance factor (HSPF) of 8.13 Stu/Wa, and a cooling energy efficiency (SEER) of 8.35 Scu/Wh. Data indicate that the beat pump is oversized for the test house since the observed lower balance point is 3 F whereas 17 F La optimum. Oversizing coupled with the use of resistance heat ot maintain delivered air temperature warmer than 90 F results in the consumption of more resistance heat than expected, more unit cycling, and therefore lower than expected energy efficiency. Our analysis indicates that with optimal mixing the dual stroke heat pump will yield as HSFF 30% better than a single capacity heat pump representative of high efficiency units in the market place today for the observed weather profile.
Ray Meta: scalable de novo metagenome assembly and profiling
2012-01-01
Voluminous parallel sequencing datasets, especially metagenomic experiments, require distributed computing for de novo assembly and taxonomic profiling. Ray Meta is a massively distributed metagenome assembler that is coupled with Ray Communities, which profiles microbiomes based on uniquely-colored k-mers. It can accurately assemble and profile a three billion read metagenomic experiment representing 1,000 bacterial genomes of uneven proportions in 15 hours with 1,024 processor cores, using only 1.5 GB per core. The software will facilitate the processing of large and complex datasets, and will help in generating biological insights for specific environments. Ray Meta is open source and available at http://denovoassembler.sf.net. PMID:23259615
pH- and Temperature-Sensitive Hydrogel Nanoparticles with Dual Photoluminescence for Bioprobes.
Zhao, Yue; Shi, Ce; Yang, Xudong; Shen, Bowen; Sun, Yuanqing; Chen, Yang; Xu, Xiaowei; Sun, Hongchen; Yu, Kui; Yang, Bai; Lin, Quan
2016-06-28
This study demonstrates high contrast and sensitivity by designing a dual-emissive hydrogel particle system, whose two emissions respond to pH and temperature strongly and independently. It describes the photoluminescence (PL) response of poly(N-isopropylacrylamide) (PNIPAM)-based core/shell hydrogel nanoparticles with dual emission, which is obtained by emulsion polymerization with potassium persulfate, consisting of the thermo- and pH-responsive copolymers of PNIPAM and poly(acrylic acid) (PAA). A red-emission rare-earth complex and a blue-emission quaternary ammonium tetraphenylethylene derivative (d-TPE) with similar excitation wavelengths are inserted into the core and shell of the hydrogel nanoparticles, respectively. The PL intensities of the nanoparticles exhibit a linear temperature response in the range from 10 to 80 °C with a change as large as a factor of 5. In addition, the blue emission from the shell exhibits a linear pH response between pH 6.5 and 7.6 with a resolution of 0.1 unit, while the red emission from the core is pH-independent. These stimuli-responsive PL nanoparticles have potential applications in biology and chemistry, including bio- and chemosensors, biological imaging, cancer diagnosis, and externally activated release of anticancer drugs.
The prediction of noise and installation effects of high-subsonic dual-stream jets in flight
NASA Astrophysics Data System (ADS)
Saxena, Swati
Both military and civil aircraft in service generate high levels of noise. One of the major contributors to this noise generated from the aircraft is the jet engine exhaust. This makes the study of jet noise and methods to reduce jet noise an active research area with the aim of designing quieter military and commercial aircraft. The current stringent aircraft noise regulations imposed by the Federal Aviation Administration (FAA) and other international agencies, have further raised the need to perform accurate jet noise calculations for more reliable estimation of the jet noise sources. The main aim of the present research is to perform jet noise simulations of single and dual-stream jets with engineering accuracy and assess forward flight effects on the jet noise. Installation effects such as caused by the pylon are also studied using a simplified pylon nozzle configuration. Due to advances in computational power, it has become possible to perform turbulent flow simulations of high speed jets, which leads to more accurate noise predictions. In the present research, a hybrid unsteady RANS-LES parallel multi-block structured grid solver called EAGLEJet is written to perform the nozzle flow calculations. The far-field noise calculation is performed using solutions to the Ffowcs Williams and Hawkings equation. The present calculations use meshes with 5 to 11 million grid points and require about three weeks of computing time with about 100 processors. A baseline single stream convergent nozzle and a dual-stream coaxial convergent nozzle are used for the flow and noise analysis. Calculations for the convergent nozzle are performed at a high subsonic jet Mach number of Mj = 0.9, which is similar to the operating conditions for commercial aircraft engines. A parallel flow gives the flight effect, which is simulated with a co-flow Mach number, Mcf varying from 0.0 to 0.28. The grid resolution effects, statistical properties of the turbulence and the heated jet effects ( TTR = 2.7) are studied and related to the noise characteristics of the jet. Both flow and noise predictions show good agreement with PIV and microphone measurements. The potential core lengths and nozzle wall boundary characteristics are studied to understand the differences between the numerical potential core lengths as compared to experiments. The flight velocity exponent, m is calculated from the noise reduction in overall sound pressure levels (OASPL, dB) and relative velocity (V j -- Vcf) at all jet inlet (angular) angles. The variation of the exponent, m at lower (50° to 90°) and higher aft inlet angles (120° to 150°) is studied and compared with available measurements. Previous studies have shown a different variation of the exponent with inlet angles while the current numerical data match well with recent experiments conducted on the same nozzle geometry. Today, turbofans are the most efficient engines in service used in almost all major commercial aircraft. Turbofans have a dual-stream exhaust nozzle with primary and secondary flow whose flow and noise characteristics are different from that of single stream jets. A Boeing-designed coaxial nozzle, with area ratio of As/Ap = 3.0, is used to study dual-stream jet noise in the present research. In this configuration, the primary nozzle extends beyond the secondary nozzle, which is representative of large turbofan engines in commercial service. The flow calculations are performed at high subsonic Mach numbers in the primary and secondary nozzles (Mpj = 0.85, Msj = 0.95) with heated core flow, TTRp = 2.26 and unheated fan flow, TTRs = 1.0. The co-flow of Mcf = 0.2 is used. The subscript p, s and amb represent the primary (core) nozzle, the secondary (fan) nozzle, and the ambient flow conditions, respectively. The statistical properties in the primary and secondary shear layers are studied and compared with those of the single stream jets. It has been found that the eddy convection velocity is lower in dual-stream jets as compared to the single stream jet operating at a similar jet exit Mach number. The phase velocity is higher in the secondary shear layer as compared to primary shear layer. The noise measurements agree well with the predicted data and noise reduction is observed in the presence of co-flow. The variation of the flight velocity exponent is calculated as a function of nozzle inlet angle. The value of the exponent at higher inlet angles is lower as compared to the single stream jets. This suggests that the noise levels are less affected in the peak noise direction in the presence of co-flow in dual-stream jets as compared to single stream jets. Two reference velocities: primary jet exit velocity Vpj and mixed velocity Vmix are considered which result in different absolute values of the exponents. Scaling of the jet spectra is performed at different inlet angles and good collapse has been obtained between the spectra. The installation effects on jet noise are studied using a simplified pylon structure with a dual-stream nozzle. In the presence of a pylon, the azimuthal symmetry of the nozzle is lost and thus the flow characteristics are different as compared to the baseline nozzle. This will result in different noise characteristics of the installed jet.
Foxton, R M; Nakajima, M; Tagami, J; Miura, H
2005-02-01
The regional tensile bond strengths of two dual-cure composite resin core materials to root canal dentine using either a one or two-step self-etching adhesive were evaluated. Extracted premolar teeth were decoronated and their root canals prepared to a depth of 8 mm and a width of 1.4 mm. In one group, a one-step self-etching adhesive (Unifil Self-etching Bond) was applied to the walls of the post-space and light-cured for 10 s. After which, the post-spaces were filled with the a dual-cure composite resin (Unifil Core) and then half the specimens were light-cured for 60 s and the other half placed in darkness for 30 min. In the second group, a self-etching primer (ED Primer II) was applied for 30 s, followed by an adhesive resin (Clearfil Photo Bond), which was light-cured for 10 s. The post-spaces were filled with a dual-cure composite resin (DC Core) and then half the specimens were light-cured for 60 s and the other half placed in darkness for 30 min. Chemical-cure composite resin was placed on the outer surfaces of all the roots, which were then stored in water for 24 h. They were serially sliced perpendicular to the bonded interface into 8, 0.6 mm-thick slabs, and then transversely sectioned into beams, approximately 8 x 0.6 x 0.6 mm, for the microtensile bond strength test (muTBS). Data were divided into two (coronal/apical half of post-space) and analysed using three-way anova and Scheffe's test (P < 0.05). Failure modes were observed under an scanning electron microscope (SEM) and statistically analysed. Specimens for observation of the bonded interfaces were prepared in a similar manner as for bond strength testing, cut in half and embedded in epoxy resin. They were then polished to a high gloss, gold sputter coated, and after argon ion etching, observed under an SEM. For both dual-cure composite resins and curing strategies, there were no significant differences in muTBS between the coronal and apical regions (P > 0.05). In addition, both dual-cure composite resins exhibited no significant differences in muTBS irrespective of whether polymerization was chemically or photoinitiated (P > 0.05). Both dual-cure composite resins exhibited good bonding to root canal dentin, which was not dependent upon region or mode of polymerization.
1990-04-23
developed Ada Real - Time Operating System (ARTOS) for bare machine environments(Target), ACW 1.1I0. " ; - -M.UIECTTERMS Ada programming language, Ada...configuration) Operating System: CSC developed Ada Real - Time Operating System (ARTOS) for bare machine environments Memory Size: 4MB 2.2...Test Method Testing of the MC Ado V1.2.beta/ Concurrent Computer Corporation compiler and the CSC developed Ada Real - Time Operating System (ARTOS) for
NASA Astrophysics Data System (ADS)
Bendayan, Michael; Sabo, Roi; Zolberg, Roee; Mandelbaum, Yaakov; Chelly, Avraham; Karsenty, Avi
2017-02-01
We developed a new type of silicon MOSFET Quantum Well transistor, coupling both electronic and optical properties which should overcome the indirect silicon bandgap constraint, and serve as a future light emitting device in the range 0.8-2μm, as part of a new building block in integrated circuits allowing ultra-high speed processors. Such Quantum Well structure enables discrete energy levels for light recombination. Model and simulations of both optical and electric properties are presented pointing out the influence of the channel thickness and the drain voltage on the optical emission spectrum.
Parallel Multi-Step/Multi-Rate Integration of Two-Time Scale Dynamic Systems
NASA Technical Reports Server (NTRS)
Chang, Johnny T.; Ploen, Scott R.; Sohl, Garett. A,; Martin, Bryan J.
2004-01-01
Increasing demands on the fidelity of simulations for real-time and high-fidelity simulations are stressing the capacity of modern processors. New integration techniques are required that provide maximum efficiency for systems that are parallelizable. However many current techniques make assumptions that are at odds with non-cascadable systems. A new serial multi-step/multi-rate integration algorithm for dual-timescale continuous state systems is presented which applies to these systems, and is extended to a parallel multi-step/multi-rate algorithm. The superior performance of both algorithms is demonstrated through a representative example.
A real-time, dual processor simulation of the rotor system research aircraft
NASA Technical Reports Server (NTRS)
Mackie, D. B.; Alderete, T. S.
1977-01-01
A real-time, man-in-the loop, simulation of the rotor system research aircraft (RSRA) was conducted. The unique feature of this simulation was that two digital computers were used in parallel to solve the equations of the RSRA mathematical model. The design, development, and implementation of the simulation are documented. Program validation was discussed, and examples of data recordings are given. This simulation provided an important research tool for the RSRA project in terms of safe and cost-effective design analysis. In addition, valuable knowledge concerning parallel processing and a powerful simulation hardware and software system was gained.
NASA Astrophysics Data System (ADS)
Barr, David; Basden, Alastair; Dipper, Nigel; Schwartz, Noah; Vick, Andy; Schnetler, Hermine
2014-08-01
We present wavefront reconstruction acceleration of high-order AO systems using an Intel Xeon Phi processor. The Xeon Phi is a coprocessor providing many integrated cores and designed for accelerating compute intensive, numerical codes. Unlike other accelerator technologies, it allows virtually unchanged C/C++ to be recompiled to run on the Xeon Phi, giving the potential of making development, upgrade and maintenance faster and less complex. We benchmark the Xeon Phi in the context of AO real-time control by running a matrix vector multiply (MVM) algorithm. We investigate variability in execution time and demonstrate a substantial speed-up in loop frequency. We examine the integration of a Xeon Phi into an existing RTC system and show that performance improvements can be achieved with limited development effort.
NASA Technical Reports Server (NTRS)
Saini, Subhash; Hood, Robert T.; Chang, Johnny; Baron, John
2016-01-01
We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5- 2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.
NASA Technical Reports Server (NTRS)
Aftosmis, M. J.; Berger, M. J.; Murman, S. M.; Kwak, Dochan (Technical Monitor)
2002-01-01
The proposed paper will present recent extensions in the development of an efficient Euler solver for adaptively-refined Cartesian meshes with embedded boundaries. The paper will focus on extensions of the basic method to include solution adaptation, time-dependent flow simulation, and arbitrary rigid domain motion. The parallel multilevel method makes use of on-the-fly parallel domain decomposition to achieve extremely good scalability on large numbers of processors, and is coupled with an automatic coarse mesh generation algorithm for efficient processing by a multigrid smoother. Numerical results are presented demonstrating parallel speed-ups of up to 435 on 512 processors. Solution-based adaptation may be keyed off truncation error estimates using tau-extrapolation or a variety of feature detection based refinement parameters. The multigrid method is extended to for time-dependent flows through the use of a dual-time approach. The extension to rigid domain motion uses an Arbitrary Lagrangian-Eulerlarian (ALE) formulation, and results will be presented for a variety of two- and three-dimensional example problems with both simple and complex geometry.
Architecture of security management unit for safe hosting of multiple agents
NASA Astrophysics Data System (ADS)
Gilmont, Tanguy; Legat, Jean-Didier; Quisquater, Jean-Jacques
1999-04-01
In such growing areas as remote applications in large public networks, electronic commerce, digital signature, intellectual property and copyright protection, and even operating system extensibility, the hardware security level offered by existing processors is insufficient. They lack protection mechanisms that prevent the user from tampering critical data owned by those applications. Some devices make exception, but have not enough processing power nor enough memory to stand up to such applications (e.g. smart cards). This paper proposes an architecture of secure processor, in which the classical memory management unit is extended into a new security management unit. It allows ciphered code execution and ciphered data processing. An internal permanent memory can store cipher keys and critical data for several client agents simultaneously. The ordinary supervisor privilege scheme is replaced by a privilege inheritance mechanism that is more suited to operating system extensibility. The result is a secure processor that has hardware support for extensible multitask operating systems, and can be used for both general applications and critical applications needing strong protection. The security management unit and the internal permanent memory can be added to an existing CPU core without loss of performance, and do not require it to be modified.
A parallel implementation of an off-lattice individual-based model of multicellular populations
NASA Astrophysics Data System (ADS)
Harvey, Daniel G.; Fletcher, Alexander G.; Osborne, James M.; Pitt-Francis, Joe
2015-07-01
As computational models of multicellular populations include ever more detailed descriptions of biophysical and biochemical processes, the computational cost of simulating such models limits their ability to generate novel scientific hypotheses and testable predictions. While developments in microchip technology continue to increase the power of individual processors, parallel computing offers an immediate increase in available processing power. To make full use of parallel computing technology, it is necessary to develop specialised algorithms. To this end, we present a parallel algorithm for a class of off-lattice individual-based models of multicellular populations. The algorithm divides the spatial domain between computing processes and comprises communication routines that ensure the model is correctly simulated on multiple processors. The parallel algorithm is shown to accurately reproduce the results of a deterministic simulation performed using a pre-existing serial implementation. We test the scaling of computation time, memory use and load balancing as more processes are used to simulate a cell population of fixed size. We find approximate linear scaling of both speed-up and memory consumption on up to 32 processor cores. Dynamic load balancing is shown to provide speed-up for non-regular spatial distributions of cells in the case of a growing population.
The CMS Level-1 Calorimeter Trigger for LHC Run II
NASA Astrophysics Data System (ADS)
Sinthuprasith, Tutanon
2017-01-01
The phase-1 upgrades of the CMS Level-1 calorimeter trigger have been completed. The Level-1 trigger has been fully commissioned and it will be used by CMS to collect data starting from the 2016 data run. The new trigger has been designed to improve the performance at high luminosity and large number of simultaneous inelastic collisions per crossing (pile-up). For this purpose it uses a novel design, the Time Multiplexed Design, which enables the data from an event to be processed by a single trigger processor at full granularity over several bunch crossings. The TMT design is a modular design based on the uTCA standard. The architecture is flexible and the number of trigger processors can be expanded according to the physics needs of CMS. Intelligent, more complex, and innovative algorithms are now the core of the first decision layer of CMS: the upgraded trigger system implements pattern recognition and MVA (Boosted Decision Tree) regression techniques in the trigger processors for pT assignment, pile up subtraction, and isolation requirements for electrons, and taus. The performance of the TMT design and the latency measurements and the algorithm performance which has been measured using data is also presented here.
Equalizer: a scalable parallel rendering framework.
Eilemann, Stefan; Makhinya, Maxim; Pajarola, Renato
2009-01-01
Continuing improvements in CPU and GPU performances as well as increasing multi-core processor and cluster-based parallelism demand for flexible and scalable parallel rendering solutions that can exploit multipipe hardware accelerated graphics. In fact, to achieve interactive visualization, scalable rendering systems are essential to cope with the rapid growth of data sets. However, parallel rendering systems are non-trivial to develop and often only application specific implementations have been proposed. The task of developing a scalable parallel rendering framework is even more difficult if it should be generic to support various types of data and visualization applications, and at the same time work efficiently on a cluster with distributed graphics cards. In this paper we introduce a novel system called Equalizer, a toolkit for scalable parallel rendering based on OpenGL which provides an application programming interface (API) to develop scalable graphics applications for a wide range of systems ranging from large distributed visualization clusters and multi-processor multipipe graphics systems to single-processor single-pipe desktop machines. We describe the system architecture, the basic API, discuss its advantages over previous approaches, present example configurations and usage scenarios as well as scalability results.
Energy-aware Thread and Data Management in Heterogeneous Multi-core, Multi-memory Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Su, Chun-Yi
By 2004, microprocessor design focused on multicore scaling—increasing the number of cores per die in each generation—as the primary strategy for improving performance. These multicore processors typically equip multiple memory subsystems to improve data throughput. In addition, these systems employ heterogeneous processors such as GPUs and heterogeneous memories like non-volatile memory to improve performance, capacity, and energy efficiency. With the increasing volume of hardware resources and system complexity caused by heterogeneity, future systems will require intelligent ways to manage hardware resources. Early research to improve performance and energy efficiency on heterogeneous, multi-core, multi-memory systems focused on tuning a single primitivemore » or at best a few primitives in the systems. The key limitation of past efforts is their lack of a holistic approach to resource management that balances the tradeoff between performance and energy consumption. In addition, the shift from simple, homogeneous systems to these heterogeneous, multicore, multi-memory systems requires in-depth understanding of efficient resource management for scalable execution, including new models that capture the interchange between performance and energy, smarter resource management strategies, and novel low-level performance/energy tuning primitives and runtime systems. Tuning an application to control available resources efficiently has become a daunting challenge; managing resources in automation is still a dark art since the tradeoffs among programming, energy, and performance remain insufficiently understood. In this dissertation, I have developed theories, models, and resource management techniques to enable energy-efficient execution of parallel applications through thread and data management in these heterogeneous multi-core, multi-memory systems. I study the effect of dynamic concurrent throttling on the performance and energy of multi-core, non-uniform memory access (NUMA) systems. I use critical path analysis to quantify memory contention in the NUMA memory system and determine thread mappings. In addition, I implement a runtime system that combines concurrent throttling and a novel thread mapping algorithm to manage thread resources and improve energy efficient execution in multi-core, NUMA systems.« less
NASA Technical Reports Server (NTRS)
1991-01-01
This document constitutes the final report prepared by Proteon, Inc. of Westborough, Massachusetts under contract NAS 5-30629 entitled High-Speed Packet Switching (SBIR 87-1, Phase 2) prepared for NASA-Greenbelt, Maryland. The primary goal of this research project is to use the results of the SBIR Phase 1 effort to develop a sound, expandable hardware and software router architecture capable of forwarding 25,000 packets per second through the router and passing 300 megabits per second on the router's internal busses. The work being delivered under this contract received its funding from three different sources: the SNIPE/RIG contract (Contract Number F30602-89-C-0014, CDRL Sequence Number A002), the SBIR contract, and Proteon. The SNIPE/RIG and SBIR contracts had many overlapping requirements, which allowed the research done under SNIPE/RIG to be applied to SBIR. Proteon funded all of the work to develop new router interfaces other than FDDI, in addition to funding the productization of the router itself. The router being delivered under SBIR will be a fully product-quality machine. The work done during this contract produced many significant findings and results, summarized here and explained in detail in later sections of this report. The SNIPE/RIG contract was completed. That contract had many overlapping requirements with the SBIR contract, and resulted in the successful demonstration and delivery of a high speed router. The development that took place during the SNIPE/RIG contract produced findings that included the choice of processor and an understanding of the issues surrounding inter processor communications in a multiprocessor environment. Many significant speed enhancements to the router software were made during that time. Under the SBIR contract (and with help from Proteon-funded work), it was found that a single processor router achieved a throughput significantly higher than originally anticipated. For this reason, a single processor router was developed and the final delivery under this contract will include a single processor CNX-500 router. The router and its interface boards (2 FDDIs and 2 dual-ethernets) are all product-quality components.
Novel processor architecture for onboard infrared sensors
NASA Astrophysics Data System (ADS)
Hihara, Hiroki; Iwasaki, Akira; Tamagawa, Nobuo; Kuribayashi, Mitsunobu; Hashimoto, Masanori; Mitsuyama, Yukio; Ochi, Hiroyuki; Onodera, Hidetoshi; Kanbara, Hiroyuki; Wakabayashi, Kazutoshi; Tada, Munehiro
2016-09-01
Infrared sensor system is a major concern for inter-planetary missions that investigate the nature and the formation processes of planets and asteroids. The infrared sensor system requires signal preprocessing functions that compensate for the intensity of infrared image sensors to get high quality data and high compression ratio through the limited capacity of transmission channels towards ground stations. For those implementations, combinations of Field Programmable Gate Arrays (FPGAs) and microprocessors are employed by AKATSUKI, the Venus Climate Orbiter, and HAYABUSA2, the asteroid probe. On the other hand, much smaller size and lower power consumption are demanded for future missions to accommodate more sensors. To fulfill this future demand, we developed a novel processor architecture which consists of reconfigurable cluster cores and programmable-logic cells with complementary atom switches. The complementary atom switches enable hardware programming without configuration memories, and thus soft-error on logic circuit connection is completely eliminated. This is a noteworthy advantage for space applications which cannot be found in conventional re-writable FPGAs. Almost one-tenth of lower power consumption is expected compared to conventional re-writable FPGAs because of the elimination of configuration memories. The proposed processor architecture can be reconfigured by behavioral synthesis with higher level language specification. Consequently, compensation functions are implemented in a single chip without accommodating program memories, which is accompanied with conventional microprocessors, while maintaining the comparable performance. This enables us to embed a processor element on each infrared signal detector output channel.
Core-Shell Particles as Building Blocks for Systems with High Duality Symmetry
NASA Astrophysics Data System (ADS)
Rahimzadegan, Aso; Rockstuhl, Carsten; Fernandez-Corbaton, Ivan
2018-05-01
Material electromagnetic duality symmetry requires a system to have equal electric and magnetic responses. Intrinsically dual materials that meet the duality conditions at the level of the constitutive relations do not exist in many frequency bands. Nevertheless, discrete objects like metallic helices and homogeneous dielectric spheres can be engineered to approximate the dual behavior. We exploit the extra degrees of freedom of a core-shell dielectric sphere in a particle optimization procedure. The duality symmetry of the resulting particle is more than 1 order of magnitude better than previously reported nonmagnetic objects. We use T -matrix-based multiscattering techniques to show that the improvement is transferred onto the duality symmetry of composite objects when the core-shell particle is used as a building block instead of homogeneous spheres. These results are relevant for the fashioning of systems with high duality symmetry, which are required for some technologically important effects.
Research on dual-parameter optical fiber sensor based on thin-core fiber and spherical structure
NASA Astrophysics Data System (ADS)
Tong, Zhengrong; Wang, Xue; Zhang, Weihua; Xue, Lifang
2018-04-01
A novel dual-parameter optical fiber sensor is proposed and experimentally demonstrated. The proposed sensor is based on a fiber in-line Mach-Zehnder interferometer, which is fabricated by sandwiching a section of thin-core fiber between two spherical structures made of single-mode fibers. The transmission spectrum exhibits the response of the interference between the core and the different cladding modes. Due to the different wavelength shifts of the two selected dips, the simultaneous measurement of temperature and the surrounding refractive index can be achieved. The measured temperature sensitivities are 0.067 nm/°C and 0.050 nm/°C, and the refractive index sensitivities are -119.9 nm/RIU and -69.71 nm/RIU, respectively. In addition, the compact size, simple fabrication and cost-effectiveness of the fiber sensor are also advantages.
NASA Technical Reports Server (NTRS)
Wilmot, Jonathan
2005-01-01
The contents include the following: High availability. Hardware is in harsh environment. Flight processor (constraints) very widely due to power and weight constraints. Software must be remotely modifiable and still operate while changes are being made. Many custom one of kind interfaces for one of a kind missions. Sustaining engineering. Price of failure is high, tens to hundreds of millions of dollars.
Visual Media Reasoning - Terrain-based Geolocation
2015-06-01
the drawings, specifications, or other data does not license the holder or any other person or corporation ; or convey any rights or permission to...3.4 Alternative Metric Investigation This section describes a graphics processor unit (GPU) based implementation in the NVIDIA CUDA programming...utilizing 2 concurrent CPU cores, each controlling a single Nvidia C2075 Tesla Fermi CUDA card. Figure 22 shows a comparison of the CPU and the GPU powered
2013-05-01
logic to perform control function computations and are connected to the full authority digital engine control ( FADEC ) via a high-speed data...Digital Engine Control ( FADEC ) via a high speed data communication bus. The short term distributed engine control configu- rations will be core...concen- trator; and high temperature electronics, high speed communication bus between the data concentrator and the control law processor master FADEC
Multi-threaded ATLAS simulation on Intel Knights Landing processors
NASA Astrophysics Data System (ADS)
Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration
2017-10-01
The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
LIBS data analysis using a predictor-corrector based digital signal processor algorithm
NASA Astrophysics Data System (ADS)
Sanders, Alex; Griffin, Steven T.; Robinson, Aaron
2012-06-01
There are many accepted sensor technologies for generating spectra for material classification. Once the spectra are generated, communication bandwidth limitations favor local material classification with its attendant reduction in data transfer rates and power consumption. Transferring sensor technologies such as Cavity Ring-Down Spectroscopy (CRDS) and Laser Induced Breakdown Spectroscopy (LIBS) require effective material classifiers. A result of recent efforts has been emphasis on Partial Least Squares - Discriminant Analysis (PLS-DA) and Principle Component Analysis (PCA). Implementation of these via general purpose computers is difficult in small portable sensor configurations. This paper addresses the creation of a low mass, low power, robust hardware spectra classifier for a limited set of predetermined materials in an atmospheric matrix. Crucial to this is the incorporation of PCA or PLS-DA classifiers into a predictor-corrector style implementation. The system configuration guarantees rapid convergence. Software running on multi-core Digital Signal Processor (DSPs) simulates a stream-lined plasma physics model estimator, reducing Analog-to-Digital (ADC) power requirements. This paper presents the results of a predictorcorrector model implemented on a low power multi-core DSP to perform substance classification. This configuration emphasizes the hardware system and software design via a predictor corrector model that simultaneously decreases the sample rate while performing the classification.
Real-time machine vision system using FPGA and soft-core processor
NASA Astrophysics Data System (ADS)
Malik, Abdul Waheed; Thörnberg, Benny; Meng, Xiaozhou; Imran, Muhammad
2012-06-01
This paper presents a machine vision system for real-time computation of distance and angle of a camera from reference points in the environment. Image pre-processing, component labeling and feature extraction modules were modeled at Register Transfer (RT) level and synthesized for implementation on field programmable gate arrays (FPGA). The extracted image component features were sent from the hardware modules to a soft-core processor, MicroBlaze, for computation of distance and angle. A CMOS imaging sensor operating at a clock frequency of 27MHz was used in our experiments to produce a video stream at the rate of 75 frames per second. Image component labeling and feature extraction modules were running in parallel having a total latency of 13ms. The MicroBlaze was interfaced with the component labeling and feature extraction modules through Fast Simplex Link (FSL). The latency for computing distance and angle of camera from the reference points was measured to be 2ms on the MicroBlaze, running at 100 MHz clock frequency. In this paper, we present the performance analysis, device utilization and power consumption for the designed system. The FPGA based machine vision system that we propose has high frame speed, low latency and a power consumption that is much lower compared to commercially available smart camera solutions.
Powerful conveyer belt real-time online detection system based on x-ray
NASA Astrophysics Data System (ADS)
Rong, Feng; Miao, Chang-yun; Meng, Wei
2009-07-01
The powerful conveyer belt is widely used in the mine, dock, and so on. After used for a long time, internal steel rope of the conveyor belt may fracture, rust, joints moving, and so on .This would bring potential safety problems. A kind of detection system based on x-ray is designed in this paper. Linear array detector (LDA) is used. LDA cost is low, response fast; technology mature .Output charge of LDA is transformed into differential voltage signal by amplifier. This kind of signal have great ability of anti-noise, is suitable for long-distance transmission. The processor is FPGA. A IP core control 4-channel A/D convertor, achieve parallel output data collection. Soft-core processor MicroBlaze which process tcp/ip protocol is embedded in FPGA. Sampling data are transferred to a computer via Ethernet. In order to improve the image quality, algorithm of getting rid of noise from the measurement result and taking gain normalization for pixel value is studied and designed. Experiments show that this system work well, can real-time online detect conveyor belt of width of 2.0m and speed of 5 m/s, does not affect the production. Image is clear, visual and can easily judge the situation of conveyor belt.
Design and optimization of a portable LQCD Monte Carlo code using OpenACC
NASA Astrophysics Data System (ADS)
Bonati, Claudio; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Calore, Enrico; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele
The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core Graphics Processor Units (GPUs), exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work, we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenAcc, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.
NASA Astrophysics Data System (ADS)
Minamino, Takuya; Mine, Atsushi; Matsumoto, Mariko; Sugawa, Yoshihiko; Kabetani, Tomoshige; Higashi, Mami; Kawaguchi, Asuka; Ohmi, Masato; Awazu, Kunio; Yatani, Hirofumi
2015-10-01
No previous reports have observed inside the root canal using both optical coherence tomography (OCT) and x-ray microcomputed tomography (μCT) for the same sample. The purpose of this study was to clarify both OCT and μCT image properties from observations of the same root canal after resin core build-up treatment. As OCT allows real-time observation of samples, gap formation may be able to be shown in real time. A dual-cure, one-step, self-etch adhesive system bonding agent, and dual-cure resin composite core material were used in root canals in accordance with instructions from the manufacturer. The resulting OCT images were superior for identifying gap formation at the interface, while μCT images were better to grasp the tooth form. Continuous tomographic images from real-time OCT observation allowed successful construction of a video of the resin core build-up procedure. After 10 to 12 s of light curing, a gap with a clear new signal occurred at the root-core material interface, proceeding from the coronal side (6 mm from the cemento-enamel junction) to the apical side of the root.
Li, Xiangyu; Xie, Nijie; Tian, Xinyue
2017-01-01
This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget. PMID:28208730
DOE Office of Scientific and Technical Information (OSTI.GOV)
You, Yang; Fu, Haohuan; Song, Shuaiwen
2014-07-18
Wave propagation forward modeling is a widely used computational method in oil and gas exploration. The iterative stencil loops in such problems have broad applications in scientific computing. However, executing such loops can be highly time time-consuming, which greatly limits application’s performance and power efficiency. In this paper, we accelerate the forward modeling technique on the latest multi-core and many-core architectures such as Intel Sandy Bridge CPUs, NVIDIA Fermi C2070 GPU, NVIDIA Kepler K20x GPU, and the Intel Xeon Phi Co-processor. For the GPU platforms, we propose two parallel strategies to explore the performance optimization opportunities for our stencil kernels.more » For Sandy Bridge CPUs and MIC, we also employ various optimization techniques in order to achieve the best.« less
Li, Xiangyu; Xie, Nijie; Tian, Xinyue
2017-02-08
This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget.
Gamma thermometer based reactor core liquid level detector
Burns, Thomas J.
1983-01-01
A system is provided which employs a modified gamma thermometer for determining the liquid coolant level within a nuclear reactor core. The gamma thermometer which normally is employed to monitor local core heat generation rate (reactor power), is modified by thermocouple junctions and leads to obtain an unambiguous indication of the presence or absence of coolant liquid at the gamma thermometer location. A signal processor generates a signal based on the thermometer surface heat transfer coefficient by comparing the signals from the thermocouples at the thermometer location. The generated signal is a direct indication of loss of coolant due to the change in surface heat transfer when coolant liquid drops below the thermometer location. The loss of coolant indication is independent of reactor power at the thermometer location. Further, the same thermometer may still be used for the normal power monitoring function.
Dual Active Bridge based DC Transformer LabVIEW FPGA Control Code
DOE Office of Scientific and Technical Information (OSTI.GOV)
In the area of power electronics control, Field Programmable Gate Arrays (FPGAs) have the capability to outperform their Digital Signal Processor (DSP) counterparts due to the FPGA’s ability to implement true parallel processing and therefore facilitate higher switching frequencies, higher control bandwidth, and/or enhanced functionality. National Instruments (NI) has developed two platforms, Compact RIO (cRIO) and Single Board RIO (sbRIO), which combine a real-time processor with an FPGA. The FPGA can be programmed with a subset of the well-known LabVIEW graphical programming language. The candidate software implements complete control algorithms in LabVIEW FPGA for a DC Transformer (DCX) based onmore » a dual active bridge (DAB). A DCX is an isolated bi-directional DC-DC converter designed to operate at unity conversion ratio, M, defined by where Vin is the primary-side DC bus voltage, Vout is the secondary-side DC bus voltage, and n is the turns ratio of the embedded high frequency transformer (HFX). The DCX based on a DAB incorporates two H-bridges, a resonant inductor, and an HFX to provide this functionality. The candidate software employs phase-shift modulation of the two H-bridges and a feedback loop to regulate the conversion ratio at unity. The software also includes alarm-handling capabilities as well as debugging and tuning tools. The software fits on the Xilinx Virtex V LX110 FPGA embedded in the NI cRIO-9118 FPGA chassis, and with a 40 MHz base clock, supports a modulation update rate of 40 MHz, and user-settable switching frequencies and synchronized control loop update rates of tens of kHz.« less
ERIC Educational Resources Information Center
Li, Jennifer J.; Steele, Jennifer L.; Slater, Robert; Bacon, Michael; Miller, Trey
2016-01-01
Dual-language immersion programs--in which students learn core subjects (language arts, math, science, and social studies) in both English and a "partner" language--have been gaining in popularity across the United States. Such programs may use a "two-way model," in which roughly half the students are native speakers of the…
Kashyap, Smita; Singh, Nitesh; Surnar, Bapurao; Jayakannan, Manickam
2016-01-11
Dual responsive polymer nanoscaffolds for administering anticancer drugs both at the tumor site and intracellular compartments are made for improving treatment in cancers. The present work reports the design and development of new thermo- and enzyme-responsive amphiphilic copolymer core-shell nanoparticles for doxorubicin delivery at extracellular and intracellular compartments, respectively. A hydrophobic acrylate monomer was tailor-made from 3-pentadecylphenol (PDP, a natural resource) and copolymerized with oligoethylene glycol acrylate (as a hydrophilic monomer) to make new classes of thermo and enzyme dual responsive polymeric amphiphiles. Both radical and reversible addition-fragmentation chain transfer (RAFT) methodologies were adapted for making the amphiphilic copolymers. These amphiphilic copolymers were self-assembled to produce spherical core-shell nanoparticles in water. Upon heating, the core-shell nanoparticles underwent segregation to produce larger sized aggregates above the lower critical solution temperature (LCST). The dual responsive polymer scaffold was found to be capable of loading water insoluble drug, such as doxorubicin (DOX), and fluorescent probe-like Nile Red. The drug release kinetics revealed that DOX was preserved in the core-shell assemblies at normal body temperature (below LCST, ≤ 37 °C). At closer to cancer tissue temperature (above LCST, ∼43 °C), the polymeric scaffold underwent burst release to deliver 90% of loaded drugs within 2 h. At the intracellular environment (pH 7.4, 37 °C) in the presence of esterase enzyme, the amphiphilic copolymer ruptured in a slow and controlled manner to release >95% of the drugs in 12 h. Thus, both burst release of cargo at the tumor microenvironment and control delivery at intracellular compartments were accomplished in a single polymer scaffold. Cytotoxicity assays of the nascent and DOX-loaded polymer were carried out in breast cancer (MCF-7) and cervical cancer (HeLa) cells. Among the two cell lines, the DOX-loaded polymers showed enhanced killing in breast cancer cells. Furthermore, the cellular uptake of the DOX was studied by confocal and fluorescence microscopes. The present investigation opens a new enzyme and thermal-responsive polymer scaffold approach for DOX delivery in cancer cells.
Cache Energy Optimization Techniques For Modern Processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh
2013-01-01
Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In thismore » book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both newcomers and veterans in the field of cache power management. It will help graduate students, CAD tool developers and designers in understanding the need of energy efficiency in modern computing systems. Further, it will be useful for researchers in gaining insights into algorithms and techniques for micro-architectural and system-level energy optimization using dynamic cache reconfiguration. We sincerely believe that the ``food for thought'' presented in this book will inspire the readers to develop even better ideas for designing ``green'' processors of tomorrow.« less
Aronoff-Spencer, Eliah; Venkatesh, A G; Sun, Alex; Brickner, Howard; Looney, David; Hall, Drew A
2016-12-15
Yeast cell lines were genetically engineered to display Hepatitis C virus (HCV) core antigen linked to gold binding peptide (GBP) as a dual-affinity biobrick chimera. These multifunctional yeast cells adhere to the gold sensor surface while simultaneously acting as a "renewable" capture reagent for anti-HCV core antibody. This streamlined functionalization and detection strategy removes the need for traditional purification and immobilization techniques. With this biobrick construct, both optical and electrochemical immunoassays were developed. The optical immunoassays demonstrated detection of anti-HCV core antibody down to 12.3pM concentrations while the electrochemical assay demonstrated higher binding constants and dynamic range. The electrochemical format and a custom, low-cost smartphone-based potentiostat ($20 USD) yielded comparable results to assays performed on a state-of-the-art electrochemical workstation. We propose this combination of synthetic biology and scalable, point-of-care sensing has potential to provide low-cost, cutting edge diagnostic capability for many pathogens in a variety of settings. Copyright © 2016 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Kim, Su Yeon; Jeong, Jong Seok; Mkhoyan, K. Andre; Jang, Ho Seong
2016-05-01
Highly efficient downconversion (DC) green-emitting LiYF4:Ce,Tb nanophosphors have been synthesized for bright dual-mode upconversion (UC) and DC green-emitting core/double-shell (C/D-S) nanophosphors--Li(Gd,Y)F4:Yb(18%),Er(2%)/LiYF4:Ce(15%),Tb(15%)/LiYF4--and the C/D-S structure has been proved by extensive scanning transmission electron microscopy (STEM) analysis. Colloidal LiYF4:Ce,Tb nanophosphors with a tetragonal bipyramidal shape are synthesized for the first time and they show intense DC green light via energy transfer from Ce3+ to Tb3+ under illumination with ultraviolet (UV) light. The LiYF4:Ce,Tb nanophosphors show 65 times higher photoluminescence intensity than LiYF4:Tb nanophosphors under illumination with UV light and the LiYF4:Ce,Tb is adapted into a luminescent shell of the tetragonal bipyramidal C/D-S nanophosphors. The formation of the DC shell on the core significantly enhances UC luminescence from the UC core under irradiation of near infrared light and concurrently generates DC luminescence from the core/shell nanophosphors under UV light. Coating with an inert inorganic shell further enhances the UC-DC dual-mode luminescence by suppressing the surface quenching effect. The C/D-S nanophosphors show 3.8% UC quantum efficiency (QE) at 239 W cm-2 and 73.0 +/- 0.1% DC QE. The designed C/D-S architecture in tetragonal bipyramidal nanophosphors is rigorously verified by an energy dispersive X-ray spectroscopy (EDX) analysis, with the assistance of line profile simulation, using an aberration-corrected scanning transmission electron microscope equipped with a high-efficiency EDX. The feasibility of these C/D-S nanophosphors for transparent display devices is also considered.Highly efficient downconversion (DC) green-emitting LiYF4:Ce,Tb nanophosphors have been synthesized for bright dual-mode upconversion (UC) and DC green-emitting core/double-shell (C/D-S) nanophosphors--Li(Gd,Y)F4:Yb(18%),Er(2%)/LiYF4:Ce(15%),Tb(15%)/LiYF4--and the C/D-S structure has been proved by extensive scanning transmission electron microscopy (STEM) analysis. Colloidal LiYF4:Ce,Tb nanophosphors with a tetragonal bipyramidal shape are synthesized for the first time and they show intense DC green light via energy transfer from Ce3+ to Tb3+ under illumination with ultraviolet (UV) light. The LiYF4:Ce,Tb nanophosphors show 65 times higher photoluminescence intensity than LiYF4:Tb nanophosphors under illumination with UV light and the LiYF4:Ce,Tb is adapted into a luminescent shell of the tetragonal bipyramidal C/D-S nanophosphors. The formation of the DC shell on the core significantly enhances UC luminescence from the UC core under irradiation of near infrared light and concurrently generates DC luminescence from the core/shell nanophosphors under UV light. Coating with an inert inorganic shell further enhances the UC-DC dual-mode luminescence by suppressing the surface quenching effect. The C/D-S nanophosphors show 3.8% UC quantum efficiency (QE) at 239 W cm-2 and 73.0 +/- 0.1% DC QE. The designed C/D-S architecture in tetragonal bipyramidal nanophosphors is rigorously verified by an energy dispersive X-ray spectroscopy (EDX) analysis, with the assistance of line profile simulation, using an aberration-corrected scanning transmission electron microscope equipped with a high-efficiency EDX. The feasibility of these C/D-S nanophosphors for transparent display devices is also considered. Electronic supplementary information (ESI) available: XRD patterns, PL and PLE spectra, SEM and HR-TEM images, PL decay times, photographs showing the transparent nanophosphor solutions and their dual-mode luminescence, and additional EDX data. See DOI: 10.1039/c5nr05722a
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumar, Sameer; Mamidala, Amith R.; Ratterman, Joseph D.
A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal tomore » the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blocksome, Michael; Kumar, Sameer; Mamidala, Amith R.
A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a barrier algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal tomore » the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.« less
NASA Astrophysics Data System (ADS)
Qiang, Ji
2017-10-01
A three-dimensional (3D) Poisson solver with longitudinal periodic and transverse open boundary conditions can have important applications in beam physics of particle accelerators. In this paper, we present a fast efficient method to solve the Poisson equation using a spectral finite-difference method. This method uses a computational domain that contains the charged particle beam only and has a computational complexity of O(Nu(logNmode)) , where Nu is the total number of unknowns and Nmode is the maximum number of longitudinal or azimuthal modes. This saves both the computational time and the memory usage of using an artificial boundary condition in a large extended computational domain. The new 3D Poisson solver is parallelized using a message passing interface (MPI) on multi-processor computers and shows a reasonable parallel performance up to hundreds of processor cores.
Modeling of the ground-to-SSFMB link networking features using SPW
NASA Technical Reports Server (NTRS)
Watson, John C.
1993-01-01
This report describes the modeling and simulation of the networking features of the ground-to-Space Station Freedom manned base (SSFMB) link using COMDISCO signal processing work-system (SPW). The networking features modeled include the implementation of Consultative Committee for Space Data Systems (CCSDS) protocols in the multiplexing of digitized audio and core data into virtual channel data units (VCDU's) in the control center complex and the demultiplexing of VCDU's in the onboard baseband signal processor. The emphasis of this work has been placed on techniques for modeling the CCSDS networking features using SPW. The objectives for developing the SPW models are to test the suitability of SPW for modeling networking features and to develop SPW simulation models of the control center complex and space station baseband signal processor for use in end-to-end testing of the ground-to-SSFMB S-band single access forward (SSAF) link.
Multicore Challenges and Benefits for High Performance Scientific Computing
Nielsen, Ida M. B.; Janssen, Curtis L.
2008-01-01
Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
A pervasive parallel framework for visualization: final report for FWP 10-014707
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moreland, Kenneth D.
2014-01-01
We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less
Parallelization of the preconditioned IDR solver for modern multicore computer systems
NASA Astrophysics Data System (ADS)
Bessonov, O. A.; Fedoseyev, A. I.
2012-10-01
This paper present the analysis, parallelization and optimization approach for the large sparse matrix solver CNSPACK for modern multicore microprocessors. CNSPACK is an advanced solver successfully used for coupled solution of stiff problems arising in multiphysics applications such as CFD, semiconductor transport, kinetic and quantum problems. It employs iterative IDR algorithm with ILU preconditioning (user chosen ILU preconditioning order). CNSPACK has been successfully used during last decade for solving problems in several application areas, including fluid dynamics and semiconductor device simulation. However, there was a dramatic change in processor architectures and computer system organization in recent years. Due to this, performance criteria and methods have been revisited, together with involving the parallelization of the solver and preconditioner using Open MP environment. Results of the successful implementation for efficient parallelization are presented for the most advances computer system (Intel Core i7-9xx or two-processor Xeon 55xx/56xx).
Kumar, Sameer; Mamidala, Amith R.; Ratterman, Joseph D.; Blocksome, Michael; Miller, Douglas
2013-09-03
A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal to the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.
Signore, Antonio; Benedicenti, Stefano; Kaitsas, Vassilios; Barone, Michele; Angiero, Francesca; Ravera, Giambattista
2009-02-01
This retrospective study investigated the clinical effectiveness over up to 8 years of parallel-sided and of tapered glass-fiber posts, in combination with either hybrid composite or dual-cure composite resin core material, in endodontically treated, maxillary anterior teeth covered with full-ceramic crowns. The study population comprised 192 patients and 526 endodontically treated teeth, with various degrees of hard-tissue loss, restored by the post-and-core technique. Four groups were defined based on post shape and core build-up materials, and within each group post-and-core restorations were assigned randomly with respect to root morphology. Inclusion criteria were symptom-free endodontic therapy, root-canal treatment with a minimum apical seal of 4mm, application of rubber dam, need for post-and-core complex because of coronal tooth loss, and tooth with at least one residual coronal wall. Survival rate of the post-and-core restorations was determined using Kaplan-Meier statistical analysis. The restorations were examined clinically and radiologically; mean observation period was 5.3 years. The overall survival rate of glass-fiber post-and-core restorations was 98.5%. The survival rate for parallel-sided posts was 98.6% and for tapered posts was 96.8%. Survival rates for core build-up materials were 100% for dual-cure composite and 96.8% for hybrid light-cure composite. For both glass-fiber post designs and for both core build-up materials, clinical performance was satisfactory. Survival was higher for teeth retaining four and three coronal walls.
Efficient algorithms and implementations of entropy-based moment closures for rarefied gases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schaerer, Roman Pascal, E-mail: schaerer@mathcces.rwth-aachen.de; Bansal, Pratyuksh; Torrilhon, Manuel
We present efficient algorithms and implementations of the 35-moment system equipped with the maximum-entropy closure in the context of rarefied gases. While closures based on the principle of entropy maximization have been shown to yield very promising results for moderately rarefied gas flows, the computational cost of these closures is in general much higher than for closure theories with explicit closed-form expressions of the closing fluxes, such as Grad's classical closure. Following a similar approach as Garrett et al. (2015) , we investigate efficient implementations of the computationally expensive numerical quadrature method used for the moment evaluations of the maximum-entropymore » distribution by exploiting its inherent fine-grained parallelism with the parallelism offered by multi-core processors and graphics cards. We show that using a single graphics card as an accelerator allows speed-ups of two orders of magnitude when compared to a serial CPU implementation. To accelerate the time-to-solution for steady-state problems, we propose a new semi-implicit time discretization scheme. The resulting nonlinear system of equations is solved with a Newton type method in the Lagrange multipliers of the dual optimization problem in order to reduce the computational cost. Additionally, fully explicit time-stepping schemes of first and second order accuracy are presented. We investigate the accuracy and efficiency of the numerical schemes for several numerical test cases, including a steady-state shock-structure problem.« less
A high efficiency dual-junction solar cell implemented as a nanowire array.
Yu, Shuqing; Witzigmann, Bernd
2013-01-14
In this work, we present an innovative design of a dual-junction nanowire array solar cell. Using a dual-diameter nanowire structure, the solar spectrum is separated and absorbed in the core wire and the shell wire with respect to the wavelength. This solar cell provides high optical absorptivity over the entire spectrum due to an electromagnetic concentration effect. Microscopic simulations were performed in a three-dimensional setup, and the optical properties of the structure were evaluated by solving Maxwell's equations. The Shockley-Queisser method was employed to calculate the current-voltage relationship of the dual-junction structure. Proper design of the geometrical and material parameters leads to an efficiency of 39.1%.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors
Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun
2016-01-01
Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction. PMID:27399722
IGA-ADS: Isogeometric analysis FEM using ADS solver
NASA Astrophysics Data System (ADS)
Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav
2017-08-01
In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
NASA Astrophysics Data System (ADS)
Mielikainen, Jarno; Huang, Bormin; Huang, Allen H.-L.
2015-05-01
Intel Many Integrated Core (MIC) ushers in a new era of supercomputing speed, performance, and compatibility. It allows the developers to run code at trillions of calculations per second using the familiar programming model. In this paper, we present our results of optimizing the updated Goddard shortwave radiation Weather Research and Forecasting (WRF) scheme on Intel Many Integrated Core Architecture (MIC) hardware. The Intel Xeon Phi coprocessor is the first product based on Intel MIC architecture, and it consists of up to 61 cores connected by a high performance on-die bidirectional interconnect. The co-processor supports all important Intel development tools. Thus, the development environment is familiar one to a vast number of CPU developers. Although, getting a maximum performance out of Xeon Phi will require using some novel optimization techniques. Those optimization techniques are discusses in this paper. The results show that the optimizations improved performance of the original code on Xeon Phi 7120P by a factor of 1.3x.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Doerfler, Douglas; Austin, Brian; Cook, Brandon
There are many potential issues associated with deploying the Intel Xeon Phi™ (code named Knights Landing [KNL]) manycore processor in a large-scale supercomputer. One in particular is the ability to fully utilize the high-speed communications network, given that the serial performance of a Xeon Phi TM core is a fraction of a Xeon®core. In this paper, we take a look at the trade-offs associated with allocating enough cores to fully utilize the Aries high-speed network versus cores dedicated to computation, e.g., the trade-off between MPI and OpenMP. In addition, we evaluate new features of Cray MPI in support of KNL,more » such as internode optimizations. We also evaluate one-sided programming models such as Unified Parallel C. We quantify the impact of the above trade-offs and features using a suite of National Energy Research Scientific Computing Center applications.« less
First Results from a Hardware-in-the-Loop Demonstration of Closed-Loop Autonomous Formation Flying
NASA Technical Reports Server (NTRS)
Gill, E.; Naasz, Bo; Ebinuma, T.
2003-01-01
A closed-loop system for the demonstration of autonomous satellite formation flying technologies using hardware-in-the-loop has been developed. Making use of a GPS signal simulator with a dual radio frequency outlet, the system includes two GPS space receivers as well as a powerful onboard navigation processor dedicated to the GPS-based guidance, navigation, and control of a satellite formation in real-time. The closed-loop system allows realistic simulations of autonomous formation flying scenarios, enabling research in the fields of tracking and orbit control strategies for a wide range of applications. The autonomous closed-loop formation acquisition and keeping strategy is based on Lyapunov's direct control method as applied to the standard set of Keplerian elements. This approach not only assures global and asymptotic stability of the control but also maintains valuable physical insight into the applied control vectors. Furthermore, the approach can account for system uncertainties and effectively avoids a computationally expensive solution of the two point boundary problem, which renders the concept particularly attractive for implementation in onboard processors. A guidance law has been developed which strictly separates the relative from the absolute motion, thus avoiding the numerical integration of a target trajectory in the onboard processor. Moreover, upon using precise kinematic relative GPS solutions, a dynamical modeling or filtering is avoided which provides for an efficient implementation of the process on an onboard processor. A sample formation flying scenario has been created aiming at the autonomous transition of a Low Earth Orbit satellite formation from an initial along-track separation of 800 m to a target distance of 100 m. Assuming a low-thrust actuator which may be accommodated on a small satellite, a typical control accuracy of less than 5 m has been achieved which proves the applicability of autonomous formation flying techniques to formations of satellites as close as 50 m.
Design and implementation of projects with Xilinx Zynq FPGA: a practical case
NASA Astrophysics Data System (ADS)
Travaglini, R.; D'Antone, I.; Meneghini, S.; Rignanese, L.; Zuffa, M.
The main advantage when using FPGAs with embedded processors is the availability of additional several high-performance resources in the same physical device. Moreover, the FPGA programmability allows for connect custom peripherals. Xilinx have designed a programmable device named Zynq-7000 (simply called Zynq in the following), which integrates programmable logic (identical to the other Xilinx "serie 7" devices) with a System on Chip (SOC) based on two embedded ARM processors. Since both parts are deeply connected, the designers benefit from performance of hardware SOC and flexibility of programmability as well. In this paper a design developed by the Electronic Design Department at the Bologna Division of INFN will be presented as a practical case of project based on Zynq device. It is developed by using a commercial board called ZedBoard hosting a FMC mezzanine with a 12-bit 500 MS/s ADC. The Zynq FPGA on the ZedBoard receives digital outputs from the ADC and send them to the acquisition PC, after proper formatting, through a Gigabit Ethernet link. The major focus of the paper will be about the methodology to develop a Zynq-based design with the Xilinx Vivado software, enlightening how to configure the SOC and connect it with the programmable logic. Firmware design techniques will be presented: in particular both VHDL and IP core based strategies will be discussed. Further, the procedure to develop software for the embedded processor will be presented. Finally, some debugging tools, like the embedded Logic Analyzer, will be shown. Advantages and disadvantages with respect to adopting FPGA without embedded processors will be discussed.
NASA Astrophysics Data System (ADS)
Eilbert, Richard F.; Krug, Kristoph D.
1993-04-01
The Vivid Rapid Explosives Detection Systems is a true dual energy x-ray machine employing precision x-ray data acquisition in combination with unique algorithms and massive computation capability. Data from the system's 960 detectors is digitally stored and processed by powerful supermicro-computers organized as an expandable array of parallel processors. The algorithms operate on the dual energy attenuation image data to recognize and define objects in the milieu of the baggage contents. Each object is then systematically examined for a match to a specific effective atomic number, density, and mass threshold. Material properties are determined by comparing the relative attenuations of the 75 kVp and 150 kVp beams and electronically separating the object from its local background. Other heuristic algorithms search for specific configurations and provide additional information. The machine automatically detects explosive materials and identifies bomb components in luggage with high specificity and throughput, X-ray dose is comparable to that of current airport x-ray machines. The machine is also configured to find heroin, cocaine, and US currency by selecting appropriate settings on-site. Since January 1992, production units have been operationally deployed at U.S. and European airports for improved screening of checked baggage.
Lee, Sang Seok; Seo, Hyeon Jin; Kim, Yun Ho; Kim, Shin-Hyun
2017-06-01
Photonic microcapsules with onion-like topology are microfluidically designed to have cholesteric liquid crystals with opposite handedness in their core and shell. The microcapsules exhibit structural colors caused by dual photonic bandgaps, resulting in a rich variety of color on the optical palette. Moreover, the microcapsules can switch the colors from either core or shell depending on the selection of light-handedness. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
An embedded multi-core parallel model for real-time stereo imaging
NASA Astrophysics Data System (ADS)
He, Wenjing; Hu, Jian; Niu, Jingyu; Li, Chuanrong; Liu, Guangyu
2018-04-01
The real-time processing based on embedded system will enhance the application capability of stereo imaging for LiDAR and hyperspectral sensor. The task partitioning and scheduling strategies for embedded multiprocessor system starts relatively late, compared with that for PC computer. In this paper, aimed at embedded multi-core processing platform, a parallel model for stereo imaging is studied and verified. After analyzing the computing amount, throughout capacity and buffering requirements, a two-stage pipeline parallel model based on message transmission is established. This model can be applied to fast stereo imaging for airborne sensors with various characteristics. To demonstrate the feasibility and effectiveness of the parallel model, a parallel software was designed using test flight data, based on the 8-core DSP processor TMS320C6678. The results indicate that the design performed well in workload distribution and had a speed-up ratio up to 6.4.
A programmable computational image sensor for high-speed vision
NASA Astrophysics Data System (ADS)
Yang, Jie; Shi, Cong; Long, Xitian; Wu, Nanjian
2013-08-01
In this paper we present a programmable computational image sensor for high-speed vision. This computational image sensor contains four main blocks: an image pixel array, a massively parallel processing element (PE) array, a row processor (RP) array and a RISC core. The pixel-parallel PE is responsible for transferring, storing and processing image raw data in a SIMD fashion with its own programming language. The RPs are one dimensional array of simplified RISC cores, it can carry out complex arithmetic and logic operations. The PE array and RP array can finish great amount of computation with few instruction cycles and therefore satisfy the low- and middle-level high-speed image processing requirement. The RISC core controls the whole system operation and finishes some high-level image processing algorithms. We utilize a simplified AHB bus as the system bus to connect our major components. Programming language and corresponding tool chain for this computational image sensor are also developed.
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Howison, Mark; Bethel, E. Wes; Childs, Hank
2012-01-01
With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
NASA Astrophysics Data System (ADS)
Jo, Hyunho; Sim, Donggyu
2014-06-01
We present a bitstream decoding processor for entropy decoding of variable length coding-based multiformat videos. Since most of the computational complexity of entropy decoders comes from bitstream accesses and table look-up process, the developed bitstream processing unit (BsPU) has several designated instructions to access bitstreams and to minimize branch operations in the table look-up process. In addition, the instruction for bitstream access has the capability to remove emulation prevention bytes (EPBs) of H.264/AVC without initial delay, repeated memory accesses, and additional buffer. Experimental results show that the proposed method for EPB removal achieves a speed-up of 1.23 times compared to the conventional EPB removal method. In addition, the BsPU achieves speed-ups of 5.6 and 3.5 times in entropy decoding of H.264/AVC and MPEG-4 Visual bitstreams, respectively, compared to an existing processor without designated instructions and a new table mapping algorithm. The BsPU is implemented on a Xilinx Virtex5 LX330 field-programmable gate array. The MPEG-4 Visual (ASP, Level 5) and H.264/AVC (Main Profile, Level 4) are processed using the developed BsPU with a core clock speed of under 250 MHz in real time.
QuickProbs—A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors
Gudyś, Adam; Deorowicz, Sebastian
2014-01-01
Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors. PMID:24586435
Design of the Protocol Processor for the ROBUS-2 Communication System
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.
2005-01-01
The ROBUS-2 Protocol Processor (RPP) is a custom-designed hardware component implementing the functionality of the ROBUS-2 fault-tolerant communication system. The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault tolerant integrated modular architecture currently under development at NASA Langley Research Center. ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, time reference (clock synchronization), and distributed diagnosis (group membership). ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 tolerates internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. ROBUS consists of RPPs connected to each other by a lower-level physical communication network. The RPP has a pipelined architecture and the design is parameterized in the behavioral and structural domains. The design of the RPP enables the bus to achieve a PE-message throughput that approaches the available bandwidth at the physical layer.
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas
2008-01-01
A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
SU-E-T-25: Real Time Simulator for Designing Electron Dual Scattering Foil Systems.
Carver, R; Hogstrom, K; Price, M; Leblanc, J; Harris, G
2012-06-01
To create a user friendly, accurate, real time computer simulator to facilitate the design of dual foil scattering systems for electron beams on radiotherapy accelerators. The simulator should allow for a relatively quick, initial design that can be refined and verified with subsequent Monte Carlo (MC) calculations and measurements. The simulator consists of an analytical algorithm for calculating electron fluence and a graphical user interface (GUI) C++ program. The algorithm predicts electron fluence using Fermi-Eyges multiple Coulomb scattering theory with a refined Moliere formalism for scattering powers. The simulator also estimates central-axis x-ray dose contamination from the dual foil system. Once the geometry of the beamline is specified, the simulator allows the user to continuously vary primary scattering foil material and thickness, secondary scattering foil material and Gaussian shape (thickness and sigma), and beam energy. The beam profile and x-ray contamination are displayed in real time. The simulator was tuned by comparison of off-axis electron fluence profiles with those calculated using EGSnrc MC. Over the energy range 7-20 MeV and using present foils on the Elekta radiotherapy accelerator, the simulator profiles agreed to within 2% of MC profiles from within 20 cm of the central axis. The x-ray contamination predictions matched measured data to within 0.6%. The calculation time was approximately 100 ms using a single processor, which allows for real-time variation of foil parameters using sliding bars. A real time dual scattering foil system simulator has been developed. The tool has been useful in a project to redesign an electron dual scattering foil system for one of our radiotherapy accelerators. The simulator has also been useful as an instructional tool for our medical physics graduate students. © 2012 American Association of Physicists in Medicine.
Impact of Fluidic Chevrons on Supersonic Jet Noise
NASA Technical Reports Server (NTRS)
Henderson, Brenda; Norum, Thomas
2007-01-01
The impact of fluidic chevrons on broadband shock noise and mixing noise for single stream and coannular jets was investigated. Air was injected into the core flow of a bypass ratio 5 nozzle system using a core fluidic chevron nozzle. For the single stream experiments, the fan stream was operated at the wind tunnel conditions and the core stream was operated at supersonic speeds. For the dual stream experiments, the fan stream was operated at supersonic speeds and the core stream was varied between subsonic and supersonic conditions. For the single stream jet at nozzle pressure ratio (NPR) below 2.0, increasing the injection pressure of the fluidic chevron increased high frequency noise at observation angles upstream of the nozzle exit and decreased mixing noise near the peak jet noise angle. When the NPR increased to a point where broadband shock noise dominated the acoustic spectra at upstream observation angles, the fluidic chevrons significantly decreased this noise. For dual stream jets, the fluidic chevrons reduced broadband shock noise levels when the fan NPR was below 2.3, but had little or no impact on shock noise with further increases in fan pressure. For all fan stream conditions investigated, the fluidic chevron became more effective at reducing mixing noise near the peak jet noise angle as the core pressure increased.
Lin, Qianglu; Makarov, Nikolay S.; Koh, Weon-kyu; ...
2014-11-26
The unique optical properties exhibited by visible emitting core/shell quantum dots with especially thick shells are the focus of widespread study, but have yet to be realized in infrared (IR) -active nanostructures. We apply an effective-mass model to identify PbSe/CdSe core/shell quantum dots as a promising system for achieving this goal. We then synthesize colloidal PbSe/CdSe quantum dots with shell thicknesses of up to 4 nm that exhibit unusually slow hole intra-band relaxation from shell to core states, as evidenced by the emergence of dual emission, i.e., IR photoluminescence from the PbSe core observed simultaneously with visible emission from themore » CdSe shell. In addition to the large shell thickness, the development of slowed intraband relaxation is facilitated by the existence of a sharp core-shell interface without discernible alloying. Growth of thick shells without interfacial alloying or incidental formation of homogenous CdSe nanocrystals was accomplished using insights attained via a systematic study of the dynamics of the cation-exchange synthesis of both PbSe/CdSe as well as the related system PbS/CdS. Finally, we show that the efficiency of the visible photoluminescence can be greatly enhanced by inorganic passivation.« less
NASA Astrophysics Data System (ADS)
Sou, In Mei; Calantoni, Joseph; Reed, Allen; Furukawa, Yoko
2012-11-01
A synchronized dual stereo particle image velocimetry (PIV) measurement technique is used to examine the erosion process of a cohesive sediment core in the Small Oscillatory Flow Tunnel (S-OFT) in the Sediment Dynamics Laboratory at the Naval Research Laboratory, Stennis Space Center, MS. The dual stereo PIV windows were positioned on either side of a sediment core inserted along the centerline of the S-OFT allowing for a total measurement window of about 20 cm long by 10 cm high with sub-millimeter spacing on resolved velocity vectors. The period of oscillation ranged from 2.86 to 6.12 seconds with constant semi-excursion amplitude in the test section of 9 cm. During the erosion process, Kelvin-Helmholtz instabilities were observed as the flow accelerated in each direction and eventually were broken down when the flow reversed. The relative concentration of suspended sediments under different flow conditions was estimated using the intensity of light scattered from the sediment particles in suspension. By subtracting the initial light scattered from the core, the residual light intensity was assumed to be scattered from suspended sediments eroded from the core. Results from two different sediment core samples of mud and sand mixtures will be presented.
Solvation and Evolution Dynamics of an Excess Electron in Supercritical CO2
NASA Astrophysics Data System (ADS)
Wang, Zhiping; Liu, Jinxiang; Zhang, Meng; Cukier, Robert I.; Bu, Yuxiang
2012-05-01
We present an ab initio molecular dynamics simulation of the dynamics of an excess electron solvated in supercritical CO2. The excess electron can exist in three types of states: CO2-core localized, dual-core localized, and diffuse states. All these states undergo continuous state conversions via a combination of long lasting breathing oscillations and core switching, as also characterized by highly cooperative oscillations of the excess electron volume and vertical detachment energy. All of these oscillations exhibit a strong correlation with the electron-impacted bending vibration of the core CO2, and the core-switching is controlled by thermal fluctuations.
The use of imprecise processing to improve accuracy in weather & climate prediction
NASA Astrophysics Data System (ADS)
Düben, Peter D.; McNamara, Hugh; Palmer, T. N.
2014-08-01
The use of stochastic processing hardware and low precision arithmetic in atmospheric models is investigated. Stochastic processors allow hardware-induced faults in calculations, sacrificing bit-reproducibility and precision in exchange for improvements in performance and potentially accuracy of forecasts, due to a reduction in power consumption that could allow higher resolution. A similar trade-off is achieved using low precision arithmetic, with improvements in computation and communication speed and savings in storage and memory requirements. As high-performance computing becomes more massively parallel and power intensive, these two approaches may be important stepping stones in the pursuit of global cloud-resolving atmospheric modelling. The impact of both hardware induced faults and low precision arithmetic is tested using the Lorenz '96 model and the dynamical core of a global atmosphere model. In the Lorenz '96 model there is a natural scale separation; the spectral discretisation used in the dynamical core also allows large and small scale dynamics to be treated separately within the code. Such scale separation allows the impact of lower-accuracy arithmetic to be restricted to components close to the truncation scales and hence close to the necessarily inexact parametrised representations of unresolved processes. By contrast, the larger scales are calculated using high precision deterministic arithmetic. Hardware faults from stochastic processors are emulated using a bit-flip model with different fault rates. Our simulations show that both approaches to inexact calculations do not substantially affect the large scale behaviour, provided they are restricted to act only on smaller scales. By contrast, results from the Lorenz '96 simulations are superior when small scales are calculated on an emulated stochastic processor than when those small scales are parametrised. This suggests that inexact calculations at the small scale could reduce computation and power costs without adversely affecting the quality of the simulations. This would allow higher resolution models to be run at the same computational cost.
Yu, Huanan; Xu, Dongdong; Xu, Qun
2015-08-28
A hierarchical meso- and microporous metal-organic framework (MOF) was facilely fabricated in an ionic liquid (IL)/supercritical CO2 (SC CO2)/surfactant emulsion system. Notably, CO2 exerts a dual effect during the synthesis; that is, CO2 droplets act as a template for the cores of nanospheres while CO2-swollen micelles induce mesopores on nanospheres.
Siloxane nanoprobes for labeling and dual modality imaging of neural stem cells
Addington, Caroline P.; Cusick, Alex; Shankar, Rohini Vidya; Agarwal, Shubhangi; Stabenfeldt, Sarah E.; Kodibagkar, Vikram D.
2015-01-01
Cell therapy represents a promising therapeutic for a myriad of medical conditions, including cancer, traumatic brain injury, and cardiovascular disease among others. A thorough understanding of the efficacy and cellular dynamics of these therapies necessitates the ability to non-invasively track cells in vivo. Magnetic resonance imaging (MRI) provides a platform to track cells as a non-invasive modality with superior resolution and soft tissue contrast. We recently reported a new nanoprobe platform for cell labeling and imaging using fluorophore doped siloxane core nanoemulsions as dual modality (1H MRI/Fluorescence), dual-functional (oximetry/detection) nanoprobes. Here, we successfully demonstrate the labeling, dual-modality imaging, and oximetry of neural progenitor/stem cells (NPSCs) in vitro using this platform. Labeling at a concentration of 10 μl/104 cells with a 40%v/v polydimethylsiloxane core nanoemulsion, doped with rhodamine, had minimal effect on viability, no effect on migration, proliferation and differentiation of NPSCs and allowed for unambiguous visualization of labeled NPSCs by 1H MR and fluorescence and local pO2 reporting by labeled NPSCs. This new approach for cell labeling with a positive contrast 1H MR probe has the potential to improve mechanistic knowledge of current therapies, and guide the design of future cell therapies due to its clinical translatability. PMID:26597417
TanDEM-X calibrated Raw DEM generation
NASA Astrophysics Data System (ADS)
Rossi, Cristian; Rodriguez Gonzalez, Fernando; Fritz, Thomas; Yague-Martinez, Nestor; Eineder, Michael
2012-09-01
The TanDEM-X mission successfully started on June 21st 2010 with the launch of the German radar satellite TDX, placed in orbit in close formation with the TerraSAR-X (TSX) satellite, and establishing the first spaceborne bistatic interferometer. The processing of SAR raw data to the Raw DEM is performed by one single processor, the Integrated TanDEM-X Processor (ITP). The quality of the Raw DEM is a fundamental parameter for the mission planning. In this paper, a novel quality indicator is derived. It is based on the comparison of the interferometric measure, the unwrapped phase, and the stereo-radargrammetric measure, the geometrical shifts computed in the coregistration stage. By stating the accuracy of the unwrapped phase, it constitutes a useful parameter for the determination of problematic scenes, which will be resubmitted to the dual baseline phase unwrapping processing chain for the mitigation of phase unwrapping errors. The stereo-radargrammetric measure is also operationally used for the Raw DEM absolute calibration through an accurate estimation of the absolute phase offset. This paper examines the interferometric algorithms implemented for the operational TanDEM-X Raw DEM generation, focusing particularly on its quality assessment and its calibration.
Novel memory architecture for video signal processor
NASA Astrophysics Data System (ADS)
Hung, Jen-Sheng; Lin, Chia-Hsing; Jen, Chein-Wei
1993-11-01
An on-chip memory architecture for video signal processor (VSP) is proposed. This memory structure is a two-level design for the different data locality in video applications. The upper level--Memory A provides enough storage capacity to reduce the impact on the limitation of chip I/O bandwidth, and the lower level--Memory B provides enough data parallelism and flexibility to meet the requirements of multiple reconfigurable pipeline function units in a single VSP chip. The needed memory size is decided by the memory usage analysis for video algorithms and the number of function units. Both levels of memory adopted a dual-port memory scheme to sustain the simultaneous read and write operations. Especially, Memory B uses multiple one-read-one-write memory banks to emulate the real multiport memory. Therefore, one can change the configuration of Memory B to several sets of memories with variable read/write ports by adjusting the bus switches. Then the numbers of read ports and write ports in proposed memory can meet requirement of data flow patterns in different video coding algorithms. We have finished the design of a prototype memory design using 1.2- micrometers SPDM SRAM technology and will fabricated it through TSMC, in Taiwan.
Spacecraft On-Board Information Extraction Computer (SOBIEC)
NASA Technical Reports Server (NTRS)
Eisenman, David; Decaro, Robert E.; Jurasek, David W.
1994-01-01
The Jet Propulsion Laboratory is the Technical Monitor on an SBIR Program issued for Irvine Sensors Corporation to develop a highly compact, dual use massively parallel processing node known as SOBIEC. SOBIEC couples 3D memory stacking technology provided by nCUBE. The node contains sufficient network Input/Output to implement up to an order-13 binary hypercube. The benefit of this network, is that it scales linearly as more processors are added, and it is a superset of other commonly used interconnect topologies such as: meshes, rings, toroids, and trees. In this manner, a distributed processing network can be easily devised and supported. The SOBIEC node has sufficient memory for most multi-computer applications, and also supports external memory expansion and DMA interfaces. The SOBIEC node is supported by a mature set of software development tools from nCUBE. The nCUBE operating system (OS) provides configuration and operational support for up to 8000 SOBIEC processors in an order-13 binary hypercube or any subset or partition(s) thereof. The OS is UNIX (USL SVR4) compatible, with C, C++, and FORTRAN compilers readily available. A stand-alone development system is also available to support SOBIEC test and integration.
Characterizing and Optimizing the Performance of the MAESTRO 49-Core Processor
2014-03-27
process large volumes of data, it is necessary during testing to vary the dimensions of the inbound data matrix to determine what effect this has on the...needed that can process the extra data these systems seek to collect. However, the space environment presents a number of threats, such as ambient or...induced faults, and that also have sufficient computational power to handle the large flow of data they encounter. This research investigates one
European Scientific Notes. Volume 35, Number 7,
1981-07-31
simulated the entire processor down cores, semiconductor PROMs, etc. pack- to gate level on a PDP-11/45 computer, aged on FUROCARDS can be interfaced...approaching retirement were used to generate internal heat age , but DERMO will undoubtedly con- when irradiated. It was found that tinue to be France’s leading...import- parameters , such a doublet will focus ance. it plays an important role not a bundle of rays incident parallel only in mapping and defining the
NASA Astrophysics Data System (ADS)
Eickhoff, Jens; Cook, Barry; Walker, Paul; Habinc, Sadi; Witt, Rouven; Roser, Hans-Peter
2011-08-01
As already published in another paper at DASIA 2010 in Budapest [1] the University of Stuttgart, Germany, is developing an advanced 3-axis stabilized small satellite applying industry standards for command/control techniques, onboard software design and onboard computer components.The satellite has a launch mass of approx. 120kg and is foreseen to be launched end 2013 as piggy back payload on an Indian PSLV launcher.During phase C the main challenge was the conceptual design for an ultra compact and performant onboard computer (OBC), which is able to support an industry standard operating system, a PUS standard based onboard software (OBSW) and CCSDS standard based ground/space communication. The developed architecture is based on 4 main elements (see [1] and Figure 4):• the OBC core board (single board computer based on LEON3 FT architecture),• an I/O Board for all OBC digital interfaces to S/C equipment,• a CCSDS TC/TM pre-processor board,• CPDU being embedded in the PCDU.The EM for the OBC core meanwhile has been shipped to the University by the supplier Aeroflex Colorado Springs, USA and is in use in Stuttgart since January 2011. Figure 2 and Figure 3 provide brief impressions. This paper concentrates on the common design of the I/O board and the CCSDS processor boards.
Kalman Filter Tracking on Parallel Architectures
NASA Astrophysics Data System (ADS)
Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi
2016-11-01
Power density constraints are limiting the performance improvements of modern CPUs. To address this we have seen the introduction of lower-power, multi-core processors such as GPGPU, ARM and Intel MIC. In order to achieve the theoretical performance gains of these processors, it will be necessary to parallelize algorithms to exploit larger numbers of lightweight cores and specialized functions like large vector units. Track finding and fitting is one of the most computationally challenging problems for event reconstruction in particle physics. At the High-Luminosity Large Hadron Collider (HL-LHC), for example, this will be by far the dominant problem. The need for greater parallelism has driven investigations of very different track finding techniques such as Cellular Automata or Hough Transforms. The most common track finding techniques in use today, however, are those based on a Kalman filter approach. Significant experience has been accumulated with these techniques on real tracking detector systems, both in the trigger and offline. They are known to provide high physics performance, are robust, and are in use today at the LHC. Given the utility of the Kalman filter in track finding, we have begun to port these algorithms to parallel architectures, namely Intel Xeon and Xeon Phi. We report here on our progress towards an end-to-end track reconstruction algorithm fully exploiting vectorization and parallelization techniques in a simplified experimental environment.
Mobile high-performance computing (HPC) for synthetic aperture radar signal processing
NASA Astrophysics Data System (ADS)
Misko, Joshua; Kim, Youngsoo; Qi, Chenchen; Sirkeci, Birsen
2018-04-01
The importance of mobile high-performance computing has emerged in numerous battlespace applications at the tactical edge in hostile environments. Energy efficient computing power is a key enabler for diverse areas ranging from real-time big data analytics and atmospheric science to network science. However, the design of tactical mobile data centers is dominated by power, thermal, and physical constraints. Presently, it is very unlikely to achieve required computing processing power by aggregating emerging heterogeneous many-core processing platforms consisting of CPU, Field Programmable Gate Arrays and Graphic Processor cores constrained by power and performance. To address these challenges, we performed a Synthetic Aperture Radar case study for Automatic Target Recognition (ATR) using Deep Neural Networks (DNNs). However, these DNN models are typically trained using GPUs with gigabytes of external memories and massively used 32-bit floating point operations. As a result, DNNs do not run efficiently on hardware appropriate for low power or mobile applications. To address this limitation, we proposed for compressing DNN models for ATR suited to deployment on resource constrained hardware. This proposed compression framework utilizes promising DNN compression techniques including pruning and weight quantization while also focusing on processor features common to modern low-power devices. Following this methodology as a guideline produced a DNN for ATR tuned to maximize classification throughput, minimize power consumption, and minimize memory footprint on a low-power device.
NASA Astrophysics Data System (ADS)
Genovese, Mariangela; Napoli, Ettore
2013-05-01
The identification of moving objects is a fundamental step in computer vision processing chains. The development of low cost and lightweight smart cameras steadily increases the request of efficient and high performance circuits able to process high definition video in real time. The paper proposes two processor cores aimed to perform the real time background identification on High Definition (HD, 1920 1080 pixel) video streams. The implemented algorithm is the OpenCV version of the Gaussian Mixture Model (GMM), an high performance probabilistic algorithm for the segmentation of the background that is however computationally intensive and impossible to implement on general purpose CPU with the constraint of real time processing. In the proposed paper, the equations of the OpenCV GMM algorithm are optimized in such a way that a lightweight and low power implementation of the algorithm is obtained. The reported performances are also the result of the use of state of the art truncated binary multipliers and ROM compression techniques for the implementation of the non-linear functions. The first circuit has commercial FPGA devices as a target and provides speed and logic resource occupation that overcome previously proposed implementations. The second circuit is oriented to an ASIC (UMC-90nm) standard cell implementation. Both implementations are able to process more than 60 frames per second in 1080p format, a frame rate compatible with HD television.
Development of sustained and dual drug release co-extrusion formulations for individual dosing.
Laukamp, Eva Julia; Vynckier, An-Katrien; Voorspoels, Jody; Thommes, Markus; Breitkreutz, Joerg
2015-01-01
In personalized medicine and patient-centered medical treatment individual dosing of medicines is crucial. The Solid Dosage Pen (SDP) allows for an individual dosing of solid drug carriers by cutting them into tablet-like slices. The aim of the present study was the development of sustained release and dual release formulations with carbamazepine (CBZ) via hot-melt co-extrusion for the use in the SDP. The selection of appropriate coat- and core-formulations was performed by adapting the mechanical properties (like tensile strength and E-modulus) for example. By using different excipients (polyethyleneglycols, poloxamers, white wax, stearic acid, and carnauba wax) and drug loadings (30-50%) tailored dissolution kinetics was achieved showing cube root or zero order release mechanisms. Besides a biphasic drug release, the dose-dependent dissolution characteristics of sustained release formulations were minimized by a co-extruded wax-coated formulation. The dissolution profiles of the co-extrudates were confirmed during short term stability study (six months at 21.0 ± 0.2 °C, 45%r.h.). Due to a good layer adhesion of core and coat and adequate mechanical properties (maximum cutting force of 35.8 ± 2.0 N and 26.4 ± 2.8 N and E-modulus of 118.1 ± 8.4 and 33.9 ± 4.5 MPa for the dual drug release and the wax-coated co-extrudates, respectively) cutting off doses via the SDP was precise. While differences of the process parameters (like the barrel temperature) between the core- and the coat-layer resulted in unsatisfying content uniformities for the wax-coated co-extrudates, the content uniformity of the dual drug release co-extrudates was found to be in compliance with pharmacopoeial specification. Copyright © 2015 Elsevier B.V. All rights reserved.
Cross talk analysis in multicore optical fibers by supermode theory.
Szostkiewicz, Lukasz; Napierala, Marek; Ziolowicz, Anna; Pytel, Anna; Tenderenda, Tadeusz; Nasilowski, Tomasz
2016-08-15
We discuss the theoretical aspects of core-to-core power transfer in multicore fibers relying on supermode theory. Based on a dual core fiber model, we investigate the consequences of this approach, such as the influence of initial excitation conditions on cross talk. Supermode interpretation of power coupling proves to be intuitive and thus may lead to new concepts of multicore fiber-based devices. As a conclusion, we propose a definition of a uniform cross talk parameter that describes multicore fiber design.
GERICOS: A Generic Framework for the Development of On-Board Software
NASA Astrophysics Data System (ADS)
Plasson, P.; Cuomo, C.; Gabriel, G.; Gauthier, N.; Gueguen, L.; Malac-Allain, L.
2016-08-01
This paper presents an overview of the GERICOS framework (GEneRIC Onboard Software), its architecture, its various layers and its future evolutions. The GERICOS framework, developed and qualified by LESIA, offers a set of generic, reusable and customizable software components for the rapid development of payload flight software. The GERICOS framework has a layered structure. The first layer (GERICOS::CORE) implements the concept of active objects and forms an abstraction layer over the top of real-time kernels. The second layer (GERICOS::BLOCKS) offers a set of reusable software components for building flight software based on generic solutions to recurrent functionalities. The third layer (GERICOS::DRIVERS) implements software drivers for several COTS IP cores of the LEON processor ecosystem.
Portable LQCD Monte Carlo code using OpenACC
NASA Astrophysics Data System (ADS)
Bonati, Claudio; Calore, Enrico; Coscetti, Simone; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Fabio Schifano, Sebastiano; Silvi, Giorgio; Tripiccione, Raffaele
2018-03-01
Varying from multi-core CPU processors to many-core GPUs, the present scenario of HPC architectures is extremely heterogeneous. In this context, code portability is increasingly important for easy maintainability of applications; this is relevant in scientific computing where code changes are numerous and frequent. In this talk we present the design and optimization of a state-of-the-art production level LQCD Monte Carlo application, using the OpenACC directives model. OpenACC aims to abstract parallel programming to a descriptive level, where programmers do not need to specify the mapping of the code on the target machine. We describe the OpenACC implementation and show that the same code is able to target different architectures, including state-of-the-art CPUs and GPUs.
LTE-Enhanced Cognitive Radio Network Testbed (LTE-CORNET)
2016-11-01
4 PERCENT_SUPPORTEDNAME FTE Equivalent: Total Number: Sub Contractors (DD882) Names of Personnel receiving masters degrees Names of personnel...Turbo, HT , 15M, 140W) Intel Core i7-3770 (3.4 GHz Quad Core, 77W) Dual Intel Xeon E5-2695 v4 (18C, 2.1GHz, 3.3GHz Turbo, 2400MHz, 45MB, 120W
Enhanced near-infrared photoacoustic imaging of silica-coated rare-earth doped nanoparticles.
Sheng, Yang; Liao, Lun-De; Bandla, Aishwarya; Liu, Yu-Hang; Yuan, Jun; Thakor, Nitish; Tan, Mei Chee
2017-01-01
Near-infrared photoacoustic (PA) imaging is an emerging diagnostic technology that utilizes the tissue transparent window to achieve improved contrast and spatial resolution for deep tissue imaging. In this study, we investigated the enhancement effect of the SiO 2 shell on the PA property of our core/shell rare-earth nanoparticles (REs) consisting of an active rare-earth doped core of NaYF 4 :Yb,Er (REDNPs) and an undoped NaYF 4 shell. We observed that the PA signal amplitude increased with SiO 2 shell thickness. Although the SiO 2 shell caused an observed decrease in the integrated fluorescence intensity due to the dilution effect, fluorescence quenching of the rare earth emitting ions within the REDNPs cores was successfully prevented by the undoped NaYF 4 shell. Therefore, our multilayer structure consisting of an active core with successive functional layers was demonstrated to be an effective design for dual-modal fluorescence and PA imaging probes with improved PA property. The result from this work addresses a critical need for the development of dual-modal contrast agent that advances deep tissue imaging with high resolution and signal-to-noise ratio. Copyright © 2016 Elsevier B.V. All rights reserved.
Core/shell Fe3O4/Gd2O3 nanocubes as T1-T2 dual modal MRI contrast agents
NASA Astrophysics Data System (ADS)
Li, Fenfen; Zhi, Debo; Luo, Yufeng; Zhang, Jiqian; Nan, Xiang; Zhang, Yunjiao; Zhou, Wei; Qiu, Bensheng; Wen, Longping; Liang, Gaolin
2016-06-01
T1-T2 dual modal magnetic resonance imaging (MRI) has attracted considerable interest because it offers complementary diagnostic information, leading to more precise diagnosis. To date, a number of nanostructures have been reported as T1-T2 dual modal MR contrast agents (CAs). However, hybrids of nanocubes with both iron and gadolinium (Gd) elements as T1-T2 dual modal CAs have not been reported. Herein, we report the synthesis of novel core/shell Fe3O4/Gd2O3 nanocubes as T1-T2 dual-modal CAs and their application for enhanced T1-T2 MR imaging of rat livers. A relaxivity study at 1.5 T indicated that our Fe3O4/Gd2O3 nanocubes have an r1 value of 45.24 mM-1 s-1 and an r2 value of 186.51 mM-1 s-1, which were about two folds of those of Gd2O3 nanoparticles and Fe3O4 nanocubes, respectively. In vivo MR imaging of rats showed both T1-positive and T2-negative contrast enhancements in the livers. We envision that our Fe3O4/Gd2O3 nanocubes could be applied as T1-T2 dual modal MR CAs for a wide range of theranostic applications in the near future.T1-T2 dual modal magnetic resonance imaging (MRI) has attracted considerable interest because it offers complementary diagnostic information, leading to more precise diagnosis. To date, a number of nanostructures have been reported as T1-T2 dual modal MR contrast agents (CAs). However, hybrids of nanocubes with both iron and gadolinium (Gd) elements as T1-T2 dual modal CAs have not been reported. Herein, we report the synthesis of novel core/shell Fe3O4/Gd2O3 nanocubes as T1-T2 dual-modal CAs and their application for enhanced T1-T2 MR imaging of rat livers. A relaxivity study at 1.5 T indicated that our Fe3O4/Gd2O3 nanocubes have an r1 value of 45.24 mM-1 s-1 and an r2 value of 186.51 mM-1 s-1, which were about two folds of those of Gd2O3 nanoparticles and Fe3O4 nanocubes, respectively. In vivo MR imaging of rats showed both T1-positive and T2-negative contrast enhancements in the livers. We envision that our Fe3O4/Gd2O3 nanocubes could be applied as T1-T2 dual modal MR CAs for a wide range of theranostic applications in the near future. Electronic supplementary information (ESI) available: Scheme S1, Fig. S1-S8, and Tables S1, S2. See DOI: 10.1039/c6nr02620f
NASA Astrophysics Data System (ADS)
Hasan, Md. Rabiul; Akter, Sanjida; Khatun, Tania; Rifat, Ahmmed A.; Anower, Md. Shamim
2017-04-01
A low-loss microstructure fiber is numerically investigated for convenient transmission of polarization maintaining terahertz (THz) waves. The dual-hole units (DHUs) are used inside the core of the kagome lattice microstructure to achieve high birefringence and low effective material loss (EML). It is demonstrated that by rotating the axis of orientation of the DHUs, it is possible to obtain low EML of 0.052 cm-1, low confinement loss of 0.01 cm-1, and high birefringence of 0.0354 at 0.85 THz. It is also reported that the transmission properties of the proposed microstructure fiber are varied with rotation angle, core diameter, and operating frequencies. Other guiding characteristics, such as single-mode propagation, power fraction, and dispersion, are also discussed thoroughly.
NASA Astrophysics Data System (ADS)
Anghel, Ion; Grumezescu, Alexandru Mihai
2013-01-01
Prosthetic medical device-associated infections are responsible for significant morbidity and mortality rates. Novel improved materials and surfaces exhibiting inappropriate conditions for microbial development are urgently required in the medical environment. This study reveals the benefit of using natural Mentha piperita essential oil, combined with a 5 nm core/shell nanosystem-improved surface exhibiting anti-adherence and antibiofilm properties. This strategy reveals a dual role of the nano-oil system; on one hand, inhibiting bacterial adherence and, on the other hand, exhibiting bactericidal effect, the core/shell nanosystem is acting as a controlled releasing machine for the essential oil. Our results demonstrate that this dual nanobiosystem is very efficient also for inhibiting biofilm formation, being a good candidate for the design of novel material surfaces used for prosthetic devices.
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
An NLRA Transducer for Dual Use Bone Conduction Audio and Haptic Communication. Summary Report
2016-12-30
VIBRANT COMPOSITES INC. 1 A16-019 Phase 1 Summary Report Vibrant Composites Inc. December 30, 2016 I. ABSTRACT A combined transducer capable of bone ...transducer core capable of both precise haptic communication and high fidelity bone conduction audio. The transducer design leverages Micro-Multilayer...head-mounted system. In this Phase I SBIR, Vibrant Composites has delivered functional dual-mode bone conduction and vibrotactile transducer prototypes
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms
DOE Office of Scientific and Technical Information (OSTI.GOV)
Arampatzis, Giorgos, E-mail: garab@math.uoc.gr; Katsoulakis, Markos A., E-mail: markos@math.umass.edu; Plechac, Petr, E-mail: plechac@math.udel.edu
2012-10-01
We present a mathematical framework for constructing and analyzing parallel algorithms for lattice kinetic Monte Carlo (KMC) simulations. The resulting algorithms have the capacity to simulate a wide range of spatio-temporal scales in spatially distributed, non-equilibrium physiochemical processes with complex chemistry and transport micro-mechanisms. Rather than focusing on constructing exactly the stochastic trajectories, our approach relies on approximating the evolution of observables, such as density, coverage, correlations and so on. More specifically, we develop a spatial domain decomposition of the Markov operator (generator) that describes the evolution of all observables according to the kinetic Monte Carlo algorithm. This domain decompositionmore » corresponds to a decomposition of the Markov generator into a hierarchy of operators and can be tailored to specific hierarchical parallel architectures such as multi-core processors or clusters of Graphical Processing Units (GPUs). Based on this operator decomposition, we formulate parallel Fractional step kinetic Monte Carlo algorithms by employing the Trotter Theorem and its randomized variants; these schemes, (a) are partially asynchronous on each fractional step time-window, and (b) are characterized by their communication schedule between processors. The proposed mathematical framework allows us to rigorously justify the numerical and statistical consistency of the proposed algorithms, showing the convergence of our approximating schemes to the original serial KMC. The approach also provides a systematic evaluation of different processor communicating schedules. We carry out a detailed benchmarking of the parallel KMC schemes using available exact solutions, for example, in Ising-type systems and we demonstrate the capabilities of the method to simulate complex spatially distributed reactions at very large scales on GPUs. Finally, we discuss work load balancing between processors and propose a re-balancing scheme based on probabilistic mass transport methods.« less
Compatibility between dental adhesive systems and dual-polymerizing composite resins.
Michaud, Pierre-Luc; MacKenzie, Alexandra
2016-10-01
Information is lacking about incompatibilities between certain types of adhesive systems and dual-polymerizing composite resins, and universal adhesives have yet to be tested with these resins. The purpose of this in vitro study was to investigate the bonding outcome of dual-polymerizing foundation composite resins by using different categories of adhesive solutions and to determine whether incompatibilities were present. One hundred and eighty caries-free, extracted third molar teeth were allocated to 9 groups (n=20), in which 3 different bonding agents (Single Bond Plus [SB]), Scotchbond Multi-purpose [MP], and Scotchbond Universal [SU]) were used to bond 3 different composite resins (CompCore AF [CC], Core Paste XP [CP], and Filtek Supreme Ultra [FS]). After restorations had been fabricated using an Ultradent device, the specimens were stored in water at 37°C for 24 hours. The specimens were tested under shear force at a rate of 0.5 mm/min. The data were analyzed with Kruskal-Wallis tests and post hoc pairwise comparisons (α=.05). All 3 composite resins produced comparable shear bond strengths when used with MP (P=.076). However, when either SB or SU was used, the light-polymerized composite resin (FS) and 1 dual-polymerized foundation composite resin (CC) bonded significantly better than the other dual-polymerized foundation composite resin (CP) (P<.005). Both FS and CC performed best with SU but had acceptable results with all of the bonding agents. CP only performed acceptably with MP (P=.023) and had poor results with both other agents. Dual-polymerizing composite resins can obtain equally good bond strengths as light-polymerizing alternatives. However, not all dual-polymerizing composite resins perform well with all bonding systems; some incompatibilities exist between different products. Copyright © 2016 Editorial Council for the Journal of Prosthetic Dentistry. Published by Elsevier Inc. All rights reserved.