Sample records for simultaneous multithreading processor

  1. Multithreading in vector processors

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Evangelinos, Constantinos; Kim, Changhoan; Nair, Ravi

    In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.

  2. Multithreaded Model for Dynamic Load Balancing Parallel Adaptive PDE Computations

    NASA Technical Reports Server (NTRS)

    Chrisochoides, Nikos

    1995-01-01

    We present a multithreaded model for the dynamic load-balancing of numerical, adaptive computations required for the solution of Partial Differential Equations (PDE's) on multiprocessors. Multithreading is used as a means of exploring concurrency in the processor level in order to tolerate synchronization costs inherent to traditional (non-threaded) parallel adaptive PDE solvers. Our preliminary analysis for parallel, adaptive PDE solvers indicates that multithreading can be used an a mechanism to mask overheads required for the dynamic balancing of processor workloads with computations required for the actual numerical solution of the PDE's. Also, multithreading can simplify the implementation of dynamic load-balancing algorithms, a task that is very difficult for traditional data parallel adaptive PDE computations. Unfortunately, multithreading does not always simplify program complexity, often makes code re-usability not an easy task, and increases software complexity.

  3. Using all of your CPU's in HIPE

    NASA Astrophysics Data System (ADS)

    Jacobson, J. D.; Fadda, D.

    2012-09-01

    Modern computer architectures increasingly feature multi-core CPU's. For example, the MacbookPro features the Intel quad-core i7 processors. Through the use of hyper-threading, where each core can execute two threads simultaneously, the quad-core i7 can support eight simultaneous processing threads. All this on your laptop! This CPU power can now be put into service by scientists to perform data reduction tasks, but only if the software has been designed to take advantage of the multiple processor architectures. Up to now, software written for Herschel data reduction (HIPE), written in Jython and JAVA, is single-threaded and can only utilize a single processor. Users of HIPE do not get any advantage from the additional processors. Why not put all of the CPU resources to work reducing your data? We present a multi-threaded software application that corrects long-term transients in the signal from the PACS unchopped spectroscopy line scan mode. In this poster, we present a multi-threaded software framework to achieve performance improvements from parallel execution. We will show how a task to correct transients in the PACS Spectroscopy Pipeline for the un-chopped line scan mode, has been threaded. This computation-intensive task uses either a one-parameter or a three parameter exponential function, to characterize the transient. The task uses a JAVA implementation of Minpack, translated from the C (Moshier) and IDL (Markwardt) by the authors, to optimize the correction parameters. We also explain how to determine if a task can benefit from threading (Amdahl's Law), and if it is safe to thread. The design and implementation, using the JAVA concurrency package completions service is described. Pitfalls, timing bugs, thread safety, resource control, testing and performance improvements are described and plotted.

  4. Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Villa, Oreste; Tumeo, Antonino; Secchi, Simone

    Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less

  5. Multi-threaded parallel simulation of non-local non-linear problems in ultrashort laser pulse propagation in the presence of plasma

    NASA Astrophysics Data System (ADS)

    Baregheh, Mandana; Mezentsev, Vladimir; Schmitz, Holger

    2011-06-01

    We describe a parallel multi-threaded approach for high performance modelling of wide class of phenomena in ultrafast nonlinear optics. Specific implementation has been performed using the highly parallel capabilities of a programmable graphics processor.

  6. Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP

    NASA Astrophysics Data System (ADS)

    Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.

    2017-07-01

    Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.

  7. Optics Program Modified for Multithreaded Parallel Computing

    NASA Technical Reports Server (NTRS)

    Lou, John; Bedding, Dave; Basinger, Scott

    2006-01-01

    A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.

  8. SALT: The Simulator for the Analysis of LWP Timing

    NASA Technical Reports Server (NTRS)

    Springer, Paul L.; Rodrigues, Arun; Brockman, Jay

    2006-01-01

    With the emergence of new processor architectures that are highly multithreaded, and support features such as full/empty memory semantics and split-phase memory transactions, the need for a processor simulator to handle these features becomes apparent. This paper describes such a simulator, called SALT.

  9. Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

    DTIC Science & Technology

    2013-09-30

    STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the

  10. Multi-threaded ATLAS simulation on Intel Knights Landing processors

    NASA Astrophysics Data System (ADS)

    Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

    2017-10-01

    The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.

  11. A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Madduri, Kamesh; Ediger, David; Jiang, Karl

    2009-02-15

    We present a new lock-free parallel algorithm for computing betweenness centralityof massive small-world networks. With minor changes to the data structures, ouralgorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in HPCS SSCA#2, a benchmark extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the Threadstorm processor, and a single-socket Sun multicore server with the UltraSPARC T2 processor. For a small-world network of 134 millionmore » vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less

  12. A Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Madduri, Kamesh; Ediger, David; Jiang, Karl

    2009-05-29

    We present a new lock-free parallel algorithm for computing betweenness centrality of massive small-world networks. With minor changes to the data structures, our algorithm also achieves better spatial cache locality compared to previous approaches. Betweenness centrality is a key algorithm kernel in the HPCS SSCA#2 Graph Analysis benchmark, which has been extensively used to evaluate the performance of emerging high-performance computing architectures for graph-theoretic computations. We design optimized implementations of betweenness centrality and the SSCA#2 benchmark for two hardware multithreaded systems: a Cray XMT system with the ThreadStorm processor, and a single-socket Sun multicore server with the UltraSparc T2 processor.more » For a small-world network of 134 million vertices and 1.073 billion edges, the 16-processor XMT system and the 8-core Sun Fire T5120 server achieve TEPS scores (an algorithmic performance count for the SSCA#2 benchmark) of 160 million and 90 million respectively, which corresponds to more than a 2X performance improvement over the previous parallel implementations. To better characterize the performance of these multithreaded systems, we correlate the SSCA#2 performance results with data from the memory-intensive STREAM and RandomAccess benchmarks. Finally, we demonstrate the applicability of our implementation to analyze massive real-world datasets by computing approximate betweenness centrality for a large-scale IMDb movie-actor network.« less

  13. The Tera Multithreaded Architecture and Unstructured Meshes

    NASA Technical Reports Server (NTRS)

    Bokhari, Shahid H.; Mavriplis, Dimitri J.

    1998-01-01

    The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.

  14. A multi-threaded version of MCFM

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Campbell, John M.; Ellis, R. Keith; Giele, Walter T.

    We report on our findings modifying MCFM using OpenMP to implement multi-threading. By using OpenMP, the modified MCFM will execute on any processor, automatically adjusting to the number of available threads. We then modified the integration routine VEGAS to distribute the event evaluation over the threads, while combining all events at the end of every iteration to optimize the numerical integration. Furthermore, we took special care so that the results of the Monte Carlo integration were independent of the number of threads used, to facilitate the validation of the OpenMP version of MCFM.

  15. Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

    NASA Astrophysics Data System (ADS)

    Backes, Werner; Wetzel, Susanne

    In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.

  16. Hardware based redundant multi-threading inside a GPU for improved reliability

    DOEpatents

    Sridharan, Vilas; Gurumurthi, Sudhanva

    2015-05-05

    A system and method for verifying computation output using computer hardware are provided. Instances of computation are generated and processed on hardware-based processors. As instances of computation are processed, each instance of computation receives a load accessible to other instances of computation. Instances of output are generated by processing the instances of computation. The instances of output are verified against each other in a hardware based processor to ensure accuracy of the output.

  17. Designing Next Generation Massively Multithreaded Architectures for Irregular Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tumeo, Antonino; Secchi, Simone; Villa, Oreste

    Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less

  18. Programs for Testing Processor-in-Memory Computing Systems

    NASA Technical Reports Server (NTRS)

    Katz, Daniel S.

    2006-01-01

    The Multithreaded Microbenchmarks for Processor-In-Memory (PIM) Compilers, Simulators, and Hardware are computer programs arranged in a series for use in testing the performances of PIM computing systems, including compilers, simulators, and hardware. The programs at the beginning of the series test basic functionality; the programs at subsequent positions in the series test increasingly complex functionality. The programs are intended to be used while designing a PIM system, and can be used to verify that compilers, simulators, and hardware work correctly. The programs can also be used to enable designers of these system components to examine tradeoffs in implementation. Finally, these programs can be run on non-PIM hardware (either single-threaded or multithreaded) using the POSIX pthreads standard to verify that the benchmarks themselves operate correctly. [POSIX (Portable Operating System Interface for UNIX) is a set of standards that define how programs and operating systems interact with each other. pthreads is a library of pre-emptive thread routines that comply with one of the POSIX standards.

  19. Platform-Independence and Scheduling In a Multi-Threaded Real-Time Simulation

    NASA Technical Reports Server (NTRS)

    Sugden, Paul P.; Rau, Melissa A.; Kenney, P. Sean

    2001-01-01

    Aviation research often relies on real-time, pilot-in-the-loop flight simulation as a means to develop new flight software, flight hardware, or pilot procedures. Often these simulations become so complex that a single processor is incapable of performing the necessary computations within a fixed time-step. Threads are an elegant means to distribute the computational work-load when running on a symmetric multi-processor machine. However, programming with threads often requires operating system specific calls that reduce code portability and maintainability. While a multi-threaded simulation allows a significant increase in the simulation complexity, it also increases the workload of a simulation operator by requiring that the operator determine which models run on which thread. To address these concerns an object-oriented design was implemented in the NASA Langley Standard Real-Time Simulation in C++ (LaSRS++) application framework. The design provides a portable and maintainable means to use threads and also provides a mechanism to automatically load balance the simulation models.

  20. Effective Vectorization with OpenMP 4.5

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

    This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less

  1. Tomo3D 2.0--exploitation of advanced vector extensions (AVX) for 3D reconstruction.

    PubMed

    Agulleiro, Jose-Ignacio; Fernandez, Jose-Jesus

    2015-02-01

    Tomo3D is a program for fast tomographic reconstruction on multicore computers. Its high speed stems from code optimization, vectorization with Streaming SIMD Extensions (SSE), multithreading and optimization of disk access. Recently, Advanced Vector eXtensions (AVX) have been introduced in the x86 processor architecture. Compared to SSE, AVX double the number of simultaneous operations, thus pointing to a potential twofold gain in speed. However, in practice, achieving this potential is extremely difficult. Here, we provide a technical description and an assessment of the optimizations included in Tomo3D to take advantage of AVX instructions. Tomo3D 2.0 allows huge reconstructions to be calculated in standard computers in a matter of minutes. Thus, it will be a valuable tool for electron tomography studies with increasing resolution needs. Copyright © 2014 Elsevier Inc. All rights reserved.

  2. Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Secchi, Simone; Tumeo, Antonino; Villa, Oreste

    Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less

  3. Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.

    PubMed

    Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut

    2018-05-03

    Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.

  4. Results of SEI Independent Research and Development Projects

    DTIC Science & Technology

    2008-12-01

    contained there. When laptops with a dual-core processor came out, ITunes fails crashed. ITunes was designed as multi-threaded application, but until...involving product portfolio, in-bound technical marketing, research and development, product engineering, supply chain, and out-bound sales and marketing...of quality and process improvement professionals to the marketing, product engineering, supply chain, product test and sales professionals. 3

  5. Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2

    PubMed Central

    Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.

    2014-01-01

    SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745

  6. Servicing a globally broadcast interrupt signal in a multi-threaded computer

    DOEpatents

    Attinella, John E.; Davis, Kristan D.; Musselman, Roy G.; Satterfield, David L.

    2015-12-29

    Methods, apparatuses, and computer program products for servicing a globally broadcast interrupt signal in a multi-threaded computer comprising a plurality of processor threads. Embodiments include an interrupt controller indicating in a plurality of local interrupt status locations that a globally broadcast interrupt signal has been received by the interrupt controller. Embodiments also include a thread determining that a local interrupt status location corresponding to the thread indicates that the globally broadcast interrupt signal has been received by the interrupt controller. Embodiments also include the thread processing one or more entries in a global interrupt status bit queue based on whether global interrupt status bits associated with the globally broadcast interrupt signal are locked. Each entry in the global interrupt status bit queue corresponds to a queued global interrupt.

  7. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Aliaga, José I., E-mail: aliaga@uji.es; Alonso, Pedro; Badía, José M.

    We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousandsmore » degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.« less

  8. Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis.

    PubMed

    González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J

    2017-02-01

    The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  9. Development of an Autonomous Navigation Technology Test Vehicle

    DTIC Science & Technology

    2004-08-01

    as an independent thread on processors using the Linux operating system. The computer hardware selected for the nodes that host the MRS threads...communications system design. Linux was chosen as the operating system for all of the single board computers used on the Mule. Linux was specifically...used for system analysis and development. The simple realization of multi-thread processing and inter-process communications in Linux made it a

  10. Electron-beam lithography data preparation based on multithreading MGS/PROXECCO

    NASA Astrophysics Data System (ADS)

    Eichhorn, Hans; Lemke, Melchior; Gramss, Juergen; Buerger, B.; Baetz, Uwe; Belic, Nikola; Eisenmann, Hans

    2001-04-01

    This paper will highlight an enhanced MGS layout data post processor and the results of its industrial application. Besides the preparation of hierarchical GDS layout data, the processing of flat data has been drastically accelerated. The application of the Proximity Correction in conjunction with the OEM version of the PROXECCO was crowned with success for data preparation of mask sets featuring 0.25 micrometers /0.18 micrometers integration levels.

  11. List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

    NASA Astrophysics Data System (ADS)

    Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

    2014-03-01

    List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.

  12. Real-time SHVC software decoding with multi-threaded parallel processing

    NASA Astrophysics Data System (ADS)

    Gudumasu, Srinivas; He, Yuwen; Ye, Yan; He, Yong; Ryu, Eun-Seok; Dong, Jie; Xiu, Xiaoyu

    2014-09-01

    This paper proposes a parallel decoding framework for scalable HEVC (SHVC). Various optimization technologies are implemented on the basis of SHVC reference software SHM-2.0 to achieve real-time decoding speed for the two layer spatial scalability configuration. SHVC decoder complexity is analyzed with profiling information. The decoding process at each layer and the up-sampling process are designed in parallel and scheduled by a high level application task manager. Within each layer, multi-threaded decoding is applied to accelerate the layer decoding speed. Entropy decoding, reconstruction, and in-loop processing are pipeline designed with multiple threads based on groups of coding tree units (CTU). A group of CTUs is treated as a processing unit in each pipeline stage to achieve a better trade-off between parallelism and synchronization. Motion compensation, inverse quantization, and inverse transform modules are further optimized with SSE4 SIMD instructions. Simulations on a desktop with an Intel i7 processor 2600 running at 3.4 GHz show that the parallel SHVC software decoder is able to decode 1080p spatial 2x at up to 60 fps (frames per second) and 1080p spatial 1.5x at up to 50 fps for those bitstreams generated with SHVC common test conditions in the JCT-VC standardization group. The decoding performance at various bitrates with different optimization technologies and different numbers of threads are compared in terms of decoding speed and resource usage, including processor and memory.

  13. Hiding the Disk and Network Latency of Out-of-Core Visualization

    NASA Technical Reports Server (NTRS)

    Ellsworth, David

    2001-01-01

    This paper describes an algorithm that improves the performance of application-controlled demand paging for out-of-core visualization by hiding the latency of reading data from both local disks or disks on remote servers. The performance improvements come from better overlapping the computation with the page reading process, and by performing multiple page reads in parallel. The paper includes measurements that show that the new multithreaded paging algorithm decreases the time needed to compute visualizations by one third when using one processor and reading data from local disk. The time needed when using one processor and reading data from remote disk decreased by two thirds. Visualization runs using data from remote disk actually ran faster than ones using data from local disk because the remote runs were able to make use of the remote server's high performance disk array.

  14. System, methods and apparatus for program optimization for multi-threaded processor architectures

    DOEpatents

    Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E

    2015-01-06

    Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.

  15. Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design

    NASA Astrophysics Data System (ADS)

    Debes, Eric; Kaine, Greg

    2002-11-01

    In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.

  16. Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing

    NASA Technical Reports Server (NTRS)

    Sterling, T. L.; Zima, H. P.

    2002-01-01

    Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we describe the design of Gilgamesh, a PIM-based massively parallel architecture, and elements of its execution model. Gilgamesh extends existing PIM capabilities by incorporating advanced mechanisms for virtualizing tasks and data and providing adaptive resource management for load balancing and latency tolerance. The Gilgamesh execution model is based on macroservers, a middleware layer which supports object-based runtime management of data and threads allowing explicit and dynamic control of locality and load balancing. The paper concludes with a discussion of related research activities and an outlook to future work.

  17. Parallel Computer System for 3D Visualization Stereo on GPU

    NASA Astrophysics Data System (ADS)

    Al-Oraiqat, Anas M.; Zori, Sergii A.

    2018-03-01

    This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.

  18. Improvement and speed optimization of numerical tsunami modelling program using OpenMP technology

    NASA Astrophysics Data System (ADS)

    Chernov, A.; Zaytsev, A.; Yalciner, A.; Kurkin, A.

    2009-04-01

    Currently, the basic problem of tsunami modeling is low speed of calculations which is unacceptable for services of the operative notification. Existing algorithms of numerical modeling of hydrodynamic processes of tsunami waves are developed without taking the opportunities of modern computer facilities. There is an opportunity to have considerable acceleration of process of calculations by using parallel algorithms. We discuss here new approach to parallelization tsunami modeling code using OpenMP Technology (for multiprocessing systems with the general memory). Nowadays, multiprocessing systems are easily accessible for everyone. The cost of the use of such systems becomes much lower comparing to the costs of clusters. This opportunity also benefits all programmers to apply multithreading algorithms on desktop computers of researchers. Other important advantage of the given approach is the mechanism of the general memory - there is no necessity to send data on slow networks (for example Ethernet). All memory is the common for all computing processes; it causes almost linear scalability of the program and processes. In the new version of NAMI DANCE using OpenMP technology and multi-threading algorithm provide 80% gain in speed in comparison with the one-thread version for dual-processor unit. The speed increased and 320% gain was attained for four core processor unit of PCs. Thus, it was possible to reduce considerably time of performance of calculations on the scientific workstations (desktops) without complete change of the program and user interfaces. The further modernization of algorithms of preparation of initial data and processing of results using OpenMP looks reasonable. The final version of NAMI DANCE with the increased computational speed can be used not only for research purposes but also in real time Tsunami Warning Systems.

  19. Kalman filter tracking on parallel architectures

    NASA Astrophysics Data System (ADS)

    Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

    2017-10-01

    We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.

  20. ROOT 6 and beyond: TObject, C++14 and many cores

    DOE PAGES

    Bellenot, B.; Canal, Ph; Couet, O.; ...

    2015-12-23

    Following the release of version 6, ROOT has entered a new area of development. It will leverage the industrial strength compiler library shipping in ROOT 6 and its support of the C++11/14 standard, to significantly simplify and harden ROOT's interfaces and to clarify and substantially improve ROOT's support for multi-threaded environments. Furthermore, this talk will also recap the most important new features and enhancements in ROOT in general, focusing on those allowed by the improved interpreter and better compiler support, including I/O for smart pointers, easier type safe access to the content of TTrees and enhanced multi processor support.

  1. M3T: Morphable Multithreaded Memory Tiles

    DTIC Science & Technology

    2004-01-01

    4 5 6 7 8 9 10 A B C D E F G H I J K L x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] y[0] y[4] y[2] y[6] y[1] y[5] y[3] y[7] Figure 4: 8-point radix-2...CONPAR 94, September. 33. [NAR95] Narayan, S. and Gajski , D. (1995) “Interfacing Incompatible Protocols Using Interface Process Generation,” 32nd...Parallel and Distributed Tools, August. 42. [SOH95] Sohi, G ., Breach, S. and Vajapeyam, S. (1995) “Multiscalar Processors,” Proceedings of the 22nd

  2. Accelerating Demand Paging for Local and Remote Out-of-Core Visualization

    NASA Technical Reports Server (NTRS)

    Ellsworth, David

    2001-01-01

    This paper describes a new algorithm that improves the performance of application-controlled demand paging for the out-of-core visualization of data sets that are on either local disks or disks on remote servers. The performance improvements come from better overlapping the computation with the page reading process, and by performing multiple page reads in parallel. The new algorithm can be applied to many different visualization algorithms since application-controlled demand paging is not specific to any visualization algorithm. The paper includes measurements that show that the new multi-threaded paging algorithm decreases the time needed to compute visualizations by one third when using one processor and reading data from local disk. The time needed when using one processor and reading data from remote disk decreased by up to 60%. Visualization runs using data from remote disk ran about as fast as ones using data from local disk because the remote runs were able to make use of the remote server's high performance disk array.

  3. 2-D Acousto-Optic Signal Processors for Simultaneous Spectrum Analysis and Direction Finding

    DTIC Science & Technology

    1990-11-01

    National Dfense Defence nationale 2-D ACOUSTO - OPTIC SIGNAL PROCESSORS FOR SIMULTANEOUS SPECTRUM ANALYSIS 00 AND DIRECTION FINDING (U) by NM Jim P.Y...Wr pdft .1w I0~1111191 3 05089 National DIfense Defence nationale 2-D ACOUSTO - OPTIC SIGNAL PROCESSORS FOR SIMULTANEOUS SPECTRUM ANALYSIS AND DIRECTION...Processing, J.T. Tippet et al., Eds., Chapter 38, pp. 715-748, MIT Press, Cambridge 1965. [6] A.E. Spezio," Acousto - optics for Electronic Warfare

  4. Algorithms and Libraries

    NASA Technical Reports Server (NTRS)

    Dongarra, Jack

    1998-01-01

    This exploratory study initiated our inquiry into algorithms and applications that would benefit by latency tolerant approach to algorithm building, including the construction of new algorithms where appropriate. In a multithreaded execution, when a processor reaches a point where remote memory access is necessary, the request is sent out on the network and a context--switch occurs to a new thread of computation. This effectively masks a long and unpredictable latency due to remote loads, thereby providing tolerance to remote access latency. We began to develop standards to profile various algorithm and application parameters, such as the degree of parallelism, granularity, precision, instruction set mix, interprocessor communication, latency etc. These tools will continue to develop and evolve as the Information Power Grid environment matures. To provide a richer context for this research, the project also focused on issues of fault-tolerance and computation migration of numerical algorithms and software. During the initial phase we tried to increase our understanding of the bottlenecks in single processor performance. Our work began by developing an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. Based on the results we achieved in this study we are planning to study other architectures of interest, including development of cost models, and developing code generators appropriate to these architectures.

  5. MetAlign 3.0: performance enhancement by efficient use of advances in computer hardware.

    PubMed

    Lommen, Arjen; Kools, Harrie J

    2012-08-01

    A new, multi-threaded version of the GC-MS and LC-MS data processing software, metAlign, has been developed which is able to utilize multiple cores on one PC. This new version was tested using three different multi-core PCs with different operating systems. The performance of noise reduction, baseline correction and peak-picking was 8-19 fold faster compared to the previous version on a single core machine from 2008. The alignment was 5-10 fold faster. Factors influencing the performance enhancement are discussed. Our observations show that performance scales with the increase in processor core numbers we currently see in consumer PC hardware development.

  6. Electromagnetic Physics Models for Parallel Computing Architectures

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

    2016-10-01

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.

  7. Multi-threading: A new dimension to massively parallel scientific computation

    NASA Astrophysics Data System (ADS)

    Nielsen, Ida M. B.; Janssen, Curtis L.

    2000-06-01

    Multi-threading is becoming widely available for Unix-like operating systems, and the application of multi-threading opens new ways for performing parallel computations with greater efficiency. We here briefly discuss the principles of multi-threading and illustrate the application of multi-threading for a massively parallel direct four-index transformation of electron repulsion integrals. Finally, other potential applications of multi-threading in scientific computing are outlined.

  8. Introducing concurrency in the Gaudi data processing framework

    NASA Astrophysics Data System (ADS)

    Clemencic, Marco; Hegner, Benedikt; Mato, Pere; Piparo, Danilo

    2014-06-01

    In the past, the increasing demands for HEP processing resources could be fulfilled by the ever increasing clock-frequencies and by distributing the work to more and more physical machines. Limitations in power consumption of both CPUs and entire data centres are bringing an end to this era of easy scalability. To get the most CPU performance per watt, future hardware will be characterised by less and less memory per processor, as well as thinner, more specialized and more numerous cores per die, and rather heterogeneous resources. To fully exploit the potential of the many cores, HEP data processing frameworks need to allow for parallel execution of reconstruction or simulation algorithms on several events simultaneously. We describe our experience in introducing concurrency related capabilities into Gaudi, a generic data processing software framework, which is currently being used by several HEP experiments, including the ATLAS and LHCb experiments at the LHC. After a description of the concurrent framework and the most relevant design choices driving its development, we describe the behaviour of the framework in a more realistic environment, using a subset of the real LHCb reconstruction workflow, and present our strategy and the used tools to validate the physics outcome of the parallel framework against the results of the present, purely sequential LHCb software. We then summarize the measurement of the code performance of the multithreaded application in terms of memory and CPU usage.

  9. Virtual Machine Language 2.1

    NASA Technical Reports Server (NTRS)

    Riedel, Joseph E.; Grasso, Christopher A.

    2012-01-01

    VML (Virtual Machine Language) is an advanced computing environment that allows spacecraft to operate using mechanisms ranging from simple, time-oriented sequencing to advanced, multicomponent reactive systems. VML has developed in four evolutionary stages. VML 0 is a core execution capability providing multi-threaded command execution, integer data types, and rudimentary branching. VML 1 added named parameterized procedures, extensive polymorphism, data typing, branching, looping issuance of commands using run-time parameters, and named global variables. VML 2 added for loops, data verification, telemetry reaction, and an open flight adaptation architecture. VML 2.1 contains major advances in control flow capabilities for executable state machines. On the resource requirements front, VML 2.1 features a reduced memory footprint in order to fit more capability into modestly sized flight processors, and endian-neutral data access for compatibility with Intel little-endian processors. Sequence packaging has been improved with object-oriented programming constructs and the use of implicit (rather than explicit) time tags on statements. Sequence event detection has been significantly enhanced with multi-variable waiting, which allows a sequence to detect and react to conditions defined by complex expressions with multiple global variables. This multi-variable waiting serves as the basis for implementing parallel rule checking, which in turn, makes possible executable state machines. The new state machine feature in VML 2.1 allows the creation of sophisticated autonomous reactive systems without the need to develop expensive flight software. Users specify named states and transitions, along with the truth conditions required, before taking transitions. Transitions with the same signal name allow separate state machines to coordinate actions: the conditions distributed across all state machines necessary to arm a particular signal are evaluated, and once found true, that signal is raised. The selected signal then causes all identically named transitions in all present state machines to be taken simultaneously. VML 2.1 has relevance to all potential space missions, both manned and unmanned. It was under consideration for use on Orion.

  10. Implementing Shared Memory Parallelism in MCBEND

    NASA Astrophysics Data System (ADS)

    Bird, Adam; Long, David; Dobson, Geoff

    2017-09-01

    MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.

  11. Electromagnetic physics models for parallel computing architectures

    DOE PAGES

    Amadio, G.; Ananya, A.; Apostolakis, J.; ...

    2016-11-21

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part ofmore » the GeantV project. Finally, the results of preliminary performance evaluation and physics validation are presented as well.« less

  12. First experience of vectorizing electromagnetic physics models for detector simulation

    NASA Astrophysics Data System (ADS)

    Amadio, G.; Apostolakis, J.; Bandieramonte, M.; Bianchini, C.; Bitzes, G.; Brun, R.; Canal, P.; Carminati, F.; de Fine Licht, J.; Duhem, L.; Elvira, D.; Gheata, A.; Jun, S. Y.; Lima, G.; Novak, M.; Presbyterian, M.; Shadura, O.; Seghal, R.; Wenzel, S.

    2015-12-01

    The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. The GeantV vector prototype for detector simulations has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth, parallelization needed to achieve optimal performance or memory access latency and speed. An additional challenge is to avoid the code duplication often inherent to supporting heterogeneous platforms. In this paper we present the first experience of vectorizing electromagnetic physics models developed for the GeantV project.

  13. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Attinella, John E.; Davis, Kristan D.; Musselman, Roy G.

    Methods, apparatuses, and computer program products for servicing a globally broadcast interrupt signal in a multi-threaded computer comprising a plurality of processor threads. Embodiments include an interrupt controller indicating in a plurality of local interrupt status locations that a globally broadcast interrupt signal has been received by the interrupt controller. Embodiments also include a thread determining that a local interrupt status location corresponding to the thread indicates that the globally broadcast interrupt signal has been received by the interrupt controller. Embodiments also include the thread processing one or more entries in a global interrupt status bit queue based on whethermore » global interrupt status bits associated with the globally broadcast interrupt signal are locked. Each entry in the global interrupt status bit queue corresponds to a queued global interrupt.« less

  14. A Metascalable Computing Framework for Large Spatiotemporal-Scale Atomistic Simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nomura, K; Seymour, R; Wang, W

    2009-02-17

    A metascalable (or 'design once, scale on new architectures') parallel computing framework has been developed for large spatiotemporal-scale atomistic simulations of materials based on spatiotemporal data locality principles, which is expected to scale on emerging multipetaflops architectures. The framework consists of: (1) an embedded divide-and-conquer (EDC) algorithmic framework based on spatial locality to design linear-scaling algorithms for high complexity problems; (2) a space-time-ensemble parallel (STEP) approach based on temporal locality to predict long-time dynamics, while introducing multiple parallelization axes; and (3) a tunable hierarchical cellular decomposition (HCD) parallelization framework to map these O(N) algorithms onto a multicore cluster based onmore » hybrid implementation combining message passing and critical section-free multithreading. The EDC-STEP-HCD framework exposes maximal concurrency and data locality, thereby achieving: (1) inter-node parallel efficiency well over 0.95 for 218 billion-atom molecular-dynamics and 1.68 trillion electronic-degrees-of-freedom quantum-mechanical simulations on 212,992 IBM BlueGene/L processors (superscalability); (2) high intra-node, multithreading parallel efficiency (nanoscalability); and (3) nearly perfect time/ensemble parallel efficiency (eon-scalability). The spatiotemporal scale covered by MD simulation on a sustained petaflops computer per day (i.e. petaflops {center_dot} day of computing) is estimated as NT = 2.14 (e.g. N = 2.14 million atoms for T = 1 microseconds).« less

  15. Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

    DOE PAGES

    Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; ...

    2017-01-03

    A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” andmore » if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Here, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient.« less

  16. Parallelization of the Flow Field Dependent Variation Scheme for Solving the Triple Shock/Boundary Layer Interaction Problem

    NASA Technical Reports Server (NTRS)

    Schunk, Richard Gregory; Chung, T. J.

    2001-01-01

    A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.

  17. muBLASTP: database-indexed protein sequence search on multicore CPUs.

    PubMed

    Zhang, Jing; Misra, Sanchit; Wang, Hao; Feng, Wu-Chun

    2016-11-04

    The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search. muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST. With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.

  18. Numericware i: Identical by State Matrix Calculator

    PubMed Central

    Kim, Bongsong; Beavis, William D

    2017-01-01

    We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375

  19. High-performance ultra-low power VLSI analog processor for data compression

    NASA Technical Reports Server (NTRS)

    Tawel, Raoul (Inventor)

    1996-01-01

    An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

  20. Hybrid Optimization Parallel Search PACKage

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    2009-11-10

    HOPSPACK is open source software for solving optimization problems without derivatives. Application problems may have a fully nonlinear objective function, bound constraints, and linear and nonlinear constraints. Problem variables may be continuous, integer-valued, or a mixture of both. The software provides a framework that supports any derivative-free type of solver algorithm. Through the framework, solvers request parallel function evaluation, which may use MPI (multiple machines) or multithreading (multiple processors/cores on one machine). The framework provides a Cache and Pending Cache of saved evaluations that reduces execution time and facilitates restarts. Solvers can dynamically create other algorithms to solve subproblems, amore » useful technique for handling multiple start points and integer-valued variables. HOPSPACK ships with the Generating Set Search (GSS) algorithm, developed at Sandia as part of the APPSPACK open source software project.« less

  1. Application of convolve-multiply-convolve SAW processor for satellite communications

    NASA Technical Reports Server (NTRS)

    Lie, Y. S.; Ching, M.

    1991-01-01

    There is a need for a satellite communications receiver than can perform simultaneous multi-channel processing of single channel per carrier (SCPC) signals originating from various small (mobile or fixed) earth stations. The number of ground users can be as many as 1000. Conventional techniques of simultaneously processing these signals is by employing as many RF-bandpass filters as the number of channels. Consequently, such an approach would result in a bulky receiver, which becomes impractical for satellite applications. A unique approach utilizing a realtime surface acoustic wave (SAW) chirp transform processor is presented. The application of a Convolve-Multiply-Convolve (CMC) chirp transform processor is described. The CMC processor transforms each input channel into a unique timeslot, while preserving its modulation content (in this case QPSK). Subsequently, each channel is individually demodulated without the need of input channel filters. Circuit complexity is significantly reduced, because the output frequency of the CMC processor is common for all input channel frequencies. The results of theoretical analysis and experimental results are in good agreement.

  2. Verifying speculative multithreading in an application

    DOEpatents

    Felton, Mitchell D

    2014-12-09

    Verifying speculative multithreading in an application executing in a computing system, including: executing one or more test instructions serially thereby producing a serial result, including insuring that all data dependencies among the test instructions are satisfied; executing the test instructions speculatively in a plurality of threads thereby producing a speculative result; and determining whether a speculative multithreading error exists including: comparing the serial result to the speculative result and, if the serial result does not match the speculative result, determining that a speculative multithreading error exists.

  3. Verifying speculative multithreading in an application

    DOEpatents

    Felton, Mitchell D

    2014-11-18

    Verifying speculative multithreading in an application executing in a computing system, including: executing one or more test instructions serially thereby producing a serial result, including insuring that all data dependencies among the test instructions are satisfied; executing the test instructions speculatively in a plurality of threads thereby producing a speculative result; and determining whether a speculative multithreading error exists including: comparing the serial result to the speculative result and, if the serial result does not match the speculative result, determining that a speculative multithreading error exists.

  4. FPGA-Based Reconfigurable Processor for Ultrafast Interlaced Ultrasound and Photoacoustic Imaging

    PubMed Central

    Alqasemi, Umar; Li, Hai; Aguirre, Andrés; Zhu, Quing

    2016-01-01

    In this paper, we report, to the best of our knowledge, a unique field-programmable gate array (FPGA)-based reconfigurable processor for real-time interlaced co-registered ultrasound and photoacoustic imaging and its application in imaging tumor dynamic response. The FPGA is used to control, acquire, store, delay-and-sum, and transfer the data for real-time co-registered imaging. The FPGA controls the ultrasound transmission and ultrasound and photoacoustic data acquisition process of a customized 16-channel module that contains all of the necessary analog and digital circuits. The 16-channel module is one of multiple modules plugged into a motherboard; their beamformed outputs are made available for a digital signal processor (DSP) to access using an external memory interface (EMIF). The FPGA performs a key role through ultrafast reconfiguration and adaptation of its structure to allow real-time switching between the two imaging modes, including transmission control, laser synchronization, internal memory structure, beamforming, and EMIF structure and memory size. It performs another role by parallel accessing of internal memories and multi-thread processing to reduce the transfer of data and the processing load on the DSP. Furthermore, because the laser will be pulsing even during ultrasound pulse-echo acquisition, the FPGA ensures that the laser pulses are far enough from the pulse-echo acquisitions by appropriate time-division multiplexing (TDM). A co-registered ultrasound and photoacoustic imaging system consisting of four FPGA modules (64-channels) is constructed, and its performance is demonstrated using phantom targets and in vivo mouse tumor models. PMID:22828830

  5. FPGA-based reconfigurable processor for ultrafast interlaced ultrasound and photoacoustic imaging.

    PubMed

    Alqasemi, Umar; Li, Hai; Aguirre, Andrés; Zhu, Quing

    2012-07-01

    In this paper, we report, to the best of our knowledge, a unique field-programmable gate array (FPGA)-based reconfigurable processor for real-time interlaced co-registered ultrasound and photoacoustic imaging and its application in imaging tumor dynamic response. The FPGA is used to control, acquire, store, delay-and-sum, and transfer the data for real-time co-registered imaging. The FPGA controls the ultrasound transmission and ultrasound and photoacoustic data acquisition process of a customized 16-channel module that contains all of the necessary analog and digital circuits. The 16-channel module is one of multiple modules plugged into a motherboard; their beamformed outputs are made available for a digital signal processor (DSP) to access using an external memory interface (EMIF). The FPGA performs a key role through ultrafast reconfiguration and adaptation of its structure to allow real-time switching between the two imaging modes, including transmission control, laser synchronization, internal memory structure, beamforming, and EMIF structure and memory size. It performs another role by parallel accessing of internal memories and multi-thread processing to reduce the transfer of data and the processing load on the DSP. Furthermore, because the laser will be pulsing even during ultrasound pulse-echo acquisition, the FPGA ensures that the laser pulses are far enough from the pulse-echo acquisitions by appropriate time-division multiplexing (TDM). A co-registered ultrasound and photoacoustic imaging system consisting of four FPGA modules (64-channels) is constructed, and its performance is demonstrated using phantom targets and in vivo mouse tumor models.

  6. Real-time video compressing under DSP/BIOS

    NASA Astrophysics Data System (ADS)

    Chen, Qiu-ping; Li, Gui-ju

    2009-10-01

    This paper presents real-time MPEG-4 Simple Profile video compressing based on the DSP processor. The programming framework of video compressing is constructed using TMS320C6416 Microprocessor, TDS510 simulator and PC. It uses embedded real-time operating system DSP/BIOS and the API functions to build periodic function, tasks and interruptions etcs. Realize real-time video compressing. To the questions of data transferring among the system. Based on the architecture of the C64x DSP, utilized double buffer switched and EDMA data transfer controller to transit data from external memory to internal, and realize data transition and processing at the same time; the architecture level optimizations are used to improve software pipeline. The system used DSP/BIOS to realize multi-thread scheduling. The whole system realizes high speed transition of a great deal of data. Experimental results show the encoder can realize real-time encoding of 768*576, 25 frame/s video images.

  7. A study of the parallel algorithm for large-scale DC simulation of nonlinear systems

    NASA Astrophysics Data System (ADS)

    Cortés Udave, Diego Ernesto; Ogrodzki, Jan; Gutiérrez de Anda, Miguel Angel

    Newton-Raphson DC analysis of large-scale nonlinear circuits may be an extremely time consuming process even if sparse matrix techniques and bypassing of nonlinear models calculation are used. A slight decrease in the time required for this task may be enabled on multi-core, multithread computers if the calculation of the mathematical models for the nonlinear elements as well as the stamp management of the sparse matrix entries are managed through concurrent processes. This numerical complexity can be further reduced via the circuit decomposition and parallel solution of blocks taking as a departure point the BBD matrix structure. This block-parallel approach may give a considerable profit though it is strongly dependent on the system topology and, of course, on the processor type. This contribution presents the easy-parallelizable decomposition-based algorithm for DC simulation and provides a detailed study of its effectiveness.

  8. SDR implementation of the receiver of adaptive communication system

    NASA Astrophysics Data System (ADS)

    Skarzynski, Jacek; Darmetko, Marcin; Kozlowski, Sebastian; Kurek, Krzysztof

    2016-04-01

    The paper presents software implementation of a receiver forming a part of an adaptive communication system. The system is intended for communication with a satellite placed in a low Earth orbit (LEO). The ability of adaptation is believed to increase the total amount of data transmitted from the satellite to the ground station. Depending on the signal-to-noise ratio (SNR) of the received signal, adaptive transmission is realized using different transmission modes, i.e., different modulation schemes (BPSK, QPSK, 8-PSK, and 16-APSK) and different convolutional code rates (1/2, 2/3, 3/4, 5/6, and 7/8). The receiver consists of a software-defined radio (SDR) module (National Instruments USRP-2920) and a multithread reception software running on Windows operating system. In order to increase the speed of signal processing, the software takes advantage of single instruction multiple data instructions supported by x86 processor architecture.

  9. BCYCLIC: A parallel block tridiagonal matrix cyclic solver

    NASA Astrophysics Data System (ADS)

    Hirshman, S. P.; Perumalla, K. S.; Lynch, V. E.; Sanchez, R.

    2010-09-01

    A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.

  10. Development and Evaluation of Vectorised and Multi-Core Event Reconstruction Algorithms within the CMS Software Framework

    NASA Astrophysics Data System (ADS)

    Hauth, T.; Innocente and, V.; Piparo, D.

    2012-12-01

    The processing of data acquired by the CMS detector at LHC is carried out with an object-oriented C++ software framework: CMSSW. With the increasing luminosity delivered by the LHC, the treatment of recorded data requires extraordinary large computing resources, also in terms of CPU usage. A possible solution to cope with this task is the exploitation of the features offered by the latest microprocessor architectures. Modern CPUs present several vector units, the capacity of which is growing steadily with the introduction of new processor generations. Moreover, an increasing number of cores per die is offered by the main vendors, even on consumer hardware. Most recent C++ compilers provide facilities to take advantage of such innovations, either by explicit statements in the programs sources or automatically adapting the generated machine instructions to the available hardware, without the need of modifying the existing code base. Programming techniques to implement reconstruction algorithms and optimised data structures are presented, that aim to scalable vectorization and parallelization of the calculations. One of their features is the usage of new language features of the C++11 standard. Portions of the CMSSW framework are illustrated which have been found to be especially profitable for the application of vectorization and multi-threading techniques. Specific utility components have been developed to help vectorization and parallelization. They can easily become part of a larger common library. To conclude, careful measurements are described, which show the execution speedups achieved via vectorised and multi-threaded code in the context of CMSSW.

  11. Method and apparatus of parallel computing with simultaneously operating stream prefetching and list prefetching engines

    DOEpatents

    Boyle, Peter A.; Christ, Norman H.; Gara, Alan; Mawhinney, Robert D.; Ohmacht, Martin; Sugavanam, Krishnan

    2012-12-11

    A prefetch system improves a performance of a parallel computing system. The parallel computing system includes a plurality of computing nodes. A computing node includes at least one processor and at least one memory device. The prefetch system includes at least one stream prefetch engine and at least one list prefetch engine. The prefetch system operates those engines simultaneously. After the at least one processor issues a command, the prefetch system passes the command to a stream prefetch engine and a list prefetch engine. The prefetch system operates the stream prefetch engine and the list prefetch engine to prefetch data to be needed in subsequent clock cycles in the processor in response to the passed command.

  12. Interprocessor bus switching system for simultaneous communication in plural bus parallel processing system

    DOEpatents

    Atac, R.; Fischler, M.S.; Husby, D.E.

    1991-01-15

    A bus switching apparatus and method for multiple processor computer systems comprises a plurality of bus switches interconnected by branch buses. Each processor or other module of the system is connected to a spigot of a bus switch. Each bus switch also serves as part of a backplane of a modular crate hardware package. A processor initiates communication with another processor by identifying that other processor. The bus switch to which the initiating processor is connected identifies and secures, if possible, a path to that other processor, either directly or via one or more other bus switches which operate similarly. If a particular desired path through a given bus switch is not available to be used, an alternate path is considered, identified and secured. 11 figures.

  13. Interprocessor bus switching system for simultaneous communication in plural bus parallel processing system

    DOEpatents

    Atac, Robert; Fischler, Mark S.; Husby, Donald E.

    1991-01-01

    A bus switching apparatus and method for multiple processor computer systems comprises a plurality of bus switches interconnected by branch buses. Each processor or other module of the system is connected to a spigot of a bus switch. Each bus switch also serves as part of a backplane of a modular crate hardware package. A processor initiates communication with another processor by identifying that other processor. The bus switch to which the initiating processor is connected identifies and secures, if possible, a path to that other processor, either directly or via one or more other bus switches which operate similarly. If a particular desired path through a given bus switch is not available to be used, an alternate path is considered, identified and secured.

  14. PAVENET OS: A Compact Hard Real-Time Operating System for Precise Sampling in Wireless Sensor Networks

    NASA Astrophysics Data System (ADS)

    Saruwatari, Shunsuke; Suzuki, Makoto; Morikawa, Hiroyuki

    The paper shows a compact hard real-time operating system for wireless sensor nodes called PAVENET OS. PAVENET OS provides hybrid multithreading: preemptive multithreading and cooperative multithreading. Both of the multithreading are optimized for two kinds of tasks on wireless sensor networks, and those are real-time tasks and best-effort ones. PAVENET OS can efficiently perform hard real-time tasks that cannot be performed by TinyOS. The paper demonstrates the hybrid multithreading realizes compactness and low overheads, which are comparable to those of TinyOS, through quantitative evaluation. The evaluation results show PAVENET OS performs 100 Hz sensor sampling with 0.01% jitter while performing wireless communication tasks, whereas optimized TinyOS has 0.62% jitter. In addition, PAVENET OS has a small footprint and low overheads (minimum RAM size: 29 bytes, minimum ROM size: 490 bytes, minimum task switch time: 23 cycles).

  15. SMT-Aware Instantaneous Footprint Optimization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Roy, Probir; Liu, Xu; Song, Shuaiwen

    Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the whole memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging, because they usually spawn threads within Single Program Multiple Data (SPMD) models. To address this important issue, we introduce a simple scheme for SMT-aware code optimization, which aims to reduce the memory contention across SMT threads.

  16. Harmony search optimization for HDR prostate brachytherapy

    NASA Astrophysics Data System (ADS)

    Panchal, Aditya

    In high dose-rate (HDR) prostate brachytherapy, multiple catheters are inserted interstitially into the target volume. The process of treating the prostate involves calculating and determining the best dose distribution to the target and organs-at-risk by means of optimizing the time that the radioactive source dwells at specified positions within the catheters. It is the goal of this work to investigate the use of a new optimization algorithm, known as Harmony Search, in order to optimize dwell times for HDR prostate brachytherapy. The new algorithm was tested on 9 different patients and also compared with the genetic algorithm. Simulations were performed to determine the optimal value of the Harmony Search parameters. Finally, multithreading of the simulation was examined to determine potential benefits. First, a simulation environment was created using the Python programming language and the wxPython graphical interface toolkit, which was necessary to run repeated optimizations. DICOM RT data from Varian BrachyVision was parsed and used to obtain patient anatomy and HDR catheter information. Once the structures were indexed, the volume of each structure was determined and compared to the original volume calculated in BrachyVision for validation. Dose was calculated using the AAPM TG-43 point source model of the GammaMed 192Ir HDR source and was validated against Varian BrachyVision. A DVH-based objective function was created and used for the optimization simulation. Harmony Search and the genetic algorithm were implemented as optimization algorithms for the simulation and were compared against each other. The optimal values for Harmony Search parameters (Harmony Memory Size [HMS], Harmony Memory Considering Rate [HMCR], and Pitch Adjusting Rate [PAR]) were also determined. Lastly, the simulation was modified to use multiple threads of execution in order to achieve faster computational times. Experimental results show that the volume calculation that was implemented in this thesis was within 2% of the values computed by Varian BrachyVision for the prostate, within 3% for the rectum and bladder and 6% for the urethra. The calculation of dose compared to BrachyVision was determined to be different by only 0.38%. Isodose curves were also generated and were found to be similar to BrachyVision. The comparison between Harmony Search and genetic algorithm showed that Harmony Search was over 4 times faster when compared over multiple data sets. The optimal Harmony Memory Size was found to be 5 or lower; the Harmony Memory Considering Rate was determined to be 0.95, and the Pitch Adjusting Rate was found to be 0.9. Ultimately, the effect of multithreading showed that as intensive computations such as optimization and dose calculation are involved, the threads of execution scale with the number of processors, achieving a speed increase proportional to the number of processor cores. In conclusion, this work showed that Harmony Search is a viable alternative to existing algorithms for use in HDR prostate brachytherapy optimization. Coupled with the optimal parameters for the algorithm and a multithreaded simulation, this combination has the capability to significantly decrease the time spent on minimizing optimization problems in the clinic that are time intensive, such as brachytherapy, IMRT and beam angle optimization.

  17. A Modular Pipelined Processor for High Resolution Gamma-Ray Spectroscopy

    NASA Astrophysics Data System (ADS)

    Veiga, Alejandro; Grunfeld, Christian

    2016-02-01

    The design of a digital signal processor for gamma-ray applications is presented in which a single ADC input can simultaneously provide temporal and energy characterization of gamma radiation for a wide range of applications. Applying pipelining techniques, the processor is able to manage and synchronize very large volumes of streamed real-time data. Its modular user interface provides a flexible environment for experimental design. The processor can fit in a medium-sized FPGA device operating at ADC sampling frequency, providing an efficient solution for multi-channel applications. Two experiments are presented in order to characterize its temporal and energy resolution.

  18. Data acquisition using the 168/E. [CERN ISR

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Carroll, J.T.; Cittolin, S.; Demoulin, M.

    1983-03-01

    Event sizes and data rates at the CERN anti p p collider compose a formidable environment for a high level trigger. A system using three 168/E processors for experiment UA1 real-time event selection is described. With 168/E data memory expanded to 512K bytes, each processor holds a complete event allowing a FORTRAN trigger algorithm access to data from the entire detector. A smart CAMAC interface reads five Remus branches in parallel transferring one word to the target processor every 0.5 ..mu..s. The NORD host computer can simultaneously read an accepted event from another processor.

  19. Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines.

    PubMed

    Zhang, Xiaohua; Wong, Sergio E; Lightstone, Felice C

    2013-04-30

    A mixed parallel scheme that combines message passing interface (MPI) and multithreading was implemented in the AutoDock Vina molecular docking program. The resulting program, named VinaLC, was tested on the petascale high performance computing (HPC) machines at Lawrence Livermore National Laboratory. To exploit the typical cluster-type supercomputers, thousands of docking calculations were dispatched by the master process to run simultaneously on thousands of slave processes, where each docking calculation takes one slave process on one node, and within the node each docking calculation runs via multithreading on multiple CPU cores and shared memory. Input and output of the program and the data handling within the program were carefully designed to deal with large databases and ultimately achieve HPC on a large number of CPU cores. Parallel performance analysis of the VinaLC program shows that the code scales up to more than 15K CPUs with a very low overhead cost of 3.94%. One million flexible compound docking calculations took only 1.4 h to finish on about 15K CPUs. The docking accuracy of VinaLC has been validated against the DUD data set by the re-docking of X-ray ligands and an enrichment study, 64.4% of the top scoring poses have RMSD values under 2.0 Å. The program has been demonstrated to have good enrichment performance on 70% of the targets in the DUD data set. An analysis of the enrichment factors calculated at various percentages of the screening database indicates VinaLC has very good early recovery of actives. Copyright © 2013 Wiley Periodicals, Inc.

  20. Optical systolic array processor using residue arithmetic

    NASA Technical Reports Server (NTRS)

    Jackson, J.; Casasent, D.

    1983-01-01

    The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.

  1. MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee

    2008-01-01

    High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less

  2. Parallel Task Management Library for MARTe

    NASA Astrophysics Data System (ADS)

    Valcarcel, Daniel F.; Alves, Diogo; Neto, Andre; Reux, Cedric; Carvalho, Bernardo B.; Felton, Robert; Lomas, Peter J.; Sousa, Jorge; Zabeo, Luca

    2014-06-01

    The Multithreaded Application Real-Time executor (MARTe) is a real-time framework with increasing popularity and support in the thermonuclear fusion community. It allows modular code to run in a multi-threaded environment leveraging on the current multi-core processor (CPU) technology. One application that relies on the MARTe framework is the Joint European Torus (JET) tokamak WAll Load Limiter System (WALLS). It calculates and monitors the temperature on metal tiles and plasma facing components (PFCs) that can melt or flake if their temperature gets too high when exposed to power loads. One of the main time consuming tasks in WALLS is the calculation of thermal diffusion models in real-time. These models tend to be described by very large state-space models thus making them perfect candidates for parallelisation. MARTe's traditional approach for task parallelisation is to split the problem into several Real-Time Threads, each responsible for a self-contained sequential execution of an input-to-output chain. This is usually possible, but it might not always be practical for algorithmic or technical reasons. Also, it might not be easily scalable with an increase in the number of available CPU cores. The WorkLibrary introduces a “GPU-like approach” of splitting work among the available cores of modern CPUs that is (i) straightforward to use in an application, (ii) scalable with the availability of cores and all of this (iii) without rewriting or recompiling the source code. The first part of this article explains the motivation behind the library, its architecture and implementation. The second part presents a real application for WALLS, a parallel version of a large state-space model describing the 2D thermal diffusion on a JET tile.

  3. Enabling Graph Appliance for Genome Assembly

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Singh, Rina; Graves, Jeffrey A; Lee, Sangkeun

    2015-01-01

    In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to storemore » and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less

  4. Time-partitioning simulation models for calculation on parallel computers

    NASA Technical Reports Server (NTRS)

    Milner, Edward J.; Blech, Richard A.; Chima, Rodrick V.

    1987-01-01

    A technique allowing time-staggered solution of partial differential equations is presented in this report. Using this technique, called time-partitioning, simulation execution speedup is proportional to the number of processors used because all processors operate simultaneously, with each updating of the solution grid at a different time point. The technique is limited by neither the number of processors available nor by the dimension of the solution grid. Time-partitioning was used to obtain the flow pattern through a cascade of airfoils, modeled by the Euler partial differential equations. An execution speedup factor of 1.77 was achieved using a two processor Cray X-MP/24 computer.

  5. Multi-threaded Event Processing with DANA

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    David Lawrence; Elliott Wolin

    2007-05-14

    The C++ data analysis framework DANA has been written to support the next generation of Nuclear Physics experiments at Jefferson Lab commensurate with the anticipated 12GeV upgrade. The DANA framework was designed to allow multi-threaded event processing with a minimal impact on developers of reconstruction software. This document describes how DANA implements multi-threaded event processing and compares it to simply running multiple instances of a program. Also presented are relative reconstruction rates for Pentium4, Xeon, and Opteron based machines.

  6. The Regional Hydrologic Extremes Assessment System: A software framework for hydrologic modeling and data assimilation

    PubMed Central

    Das, Narendra; Stampoulis, Dimitrios; Ines, Amor; Fisher, Joshua B.; Granger, Stephanie; Kawata, Jessie; Han, Eunjin; Behrangi, Ali

    2017-01-01

    The Regional Hydrologic Extremes Assessment System (RHEAS) is a prototype software framework for hydrologic modeling and data assimilation that automates the deployment of water resources nowcasting and forecasting applications. A spatially-enabled database is a key component of the software that can ingest a suite of satellite and model datasets while facilitating the interfacing with Geographic Information System (GIS) applications. The datasets ingested are obtained from numerous space-borne sensors and represent multiple components of the water cycle. The object-oriented design of the software allows for modularity and extensibility, showcased here with the coupling of the core hydrologic model with a crop growth model. RHEAS can exploit multi-threading to scale with increasing number of processors, while the database allows delivery of data products and associated uncertainty through a variety of GIS platforms. A set of three example implementations of RHEAS in the United States and Kenya are described to demonstrate the different features of the system in real-world applications. PMID:28545077

  7. The Regional Hydrologic Extremes Assessment System: A software framework for hydrologic modeling and data assimilation.

    PubMed

    Andreadis, Konstantinos M; Das, Narendra; Stampoulis, Dimitrios; Ines, Amor; Fisher, Joshua B; Granger, Stephanie; Kawata, Jessie; Han, Eunjin; Behrangi, Ali

    2017-01-01

    The Regional Hydrologic Extremes Assessment System (RHEAS) is a prototype software framework for hydrologic modeling and data assimilation that automates the deployment of water resources nowcasting and forecasting applications. A spatially-enabled database is a key component of the software that can ingest a suite of satellite and model datasets while facilitating the interfacing with Geographic Information System (GIS) applications. The datasets ingested are obtained from numerous space-borne sensors and represent multiple components of the water cycle. The object-oriented design of the software allows for modularity and extensibility, showcased here with the coupling of the core hydrologic model with a crop growth model. RHEAS can exploit multi-threading to scale with increasing number of processors, while the database allows delivery of data products and associated uncertainty through a variety of GIS platforms. A set of three example implementations of RHEAS in the United States and Kenya are described to demonstrate the different features of the system in real-world applications.

  8. Precise and Efficient Static Array Bound Checking for Large Embedded C Programs

    NASA Technical Reports Server (NTRS)

    Venet, Arnaud

    2004-01-01

    In this paper we describe the design and implementation of a static array-bound checker for a family of embedded programs: the flight control software of recent Mars missions. These codes are large (up to 250 KLOC), pointer intensive, heavily multithreaded and written in an object-oriented style, which makes their analysis very challenging. We designed a tool called C Global Surveyor (CGS) that can analyze the largest code in a couple of hours with a precision of 80%. The scalability and precision of the analyzer are achieved by using an incremental framework in which a pointer analysis and a numerical analysis of array indices mutually refine each other. CGS has been designed so that it can distribute the analysis over several processors in a cluster of machines. To the best of our knowledge this is the first distributed implementation of static analysis algorithms. Throughout the paper we will discuss the scalability setbacks that we encountered during the construction of the tool and their impact on the initial design decisions.

  9. Iterative color-multiplexed, electro-optical processor.

    PubMed

    Psaltis, D; Casasent, D; Carlotto, M

    1979-11-01

    A noncoherent optical vector-matrix multiplier using a linear LED source array and a linear P-I-N photodiode detector array has been combined with a 1-D adder in a feedback loop. The resultant iterative optical processor and its use in solving simultaneous linear equations are described. Operation on complex data is provided by a novel color-multiplexing system.

  10. A Survey of Recent MARTe Based Systems

    NASA Astrophysics Data System (ADS)

    Neto, André C.; Alves, Diogo; Boncagni, Luca; Carvalho, Pedro J.; Valcarcel, Daniel F.; Barbalace, Antonio; De Tommasi, Gianmaria; Fernandes, Horácio; Sartori, Filippo; Vitale, Enzo; Vitelli, Riccardo; Zabeo, Luca

    2011-08-01

    The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardize the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilization system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.

  11. AthenaMT: upgrading the ATLAS software framework for the many-core world with multi-threading

    NASA Astrophysics Data System (ADS)

    Leggett, Charles; Baines, John; Bold, Tomasz; Calafiura, Paolo; Farrell, Steven; van Gemmeren, Peter; Malon, David; Ritsch, Elmar; Stewart, Graeme; Snyder, Scott; Tsulaia, Vakhtang; Wynne, Benjamin; ATLAS Collaboration

    2017-10-01

    ATLAS’s current software framework, Gaudi/Athena, has been very successful for the experiment in LHC Runs 1 and 2. However, its single threaded design has been recognized for some time to be increasingly problematic as CPUs have increased core counts and decreased available memory per core. Even the multi-process version of Athena, AthenaMP, will not scale to the range of architectures we expect to use beyond Run2. After concluding a rigorous requirements phase, where many design components were examined in detail, ATLAS has begun the migration to a new data-flow driven, multi-threaded framework, which enables the simultaneous processing of singleton, thread unsafe legacy Algorithms, cloned Algorithms that execute concurrently in their own threads with different Event contexts, and fully re-entrant, thread safe Algorithms. In this paper we report on the process of modifying the framework to safely process multiple concurrent events in different threads, which entails significant changes in the underlying handling of features such as event and time dependent data, asynchronous callbacks, metadata, integration with the online High Level Trigger for partial processing in certain regions of interest, concurrent I/O, as well as ensuring thread safety of core services. We also report on upgrading the framework to handle Algorithms that are fully re-entrant.

  12. Beaver-mediated lateral hydrologic connectivity, fluvial carbon and nutrient flux, and aquatic ecosystem metabolism

    NASA Astrophysics Data System (ADS)

    Wegener, Pam; Covino, Tim; Wohl, Ellen

    2017-06-01

    River networks that drain mountain landscapes alternate between narrow and wide valley segments. Within the wide segments, beaver activity can facilitate the development and maintenance of complex, multithread planform. Because the narrow segments have limited ability to retain water, carbon, and nutrients, the wide, multithread segments are likely important locations of retention. We evaluated hydrologic dynamics, nutrient flux, and aquatic ecosystem metabolism along two adjacent segments of a river network in the Rocky Mountains, Colorado: (1) a wide, multithread segment with beaver activity; and, (2) an adjacent (directly upstream) narrow, single-thread segment without beaver activity. We used a mass balance approach to determine the water, carbon, and nutrient source-sink behavior of each river segment across a range of flows. While the single-thread segment was consistently a source of water, carbon, and nitrogen, the beaver impacted multithread segment exhibited variable source-sink dynamics as a function of flow. Specifically, the multithread segment was a sink for water, carbon, and nutrients during high flows, and subsequently became a source as flows decreased. Shifts in river-floodplain hydrologic connectivity across flows related to higher and more variable aquatic ecosystem metabolism rates along the multithread relative to the single-thread segment. Our data suggest that beaver activity in wide valleys can create a physically complex hydrologic environment that can enhance hydrologic and biogeochemical buffering, and promote high rates of aquatic ecosystem metabolism. Given the widespread removal of beaver, determining the cumulative effects of these changes is a critical next step in restoring function in altered river networks.

  13. Programmable DNA-Mediated Multitasking Processor.

    PubMed

    Shu, Jian-Jun; Wang, Qi-Wen; Yong, Kian-Yan; Shao, Fangwei; Lee, Kee Jin

    2015-04-30

    Because of DNA appealing features as perfect material, including minuscule size, defined structural repeat and rigidity, programmable DNA-mediated processing is a promising computing paradigm, which employs DNAs as information storing and processing substrates to tackle the computational problems. The massive parallelism of DNA hybridization exhibits transcendent potential to improve multitasking capabilities and yield a tremendous speed-up over the conventional electronic processors with stepwise signal cascade. As an example of multitasking capability, we present an in vitro programmable DNA-mediated optimal route planning processor as a functional unit embedded in contemporary navigation systems. The novel programmable DNA-mediated processor has several advantages over the existing silicon-mediated methods, such as conducting massive data storage and simultaneous processing via much fewer materials than conventional silicon devices.

  14. A digital retina-like low-level vision processor.

    PubMed

    Mertoguno, S; Bourbakis, N G

    2003-01-01

    This correspondence presents the basic design and the simulation of a low level multilayer vision processor that emulates to some degree the functional behavior of a human retina. This retina-like multilayer processor is the lower part of an autonomous self-organized vision system, called Kydon, that could be used on visually impaired people with a damaged visual cerebral cortex. The Kydon vision system, however, is not presented in this paper. The retina-like processor consists of four major layers, where each of them is an array processor based on hexagonal, autonomous processing elements that perform a certain set of low level vision tasks, such as smoothing and light adaptation, edge detection, segmentation, line recognition and region-graph generation. At each layer, the array processor is a 2D array of k/spl times/m hexagonal identical autonomous cells that simultaneously execute certain low level vision tasks. Thus, the hardware design and the simulation at the transistor level of the processing elements (PEs) of the retina-like processor and its simulated functionality with illustrative examples are provided in this paper.

  15. Multi-Threaded DNA Tag/Anti-Tag Library Generator for Multi-Core Platforms

    DTIC Science & Technology

    2009-05-01

    base pair)  Watson ‐ Crick  strand pairs that bind perfectly within pairs, but poorly across pairs. A variety  of  DNA  strand hybridization metrics...AFRL-RI-RS-TR-2009-131 Final Technical Report May 2009 MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE PLATFORMS...TYPE Final 3. DATES COVERED (From - To) Jun 08 – Feb 09 4. TITLE AND SUBTITLE MULTI-THREADED DNA TAG/ANTI-TAG LIBRARY GENERATOR FOR MULTI-CORE

  16. Method for fast start of a fuel processor

    DOEpatents

    Ahluwalia, Rajesh K [Burr Ridge, IL; Ahmed, Shabbir [Naperville, IL; Lee, Sheldon H. D. [Willowbrook, IL

    2008-01-29

    An improved fuel processor for fuel cells is provided whereby the startup time of the processor is less than sixty seconds and can be as low as 30 seconds, if not less. A rapid startup time is achieved by either igniting or allowing a small mixture of air and fuel to react over and warm up the catalyst of an autothermal reformer (ATR). The ATR then produces combustible gases to be subsequently oxidized on and simultaneously warm up water-gas shift zone catalysts. After normal operating temperature has been achieved, the proportion of air included with the fuel is greatly diminished.

  17. Method for simultaneous overlapped communications between neighboring processors in a multiple

    DOEpatents

    Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

    1991-01-01

    A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

  18. Algorithms for solving large sparse systems of simultaneous linear equations on vector processors

    NASA Technical Reports Server (NTRS)

    David, R. E.

    1984-01-01

    Very efficient algorithms for solving large sparse systems of simultaneous linear equations have been developed for serial processing computers. These involve a reordering of matrix rows and columns in order to obtain a near triangular pattern of nonzero elements. Then an LU factorization is developed to represent the matrix inverse in terms of a sequence of elementary Gaussian eliminations, or pivots. In this paper it is shown how these algorithms are adapted for efficient implementation on vector processors. Results obtained on the CYBER 200 Model 205 are presented for a series of large test problems which show the comparative advantages of the triangularization and vector processing algorithms.

  19. High performance computing for deformable image registration: towards a new paradigm in adaptive radiotherapy.

    PubMed

    Samant, Sanjiv S; Xia, Junyi; Muyan-Ozcelik, Pinar; Owens, John D

    2008-08-01

    The advent of readily available temporal imaging or time series volumetric (4D) imaging has become an indispensable component of treatment planning and adaptive radiotherapy (ART) at many radiotherapy centers. Deformable image registration (DIR) is also used in other areas of medical imaging, including motion corrected image reconstruction. Due to long computation time, clinical applications of DIR in radiation therapy and elsewhere have been limited and consequently relegated to offline analysis. With the recent advances in hardware and software, graphics processing unit (GPU) based computing is an emerging technology for general purpose computation, including DIR, and is suitable for highly parallelized computing. However, traditional general purpose computation on the GPU is limited because the constraints of the available programming platforms. As well, compared to CPU programming, the GPU currently has reduced dedicated processor memory, which can limit the useful working data set for parallelized processing. We present an implementation of the demons algorithm using the NVIDIA 8800 GTX GPU and the new CUDA programming language. The GPU performance will be compared with single threading and multithreading CPU implementations on an Intel dual core 2.4 GHz CPU using the C programming language. CUDA provides a C-like language programming interface, and allows for direct access to the highly parallel compute units in the GPU. Comparisons for volumetric clinical lung images acquired using 4DCT were carried out. Computation time for 100 iterations in the range of 1.8-13.5 s was observed for the GPU with image size ranging from 2.0 x 10(6) to 14.2 x 10(6) pixels. The GPU registration was 55-61 times faster than the CPU for the single threading implementation, and 34-39 times faster for the multithreading implementation. For CPU based computing, the computational time generally has a linear dependence on image size for medical imaging data. Computational efficiency is characterized in terms of time per megapixels per iteration (TPMI) with units of seconds per megapixels per iteration (or spmi). For the demons algorithm, our CPU implementation yielded largely invariant values of TPMI. The mean TPMIs were 0.527 spmi and 0.335 spmi for the single threading and multithreading cases, respectively, with <2% variation over the considered image data range. For GPU computing, we achieved TPMI =0.00916 spmi with 3.7% variation, indicating optimized memory handling under CUDA. The paradigm of GPU based real-time DIR opens up a host of clinical applications for medical imaging.

  20. Implementation of a multi-threaded framework for large-scale scientific applications

    DOE PAGES

    Sexton-Kennedy, E.; Gartung, Patrick; Jones, C. D.; ...

    2015-05-22

    The CMS experiment has recently completed the development of a multi-threaded capable application framework. In this paper, we will discuss the design, implementation and application of this framework to production applications in CMS. For the 2015 LHC run, this functionality is particularly critical for both our online and offline production applications, which depend on faster turn-around times and a reduced memory footprint relative to before. These applications are complex codes, each including a large number of physics-driven algorithms. While the framework is capable of running a mix of thread-safe and 'legacy' modules, algorithms running in our production applications need tomore » be thread-safe for optimal use of this multi-threaded framework at a large scale. Towards this end, we discuss the types of changes, which were necessary for our algorithms to achieve good performance of our multithreaded applications in a full-scale application. Lastly performance numbers for what has been achieved for the 2015 run are presented.« less

  1. Using Multithreading for the Automatic Load Balancing of 2D Adaptive Finite Element Meshes

    NASA Technical Reports Server (NTRS)

    Heber, Gerd; Biswas, Rupak; Thulasiraman, Parimala; Gao, Guang R.; Bailey, David H. (Technical Monitor)

    1998-01-01

    In this paper, we present a multi-threaded approach for the automatic load balancing of adaptive finite element (FE) meshes. The platform of our choice is the EARTH multi-threaded system which offers sufficient capabilities to tackle this problem. We implement the question phase of FE applications on triangular meshes, and exploit the EARTH token mechanism to automatically balance the resulting irregular and highly nonuniform workload. We discuss the results of our experiments on EARTH-SP2, an implementation of EARTH on the IBM SP2, with different load balancing strategies that are built into the runtime system.

  2. Using Multi-threading for the Automatic Load Balancing of 2D Adaptive Finite Element Meshes

    NASA Technical Reports Server (NTRS)

    Heber, Gerd; Biswas, Rupak; Thulasiraman, Parimala; Gao, Guang R.; Saini, Subhash (Technical Monitor)

    1998-01-01

    In this paper, we present a multi-threaded approach for the automatic load balancing of adaptive finite element (FE) meshes The platform of our choice is the EARTH multi-threaded system which offers sufficient capabilities to tackle this problem. We implement the adaption phase of FE applications oil triangular meshes and exploit the EARTH token mechanism to automatically balance the resulting irregular and highly nonuniform workload. We discuss the results of our experiments oil EARTH-SP2, on implementation of EARTH on the IBM SP2 with different load balancing strategies that are built into the runtime system.

  3. NASA Tech Briefs, January 2006

    NASA Technical Reports Server (NTRS)

    2006-01-01

    Topics covered include: Semiautonomous Avionics-and-Sensors System for a UAV; Biomimetic/Optical Sensors for Detecting Bacterial Species; System Would Detect Foreign-Object Damage in Turbofan Engine; Detection of Water Hazards for Autonomous Robotic Vehicles; Fuel Cells Utilizing Oxygen From Air at Low Pressures; Hybrid Ion-Detector/Data-Acquisition System for a TOF-MS; Spontaneous-Desorption Ionizer for a TOF-MS; Equipment for On-Wafer Testing From 220 to 325 GHz; Computing Isentropic Flow Properties of Air/R-134a Mixtures; Java Mission Evaluation Workstation System; Using a Quadtree Algorithm To Assess Line of Sight; Software for Automated Generation of Cartesian Meshes; Optics Program Modified for Multithreaded Parallel Computing; Programs for Testing Processor-in-Memory Computing Systems; PVM Enhancement for Beowulf Multiple-Processor Nodes; Ion-Exclusion Chromatography for Analyzing Organics in Water; Selective Plasma Deposition of Fluorocarbon Films on SAMs; Water-Based Pressure-Sensitive Paints; System Finds Horizontal Location of Center of Gravity; Predicting Tail Buffet Loads of a Fighter Airplane; Water Containment Systems for Testing High-Speed Flywheels; Vapor-Compression Heat Pumps for Operation Aboard Spacecraft; Multistage Electrophoretic Separators; Recovering Residual Xenon Propellant for an Ion Propulsion System; Automated Solvent Seaming of Large Polyimide Membranes; Manufacturing Precise, Lightweight Paraboloidal Mirrors; Analysis of Membrane Lipids of Airborne Micro-Organisms; Noninvasive Diagnosis of Coronary Artery Disease Using 12-Lead High-Frequency Electrocardiograms; Dual-Laser-Pulse Ignition; Enhanced-Contrast Viewing of White-Hot Objects in Furnaces; Electrically Tunable Terahertz Quantum-Cascade Lasers; Few-Mode Whispering-Gallery-Mode Resonators; Conflict-Aware Scheduling Algorithm; and Real-Time Diagnosis of Faults Using a Bank of Kalman Filters.

  4. FODEM: A Multi-Threaded Research and Development Method for Educational Technology

    ERIC Educational Resources Information Center

    Suhonen, Jarkko; de Villiers, M. Ruth; Sutinen, Erkki

    2012-01-01

    Formative development method (FODEM) is a multithreaded design approach that was originated to support the design and development of various types of educational technology innovations, such as learning tools, and online study programmes. The threaded and agile structure of the approach provides flexibility to the design process. Intensive…

  5. Vectorization of a classical trajectory code on a floating point systems, Inc. Model 164 attached processor.

    PubMed

    Kraus, Wayne A; Wagner, Albert F

    1986-04-01

    A triatomic classical trajectory code has been modified by extensive vectorization of the algorithms to achieve much improved performance on an FPS 164 attached processor. Extensive timings on both the FPS 164 and a VAX 11/780 with floating point accelerator are presented as a function of the number of trajectories simultaneously run. The timing tests involve a potential energy surface of the LEPS variety and trajectories with 1000 time steps. The results indicate that vectorization results in timing improvements on both the VAX and the FPS. For larger numbers of trajectories run simultaneously, up to a factor of 25 improvement in speed occurs between VAX and FPS vectorized code. Copyright © 1986 John Wiley & Sons, Inc.

  6. Multithreaded transactions in scientific computing: New versions of a computer program for kinematical calculations of RHEED intensity oscillations

    NASA Astrophysics Data System (ADS)

    Brzuszek, Marcin; Daniluk, Andrzej

    2006-11-01

    Writing a concurrent program can be more difficult than writing a sequential program. Programmer needs to think about synchronisation, race conditions and shared variables. Transactions help reduce the inconvenience of using threads. A transaction is an abstraction, which allows programmers to group a sequence of actions on the program into a logical, higher-level computation unit. This paper presents multithreaded versions of the GROWTH program, which allow to calculate the layer coverages during the growth of thin epitaxial films and the corresponding RHEED intensities according to the kinematical approximation. The presented programs also contain graphical user interfaces, which enable displaying program data at run-time. New version program summaryTitles of programs:GROWTHGr, GROWTH06 Catalogue identifier:ADVL_v2_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADVL_v2_0 Program obtainable from:CPC Program Library, Queen's University of Belfast, N. Ireland Catalogue identifier of previous version:ADVL Does the new version supersede the original program:No Computer for which the new version is designed and others on which it has been tested: Pentium-based PC Operating systems or monitors under which the new version has been tested: Windows 9x, XP, NT Programming language used:Object Pascal Memory required to execute with typical data:More than 1 MB Number of bits in a word:64 bits Number of processors used:1 No. of lines in distributed program, including test data, etc.:20 931 Number of bytes in distributed program, including test data, etc.: 1 311 268 Distribution format:tar.gz Nature of physical problem: The programs compute the RHEED intensities during the growth of thin epitaxial structures prepared using the molecular beam epitaxy (MBE). The computations are based on the use of kinematical diffraction theory [P.I. Cohen, G.S. Petrich, P.R. Pukite, G.J. Whaley, A.S. Arrott, Surf. Sci. 216 (1989) 222. [1

  7. A fast CT reconstruction scheme for a general multi-core PC.

    PubMed

    Zeng, Kai; Bai, Erwei; Wang, Ge

    2007-01-01

    Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.

  8. ParDRe: faster parallel duplicated reads removal tool for sequencing studies.

    PubMed

    González-Domínguez, Jorge; Schmidt, Bertil

    2016-05-15

    Current next generation sequencing technologies often generate duplicated or near-duplicated reads that (depending on the application scenario) do not provide any interesting biological information but can increase memory requirements and computational time of downstream analysis. In this work we present ParDRe, a de novo parallel tool to remove duplicated and near-duplicated reads through the clustering of Single-End or Paired-End sequences from fasta or fastq files. It uses a novel bitwise approach to compare the suffixes of DNA strings and employs hybrid MPI/multithreading to reduce runtime on multicore systems. We show that ParDRe is up to 27.29 times faster than Fulcrum (a representative state-of-the-art tool) on a platform with two 8-core Sandy-Bridge processors. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at https://sourceforge.net/projects/pardre/ jgonzalezd@udc.es. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  9. Multicore Architecture-aware Scientific Applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Srinivasa, Avinash

    Modern high performance systems are becoming increasingly complex and powerful due to advancements in processor and memory architecture. In order to keep up with this increasing complexity, applications have to be augmented with certain capabilities to fully exploit such systems. These may be at the application level, such as static or dynamic adaptations or at the system level, like having strategies in place to override some of the default operating system polices, the main objective being to improve computational performance of the application. The current work proposes two such capabilites with respect to multi-threaded scientific applications, in particular a largemore » scale physics application computing ab-initio nuclear structure. The first involves using a middleware tool to invoke dynamic adaptations in the application, so as to be able to adjust to the changing computational resource availability at run-time. The second involves a strategy for effective placement of data in main memory, to optimize memory access latencies and bandwidth. These capabilties when included were found to have a significant impact on the application performance, resulting in average speedups of as much as two to four times.« less

  10. A Fast CT Reconstruction Scheme for a General Multi-Core PC

    PubMed Central

    Zeng, Kai; Bai, Erwei; Wang, Ge

    2007-01-01

    Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors. PMID:18256731

  11. Adaptive and mobile ground sensor array.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Holzrichter, Michael Warren; O'Rourke, William T.; Zenner, Jennifer

    The goal of this LDRD was to demonstrate the use of robotic vehicles for deploying and autonomously reconfiguring seismic and acoustic sensor arrays with high (centimeter) accuracy to obtain enhancement of our capability to locate and characterize remote targets. The capability to accurately place sensors and then retrieve and reconfigure them allows sensors to be placed in phased arrays in an initial monitoring configuration and then to be reconfigured in an array tuned to the specific frequencies and directions of the selected target. This report reviews the findings and accomplishments achieved during this three-year project. This project successfully demonstrated autonomousmore » deployment and retrieval of a payload package with an accuracy of a few centimeters using differential global positioning system (GPS) signals. It developed an autonomous, multisensor, temporally aligned, radio-frequency communication and signal processing capability, and an array optimization algorithm, which was implemented on a digital signal processor (DSP). Additionally, the project converted the existing single-threaded, monolithic robotic vehicle control code into a multi-threaded, modular control architecture that enhances the reuse of control code in future projects.« less

  12. ALFA: The new ALICE-FAIR software framework

    NASA Astrophysics Data System (ADS)

    Al-Turany, M.; Buncic, P.; Hristov, P.; Kollegger, T.; Kouzinopoulos, C.; Lebedev, A.; Lindenstruth, V.; Manafov, A.; Richter, M.; Rybalchenko, A.; Vande Vyvre, P.; Winckler, N.

    2015-12-01

    The commonalities between the ALICE and FAIR experiments and their computing requirements led to the development of large parts of a common software framework in an experiment independent way. The FairRoot project has already shown the feasibility of such an approach for the FAIR experiments and extending it beyond FAIR to experiments at other facilities[1, 2]. The ALFA framework is a joint development between ALICE Online- Offline (O2) and FairRoot teams. ALFA is designed as a flexible, elastic system, which balances reliability and ease of development with performance using multi-processing and multithreading. A message- based approach has been adopted; such an approach will support the use of the software on different hardware platforms, including heterogeneous systems. Each process in ALFA assumes limited communication and reliance on other processes. Such a design will add horizontal scaling (multiple processes) to vertical scaling provided by multiple threads to meet computing and throughput demands. ALFA does not dictate any application protocols. Potentially, any content-based processor or any source can change the application protocol. The framework supports different serialization standards for data exchange between different hardware and software languages.

  13. Integrated optical 3D digital imaging based on DSP scheme

    NASA Astrophysics Data System (ADS)

    Wang, Xiaodong; Peng, Xiang; Gao, Bruce Z.

    2008-03-01

    We present a scheme of integrated optical 3-D digital imaging (IO3DI) based on digital signal processor (DSP), which can acquire range images independently without PC support. This scheme is based on a parallel hardware structure with aid of DSP and field programmable gate array (FPGA) to realize 3-D imaging. In this integrated scheme of 3-D imaging, the phase measurement profilometry is adopted. To realize the pipeline processing of the fringe projection, image acquisition and fringe pattern analysis, we present a multi-threads application program that is developed under the environment of DSP/BIOS RTOS (real-time operating system). Since RTOS provides a preemptive kernel and powerful configuration tool, with which we are able to achieve a real-time scheduling and synchronization. To accelerate automatic fringe analysis and phase unwrapping, we make use of the technique of software optimization. The proposed scheme can reach a performance of 39.5 f/s (frames per second), so it may well fit into real-time fringe-pattern analysis and can implement fast 3-D imaging. Experiment results are also presented to show the validity of proposed scheme.

  14. Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors

    NASA Technical Reports Server (NTRS)

    Probst, David K.

    1993-01-01

    A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.

  15. Moessfit. A free Mössbauer fitting program

    NASA Astrophysics Data System (ADS)

    Kamusella, Sirko; Klauss, Hans-Henning

    2016-12-01

    A free data analysis program for Mössbauer spectroscopy was developed to solve commonly faced problems such as simultaneous fitting of multiple data sets, Maximum Entropy Method and a proper error estimation. The program is written in C++ using the Qt application framework and the Gnu Scientific Library. Moessfit makes use of multithreading to reasonably apply the multi core CPU capacities of modern PC. The whole fit is specified in a text input file issued to simplify work flow for the user and provide a simple start in the Mössbauer data analysis for beginners. However, the possibility to define arbitrary parameter dependencies and distributions as well as relaxation spectra makes Moessfit interesting for advanced user as well.

  16. Autothermal and partial oxidation reformer-based fuel processor, method for improving catalyst function in autothermal and partial oxidation reformer-based processors

    DOEpatents

    Ahmed, Shabbir; Papadias, Dionissios D.; Lee, Sheldon H. D.; Ahluwalia, Rajesh K.

    2013-01-08

    The invention provides a fuel processor comprising a linear flow structure having an upstream portion and a downstream portion; a first catalyst supported at the upstream portion; and a second catalyst supported at the downstream portion, wherein the first catalyst is in fluid communication with the second catalyst. Also provided is a method for reforming fuel, the method comprising contacting the fuel to an oxidation catalyst so as to partially oxidize the fuel and generate heat; warming incoming fuel with the heat while simultaneously warming a reforming catalyst with the heat; and reacting the partially oxidized fuel with steam using the reforming catalyst.

  17. metAlignID: a high-throughput software tool set for automated detection of trace level contaminants in comprehensive LECO two-dimensional gas chromatography time-of-flight mass spectrometry data.

    PubMed

    Lommen, Arjen; van der Kamp, Henk J; Kools, Harrie J; van der Lee, Martijn K; van der Weg, Guido; Mol, Hans G J

    2012-11-09

    A new alternative data processing tool set, metAlignID, is developed for automated pre-processing and library-based identification and concentration estimation of target compounds after analysis by comprehensive two-dimensional gas chromatography with mass spectrometric detection. The tool set has been developed for and tested on LECO data. The software is developed to run multi-threaded (one thread per processor core) on a standard PC (personal computer) under different operating systems and is as such capable of processing multiple data sets simultaneously. Raw data files are converted into netCDF (network Common Data Form) format using a fast conversion tool. They are then preprocessed using previously developed algorithms originating from metAlign software. Next, the resulting reduced data files are searched against a user-composed library (derived from user or commercial NIST-compatible libraries) (NIST=National Institute of Standards and Technology) and the identified compounds, including an indicative concentration, are reported in Excel format. Data can be processed batch wise. The overall time needed for conversion together with processing and searching of 30 raw data sets for 560 compounds is routinely within an hour. The screening performance is evaluated for detection of pesticides and contaminants in raw data obtained after analysis of soil and plant samples. Results are compared to the existing data-handling routine based on proprietary software (LECO, ChromaTOF). The developed software tool set, which is freely downloadable at www.metalign.nl, greatly accelerates data-analysis and offers more options for fine-tuning automated identification toward specific application needs. The quality of the results obtained is slightly better than the standard processing and also adds a quantitative estimate. The software tool set in combination with two-dimensional gas chromatography coupled to time-of-flight mass spectrometry shows great potential as a highly-automated and fast multi-residue instrumental screening method. Copyright © 2012 Elsevier B.V. All rights reserved.

  18. a Spatiotemporal Aggregation Query Method Using Multi-Thread Parallel Technique Based on Regional Division

    NASA Astrophysics Data System (ADS)

    Liao, S.; Chen, L.; Li, J.; Xiong, W.; Wu, Q.

    2015-07-01

    Existing spatiotemporal database supports spatiotemporal aggregation query over massive moving objects datasets. Due to the large amounts of data and single-thread processing method, the query speed cannot meet the application requirements. On the other hand, the query efficiency is more sensitive to spatial variation then temporal variation. In this paper, we proposed a spatiotemporal aggregation query method using multi-thread parallel technique based on regional divison and implemented it on the server. Concretely, we divided the spatiotemporal domain into several spatiotemporal cubes, computed spatiotemporal aggregation on all cubes using the technique of multi-thread parallel processing, and then integrated the query results. By testing and analyzing on the real datasets, this method has improved the query speed significantly.

  19. Scalable architecture for a room temperature solid-state quantum information processor.

    PubMed

    Yao, N Y; Jiang, L; Gorshkov, A V; Maurer, P C; Giedke, G; Cirac, J I; Lukin, M D

    2012-04-24

    The realization of a scalable quantum information processor has emerged over the past decade as one of the central challenges at the interface of fundamental science and engineering. Here we propose and analyse an architecture for a scalable, solid-state quantum information processor capable of operating at room temperature. Our approach is based on recent experimental advances involving nitrogen-vacancy colour centres in diamond. In particular, we demonstrate that the multiple challenges associated with operation at ambient temperature, individual addressing at the nanoscale, strong qubit coupling, robustness against disorder and low decoherence rates can be simultaneously achieved under realistic, experimentally relevant conditions. The architecture uses a novel approach to quantum information transfer and includes a hierarchy of control at successive length scales. Moreover, it alleviates the stringent constraints currently limiting the realization of scalable quantum processors and will provide fundamental insights into the physics of non-equilibrium many-body quantum systems.

  20. On the relationship between parallel computation and graph embedding

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Gupta, A.K.

    1989-01-01

    The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sizedmore » binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.« less

  1. Dual-mode self-validating resistance/Johnson noise thermometer system

    DOEpatents

    Shepard, Robert L.; Blalock, Theron V.; Roberts, Michael J.

    1993-01-01

    A dual-mode Johnson noise and DC resistance thermometer capable of use in control systems where prompt indications of temperature changes and long term accuracy are needed. A resistance-inductance-capacitance (RLC) tuned circuit produces a continuous voltage signal for Johnson noise temperature measurement. The RLC circuit provides a mean-squared noise voltage that depends only on the capacitance used and the temperature of the sensor. The sensor has four leads for simultaneous coupling to a noise signal processor and to a DC resistance signal processor.

  2. Memory and Energy Optimization Strategies for Multithreaded Operating System on the Resource-Constrained Wireless Sensor Node

    PubMed Central

    Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Xu, Jun; Yang, Jianfeng; Zhou, Haiying; Shi, Hongling; Zhou, Peng

    2015-01-01

    Memory and energy optimization strategies are essential for the resource-constrained wireless sensor network (WSN) nodes. In this article, a new memory-optimized and energy-optimized multithreaded WSN operating system (OS) LiveOS is designed and implemented. Memory cost of LiveOS is optimized by using the stack-shifting hybrid scheduling approach. Different from the traditional multithreaded OS in which thread stacks are allocated statically by the pre-reservation, thread stacks in LiveOS are allocated dynamically by using the stack-shifting technique. As a result, memory waste problems caused by the static pre-reservation can be avoided. In addition to the stack-shifting dynamic allocation approach, the hybrid scheduling mechanism which can decrease both the thread scheduling overhead and the thread stack number is also implemented in LiveOS. With these mechanisms, the stack memory cost of LiveOS can be reduced more than 50% if compared to that of a traditional multithreaded OS. Not is memory cost optimized, but also the energy cost is optimized in LiveOS, and this is achieved by using the multi-core “context aware” and multi-core “power-off/wakeup” energy conservation approaches. By using these approaches, energy cost of LiveOS can be reduced more than 30% when compared to the single-core WSN system. Memory and energy optimization strategies in LiveOS not only prolong the lifetime of WSN nodes, but also make the multithreaded OS feasible to run on the memory-constrained WSN nodes. PMID:25545264

  3. Applying Jlint to Space Exploration Software

    NASA Technical Reports Server (NTRS)

    Artho, Cyrille; Havelund, Klaus

    2004-01-01

    Java is a very successful programming language which is also becoming widespread in embedded systems, where software correctness is critical. Jlint is a simple but highly efficient static analyzer that checks a Java program for several common errors, such as null pointer exceptions, and overflow errors. It also includes checks for multi-threading problems, such as deadlocks and data races. The case study described here shows the effectiveness of Jlint in find-false positives in the multi-threading warnings gives an insight into design patterns commonly used in multi-threaded code. The results show that a few analysis techniques are sufficient to avoid almost all false positives. These techniques include investigating all possible callers and a few code idioms. Verifying the correct application of these patterns is still crucial, because their correct usage is not trivial.

  4. Biological Water Processor and Forward Osmosis Secondary Treatment

    NASA Technical Reports Server (NTRS)

    Shull, Sarah; Meyer, Caitlin

    2014-01-01

    The goal of the Biological Water Processor (BWP) is to remove 90% organic carbon and 75% ammonium from an exploration-based wastewater stream for four crew members. The innovative design saves on space, power and consumables as compared to the ISS Urine Processor Assembly (UPA) by utilizing microbes in a biofilm. The attached-growth system utilizes simultaneous nitrification and denitrification to mineralize organic carbon and ammonium to carbon dioxide and nitrogen gas, which can be scrubbed in a cabin air revitalization system. The BWP uses a four-crew wastewater comprised of urine and humidity condensate, as on the ISS, but also includes hygiene (shower, shave, hand washing and oral hygiene) and laundry. The BWP team donates 58L per day of this wastewater processed in Building 7.

  5. Final report for the Tera Computer TTI CRADA

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Davidson, G.S.; Pavlakos, C.; Silva, C.

    1997-01-01

    Tera Computer and Sandia National Laboratories have completed a CRADA, which examined the Tera Multi-Threaded Architecture (MTA) for use with large codes of importance to industry and DOE. The MTA is an innovative architecture that uses parallelism to mask latency between memories and processors. The physical implementation is a parallel computer with high cross-section bandwidth and GaAs processors designed by Tera, which support many small computation threads and fast, lightweight context switches between them. When any thread blocks while waiting for memory accesses to complete, another thread immediately begins execution so that high CPU utilization is maintained. The Tera MTAmore » parallel computer has a single, global address space, which is appealing when porting existing applications to a parallel computer. This ease of porting is further enabled by compiler technology that helps break computations into parallel threads. DOE and Sandia National Laboratories were interested in working with Tera to further develop this computing concept. While Tera Computer would continue the hardware development and compiler research, Sandia National Laboratories would work with Tera to ensure that their compilers worked well with important Sandia codes, most particularly CTH, a shock physics code used for weapon safety computations. In addition to that important code, Sandia National Laboratories would complete research on a robotic path planning code, SANDROS, which is important in manufacturing applications, and would evaluate the MTA performance on this code. Finally, Sandia would work directly with Tera to develop 3D visualization codes, which would be appropriate for use with the MTA. Each of these tasks has been completed to the extent possible, given that Tera has just completed the MTA hardware. All of the CRADA work had to be done on simulators.« less

  6. Hydrogen Balmer Line Broadening in Solar and Stellar Flares

    NASA Technical Reports Server (NTRS)

    Kowalski, Adam F.; Allred, Joel C.; Uitenbroek, Han; Tremblay, Pier-Emmanuel; Brown, Stephen; Carlsson, Mats; Osten, Rachel A.; Wisniewski, John P.; Hawley, Suzanne L.

    2017-01-01

    The broadening of the hydrogen lines during flares is thought to result from increased charge (electron, proton) density in the flare chromosphere. However, disagreements between theory and modeling prescriptions have precluded an accurate diagnostic of the degree of ionization and compression resulting from flare heating in the chromosphere. To resolve this issue, we have incorporated the unified theory of electric pressure broadening of the hydrogen lines into the non-LTE radiative-transfer code RH. This broadening prescription produces a much more realistic spectrum of the quiescent, A0 star Vega compared to the analytic approximations used as a damping parameter in the Voigt profiles. We test recent radiative-hydrodynamic (RHD) simulations of the atmospheric response to high nonthermal electron beam fluxes with the new broadening prescription and find that the Balmer lines are overbroadened at the densest times in the simulations. Adding many simultaneously heated and cooling model loops as a 'multithread' model improves the agreement with the observations. We revisit the three component phenomenological flare model of the YZ CMi Megaflare using recent and new RHD models. The evolution of the broadening, line flux ratios, and continuum flux ratios are well-reproduced by a multithread model with high-flux nonthermal electron beam heating, an extended decay phase model, and a 'hot spot' atmosphere heated by an ultra relativistic electron beam with reasonable filling factors: approximately 0.1%, 1%, and 0.1% of the visible stellar hemisphere, respectively. The new modeling motivates future work to understand the origin of the extended gradual phase emission.

  7. Demonstration of two-qubit algorithms with a superconducting quantum processor.

    PubMed

    DiCarlo, L; Chow, J M; Gambetta, J M; Bishop, Lev S; Johnson, B R; Schuster, D I; Majer, J; Blais, A; Frunzio, L; Girvin, S M; Schoelkopf, R J

    2009-07-09

    Quantum computers, which harness the superposition and entanglement of physical states, could outperform their classical counterparts in solving problems with technological impact-such as factoring large numbers and searching databases. A quantum processor executes algorithms by applying a programmable sequence of gates to an initialized register of qubits, which coherently evolves into a final state containing the result of the computation. Building a quantum processor is challenging because of the need to meet simultaneously requirements that are in conflict: state preparation, long coherence times, universal gate operations and qubit readout. Processors based on a few qubits have been demonstrated using nuclear magnetic resonance, cold ion trap and optical systems, but a solid-state realization has remained an outstanding challenge. Here we demonstrate a two-qubit superconducting processor and the implementation of the Grover search and Deutsch-Jozsa quantum algorithms. We use a two-qubit interaction, tunable in strength by two orders of magnitude on nanosecond timescales, which is mediated by a cavity bus in a circuit quantum electrodynamics architecture. This interaction allows the generation of highly entangled states with concurrence up to 94 per cent. Although this processor constitutes an important step in quantum computing with integrated circuits, continuing efforts to increase qubit coherence times, gate performance and register size will be required to fulfil the promise of a scalable technology.

  8. Right-Brain/Left-Brain Integrated Associative Processor Employing Convertible Multiple-Instruction-Stream Multiple-Data-Stream Elements

    NASA Astrophysics Data System (ADS)

    Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi

    2005-04-01

    A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.

  9. Shadow-Bitcoin: Scalable Simulation via Direct Execution of Multi-Threaded Applications

    DTIC Science & Technology

    2015-08-10

    Shadow- Bitcoin : Scalable Simulation via Direct Execution of Multi-threaded Applications Andrew Miller University of Maryland amiller@cs.umd.edu Rob...Shadow plug-in that directly executes the Bitcoin reference client software. To demonstrate the usefulness of this tool, we present novel denial-of...service attacks against the Bit- coin software that exploit low-level implementation ar- tifacts in the Bitcoin reference client; our determinis- tic

  10. SATNET Development and Operation. Pluribus Satellite IMP Development, Remote Site Maintenance. Internet Development. Mobile Access Terminal Network, TCP for the HP3000, TCP-TAC.

    DTIC Science & Technology

    1980-08-01

    reduce I/O latency. Periodically, the polling processor would hand off the polling task to a different processor which would then become the active...DMA (2 X 1.5 microsecond/word) since both halves of the SKI carry on simultaneous DMA transfers in a looped configuration. The difference between 3.0...intelligent DMA. The only difference between this approach and the intelligent DNA is that the true intelligent DNA (approach (1)) would not use up I

  11. Electronic neural network for solving traveling salesman and similar global optimization problems

    NASA Technical Reports Server (NTRS)

    Thakoor, Anilkumar P. (Inventor); Moopenn, Alexander W. (Inventor); Duong, Tuan A. (Inventor); Eberhardt, Silvio P. (Inventor)

    1993-01-01

    This invention is a novel high-speed neural network based processor for solving the 'traveling salesman' and other global optimization problems. It comprises a novel hybrid architecture employing a binary synaptic array whose embodiment incorporates the fixed rules of the problem, such as the number of cities to be visited. The array is prompted by analog voltages representing variables such as distances. The processor incorporates two interconnected feedback networks, each of which solves part of the problem independently and simultaneously, yet which exchange information dynamically.

  12. Unipro UGENE: a unified bioinformatics toolkit.

    PubMed

    Okonechnikov, Konstantin; Golosova, Olga; Fursov, Mikhail

    2012-04-15

    Unipro UGENE is a multiplatform open-source software with the main goal of assisting molecular biologists without much expertise in bioinformatics to manage, analyze and visualize their data. UGENE integrates widely used bioinformatics tools within a common user interface. The toolkit supports multiple biological data formats and allows the retrieval of data from remote data sources. It provides visualization modules for biological objects such as annotated genome sequences, Next Generation Sequencing (NGS) assembly data, multiple sequence alignments, phylogenetic trees and 3D structures. Most of the integrated algorithms are tuned for maximum performance by the usage of multithreading and special processor instructions. UGENE includes a visual environment for creating reusable workflows that can be launched on local resources or in a High Performance Computing (HPC) environment. UGENE is written in C++ using the Qt framework. The built-in plugin system and structured UGENE API make it possible to extend the toolkit with new functionality. UGENE binaries are freely available for MS Windows, Linux and Mac OS X at http://ugene.unipro.ru/download.html. UGENE code is licensed under the GPLv2; the information about the code licensing and copyright of integrated tools can be found in the LICENSE.3rd_party file provided with the source bundle.

  13. Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

    NASA Astrophysics Data System (ADS)

    Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

    2014-06-01

    This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.

  14. JaSTA-2: Second version of the Java Superposition T-matrix Application

    NASA Astrophysics Data System (ADS)

    Halder, Prithish; Das, Himadri Sekhar

    2017-12-01

    In this article, we announce the development of a new version of the Java Superposition T-matrix App (JaSTA-2), to study the light scattering properties of porous aggregate particles. It has been developed using Netbeans 7.1.2, which is a java integrated development environment (IDE). The JaSTA uses double precision superposition T-matrix codes for multi-sphere clusters in random orientation, developed by Mackowski and Mischenko (1996). The new version consists of two options as part of the input parameters: (i) single wavelength and (ii) multiple wavelengths. The first option (which retains the applicability of older version of JaSTA) calculates the light scattering properties of aggregates of spheres for a single wavelength at a given instant of time whereas the second option can execute the code for a multiple numbers of wavelengths in a single run. JaSTA-2 provides convenient and quicker data analysis which can be used in diverse fields like Planetary Science, Atmospheric Physics, Nanoscience, etc. This version of the software is developed for Linux platform only, and it can be operated over all the cores of a processor using the multi-threading option.

  15. MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

    PubMed

    González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil

    2016-12-15

    MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  16. Performance of hybrid programming models for multiscale cardiac simulations: preparing for petascale computation.

    PubMed

    Pope, Bernard J; Fitch, Blake G; Pitman, Michael C; Rice, John J; Reumann, Matthias

    2011-10-01

    Future multiscale and multiphysics models that support research into human disease, translational medical science, and treatment can utilize the power of high-performance computing (HPC) systems. We anticipate that computationally efficient multiscale models will require the use of sophisticated hybrid programming models, mixing distributed message-passing processes [e.g., the message-passing interface (MPI)] with multithreading (e.g., OpenMP, Pthreads). The objective of this study is to compare the performance of such hybrid programming models when applied to the simulation of a realistic physiological multiscale model of the heart. Our results show that the hybrid models perform favorably when compared to an implementation using only the MPI and, furthermore, that OpenMP in combination with the MPI provides a satisfactory compromise between performance and code complexity. Having the ability to use threads within MPI processes enables the sophisticated use of all processor cores for both computation and communication phases. Considering that HPC systems in 2012 will have two orders of magnitude more cores than what was used in this study, we believe that faster than real-time multiscale cardiac simulations can be achieved on these systems.

  17. Environmentally adaptive processing for shallow ocean applications: A sequential Bayesian approach.

    PubMed

    Candy, J V

    2015-09-01

    The shallow ocean is a changing environment primarily due to temperature variations in its upper layers directly affecting sound propagation throughout. The need to develop processors capable of tracking these changes implies a stochastic as well as an environmentally adaptive design. Bayesian techniques have evolved to enable a class of processors capable of performing in such an uncertain, nonstationary (varying statistics), non-Gaussian, variable shallow ocean environment. A solution to this problem is addressed by developing a sequential Bayesian processor capable of providing a joint solution to the modal function tracking and environmental adaptivity problem. Here, the focus is on the development of both a particle filter and an unscented Kalman filter capable of providing reasonable performance for this problem. These processors are applied to hydrophone measurements obtained from a vertical array. The adaptivity problem is attacked by allowing the modal coefficients and/or wavenumbers to be jointly estimated from the noisy measurement data along with tracking of the modal functions while simultaneously enhancing the noisy pressure-field measurements.

  18. Acousto-optic time- and space-integrating spotlight-mode SAR processor

    NASA Astrophysics Data System (ADS)

    Haney, Michael W.; Levy, James J.; Michael, Robert R., Jr.

    1993-09-01

    The technical approach and recent experimental results for the acousto-optic time- and space- integrating real-time SAR image formation processor program are reported. The concept overcomes the size and power consumption limitations of electronic approaches by using compact, rugged, and low-power analog optical signal processing techniques for the most computationally taxing portions of the SAR imaging problem. Flexibility and performance are maintained by the use of digital electronics for the critical low-complexity filter generation and output image processing functions. The results include a demonstration of the processor's ability to perform high-resolution spotlight-mode SAR imaging by simultaneously compensating for range migration and range/azimuth coupling in the analog optical domain, thereby avoiding a highly power-consuming digital interpolation or reformatting operation usually required in all-electronic approaches.

  19. Multi-threaded integration of HTC-Vive and MeVisLab

    NASA Astrophysics Data System (ADS)

    Gunacker, Simon; Gall, Markus; Schmalstieg, Dieter; Egger, Jan

    2018-03-01

    This work presents how Virtual Reality (VR) can easily be integrated into medical applications via a plugin for a medical image processing framework called MeVisLab. A multi-threaded plugin has been developed using OpenVR, a VR library that can be used for developing vendor and platform independent VR applications. The plugin is tested using the HTC Vive, a head-mounted display developed by HTC and Valve Corporation.

  20. Factors controlling large-wood transport in a mountain river

    NASA Astrophysics Data System (ADS)

    Ruiz-Villanueva, Virginia; Wyżga, Bartłomiej; Zawiejska, Joanna; Hajdukiewicz, Maciej; Stoffel, Markus

    2016-11-01

    As with bedload transport, wood transport in rivers is governed by several factors such as flow regime, geomorphic configuration of the channel and floodplain, or wood size and shape. Because large-wood tends to be transported during floods, safety and logistical constraints make field measurements difficult. As a result, direct observation and measurements of the conditions of wood transport are scarce. This lack of direct observations and the complexity of the processes involved in wood transport may result in an incomplete understanding of wood transport processes. Numerical modelling provides an alternative approach to addressing some of the unknowns in the dynamics of large-wood in rivers. The aim of this study is to improve the understanding of controls governing wood transport in mountain rivers, combining numerical modelling and direct field observations. By defining different scenarios, we illustrate relationships between the rate of wood transport and discharge, wood size, and river morphology. We test these relationships for a wide, multithread reach and a narrower, partially channelized single-thread reach of the Czarny Dunajec River in the Polish Carpathians. Results indicate that a wide range of quantitative information about wood transport can be obtained from a combination of numerical modelling and field observations and from document contrasting patterns of wood transport in single- and multithread river reaches. On the one hand, log diameter seems to have a greater importance for wood transport in the multithread channel because of shallower flow, lower flow velocity, and lower stream power. Hydrodynamic conditions in the single-thread channel allow transport of large-wood pieces, whereas in the multithread reach, logs with diameters similar to water depth are not being moved. On the other hand, log length also exerts strong control on wood transport, more so in the single-thread than in the multithread reach. In any case, wood transport strongly decreases with increasing piece volume, although this relation is not linear. We also document a nonlinear relationship between wood transport and flood magnitude. A threshold discharge was identified below which wood transport is negligible. This threshold is higher in the multithread reach, while in the single-thread reach floods of lower magnitude are able to transport wood downstream. Wood transport ratio increases with discharge until it reaches an upper threshold or tipping point, and then decreases or increases much more slowly. This threshold is clearly related to bankfull discharge, but it is much higher for the multithread reach than for the single-thread one. Although modelling input and field observations were taken from a specific river, our findings and conclusions are likely to be applicable to a much larger suite of (mountain) rivers.

  1. HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation

    NASA Technical Reports Server (NTRS)

    Sterling, Thomas; Bergman, Larry

    2000-01-01

    Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention semiconductor logic. Wave Division Multiplexing optical communications can approach a peak per fiber bandwidth of 1 Tbps and the new Data Vortex network topology employing this technology can connect tens of thousands of ports providing a bi-section bandwidth on the order of a Petabyte per second with latencies well below 100 nanoseconds, even under heavy loads. Processor-in-Memory (PIM) technology combines logic and memory on the same chip exposing the internal bandwidth of the memory row buffers at low latency. And holographic storage photorefractive storage technologies provide high-density memory with access a thousand times faster than conventional disk technologies. Together these technologies enable a new class of shared memory system architecture with a peak performance in the range of a Petaflops but size and power requirements comparable to today's largest Teraflops scale systems. To achieve high-sustained performance, HTMT combines an advanced multithreading processor architecture with a memory-driven coarse-grained latency management strategy called "percolation", yielding high efficiency while reducing the much of the parallel programming burden. This paper will present the basic system architecture characteristics made possible through this series of advanced technologies and then give a detailed description of the new percolation approach to runtime latency management.

  2. Multi-thread parallel algorithm for reconstructing 3D large-scale porous structures

    NASA Astrophysics Data System (ADS)

    Ju, Yang; Huang, Yaohui; Zheng, Jiangtao; Qian, Xu; Xie, Heping; Zhao, Xi

    2017-04-01

    Geomaterials inherently contain many discontinuous, multi-scale, geometrically irregular pores, forming a complex porous structure that governs their mechanical and transport properties. The development of an efficient reconstruction method for representing porous structures can significantly contribute toward providing a better understanding of the governing effects of porous structures on the properties of porous materials. In order to improve the efficiency of reconstructing large-scale porous structures, a multi-thread parallel scheme was incorporated into the simulated annealing reconstruction method. In the method, four correlation functions, which include the two-point probability function, the linear-path functions for the pore phase and the solid phase, and the fractal system function for the solid phase, were employed for better reproduction of the complex well-connected porous structures. In addition, a random sphere packing method and a self-developed pre-conditioning method were incorporated to cast the initial reconstructed model and select independent interchanging pairs for parallel multi-thread calculation, respectively. The accuracy of the proposed algorithm was evaluated by examining the similarity between the reconstructed structure and a prototype in terms of their geometrical, topological, and mechanical properties. Comparisons of the reconstruction efficiency of porous models with various scales indicated that the parallel multi-thread scheme significantly shortened the execution time for reconstruction of a large-scale well-connected porous model compared to a sequential single-thread procedure.

  3. Fuel processing device

    DOEpatents

    Ahluwalia, Rajesh K [Burr Ridge, IL; Ahmed, Shabbir [Naperville, IL; Lee, Sheldon H. D. [Willowbrook, IL

    2011-08-02

    An improved fuel processor for fuel cells is provided whereby the startup time of the processor is less than sixty seconds and can be as low as 30 seconds, if not less. A rapid startup time is achieved by either igniting or allowing a small mixture of air and fuel to react over and warm up the catalyst of an autothermal reformer (ATR). The ATR then produces combustible gases to be subsequently oxidized on and simultaneously warm up water-gas shift zone catalysts. After normal operating temperature has been achieved, the proportion of air included with the fuel is greatly diminished.

  4. Development Of A Three-Dimensional Circuit Integration Technology And Computer Architecture

    NASA Astrophysics Data System (ADS)

    Etchells, R. D.; Grinberg, J.; Nudd, G. R.

    1981-12-01

    This paper is the first of a series 1,2,3 describing a range of efforts at Hughes Research Laboratories, which are collectively referred to as "Three-Dimensional Microelectronics." The technology being developed is a combination of a unique circuit fabrication/packaging technology and a novel processing architecture. The packaging technology greatly reduces the parasitic impedances associated with signal-routing in complex VLSI structures, while simultaneously allowing circuit densities orders of magnitude higher than the current state-of-the-art. When combined with the 3-D processor architecture, the resulting machine exhibits a one- to two-order of magnitude simultaneous improvement over current state-of-the-art machines in the three areas of processing speed, power consumption, and physical volume. The 3-D architecture is essentially that commonly referred to as a "cellular array", with the ultimate implementation having as many as 512 x 512 processors working in parallel. The three-dimensional nature of the assembled machine arises from the fact that the chips containing the active circuitry of the processor are stacked on top of each other. In this structure, electrical signals are passed vertically through the chips via thermomigrated aluminum feedthroughs. Signals are passed between adjacent chips by micro-interconnects. This discussion presents a broad view of the total effort, as well as a more detailed treatment of the fabrication and packaging technologies themselves. The results of performance simulations of the completed 3-D processor executing a variety of algorithms are also presented. Of particular pertinence to the interests of the focal-plane array community is the simulation of the UNICORNS nonuniformity correction algorithms as executed by the 3-D architecture.

  5. Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

    NASA Technical Reports Server (NTRS)

    Oliker, Leonid; Biswas, Rupak

    1999-01-01

    The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.

  6. Modernizing the ATLAS simulation infrastructure

    NASA Astrophysics Data System (ADS)

    Di Simone, A.; CollaborationAlbert-Ludwigs-Universitt Freiburg, ATLAS; Institut, Physikalisches; Br., 79104 Freiburg i.; Germany

    2017-10-01

    The ATLAS Simulation infrastructure has been used to produce upwards of 50 billion proton-proton collision events for analyses ranging from detailed Standard Model measurements to searches for exotic new phenomena. In the last several years, the infrastructure has been heavily revised to allow intuitive multithreading and significantly improved maintainability. Such a massive update of a legacy code base requires careful choices about what pieces of code to completely rewrite and what to wrap or revise. The initialization of the complex geometry was generalized to allow new tools and geometry description languages, popular in some detector groups. The addition of multithreading requires Geant4-MT and GaudiHive, two frameworks with fundamentally different approaches to multithreading, to work together. It also required enforcing thread safety throughout a large code base, which required the redesign of several aspects of the simulation, including truth, the record of particle interactions with the detector during the simulation. These advances were possible thanks to close interactions with the Geant4 developers.

  7. A new parallel-vector finite element analysis software on distributed-memory computers

    NASA Technical Reports Server (NTRS)

    Qin, Jiangning; Nguyen, Duc T.

    1993-01-01

    A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.

  8. Tunable multi-wavelength fiber lasers based on an Opto-VLSI processor and optical amplifiers.

    PubMed

    Xiao, Feng; Alameh, Kamal; Lee, Yong Tak

    2009-12-07

    A multi-wavelength tunable fiber laser based on the use of an Opto-VLSI processor in conjunction with different optical amplifiers is proposed and experimentally demonstrated. The Opto-VLSI processor can simultaneously select any part of the gain spectrum from each optical amplifier into its associated fiber ring, leading to a multiport tunable fiber laser source. We experimentally demonstrate a 3-port tunable fiber laser source, where each output wavelength of each port can independently be tuned within the C-band with a wavelength step of about 0.05 nm. Experimental results demonstrate a laser linewidth as narrow as 0.05 nm and an optical side-mode-suppression-ratio (SMSR) of about 35 dB. The demonstrated three fiber lasers have excellent stability at room temperature and output power uniformity less than 0.5 dB over the whole C-band.

  9. Compact propane fuel processor for auxiliary power unit application

    NASA Astrophysics Data System (ADS)

    Dokupil, M.; Spitta, C.; Mathiak, J.; Beckhaus, P.; Heinzel, A.

    With focus on mobile applications a fuel cell auxiliary power unit (APU) using liquefied petroleum gas (LPG) is currently being developed at the Centre for Fuel Cell Technology (Zentrum für BrennstoffzellenTechnik, ZBT gGmbH). The system is consisting of an integrated compact and lightweight fuel processor and a low temperature PEM fuel cell for an electric power output of 300 W. This article is presenting the current status of development of the fuel processor which is designed for a nominal hydrogen output of 1 k Wth,H2 within a load range from 50 to 120%. A modular setup was chosen defining a reformer/burner module and a CO-purification module. Based on the performance specifications, thermodynamic simulations, benchmarking and selection of catalysts the modules have been developed and characterised simultaneously and then assembled to the complete fuel processor. Automated operation results in a cold startup time of about 25 min for nominal load and carbon monoxide output concentrations below 50 ppm for steady state and dynamic operation. Also fast transient response of the fuel processor at load changes with low fluctuations of the reformate gas composition have been achieved. Beside the development of the main reactors the transfer of the fuel processor to an autonomous system is of major concern. Hence, concepts for packaging have been developed resulting in a volume of 7 l and a weight of 3 kg. Further a selection of peripheral components has been tested and evaluated regarding to the substitution of the laboratory equipment.

  10. P-HS-SFM: a parallel harmony search algorithm for the reproduction of experimental data in the continuous microscopic crowd dynamic models

    NASA Astrophysics Data System (ADS)

    Jaber, Khalid Mohammad; Alia, Osama Moh'd.; Shuaib, Mohammed Mahmod

    2018-03-01

    Finding the optimal parameters that can reproduce experimental data (such as the velocity-density relation and the specific flow rate) is a very important component of the validation and calibration of microscopic crowd dynamic models. Heavy computational demand during parameter search is a known limitation that exists in a previously developed model known as the Harmony Search-Based Social Force Model (HS-SFM). In this paper, a parallel-based mechanism is proposed to reduce the computational time and memory resource utilisation required to find these parameters. More specifically, two MATLAB-based multicore techniques (parfor and create independent jobs) using shared memory are developed by taking advantage of the multithreading capabilities of parallel computing, resulting in a new framework called the Parallel Harmony Search-Based Social Force Model (P-HS-SFM). The experimental results show that the parfor-based P-HS-SFM achieved a better computational time of about 26 h, an efficiency improvement of ? 54% and a speedup factor of 2.196 times in comparison with the HS-SFM sequential processor. The performance of the P-HS-SFM using the create independent jobs approach is also comparable to parfor with a computational time of 26.8 h, an efficiency improvement of about 30% and a speedup of 2.137 times.

  11. A simple modern correctness condition for a space-based high-performance multiprocessor

    NASA Technical Reports Server (NTRS)

    Probst, David K.; Li, Hon F.

    1992-01-01

    A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.

  12. High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System

    PubMed Central

    Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram

    2014-01-01

    We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second. PMID:24891848

  13. High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.

    PubMed

    Saikia, Manob Jyoti; Kanhirodan, Rajan; Mohan Vasu, Ram

    2014-01-01

    We have developed a graphics processor unit (GPU-) based high-speed fully 3D system for diffuse optical tomography (DOT). The reduction in execution time of 3D DOT algorithm, a severely ill-posed problem, is made possible through the use of (1) an algorithmic improvement that uses Broyden approach for updating the Jacobian matrix and thereby updating the parameter matrix and (2) the multinode multithreaded GPU and CUDA (Compute Unified Device Architecture) software architecture. Two different GPU implementations of DOT programs are developed in this study: (1) conventional C language program augmented by GPU CUDA and CULA routines (C GPU), (2) MATLAB program supported by MATLAB parallel computing toolkit for GPU (MATLAB GPU). The computation time of the algorithm on host CPU and the GPU system is presented for C and Matlab implementations. The forward computation uses finite element method (FEM) and the problem domain is discretized into 14610, 30823, and 66514 tetrahedral elements. The reconstruction time, so achieved for one iteration of the DOT reconstruction for 14610 elements, is 0.52 seconds for a C based GPU program for 2-plane measurements. The corresponding MATLAB based GPU program took 0.86 seconds. The maximum number of reconstructed frames so achieved is 2 frames per second.

  14. Cranial implant design using augmented reality immersive system.

    PubMed

    Ai, Zhuming; Evenhouse, Ray; Leigh, Jason; Charbel, Fady; Rasmussen, Mary

    2007-01-01

    Software tools that utilize haptics for sculpting precise fitting cranial implants are utilized in an augmented reality immersive system to create a virtual working environment for the modelers. The virtual environment is designed to mimic the traditional working environment as closely as possible, providing more functionality for the users. The implant design process uses patient CT data of a defective area. This volumetric data is displayed in an implant modeling tele-immersive augmented reality system where the modeler can build a patient specific implant that precisely fits the defect. To mimic the traditional sculpting workspace, the implant modeling augmented reality system includes stereo vision, viewer centered perspective, sense of touch, and collaboration. To achieve optimized performance, this system includes a dual-processor PC, fast volume rendering with three-dimensional texture mapping, the fast haptic rendering algorithm, and a multi-threading architecture. The system replaces the expensive and time consuming traditional sculpting steps such as physical sculpting, mold making, and defect stereolithography. This augmented reality system is part of a comprehensive tele-immersive system that includes a conference-room-sized system for tele-immersive small group consultation and an inexpensive, easily deployable networked desktop virtual reality system for surgical consultation, evaluation and collaboration. This system has been used to design patient-specific cranial implants with precise fit.

  15. Multi-Threaded Algorithms for GPGPU in the ATLAS High Level Trigger

    NASA Astrophysics Data System (ADS)

    Conde Muíño, P.; ATLAS Collaboration

    2017-10-01

    General purpose Graphics Processor Units (GPGPU) are being evaluated for possible future inclusion in an upgraded ATLAS High Level Trigger farm. We have developed a demonstrator including GPGPU implementations of Inner Detector and Muon tracking and Calorimeter clustering within the ATLAS software framework. ATLAS is a general purpose particle physics experiment located on the LHC collider at CERN. The ATLAS Trigger system consists of two levels, with Level-1 implemented in hardware and the High Level Trigger implemented in software running on a farm of commodity CPU. The High Level Trigger reduces the trigger rate from the 100 kHz Level-1 acceptance rate to 1.5 kHz for recording, requiring an average per-event processing time of ∼ 250 ms for this task. The selection in the high level trigger is based on reconstructing tracks in the Inner Detector and Muon Spectrometer and clusters of energy deposited in the Calorimeter. Performing this reconstruction within the available farm resources presents a significant challenge that will increase significantly with future LHC upgrades. During the LHC data taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further to 7.5 times the design value in 2026 following LHC and ATLAS upgrades. Corresponding improvements in the speed of the reconstruction code will be needed to provide the required trigger selection power within affordable computing resources. Key factors determining the potential benefit of including GPGPU as part of the HLT processor farm are: the relative speed of the CPU and GPGPU algorithm implementations; the relative execution times of the GPGPU algorithms and serial code remaining on the CPU; the number of GPGPU required, and the relative financial cost of the selected GPGPU. We give a brief overview of the algorithms implemented and present new measurements that compare the performance of various configurations exploiting GPGPU cards.

  16. Human Cognition and Information Processing: Potential Problems for a Field Dependent Human Sequential Information Processor.

    ERIC Educational Resources Information Center

    Shaughnessy, M.; And Others

    Numerous cognitive psychologists have validated the hypothesis, originally advanced by the Russian physician, A. Luria, that different individuals process information in two distinctly different manners: simultaneously and sequentially. The importance of recognizing the existence of these two distinct styles of processing information and selecting…

  17. Data processing techniques used with MST radars: A review

    NASA Technical Reports Server (NTRS)

    Rastogi, P. K.

    1983-01-01

    The data processing methods used in high power radar probing of the middle atmosphere are examined. The radar acts as a spatial filter on the small scale refractivity fluctuations in the medium. The characteristics of the received signals are related to the statistical properties of these fluctuations. A functional outline of the components of a radar system is given. Most computation intensive tasks are carried out by the processor. The processor computes a statistical function of the received signals, simultaneously for a large number of ranges. The slow fading of atmospheric signals is used to reduce the data input rate to the processor by coherent integration. The inherent range resolution of the radar experiments can be improved significant with the use of pseudonoise phase codes to modulate the transmitted pulses and a corresponding decoding operation on the received signals. Commutability of the decoding and coherent integration operations is used to obtain a significant reduction in computations. The limitations of the processors are outlined. At the next level of data reduction, the measured function is parameterized by a few spectral moments that can be related to physical processes in the medium. The problems encountered in estimating the spectral moments in the presence of strong ground clutter, external interference, and noise are discussed. The graphical and statistical analysis of the inferred parameters are outlined. The requirements for special purpose processors for MST radars are discussed.

  18. Expressing Parallelism with ROOT

    NASA Astrophysics Data System (ADS)

    Piparo, D.; Tejedor, E.; Guiraud, E.; Ganis, G.; Mato, P.; Moneta, L.; Valls Pla, X.; Canal, P.

    2017-10-01

    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module in Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.

  19. Expressing Parallelism with ROOT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Piparo, D.; Tejedor, E.; Guiraud, E.

    The need for processing the ever-increasing amount of data generated by the LHC experiments in a more efficient way has motivated ROOT to further develop its support for parallelism. Such support is being tackled both for shared-memory and distributed-memory environments. The incarnations of the aforementioned parallelism are multi-threading, multi-processing and cluster-wide executions. In the area of multi-threading, we discuss the new implicit parallelism and related interfaces, as well as the new building blocks to safely operate with ROOT objects in a multi-threaded environment. Regarding multi-processing, we review the new MultiProc framework, comparing it with similar tools (e.g. multiprocessing module inmore » Python). Finally, as an alternative to PROOF for cluster-wide executions, we introduce the efforts on integrating ROOT with state-of-the-art distributed data processing technologies like Spark, both in terms of programming model and runtime design (with EOS as one of the main components). For all the levels of parallelism, we discuss, based on real-life examples and measurements, how our proposals can increase the productivity of scientists.« less

  20. Software design of a remote real-time ECG monitoring system

    NASA Astrophysics Data System (ADS)

    Yu, Chengbo; Tao, Hongyan

    2005-12-01

    Heart disease is one of the main diseases that threaten the health and lives of human beings. At present, the normal remote ECG monitoring system has the disadvantages of a short testing distance and limitation of monitoring lines. Because of accident and paroxysmal disease, ECG monitoring has extended from the hospital to the family. Therefore, remote ECG monitoring through the Internet has the actual value and significance. The principle and design method of software of the remote dynamic ECG monitor was presented and discussed. The monitoring software is programmed with Delphi software based on client-sever interactive mode. The application program of the system, which makes use of multithreading technology, is shown to perform in an excellent manner. The program includes remote link users and ECG processing, i.e. ECG data's receiving, real-time displaying, recording and replaying. The system can connect many clients simultaneously and perform real-time monitoring to patients.

  1. Automated target recognition and tracking using an optical pattern recognition neural network

    NASA Technical Reports Server (NTRS)

    Chao, Tien-Hsin

    1991-01-01

    The on-going development of an automatic target recognition and tracking system at the Jet Propulsion Laboratory is presented. This system is an optical pattern recognition neural network (OPRNN) that is an integration of an innovative optical parallel processor and a feature extraction based neural net training algorithm. The parallel optical processor provides high speed and vast parallelism as well as full shift invariance. The neural network algorithm enables simultaneous discrimination of multiple noisy targets in spite of their scales, rotations, perspectives, and various deformations. This fully developed OPRNN system can be effectively utilized for the automated spacecraft recognition and tracking that will lead to success in the Automated Rendezvous and Capture (AR&C) of the unmanned Cargo Transfer Vehicle (CTV). One of the most powerful optical parallel processors for automatic target recognition is the multichannel correlator. With the inherent advantages of parallel processing capability and shift invariance, multiple objects can be simultaneously recognized and tracked using this multichannel correlator. This target tracking capability can be greatly enhanced by utilizing a powerful feature extraction based neural network training algorithm such as the neocognitron. The OPRNN, currently under investigation at JPL, is constructed with an optical multichannel correlator where holographic filters have been prepared using the neocognitron training algorithm. The computation speed of the neocognitron-type OPRNN is up to 10(exp 14) analog connections/sec that enabling the OPRNN to outperform its state-of-the-art electronics counterpart by at least two orders of magnitude.

  2. PIFEX: An advanced programmable pipelined-image processor

    NASA Technical Reports Server (NTRS)

    Gennery, D. B.; Wilcox, B.

    1985-01-01

    PIFEX is a pipelined-image processor being built in the JPL Robotics Lab. It will operate on digitized raster-scanned images (at 60 frames per second for images up to about 300 by 400 and at lesser rates for larger images), performing a variety of operations simultaneously under program control. It thus is a powerful, flexible tool for image processing and low-level computer vision. It also has applications in other two-dimensional problems such as route planning for obstacle avoidance and the numerical solution of two-dimensional partial differential equations (although its low numerical precision limits its use in the latter field). The concept and design of PIFEX are described herein, and some examples of its use are given.

  3. Skin contamination dosimeter

    DOEpatents

    Hamby, David M [Corvallis, OR; Farsoni, Abdollah T [Corvallis, OR; Cazalas, Edward [Corvallis, OR

    2011-06-21

    A technique and device provides absolute skin dosimetry in real time at multiple tissue depths simultaneously. The device uses a phoswich detector which has multiple scintillators embedded at different depths within a non-scintillating material. A digital pulse processor connected to the phoswich detector measures a differential distribution (dN/dH) of count rate N as function of pulse height H for signals from each of the multiple scintillators. A digital processor computes in real time from the differential count-rate distribution for each of multiple scintillators an estimate of an ionizing radiation dose delivered to each of multiple depths of skin tissue corresponding to the multiple scintillators embedded at multiple corresponding depths within the non-scintillating material.

  4. Operational summary of an electric propulsion long term test facility

    NASA Technical Reports Server (NTRS)

    Trump, G. E.; James, E. L.; Bechtel, R. T.

    1982-01-01

    An automated test facility capable of simultaneously operating three 2.5 kW, 30-cm mercury ion thrusters and their power processors is described, along with a test program conducted for the documentation of thruster characteristics as a function of time. Facility controls are analog, with full redundancy, so that in the event of malfunction the facility automaticcally activates a backup mode and notifies an operator. Test data are recorded by a central data collection system and processed as daily averages. The facility has operated continuously for a period of 37 months, over which nine mercury ion thrusters and four power processor units accumulated a total of over 14,500 hours of thruster operating time.

  5. Global Futures: a multithreaded execution model for Global Arrays-based applications

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chavarría-Miranda, Daniel; Krishnamoorthy, Sriram; Vishnu, Abhinav

    2012-05-31

    We present Global Futures (GF), an execution model extension to Global Arrays, which is based on a PGAS-compatible Active Message-based paradigm. We describe the design and implementation of Global Futures and illustrate its use in a computational chemistry application benchmark (Hartree-Fock matrix construction using the Self-Consistent Field method). Our results show how we used GF to increase the scalability of the Hartree-Fock matrix build to up to 6,144 cores of an Infiniband cluster. We also show how GF's multithreaded execution has comparable performance to the traditional process-based SPMD model.

  6. Simultaneous bilateral cochlear implantation in a five-month-old child with Usher syndrome.

    PubMed

    Alsanosi, A A

    2015-09-01

    To report a rare case of simultaneous bilateral cochlear implantation in a five-month-old child with Usher syndrome. Case report. A five-month-old boy with Usher syndrome and congenital profound bilateral deafness underwent simultaneous bilateral cochlear implantation. The decision to perform implantation in such a young child was based on his having a supportive family and the desire to foster his audiological development before his vision deteriorated. The subject experienced easily resolvable intra- and post-operative adverse events, and was first fitted with an externally worn audio processor four weeks after implantation. At 14 months of age, his audiological development was age-appropriate. Simultaneous bilateral cochlear implantation is possible, and even advisable, in children as young as five months old when performed by an experienced implantation team.

  7. Using a source-to-source transformation to introduce multi-threading into the AliRoot framework for a parallel event reconstruction

    NASA Astrophysics Data System (ADS)

    Lohn, Stefan B.; Dong, Xin; Carminati, Federico

    2012-12-01

    Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.

  8. A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System.

    PubMed

    Zhang, Zhen; Ma, Cheng; Zhu, Rong

    2017-08-23

    Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas.

  9. A FPGA-Based, Granularity-Variable Neuromorphic Processor and Its Application in a MIMO Real-Time Control System

    PubMed Central

    Zhang, Zhen; Zhu, Rong

    2017-01-01

    Artificial Neural Networks (ANNs), including Deep Neural Networks (DNNs), have become the state-of-the-art methods in machine learning and achieved amazing success in speech recognition, visual object recognition, and many other domains. There are several hardware platforms for developing accelerated implementation of ANN models. Since Field Programmable Gate Array (FPGA) architectures are flexible and can provide high performance per watt of power consumption, they have drawn a number of applications from scientists. In this paper, we propose a FPGA-based, granularity-variable neuromorphic processor (FBGVNP). The traits of FBGVNP can be summarized as granularity variability, scalability, integrated computing, and addressing ability: first, the number of neurons is variable rather than constant in one core; second, the multi-core network scale can be extended in various forms; third, the neuron addressing and computing processes are executed simultaneously. These make the processor more flexible and better suited for different applications. Moreover, a neural network-based controller is mapped to FBGVNP and applied in a multi-input, multi-output, (MIMO) real-time, temperature-sensing and control system. Experiments validate the effectiveness of the neuromorphic processor. The FBGVNP provides a new scheme for building ANNs, which is flexible, highly energy-efficient, and can be applied in many areas. PMID:28832522

  10. Multi-threading performance of Geant4, MCNP6, and PHITS Monte Carlo codes for tetrahedral-mesh geometry.

    PubMed

    Han, Min Cheol; Yeom, Yeon Soo; Lee, Hyun Su; Shin, Bangho; Kim, Chan Hyeong; Furuta, Takuya

    2018-05-04

    In this study, the multi-threading performance of the Geant4, MCNP6, and PHITS codes was evaluated as a function of the number of threads (N) and the complexity of the tetrahedral-mesh phantom. For this, three tetrahedral-mesh phantoms of varying complexity (simple, moderately complex, and highly complex) were prepared and implemented in the three different Monte Carlo codes, in photon and neutron transport simulations. Subsequently, for each case, the initialization time, calculation time, and memory usage were measured as a function of the number of threads used in the simulation. It was found that for all codes, the initialization time significantly increased with the complexity of the phantom, but not with the number of threads. Geant4 exhibited much longer initialization time than the other codes, especially for the complex phantom (MRCP). The improvement of computation speed due to the use of a multi-threaded code was calculated as the speed-up factor, the ratio of the computation speed on a multi-threaded code to the computation speed on a single-threaded code. Geant4 showed the best multi-threading performance among the codes considered in this study, with the speed-up factor almost linearly increasing with the number of threads, reaching ~30 when N  =  40. PHITS and MCNP6 showed a much smaller increase of the speed-up factor with the number of threads. For PHITS, the speed-up factors were low when N  =  40. For MCNP6, the increase of the speed-up factors was better, but they were still less than ~10 when N  =  40. As for memory usage, Geant4 was found to use more memory than the other codes. In addition, compared to that of the other codes, the memory usage of Geant4 more rapidly increased with the number of threads, reaching as high as ~74 GB when N  =  40 for the complex phantom (MRCP). It is notable that compared to that of the other codes, the memory usage of PHITS was much lower, regardless of both the complexity of the phantom and the number of threads, hardly increasing with the number of threads for the MRCP.

  11. Two-dimensional acousto-optic processor using circular antenna array with a Butler matrix

    NASA Astrophysics Data System (ADS)

    Lee, Jim P.

    1992-09-01

    A two-dimensional acousto-optic signal processor is shown to be useful for providing simultaneous spectrum analysis and direction finding of radar signals over an instantaneous field of view of 360 deg. A system analysis with emphasis on the direction-finding aspect of this new architecture is presented. The peak location of the optical pattern provides a direct measure of bearing, independent of signal frequency. In addition, the sidelobe levels of the pattern can be effectively reduced using amplitude weighting. Performance parameters, such as mainlobe beamwidth, peak-sidelobe level, and pointing error, are analyzed as a function of the Gaussian laser illumination profile and the number of channels. Finally, a comparison with a linear antenna array architecture is also discussed.

  12. Interactive digital signal processor

    NASA Technical Reports Server (NTRS)

    Mish, W. H.; Wenger, R. M.; Behannon, K. W.; Byrnes, J. B.

    1982-01-01

    The Interactive Digital Signal Processor (IDSP) is examined. It consists of a set of time series analysis Operators each of which operates on an input file to produce an output file. The operators can be executed in any order that makes sense and recursively, if desired. The operators are the various algorithms used in digital time series analysis work. User written operators can be easily interfaced to the sysatem. The system can be operated both interactively and in batch mode. In IDSP a file can consist of up to n (currently n=8) simultaneous time series. IDSP currently includes over thirty standard operators that range from Fourier transform operations, design and application of digital filters, eigenvalue analysis, to operators that provide graphical output, allow batch operation, editing and display information.

  13. Method and apparatus for measurement of orientation in an anisotropic medium

    DOEpatents

    Gilmore, Robert Snee; Kline, Ronald Alan; Deaton, Jr., John Broddus

    1999-01-01

    A method and apparatus are provided for simultaneously measuring the anisotropic orientation and the thickness of an article. The apparatus comprises a transducer assembly which propagates longitudinal and transverse waves through the article and which receives reflections of the waves. A processor is provided to measure respective transit times of the longitudinal and shear waves propagated through the article and to calculate respective predicted transit times of the longitudinal and shear waves based on an estimated thickness, an estimated anisotropic orientation, and an elasticity of the article. The processor adjusts the estimated thickness and the estimated anisotropic orientation to reduce the difference between the measured transit times and the respective predicted transit times of the longitudinal and shear waves.

  14. A portable platform to collect and review behavioral data simultaneously with neurophysiological signals.

    PubMed

    Tianxiao Jiang; Siddiqui, Hasan; Ray, Shruti; Asman, Priscella; Ozturk, Musa; Ince, Nuri F

    2017-07-01

    This paper presents a portable platform to collect and review behavioral data simultaneously with neurophysiological signals. The whole system is comprised of four parts: a sensor data acquisition interface, a socket server for real-time data streaming, a Simulink system for real-time processing and an offline data review and analysis toolbox. A low-cost microcontroller is used to acquire data from external sensors such as accelerometer and hand dynamometer. The micro-controller transfers the data either directly through USB or wirelessly through a bluetooth module to a data server written in C++ for MS Windows OS. The data server also interfaces with the digital glove and captures HD video from webcam. The acquired sensor data are streamed under User Datagram Protocol (UDP) to other applications such as Simulink/Matlab for real-time analysis and recording. Neurophysiological signals such as electroencephalography (EEG), electrocorticography (ECoG) and local field potential (LFP) recordings can be collected simultaneously in Simulink and fused with behavioral data. In addition, we developed a customized Matlab Graphical User Interface (GUI) software to review, annotate and analyze the data offline. The software provides a fast, user-friendly data visualization environment with synchronized video playback feature. The software is also capable of reviewing long-term neural recordings. Other featured functions such as fast preprocessing with multithreaded filters, annotation, montage selection, power-spectral density (PSD) estimate, time-frequency map and spatial spectral map are also implemented.

  15. Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chin, George; Marquez, Andres; Choudhury, Sutanay

    2012-09-01

    Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.

    The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less

  17. Monocular Stereo Measurement Using High-Speed Catadioptric Tracking

    PubMed Central

    Hu, Shaopeng; Matsumoto, Yuji; Takaki, Takeshi; Ishii, Idaku

    2017-01-01

    This paper presents a novel concept of real-time catadioptric stereo tracking using a single ultrafast mirror-drive pan-tilt active vision system that can simultaneously switch between hundreds of different views in a second. By accelerating video-shooting, computation, and actuation at the millisecond-granularity level for time-division multithreaded processing in ultrafast gaze control, the active vision system can function virtually as two or more tracking cameras with different views. It enables a single active vision system to act as virtual left and right pan-tilt cameras that can simultaneously shoot a pair of stereo images for the same object to be observed at arbitrary viewpoints by switching the direction of the mirrors of the active vision system frame by frame. We developed a monocular galvano-mirror-based stereo tracking system that can switch between 500 different views in a second, and it functions as a catadioptric active stereo with left and right pan-tilt tracking cameras that can virtually capture 8-bit color 512×512 images each operating at 250 fps to mechanically track a fast-moving object with a sufficient parallax for accurate 3D measurement. Several tracking experiments for moving objects in 3D space are described to demonstrate the performance of our monocular stereo tracking system. PMID:28792483

  18. The science of computing - Parallel computation

    NASA Technical Reports Server (NTRS)

    Denning, P. J.

    1985-01-01

    Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.

  19. Discovering Motifs in Biological Sequences Using the Micron Automata Processor.

    PubMed

    Roy, Indranil; Aluru, Srinivas

    2016-01-01

    Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l, d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26,11). We propose a novel algorithm for the (l,d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l, d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39,18) and (40,17). The paper serves as a useful guide to solving problems using this new accelerator technology.

  20. Performances of multiprocessor multidisk architectures for continuous media storage

    NASA Astrophysics Data System (ADS)

    Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

    1996-03-01

    Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.

  1. Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Earl, Christopher; Might, Matthew; Bagusetty, Abhishek

    This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.

  2. Nebo: An efficient, parallel, and portable domain-specific language for numerically solving partial differential equations

    DOE PAGES

    Earl, Christopher; Might, Matthew; Bagusetty, Abhishek; ...

    2016-01-26

    This study presents Nebo, a declarative domain-specific language embedded in C++ for discretizing partial differential equations for transport phenomena on multiple architectures. Application programmers use Nebo to write code that appears sequential but can be run in parallel, without editing the code. Currently Nebo supports single-thread execution, multi-thread execution, and many-core (GPU-based) execution. With single-thread execution, Nebo performs on par with code written by domain experts. With multi-thread execution, Nebo can linearly scale (with roughly 90% efficiency) up to 12 cores, compared to its single-thread execution. Moreover, Nebo’s many-core execution can be over 140x faster than its single-thread execution.

  3. A VxD-based automatic blending system using multithreaded programming.

    PubMed

    Wang, L; Jiang, X; Chen, Y; Tan, K C

    2004-01-01

    This paper discusses the object-oriented software design for an automatic blending system. By combining the advantages of a programmable logic controller (PLC) and an industrial control PC (ICPC), an automatic blending control system is developed for a chemical plant. The system structure and multithread-based communication approach are first presented in this paper. The overall software design issues, such as system requirements and functionalities, are then discussed in detail. Furthermore, by replacing the conventional dynamic link library (DLL) with virtual X device drivers (VxD's), a practical and cost-effective solution is provided to improve the robustness of the Windows platform-based automatic blending system in small- and medium-sized plants.

  4. Real time display Fourier-domain OCT using multi-thread parallel computing with data vectorization

    NASA Astrophysics Data System (ADS)

    Eom, Tae Joong; Kim, Hoon Seop; Kim, Chul Min; Lee, Yeung Lak; Choi, Eun-Seo

    2011-03-01

    We demonstrate a real-time display of processed OCT images using multi-thread parallel computing with a quad-core CPU of a personal computer. The data of each A-line are treated as one vector to maximize the data translation rate between the cores of the CPU and RAM stored image data. A display rate of 29.9 frames/sec for processed OCT data (4096 FFT-size x 500 A-scans) is achieved in our system using a wavelength swept source with 52-kHz swept frequency. The data processing times of the OCT image and a Doppler OCT image with a 4-time average are 23.8 msec and 91.4 msec.

  5. Method for improving catalyst function in auto-thermal and partial oxidation reformer-based processors

    DOEpatents

    Ahmed, Shabbir; Papadias, Dionissios D.; Lee, Sheldon H.D.; Ahluwalia, Rajesh K.

    2014-08-26

    The invention provides a method for reforming fuel, the method comprising contacting the fuel to an oxidation catalyst so as to partially oxidize the fuel and generate heat; warming incoming fuel with the heat while simultaneously warming a reforming catalyst with the heat; and reacting the partially oxidized fuel with steam using the reforming catalyst.

  6. Simultaneous master-slave Omega pairs. [navigation system featuring low cost receiver

    NASA Technical Reports Server (NTRS)

    Burhans, R. W.

    1974-01-01

    Master-slave sequence ordering of the Omega system is suggested as a method of improving the pair geometry for low-cost receiver user benefit. The sequence change will not affect present sophisticated processor users other than require new labels for some pair combinations, but may require worldwide transmitter operators to slightly alter their long-range synchronizing techniques.

  7. Equation solvers for distributed-memory computers

    NASA Technical Reports Server (NTRS)

    Storaasli, Olaf O.

    1994-01-01

    A large number of scientific and engineering problems require the rapid solution of large systems of simultaneous equations. The performance of parallel computers in this area now dwarfs traditional vector computers by nearly an order of magnitude. This talk describes the major issues involved in parallel equation solvers with particular emphasis on the Intel Paragon, IBM SP-1 and SP-2 processors.

  8. Autonomic Cluster Management System (ACMS): A Demonstration of Autonomic Principles at Work

    NASA Technical Reports Server (NTRS)

    Baldassari, James D.; Kopec, Christopher L.; Leshay, Eric S.; Truszkowski, Walt; Finkel, David

    2005-01-01

    Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of achieving significant computational capabilities for high-performance computing applications, while simultaneously affording the ability to. increase that capability simply by adding more (inexpensive) processors. However, the task of manually managing and con.guring a cluster quickly becomes impossible as the cluster grows in size. Autonomic computing is a relatively new approach to managing complex systems that can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Automatic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management.

  9. FAST: framework for heterogeneous medical image computing and visualization.

    PubMed

    Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank

    2015-11-01

    Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.

  10. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Krityakierne, Tipaluck; Akhtar, Taimoor; Shoemaker, Christine A.

    This paper presents a parallel surrogate-based global optimization method for computationally expensive objective functions that is more effective for larger numbers of processors. To reach this goal, we integrated concepts from multi-objective optimization and tabu search into, single objective, surrogate optimization. Our proposed derivative-free algorithm, called SOP, uses non-dominated sorting of points for which the expensive function has been previously evaluated. The two objectives are the expensive function value of the point and the minimum distance of the point to previously evaluated points. Based on the results of non-dominated sorting, P points from the sorted fronts are selected as centersmore » from which many candidate points are generated by random perturbations. Based on surrogate approximation, the best candidate point is subsequently selected for expensive evaluation for each of the P centers, with simultaneous computation on P processors. Centers that previously did not generate good solutions are tabu with a given tenure. We show almost sure convergence of this algorithm under some conditions. The performance of SOP is compared with two RBF based methods. The test results show that SOP is an efficient method that can reduce time required to find a good near optimal solution. In a number of cases the efficiency of SOP is so good that SOP with 8 processors found an accurate answer in less wall-clock time than the other algorithms did with 32 processors.« less

  11. Single-Scale Retinex Using Digital Signal Processors

    NASA Technical Reports Server (NTRS)

    Hines, Glenn; Rahman, Zia-Ur; Jobson, Daniel; Woodell, Glenn

    2005-01-01

    The Retinex is an image enhancement algorithm that improves the brightness, contrast and sharpness of an image. It performs a non-linear spatial/spectral transform that provides simultaneous dynamic range compression and color constancy. It has been used for a wide variety of applications ranging from aviation safety to general purpose photography. Many potential applications require the use of Retinex processing at video frame rates. This is difficult to achieve with general purpose processors because the algorithm contains a large number of complex computations and data transfers. In addition, many of these applications also constrain the potential architectures to embedded processors to save power, weight and cost. Thus we have focused on digital signal processors (DSPs) and field programmable gate arrays (FPGAs) as potential solutions for real-time Retinex processing. In previous efforts we attained a 21 (full) frame per second (fps) processing rate for the single-scale monochromatic Retinex with a TMS320C6711 DSP operating at 150 MHz. This was achieved after several significant code improvements and optimizations. Since then we have migrated our design to the slightly more powerful TMS320C6713 DSP and the fixed point TMS320DM642 DSP. In this paper we briefly discuss the Retinex algorithm, the performance of the algorithm executing on the TMS320C6713 and the TMS320DM642, and compare the results with the TMS320C6711.

  12. Evaluating architecture impact on system energy efficiency

    PubMed Central

    Yu, Shijie; Wang, Rui; Luan, Zhongzhi; Qian, Depei

    2017-01-01

    As the energy consumption has been surging in an unsustainable way, it is important to understand the impact of existing architecture designs from energy efficiency perspective, which is especially valuable for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers. One obstacle hindering the advance of comprehensive evaluation on energy efficiency is the deficient power measuring approach. Most of the energy study relies on either external power meters or power models, both of these two methods contain intrinsic drawbacks in their practical adoption and measuring accuracy. Fortunately, the advent of Intel Running Average Power Limit (RAPL) interfaces has promoted the power measurement ability into next level, with higher accuracy and finer time resolution. Therefore, we argue it is the exact time to conduct an in-depth evaluation of the existing architecture designs to understand their impact on system energy efficiency. In this paper, we leverage representative benchmark suites including serial and parallel workloads from diverse domains to evaluate the architecture features such as Non Uniform Memory Access (NUMA), Simultaneous Multithreading (SMT) and Turbo Boost. The energy is tracked at subcomponent level such as Central Processing Unit (CPU) cores, uncore components and Dynamic Random-Access Memory (DRAM) through exploiting the power measurement ability exposed by RAPL. The experiments reveal non-intuitive results: 1) the mismatch between local compute and remote memory node caused by NUMA effect not only generates dramatic power and energy surge but also deteriorates the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase of energy efficiency using SMT, with more than 40% decline in average power consumption; 3) Turbo Boost is effective to accelerate the workload execution and further preserve the energy, however it may not be applicable on system with tight power budget. PMID:29161317

  13. Evaluating architecture impact on system energy efficiency.

    PubMed

    Yu, Shijie; Yang, Hailong; Wang, Rui; Luan, Zhongzhi; Qian, Depei

    2017-01-01

    As the energy consumption has been surging in an unsustainable way, it is important to understand the impact of existing architecture designs from energy efficiency perspective, which is especially valuable for High Performance Computing (HPC) and datacenter environment hosting tens of thousands of servers. One obstacle hindering the advance of comprehensive evaluation on energy efficiency is the deficient power measuring approach. Most of the energy study relies on either external power meters or power models, both of these two methods contain intrinsic drawbacks in their practical adoption and measuring accuracy. Fortunately, the advent of Intel Running Average Power Limit (RAPL) interfaces has promoted the power measurement ability into next level, with higher accuracy and finer time resolution. Therefore, we argue it is the exact time to conduct an in-depth evaluation of the existing architecture designs to understand their impact on system energy efficiency. In this paper, we leverage representative benchmark suites including serial and parallel workloads from diverse domains to evaluate the architecture features such as Non Uniform Memory Access (NUMA), Simultaneous Multithreading (SMT) and Turbo Boost. The energy is tracked at subcomponent level such as Central Processing Unit (CPU) cores, uncore components and Dynamic Random-Access Memory (DRAM) through exploiting the power measurement ability exposed by RAPL. The experiments reveal non-intuitive results: 1) the mismatch between local compute and remote memory node caused by NUMA effect not only generates dramatic power and energy surge but also deteriorates the energy efficiency significantly; 2) for multithreaded application such as the Princeton Application Repository for Shared-Memory Computers (PARSEC), most of the workloads benefit a notable increase of energy efficiency using SMT, with more than 40% decline in average power consumption; 3) Turbo Boost is effective to accelerate the workload execution and further preserve the energy, however it may not be applicable on system with tight power budget.

  14. EPIC

    NASA Image and Video Library

    2011-12-29

    ISS030-E-017789 (29 Dec. 2011) --- Working in chorus with the International Space Station team in Houston?s Mission Control Center, this astronaut and his Expedition 30 crewmates on the station install a set of Enhanced Processor and Integrated Communications (EPIC) computer cards in one of seven primary computers onboard. The upgrade will allow more experiments to operate simultaneously, and prepare for the arrival of commercial cargo ships later this year.

  15. EPIC Computer Cards

    NASA Image and Video Library

    2011-12-29

    ISS030-E-017776 (29 Dec. 2011) --- Working in chorus with the International Space Station team in Houston?s Mission Control Center, this astronaut and his Expedition 30 crewmates on the station install a set of Enhanced Processor and Integrated Communications (EPIC) computer cards in one of seven primary computers onboard. The upgrade will allow more experiments to operate simultaneously, and prepare for the arrival of commercial cargo ships later this year.

  16. The Multi-energy High precision Data Processor Based on AD7606

    NASA Astrophysics Data System (ADS)

    Zhao, Chen; Zhang, Yanchi; Xie, Da

    2017-11-01

    This paper designs an information collector based on AD7606 to realize the high-precision simultaneous acquisition of multi-source information of multi-energy systems to form the information platform of the energy Internet at Laogang with electricty as its major energy source. Combined with information fusion technologies, this paper analyzes the data to improve the overall energy system scheduling capability and reliability.

  17. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

    DOE PAGES

    Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.; ...

    2017-02-11

    In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of completemore » redundancy incurs significant overhead to the application performance.« less

  18. RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.

    In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. Furthermore, the use of completemore » redundancy incurs significant overhead to the application performance.« less

  19. Accelerating finite-rate chemical kinetics with coprocessors: Comparing vectorization methods on GPUs, MICs, and CPUs

    NASA Astrophysics Data System (ADS)

    Stone, Christopher P.; Alferman, Andrew T.; Niemeyer, Kyle E.

    2018-05-01

    Accurate and efficient methods for solving stiff ordinary differential equations (ODEs) are a critical component of turbulent combustion simulations with finite-rate chemistry. The ODEs governing the chemical kinetics at each mesh point are decoupled by operator-splitting allowing each to be solved concurrently. An efficient ODE solver must then take into account the available thread and instruction-level parallelism of the underlying hardware, especially on many-core coprocessors, as well as the numerical efficiency. A stiff Rosenbrock and a nonstiff Runge-Kutta ODE solver are both implemented using the single instruction, multiple thread (SIMT) and single instruction, multiple data (SIMD) paradigms within OpenCL. Both methods solve multiple ODEs concurrently within the same instruction stream. The performance of these parallel implementations was measured on three chemical kinetic models of increasing size across several multicore and many-core platforms. Two separate benchmarks were conducted to clearly determine any performance advantage offered by either method. The first benchmark measured the run-time of evaluating the right-hand-side source terms in parallel and the second benchmark integrated a series of constant-pressure, homogeneous reactors using the Rosenbrock and Runge-Kutta solvers. The right-hand-side evaluations with SIMD parallelism on the host multicore Xeon CPU and many-core Xeon Phi co-processor performed approximately three times faster than the baseline multithreaded C++ code. The SIMT parallel model on the host and Phi was 13%-35% slower than the baseline while the SIMT model on the NVIDIA Kepler GPU provided approximately the same performance as the SIMD model on the Phi. The runtimes for both ODE solvers decreased significantly with the SIMD implementations on the host CPU (2.5-2.7 ×) and Xeon Phi coprocessor (4.7-4.9 ×) compared to the baseline parallel code. The SIMT implementations on the GPU ran 1.5-1.6 times faster than the baseline multithreaded CPU code; however, this was significantly slower than the SIMD versions on the host CPU or the Xeon Phi. The performance difference between the three platforms was attributed to thread divergence caused by the adaptive step-sizes within the ODE integrators. Analysis showed that the wider vector width of the GPU incurs a higher level of divergence than the narrower Sandy Bridge or Xeon Phi. The significant performance improvement provided by the SIMD parallel strategy motivates further research into more ODE solver methods that are both SIMD-friendly and computationally efficient.

  20. General-purpose interface bus for multiuser, multitasking computer system

    NASA Technical Reports Server (NTRS)

    Generazio, Edward R.; Roth, Don J.; Stang, David B.

    1990-01-01

    The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.

  1. An Intrinsically Digital Amplification Scheme for Hearing Aids

    NASA Astrophysics Data System (ADS)

    Blamey, Peter J.; Macfarlane, David S.; Steele, Brenton R.

    2005-12-01

    Results for linear and wide-dynamic range compression were compared with a new 64-channel digital amplification strategy in three separate studies. The new strategy addresses the requirements of the hearing aid user with efficient computations on an open-platform digital signal processor (DSP). The new amplification strategy is not modeled on prior analog strategies like compression and linear amplification, but uses statistical analysis of the signal to optimize the output dynamic range in each frequency band independently. Using the open-platform DSP processor also provided the opportunity for blind trial comparisons of the different processing schemes in BTE and ITE devices of a high commercial standard. The speech perception scores and questionnaire results show that it is possible to provide improved audibility for sound in many narrow frequency bands while simultaneously improving comfort, speech intelligibility in noise, and sound quality.

  2. Architecture of security management unit for safe hosting of multiple agents

    NASA Astrophysics Data System (ADS)

    Gilmont, Tanguy; Legat, Jean-Didier; Quisquater, Jean-Jacques

    1999-04-01

    In such growing areas as remote applications in large public networks, electronic commerce, digital signature, intellectual property and copyright protection, and even operating system extensibility, the hardware security level offered by existing processors is insufficient. They lack protection mechanisms that prevent the user from tampering critical data owned by those applications. Some devices make exception, but have not enough processing power nor enough memory to stand up to such applications (e.g. smart cards). This paper proposes an architecture of secure processor, in which the classical memory management unit is extended into a new security management unit. It allows ciphered code execution and ciphered data processing. An internal permanent memory can store cipher keys and critical data for several client agents simultaneously. The ordinary supervisor privilege scheme is replaced by a privilege inheritance mechanism that is more suited to operating system extensibility. The result is a secure processor that has hardware support for extensible multitask operating systems, and can be used for both general applications and critical applications needing strong protection. The security management unit and the internal permanent memory can be added to an existing CPU core without loss of performance, and do not require it to be modified.

  3. The CMS Level-1 Calorimeter Trigger for LHC Run II

    NASA Astrophysics Data System (ADS)

    Sinthuprasith, Tutanon

    2017-01-01

    The phase-1 upgrades of the CMS Level-1 calorimeter trigger have been completed. The Level-1 trigger has been fully commissioned and it will be used by CMS to collect data starting from the 2016 data run. The new trigger has been designed to improve the performance at high luminosity and large number of simultaneous inelastic collisions per crossing (pile-up). For this purpose it uses a novel design, the Time Multiplexed Design, which enables the data from an event to be processed by a single trigger processor at full granularity over several bunch crossings. The TMT design is a modular design based on the uTCA standard. The architecture is flexible and the number of trigger processors can be expanded according to the physics needs of CMS. Intelligent, more complex, and innovative algorithms are now the core of the first decision layer of CMS: the upgraded trigger system implements pattern recognition and MVA (Boosted Decision Tree) regression techniques in the trigger processors for pT assignment, pile up subtraction, and isolation requirements for electrons, and taus. The performance of the TMT design and the latency measurements and the algorithm performance which has been measured using data is also presented here.

  4. On the Suitability of MPI as a PGAS Runtime

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Daily, Jeffrey A.; Vishnu, Abhinav; Palmer, Bruce J.

    2014-12-18

    Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes {\\em get, put,} and {\\em atomic memory operations}. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA,more » as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speed-up on 1008 processors over the other MPI-based designs.« less

  5. Gyrokinetic particle-in-cell optimization on emerging multi- and manycore platforms

    DOE PAGES

    Madduri, Kamesh; Im, Eun-Jin; Ibrahim, Khaled Z.; ...

    2011-03-02

    The next decade of high-performance computing (HPC) systems will see a rapid evolution and divergence of multi- and manycore architectures as power and cooling constraints limit increases in microprocessor clock speeds. Understanding efficient optimization methodologies on diverse multicore designs in the context of demanding numerical methods is one of the greatest challenges faced today by the HPC community. In this paper, we examine the efficient multicore optimization of GTC, a petascale gyrokinetic toroidal fusion code for studying plasma microturbulence in tokamak devices. For GTC’s key computational components (charge deposition and particle push), we explore efficient parallelization strategies across a broadmore » range of emerging multicore designs, including the recently-released Intel Nehalem-EX, the AMD Opteron Istanbul, and the highly multithreaded Sun UltraSparc T2+. We also present the first study on tuning gyrokinetic particle-in-cell (PIC) algorithms for graphics processors, using the NVIDIA C2050 (Fermi). Our work discusses several novel optimization approaches for gyrokinetic PIC, including mixed-precision computation, particle binning and decomposition strategies, grid replication, SIMDized atomic floating-point operations, and effective GPU texture memory utilization. Overall, we achieve significant performance improvements of 1.3–4.7× on these complex PIC kernels, despite the inherent challenges of data dependency and locality. Finally, our work also points to several architectural and programming features that could significantly enhance PIC performance and productivity on next-generation architectures.« less

  6. Input-independent, Scalable and Fast String Matching on the Cray XMT

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Villa, Oreste; Chavarría-Miranda, Daniel; Maschhoff, Kristyn J

    2009-05-25

    String searching is at the core of many security and network applications like search engines, intrusion detection systems, virus scanners and spam filters. The growing size of on-line content and the increasing wire speeds push the need for fast, and often real- time, string searching solutions. For these conditions, many software implementations (if not all) targeting conventional cache-based microprocessors do not perform well. They either exhibit overall low performance or exhibit highly variable performance depending on the types of inputs. For this reason, real-time state of the art solutions rely on the use of either custom hardware or Field-Programmable Gatemore » Arrays (FPGAs) at the expense of overall system flexibility and programmability. This paper presents a software based implementation of the Aho-Corasick string searching algorithm on the Cray XMT multithreaded shared memory machine. Our so- lution relies on the particular features of the XMT architecture and on several algorith- mic strategies: it is fast, scalable and its performance is virtually content-independent. On a 128-processor Cray XMT, it reaches a scanning speed of ≈ 28 Gbps with a performance variability below 10 %. In the 10 Gbps performance range, variability is below 2.5%. By comparison, an Intel dual-socket, 8-core system running at 2.66 GHz achieves a peak performance which varies from 500 Mbps to 10 Gbps depending on the type of input and dictionary size.« less

  7. PIPER: Performance Insight for Programmers and Exascale Runtimes: Guiding the Development of the Exascale Software Stack

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mellor-Crummey, John

    The PIPER project set out to develop methodologies and software for measurement, analysis, attribution, and presentation of performance data for extreme-scale systems. Goals of the project were to support analysis of massive multi-scale parallelism, heterogeneous architectures, multi-faceted performance concerns, and to support both post-mortem performance analysis to identify program features that contribute to problematic performance and on-line performance analysis to drive adaptation. This final report summarizes the research and development activity at Rice University as part of the PIPER project. Producing a complete suite of performance tools for exascale platforms during the course of this project was impossible since bothmore » hardware and software for exascale systems is still a moving target. For that reason, the project focused broadly on the development of new techniques for measurement and analysis of performance on modern parallel architectures, enhancements to HPCToolkit’s software infrastructure to support our research goals or use on sophisticated applications, engaging developers of multithreaded runtimes to explore how support for tools should be integrated into their designs, engaging operating system developers with feature requests for enhanced monitoring support, engaging vendors with requests that they add hardware measure- ment capabilities and software interfaces needed by tools as they design new components of HPC platforms including processors, accelerators and networks, and finally collaborations with partners interested in using HPCToolkit to analyze and tune scalable parallel applications.« less

  8. A Hybrid Task Graph Scheduler for High Performance Image Processing Workflows.

    PubMed

    Blattner, Timothy; Keyrouz, Walid; Bhattacharyya, Shuvra S; Halem, Milton; Brady, Mary

    2017-12-01

    Designing applications for scalability is key to improving their performance in hybrid and cluster computing. Scheduling code to utilize parallelism is difficult, particularly when dealing with data dependencies, memory management, data motion, and processor occupancy. The Hybrid Task Graph Scheduler (HTGS) improves programmer productivity when implementing hybrid workflows for multi-core and multi-GPU systems. The Hybrid Task Graph Scheduler (HTGS) is an abstract execution model, framework, and API that increases programmer productivity when implementing hybrid workflows for such systems. HTGS manages dependencies between tasks, represents CPU and GPU memories independently, overlaps computations with disk I/O and memory transfers, keeps multiple GPUs occupied, and uses all available compute resources. Through these abstractions, data motion and memory are explicit; this makes data locality decisions more accessible. To demonstrate the HTGS application program interface (API), we present implementations of two example algorithms: (1) a matrix multiplication that shows how easily task graphs can be used; and (2) a hybrid implementation of microscopy image stitching that reduces code size by ≈ 43% compared to a manually coded hybrid workflow implementation and showcases the minimal overhead of task graphs in HTGS. Both of the HTGS-based implementations show good performance. In image stitching the HTGS implementation achieves similar performance to the hybrid workflow implementation. Matrix multiplication with HTGS achieves 1.3× and 1.8× speedup over the multi-threaded OpenBLAS library for 16k × 16k and 32k × 32k size matrices, respectively.

  9. Multilevel Parallelization of AutoDock 4.2.

    PubMed

    Norgan, Andrew P; Coffman, Paul K; Kocher, Jean-Pierre A; Katzmann, David J; Sosa, Carlos P

    2011-04-28

    Virtual (computational) screening is an increasingly important tool for drug discovery. AutoDock is a popular open-source application for performing molecular docking, the prediction of ligand-receptor interactions. AutoDock is a serial application, though several previous efforts have parallelized various aspects of the program. In this paper, we report on a multi-level parallelization of AutoDock 4.2 (mpAD4). Using MPI and OpenMP, AutoDock 4.2 was parallelized for use on MPI-enabled systems and to multithread the execution of individual docking jobs. In addition, code was implemented to reduce input/output (I/O) traffic by reusing grid maps at each node from docking to docking. Performance of mpAD4 was examined on two multiprocessor computers. Using MPI with OpenMP multithreading, mpAD4 scales with near linearity on the multiprocessor systems tested. In situations where I/O is limiting, reuse of grid maps reduces both system I/O and overall screening time. Multithreading of AutoDock's Lamarkian Genetic Algorithm with OpenMP increases the speed of execution of individual docking jobs, and when combined with MPI parallelization can significantly reduce the execution time of virtual screens. This work is significant in that mpAD4 speeds the execution of certain molecular docking workloads and allows the user to optimize the degree of system-level (MPI) and node-level (OpenMP) parallelization to best fit both workloads and computational resources.

  10. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rodenbeck, Christopher T.; Young, Derek; Chou, Tina

    A combined radar and telemetry system is described. The combined radar and telemetry system includes a processing unit that executes instructions, where the instructions define a radar waveform and a telemetry waveform. The processor outputs a digital baseband signal based upon the instructions, where the digital baseband signal is based upon the radar waveform and the telemetry waveform. A radar and telemetry circuit transmits, simultaneously, a radar signal and telemetry signal based upon the digital baseband signal.

  11. Automated mass spectrometer analysis system

    NASA Technical Reports Server (NTRS)

    Giffin, Charles E. (Inventor); Kuppermann, Aron (Inventor); Dreyer, William J. (Inventor); Boettger, Heinz G. (Inventor)

    1982-01-01

    An automated mass spectrometer analysis system is disclosed, in which samples are automatically processed in a sample processor and converted into volatilizable samples, or their characteristic volatilizable derivatives. Each volatilizable sample is sequentially volatilized and analyzed in a double focusing mass spectrometer, whose output is in the form of separate ion beams all of which are simultaneously focused in a focal plane. Each ion beam is indicative of a different sample component or different fragments of one or more sample components and the beam intensity is related to the relative abundance of the sample component. The system includes an electro-optical ion detector which automatically and simultaneously converts the ion beams, first into electron beams which in turn produce a related image which is transferred to the target of a vilicon unit. The latter converts the images into electrical signals which are supplied to a data processor, whose output is a list of the components of the analyzed sample and their abundances. The system is under the control of a master control unit, which in addition to monitoring and controlling various power sources, controls the automatic operation of the system under expected and some unexpected conditions and further protects various critical parts of the system from damage due to particularly abnormal conditions.

  12. Automated mass spectrometer analysis system

    NASA Technical Reports Server (NTRS)

    Boettger, Heinz G. (Inventor); Giffin, Charles E. (Inventor); Dreyer, William J. (Inventor); Kuppermann, Aron (Inventor)

    1978-01-01

    An automated mass spectrometer analysis system is disclosed, in which samples are automatically processed in a sample processor and converted into volatilizable samples, or their characteristic volatilizable derivatives. Each volatizable sample is sequentially volatilized and analyzed in a double focusing mass spectrometer, whose output is in the form of separate ion beams all of which are simultaneously focused in a focal plane. Each ion beam is indicative of a different sample component or different fragments of one or more sample components and the beam intensity is related to the relative abundance of the sample component. The system includes an electro-optical ion detector which automatically and simultaneously converts the ion beams, first into electron beams which in turn produce a related image which is transferred to the target of a vidicon unit. The latter converts the images into electrical signals which are supplied to a data processor, whose output is a list of the components of the analyzed sample and their abundances. The system is under the control of a master control unit, which in addition to monitoring and controlling various power sources, controls the automatic operation of the system under expected and some unexpected conditions and further protects various critical parts of the system from damage due to particularly abnormal conditions.

  13. SOP: parallel surrogate global optimization with Pareto center selection for computationally expensive single objective problems

    DOE PAGES

    Krityakierne, Tipaluck; Akhtar, Taimoor; Shoemaker, Christine A.

    2016-02-02

    This paper presents a parallel surrogate-based global optimization method for computationally expensive objective functions that is more effective for larger numbers of processors. To reach this goal, we integrated concepts from multi-objective optimization and tabu search into, single objective, surrogate optimization. Our proposed derivative-free algorithm, called SOP, uses non-dominated sorting of points for which the expensive function has been previously evaluated. The two objectives are the expensive function value of the point and the minimum distance of the point to previously evaluated points. Based on the results of non-dominated sorting, P points from the sorted fronts are selected as centersmore » from which many candidate points are generated by random perturbations. Based on surrogate approximation, the best candidate point is subsequently selected for expensive evaluation for each of the P centers, with simultaneous computation on P processors. Centers that previously did not generate good solutions are tabu with a given tenure. We show almost sure convergence of this algorithm under some conditions. The performance of SOP is compared with two RBF based methods. The test results show that SOP is an efficient method that can reduce time required to find a good near optimal solution. In a number of cases the efficiency of SOP is so good that SOP with 8 processors found an accurate answer in less wall-clock time than the other algorithms did with 32 processors.« less

  14. A Biologically-Based Alternative Water Processor for Long Duration Space Missions

    NASA Technical Reports Server (NTRS)

    Barta, Daniel J.; Pickering, Karen D.; Meyer, Caitlin; Pensinger, Stuart; Vega, Leticia; Flynn, Michael; Jackson, Andrew; Wheeler, Raymond

    2015-01-01

    A wastewater recovery system has been developed that combines novel biological and physicochemical components for recycling wastewater on long duration space missions. Functionally, this Alternative Water Processor (AWP) would replace the Urine Processing Assembly on the International Space Station and reduce or eliminate the need for the multifiltration beds of the Water Processing Assembly (WPA). At its center are two unique game changing technologies: 1) a biological water processor (BWP) to mineralize organic forms of carbon and nitrogen and 2) an advanced membrane processor (Forward Osmosis Secondary Treatment) for removal of solids and inorganic ions. The AWP is designed for recycling larger quantities of wastewater from multiple sources expected during future exploration missions, including urine, hygiene (hand wash, shower, oral and shave) and laundry. The BWP utilizes a single-stage membrane-aerated biological reactor for simultaneous nitrification and denitrification. The Forward Osmosis Secondary Treatment (FOST) system uses a combination of forward osmosis (FO) and reverse osmosis (RO), is resistant to biofouling and can easily tolerate wastewaters high in non-volatile organics and solids associated with shower and/or hand washing. The BWP was operated continuously for over 300 days. After startup, the mature biological system averaged 85% organic carbon removal and 44% nitrogen removal, close to maximum based on available carbon. The FOST has averaged 93% water recovery, with a maximum of 98%. If the wastewater is slighty acidified, ammonia rejection is optimal. This paper will provide a description of the technology and summarize results from ground-based testing using real wastewater.

  15. Real-time heart rate measurement for multi-people using compressive tracking

    NASA Astrophysics Data System (ADS)

    Liu, Lingling; Zhao, Yuejin; Liu, Ming; Kong, Lingqin; Dong, Liquan; Ma, Feilong; Pang, Zongguang; Cai, Zhi; Zhang, Yachu; Hua, Peng; Yuan, Ruifeng

    2017-09-01

    The rise of aging population has created a demand for inexpensive, unobtrusive, automated health care solutions. Image PhotoPlethysmoGraphy(IPPG) aids in the development of these solutions by allowing for the extraction of physiological signals from video data. However, the main deficiencies of the recent IPPG methods are non-automated, non-real-time and susceptible to motion artifacts(MA). In this paper, a real-time heart rate(HR) detection method for multiple subjects simultaneously was proposed and realized using the open computer vision(openCV) library, which consists of getting multiple subjects' facial video automatically through a Webcam, detecting the region of interest (ROI) in the video, reducing the false detection rate by our improved Adaboost algorithm, reducing the MA by our improved compress tracking(CT) algorithm, wavelet noise-suppression algorithm for denoising and multi-threads for higher detection speed. For comparison, HR was measured simultaneously using a medical pulse oximetry device for every subject during all sessions. Experimental results on a data set of 30 subjects show that the max average absolute error of heart rate estimation is less than 8 beats per minute (BPM), and the processing speed of every frame has almost reached real-time: the experiments with video recordings of ten subjects under the condition of the pixel resolution of 600× 800 pixels show that the average HR detection time of 10 subjects was about 17 frames per second (fps).

  16. Jagged Tiling for Intra-tile Parallelism and Fine-Grain Multithreading

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shrestha, Sunil; Manzano Franco, Joseph B.; Marquez, Andres

    In this paper, we have developed a novel methodology that takes into consideration multithreaded many-core designs to better utilize memory/processing resources and improve memory residence on tileable applications. It takes advantage of polyhedral analysis and transformation in the form of PLUTO, combined with a highly optimized finegrain tile runtime to exploit parallelism at all levels. The main contributions of this paper include the introduction of multi-hierarchical tiling techniques that increases intra tile parallelism; and a data-flow inspired runtime library that allows the expression of parallel tiles with an efficient synchronization registry. Our current implementation shows performance improvements on an Intelmore » Xeon Phi board up to 32.25% against instances produced by state-of-the-art compiler frameworks for selected stencil applications.« less

  17. TOGA - A GNSS Reflections Instrument for Remote Sensing Using Beamforming

    NASA Technical Reports Server (NTRS)

    Esterhuizen, S.; Meehan, T. K.; Robison, D.

    2009-01-01

    Remotely sensing the Earth's surface using GNSS signals as bi-static radar sources is one of the most challenging applications for radiometric instrument design. As part of NASA's Instrument Incubator Program, our group at JPL has built a prototype instrument, TOGA (Time-shifted, Orthometric, GNSS Array), to address a variety of GNSS science needs. Observing GNSS reflections is major focus of the design/development effort. The TOGA design features a steerable beam antenna array which can form a high-gain antenna pattern in multiple directions simultaneously. Multiple FPGAs provide flexible digital signal processing logic to process both GPS and Galileo reflections. A Linux OS based science processor serves as experiment scheduler and data post-processor. This paper outlines the TOGA design approach as well as preliminary results of reflection data collected from test flights over the Pacific ocean. This reflections data demonstrates observation of the GPS L1/L2C/L5 signals.

  18. Interconnect-free parallel logic circuits in a single mechanical resonator

    PubMed Central

    Mahboob, I.; Flurin, E.; Nishiguchi, K.; Fujiwara, A.; Yamaguchi, H.

    2011-01-01

    In conventional computers, wiring between transistors is required to enable the execution of Boolean logic functions. This has resulted in processors in which billions of transistors are physically interconnected, which limits integration densities, gives rise to huge power consumption and restricts processing speeds. A method to eliminate wiring amongst transistors by condensing Boolean logic into a single active element is thus highly desirable. Here, we demonstrate a novel logic architecture using only a single electromechanical parametric resonator into which multiple channels of binary information are encoded as mechanical oscillations at different frequencies. The parametric resonator can mix these channels, resulting in new mechanical oscillation states that enable the construction of AND, OR and XOR logic gates as well as multibit logic circuits. Moreover, the mechanical logic gates and circuits can be executed simultaneously, giving rise to the prospect of a parallel logic processor in just a single mechanical resonator. PMID:21326230

  19. Interconnect-free parallel logic circuits in a single mechanical resonator.

    PubMed

    Mahboob, I; Flurin, E; Nishiguchi, K; Fujiwara, A; Yamaguchi, H

    2011-02-15

    In conventional computers, wiring between transistors is required to enable the execution of Boolean logic functions. This has resulted in processors in which billions of transistors are physically interconnected, which limits integration densities, gives rise to huge power consumption and restricts processing speeds. A method to eliminate wiring amongst transistors by condensing Boolean logic into a single active element is thus highly desirable. Here, we demonstrate a novel logic architecture using only a single electromechanical parametric resonator into which multiple channels of binary information are encoded as mechanical oscillations at different frequencies. The parametric resonator can mix these channels, resulting in new mechanical oscillation states that enable the construction of AND, OR and XOR logic gates as well as multibit logic circuits. Moreover, the mechanical logic gates and circuits can be executed simultaneously, giving rise to the prospect of a parallel logic processor in just a single mechanical resonator.

  20. On-board processing concepts for future satellite communications systems

    NASA Technical Reports Server (NTRS)

    Brandon, W. T. (Editor); White, B. E. (Editor)

    1980-01-01

    The initial definition of on-board processing for an advanced satellite communications system to service domestic markets in the 1990's is discussed. An exemplar system with both RF on-board switching and demodulation/remodulation baseband processing is used to identify important issues related to system implementation, cost, and technology development. Analyses of spectrum-efficient modulation, coding, and system control techniques are summarized. Implementations for an RF switch and baseband processor are described. Among the major conclusions listed is the need for high gain satellites capable of handling tens of simultaneous beams for the efficient reuse of the 2.5 GHz 30/20 frequency band. Several scanning beams are recommended in addition to the fixed beams. Low power solid state 20 GHz GaAs FET power amplifiers in the 5W range and a general purpose digital baseband processor with gigahertz logic speeds and megabits of memory are also recommended.

  1. Handling of huge multispectral image data volumes from a spectral hole burning device (SHBD)

    NASA Astrophysics Data System (ADS)

    Graff, Werner; Rosselet, Armel C.; Wild, Urs P.; Gschwind, Rudolf; Keller, Christoph U.

    1995-06-01

    We use chlorin-doped polymer films at low temperatures as the primary imaging detector. Based on the principles of persistent spectral hole burning, this system is capable of storing spatial and spectral information simultaneously in one exposure with extremely high resolution. The sun as an extended light source has been imaged onto the film. The information recorded amounts to tens of GBytes. This data volume is read out by scanning the frequency of a tunable dye laser and reading the images with a digital CCD camera. For acquisition, archival, processing, and visualization, we use MUSIC (MUlti processor System with Intelligent Communication), a single instruction multiple data parallel processor system equipped with the necessary I/O facilities. The huge amount of data requires the developemnt of sophisticated algorithms to efficiently calibrate the data and to extract useful and new information for solar physics.

  2. Electronic Structure Calculations and Adaptation Scheme in Multi-core Computing Environments

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Seshagiri, Lakshminarasimhan; Sosonkina, Masha; Zhang, Zhao

    2009-05-20

    Multi-core processing environments have become the norm in the generic computing environment and are being considered for adding an extra dimension to the execution of any application. The T2 Niagara processor is a very unique environment where it consists of eight cores having a capability of running eight threads simultaneously in each of the cores. Applications like General Atomic and Molecular Electronic Structure (GAMESS), used for ab-initio molecular quantum chemistry calculations, can be good indicators of the performance of such machines and would be a guideline for both hardware designers and application programmers. In this paper we try to benchmarkmore » the GAMESS performance on a T2 Niagara processor for a couple of molecules. We also show the suitability of using a middleware based adaptation algorithm on GAMESS on such a multi-core environment.« less

  3. Real-time digital holographic microscopy using the graphic processing unit.

    PubMed

    Shimobaba, Tomoyoshi; Sato, Yoshikuni; Miura, Junya; Takenouchi, Mai; Ito, Tomoyoshi

    2008-08-04

    Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512 x 512 grids in 24 frames per second.

  4. Long sequence correlation coprocessor

    NASA Astrophysics Data System (ADS)

    Gage, Douglas W.

    1994-09-01

    A long sequence correlation coprocessor (LSCC) accelerates the bitwise correlation of arbitrarily long digital sequences by calculating in parallel the correlation score for 16, for example, adjacent bit alignments between two binary sequences. The LSCC integrated circuit is incorporated into a computer system with memory storage buffers and a separate general purpose computer processor which serves as its controller. Each of the LSCC's set of sequential counters simultaneously tallies a separate correlation coefficient. During each LSCC clock cycle, computer enable logic associated with each counter compares one bit of a first sequence with one bit of a second sequence to increment the counter if the bits are the same. A shift register assures that the same bit of the first sequence is simultaneously compared to different bits of the second sequence to simultaneously calculate the correlation coefficient by the different counters to represent different alignments of the two sequences.

  5. Exploiting multicore compute resources in the CMS experiment

    NASA Astrophysics Data System (ADS)

    Ramírez, J. E.; Pérez-Calero Yzquierdo, A.; Hernández, J. M.; CMS Collaboration

    2016-10-01

    CMS has developed a strategy to efficiently exploit the multicore architecture of the compute resources accessible to the experiment. A coherent use of the multiple cores available in a compute node yields substantial gains in terms of resource utilization. The implemented approach makes use of the multithreading support of the event processing framework and the multicore scheduling capabilities of the resource provisioning system. Multicore slots are acquired and provisioned by means of multicore pilot agents which internally schedule and execute single and multicore payloads. Multicore scheduling and multithreaded processing are currently used in production for online event selection and prompt data reconstruction. More workflows are being adapted to run in multicore mode. This paper presents a review of the experience gained in the deployment and operation of the multicore scheduling and processing system, the current status and future plans.

  6. CMS event processing multi-core efficiency status

    NASA Astrophysics Data System (ADS)

    Jones, C. D.; CMS Collaboration

    2017-10-01

    In 2015, CMS was the first LHC experiment to begin using a multi-threaded framework for doing event processing. This new framework utilizes Intel’s Thread Building Block library to manage concurrency via a task based processing model. During the 2015 LHC run period, CMS only ran reconstruction jobs using multiple threads because only those jobs were sufficiently thread efficient. Recent work now allows simulation and digitization to be thread efficient. In addition, during 2015 the multi-threaded framework could run events in parallel but could only use one thread per event. Work done in 2016 now allows multiple threads to be used while processing one event. In this presentation we will show how these recent changes have improved CMS’s overall threading and memory efficiency and we will discuss work to be done to further increase those efficiencies.

  7. Multithreaded implicitly dealiased convolutions

    NASA Astrophysics Data System (ADS)

    Roberts, Malcolm; Bowman, John C.

    2018-03-01

    Implicit dealiasing is a method for computing in-place linear convolutions via fast Fourier transforms that decouples work memory from input data. It offers easier memory management and, for long one-dimensional input sequences, greater efficiency than conventional zero-padding. Furthermore, for convolutions of multidimensional data, the segregation of data and work buffers can be exploited to reduce memory usage and execution time significantly. This is accomplished by processing and discarding data as it is generated, allowing work memory to be reused, for greater data locality and performance. A multithreaded implementation of implicit dealiasing that accepts an arbitrary number of input and output vectors and a general multiplication operator is presented, along with an improved one-dimensional Hermitian convolution that avoids the loop dependency inherent in previous work. An alternate data format that can accommodate a Nyquist mode and enhance cache efficiency is also proposed.

  8. Large Scale Document Inversion using a Multi-threaded Computing System

    PubMed Central

    Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

    2018-01-01

    Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. CCS Concepts •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations. PMID:29861701

  9. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei

    2007-01-01

    We present Navigational Programming (NavP) -- a distributed parallel programming methodology based on the principles of migrating computations and multithreading. The four major steps of NavP are: (1) Distribute the data using the data communication pattern in a given algorithm; (2) Insert navigational commands for the computation to migrate and follow large-sized distributed data; (3) Cut the sequential migrating thread and construct a mobile pipeline; and (4) Loop back for refinement. NavP is significantly different from the current prevailing Message Passing (MP) approach. The advantages of NavP include: (1) NavP is structured distributed programming and it does not change the code structure of an original algorithm. This is in sharp contrast to MP as MP implementations in general do not resemble the original sequential code; (2) NavP implementations are always competitive with the best MPI implementations in terms of performance. Approaches such as DSM or HPF have failed to deliver satisfying performance as of today in contrast, even if they are relatively easy to use compared to MP; (3) NavP provides incremental parallelization, which is beyond the reach of MP; and (4) NavP is a unifying approach that allows us to exploit both fine- (multithreading on shared memory) and coarse- (pipelined tasks on distributed memory) grained parallelism. This is in contrast to the currently popular hybrid use of MP+OpenMP, which is known to be complex to use. We present experimental results that demonstrate the effectiveness of NavP.

  10. Airbreathing Propulsion System Analysis Using Multithreaded Parallel Processing

    NASA Technical Reports Server (NTRS)

    Schunk, Richard Gregory; Chung, T. J.; Rodriguez, Pete (Technical Monitor)

    2000-01-01

    In this paper, parallel processing is used to analyze the mixing, and combustion behavior of hypersonic flow. Preliminary work for a sonic transverse hydrogen jet injected from a slot into a Mach 4 airstream in a two-dimensional duct combustor has been completed [Moon and Chung, 1996]. Our aim is to extend this work to three-dimensional domain using multithreaded domain decomposition parallel processing based on the flowfield-dependent variation theory. Numerical simulations of chemically reacting flows are difficult because of the strong interactions between the turbulent hydrodynamic and chemical processes. The algorithm must provide an accurate representation of the flowfield, since unphysical flowfield calculations will lead to the faulty loss or creation of species mass fraction, or even premature ignition, which in turn alters the flowfield information. Another difficulty arises from the disparity in time scales between the flowfield and chemical reactions, which may require the use of finite rate chemistry. The situations are more complex when there is a disparity in length scales involved in turbulence. In order to cope with these complicated physical phenomena, it is our plan to utilize the flowfield-dependent variation theory mentioned above, facilitated by large eddy simulation. Undoubtedly, the proposed computation requires the most sophisticated computational strategies. The multithreaded domain decomposition parallel processing will be necessary in order to reduce both computational time and storage. Without special treatments involved in computer engineering, our attempt to analyze the airbreathing combustion appears to be difficult, if not impossible.

  11. Large Scale Document Inversion using a Multi-threaded Computing System.

    PubMed

    Jung, Sungbo; Chang, Dar-Jen; Park, Juw Won

    2017-06-01

    Current microprocessor architecture is moving towards multi-core/multi-threaded systems. This trend has led to a surge of interest in using multi-threaded computing devices, such as the Graphics Processing Unit (GPU), for general purpose computing. We can utilize the GPU in computation as a massive parallel coprocessor because the GPU consists of multiple cores. The GPU is also an affordable, attractive, and user-programmable commodity. Nowadays a lot of information has been flooded into the digital domain around the world. Huge volume of data, such as digital libraries, social networking services, e-commerce product data, and reviews, etc., is produced or collected every moment with dramatic growth in size. Although the inverted index is a useful data structure that can be used for full text searches or document retrieval, a large number of documents will require a tremendous amount of time to create the index. The performance of document inversion can be improved by multi-thread or multi-core GPU. Our approach is to implement a linear-time, hash-based, single program multiple data (SPMD), document inversion algorithm on the NVIDIA GPU/CUDA programming platform utilizing the huge computational power of the GPU, to develop high performance solutions for document indexing. Our proposed parallel document inversion system shows 2-3 times faster performance than a sequential system on two different test datasets from PubMed abstract and e-commerce product reviews. •Information systems➝Information retrieval • Computing methodologies➝Massively parallel and high-performance simulations.

  12. An Alternative Water Processor for Long Duration Space Missions

    NASA Technical Reports Server (NTRS)

    Barta, Daniel J.; Pickering, Karen D.; Meyer, Caitlin; Pennsinger, Stuart; Vega, Leticia; Flynn, Michael; Jackson, Andrew; Wheeler, Raymond

    2014-01-01

    A new wastewater recovery system has been developed that combines novel biological and physicochemical components for recycling wastewater on long duration human space missions. Functionally, this Alternative Water Processor (AWP) would replace the Urine Processing Assembly on the International Space Station and reduce or eliminate the need for the multi-filtration beds of the Water Processing Assembly (WPA). At its center are two unique game changing technologies: 1) a biological water processor (BWP) to mineralize organic forms of carbon and nitrogen and 2) an advanced membrane processor (Forward Osmosis Secondary Treatment) for removal of solids and inorganic ions. The AWP is designed for recycling larger quantities of wastewater from multiple sources expected during future exploration missions, including urine, hygiene (hand wash, shower, oral and shave) and laundry. The BWP utilizes a single-stage membrane-aerated biological reactor for simultaneous nitrification and denitrification. The Forward Osmosis Secondary Treatment (FOST) system uses a combination of forward osmosis (FO) and reverse osmosis (RO), is resistant to biofouling and can easily tolerate wastewaters high in non-volatile organics and solids associated with shower and/or hand washing. The BWP has been operated continuously for over 300 days. After startup, the mature biological system averaged 85% organic carbon removal and 44% nitrogen removal, close to stoichiometric maximum based on available carbon. To date, the FOST has averaged 93% water recovery, with a maximum of 98%. If the wastewater is slighty acidified, ammonia rejection is optimal. This paper will provide a description of the technology and summarize results from ground-based testing using real wastewater

  13. An Adaptive Insertion and Promotion Policy for Partitioned Shared Caches

    NASA Astrophysics Data System (ADS)

    Mahrom, Norfadila; Liebelt, Michael; Raof, Rafikha Aliana A.; Daud, Shuhaizar; Hafizah Ghazali, Nur

    2018-03-01

    Cache replacement policies in chip multiprocessors (CMP) have been investigated extensively and proven able to enhance shared cache management. However, competition among multiple processors executing different threads that require simultaneous access to a shared memory may cause cache contention and memory coherence problems on the chip. These issues also exist due to some drawbacks of the commonly used Least Recently Used (LRU) policy employed in multiprocessor systems, which are because of the cache lines residing in the cache longer than required. In image processing analysis of for example extra pulmonary tuberculosis (TB), an accurate diagnosis for tissue specimen is required. Therefore, a fast and reliable shared memory management system to execute algorithms for processing vast amount of specimen image is needed. In this paper, the effects of the cache replacement policy in a partitioned shared cache are investigated. The goal is to quantify whether better performance can be achieved by using less complex replacement strategies. This paper proposes a Middle Insertion 2 Positions Promotion (MI2PP) policy to eliminate cache misses that could adversely affect the access patterns and the throughput of the processors in the system. The policy employs a static predefined insertion point, near distance promotion, and the concept of ownership in the eviction policy to effectively improve cache thrashing and to avoid resource stealing among the processors.

  14. Mapping of MPEG-4 decoding on a flexible architecture platform

    NASA Astrophysics Data System (ADS)

    van der Tol, Erik B.; Jaspers, Egbert G.

    2001-12-01

    In the field of consumer electronics, the advent of new features such as Internet, games, video conferencing, and mobile communication has triggered the convergence of television and computers technologies. This requires a generic media-processing platform that enables simultaneous execution of very diverse tasks such as high-throughput stream-oriented data processing and highly data-dependent irregular processing with complex control flows. As a representative application, this paper presents the mapping of a Main Visual profile MPEG-4 for High-Definition (HD) video onto a flexible architecture platform. A stepwise approach is taken, going from the decoder application toward an implementation proposal. First, the application is decomposed into separate tasks with self-contained functionality, clear interfaces, and distinct characteristics. Next, a hardware-software partitioning is derived by analyzing the characteristics of each task such as the amount of inherent parallelism, the throughput requirements, the complexity of control processing, and the reuse potential over different applications and different systems. Finally, a feasible implementation is proposed that includes amongst others a very-long-instruction-word (VLIW) media processor, one or more RISC processors, and some dedicated processors. The mapping study of the MPEG-4 decoder proves the flexibility and extensibility of the media-processing platform. This platform enables an effective HW/SW co-design yielding a high performance density.

  15. Spaceborne Processor Array

    NASA Technical Reports Server (NTRS)

    Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

    2008-01-01

    A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

  16. Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Williams, Samuel; Kalamkar, Dhiraj; Singh, Amik

    2012-12-01

    Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this report, we describe miniGMG, our compact geometric multigrid benchmark designed to proxy the multigrid solves found in AMR applications. We explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel Sandy Bridge and Nehalem-based Infiniband clusters, as well as manycore-based architectures including NVIDIA's Fermi and Kepler GPUs and Intel's Knights Corner (KNC) co-processor. This report examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding,more » dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.« less

  17. Two schemes for rapid generation of digital video holograms using PC cluster

    NASA Astrophysics Data System (ADS)

    Park, Hanhoon; Song, Joongseok; Kim, Changseob; Park, Jong-Il

    2017-12-01

    Computer-generated holography (CGH), which is a process of generating digital holograms, is computationally expensive. Recently, several methods/systems of parallelizing the process using graphic processing units (GPUs) have been proposed. Indeed, use of multiple GPUs or a personal computer (PC) cluster (each PC with GPUs) enabled great improvements in the process speed. However, extant literature has less often explored systems involving rapid generation of multiple digital holograms and specialized systems for rapid generation of a digital video hologram. This study proposes a system that uses a PC cluster and is able to more efficiently generate a video hologram. The proposed system is designed to simultaneously generate multiple frames and accelerate the generation by parallelizing the CGH computations across a number of frames, as opposed to separately generating each individual frame while parallelizing the CGH computations within each frame. The proposed system also enables the subprocesses for generating each frame to execute in parallel through multithreading. With these two schemes, the proposed system significantly reduced the data communication time for generating a digital hologram when compared with that of the state-of-the-art system.

  18. Spectral Analysis Tool 6.2 for Windows

    NASA Technical Reports Server (NTRS)

    Morgan, Feiming; Sue, Miles; Peng, Ted; Tan, Harry; Liang, Robert; Kinman, Peter

    2006-01-01

    Spectral Analysis Tool 6.2 is the latest version of a computer program that assists in analysis of interference between radio signals of the types most commonly used in Earth/spacecraft radio communications. [An earlier version was reported in Software for Analyzing Earth/Spacecraft Radio Interference (NPO-20422), NASA Tech Briefs, Vol. 25, No. 4 (April 2001), page 52.] SAT 6.2 calculates signal spectra, bandwidths, and interference effects for several families of modulation schemes. Several types of filters can be modeled, and the program calculates and displays signal spectra after filtering by any of the modeled filters. The program accommodates two simultaneous signals: a desired signal and an interferer. The interference-to-signal power ratio can be calculated for the filtered desired and interfering signals. Bandwidth-occupancy and link-budget calculators are included for the user s convenience. SAT 6.2 has a new software structure and provides a new user interface that is both intuitive and convenient. SAT 6.2 incorporates multi-tasking, multi-threaded execution, virtual memory management, and a dynamic link library. SAT 6.2 is designed for use on 32- bit computers employing Microsoft Windows operating systems.

  19. Simultaneous determination of thorium, niobium, lead, and zinc by photon-induced x-ray fluorescence of lateritic material

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    LaBrecque, J.J.; Adames, D.; Parker, W.C.

    1981-01-01

    A rapid method is presented for the simultaneous determinations of thorium, niobium, lead, and zinc in lateritic material from Cerro Impacto, Estado Bolivar, Venezuela. This technique uses a PDP - 11/05 processor - based photon induced x-ray fluorescence system. The total variations of approximately 5% for concentrations of approximately 1 and 10% for concentrations of approximately 0.1% were obtained with only 500 s of fluorescent time. The values obtained by this method were in agreement with values measured by conventional flame atomic absorption spectroscopy for lead and zinc. The values for thorium measured were in agreement with the reported valuesmore » for the reference materials supplied by NBL.« less

  20. The Mark III Hypercube-Ensemble Computers

    NASA Technical Reports Server (NTRS)

    Peterson, John C.; Tuazon, Jesus O.; Lieberman, Don; Pniel, Moshe

    1988-01-01

    Mark III Hypercube concept applied in development of series of increasingly powerful computers. Processor of each node of Mark III Hypercube ensemble is specialized computer containing three subprocessors and shared main memory. Solves problem quickly by simultaneously processing part of problem at each such node and passing combined results to host computer. Disciplines benefitting from speed and memory capacity include astrophysics, geophysics, chemistry, weather, high-energy physics, applied mechanics, image processing, oil exploration, aircraft design, and microcircuit design.

  1. CDL description of the CDC 6600 stunt box

    NASA Technical Reports Server (NTRS)

    Hertzog, J. B.

    1971-01-01

    The CDC 6600 central memory control (stunt box) is described utilizing CDL (Computer Design Language), block diagrams, and text. The stunt box is a clearing house for all central memory references from the 6600 central and peripheral processors. Since memory requests can be issued simultaneously, the stunt box must be capable of assigning priorities to requests, of labeling requests so that the data will be distributed correctly, and of remembering rejected addresses due to memory conflicts.

  2. Workshop on Future Directions for Optical Information Processing.

    DTIC Science & Technology

    1981-03-01

    h . The i reference point source simultaneously illuminates the i member of a family of n phase-encoding Aiffusers (e.g. shower glass , ground glass ...diffuser (ground glass ) section illuminated with a plane wave [35.37). The n(n-1) - 4(3) - 12 crosstalk terms have been distributed into the noise...for 2x2 input Fig. 6. Outnut of processor analogous to that array, l.Sx magnifier, ground glass diffuser of Fig. 5, but using spherical wavefront and

  3. Tse computers. [Chinese pictograph character binary image processor design for high speed applications

    NASA Technical Reports Server (NTRS)

    Strong, J. P., III

    1973-01-01

    Tse computers have the potential of operating four or five orders of magnitude faster than present digital computers. The computers of the new design use binary images as their basic computational entity. The word 'tse' is the transliteration of the Chinese word for 'pictograph character.' Tse computers are large collections of devices that perform logical operations on binary images. The operations on binary images are to be performed over the entire image simultaneously.

  4. GPU-based Parallel Application Design for Emerging Mobile Devices

    NASA Astrophysics Data System (ADS)

    Gupta, Kshitij

    A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as compute and communication capabilities of mobile devices improve, we analyze energy implications of processing speech recognition locally (on-chip) and offloading it to servers (in-cloud).

  5. Strong Motion Seismograph Based On MEMS Accelerometer

    NASA Astrophysics Data System (ADS)

    Teng, Y.; Hu, X.

    2013-12-01

    The MEMS strong motion seismograph we developed used the modularization method to design its software and hardware.It can fit various needs in different application situation.The hardware of the instrument is composed of a MEMS accelerometer,a control processor system,a data-storage system,a wired real-time data transmission system by IP network,a wireless data transmission module by 3G broadband,a GPS calibration module and power supply system with a large-volumn lithium battery in it. Among it,the seismograph's sensor adopted a three-axis with 14-bit high resolution and digital output MEMS accelerometer.Its noise level just reach about 99μg/√Hz and ×2g to ×8g dynamically selectable full-scale.Its output data rates from 1.56Hz to 800Hz. Its maximum current consumption is merely 165μA,and the device is so small that it is available in a 3mm×3mm×1mm QFN package. Furthermore,there is access to both low pass filtered data as well as high pass filtered data,which minimizes the data analysis required for earthquake signal detection. So,the data post-processing can be simplified. Controlling process system adopts a 32-bit low power consumption embedded ARM9 processor-S3C2440 and is based on the Linux operation system.The processor's operating clock at 400MHz.The controlling system's main memory is a 64MB SDRAM with a 256MB flash-memory.Besides,an external high-capacity SD card data memory can be easily added.So the system can meet the requirements for data acquisition,data processing,data transmission,data storage,and so on. Both wired and wireless network can satisfy remote real-time monitoring, data transmission,system maintenance,status monitoring or updating software.Linux was embedded and multi-layer designed conception was used.The code, including sensor hardware driver,the data acquisition,earthquake setting out and so on,was written on medium layer.The hardware driver consist of IIC-Bus interface driver, IO driver and asynchronous notification driver. The application program layer mainly concludes: earthquake parameter module, local database managing module, data transmission module, remote monitoring, FTP service and so on. The application layer adopted multi-thread process. The whole strong motion seismograph was encapsulated in a small aluminum box, which size is 80mm×120mm×55mm. The inner battery can work continuesly more than 24 hours. The MEMS accelerograph uses modular design for its software part and hardware part. It has remote software update function and can meet the following needs: a) Auto picking up the earthquake event; saving the data on wave-event files and hours files; It may be used for monitoring strong earthquake, explosion, bridge and house health. b) Auto calculate the earthquake parameters, and transferring those parameters by 3G wireless broadband network. This kind of seismograph has characteristics of low cost, easy installation. They can be concentrated in the urban region or areas need to specially care. We can set up a ground motion parameters quick report sensor network while large earthquake break out. Then high-resolution-fine shake-map can be easily produced for the need of emergency rescue. c) By loading P-wave detection program modules, it can be used for earthquake early warning for large earthquakes; d) Can easily construct a high-density layout seismic monitoring network owning remote control and modern intelligent earthquake sensor.

  6. Kernel optimization for short-range molecular dynamics

    NASA Astrophysics Data System (ADS)

    Hu, Changjun; Wang, Xianmeng; Li, Jianjiang; He, Xinfu; Li, Shigang; Feng, Yangde; Yang, Shaofeng; Bai, He

    2017-02-01

    To optimize short-range force computations in Molecular Dynamics (MD) simulations, multi-threading and SIMD optimizations are presented in this paper. With respect to multi-threading optimization, a Partition-and-Separate-Calculation (PSC) method is designed to avoid write conflicts caused by using Newton's third law. Serial bottlenecks are eliminated with no additional memory usage. The method is implemented by using the OpenMP model. Furthermore, the PSC method is employed on Intel Xeon Phi coprocessors in both native and offload models. We also evaluate the performance of the PSC method under different thread affinities on the MIC architecture. In the SIMD execution, we explain the performance influence in the PSC method, considering the "if-clause" of the cutoff radius check. The experiment results show that our PSC method is relatively more efficient compared to some traditional methods. In double precision, our 256-bit SIMD implementation is about 3 times faster than the scalar version.

  7. Multithreaded hybrid feature tracking for markerless augmented reality.

    PubMed

    Lee, Taehee; Höllerer, Tobias

    2009-01-01

    We describe a novel markerless camera tracking approach and user interaction methodology for augmented reality (AR) on unprepared tabletop environments. We propose a real-time system architecture that combines two types of feature tracking. Distinctive image features of the scene are detected and tracked frame-to-frame by computing optical flow. In order to achieve real-time performance, multiple operations are processed in a synchronized multi-threaded manner: capturing a video frame, tracking features using optical flow, detecting distinctive invariant features, and rendering an output frame. We also introduce user interaction methodology for establishing a global coordinate system and for placing virtual objects in the AR environment by tracking a user's outstretched hand and estimating a camera pose relative to it. We evaluate the speed and accuracy of our hybrid feature tracking approach, and demonstrate a proof-of-concept application for enabling AR in unprepared tabletop environments, using bare hands for interaction.

  8. High-Level Data Races

    NASA Technical Reports Server (NTRS)

    Artho, Cyrille; Havelund, Klaus; Biere, Armin; Koga, Dennis (Technical Monitor)

    2003-01-01

    Data races are a common problem in concurrent and multi-threaded programming. They are hard to detect without proper tool support. Despite the successful application of these tools, experience shows that the notion of data race is not powerful enough to capture certain types of inconsistencies occurring in practice. In this paper we investigate data races on a higher abstraction layer. This enables us to detect inconsistent uses of shared variables, even if no classical race condition occurs. For example, a data structure representing a coordinate pair may have to be treated atomically. By lifting the meaning of a data race to a higher level, such problems can now be covered. The paper defines the concepts view and view consistency to give a notation for this novel kind of property. It describes what kinds of errors can be detected with this new definition, and where its limitations are. It also gives a formal guideline for using data structures in a multi-threading environment.

  9. Thread mapping using system-level model for shared memory multicores

    NASA Astrophysics Data System (ADS)

    Mitra, Reshmi

    Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.

  10. Massively Parallel Signal Processing using the Graphics Processing Unit for Real-Time Brain-Computer Interface Feature Extraction.

    PubMed

    Wilson, J Adam; Williams, Justin C

    2009-01-01

    The clock speeds of modern computer processors have nearly plateaued in the past 5 years. Consequently, neural prosthetic systems that rely on processing large quantities of data in a short period of time face a bottleneck, in that it may not be possible to process all of the data recorded from an electrode array with high channel counts and bandwidth, such as electrocorticographic grids or other implantable systems. Therefore, in this study a method of using the processing capabilities of a graphics card [graphics processing unit (GPU)] was developed for real-time neural signal processing of a brain-computer interface (BCI). The NVIDIA CUDA system was used to offload processing to the GPU, which is capable of running many operations in parallel, potentially greatly increasing the speed of existing algorithms. The BCI system records many channels of data, which are processed and translated into a control signal, such as the movement of a computer cursor. This signal processing chain involves computing a matrix-matrix multiplication (i.e., a spatial filter), followed by calculating the power spectral density on every channel using an auto-regressive method, and finally classifying appropriate features for control. In this study, the first two computationally intensive steps were implemented on the GPU, and the speed was compared to both the current implementation and a central processing unit-based implementation that uses multi-threading. Significant performance gains were obtained with GPU processing: the current implementation processed 1000 channels of 250 ms in 933 ms, while the new GPU method took only 27 ms, an improvement of nearly 35 times.

  11. Historical perspectives on channel pattern in the Clark Fork River, Montana and implications for post-dam removal restoration

    NASA Astrophysics Data System (ADS)

    Woelfle-Erskine, C. A.; Wilcox, A. C.

    2009-12-01

    Active restoration approaches such as channel reconstruction have moved beyond the realm of small streams and are being applied to larger rivers. Uncertainties arising from limited knowledge, fluvial and ecosystem variability, and contaminants are especially significant in restoration of large rivers, where project costs and the social, infrastructural, and ecological costs of failure are high. We use the case of Milltown Dam removal on the Clark Fork River, Montana and subsequent channel reconstruction in the former reservoir to examine the use of historical research and uncertainty analysis in river restoration. At a cost of approximately $120 million, the Milltown Dam removal involves the mechanical removal of approximately 2 million cubic meters of sediments contaminated by upstream mining, followed by restoration of the former reservoir reach in which a single-thread meandering channel is being constructed. Historical maps, surveys, photographs, and accounts suggest a conceptual model of a multi-thread, anastomosing river in the reach targeted for channel reconstruction, upstream of the confluence of the Clark Fork and Blackfoot Rivers. We supplemented historical research with analysis of aerial photographs, topographic data, and USGS stage-discharge measurements in a lotic but reservoir-influenced reach of the Clark Fork River within our study area to estimate avulsion frequency (0.8 avulsions/year over a 70-year period) and average rates of lateral migration and aggradation. These were used to calculate the mobility number, a dimensionless relationship between channel filling and lateral migration timescales that can be used to predict whether a river’s planform is single or multi-threaded. The mobility number within our study reach ranged from 0.6 (multi-thread channel) to 1.7 (transitional channel). We predict that, in the absence of active channel reconstruction, the post-dam channel pattern would evolve to one that alternates between single and multi-threaded. We propose that multiple working hypotheses should be applied to managing uncertainty as part of an adaptive management plan for restoration in our study area and elsewhere. In this approach, restoration planning and implementation would be underpinned by an explicitly identified set of uncertainties and hypotheses about channel processes and post-restoration responses. This framework would allow for and embrace channel processes such as bifurcations and avulsions that are excluded from dominant approaches to channel reconstruction, which emphasize single-thread meandering planforms.

  12. Hydraulic conditions of flood flows in a Polish Carpathian river subjected to variable human impacts

    NASA Astrophysics Data System (ADS)

    Radecki-Pawlik, Artur; Czech, Wiktoria; Wyżga, Bartłomiej; Mikuś, Paweł; Zawiejska, Joanna; Ruiz-Villanueva, Virginia

    2016-04-01

    Channel morphology of the Czarny Dunajec River, Polish Carpathians, has been considerably modified as a result of channelization and gravel-mining induced channel incision, and now it varies from a single-thread, incised or regulated channel to an unmanaged, multi-thread channel. We investigated effects of these distinct channel morphologies on the conditions for flood flows in a study of 25 cross-sections from the middle river course where the Czarny Dunajec receives no significant tributaries and flood discharges increase little in the downstream direction. Cross-sectional morphology, channel slope and roughness of particular cross-section parts were used as input data for the hydraulic modelling performed with the 1D steady-flow HEC-RAS model for discharges with recurrence interval from 1.5 to 50 years. The model for each cross-section was calibrated with the water level of a 20-year flood from May 2014, determined shortly after the flood on the basis of high-water marks. Results indicated that incised and channelized river reaches are typified by similar flow widths and cross-sectional flow areas, which are substantially smaller than those in the multi-thread reach. However, because of steeper channel slope in the incised reach than in the channelized reach, the three river reaches differ in unit stream power and bed shear stress, which attain the highest values in the incised reach, intermediate values in the channelized reach, and the lowest ones in the multi-thread reach. These patterns of flow power and hydraulic forces are reflected in significant differences in river competence between the three river reaches. Since the introduction of the channelization scheme 30 years ago, sedimentation has reduced its initial flow conveyance by more than half and elevated water stages at given flood discharges by about 0.5-0.7 m. This partly reflects a progressive growth of natural levees along artificially stabilized channel banks. By contrast, sediments of natural levees deposited along the multi-thread channel and subsequently eroded in the course of lateral channel migration and floodplain reworking; as a result, they do not reduce the conveyance of floodplain flows in this reach. This study was performed within the scope of the Research Project DEC-2013/09/B/ST10/00056 financed by the National Science Centre of Poland.

  13. History of river regulation of the Noce River (NE Italy) and related bio-morphodynamic responses

    NASA Astrophysics Data System (ADS)

    Serlet, Alyssa; Scorpio, Vittoria; Mastronunzio, Marco; Proto, Matteo; Zen, Simone; Zolezzi, Guido; Bertoldi, Walter; Comiti, Francesco; Prà, Elena Dai; Surian, Nicola; Gurnell, Angela

    2016-04-01

    The Noce River is a hydropower-regulated Alpine stream in Northern-East Italy and a major tributary of the Adige River, the second longest Italian river. The objective of the research is to investigate the response of the lower course of the Noce to two main stages of hydromorphological regulation; channelization/ diversion and, one century later, hydropower regulation. This research uses a historical reconstruction to link the geomorphic response with natural and human-induced factors by identifying morphological and vegetation features from historical maps and airborne photogrammetry and implementing a quantitative analysis of the river response to channelization and flow / sediment supply regulation related to hydropower development. A descriptive overview is presented. The concept of evolutionary trajectory is integrated with predictions from morphodynamic theories for river bars that allow increased insight to investigate the river response to a complex sequence of regulatory events such as development of bars, islands and riparian vegetation. Until the mid-19th century the river had a multi-thread channel pattern. Thereafter (1852) the river was straightened and diverted. Upstream of Mezzolombardo village the river was constrained between embankments of approximately 100 m width while downstream they are of approximately 50 m width. Since channelization some interesting geomorphic changes have appeared in the river e.g. the appearance of alternate bars in the channel. In 1926 there was a breach in the right bank of the downstream part that resulted in a multi-thread river reach which can be viewed as a recovery to the earlier multi-thread pattern. After the 1950's the flow and sediment supply became strongly regulated by hydropower development. The analysis of aerial images reveals that the multi-thread reach became progressively stabilized by vegetation development over the bars, though signs of some dynamics can still be recognizable today, despite the strong hydropeaking that dominates the flow regime. The results of the historical analysis will be used in a larger framework that focuses on interdisciplinary research of interactions between flow, sediment and vegetation in regulated rivers and aims to enhance knowledge on the interplay between river bars and vegetation in the perspective of providing enhanced tools for river rehabilitation and restoration.

  14. Speech understanding in noise with the Roger Pen, Naida CI Q70 processor, and integrated Roger 17 receiver in a multi-talker network.

    PubMed

    De Ceulaer, Geert; Bestel, Julie; Mülder, Hans E; Goldbeck, Felix; de Varebeke, Sebastien Pierre Janssens; Govaerts, Paul J

    2016-05-01

    Roger is a digital adaptive multi-channel remote microphone technology that wirelessly transmits a speaker's voice directly to a hearing instrument or cochlear implant sound processor. Frequency hopping between channels, in combination with repeated broadcast, avoids interference issues that have limited earlier generation FM systems. This study evaluated the benefit of the Roger Pen transmitter microphone in a multiple talker network (MTN) for cochlear implant users in a simulated noisy conversation setting. Twelve post-lingually deafened adult Advanced Bionics CII/HiRes 90K recipients were recruited. Subjects used a Naida CI Q70 processor with integrated Roger 17 receiver. The test environment simulated four people having a meal in a noisy restaurant, one the CI user (listener), and three companions (talkers) talking non-simultaneously in a diffuse field of multi-talker babble. Speech reception thresholds (SRTs) were determined without the Roger Pen, with one Roger Pen, and with three Roger Pens in an MTN. Using three Roger Pens in an MTN improved the SRT by 14.8 dB over using no Roger Pen, and by 13.1 dB over using a single Roger Pen (p < 0.0001). The Roger Pen in an MTN provided statistically and clinically significant improvement in speech perception in noise for Advanced Bionics cochlear implant recipients. The integrated Roger 17 receiver made it easy for users of the Naida CI Q70 processor to take advantage of the Roger system. The listening advantage and ease of use should encourage more clinicians to recommend and fit Roger in adult cochlear implant patients.

  15. Real-time optical signal processors employing optical feedback: amplitude and phase control.

    PubMed

    Gallagher, N C

    1976-04-01

    The development of real-time coherent optical signal processors has increased the appeal of optical computing techniques in signal processing applications. A major limitation of these real-time systems is the. fact that the optical processing material is generally of a phase-only type. The result is that the spatial filters synthesized with these systems must be either phase-only filters or amplitude-only filters. The main concern of this paper is the application of optical feedback techniques to obtain simultaneous and independent amplitude and phase control of the light passing through the system. It is shown that optical feedback techniques may be employed with phase-only spatial filters to obtain this amplitude and phase control. The feedback system with phase-only filters is compared with other feedback systems that employ combinations of phase-only and amplitude-only filters; it is found that the phase-only system is substantially more flexible than the other two systems investigated.

  16. Highly parallel sparse Cholesky factorization

    NASA Technical Reports Server (NTRS)

    Gilbert, John R.; Schreiber, Robert

    1990-01-01

    Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms.

  17. Massively parallel processor computer

    NASA Technical Reports Server (NTRS)

    Fung, L. W. (Inventor)

    1983-01-01

    An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

  18. A new multifunction acousto-optic signal processor

    NASA Technical Reports Server (NTRS)

    Berg, N. J.; Casseday, M. W.; Filipov, A. N.; Pellegrino, J. M.

    1984-01-01

    An acousto-optic architecture for simultaneously obtaining time integration correlation and high-speed power spectrum analysis was constructed using commercially available TeO2 modulators and photodiode detector-arrays. The correlator section of the processor uses coherent interferometry to attain maximum bandwidth and dynamic range while achieving a time-bandwidth product of 1 million. Two correllator outputs are achieved in this system configuration. One is optically filtered and magnified 2 : 1 to decrease the spatial frequency to a level where a 25-MHz bandwidth may be sampled by a 62-mm array with elements on 25-micro centers. The other output is magnified by a factor of 10 such that the center 4 microseconds of information is available for estimation of time-difference-of-arrival to within 10 ns. The Bragg cell spectrum-analyzer section, which also has two outputs, resolves a 25-MHz instantaneous bandwidth to 25 kHz and can determine discrete-frequency reception time to within 15 microseconds. A microprocessor combines spectrum analysis information with that obtained from the correlator.

  19. Load Balancing Strategies for Multiphase Flows on Structured Grids

    NASA Astrophysics Data System (ADS)

    Olshefski, Kristopher; Owkes, Mark

    2017-11-01

    The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.

  20. Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

    NASA Technical Reports Server (NTRS)

    Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.

    2012-01-01

    In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.

  1. Beam Stability R&D for the APS MBA Upgrade

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Sereno, Nicholas S.; Arnold, Ned D.; Bui, Hanh D.

    2015-01-01

    Beam diagnostics required for the APS Multi-bend acromat (MBA) are driven by ambitious beam stability requirements. The major AC stability challenge is to correct rms beam motion to 10% the rms beam size at the insertion device source points from0.01 to 1000 Hz. The vertical plane represents the biggest challenge forAC stability, which is required to be 400 nm rms for a 4-micron vertical beam size. In addition to AC stability, long-term drift over a period of seven days is required to be 1 micron or less. Major diagnostics R&D components include improved rf beam position processing using commercially availablemore » FPGA-based BPM processors, new X-ray beam position monitors based on hard X-ray fluorescence from copper and Compton scattering off diamond, mechanical motion sensing to detect and correct long-term vacuum chamber drift, a new feedback system featuring a tenfold increase in sampling rate, and a several-fold increase in the number of fast correctors and BPMs in the feedback algorithm. Feedback system development represents a major effort, and we are pursuing development of a novel algorithm that integrates orbit correction for both slow and fast correctors down to DC simultaneously. Finally, a new data acquisition system (DAQ) is being developed to simultaneously acquire streaming data from all diagnostics as well as the feedback processors for commissioning and fault diagnosis. Results of studies and the design effort are reported.« less

  2. Polymer-Based Dense Fluidic Networks for High Throughput Screening with Ultrasensitive Fluorescence Detection

    PubMed Central

    Okagbare, Paul I.; Soper, Steven A.

    2011-01-01

    Microfluidics represents a viable platform for performing High Throughput Screening (HTS) due to its ability to automate fluid handling and generate fluidic networks with high number densities over small footprints appropriate for the simultaneous optical interrogation of many screening assays. While most HTS campaigns depend on fluorescence, readers typically use point detection and serially address the assay results significantly lowering throughput or detection sensitivity due to a low duty cycle. To address this challenge, we present here the fabrication of a high density microfluidic network packed into the imaging area of a large field-of-view (FoV) ultrasensitive fluorescence detection system. The fluidic channels were 1, 5 or 10 μm (width), 1 μm (depth) with a pitch of 1–10 μm and each fluidic processor was individually addressable. The fluidic chip was produced from a molding tool using hot embossing and thermal fusion bonding to enclose the fluidic channels. A 40X microscope objective (numerical aperture = 0.75) created a FoV of 200 μm, providing the ability to interrogate ~25 channels using the current fluidic configuration. An ultrasensitive fluorescence detection system with a large FoV was used to transduce fluorescence signals simultaneously from each fluidic processor onto the active area of an electron multiplying charge-coupled device (EMCCD). The utility of these multichannel networks for HTS was demonstrated by carrying out the high throughput monitoring of the activity of an enzyme, APE1, used as a model screening assay. PMID:20872611

  3. Interior Noise Reduction by Adaptive Feedback Vibration Control

    NASA Technical Reports Server (NTRS)

    Lim, Tae W.

    1998-01-01

    The objective of this project is to investigate the possible use of adaptive digital filtering techniques in simultaneous, multiple-mode identification of the modal parameters of a vibrating structure in real-time. It is intended that the results obtained from this project will be used for state estimation needed in adaptive structural acoustics control. The work done in this project is basically an extension of the work on real-time single mode identification, which was performed successfully using a digital signal processor (DSP) at NASA, Langley. Initially, in this investigation the single mode identification work was duplicated on a different processor, namely the Texas Instruments TMS32OC40 DSP. The system identification results for the single mode case were very good. Then an algorithm for simultaneous two mode identification was developed and tested using analytical simulation. When it successfully performed the expected tasks, it was implemented in real-time on the DSP system to identify the first two modes of vibration of a cantilever aluminum beam. The results of the simultaneous two mode case were good but some problems were identified related to frequency warping and spurious mode identification. The frequency warping problem was found to be due to the bilinear transformation used in the algorithm to convert the system transfer function from the continuous-time domain to the discrete-time domain. An alternative approach was developed to rectify the problem. The spurious mode identification problem was found to be associated with high sampling rates. Noise in the signal is suspected to be the cause of this problem but further investigation will be needed to clarify the cause. For simultaneous identification of more than two modes, it was found that theoretically an adaptive digital filter can be designed to identify the required number of modes, but the algebra became very complex which made it impossible to implement in the DSP system used in this study. The on-line identification algorithm developed in this research will be useful in constructing a state estimator for feedback vibration control.

  4. Development of a Next Generation Concurrent Framework for the ATLAS Experiment

    NASA Astrophysics Data System (ADS)

    Calafiura, P.; Lampl, W.; Leggett, C.; Malon, D.; Stewart, G.; Wynne, B.

    2015-12-01

    The ATLAS experiment has successfully used its Gaudi/Athena software framework for data taking and analysis during the first LHC run, with billions of events successfully processed. However, the design of Gaudi/Athena dates from early 2000 and the software and the physics code has been written using a single threaded, serial design. This programming model has increasing difficulty in exploiting the potential of current CPUs, which offer their best performance only through taking full advantage of multiple cores and wide vector registers. Future CPU evolution will intensify this trend, with core counts increasing and memory per core falling. With current memory consumption for 64 bit ATLAS reconstruction in a high luminosity environment approaching 4GB, it will become impossible to fully occupy all cores in a machine without exhausting available memory. However, since maximizing performance per watt will be a key metric, a mechanism must be found to use all cores as efficiently as possible. In this paper we report on our progress with a practical demonstration of the use of multithreading in the ATLAS reconstruction software, using the GaudiHive framework. We have expanded support to Calorimeter, Inner Detector, and Tracking code, discussing what changes were necessary in order to allow the serially designed ATLAS code to run, both to the framework and to the tools and algorithms used. We report on both the performance gains, and what general lessons were learned about the code patterns that had been employed in the software and which patterns were identified as particularly problematic for multi-threading. We also present our findings on implementing a hybrid multi-threaded / multi-process framework, to take advantage of the strengths of each type of concurrency, while avoiding some of their corresponding limitations.

  5. Large-N in Volcano Settings: Volcanosri

    NASA Astrophysics Data System (ADS)

    Lees, J. M.; Song, W.; Xing, G.; Vick, S.; Phillips, D.

    2014-12-01

    We seek a paradigm shift in the approach we take on volcano monitoring where the compromise from high fidelity to large numbers of sensors is used to increase coverage and resolution. Accessibility, danger and the risk of equipment loss requires that we develop systems that are independent and inexpensive. Furthermore, rather than simply record data on hard disk for later analysis we desire a system that will work autonomously, capitalizing on wireless technology and in field network analysis. To this end we are currently producing a low cost seismic array which will incorporate, at the very basic level, seismological tools for first cut analysis of a volcano in crises mode. At the advanced end we expect to perform tomographic inversions in the network in near real time. Geophone (4 Hz) sensors connected to a low cost recording system will be installed on an active volcano where triggering earthquake location and velocity analysis will take place independent of human interaction. Stations are designed to be inexpensive and possibly disposable. In one of the first implementations the seismic nodes consist of an Arduino Due processor board with an attached Seismic Shield. The Arduino Due processor board contains an Atmel SAM3X8E ARM Cortex-M3 CPU. This 32 bit 84 MHz processor can filter and perform coarse seismic event detection on a 1600 sample signal in fewer than 200 milliseconds. The Seismic Shield contains a GPS module, 900 MHz high power mesh network radio, SD card, seismic amplifier, and 24 bit ADC. External sensors can be attached to either this 24-bit ADC or to the internal multichannel 12 bit ADC contained on the Arduino Due processor board. This allows the node to support attachment of multiple sensors. By utilizing a high-speed 32 bit processor complex signal processing tasks can be performed simultaneously on multiple sensors. Using a 10 W solar panel, second system being developed can run autonomously and collect data on 3 channels at 100Hz for 6 months with the installed 16Gb SD card. Initial designs and test results will be presented and discussed.

  6. A Multi-Threaded Cryptographic Pseudorandom Number Generator Test Suite

    DTIC Science & Technology

    2016-09-01

    bitcoin thieves, Google releases patch. (2013, Aug. 16). SiliconANGLE. [Online]. Available: http://siliconangle.com/blog/2013/ 08/16/android-crypto-prng...flaw-aided- bitcoin -thieves-google-releases-patch/ [5] M. Gondree. (2014, Sep. 28). NPS POSIX thread pool library. [Online]. Available: https

  7. Parallel k-means++

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    A parallelization of the k-means++ seed selection algorithm on three distinct hardware platforms: GPU, multicore CPU, and multithreaded architecture. K-means++ was developed by David Arthur and Sergei Vassilvitskii in 2007 as an extension of the k-means data clustering technique. These algorithms allow people to cluster multidimensional data, by attempting to minimize the mean distance of data points within a cluster. K-means++ improved upon traditional k-means by using a more intelligent approach to selecting the initial seeds for the clustering process. While k-means++ has become a popular alternative to traditional k-means clustering, little work has been done to parallelize this technique.more » We have developed original C++ code for parallelizing the algorithm on three unique hardware architectures: GPU using NVidia's CUDA/Thrust framework, multicore CPU using OpenMP, and the Cray XMT multithreaded architecture. By parallelizing the process for these platforms, we are able to perform k-means++ clustering much more quickly than it could be done before.« less

  8. Scheduling based on a dynamic resource connection

    NASA Astrophysics Data System (ADS)

    Nagiyev, A. E.; Botygin, I. A.; Shersntneva, A. I.; Konyaev, P. A.

    2017-02-01

    The practical using of distributed computing systems associated with many problems, including troubles with the organization of an effective interaction between the agents located at the nodes of the system, with the specific configuration of each node of the system to perform a certain task, with the effective distribution of the available information and computational resources of the system, with the control of multithreading which implements the logic of solving research problems and so on. The article describes the method of computing load balancing in distributed automatic systems, focused on the multi-agency and multi-threaded data processing. The scheme of the control of processing requests from the terminal devices, providing the effective dynamic scaling of computing power under peak load is offered. The results of the model experiments research of the developed load scheduling algorithm are set out. These results show the effectiveness of the algorithm even with a significant expansion in the number of connected nodes and zoom in the architecture distributed computing system.

  9. Multithreading with separate data to improve the performance of Backpropagation method

    NASA Astrophysics Data System (ADS)

    Dhamma, Mulia; Zarlis, Muhammad; Budhiarti Nababan, Erna

    2017-12-01

    Backpropagation is one method of artificial neural network that can make a prediction for a new data with learning by supervised of the past data. The learning process of backpropagation method will become slow if we give too much data for backpropagation method to learn the data. Multithreading with a separate data inside of each thread are being used in order to improve the performance of backpropagtion method . Base on the research for 39 data and also 5 times experiment with separate data into 2 thread, the result showed that the average epoch become 6490 when using 2 thread and 453049 epoch when using only 1 thread. The most lowest epoch for 2 thread is 1295 and 1 thread is 356116. The process of improvement is caused by the minimum error from 2 thread that has been compared to take the weight and bias value. This process will be repeat as long as the backpropagation do learning.

  10. Real-Time Processing System for the JET Hard X-Ray and Gamma-Ray Profile Monitor Enhancement

    NASA Astrophysics Data System (ADS)

    Fernandes, Ana M.; Pereira, Rita C.; Neto, André; Valcárcel, Daniel F.; Alves, Diogo; Sousa, Jorge; Carvalho, Bernardo B.; Kiptily, Vasily; Syme, Brian; Blanchard, Patrick; Murari, Andrea; Correia, Carlos M. B. A.; Varandas, Carlos A. F.; Gonçalves, Bruno

    2014-06-01

    The Joint European Torus (JET) is currently undertaking an enhancement program which includes tests of relevant diagnostics with real-time processing capabilities for the International Thermonuclear Experimental Reactor (ITER). Accordingly, a new real-time processing system was developed and installed at JET for the gamma-ray and hard X-ray profile monitor diagnostic. The new system is connected to 19 CsI(Tl) photodiodes in order to obtain the line-integrated profiles of the gamma-ray and hard X-ray emissions. Moreover, it was designed to overcome the former data acquisition (DAQ) limitations while exploiting the required real-time features. The new DAQ hardware, based on the Advanced Telecommunication Computer Architecture (ATCA) standard, includes reconfigurable digitizer modules with embedded field-programmable gate array (FPGA) devices capable of acquiring and simultaneously processing data in real-time from the 19 detectors. A suitable algorithm was developed and implemented in the FPGAs, which are able to deliver the corresponding energy of the acquired pulses. The processed data is sent periodically, during the discharge, through the JET real-time network and stored in the JET scientific databases at the end of the pulse. The interface between the ATCA digitizers, the JET control and data acquisition system (CODAS), and the JET real-time network is provided by the Multithreaded Application Real-Time executor (MARTe). The work developed allowed attaining two of the major milestones required by next fusion devices: the ability to process and simultaneously supply high volume data rates in real-time.

  11. Universal computer control system (UCCS) for space telerobots

    NASA Technical Reports Server (NTRS)

    Bejczy, Antal K.; Szakaly, Zoltan

    1987-01-01

    A universal computer control system (UCCS) is under development for all motor elements of a space telerobot. The basic hardware architecture and software design of UCCS are described, together with the rich motor sensing, control, and self-test capabilities of this all-computerized motor control system. UCCS is integrated into a multibus computer environment with direct interface to higher level control processors, uses pulsewidth multiplier power amplifiers, and one unit can control up to sixteen different motors simultaneously at a high I/O rate. UCCS performance capabilities are illustrated by a few data.

  12. A personal computer-based, multitasking data acquisition system

    NASA Technical Reports Server (NTRS)

    Bailey, Steven A.

    1990-01-01

    A multitasking, data acquisition system was written to simultaneously collect meteorological radar and telemetry data from two sources. This system is based on the personal computer architecture. Data is collected via two asynchronous serial ports and is deposited to disk. The system is written in both the C programming language and assembler. It consists of three parts: a multitasking kernel for data collection, a shell with pull down windows as user interface, and a graphics processor for editing data and creating coded messages. An explanation of both system principles and program structure is presented.

  13. A multitasking, multisinked, multiprocessor data acquisition front end

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fox, R.; Au, R.; Molen, A.V.

    1989-10-01

    The authors have developed a generalized data acquisition front end system which is based on MC68020 processors running a commercial real time kernel (rhoSOS), and implemented primarily in a high level language (C). This system has been attached to the back end on-line computing system at NSCL via our high performance ETHERNET protocol. Data may be simultaneously sent to any number of back end systems. Fixed fraction sampling along links to back end computing is also supported. A nonprocedural program generator simplifies the development of experiment specific code.

  14. Simultaneous Range-Velocity Processing and SNR Analysis of AFIT’s Random Noise Radar

    DTIC Science & Technology

    2012-03-22

    reducing the overall processing time. Two computers, equipped with NVIDIA ® GPUs, were used to process the col- 45 lected data. The specifications for each...gather the results back to the CPU. Another company , AccelerEyes®, has developed a product called Jacket® that claims to be better than the parallel...Number of Processing Cores 4 8 Processor Speed 3.33 GHz 3.07 GHz Installed Memory 48 GB 48 GB GPU Make NVIDIA NVIDIA GPU Model Tesla 1060 Tesla C2070 GPU

  15. An Expert System Interfaced with a Database System to Perform Troubleshooting of Aircraft Carrier Piping Systems

    DTIC Science & Technology

    1988-12-01

    interval of four feet, and are numbered sequentially bow to stem. * "wing tank" is a tank or void, outboard of the holding bulkhead, away from the center...system and DBMS simultaneously with a multi-processor, allowing queries to the DBMS without terminating the expert system. This method was judged...RECIRC). eductor -strip("Y"):- ask _ques _read_ans(OVBD,"ovbd dis open"),ovbd dis-open(OVBD). eductor-strip("N"):- ask_ques read_ans( LINEUP , "strip lineup

  16. Integration of Diagnostics into Ground Equipment Study. Volume 1

    DTIC Science & Technology

    2004-07-30

    Marine Corps V-22, CH-53E, MH-53E, SH- 60B, MH- 60S /R, AH-1Z and UH -1Y aircraft. In addition, 30 systems are in delivery to the US Army Aviation Applied...simultaneous) can be connected to the VMEP system, which is based on a PC-104 platform and a 233MHz processor. The AH-64 Apache and UH - 60 Blackhawk are outfitted...34A Model-Based Health and Usage Monitoring and Diagnostic System for the UH - 60 Helicopter," Proceedings of the American Helicopter Society 57th

  17. SimExTargId: A comprehensive package for real-time LC-MS data acquisition and analysis.

    PubMed

    Edmands, William M B; Hayes, Josie; Rappaport, Stephen M

    2018-05-22

    Liquid chromatography mass spectrometry (LC-MS) is the favored method for untargeted metabolomic analysis of small molecules in biofluids. Here we present SimExTargId, an open-source R package for autonomous analysis of metabolomic data and real-time observation of experimental runs. This simultaneous, fully automated and multi-threaded (optional) package is a wrapper for vendor-independent format conversion (ProteoWizard), xcms- and CAMERA- based peak-picking, MetMSLine-based pre-processing and covariate-based statistical analysis. Users are notified of detrimental instrument drift or errors by email. Also included are two shiny applications, targetId for real-time MS2 target identification, and peakMonitor to monitor targeted metabolites. SimExTargId is publicly available under GNU LGPL v3.0 license at https://github.com/JosieLHayes/simExTargId, which includes a vignette with example data. SimExTargId should be installed on a dedicated data-processing workstation or server that is networked to the LC-MS platform to facilitate MS1 profiling of metabolomic data. josie.hayes@berkeley.edu. Supplementary data are available at Bioinformatics online.

  18. Distributed processor allocation for launching applications in a massively connected processors complex

    DOEpatents

    Pedretti, Kevin

    2008-11-18

    A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

  19. A Wireless Headstage for Combined Optogenetics and Multichannel Electrophysiological Recording.

    PubMed

    Gagnon-Turcotte, Gabriel; LeChasseur, Yoan; Bories, Cyril; Messaddeq, Younes; De Koninck, Yves; Gosselin, Benoit

    2017-02-01

    This paper presents a wireless headstage with real-time spike detection and data compression for combined optogenetics and multichannel electrophysiological recording. The proposed headstage, which is intended to perform both optical stimulation and electrophysiological recordings simultaneously in freely moving transgenic rodents, is entirely built with commercial off-the-shelf components, and includes 32 recording channels and 32 optical stimulation channels. It can detect, compress and transmit full action potential waveforms over 32 channels in parallel and in real time using an embedded digital signal processor based on a low-power field programmable gate array and a Microblaze microprocessor softcore. Such a processor implements a complete digital spike detector featuring a novel adaptive threshold based on a Sigma-delta control loop, and a wavelet data compression module using a new dynamic coefficient re-quantization technique achieving large compression ratios with higher signal quality. Simultaneous optical stimulation and recording have been performed in-vivo using an optrode featuring 8 microelectrodes and 1 implantable fiber coupled to a 465-nm LED, in the somatosensory cortex and the Hippocampus of a transgenic mouse expressing ChannelRhodospin (Thy1::ChR2-YFP line 4) under anesthetized conditions. Experimental results show that the proposed headstage can trigger neuron activity while collecting, detecting and compressing single cell microvolt amplitude activity from multiple channels in parallel while achieving overall compression ratios above 500. This is the first reported high-channel count wireless optogenetic device providing simultaneous optical stimulation and recording. Measured characteristics show that the proposed headstage can achieve up to 100% of true positive detection rate for signal-to-noise ratio (SNR) down to 15 dB, while achieving up to 97.28% at SNR as low as 5 dB. The implemented prototype features a lifespan of up to 105 minutes, and uses a lightweight (2.8 g) and compact [Formula: see text] rigid-flex printed circuit board.

  20. Fast 2D FWI on a multi and many-cores workstation.

    NASA Astrophysics Data System (ADS)

    Thierry, Philippe; Donno, Daniela; Noble, Mark

    2014-05-01

    Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.

  1. MARTe: A Multiplatform Real-Time Framework

    NASA Astrophysics Data System (ADS)

    Neto, André C.; Sartori, Filippo; Piccolo, Fabio; Vitelli, Riccardo; De Tommasi, Gianmaria; Zabeo, Luca; Barbalace, Antonio; Fernandes, Horacio; Valcarcel, Daniel F.; Batista, Antonio J. N.

    2010-04-01

    Development of real-time applications is usually associated with nonportable code targeted at specific real-time operating systems. The boundary between hardware drivers, system services, and user code is commonly not well defined, making the development in the target host significantly difficult. The Multithreaded Application Real-Time executor (MARTe) is a framework built over a multiplatform library that allows the execution of the same code in different operating systems. The framework provides the high-level interfaces with hardware, external configuration programs, and user interfaces, assuring at the same time hard real-time performances. End-users of the framework are required to define and implement algorithms inside a well-defined block of software, named Generic Application Module (GAM), that is executed by the real-time scheduler. Each GAM is reconfigurable with a set of predefined configuration meta-parameters and interchanges information using a set of data pipes that are provided as inputs and required as output. Using these connections, different GAMs can be chained either in series or parallel. GAMs can be developed and debugged in a non-real-time system and, only once the robustness of the code and correctness of the algorithm are verified, deployed to the real-time system. The software also supplies a large set of utilities that greatly ease the interaction and debugging of a running system. Among the most useful are a highly efficient real-time logger, HTTP introspection of real-time objects, and HTTP remote configuration. MARTe is currently being used to successfully drive the plasma vertical stabilization controller on the largest magnetic confinement fusion device in the world, with a control loop cycle of 50 ?s and a jitter under 1 ?s. In this particular project, MARTe is used with the Real-Time Application Interface (RTAI)/Linux operating system exploiting the new ?86 multicore processors technology.

  2. Development of an extensible dual-core wireless sensing node for cyber-physical systems

    NASA Astrophysics Data System (ADS)

    Kane, Michael; Zhu, Dapeng; Hirose, Mitsuhito; Dong, Xinjun; Winter, Benjamin; Häckell, Mortiz; Lynch, Jerome P.; Wang, Yang; Swartz, A.

    2014-04-01

    The introduction of wireless telemetry into the design of monitoring and control systems has been shown to reduce system costs while simplifying installations. To date, wireless nodes proposed for sensing and actuation in cyberphysical systems have been designed using microcontrollers with one computational pipeline (i.e., single-core microcontrollers). While concurrent code execution can be implemented on single-core microcontrollers, concurrency is emulated by splitting the pipeline's resources to support multiple threads of code execution. For many applications, this approach to multi-threading is acceptable in terms of speed and function. However, some applications such as feedback controls demand deterministic timing of code execution and maximum computational throughput. For these applications, the adoption of multi-core processor architectures represents one effective solution. Multi-core microcontrollers have multiple computational pipelines that can execute embedded code in parallel and can be interrupted independent of one another. In this study, a new wireless platform named Martlet is introduced with a dual-core microcontroller adopted in its design. The dual-core microcontroller design allows Martlet to dedicate one core to standard wireless sensor operations while the other core is reserved for embedded data processing and real-time feedback control law execution. Another distinct feature of Martlet is a standardized hardware interface that allows specialized daughter boards (termed wing boards) to be interfaced to the Martlet baseboard. This extensibility opens opportunity to encapsulate specialized sensing and actuation functions in a wing board without altering the design of Martlet. In addition to describing the design of Martlet, a few example wings are detailed, along with experiments showing the Martlet's ability to monitor and control physical systems such as wind turbines and buildings.

  3. Developing Multimedia Courseware for the Internet's Java versus Shockwave.

    ERIC Educational Resources Information Center

    Majchrzak, Tina L.

    1996-01-01

    Describes and compares two methods for developing multimedia courseware for use on the Internet: an authoring tool called Shockwave, and an object-oriented language called Java. Topics include vector graphics, browsers, interaction with network protocols, data security, multithreading, and computer languages versus development environments. (LRW)

  4. Implementation of a fully-balanced periodic tridiagonal solver on a parallel distributed memory architecture

    NASA Technical Reports Server (NTRS)

    Eidson, T. M.; Erlebacher, G.

    1994-01-01

    While parallel computers offer significant computational performance, it is generally necessary to evaluate several programming strategies. Two programming strategies for a fairly common problem - a periodic tridiagonal solver - are developed and evaluated. Simple model calculations as well as timing results are presented to evaluate the various strategies. The particular tridiagonal solver evaluated is used in many computational fluid dynamic simulation codes. The feature that makes this algorithm unique is that these simulation codes usually require simultaneous solutions for multiple right-hand-sides (RHS) of the system of equations. Each RHS solutions is independent and thus can be computed in parallel. Thus a Gaussian elimination type algorithm can be used in a parallel computation and the more complicated approaches such as cyclic reduction are not required. The two strategies are a transpose strategy and a distributed solver strategy. For the transpose strategy, the data is moved so that a subset of all the RHS problems is solved on each of the several processors. This usually requires significant data movement between processor memories across a network. The second strategy attempts to have the algorithm allow the data across processor boundaries in a chained manner. This usually requires significantly less data movement. An approach to accomplish this second strategy in a near-perfect load-balanced manner is developed. In addition, an algorithm will be shown to directly transform a sequential Gaussian elimination type algorithm into the parallel chained, load-balanced algorithm.

  5. A Tool for Intersecting Context-Free Grammars and Its Applications

    NASA Technical Reports Server (NTRS)

    Gange, Graeme; Navas, Jorge A.; Schachte, Peter; Sondergaard, Harald; Stuckey, Peter J.

    2015-01-01

    This paper describes a tool for intersecting context-free grammars. Since this problem is undecidable the tool follows a refinement-based approach and implements a novel refinement which is complete for regularly separable grammars. We show its effectiveness for safety verification of recursive multi-threaded programs.

  6. A Novel Implementation of Massively Parallel Three Dimensional Monte Carlo Radiation Transport

    NASA Astrophysics Data System (ADS)

    Robinson, P. B.; Peterson, J. D. L.

    2005-12-01

    The goal of our summer project was to implement the difference formulation for radiation transport into Cosmos++, a multidimensional, massively parallel, magneto hydrodynamics code for astrophysical applications (Peter Anninos - AX). The difference formulation is a new method for Symbolic Implicit Monte Carlo thermal transport (Brooks and Szöke - PAT). Formerly, simultaneous implementation of fully implicit Monte Carlo radiation transport in multiple dimensions on multiple processors had not been convincingly demonstrated. We found that a combination of the difference formulation and the inherent structure of Cosmos++ makes such an implementation both accurate and straightforward. We developed a "nearly nearest neighbor physics" technique to allow each processor to work independently, even with a fully implicit code. This technique coupled with the increased accuracy of an implicit Monte Carlo solution and the efficiency of parallel computing systems allows us to demonstrate the possibility of massively parallel thermal transport. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48

  7. Bottom-up construction of artificial molecules for superconducting quantum processors

    NASA Astrophysics Data System (ADS)

    Poletto, Stefano; Rigetti, Chad; Gambetta, Jay M.; Merkel, Seth; Chow, Jerry M.; Corcoles, Antonio D.; Smolin, John A.; Rozen, Jim R.; Keefe, George A.; Rothwell, Mary B.; Ketchen, Mark B.; Steffen, Matthias

    2012-02-01

    Recent experiments on transmon qubits capacitively coupled to superconducting 3-dimensional cavities have shown coherence times much longer than transmons coupled to more traditional planar resonators. For the implementation of a quantum processor this approach has clear advantages over traditional techniques but it poses the challenge of scalability. We are currently implementing multi-qubits experiments based on a bottom-up scaling approach. First, transmon qubits are fabricated on individual chips and are independently characterized. Second, an artificial molecule is assembled by selecting a particular set of previously characterized single-transmon chips. We present recent data on a two-qubit artificial molecule constructed in this way. The two qubits are chosen to generate a strong Z-Z interaction by matching the 0-1 transition energy of one qubit with the 1-2 transition of the other. Single qubit manipulations and state tomography cannot be done with ``traditional'' single tone microwave pulses but instead specifically shaped pulses have to be simultaneously applied on both qubits. Coherence times, coupling strength, and optimal pulses for decoupling the two qubits and perform state tomography are presented

  8. Results of the Alternative Water Processor Test, A Novel Technology for Exploration Wastewater Remediation

    NASA Technical Reports Server (NTRS)

    Meyer, Caitlin E.; Pensinger, Stuart; Adam, Niklas; Pickering, Karen D.; Barta, Daniel; Shull, Sarah A.; Vega, Leticia M.; Lange, Kevin; Christenson, Dylan; Jackson, W. Andrew

    2016-01-01

    Biologically-based water recovery systems are a regenerative, low energy alternative to physiochemical processes to reclaim water from wastewater. This report summarizes the results of the Alternative Water Processor (AWP) Integrated Test, conducted from June 2013 until April 2014. The system was comprised of four (4) membrane aerated bioreactors (MABRs) to remove carbon and nitrogen from an exploration mission wastewater and a coupled forward and reverse osmosis system to remove large organic and inorganic salts from the biological system effluent. The system exceeded the overall objectives of the test by recovering 90% of the influent wastewater processed into a near potable state and a 64% reduction of consumables from the current state of the art water recovery system on the International Space Station (ISS). However, the biological system fell short of its test goals, failing to remove 75% and 90% of the influent ammonium and organic carbon, respectively. Despite not meeting its test goals, the BWP demonstrated the feasibility of an attached-growth biological system for simultaneous nitrification and denitrification, an innovative, volume- and consumable-saving design that does not require toxic pretreatment.

  9. The role of feedback mechanisms in historic channel changes of the lower Rio Grande in the Big Bend region

    NASA Astrophysics Data System (ADS)

    Dean, David J.; Schmidt, John C.

    2011-03-01

    Over the last century, large-scale water development of the upper Rio Grande in the U.S. and Mexico, and of the Rio Conchos in Mexico, has resulted in progressive channel narrowing of the lower Rio Grande in the Big Bend region. We used methods operating at multiple spatial and temporal scales to analyze the rate, magnitude, and processes responsible for channel narrowing. These methods included: hydrologic analysis of historic stream gage data, analysis of notes of measured discharges, historic oblique and aerial photograph analysis, and stratigraphic and dendrogeomorphic analysis of inset floodplain deposits. Our analyses indicate that frequent large floods between 1900 and the mid-1940s acted as a negative feedback mechanism and maintained a wide, sandy, multi-threaded river. Declines in mean and peak flow in the mid-1940s resulted in progressive channel narrowing. Channel narrowing has been temporarily interrupted by occasional large floods that widened the channel, however, channel narrowing has always resumed. After large floods in 1990 and 1991, the active channel width of the lower Rio Grande has narrowed by 36-52%. Narrowing has occurred by the vertical accretion of fine-grained deposits on top of sand and gravel bars, inset within natural levees. Channel narrowing by vertical accretion occurred simultaneously with a rapid invasion of non-native riparian vegetation ( Tamarix spp., Arundo donax) which created a positive feedback and exacerbated the processes of channel narrowing and vertical accretion. In two floodplain trenches, we measured 2.75 and 3.5 m of vertical accretion between 1993 and 2008. In some localities, nearly 90% of bare, active channel bars were converted to vegetated floodplain during the same period. Upward shifts of stage-discharge relations occurred resulting in over-bank flooding at lower discharges, and continued vertical accretion despite a progressive reduction in stream flow. Thus, although the magnitude of the average annual flood was reduced between 40 and 50%, over-bank flooding continued. These changes reflect a shift in the geomorphic nature of the Rio Grande from a wide, laterally unstable, multi-thread river, to a laterally stable, single-thread channel with cohesive, vertical banks, and few active in-channel bars.

  10. Simultaneous multi-beam planar array IR (pair) spectroscopy

    DOEpatents

    Elmore, Douglas L.; Rabolt, John F.; Tsao, Mei-Wei

    2005-09-13

    An apparatus and method capable of providing spatially multiplexed IR spectral information simultaneously in real-time for multiple samples or multiple spatial areas of one sample using IR absorption phenomena requires no moving parts or Fourier Transform during operation, and self-compensates for background spectra and degradation of component performance over time. IR spectral information and chemical analysis of the samples is determined by using one or more IR sources, sampling accessories for positioning the samples, optically dispersive elements, a focal plane array (FPA) arranged to detect the dispersed light beams, and a processor and display to control the FPA, and display an IR spectrograph. Fiber-optic coupling can be used to allow remote sensing. Portability, reliability, and ruggedness is enhanced due to the no-moving part construction. Applications include determining time-resolved orientation and characteristics of materials, including polymer monolayers. Orthogonal polarizers may be used to determine certain material characteristics.

  11. x-y curvature wavefront sensor.

    PubMed

    Cagigal, Manuel P; Valle, Pedro J

    2015-04-15

    In this Letter, we propose a new curvature wavefront sensor based on the principles of optical differentiation. The theoretically modeled setup consists of a diffractive optical mask placed at the intermediate plane of a classical two-lens coherent optical processor. The resulting image is composed of a number of local derivatives of the entrance pupil function whose proper combination provides the wavefront curvature. In contrast to the common radial curvature sensors, this one is able to provide the x and y wavefront curvature maps simultaneously. The sensor offers other additional advantages like having high spatial resolution, adjustable dynamic range, and not being sensitive to misalignment.

  12. Multiple-access phased array antenna simulator for a digital beam-forming system investigation

    NASA Technical Reports Server (NTRS)

    Kerczewski, Robert J.; Yu, John; Walton, Joanne C.; Perl, Thomas D.; Andro, Monty; Alexovich, Robert E.

    1992-01-01

    Future versions of data relay satellite systems are currently being planned by NASA. Being given consideration for implementation are on-board digital beamforming techniques which will allow multiple users to simultaneously access a single S-band phased array antenna system. To investigate the potential performance of such a system, a laboratory simulator has been developed at NASA's Lewis Research Center. This paper describes the system simulator, and in particular, the requirements, design and performance of a key subsystem, the phased array antenna simulator, which provides realistic inputs to the digital processor including multiple signals, noise, and nonlinearities.

  13. Multiple-access phased array antenna simulator for a digital beam forming system investigation

    NASA Technical Reports Server (NTRS)

    Kerczewski, Robert J.; Yu, John; Walton, Joanne C.; Perl, Thomas D.; Andro, Monty; Alexovich, Robert E.

    1992-01-01

    Future versions of data relay satellite systems are currently being planned by NASA. Being given consideration for implementation are on-board digital beamforming techniques which will allow multiple users to simultaneously access a single S-band phased array antenna system. To investigate the potential performance of such a system, a laboratory simulator has been developed at NASA's Lewis Research Center. This paper describes the system simulator, and in particular, the requirements, design, and performance of a key subsystem, the phased array antenna simulator, which provides realistic inputs to the digital processor including multiple signals, noise, and nonlinearities.

  14. Benchmark and Framework for Encouraging Research on Multi-Threaded Testing Tools

    NASA Technical Reports Server (NTRS)

    Havelund, Klaus; Stoller, Scott D.; Ur, Shmuel

    2003-01-01

    A problem that has been getting prominence in testing is that of looking for intermittent bugs. Multi-threaded code is becoming very common, mostly on the server side. As there is no silver bullet solution, research focuses on a variety of partial solutions. In this paper (invited by PADTAD 2003) we outline a proposed project to facilitate research. The project goals are as follows. The first goal is to create a benchmark that can be used to evaluate different solutions. The benchmark, apart from containing programs with documented bugs, will include other artifacts, such as traces, that are useful for evaluating some of the technologies. The second goal is to create a set of tools with open API s that can be used to check ideas without building a large system. For example an instrumentor will be available, that could be used to test temporal noise making heuristics. The third goal is to create a focus for the research in this area around which a community of people who try to solve similar problems with different techniques, could congregate.

  15. On Parallel Push-Relabel based Algorithms for Bipartite Maximum Matching

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Langguth, Johannes; Azad, Md Ariful; Halappanavar, Mahantesh

    2014-07-01

    We study multithreaded push-relabel based algorithms for computing maximum cardinality matching in bipartite graphs. Matching is a fundamental combinatorial (graph) problem with applications in a wide variety of problems in science and engineering. We are motivated by its use in the context of sparse linear solvers for computing maximum transversal of a matrix. We implement and test our algorithms on several multi-socket multicore systems and compare their performance to state-of-the-art augmenting path-based serial and parallel algorithms using a testset comprised of a wide range of real-world instances. Building on several heuristics for enhancing performance, we demonstrate good scaling for themore » parallel push-relabel algorithm. We show that it is comparable to the best augmenting path-based algorithms for bipartite matching. To the best of our knowledge, this is the first extensive study of multithreaded push-relabel based algorithms. In addition to a direct impact on the applications using matching, the proposed algorithmic techniques can be extended to preflow-push based algorithms for computing maximum flow in graphs.« less

  16. AN MHD AVALANCHE IN A MULTI-THREADED CORONAL LOOP

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hood, A. W.; Cargill, P. J.; Tam, K. V.

    For the first time, we demonstrate how an MHD avalanche might occur in a multithreaded coronal loop. Considering 23 non-potential magnetic threads within a loop, we use 3D MHD simulations to show that only one thread needs to be unstable in order to start an avalanche even when the others are below marginal stability. This has significant implications for coronal heating in that it provides for energy dissipation with a trigger mechanism. The instability of the unstable thread follows the evolution determined in many earlier investigations. However, once one stable thread is disrupted, it coalesces with a neighboring thread andmore » this process disrupts other nearby threads. Coalescence with these disrupted threads then occurs leading to the disruption of yet more threads as the avalanche develops. Magnetic energy is released in discrete bursts as the surrounding stable threads are disrupted. The volume integrated heating, as a function of time, shows short spikes suggesting that the temporal form of the heating is more like that of nanoflares than of constant heating.« less

  17. Parallelization strategies for continuum-generalized method of moments on the multi-thread systems

    NASA Astrophysics Data System (ADS)

    Bustamam, A.; Handhika, T.; Ernastuti, Kerami, D.

    2017-07-01

    Continuum-Generalized Method of Moments (C-GMM) covers the Generalized Method of Moments (GMM) shortfall which is not as efficient as Maximum Likelihood estimator by using the continuum set of moment conditions in a GMM framework. However, this computation would take a very long time since optimizing regularization parameter. Unfortunately, these calculations are processed sequentially whereas in fact all modern computers are now supported by hierarchical memory systems and hyperthreading technology, which allowing for parallel computing. This paper aims to speed up the calculation process of C-GMM by designing a parallel algorithm for C-GMM on the multi-thread systems. First, parallel regions are detected for the original C-GMM algorithm. There are two parallel regions in the original C-GMM algorithm, that are contributed significantly to the reduction of computational time: the outer-loop and the inner-loop. Furthermore, this parallel algorithm will be implemented with standard shared-memory application programming interface, i.e. Open Multi-Processing (OpenMP). The experiment shows that the outer-loop parallelization is the best strategy for any number of observations.

  18. Observation and modelling of the Fe XXI line profile observed by IRIS during the impulsive phase of flares

    NASA Astrophysics Data System (ADS)

    Polito, V.; Testa, P.; De Pontieu, B.; Allred, J. C.

    2017-12-01

    The observation of the high temperature (above 10 MK) Fe XXI 1354.1 A line with the Interface Region Imaging Spectrograph (IRIS) has provided significant insights into the chromospheric evaporation process in flares. In particular, the line is often observed to be completely blueshifted, in contrast to previous observations at lower spatial and spectral resolution, and in agreement with predictions from theoretical models. Interestingly, the line is also observed to be mostly symmetric and with a large excess above the thermal width. One popular interpretation for the excess broadening is given by assuming a superposition of flows from different loop strands. In this work, we perform a statistical analysis of Fe XXI line profiles observed by IRIS during the impulsive phase of flares and compare our results with hydrodynamic simulations of multi-thread flare loops performed with the 1D RADYN code. Our results indicate that the multi-thread models cannot easily reproduce the symmetry of the line and that some other physical process might need to be invoked in order to explain the observed profiles.

  19. Evolution of the ATLAS Software Framework towards Concurrency

    NASA Astrophysics Data System (ADS)

    Jones, R. W. L.; Stewart, G. A.; Leggett, C.; Wynne, B. M.

    2015-05-01

    The ATLAS experiment has successfully used its Gaudi/Athena software framework for data taking and analysis during the first LHC run, with billions of events successfully processed. However, the design of Gaudi/Athena dates from early 2000 and the software and the physics code has been written using a single threaded, serial design. This programming model has increasing difficulty in exploiting the potential of current CPUs, which offer their best performance only through taking full advantage of multiple cores and wide vector registers. Future CPU evolution will intensify this trend, with core counts increasing and memory per core falling. Maximising performance per watt will be a key metric, so all of these cores must be used as efficiently as possible. In order to address the deficiencies of the current framework, ATLAS has embarked upon two projects: first, a practical demonstration of the use of multi-threading in our reconstruction software, using the GaudiHive framework; second, an exercise to gather requirements for an updated framework, going back to the first principles of how event processing occurs. In this paper we report on both these aspects of our work. For the hive based demonstrators, we discuss what changes were necessary in order to allow the serially designed ATLAS code to run, both to the framework and to the tools and algorithms used. We report on what general lessons were learned about the code patterns that had been employed in the software and which patterns were identified as particularly problematic for multi-threading. These lessons were fed into our considerations of a new framework and we present preliminary conclusions on this work. In particular we identify areas where the framework can be simplified in order to aid the implementation of a concurrent event processing scheme. Finally, we discuss the practical difficulties involved in migrating a large established code base to a multi-threaded framework and how this can be achieved for LHC Run 3.

  20. Efficiency of static core turn-off in a system-on-a-chip with variation

    DOEpatents

    Cher, Chen-Yong; Coteus, Paul W; Gara, Alan; Kursun, Eren; Paulsen, David P; Schuelke, Brian A; Sheets, II, John E; Tian, Shurong

    2013-10-29

    A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.

  1. Millimeter-wave passive ultra-compact imaging technology for synthetic vision & mobile platforms

    NASA Technical Reports Server (NTRS)

    Olsen, Randall

    1996-01-01

    Substantial technical progress was made on all of the three high-risk subsystems of this program. The subsystems include dielectric antenna, G-band receiver, and electro-optic image processor. Progress is approximately on-schedule for both the receiver and the electro-optic processor development, while greater than anticipated challenges have been discovered in the dielectric antenna development. Much of the information in this report was covered in greater detail in the One-Year Review Meeting held at TTC on 22 February 1996. The performance goals of the dielectric antenna project are: Scan Angle -- 20 deg. desired; Loss -- 6 dB end to end (3 dB average); Frequency -- 206-218 GHz (6% bandwidth); Beam width -- 0.25 deg.; and Length -- 12 inches. The scan angle requirement was chosen to satisfy the needs of aircraft pilots. This requirement, coupled with the presently limited bandwidth processors (1 GHz state-of-the-art and 12 GHz in development in this program) forces the antenna to be dielectric (high scan angle air-filled waveguide-based antennas would be too lossy and their performance would vary too much as a function of frequency). A high dielectric constant (e.g., 10) was initially chosen for the dielectric material. This choice lead to the following fabrication challenges: total thickness variation (TTV) tolerance is 1 micrometer; coupler spacing tolerance is 1 micrometer; width tolerance is larger, but unknown, and the surfaces must have mirror finish. Also of importance is the difficulty in obtaining raw materials that satisfy the overall length requirement of 12 inches while simultaneously satisfying the above specifications.

  2. Calibrating thermal behavior of electronics

    DOEpatents

    Chainer, Timothy J.; Parida, Pritish R.; Schultz, Mark D.

    2017-07-11

    A method includes determining a relationship between indirect thermal data for a processor and a measured temperature associated with the processor, during a calibration process, obtaining the indirect thermal data for the processor during actual operation of the processor, and determining an actual significant temperature associated with the processor during the actual operation using the indirect thermal data for the processor during actual operation of the processor and the relationship.

  3. Calibrating thermal behavior of electronics

    DOEpatents

    Chainer, Timothy J.; Parida, Pritish R.; Schultz, Mark D.

    2016-05-31

    A method includes determining a relationship between indirect thermal data for a processor and a measured temperature associated with the processor, during a calibration process, obtaining the indirect thermal data for the processor during actual operation of the processor, and determining an actual significant temperature associated with the processor during the actual operation using the indirect thermal data for the processor during actual operation of the processor and the relationship.

  4. Calibrating thermal behavior of electronics

    DOEpatents

    Chainer, Timothy J.; Parida, Pritish R.; Schultz, Mark D.

    2017-01-03

    A method includes determining a relationship between indirect thermal data for a processor and a measured temperature associated with the processor, during a calibration process, obtaining the indirect thermal data for the processor during actual operation of the processor, and determining an actual significant temperature associated with the processor during the actual operation using the indirect thermal data for the processor during actual operation of the processor and the relationship.

  5. Real-time calibration-free C-scan images of the eye fundus using Master Slave swept source optical coherence tomography

    NASA Astrophysics Data System (ADS)

    Bradu, Adrian; Kapinchev, Konstantin; Barnes, Fred; Garway-Heath, David F.; Rajendram, Ranjan; Keane, Pearce; Podoleanu, Adrian G.

    2015-03-01

    Recently, we introduced a novel Optical Coherence Tomography (OCT) method, termed as Master Slave OCT (MS-OCT), specialized for delivering en-face images. This method uses principles of spectral domain interfereometry in two stages. MS-OCT operates like a time domain OCT, selecting only signals from a chosen depth only while scanning the laser beam across the eye. Time domain OCT allows real time production of an en-face image, although relatively slowly. As a major advance, the Master Slave method allows collection of signals from any number of depths, as required by the user. The tremendous advantage in terms of parallel provision of data from numerous depths could not be fully employed by using multi core processors only. The data processing required to generate images at multiple depths simultaneously is not achievable with commodity multicore processors only. We compare here the major improvement in processing and display, brought about by using graphic cards. We demonstrate images obtained with a swept source at 100 kHz (which determines an acquisition time [Ta] for a frame of 200×200 pixels2 of Ta =1.6 s). By the end of the acquired frame being scanned, using our computing capacity, 4 simultaneous en-face images could be created in T = 0.8 s. We demonstrate that by using graphic cards, 32 en-face images can be displayed in Td 0.3 s. Other faster swept source engines can be used with no difference in terms of Td. With 32 images (or more), volumes can be created for 3D display, using en-face images, as opposed to the current technology where volumes are created using cross section OCT images.

  6. The Two Micron All Sky Survey

    NASA Astrophysics Data System (ADS)

    Lonsdale, Carol

    The 2 Micron All Sky Survey (2MASS) project, a collaboration between the University of Massachusetts (Dr. Mike Skrutskie, PI) and the Infrared Processing and Analysis Center, JPL/Caltech funded primarily by NASA and the NSF, will scan the entire sky utilizing two new, highly automated 1.3m telescopes at Mt. Hopkins, AZ and at CTIO, Chile. Each telescope simultaneously scans the sky at J, H and Ks with a three channel camera using 256x256 arrays of HgCdTe detectors to detect point sources brighter than about 1 mJy (to SNR=10), with a pixel size of 2.0 arcseconds. The data rate is $\\sim 19$ Gbyte per night, with a total processed data volume of 13 Tbytes of images and 0.5 Tbyte of tabular data. The 2MASS data is archived nightly into the Infrared Science Information System at IPAC, which is based on an Informix database engine, judged at the time of purchase to have the best commercially available indexing and parallelization flexibility, and a 5 Tbyte-capacity RAID multi-threaded disk system with multi-server shared disk architecture. I will discuss the challenges of processing and archiving the 2MASS data, and of supporting intelligent query access to them by the astronomical community across the net, including possibilities for cross-correlation with other remote data sets.

  7. Naval Research Laboratory Fact Book 2012

    DTIC Science & Technology

    2012-11-01

    Distributed network-based battle management High performance computing supporting uniform and nonuniform memory access with single and multithreaded...hyperspectral systems VNIR, MWIR, and LWIR high-resolution systems Wideband SAR systems RF and laser data links High-speed, high-power...hyperspectral imaging system Long-wave infrared ( LWIR ) quantum well IR photodetector (QWIP) imaging system Research and Development Services Divi- sion

  8. NRL Fact Book

    DTIC Science & Technology

    2008-01-01

    Distributed network-based battle management High performance computing supporting uniform and nonuniform memory access with single and multithreaded...pallet Airborne EO/IR and radar sensors VNIR through SWIR hyperspectral systems VNIR, MWIR, and LWIR high-resolution sys- tems Wideband SAR systems...meteorological sensors Hyperspectral sensor systems (PHILLS) Mid-wave infrared (MWIR) Indium Antimonide (InSb) imaging system Long-wave infrared ( LWIR

  9. Merced

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hedstrom, Gerald; Beck, Bret; Mattoon, Caleb

    2016-10-01

    Merced performs a multi-dimensional integral tl generate so-called 'transfer matrices' for use in deterministic radiation transport applications. It produces transfer matrices on the user-defind energy grid. The angular dependence of outgoing products is captured in a Legendre expansion, up to a user-specified maximun Legendre order. Merced calculations can use multi-threading for enhanced performance on a single compute node.

  10. Implementation of Tree and Butterfly Barriers with Optimistic Time Management Algorithms for Discrete Event Simulation

    NASA Astrophysics Data System (ADS)

    Rizvi, Syed S.; Shah, Dipali; Riasat, Aasia

    The Time Wrap algorithm [3] offers a run time recovery mechanism that deals with the causality errors. These run time recovery mechanisms consists of rollback, anti-message, and Global Virtual Time (GVT) techniques. For rollback, there is a need to compute GVT which is used in discrete-event simulation to reclaim the memory, commit the output, detect the termination, and handle the errors. However, the computation of GVT requires dealing with transient message problem and the simultaneous reporting problem. These problems can be dealt in an efficient manner by the Samadi's algorithm [8] which works fine in the presence of causality errors. However, the performance of both Time Wrap and Samadi's algorithms depends on the latency involve in GVT computation. Both algorithms give poor latency for large simulation systems especially in the presence of causality errors. To improve the latency and reduce the processor ideal time, we implement tree and butterflies barriers with the optimistic algorithm. Our analysis shows that the use of synchronous barriers such as tree and butterfly with the optimistic algorithm not only minimizes the GVT latency but also minimizes the processor idle time.

  11. Effective correlator for RadioAstron project

    NASA Astrophysics Data System (ADS)

    Sergeev, Sergey

    This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.

  12. Accelerating the Gillespie Exact Stochastic Simulation Algorithm using hybrid parallel execution on graphics processing units.

    PubMed

    Komarov, Ivan; D'Souza, Roshan M

    2012-01-01

    The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.

  13. CMS Readiness for Multi-Core Workload Scheduling

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.

    In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides amore » solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.« less

  14. System for clinical photometric stereo endoscopy

    NASA Astrophysics Data System (ADS)

    Durr, Nicholas J.; González, Germán.; Lim, Daryl; Traverso, Giovanni; Nishioka, Norman S.; Vakoc, Benjamin J.; Parot, Vicente

    2014-02-01

    Photometric stereo endoscopy is a technique that captures information about the high-spatial-frequency topography of the field of view simultaneously with a conventional color image. Here we describe a system that will enable photometric stereo endoscopy to be clinically evaluated in the large intestine of human patients. The clinical photometric stereo endoscopy system consists of a commercial gastroscope, a commercial video processor, an image capturing and processing unit, custom synchronization electronics, white light LEDs, a set of four fibers with diffusing tips, and an alignment cap. The custom pieces that come into contact with the patient are composed of biocompatible materials that can be sterilized before use. The components can then be assembled in the endoscopy suite before use. The resulting endoscope has the same outer diameter as a conventional colonoscope (14 mm), plugs into a commercial video processor, captures topography and color images at 15 Hz, and displays the conventional color image to the gastroenterologist in real-time. We show that this system can capture a color and topographical video in a tubular colon phantom, demonstrating robustness to complex geometries and motion. The reported system is suitable for in vivo evaluation of photometric stereo endoscopy in the human large intestine.

  15. Novel memory architecture for video signal processor

    NASA Astrophysics Data System (ADS)

    Hung, Jen-Sheng; Lin, Chia-Hsing; Jen, Chein-Wei

    1993-11-01

    An on-chip memory architecture for video signal processor (VSP) is proposed. This memory structure is a two-level design for the different data locality in video applications. The upper level--Memory A provides enough storage capacity to reduce the impact on the limitation of chip I/O bandwidth, and the lower level--Memory B provides enough data parallelism and flexibility to meet the requirements of multiple reconfigurable pipeline function units in a single VSP chip. The needed memory size is decided by the memory usage analysis for video algorithms and the number of function units. Both levels of memory adopted a dual-port memory scheme to sustain the simultaneous read and write operations. Especially, Memory B uses multiple one-read-one-write memory banks to emulate the real multiport memory. Therefore, one can change the configuration of Memory B to several sets of memories with variable read/write ports by adjusting the bus switches. Then the numbers of read ports and write ports in proposed memory can meet requirement of data flow patterns in different video coding algorithms. We have finished the design of a prototype memory design using 1.2- micrometers SPDM SRAM technology and will fabricated it through TSMC, in Taiwan.

  16. Very low cost real time histogram-based contrast enhancer utilizing fixed-point DSP processing

    NASA Astrophysics Data System (ADS)

    McCaffrey, Nathaniel J.; Pantuso, Francis P.

    1998-03-01

    A real time contrast enhancement system utilizing histogram- based algorithms has been developed to operate on standard composite video signals. This low-cost DSP based system is designed with fixed-point algorithms and an off-chip look up table (LUT) to reduce the cost considerably over other contemporary approaches. This paper describes several real- time contrast enhancing systems advanced at the Sarnoff Corporation for high-speed visible and infrared cameras. The fixed-point enhancer was derived from these high performance cameras. The enhancer digitizes analog video and spatially subsamples the stream to qualify the scene's luminance. Simultaneously, the video is streamed through a LUT that has been programmed with the previous calculation. Reducing division operations by subsampling reduces calculation- cycles and also allows the processor to be used with cameras of nominal resolutions. All values are written to the LUT during blanking so no frames are lost. The enhancer measures 13 cm X 6.4 cm X 3.2 cm, operates off 9 VAC and consumes 12 W. This processor is small and inexpensive enough to be mounted with field deployed security cameras and can be used for surveillance, video forensics and real- time medical imaging.

  17. CMS readiness for multi-core workload scheduling

    NASA Astrophysics Data System (ADS)

    Perez-Calero Yzquierdo, A.; Balcas, J.; Hernandez, J.; Aftab Khan, F.; Letts, J.; Mason, D.; Verguilov, V.

    2017-10-01

    In the present run of the LHC, CMS data reconstruction and simulation algorithms benefit greatly from being executed as multiple threads running on several processor cores. The complexity of the Run 2 events requires parallelization of the code to reduce the memory-per- core footprint constraining serial execution programs, thus optimizing the exploitation of present multi-core processor architectures. The allocation of computing resources for multi-core tasks, however, becomes a complex problem in itself. The CMS workload submission infrastructure employs multi-slot partitionable pilots, built on HTCondor and GlideinWMS native features, to enable scheduling of single and multi-core jobs simultaneously. This provides a solution for the scheduling problem in a uniform way across grid sites running a diversity of gateways to compute resources and batch system technologies. This paper presents this strategy and the tools on which it has been implemented. The experience of managing multi-core resources at the Tier-0 and Tier-1 sites during 2015, along with the deployment phase to Tier-2 sites during early 2016 is reported. The process of performance monitoring and optimization to achieve efficient and flexible use of the resources is also described.

  18. Stereo and IMU-Assisted Visual Odometry for Small Robots

    NASA Technical Reports Server (NTRS)

    2012-01-01

    This software performs two functions: (1) taking stereo image pairs as input, it computes stereo disparity maps from them by cross-correlation to achieve 3D (three-dimensional) perception; (2) taking a sequence of stereo image pairs as input, it tracks features in the image sequence to estimate the motion of the cameras between successive image pairs. A real-time stereo vision system with IMU (inertial measurement unit)-assisted visual odometry was implemented on a single 750 MHz/520 MHz OMAP3530 SoC (system on chip) from TI (Texas Instruments). Frame rates of 46 fps (frames per second) were achieved at QVGA (Quarter Video Graphics Array i.e. 320 240), or 8 fps at VGA (Video Graphics Array 640 480) resolutions, while simultaneously tracking up to 200 features, taking full advantage of the OMAP3530's integer DSP (digital signal processor) and floating point ARM processors. This is a substantial advancement over previous work as the stereo implementation produces 146 Mde/s (millions of disparities evaluated per second) in 2.5W, yielding a stereo energy efficiency of 58.8 Mde/J, which is 3.75 better than prior DSP stereo while providing more functionality.

  19. Methods and systems for providing reconfigurable and recoverable computing resources

    NASA Technical Reports Server (NTRS)

    Stange, Kent (Inventor); Hess, Richard (Inventor); Kelley, Gerald B (Inventor); Rogers, Randy (Inventor)

    2010-01-01

    A method for optimizing the use of digital computing resources to achieve reliability and availability of the computing resources is disclosed. The method comprises providing one or more processors with a recovery mechanism, the one or more processors executing one or more applications. A determination is made whether the one or more processors needs to be reconfigured. A rapid recovery is employed to reconfigure the one or more processors when needed. A computing system that provides reconfigurable and recoverable computing resources is also disclosed. The system comprises one or more processors with a recovery mechanism, with the one or more processors configured to execute a first application, and an additional processor configured to execute a second application different than the first application. The additional processor is reconfigurable with rapid recovery such that the additional processor can execute the first application when one of the one more processors fails.

  20. Rectangular Array Of Digital Processors For Planning Paths

    NASA Technical Reports Server (NTRS)

    Kemeny, Sabrina E.; Fossum, Eric R.; Nixon, Robert H.

    1993-01-01

    Prototype 24 x 25 rectangular array of asynchronous parallel digital processors rapidly finds best path across two-dimensional field, which could be patch of terrain traversed by robotic or military vehicle. Implemented as single-chip very-large-scale integrated circuit. Excepting processors on edges, each processor communicates with four nearest neighbors along paths representing travel to north, south, east, and west. Each processor contains delay generator in form of 8-bit ripple counter, preset to 1 of 256 possible values. Operation begins with choice of processor representing starting point. Transmits signals to nearest neighbor processors, which retransmits to other neighboring processors, and process repeats until signals propagated across entire field.

  1. Buffered coscheduling for parallel programming and enhanced fault tolerance

    DOEpatents

    Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

    2006-01-31

    A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

  2. Staging memory for massively parallel processor

    NASA Technical Reports Server (NTRS)

    Batcher, Kenneth E. (Inventor)

    1988-01-01

    The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

  3. A high performance load balance strategy for real-time multicore systems.

    PubMed

    Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

    2014-01-01

    Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper.

  4. Opto-VLSI-based photonic true-time delay architecture for broadband adaptive nulling in phased array antennas.

    PubMed

    Juswardy, Budi; Xiao, Feng; Alameh, Kamal

    2009-03-16

    This paper proposes a novel Opto-VLSI-based tunable true-time delay generation unit for adaptively steering the nulls of microwave phased array antennas. Arbitrary single or multiple true-time delays can simultaneously be synthesized for each antenna element by slicing an RF-modulated broadband optical source and routing specific sliced wavebands through an Opto-VLSI processor to a high-dispersion fiber. Experimental results are presented, which demonstrate the principle of the true-time delay unit through the generation of 5 arbitrary true-time delays of up to 2.5 ns each. (c) 2009 Optical Society of America

  5. Failure detection and identification

    NASA Technical Reports Server (NTRS)

    Massoumnia, Mohammad-Ali; Verghese, George C.; Willsky, Alan S.

    1989-01-01

    Using the geometric concept of an unobservability subspace, a solution is given to the problem of detecting and identifying control system component failures in linear, time-invariant systems. Conditions are developed for the existence of a causal, linear, time-invariant processor that can detect and uniquely identify a component failure, first for the case where components can fail simultaneously, and then for the case where they fail only one at a time. Explicit design algorithms are provided when these conditions are satisfied. In addition to time-domain solvability conditions, frequency-domain interpretations of the results are given, and connections are drawn with results already available in the literature.

  6. A High Performance Load Balance Strategy for Real-Time Multicore Systems

    PubMed Central

    Cho, Keng-Mao; Tsai, Chun-Wei; Chiu, Yi-Shiuan; Yang, Chu-Sing

    2014-01-01

    Finding ways to distribute workloads to each processor core and efficiently reduce power consumption is of vital importance, especially for real-time systems. In this paper, a novel scheduling algorithm is proposed for real-time multicore systems to balance the computation loads and save power. The developed algorithm simultaneously considers multiple criteria, a novel factor, and task deadline, and is called power and deadline-aware multicore scheduling (PDAMS). Experiment results show that the proposed algorithm can greatly reduce energy consumption by up to 54.2% and the deadline times missed, as compared to the other scheduling algorithms outlined in this paper. PMID:24955382

  7. Hardware/Software Issues for Video Guidance Systems: The Coreco Frame Grabber

    NASA Technical Reports Server (NTRS)

    Bales, John W.

    1996-01-01

    The F64 frame grabber is a high performance video image acquisition and processing board utilizing the TMS320C40 and TMS34020 processors. The hardware is designed for the ISA 16 bit bus and supports multiple digital or analog cameras. It has an acquisition rate of 40 million pixels per second, with a variable sampling frequency of 510 kHz to MO MHz. The board has a 4MB frame buffer memory expandable to 32 MB, and has a simultaneous acquisition and processing capability. It supports both VGA and RGB displays, and accepts all analog and digital video input standards.

  8. Automatic analysis of stereoscopic satellite image pairs for determination of cloud-top height and structure

    NASA Technical Reports Server (NTRS)

    Hasler, A. F.; Strong, J.; Woodward, R. H.; Pierce, H.

    1991-01-01

    Results are presented on an automatic stereo analysis of cloud-top heights from nearly simultaneous satellite image pairs from the GOES and NOAA satellites, using a massively parallel processor computer. Comparisons of computer-derived height fields and manually analyzed fields show that the automatic analysis technique shows promise for performing routine stereo analysis in a real-time environment, providing a useful forecasting tool by augmenting observational data sets of severe thunderstorms and hurricanes. Simulations using synthetic stereo data show that it is possible to automatically resolve small-scale features such as 4000-m-diam clouds to about 1500 m in the vertical.

  9. FPGA implementation of ICA algorithm for blind signal separation and adaptive noise canceling.

    PubMed

    Kim, Chang-Min; Park, Hyung-Min; Kim, Taesu; Choi, Yoon-Kyung; Lee, Soo-Young

    2003-01-01

    An field programmable gate array (FPGA) implementation of independent component analysis (ICA) algorithm is reported for blind signal separation (BSS) and adaptive noise canceling (ANC) in real time. In order to provide enormous computing power for ICA-based algorithms with multipath reverberation, a special digital processor is designed and implemented in FPGA. The chip design fully utilizes modular concept and several chips may be put together for complex applications with a large number of noise sources. Experimental results with a fabricated test board are reported for ANC only, BSS only, and simultaneous ANC/BSS, which demonstrates successful speech enhancement in real environments in real time.

  10. Multichannel Baseband Processor for Wideband CDMA

    NASA Astrophysics Data System (ADS)

    Jalloul, Louay M. A.; Lin, Jim

    2005-12-01

    The system architecture of the cellular base station modem engine (CBME) is described. The CBME is a single-chip multichannel transceiver capable of processing and demodulating signals from multiple users simultaneously. It is optimized to process different classes of code-division multiple-access (CDMA) signals. The paper will show that through key functional system partitioning, tightly coupled small digital signal processing cores, and time-sliced reuse architecture, CBME is able to achieve a high degree of algorithmic flexibility while maintaining efficiency. The paper will also highlight the implementation and verification aspects of the CBME chip design. In this paper, wideband CDMA is used as an example to demonstrate the architecture concept.

  11. Flight design system level C requirements. Solid rocket booster and external tank impact prediction processors. [space transportation system

    NASA Technical Reports Server (NTRS)

    Seale, R. H.

    1979-01-01

    The prediction of the SRB and ET impact areas requires six separate processors. The SRB impact prediction processor computes the impact areas and related trajectory data for each SRB element. Output from this processor is stored on a secure file accessible by the SRB impact plot processor which generates the required plots. Similarly the ET RTLS impact prediction processor and the ET RTLS impact plot processor generates the ET impact footprints for return-to-launch-site (RTLS) profiles. The ET nominal/AOA/ATO impact prediction processor and the ET nominal/AOA/ATO impact plot processor generate the ET impact footprints for non-RTLS profiles. The SRB and ET impact processors compute the size and shape of the impact footprints by tabular lookup in a stored footprint dispersion data base. The location of each footprint is determined by simulating a reference trajectory and computing the reference impact point location. To insure consistency among all flight design system (FDS) users, much input required by these processors will be obtained from the FDS master data base.

  12. Parallel approach in RDF query processing

    NASA Astrophysics Data System (ADS)

    Vajgl, Marek; Parenica, Jan

    2017-07-01

    Parallel approach is nowadays a very cheap solution to increase computational power due to possibility of usage of multithreaded computational units. This hardware became typical part of nowadays personal computers or notebooks and is widely spread. This contribution deals with experiments how evaluation of computational complex algorithm of the inference over RDF data can be parallelized over graphical cards to decrease computational time.

  13. The effects of wildfire on native tree species in the Middle Rio Grande bosques of New Mexico

    Treesearch

    Brad Johnson; David Merritt

    2009-01-01

    The cottonwood bosques along the Middle Fork of the Rio Grande (MRG) form a ribbon of surviving habitat in this once vast ecosystem. Historically, the channel had a multi-threaded and braided configuration that created a rich mosaic of habitats, including mixed-aged cottonwood forests, meadows, and willow-dominated riparian wetlands and backwaters (...

  14. Thread scheduling for GPU-based OPC simulation on multi-thread

    NASA Astrophysics Data System (ADS)

    Lee, Heejun; Kim, Sangwook; Hong, Jisuk; Lee, Sooryong; Han, Hwansoo

    2018-03-01

    As semiconductor product development based on shrinkage continues, the accuracy and difficulty required for the model based optical proximity correction (MBOPC) is increasing. OPC simulation time, which is the most timeconsuming part of MBOPC, is rapidly increasing due to high pattern density in a layout and complex OPC model. To reduce OPC simulation time, we attempt to apply graphic processing unit (GPU) to MBOPC because OPC process is good to be programmed in parallel. We address some issues that may typically happen during GPU-based OPC simulation in multi thread system, such as "out of memory" and "GPU idle time". To overcome these problems, we propose a thread scheduling method, which manages OPC jobs in multiple threads in such a way that simulations jobs from multiple threads are alternatively executed on GPU while correction jobs are executed at the same time in each CPU cores. It was observed that the amount of GPU peak memory usage decreases by up to 35%, and MBOPC runtime also decreases by 4%. In cases where out of memory issues occur in a multi-threaded environment, the thread scheduler was used to improve MBOPC runtime up to 23%.

  15. Playback system designed for X-Band SAR

    NASA Astrophysics Data System (ADS)

    Yuquan, Liu; Changyong, Dou

    2014-03-01

    SAR(Synthetic Aperture Radar) has extensive application because it is daylight and weather independent. In particular, X-Band SAR strip map, designed by Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences, provides high ground resolution images, at the same time it has a large spatial coverage and a short acquisition time, so it is promising in multi-applications. When sudden disaster comes, the emergency situation acquires radar signal data and image as soon as possible, in order to take action to reduce loss and save lives in the first time. This paper summarizes a type of X-Band SAR playback processing system designed for disaster response and scientific needs. It describes SAR data workflow includes the payload data transmission and reception process. Playback processing system completes signal analysis on the original data, providing SAR level 0 products and quick image. Gigabit network promises radar signal transmission efficiency from recorder to calculation unit. Multi-thread parallel computing and ping pong operation can ensure computation speed. Through gigabit network, multi-thread parallel computing and ping pong operation, high speed data transmission and processing meet the SAR radar data playback real time requirement.

  16. The Reconstruction Toolkit (RTK), an open-source cone-beam CT reconstruction toolkit based on the Insight Toolkit (ITK)

    NASA Astrophysics Data System (ADS)

    Rit, S.; Vila Oliva, M.; Brousmiche, S.; Labarbe, R.; Sarrut, D.; Sharp, G. C.

    2014-03-01

    We propose the Reconstruction Toolkit (RTK, http://www.openrtk.org), an open-source toolkit for fast cone-beam CT reconstruction, based on the Insight Toolkit (ITK) and using GPU code extracted from Plastimatch. RTK is developed by an open consortium (see affiliations) under the non-contaminating Apache 2.0 license. The quality of the platform is daily checked with regression tests in partnership with Kitware, the company supporting ITK. Several features are already available: Elekta, Varian and IBA inputs, multi-threaded Feldkamp-David-Kress reconstruction on CPU and GPU, Parker short scan weighting, multi-threaded CPU and GPU forward projectors, etc. Each feature is either accessible through command line tools or C++ classes that can be included in independent software. A MIDAS community has been opened to share CatPhan datasets of several vendors (Elekta, Varian and IBA). RTK will be used in the upcoming cone-beam CT scanner developed by IBA for proton therapy rooms. Many features are under development: new input format support, iterative reconstruction, hybrid Monte Carlo / deterministic CBCT simulation, etc. RTK has been built to freely share tomographic reconstruction developments between researchers and is open for new contributions.

  17. New technique for real-time distortion-invariant multiobject recognition and classification

    NASA Astrophysics Data System (ADS)

    Hong, Rutong; Li, Xiaoshun; Hong, En; Wang, Zuyi; Wei, Hongan

    2001-04-01

    A real-time hybrid distortion-invariant OPR system was established to make 3D multiobject distortion-invariant automatic pattern recognition. Wavelet transform technique was used to make digital preprocessing of the input scene, to depress the noisy background and enhance the recognized object. A three-layer backpropagation artificial neural network was used in correlation signal post-processing to perform multiobject distortion-invariant recognition and classification. The C-80 and NOA real-time processing ability and the multithread programming technology were used to perform high speed parallel multitask processing and speed up the post processing rate to ROIs. The reference filter library was constructed for the distortion version of 3D object model images based on the distortion parameter tolerance measuring as rotation, azimuth and scale. The real-time optical correlation recognition testing of this OPR system demonstrates that using the preprocessing, post- processing, the nonlinear algorithm os optimum filtering, RFL construction technique and the multithread programming technology, a high possibility of recognition and recognition rate ere obtained for the real-time multiobject distortion-invariant OPR system. The recognition reliability and rate was improved greatly. These techniques are very useful to automatic target recognition.

  18. Implementation of GPU accelerated SPECT reconstruction with Monte Carlo-based scatter correction.

    PubMed

    Bexelius, Tobias; Sohlberg, Antti

    2018-06-01

    Statistical SPECT reconstruction can be very time-consuming especially when compensations for collimator and detector response, attenuation, and scatter are included in the reconstruction. This work proposes an accelerated SPECT reconstruction algorithm based on graphics processing unit (GPU) processing. Ordered subset expectation maximization (OSEM) algorithm with CT-based attenuation modelling, depth-dependent Gaussian convolution-based collimator-detector response modelling, and Monte Carlo-based scatter compensation was implemented using OpenCL. The OpenCL implementation was compared against the existing multi-threaded OSEM implementation running on a central processing unit (CPU) in terms of scatter-to-primary ratios, standardized uptake values (SUVs), and processing speed using mathematical phantoms and clinical multi-bed bone SPECT/CT studies. The difference in scatter-to-primary ratios, visual appearance, and SUVs between GPU and CPU implementations was minor. On the other hand, at its best, the GPU implementation was noticed to be 24 times faster than the multi-threaded CPU version on a normal 128 × 128 matrix size 3 bed bone SPECT/CT data set when compensations for collimator and detector response, attenuation, and scatter were included. GPU SPECT reconstructions show great promise as an every day clinical reconstruction tool.

  19. Rapid Startup and Loading of an Attached Growth, Simultaneous Nitrification/Denitrification Membrane Aerated Bioreactor

    NASA Technical Reports Server (NTRS)

    Meyer, Caitlin; Vega, Leticia

    2014-01-01

    The Membrane Aerated Bioreactor (MABR) is an attached-growth biological system for simultaneous nitrification and denitrification. This design is an innovative approach to common terrestrial wastewater treatments for nitrogen and carbon removal. Implementing a biologically-based water treatment system for long-duration human exploration is an attractive, low energy alternative to physiochemical processes. Two obstacles to implementing such a system are (1) the "start-up" duration from inoculation to steady-state operations and (2) the amount of surface area needed for the biological activity to occur. The Advanced Water Recovery Systems (AWRS) team at JSC explored these two issues through two tests; a rapid inoculation study and a wastewater loading study. Results from these tests demonstrate that the duration from inoculation to steady state can be reduced to two weeks and that the surface area to volume ratio baseline used in the Alternative Water Processor (AWP) test was higher than what was needed to remove the organic carbon and ammonium from the system.

  20. Bin-Hash Indexing: A Parallel Method for Fast Query Processing

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bethel, Edward W; Gosink, Luke J.; Wu, Kesheng

    2008-06-27

    This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not bemore » resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies.« less

  1. Coding, testing and documentation of processors for the flight design system

    NASA Technical Reports Server (NTRS)

    1980-01-01

    The general functional design and implementation of processors for a space flight design system are briefly described. Discussions of a basetime initialization processor; conic, analytical, and precision coasting flight processors; and an orbit lifetime processor are included. The functions of several utility routines are also discussed.

  2. The computational structural mechanics testbed generic structural-element processor manual

    NASA Technical Reports Server (NTRS)

    Stanley, Gary M.; Nour-Omid, Shahram

    1990-01-01

    The usage and development of structural finite element processors based on the CSM Testbed's Generic Element Processor (GEP) template is documented. By convention, such processors have names of the form ESi, where i is an integer. This manual is therefore intended for both Testbed users who wish to invoke ES processors during the course of a structural analysis, and Testbed developers who wish to construct new element processors (or modify existing ones).

  3. Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

    NASA Technical Reports Server (NTRS)

    Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

    1994-01-01

    In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

  4. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, Michael S.; Strip, David R.

    1996-01-01

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

  5. Switch for serial or parallel communication networks

    DOEpatents

    Crosette, D.B.

    1994-07-19

    A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.

  6. Switch for serial or parallel communication networks

    DOEpatents

    Crosette, Dario B.

    1994-01-01

    A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.

  7. Conditions for space invariance in optical data processors used with coherent or noncoherent light.

    PubMed

    Arsenault, H R

    1972-10-01

    The conditions for space invariance in coherent and noncoherent optical processors are considered. All linear optical processors are shown to belong to one of two types. The conditions for space invariance are more stringent for noncoherent processors than for coherent processors, so that a system that is linear in coherent light may be nonlinear in noncoherent light. However, any processor that is linear in noncoherent light is also linear in the coherent limit.

  8. Complete all-optical processing polarization-based binary logic gates and optical processors.

    PubMed

    Zaghloul, Y A; Zaghloul, A R M

    2006-10-16

    We present a complete all-optical-processing polarization-based binary-logic system, by which any logic gate or processor can be implemented. Following the new polarization-based logic presented in [Opt. Express 14, 7253 (2006)], we develop a new parallel processing technique that allows for the creation of all-optical-processing gates that produce a unique output either logic 1 or 0 only once in a truth table, and those that do not. This representation allows for the implementation of simple unforced OR, AND, XOR, XNOR, inverter, and more importantly NAND and NOR gates that can be used independently to represent any Boolean expression or function. In addition, the concept of a generalized gate is presented which opens the door for reconfigurable optical processors and programmable optical logic gates. Furthermore, the new design is completely compatible with the old one presented in [Opt. Express 14, 7253 (2006)], and with current semiconductor based devices. The gates can be cascaded, where the information is always on the laser beam. The polarization of the beam, and not its intensity, carries the information. The new methodology allows for the creation of multiple-input-multiple-output processors that implement, by itself, any Boolean function, such as specialized or non-specialized microprocessors. Three all-optical architectures are presented: orthoparallel optical logic architecture for all known and unknown binary gates, singlebranch architecture for only XOR and XNOR gates, and the railroad (RR) architecture for polarization optical processors (POP). All the control inputs are applied simultaneously leading to a single time lag which leads to a very-fast and glitch-immune POP. A simple and easy-to-follow step-by-step algorithm is provided for the POP, and design reduction methodologies are briefly discussed. The algorithm lends itself systematically to software programming and computer-assisted design. As examples, designs of all binary gates, multiple-input gates, and sequential and non-sequential Boolean expressions are presented and discussed. The operation of each design is simply understood by a bullet train traveling at the speed of light on a railroad system preconditioned by the crossover states predetermined by the control inputs. The presented designs allow for optical processing of the information eliminating the need to convert it, back and forth, to an electronic signal for processing purposes. All gates with a truth table, including for example Fredkin, Toffoli, testable reversible logic, and threshold logic gates, can be designed and implemented using the railroad architecture. That includes any future gates not known today. Those designs and the quantum gates are not discussed in this paper.

  9. Broadcasting collective operation contributions throughout a parallel computer

    DOEpatents

    Faraj, Ahmad [Rochester, MN

    2012-02-21

    Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

  10. LANDSAT-D flight segment operations manual. Appendix B: OBC software operations

    NASA Technical Reports Server (NTRS)

    Talipsky, R.

    1981-01-01

    The LANDSAT 4 satellite contains two NASA standard spacecraft computers and 65,536 words of memory. Onboard computer software is divided into flight executive and applications processors. Both applications processors and the flight executive use one or more of 67 system tables to obtain variables, constants, and software flags. Output from the software for monitoring operation is via 49 OBC telemetry reports subcommutated in the spacecraft telemetry. Information is provided about the flight software as it is used to control the various spacecraft operations and interpret operational OBC telemetry. Processor function descriptions, processor operation, software constraints, processor system tables, processor telemetry, and processor flow charts are presented.

  11. Managing Power Heterogeneity

    NASA Astrophysics Data System (ADS)

    Pruhs, Kirk

    A particularly important emergent technology is heterogeneous processors (or cores), which many computer architects believe will be the dominant architectural design in the future. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs. Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.

  12. Multi-Core Processor Memory Contention Benchmark Analysis Case Study

    NASA Technical Reports Server (NTRS)

    Simon, Tyler; McGalliard, James

    2009-01-01

    Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.

  13. Simulink/PARS Integration Support

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Vacaliuc, B.; Nakhaee, N.

    2013-12-18

    The state of the art for signal processor hardware has far out-paced the development tools for placing applications on that hardware. In addition, signal processors are available in a variety of architectures, each uniquely capable of handling specific types of signal processing efficiently. With these processors becoming smaller and demanding less power, it has become possible to group multiple processors, a heterogeneous set of processors, into single systems. Different portions of the desired problem set can be assigned to different processor types as appropriate. As software development tools do not keep pace with these processors, especially when multiple processors ofmore » different types are used, a method is needed to enable software code portability among multiple processors and multiple types of processors along with their respective software environments. Sundance DSP, Inc. has developed a software toolkit called “PARS”, whose objective is to provide a framework that uses suites of tools provided by different vendors, along with modeling tools and a real time operating system, to build an application that spans different processor types. The software language used to express the behavior of the system is a very high level modeling language, “Simulink”, a MathWorks product. ORNL has used this toolkit to effectively implement several deliverables. This CRADA describes this collaboration between ORNL and Sundance DSP, Inc.« less

  14. SPECIAL ISSUE ON OPTICAL PROCESSING OF INFORMATION: Optoelectronic processors with scanning CCD photodetectors

    NASA Astrophysics Data System (ADS)

    Esepkina, N. A.; Lavrov, A. P.; Anan'ev, M. N.; Blagodarnyi, V. S.; Ivanov, S. I.; Mansyrev, M. I.; Molodyakov, S. A.

    1995-10-01

    Two new types of optoelectronic radio-signal processors were investigated. Charge-coupled device (CCD) photodetectors are used in these processors under continuous scanning conditions, i.e. in a time delay and storage mode. One of these processors is based on a CCD photodetector array with a reference-signal amplitude transparency and the other is an adaptive acousto-optical signal processor with linear frequency modulation. The processor with the transparency performs multichannel discrete—analogue convolution of an input signal with a corresponding kernel of the transformation determined by the transparency. If a light source is an array of light-emitting diodes of special (stripe) geometry, the optical stages of the processor can be made from optical fibre components and the whole processor then becomes a rigid 'sandwich' (a compact hybrid optoelectronic microcircuit). A report is given also of a study of a prototype processor with optical fibre components for the reception of signals from a system with antenna aperture synthesis, which forms a radio image of the Earth.

  15. System and method for representing and manipulating three-dimensional objects on massively parallel architectures

    DOEpatents

    Karasick, M.S.; Strip, D.R.

    1996-01-30

    A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

  16. Shared performance monitor in a multiprocessor system

    DOEpatents

    Chiu, George; Gara, Alan G.; Salapura, Valentina

    2012-07-24

    A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.

  17. Efficient Aho-Corasick String Matching on Emerging Multicore Architectures

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tumeo, Antonino; Villa, Oreste; Secchi, Simone

    String matching algorithms are critical to several scientific fields. Beside text processing and databases, emerging applications such as DNA protein sequence analysis, data mining, information security software, antivirus, ma- chine learning, all exploit string matching algorithms [3]. All these applica- tions usually process large quantity of textual data, require high performance and/or predictable execution times. Among all the string matching algorithms, one of the most studied, especially for text processing and security applica- tions, is the Aho-Corasick algorithm. 1 2 Book title goes here Aho-Corasick is an exact, multi-pattern string matching algorithm which performs the search in a time linearlymore » proportional to the length of the input text independently from pattern set size. However, depending on the imple- mentation, when the number of patterns increase, the memory occupation may raise drastically. In turn, this can lead to significant variability in the performance, due to the memory access times and the caching effects. This is a significant concern for many mission critical applications and modern high performance architectures. For example, security applications such as Network Intrusion Detection Systems (NIDS), must be able to scan network traffic against very large dictionaries in real time. Modern Ethernet links reach up to 10 Gbps, and malicious threats are already well over 1 million, and expo- nentially growing [28]. When performing the search, a NIDS should not slow down the network, or let network packets pass unchecked. Nevertheless, on the current state-of-the-art cache based processors, there may be a large per- formance variability when dealing with big dictionaries and inputs that have different frequencies of matching patterns. In particular, when few patterns are matched and they are all in the cache, the procedure is fast. Instead, when they are not in the cache, often because many patterns are matched and the caches are continuously thrashed, they should be retrieved from the system memory and the procedure is slowed down by the increased latency. Efficient implementations of string matching algorithms have been the fo- cus of several works, targeting Field Programmable Gate Arrays [4, 25, 15, 5], highly multi-threaded solutions like the Cray XMT [34], multicore proces- sors [19] or heterogeneous processors like the Cell Broadband Engine [35, 22]. Recently, several researchers have also started to investigate the use Graphic Processing Units (GPUs) for string matching algorithms in security applica- tions [20, 10, 32, 33]. Most of these approaches mainly focus on reaching high peak performance, or try to optimize the memory occupation, rather than looking at performance stability. However, hardware solutions supports only small dictionary sizes due to lack of memory and are difficult to customize, while platforms such as the Cell/B.E. are very complex to program.« less

  18. Comprehensive Software Simulation on Ground Power Supply for Launch Pads and Processing Facilities at NASA Kennedy Space Center

    NASA Technical Reports Server (NTRS)

    Dominguez, Jesus A.; Victor, Elias; Vasquez, Angel L.; Urbina, Alfredo R.

    2017-01-01

    A multi-threaded software application has been developed in-house by the Ground Special Power (GSP) team at NASA Kennedy Space Center (KSC) to separately simulate and fully emulate all units that supply VDC power and battery-based power backup to multiple KSC launch ground support systems for NASA Space Launch Systems (SLS) rocket.

  19. NavP: Structured and Multithreaded Distributed Parallel Programming

    NASA Technical Reports Server (NTRS)

    Pan, Lei; Xu, Jingling

    2006-01-01

    This slide presentation reviews some of the issues around distributed parallel programming. It compares and contrast two methods of programming: Single Program Multiple Data (SPMD) with the Navigational Programming (NAVP). It then reviews the distributed sequential computing (DSC) method and the methodology of NavP. Case studies are presented. It also reviews the work that is being done to enable the NavP system.

  20. Concurrency Attacks and Defenses

    DTIC Science & Technology

    2016-10-04

    Enter name(s) of person(s) responsible for writing the report, performing the research, or credited with the content of the report. The form of...Amsterdam Avenue New York, NY 10027-7003 Tel.: 212-939-7012 Fax: 212-666-0140 Email : junfeng@cs.columbia.edu 2. Research Objectives...Multithreaded programs are getting increasingly pervasive and critical. Unfortunately, they remain extremely difficult to write . This difficulty has led to

  1. Distributed Emulation in Support of Large Networks

    DTIC Science & Technology

    2016-06-01

    Provider LTE Long Term Evolution MB Megabyte MIPS Microprocessor without Interlocked Pipeline Stages MRT Multi-Threaded Routing Toolkit NPS Naval...environment, modifications to a network, protocol, or model can be executed – and the effects measured – without affecting real-world users or services...produce their results when analyzing performance of Long Term Evolution ( LTE ) gateways [3]. Many research scenarios allow problems to be represented

  2. SU-E-T-531: Performance Evaluation of Multithreaded Geant4 for Proton Therapy Dose Calculations in a High Performance Computing Facility

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shin, J; Coss, D; McMurry, J

    Purpose: To evaluate the efficiency of multithreaded Geant4 (Geant4-MT, version 10.0) for proton Monte Carlo dose calculations using a high performance computing facility. Methods: Geant4-MT was used to calculate 3D dose distributions in 1×1×1 mm3 voxels in a water phantom and patient's head with a 150 MeV proton beam covering approximately 5×5 cm2 in the water phantom. Three timestamps were measured on the fly to separately analyze the required time for initialization (which cannot be parallelized), processing time of individual threads, and completion time. Scalability of averaged processing time per thread was calculated as a function of thread number (1,more » 100, 150, and 200) for both 1M and 50 M histories. The total memory usage was recorded. Results: Simulations with 50 M histories were fastest with 100 threads, taking approximately 1.3 hours and 6 hours for the water phantom and the CT data, respectively with better than 1.0 % statistical uncertainty. The calculations show 1/N scalability in the event loops for both cases. The gains from parallel calculations started to decrease with 150 threads. The memory usage increases linearly with number of threads. No critical failures were observed during the simulations. Conclusion: Multithreading in Geant4-MT decreased simulation time in proton dose distribution calculations by a factor of 64 and 54 at a near optimal 100 threads for water phantom and patient's data respectively. Further simulations will be done to determine the efficiency at the optimal thread number. Considering the trend of computer architecture development, utilizing Geant4-MT for radiotherapy simulations is an excellent cost-effective alternative for a distributed batch queuing system. However, because the scalability depends highly on simulation details, i.e., the ratio of the processing time of one event versus waiting time to access for the shared event queue, a performance evaluation as described is recommended.« less

  3. A PC-based shutter glasses controller for visual stimulation using multithreading in LabWindows/CVI.

    PubMed

    Gramatikov, Ivan; Simons, Kurt; Guyton, David; Gramatikov, Boris

    2017-05-01

    Amblyopia, commonly known as "lazy eye," is poor vision in an eye from prolonged neurologic suppression. It is a major public health problem, afflicting up to 3.6% of children, and will lead to lifelong visual impairment if not identified and treated in early childhood. Traditional treatment methods, such as occluding or penalizing the good eye with eye patches or blurring eye drops, do not always yield satisfactory results. Newer methods have emerged, based on liquid crystal shutter glasses that intermittently occlude the better eye, or alternately occlude the two eyes, thus stimulating vision in the "lazy" eye. As yet there is no technology that allows easy and efficient optimization of the shuttering characteristics for a given individual. The purpose of this study was to develop an inexpensive, computer-based system to perform liquid crystal shuttering in laboratory and clinical settings to help "wake up" the suppressed eye in amblyopic patients, and to help optimize the individual shuttering parameters such as wave shape, level of transparency/opacity, frequency, and duty cycle of the shuttering. We developed a liquid crystal glasses controller connected by USB cable to a PC computer. It generates the voltage waveforms going to the glasses, and has potentiometer knobs for interactive adjustments by the patient. In order to achieve good timing performance in this bidirectional system, we used multithreading programming techniques with data protection, implemented in LabWindows/CVI. The hardware and software developed were assessed experimentally. We achieved an accuracy of ±1Hz for the frequency, and ±2% for the duty cycle of the occlusion pulses. We consider these values to be satisfactory for the purpose of optimizing the visual stimulation by means of shutter glasses. The system can be used for individual optimization of shuttering attributes by clinicians, for training sessions in clinical settings, or even at home, aimed at stimulating vision in the "lazy" eye. Multithreading offers significant benefits for data acquisition and instrument control, making it possible to implement time-efficient algorithms in inexpensive yet versatile medical instrumentation with only minimum requirements on the hardware. Copyright © 2017 Elsevier B.V. All rights reserved.

  4. Implementation of kernels on the Maestro processor

    NASA Astrophysics Data System (ADS)

    Suh, Jinwoo; Kang, D. I. D.; Crago, S. P.

    Currently, most microprocessors use multiple cores to increase performance while limiting power usage. Some processors use not just a few cores, but tens of cores or even 100 cores. One such many-core microprocessor is the Maestro processor, which is based on Tilera's TILE64 processor. The Maestro chip is a 49-core, general-purpose, radiation-hardened processor designed for space applications. The Maestro processor, unlike the TILE64, has a floating point unit (FPU) in each core for improved floating point performance. The Maestro processor runs at 342 MHz clock frequency. On the Maestro processor, we implemented several widely used kernels: matrix multiplication, vector add, FIR filter, and FFT. We measured and analyzed the performance of these kernels. The achieved performance was up to 5.7 GFLOPS, and the speedup compared to single tile was up to 49 using 49 tiles.

  5. Ordering of guarded and unguarded stores for no-sync I/O

    DOEpatents

    Gara, Alan; Ohmacht, Martin

    2013-06-25

    A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.

  6. Electrochemical sensing using voltage-current time differential

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Woo, Leta Yar-Li; Glass, Robert Scott; Fitzpatrick, Joseph Jay

    2017-02-28

    A device for signal processing. The device includes a signal generator, a signal detector, and a processor. The signal generator generates an original waveform. The signal detector detects an affected waveform. The processor is coupled to the signal detector. The processor receives the affected waveform from the signal detector. The processor also compares at least one portion of the affected waveform with the original waveform. The processor also determines a difference between the affected waveform and the original waveform. The processor also determines a value corresponding to a unique portion of the determined difference between the original and affected waveforms.more » The processor also outputs the determined value.« less

  7. Accuracy requirements of optical linear algebra processors in adaptive optics imaging systems

    NASA Technical Reports Server (NTRS)

    Downie, John D.; Goodman, Joseph W.

    1989-01-01

    The accuracy requirements of optical processors in adaptive optics systems are determined by estimating the required accuracy in a general optical linear algebra processor (OLAP) that results in a smaller average residual aberration than that achieved with a conventional electronic digital processor with some specific computation speed. Special attention is given to an error analysis of a general OLAP with regard to the residual aberration that is created in an adaptive mirror system by the inaccuracies of the processor, and to the effect of computational speed of an electronic processor on the correction. Results are presented on the ability of an OLAP to compete with a digital processor in various situations.

  8. Considerations on the Use of Custom Accelerators for Big Data Analytics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Castellana, Vito G.; Tumeo, Antonino; Minutoli, Marco

    Accelerators, including Graphic Processing Units (GPUs) for gen- eral purpose computation, many-core designs with wide vector units (e.g., Intel Phi), have become a common component of many high performance clusters. The appearance of more stable and reliable tools tools that can automatically convert code written in high-level specifications with annotations (such as C or C++) to hardware de- scription languages (High-Level Synthesis - HLS), is also setting the stage for a broader use of reconfigurable devices (e.g., Field Pro- grammable Gate Arrays - FPGAs) in high performance system for the implementation of custom accelerators, helped by the fact that newmore » processors include advanced cache-coherent interconnects for these components. In this chapter, we briefly survey the status of the use of accelerators in high performance systems targeted at big data analytics applications. We argue that, although the progress in the use of accelerators for this class of applications has been sig- nificant, differently from scientific simulations there still are gaps to close. This is particularly true for the ”irregular” behaviors exhibited by no-SQL, graph databases. We focus our attention on the limits of HLS tools for data analytics and graph methods, and discuss a new architectural template that better fits the requirement of this class of applications. We validate the new architectural templates by mod- ifying the Graph Engine for Multithreaded System (GEMS) frame- work to support accelerators generated with such a methodology, and testing with queries coming from the Lehigh University Benchmark (LUBM). The architectural template enables better supporting the task and memory level parallelism present in graph methods by sup- porting a new control model and a enhanced memory interface. We show that out solution allows generating parallel accelerators, pro- viding speed ups with respect to conventional HLS flows. We finally draw conclusions and present a perspective on the use of reconfig- urable devices and Design Automation tools for data analytics.« less

  9. GAMERA - The New Magnetospheric Code

    NASA Astrophysics Data System (ADS)

    Lyon, J.; Sorathia, K.; Zhang, B.; Merkin, V. G.; Wiltberger, M. J.; Daldorff, L. K. S.

    2017-12-01

    The Lyon-Fedder-Mobarry (LFM) code has been a main-line magnetospheric simulation code for 30 years. The code base, designed in the age of memory to memory vector ma- chines,is still in wide use for science production but needs upgrading to ensure the long term sustainability. In this presentation, we will discuss our recent efforts to update and improve that code base and also highlight some recent results. The new project GAM- ERA, Grid Agnostic MHD for Extended Research Applications, has kept the original design characteristics of the LFM and made significant improvements. The original de- sign included high order numerical differencing with very aggressive limiting, the ability to use arbitrary, but logically rectangular, grids, and maintenance of div B = 0 through the use of the Yee grid. Significant improvements include high-order upwinding and a non-clipping limiter. One other improvement with wider applicability is an im- proved averaging technique for the singularities in polar and spherical grids. The new code adopts a hybrid structure - multi-threaded OpenMP with an overarching MPI layer for large scale and coupled applications. The MPI layer uses a combination of standard MPI and the Global Array Toolkit from PNL to provide a lightweight mechanism for coupling codes together concurrently. The single processor code is highly efficient and can run magnetospheric simulations at the default CCMC resolution faster than real time on a MacBook pro. We have run the new code through the Athena suite of tests, and the results compare favorably with the codes available to the astrophysics community. LFM/GAMERA has been applied to many different situations ranging from the inner and outer heliosphere and magnetospheres of Venus, the Earth, Jupiter and Saturn. We present example results the Earth's magnetosphere including a coupled ring current (RCM), the magnetospheres of Jupiter and Saturn, and the inner heliosphere.

  10. High-Performance, Multi-Node File Copies and Checksums for Clustered File Systems

    NASA Technical Reports Server (NTRS)

    Kolano, Paul Z.; Ciotti, Robert B.

    2012-01-01

    Modern parallel file systems achieve high performance using a variety of techniques, such as striping files across multiple disks to increase aggregate I/O bandwidth and spreading disks across multiple servers to increase aggregate interconnect bandwidth. To achieve peak performance from such systems, it is typically necessary to utilize multiple concurrent readers/writers from multiple systems to overcome various singlesystem limitations, such as number of processors and network bandwidth. The standard cp and md5sum tools of GNU coreutils found on every modern Unix/Linux system, however, utilize a single execution thread on a single CPU core of a single system, and hence cannot take full advantage of the increased performance of clustered file systems. Mcp and msum are drop-in replacements for the standard cp and md5sum programs that utilize multiple types of parallelism and other optimizations to achieve maximum copy and checksum performance on clustered file systems. Multi-threading is used to ensure that nodes are kept as busy as possible. Read/write parallelism allows individual operations of a single copy to be overlapped using asynchronous I/O. Multinode cooperation allows different nodes to take part in the same copy/checksum. Split-file processing allows multiple threads to operate concurrently on the same file. Finally, hash trees allow inherently serial checksums to be performed in parallel. Mcp and msum provide significant performance improvements over standard cp and md5sum using multiple types of parallelism and other optimizations. The total speed-ups from all improvements are significant. Mcp improves cp performance over 27x, msum improves md5sum performance almost 19x, and the combination of mcp and msum improves verified copies via cp and md5sum by almost 22x. These improvements come in the form of drop-in replacements for cp and md5sum, so are easily used and are available for download as open source software at http://mutil.sourceforge.net.

  11. SIPSMetGen: It's Not Just For Aircraft Data and ECS Anymore.

    NASA Astrophysics Data System (ADS)

    Schwab, M.

    2015-12-01

    The SIPSMetGen utility, developed for the NASA EOSDIS project, under the EED contract, simplified the creation of file level metadata for the ECS System. The utility has been enhanced for ease of use, efficiency, speed and increased flexibility. The SIPSMetGen utility was originally created as a means of generating file level spatial metadata for Operation IceBridge. The first version created only ODL metadata, specific for ingest into ECS. The core strength of the utility was, and continues to be, its ability to take complex shapes and patterns of data collection point clouds from aircraft flights and simplify them to a relatively simple concave hull geo-polygon. It has been found to be a useful and easy to use tool for creating file level metadata for many other missions, both aircraft and satellite. While the original version was useful it had its limitations. In 2014 Raytheon was tasked to make enhancements to SIPSMetGen, this resulted a new version of SIPSMetGen which can create ISO Compliant XML metadata; provides optimization and streamlining of the algorithm for creating the spatial metadata; a quicker runtime with more consistent results; a utility that can be configured to run multi-threaded on systems with multiple processors. The utility comes with a java based graphical user interface to aid in configuration and running of the utility. The enhanced SIPSMetGen allows more diverse data sets to be archived with file level metadata. The advantage of archiving data with file level metadata is that it makes it easier for data users, and scientists to find relevant data. File level metadata unlocks the power of existing archives and metadata repositories such as ECS and CMR and search and discovery utilities like Reverb and Earth Data Search. Current missions now using SIPSMetGen include: Aquarius, Measures, ARISE, and Nimbus.

  12. Modeling heterogeneous processor scheduling for real time systems

    NASA Technical Reports Server (NTRS)

    Leathrum, J. F.; Mielke, R. R.; Stoughton, J. W.

    1994-01-01

    A new model is presented to describe dataflow algorithms implemented in a multiprocessing system. Called the resource/data flow graph (RDFG), the model explicitly represents cyclo-static processor schedules as circuits of processor arcs which reflect the order that processors execute graph nodes. The model also allows the guarantee of meeting hard real-time deadlines. When unfolded, the model identifies statically the processor schedule. The model therefore is useful for determining the throughput and latency of systems with heterogeneous processors. The applicability of the model is demonstrated using a space surveillance algorithm.

  13. Parallel processor for real-time structural control

    NASA Astrophysics Data System (ADS)

    Tise, Bert L.

    1993-07-01

    A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-to-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection to host computer, parallelizing code generator, and look- up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating- point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An OpenWindows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.

  14. Testing and operating a multiprocessor chip with processor redundancy

    DOEpatents

    Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J

    2014-10-21

    A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Reed, D.A.; Grunwald, D.C.

    The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less

  16. Real-time separation of multineuron recordings with a DSP32C signal processor.

    PubMed

    Gädicke, R; Albus, K

    1995-04-01

    We have developed a hardware and software package for real-time discrimination of multiple-unit activities recorded simultaneously from multiple microelectrodes using a VME-Bus system. Compared with other systems cited in literature or commercially available, our system has the following advantages. (1) Each electrode is served by its own preprocessor (DSP32C); (2) On-line spike discrimination is performed independently for each electrode. (3) The VME-bus allows processing of data received from 16 electrodes. The digitized (62.5 kHz) spike form is itself used as the model spike; the algorithm allows for comparing and sorting complete wave forms in real time into 8 different models per electrode.

  17. An Efficient Downlink Scheduling Strategy Using Normal Graphs for Multiuser MIMO Wireless Systems

    NASA Astrophysics Data System (ADS)

    Chen, Jung-Chieh; Wu, Cheng-Hsuan; Lee, Yao-Nan; Wen, Chao-Kai

    Inspired by the success of the low-density parity-check (LDPC) codes in the field of error-control coding, in this paper we propose transforming the downlink multiuser multiple-input multiple-output scheduling problem into an LDPC-like problem using the normal graph. Based on the normal graph framework, soft information, which indicates the probability that each user will be scheduled to transmit packets at the access point through a specified angle-frequency sub-channel, is exchanged among the local processors to iteratively optimize the multiuser transmission schedule. Computer simulations show that the proposed algorithm can efficiently schedule simultaneous multiuser transmission which then increases the overall channel utilization and reduces the average packet delay.

  18. Real-Time Cognitive Computing Architecture for Data Fusion in a Dynamic Environment

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.; Duong, Vu A.

    2012-01-01

    A novel cognitive computing architecture is conceptualized for processing multiple channels of multi-modal sensory data streams simultaneously, and fusing the information in real time to generate intelligent reaction sequences. This unique architecture is capable of assimilating parallel data streams that could be analog, digital, synchronous/asynchronous, and could be programmed to act as a knowledge synthesizer and/or an "intelligent perception" processor. In this architecture, the bio-inspired models of visual pathway and olfactory receptor processing are combined as processing components, to achieve the composite function of "searching for a source of food while avoiding the predator." The architecture is particularly suited for scene analysis from visual data and odorant.

  19. PCLIPS: Parallel CLIPS

    NASA Technical Reports Server (NTRS)

    Gryphon, Coranth D.; Miller, Mark D.

    1991-01-01

    PCLIPS (Parallel CLIPS) is a set of extensions to the C Language Integrated Production System (CLIPS) expert system language. PCLIPS is intended to provide an environment for the development of more complex, extensive expert systems. Multiple CLIPS expert systems are now capable of running simultaneously on separate processors, or separate machines, thus dramatically increasing the scope of solvable tasks within the expert systems. As a tool for parallel processing, PCLIPS allows for an expert system to add to its fact-base information generated by other expert systems, thus allowing systems to assist each other in solving a complex problem. This allows individual expert systems to be more compact and efficient, and thus run faster or on smaller machines.

  20. Electrochemical sensing using comparison of voltage-current time differential values during waveform generation and detection

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Woo, Leta Yar-Li; Glass, Robert Scott; Fitzpatrick, Joseph Jay

    2018-01-02

    A device for signal processing. The device includes a signal generator, a signal detector, and a processor. The signal generator generates an original waveform. The signal detector detects an affected waveform. The processor is coupled to the signal detector. The processor receives the affected waveform from the signal detector. The processor also compares at least one portion of the affected waveform with the original waveform. The processor also determines a difference between the affected waveform and the original waveform. The processor also determines a value corresponding to a unique portion of the determined difference between the original and affected waveforms.more » The processor also outputs the determined value.« less

  1. Real-time co-registered ultrasound and photoacoustic imaging system based on FPGA and DSP architecture

    NASA Astrophysics Data System (ADS)

    Alqasemi, Umar; Li, Hai; Aguirre, Andres; Zhu, Quing

    2011-03-01

    Co-registering ultrasound (US) and photoacoustic (PA) imaging is a logical extension to conventional ultrasound because both modalities provide complementary information of tumor morphology, tumor vasculature and hypoxia for cancer detection and characterization. In addition, both modalities are capable of providing real-time images for clinical applications. In this paper, a Field Programmable Gate Array (FPGA) and Digital Signal Processor (DSP) module-based real-time US/PA imaging system is presented. The system provides real-time US/PA data acquisition and image display for up to 5 fps* using the currently implemented DSP board. It can be upgraded to 15 fps, which is the maximum pulse repetition rate of the used laser, by implementing an advanced DSP module. Additionally, the photoacoustic RF data for each frame is saved for further off-line processing. The system frontend consists of eight 16-channel modules made of commercial and customized circuits. Each 16-channel module consists of two commercial 8-channel receiving circuitry boards and one FPGA board from Analog Devices. Each receiving board contains an IC† that combines. 8-channel low-noise amplifiers, variable-gain amplifiers, anti-aliasing filters, and ADC's‡ in a single chip with sampling frequency of 40MHz. The FPGA board captures the LVDSξ Double Data Rate (DDR) digital output of the receiving board and performs data conditioning and subbeamforming. A customized 16-channel transmission circuitry is connected to the two receiving boards for US pulseecho (PE) mode data acquisition. A DSP module uses External Memory Interface (EMIF) to interface with the eight 16-channel modules through a customized adaptor board. The DSP transfers either sub-beamformed data (US pulse-echo mode or PAI imaging mode) or raw data from FPGA boards to its DDR-2 memory through the EMIF link, then it performs additional processing, after that, it transfer the data to the PC** for further image processing. The PC code performs image processing including demodulation, beam envelope detection and scan conversion. Additionally, the PC code pre-calculates the delay coefficients used for transmission focusing and receiving dynamic focusing for different types of transducers to speed up the imaging process. To further speed up the imaging process, a multi-threads technique is implemented in order to allow formation of previous image frame data and acquisition of the next one simultaneously. The system is also capable of doing semi-real-time automated SO2 imaging at 10 seconds per frame by changing the wavelength knob of the laser automatically using a stepper motor controlled by the system. Initial in vivo experiments were performed on animal tumors to map out its vasculature and hypoxia level, which were superimposed on co-registered US images. The real-time system allows capturing co-registered US/PA images free of motion artifacts and also provides dynamitic information when contrast agents are used.

  2. Management tools for the 21st century environmental office: The role of office automation and information technology. Final report

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fittipaldi, J.J.; Sliwinski, B.J.

    1991-06-01

    Army environmental planning and compliance activities continue to grow in magnitude and complexity, straining the resources of installation environmental offices. New efficiencies must be found to meet the increasing demands of planning and compliance imperatives. This study examined how office automation/information technology (OA/IT) may boost productivity in U.S. Army Training and Doctrine Command (TRADOC) installation environmental offices between now and the year 2000. A survey of four TRADOC installation environmental offices revealed that the workload often exceeds the capacity of staff. Computer literacy among personnel varies widely, limiting the benefits available from OA/IT now in use. Since environmental personnel aremore » primarily gatherers and processors of information, better implementation of OA/IT could substantially improve work quality and productivity. Advanced technologies expected to reach the consumer market during the 1990s will dramatically increase the potential productivity of environmental office personnel. Multitasking operating environments will allow simultaneous automation of communications, document processing, and engineering software. Increased processor power and parallel processing techniques will spur simplification of the user interface and greater software capabilities in general. The authors conclude that full implementation of this report's OA/IT recommendations could double TRADOC environmental office productivity by the year 2000.« less

  3. LIBS data analysis using a predictor-corrector based digital signal processor algorithm

    NASA Astrophysics Data System (ADS)

    Sanders, Alex; Griffin, Steven T.; Robinson, Aaron

    2012-06-01

    There are many accepted sensor technologies for generating spectra for material classification. Once the spectra are generated, communication bandwidth limitations favor local material classification with its attendant reduction in data transfer rates and power consumption. Transferring sensor technologies such as Cavity Ring-Down Spectroscopy (CRDS) and Laser Induced Breakdown Spectroscopy (LIBS) require effective material classifiers. A result of recent efforts has been emphasis on Partial Least Squares - Discriminant Analysis (PLS-DA) and Principle Component Analysis (PCA). Implementation of these via general purpose computers is difficult in small portable sensor configurations. This paper addresses the creation of a low mass, low power, robust hardware spectra classifier for a limited set of predetermined materials in an atmospheric matrix. Crucial to this is the incorporation of PCA or PLS-DA classifiers into a predictor-corrector style implementation. The system configuration guarantees rapid convergence. Software running on multi-core Digital Signal Processor (DSPs) simulates a stream-lined plasma physics model estimator, reducing Analog-to-Digital (ADC) power requirements. This paper presents the results of a predictorcorrector model implemented on a low power multi-core DSP to perform substance classification. This configuration emphasizes the hardware system and software design via a predictor corrector model that simultaneously decreases the sample rate while performing the classification.

  4. Feasibility study of parallel optical correlation-decoding analysis of lightning

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Descour, M.R.; Sweatt, W.C.; Elliott, G.R.

    The optical correlator described in this report is intended to serve as an attention-focusing processor. The objective is to narrowly bracket the range of a parameter value that characterizes the correlator input. The input is a waveform collected by a satellite-borne receiver. In the correlator, this waveform is simultaneously correlated with an ensemble of ionosphere impulse-response functions, each corresponding to a different total-electron-count (TEC) value. We have found that correlation is an effective method of bracketing the range of TEC values likely to be represented by the input waveform. High accuracy in a computational sense is not required of themore » correlator. Binarization of the impulse-response functions and the input waveforms prior to correlation results in a lower correlation-peak-to-background-fluctuation (signal-to-noise) ratio than the peak that is obtained when all waveforms retain their grayscale values. The results presented in this report were obtained by means of an acousto-optic correlator previously developed at SNL as well as by simulation. An optical-processor architecture optimized for 1D correlation of long waveforms characteristic of this application is described. Discussions of correlator components, such as optics, acousto-optic cells, digital micromirror devices, laser diodes, and VCSELs are included.« less

  5. Results of the Alternative Water Processor Test, A Novel Technology for Exploration Wastewater Remediation

    NASA Technical Reports Server (NTRS)

    Vega, Leticia; Meyer, Caitlin

    2016-01-01

    Biologically-based water recovery systems are a regenerative, low energy alternative to physiochemical processes to reclaim water from wastewater. This paper summarizes the results of the Alternative Water Processor (AWP) test conducted over one year. The AWP recovered 90% of water from four crewmembers using (4) membrane aerated bioreactors (MABRs) to remove carbon and nitrogen from an exploration mission wastewater, including urine, hygiene, laundry and humidity condensate. Downstream, a coupled forward and reverse osmosis system removed large organics and inorganic salts from the biological system effluent. The system exceeded the overall objectives of the test by recovering 90% of the influent wastewater processed and a 29% reduction of consumables from the current state of the art water recovery system on the International Space Station (ISS). However the biological system fell short of its test goals, failing to remove 75% and 90% of the influent ammonium and organic carbon, respectively. Despite not meeting its test goals, the BWP demonstrated the feasibility of an attached-growth biological system for simultaneous nitrification and denitrification, an innovative, volume and consumable-saving design that doesn't require toxic pretreatment. This paper will explain the reasons for this and will discuss steps to optimize each subsystem to increase effluent quality from the MABRs and the FOST to advance the system.

  6. Results of the Alternative Water Processor Test, A Novel Technology for Exploration Wastewater Remediation

    NASA Technical Reports Server (NTRS)

    Vega, Leticia; Meyer, Caitlin

    2015-01-01

    Biologically-based water recovery systems are a regenerative, low energy alternative to physiochemical processes to reclaim water from wastewater. This paper summarizes the results of the Alternative Water Processor (AWP) test conducted over one year. The AWP recovered 90% of water from four crewmembers using (4) membrane aerated bioreactors (MABRs) to remove carbon and nitrogen from an exploration mission wastewater, including urine, hygiene, laundry and humidity condensate. Downstream, a coupled forward and reverse osmosis system removed large organics and inorganic salts from the biological system effluent. The system exceeded the overall objectives of the test by recovering 90% of the influent wastewater processed and a 29% reduction of consumables from the current state of the art water recovery system on the International Space Station (ISS). However the biological system fell short of its test goals, failing to remove 75% and 90% of the influent ammonium and organic carbon, respectively. Despite not meeting its test goals, the BWP demonstrated the feasibility of an attachedgrowth biological system for simultaneous nitrification and denitrification, an innovative, volume and consumable-saving design that doesn't require toxic pretreatment. This paper will explain the reasons for this and will discuss steps to optimize each subsystem to increase effluent quality from the MABRs and the FOST to advance the system.

  7. Hybrid Electro-Optic Processor

    DTIC Science & Technology

    1991-07-01

    This report describes the design of a hybrid electro - optic processor to perform adaptive interference cancellation in radar systems. The processor is...modulator is reported. Included is this report is a discussion of the design, partial fabrication in the laboratory, and partial testing of the hybrid electro ... optic processor. A follow on effort is planned to complete the construction and testing of the processor. The work described in this report is the

  8. JPRS Report, Science & Technology, Europe.

    DTIC Science & Technology

    1991-04-30

    processor in collaboration with Intel . The processor , christened Touchstone, will be used as the core of a parallel computer with 2,000 processors . One of...ELECTRONIQUE HEBDO in French 24 Jan 91 pp 14-15 [Article by Claire Remy: "Everything Set for Neural Signal Processors " first paragraph is ELECTRONIQUE...paving the way for neural signal processors in so doing. The principal advantage of this specific circuit over a neuromimetic software program is

  9. Processor register error correction management

    DOEpatents

    Bose, Pradip; Cher, Chen-Yong; Gupta, Meeta S.

    2016-12-27

    Processor register protection management is disclosed. In embodiments, a method of processor register protection management can include determining a sensitive logical register for executable code generated by a compiler, generating an error-correction table identifying the sensitive logical register, and storing the error-correction table in a memory accessible by a processor. The processor can be configured to generate a duplicate register of the sensitive logical register identified by the error-correction table.

  10. The CSM testbed matrix processors internal logic and dataflow descriptions

    NASA Technical Reports Server (NTRS)

    Regelbrugge, Marc E.; Wright, Mary A.

    1988-01-01

    This report constitutes the final report for subtask 1 of Task 5 of NASA Contract NAS1-18444, Computational Structural Mechanics (CSM) Research. This report contains a detailed description of the coded workings of selected CSM Testbed matrix processors (i.e., TOPO, K, INV, SSOL) and of the arithmetic utility processor AUS. These processors and the current sparse matrix data structures are studied and documented. Items examined include: details of the data structures, interdependence of data structures, data-blocking logic in the data structures, processor data flow and architecture, and processor algorithmic logic flow.

  11. Parallel processor for real-time structural control

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Tise, B.L.

    1992-01-01

    A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection tomore » host computer, parallelizing code generator, and look-up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating-point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An Open Windows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.« less

  12. 7 CFR 1435.310 - Sharing processors' allocations with producers.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...

  13. 7 CFR 1435.310 - Sharing processors' allocations with producers.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...

  14. 7 CFR 1435.310 - Sharing processors' allocations with producers.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...

  15. 7 CFR 1435.310 - Sharing processors' allocations with producers.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...

  16. 7 CFR 1435.310 - Sharing processors' allocations with producers.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...

  17. 40 CFR 791.45 - Processors.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ...) When a test rule or subsequent Federal Register notice pertaining to a test rule expressly obligates processors as well as manufacturers to assume direct testing and data reimbursement responsibilities. (2... processors voluntarily agree to reimburse manufacturers for a portion of test costs. Only those processors...

  18. Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids

    DOEpatents

    Chatterjee, Siddhartha [Yorktown Heights, NY; Gunnels, John A [Brewster, NY

    2011-11-08

    A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

  19. A Multithreaded Missions And Means Framework (MMF) Concept Report

    DTIC Science & Technology

    2012-03-01

    Vasconcelos , W.; Gibson, C.; Bar-Noy, A.; Borowiecki, K.; La Porta, T.; Pizzocaro, D.; Rowaihy, H.; Pearson, G.; Pham, T. An Ontology Centric...M.; de Mel, G.; Vasconcelos , W.; Sleeman, D.; Colley, S.; La Porta, T. An Ontology-Based Approach to Sensor-Mission Assignment. Proceedings of the...1st Annual Conference of the International Technology Alliance (ACITA 2007), 2007. Preece, A.; Gomez, M.; de Mel, G.; Vasconcelos , W.; Sleeman, D

  20. Systematic and Scalable Testing of Concurrent Programs

    DTIC Science & Technology

    2013-12-16

    The evaluation of CHESS [107] checked eight different programs ranging from process management libraries to a distributed execution engine to a research...tool (§3.1) targets systematic testing of scheduling nondeterminism in multi- threaded components of the Omega cluster management system [129], while...tool for systematic testing of multithreaded com- ponents of the Omega cluster management system [129]. In particular, §3.1.1 defines a model for

  1. Development and study of a parallel algorithm of iteratively forming latent functionally-determined structures for classification and analysis of meteorological data

    NASA Astrophysics Data System (ADS)

    Sorokin, V. A.; Volkov, Yu V.; Sherstneva, A. I.; Botygin, I. A.

    2016-11-01

    This paper overviews a method of generating climate regions based on an analytic signal theory. When applied to atmospheric surface layer temperature data sets, the method allows forming climatic structures with the corresponding changes in the temperature to make conclusions on the uniformity of climate in an area and to trace the climate changes in time by analyzing the type group shifts. The algorithm is based on the fact that the frequency spectrum of the thermal oscillation process is narrow-banded and has only one mode for most weather stations. This allows using the analytic signal theory, causality conditions and introducing an oscillation phase. The annual component of the phase, being a linear function, was removed by the least squares method. The remaining phase fluctuations allow consistent studying of their coordinated behavior and timing, using the Pearson correlation coefficient for dependence evaluation. This study includes program experiments to evaluate the calculation efficiency in the phase grouping task. The paper also overviews some single-threaded and multi-threaded computing models. It is shown that the phase grouping algorithm for meteorological data can be parallelized and that a multi-threaded implementation leads to a 25-30% increase in the performance.

  2. Application of Intel Many Integrated Core (MIC) architecture to the Yonsei University planetary boundary layer scheme in Weather Research and Forecasting model

    NASA Astrophysics Data System (ADS)

    Huang, Melin; Huang, Bormin; Huang, Allen H.

    2014-10-01

    The Weather Research and Forecasting (WRF) model provided operational services worldwide in many areas and has linked to our daily activity, in particular during severe weather events. The scheme of Yonsei University (YSU) is one of planetary boundary layer (PBL) models in WRF. The PBL is responsible for vertical sub-grid-scale fluxes due to eddy transports in the whole atmospheric column, determines the flux profiles within the well-mixed boundary layer and the stable layer, and thus provide atmospheric tendencies of temperature, moisture (including clouds), and horizontal momentum in the entire atmospheric column. The YSU scheme is very suitable for massively parallel computation as there are no interactions among horizontal grid points. To accelerate the computation process of the YSU scheme, we employ Intel Many Integrated Core (MIC) Architecture as it is a multiprocessor computer structure with merits of efficient parallelization and vectorization essentials. Our results show that the MIC-based optimization improved the performance of the first version of multi-threaded code on Xeon Phi 5110P by a factor of 2.4x. Furthermore, the same CPU-based optimizations improved the performance on Intel Xeon E5-2603 by a factor of 1.6x as compared to the first version of multi-threaded code.

  3. In-Memory Graph Databases for Web-Scale Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Castellana, Vito G.; Morari, Alessandro; Weaver, Jesse R.

    RDF databases have emerged as one of the most relevant way for organizing, integrating, and managing expo- nentially growing, often heterogeneous, and not rigidly structured data for a variety of scientific and commercial fields. In this paper we discuss the solutions integrated in GEMS (Graph database Engine for Multithreaded Systems), a software framework for implementing RDF databases on commodity, distributed-memory high-performance clusters. Unlike the majority of current RDF databases, GEMS has been designed from the ground up to primarily employ graph-based methods. This is reflected in all the layers of its stack. The GEMS framework is composed of: a SPARQL-to-C++more » compiler, a library of data structures and related methods to access and modify them, and a custom runtime providing lightweight software multithreading, network messages aggregation and a partitioned global address space. We provide an overview of the framework, detailing its component and how they have been closely designed and customized to address issues of graph methods applied to large-scale datasets on clusters. We discuss in details the principles that enable automatic translation of the queries (expressed in SPARQL, the query language of choice for RDF databases) to graph methods, and identify differences with respect to other RDF databases.« less

  4. A parallel approach of COFFEE objective function to multiple sequence alignment

    NASA Astrophysics Data System (ADS)

    Zafalon, G. F. D.; Visotaky, J. M. V.; Amorim, A. R.; Valêncio, C. R.; Neves, L. A.; de Souza, R. C. G.; Machado, J. M.

    2015-09-01

    The computational tools to assist genomic analyzes show even more necessary due to fast increasing of data amount available. With high computational costs of deterministic algorithms for sequence alignments, many works concentrate their efforts in the development of heuristic approaches to multiple sequence alignments. However, the selection of an approach, which offers solutions with good biological significance and feasible execution time, is a great challenge. Thus, this work aims to show the parallelization of the processing steps of MSA-GA tool using multithread paradigm in the execution of COFFEE objective function. The standard objective function implemented in the tool is the Weighted Sum of Pairs (WSP), which produces some distortions in the final alignments when sequences sets with low similarity are aligned. Then, in studies previously performed we implemented the COFFEE objective function in the tool to smooth these distortions. Although the nature of COFFEE objective function implies in the increasing of execution time, this approach presents points, which can be executed in parallel. With the improvements implemented in this work, we can verify the execution time of new approach is 24% faster than the sequential approach with COFFEE. Moreover, the COFFEE multithreaded approach is more efficient than WSP, because besides it is slightly fast, its biological results are better.

  5. A Registration Method Based on Contour Point Cloud for 3D Whole-Body PET and CT Images

    PubMed Central

    Yang, Qiyao; Wang, Zhiguo; Zhang, Guoxu

    2017-01-01

    The PET and CT fusion image, combining the anatomical and functional information, has important clinical meaning. An effective registration of PET and CT images is the basis of image fusion. This paper presents a multithread registration method based on contour point cloud for 3D whole-body PET and CT images. Firstly, a geometric feature-based segmentation (GFS) method and a dynamic threshold denoising (DTD) method are creatively proposed to preprocess CT and PET images, respectively. Next, a new automated trunk slices extraction method is presented for extracting feature point clouds. Finally, the multithread Iterative Closet Point is adopted to drive an affine transform. We compare our method with a multiresolution registration method based on Mattes Mutual Information on 13 pairs (246~286 slices per pair) of 3D whole-body PET and CT data. Experimental results demonstrate the registration effectiveness of our method with lower negative normalization correlation (NC = −0.933) on feature images and less Euclidean distance error (ED = 2.826) on landmark points, outperforming the source data (NC = −0.496, ED = 25.847) and the compared method (NC = −0.614, ED = 16.085). Moreover, our method is about ten times faster than the compared one. PMID:28316979

  6. Geant4 Computing Performance Benchmarking and Monitoring

    DOE PAGES

    Dotti, Andrea; Elvira, V. Daniel; Folger, Gunter; ...

    2015-12-23

    Performance evaluation and analysis of large scale computing applications is essential for optimal use of resources. As detector simulation is one of the most compute intensive tasks and Geant4 is the simulation toolkit most widely used in contemporary high energy physics (HEP) experiments, it is important to monitor Geant4 through its development cycle for changes in computing performance and to identify problems and opportunities for code improvements. All Geant4 development and public releases are being profiled with a set of applications that utilize different input event samples, physics parameters, and detector configurations. Results from multiple benchmarking runs are compared tomore » previous public and development reference releases to monitor CPU and memory usage. Observed changes are evaluated and correlated with code modifications. Besides the full summary of call stack and memory footprint, a detailed call graph analysis is available to Geant4 developers for further analysis. The set of software tools used in the performance evaluation procedure, both in sequential and multi-threaded modes, include FAST, IgProf and Open|Speedshop. In conclusion, the scalability of the CPU time and memory performance in multi-threaded application is evaluated by measuring event throughput and memory gain as a function of the number of threads for selected event samples.« less

  7. MEGAnnotator: a user-friendly pipeline for microbial genomes assembly and annotation.

    PubMed

    Lugli, Gabriele Andrea; Milani, Christian; Mancabelli, Leonardo; van Sinderen, Douwe; Ventura, Marco

    2016-04-01

    Genome annotation is one of the key actions that must be undertaken in order to decipher the genetic blueprint of organisms. Thus, a correct and reliable annotation is essential in rendering genomic data valuable. Here, we describe a bioinformatics pipeline based on freely available software programs coordinated by a multithreaded script named MEGAnnotator (Multithreaded Enhanced prokaryotic Genome Annotator). This pipeline allows the generation of multiple annotated formats fulfilling the NCBI guidelines for assembled microbial genome submission, based on DNA shotgun sequencing reads, and minimizes manual intervention, while also reducing waiting times between software program executions and improving final quality of both assembly and annotation outputs. MEGAnnotator provides an efficient way to pre-arrange the assembly and annotation work required to process NGS genome sequence data. The script improves the final quality of microbial genome annotation by reducing ambiguous annotations. Moreover, the MEGAnnotator platform allows the user to perform a partial annotation of pre-assembled genomes and includes an option to accomplish metagenomic data set assemblies. MEGAnnotator platform will be useful for microbiologists interested in genome analyses of bacteria as well as those investigating the complexity of microbial communities that do not possess the necessary skills to prepare their own bioinformatics pipeline. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  8. Presentation of a large amount of moving objects in a virtual environment

    NASA Astrophysics Data System (ADS)

    Ye, Huanzhuo; Gong, Jianya; Ye, Jing

    2004-05-01

    It needs a lot of consideration to manage the presentation of a large amount of moving objects in virtual environment. Motion state model (MSM) is used to represent the motion of objects and 2n tree is used to index the motion data stored in database or files. To minimize the necessary memory occupation for static models, cache with LRU or FIFO refreshing is introduced. DCT and wavelet work well with different playback speeds of motion presentation because they can filter low frequencies from motion data and adjust the filter according to playback speed. Since large amount of data are continuously retrieved, calculated, used for displaying, and then discarded, multithreading technology is naturally employed though single thread with carefully arranged data retrieval also works well when the number of objects is not very big. With multithreading, the level of concurrence should be placed at data retrieval, where waiting may occur, rather than at calculating or displaying, and synchronization should be carefully arranged to make sure that different threads can collaborate well. Collision detection is not needed when playing with history data and sampled current data; however, it is necessary for spatial state prediction. When the current state is presented, either predicting-adjusting method or late updating method could be used according to the users' preference.

  9. Variable word length encoder reduces TV bandwith requirements

    NASA Technical Reports Server (NTRS)

    Sivertson, W. E., Jr.

    1965-01-01

    Adaptive variable resolution encoding technique provides an adaptive compression pseudo-random noise signal processor for reducing television bandwidth requirements. Complementary processors are required in both the transmitting and receiving systems. The pretransmission processor is analog-to-digital, while the postreception processor is digital-to-analog.

  10. Accelerating molecular dynamic simulation on the cell processor and Playstation 3.

    PubMed

    Luttmann, Edgar; Ensign, Daniel L; Vaidyanathan, Vishal; Houston, Mike; Rimon, Noam; Øland, Jeppe; Jayachandran, Guha; Friedrichs, Mark; Pande, Vijay S

    2009-01-30

    Implementation of molecular dynamics (MD) calculations on novel architectures will vastly increase its power to calculate the physical properties of complex systems. Herein, we detail algorithmic advances developed to accelerate MD simulations on the Cell processor, a commodity processor found in PlayStation 3 (PS3). In particular, we discuss issues regarding memory access versus computation and the types of calculations which are best suited for streaming processors such as the Cell, focusing on implicit solvation models. We conclude with a comparison of improved performance on the PS3's Cell processor over more traditional processors. (c) 2008 Wiley Periodicals, Inc.

  11. Allocating application to group of consecutive processors in fault-tolerant deadlock-free routing path defined by routers obeying same rules for path selection

    DOEpatents

    Leung, Vitus J [Albuquerque, NM; Phillips, Cynthia A [Albuquerque, NM; Bender, Michael A [East Northport, NY; Bunde, David P [Urbana, IL

    2009-07-21

    In a multiple processor computing apparatus, directional routing restrictions and a logical channel construct permit fault tolerant, deadlock-free routing. Processor allocation can be performed by creating a linear ordering of the processors based on routing rules used for routing communications between the processors. The linear ordering can assume a loop configuration, and bin-packing is applied to this loop configuration. The interconnection of the processors can be conceptualized as a generally rectangular 3-dimensional grid, and the MC allocation algorithm is applied with respect to the 3-dimensional grid.

  12. Communications systems and methods for subsea processors

    DOEpatents

    Gutierrez, Jose; Pereira, Luis

    2016-04-26

    A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.

  13. An Efficient Functional Test Generation Method For Processors Using Genetic Algorithms

    NASA Astrophysics Data System (ADS)

    Hudec, Ján; Gramatová, Elena

    2015-07-01

    The paper presents a new functional test generation method for processors testing based on genetic algorithms and evolutionary strategies. The tests are generated over an instruction set architecture and a processor description. Such functional tests belong to the software-oriented testing. Quality of the tests is evaluated by code coverage of the processor description using simulation. The presented test generation method uses VHDL models of processors and the professional simulator ModelSim. The rules, parameters and fitness functions were defined for various genetic algorithms used in automatic test generation. Functionality and effectiveness were evaluated using the RISC type processor DP32.

  14. Experimental testing of the noise-canceling processor.

    PubMed

    Collins, Michael D; Baer, Ralph N; Simpson, Harry J

    2011-09-01

    Signal-processing techniques for localizing an acoustic source buried in noise are tested in a tank experiment. Noise is generated using a discrete source, a bubble generator, and a sprinkler. The experiment has essential elements of a realistic scenario in matched-field processing, including complex source and noise time series in a waveguide with water, sediment, and multipath propagation. The noise-canceling processor is found to outperform the Bartlett processor and provide the correct source range for signal-to-noise ratios below -10 dB. The multivalued Bartlett processor is found to outperform the Bartlett processor but not the noise-canceling processor. © 2011 Acoustical Society of America

  15. A High Performance VLSI Computer Architecture For Computer Graphics

    NASA Astrophysics Data System (ADS)

    Chin, Chi-Yuan; Lin, Wen-Tai

    1988-10-01

    A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.

  16. Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

    NASA Astrophysics Data System (ADS)

    Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

    2013-03-01

    With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.

  17. New Modular Ultrasonic Signal Processing Building Blocks for Real-Time Data Acquisition and Post Processing

    NASA Astrophysics Data System (ADS)

    Weber, Walter H.; Mair, H. Douglas; Jansen, Dion

    2003-03-01

    A suite of basic signal processors has been developed. These basic building blocks can be cascaded together to form more complex processors without the need for programming. The data structures between each of the processors are handled automatically. This allows a processor built for one purpose to be applied to any type of data such as images, waveform arrays and single values. The processors are part of Winspect Data Acquisition software. The new processors are fast enough to work on A-scan signals live while scanning. Their primary use is to extract features, reduce noise or to calculate material properties. The cascaded processors work equally well on live A-scan displays, live gated data or as a post-processing engine on saved data. Researchers are able to call their own MATLAB or C-code from anywhere within the processor structure. A built-in formula node processor that uses a simple algebraic editor may make external user programs unnecessary. This paper also discusses the problems associated with ad hoc software development and how graphical programming languages can tie up researchers writing software rather than designing experiments.

  18. Array processor architecture connection network

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1982-01-01

    A connection network is disclosed for use between a parallel array of processors and a parallel array of memory modules for establishing non-conflicting data communications paths between requested memory modules and requesting processors. The connection network includes a plurality of switching elements interposed between the processor array and the memory modules array in an Omega networking architecture. Each switching element includes a first and a second processor side port, a first and a second memory module side port, and control logic circuitry for providing data connections between the first and second processor ports and the first and second memory module ports. The control logic circuitry includes strobe logic for examining data arriving at the first and the second processor ports to indicate when the data arriving is requesting data from a requesting processor to a requested memory module. Further, connection circuitry is associated with the strobe logic for examining requesting data arriving at the first and the second processor ports for providing a data connection therefrom to the first and the second memory module ports in response thereto when the data connection so provided does not conflict with a pre-established data connection currently in use.

  19. 21 CFR 892.1900 - Automatic radiographic film processor.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... 21 Food and Drugs 8 2011-04-01 2011-04-01 false Automatic radiographic film processor. 892.1900... (CONTINUED) MEDICAL DEVICES RADIOLOGY DEVICES Diagnostic Devices § 892.1900 Automatic radiographic film processor. (a) Identification. An automatic radiographic film processor is a device intended to be used to...

  20. 21 CFR 892.1900 - Automatic radiographic film processor.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... 21 Food and Drugs 8 2013-04-01 2013-04-01 false Automatic radiographic film processor. 892.1900... (CONTINUED) MEDICAL DEVICES RADIOLOGY DEVICES Diagnostic Devices § 892.1900 Automatic radiographic film processor. (a) Identification. An automatic radiographic film processor is a device intended to be used to...

  1. 21 CFR 892.1900 - Automatic radiographic film processor.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... 21 Food and Drugs 8 2014-04-01 2014-04-01 false Automatic radiographic film processor. 892.1900... (CONTINUED) MEDICAL DEVICES RADIOLOGY DEVICES Diagnostic Devices § 892.1900 Automatic radiographic film processor. (a) Identification. An automatic radiographic film processor is a device intended to be used to...

  2. 21 CFR 892.1900 - Automatic radiographic film processor.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... 21 Food and Drugs 8 2012-04-01 2012-04-01 false Automatic radiographic film processor. 892.1900... (CONTINUED) MEDICAL DEVICES RADIOLOGY DEVICES Diagnostic Devices § 892.1900 Automatic radiographic film processor. (a) Identification. An automatic radiographic film processor is a device intended to be used to...

  3. 7 CFR 1160.108 - Fluid milk processor.

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... 7 Agriculture 9 2013-01-01 2013-01-01 false Fluid milk processor. 1160.108 Section 1160.108... AGREEMENTS AND ORDERS; MILK), DEPARTMENT OF AGRICULTURE FLUID MILK PROMOTION PROGRAM Fluid Milk Promotion Order Definitions § 1160.108 Fluid milk processor. (a) Fluid milk processor means any person who...

  4. 7 CFR 1160.108 - Fluid milk processor.

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... 7 Agriculture 9 2012-01-01 2012-01-01 false Fluid milk processor. 1160.108 Section 1160.108... Agreements and Orders; Milk), DEPARTMENT OF AGRICULTURE FLUID MILK PROMOTION PROGRAM Fluid Milk Promotion Order Definitions § 1160.108 Fluid milk processor. (a) Fluid milk processor means any person who...

  5. 7 CFR 1160.108 - Fluid milk processor.

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 7 Agriculture 9 2014-01-01 2013-01-01 true Fluid milk processor. 1160.108 Section 1160.108... AGREEMENTS AND ORDERS; MILK), DEPARTMENT OF AGRICULTURE FLUID MILK PROMOTION PROGRAM Fluid Milk Promotion Order Definitions § 1160.108 Fluid milk processor. (a) Fluid milk processor means any person who...

  6. 21 CFR 892.1900 - Automatic radiographic film processor.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... 21 Food and Drugs 8 2010-04-01 2010-04-01 false Automatic radiographic film processor. 892.1900... (CONTINUED) MEDICAL DEVICES RADIOLOGY DEVICES Diagnostic Devices § 892.1900 Automatic radiographic film processor. (a) Identification. An automatic radiographic film processor is a device intended to be used to...

  7. 7 CFR 1160.108 - Fluid milk processor.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 9 2010-01-01 2009-01-01 true Fluid milk processor. 1160.108 Section 1160.108... Agreements and Orders; Milk), DEPARTMENT OF AGRICULTURE FLUID MILK PROMOTION PROGRAM Fluid Milk Promotion Order Definitions § 1160.108 Fluid milk processor. (a) Fluid milk processor means any person who...

  8. 7 CFR 1160.108 - Fluid milk processor.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 9 2011-01-01 2011-01-01 false Fluid milk processor. 1160.108 Section 1160.108... Agreements and Orders; Milk), DEPARTMENT OF AGRICULTURE FLUID MILK PROMOTION PROGRAM Fluid Milk Promotion Order Definitions § 1160.108 Fluid milk processor. (a) Fluid milk processor means any person who...

  9. Shared performance monitor in a multiprocessor system

    DOEpatents

    Chiu, George; Gara, Alan G; Salapura, Valentina

    2014-12-02

    A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.

  10. Noncoherent parallel optical processor for discrete two-dimensional linear transformations.

    PubMed

    Glaser, I

    1980-10-01

    We describe a parallel optical processor, based on a lenslet array, that provides general linear two-dimensional transformations using noncoherent light. Such a processor could become useful in image- and signal-processing applications in which the throughput requirements cannot be adequately satisfied by state-of-the-art digital processors. Experimental results that illustrate the feasibility of the processor by demonstrating its use in parallel optical computation of the two-dimensional Walsh-Hadamard transformation are presented.

  11. Processors for wavelet analysis and synthesis: NIFS and TI-C80 MVP

    NASA Astrophysics Data System (ADS)

    Brooks, Geoffrey W.

    1996-03-01

    Two processors are considered for image quadrature mirror filtering (QMF). The neuromorphic infrared focal-plane sensor (NIFS) is an existing prototype analog processor offering high speed spatio-temporal Gaussian filtering, which could be used for the QMF low- pass function, and difference of Gaussian filtering, which could be used for the QMF high- pass function. Although not designed specifically for wavelet analysis, the biologically- inspired system accomplishes the most computationally intensive part of QMF processing. The Texas Instruments (TI) TMS320C80 Multimedia Video Processor (MVP) is a 32-bit RISC master processor with four advanced digital signal processors (DSPs) on a single chip. Algorithm partitioning, memory management and other issues are considered for optimal performance. This paper presents these considerations with simulated results leading to processor implementation of high-speed QMF analysis and synthesis.

  12. 77 FR 124 - Biological Processors of Alabama; Decatur, Morgan County, AL; Notice of Settlement

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-01-03

    ... ENVIRONMENTAL PROTECTION AGENCY [FRL-9612-9] Biological Processors of Alabama; Decatur, Morgan... reimbursement of past response costs concerning the Biological Processors of Alabama Superfund Site located in... Ms. Paula V. Painter. Submit your comments by Site name Biological Processors of Alabama Superfund...

  13. Multiple core computer processor with globally-accessible local memories

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Shalf, John; Donofrio, David; Oliker, Leonid

    A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality ofmore » processor cores.« less

  14. Scalable load balancing for massively parallel distributed Monte Carlo particle transport

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    O'Brien, M. J.; Brantley, P. S.; Joy, K. I.

    2013-07-01

    In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrencemore » Livermore National Laboratory. (authors)« less

  15. Parallel processor-based raster graphics system architecture

    DOEpatents

    Littlefield, Richard J.

    1990-01-01

    An apparatus for generating raster graphics images from the graphics command stream includes a plurality of graphics processors connected in parallel, each adapted to receive any part of the graphics command stream for processing the command stream part into pixel data. The apparatus also includes a frame buffer for mapping the pixel data to pixel locations and an interconnection network for interconnecting the graphics processors to the frame buffer. Through the interconnection network, each graphics processor may access any part of the frame buffer concurrently with another graphics processor accessing any other part of the frame buffer. The plurality of graphics processors can thereby transmit concurrently pixel data to pixel locations in the frame buffer.

  16. Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

    NASA Astrophysics Data System (ADS)

    Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

    2017-11-01

    Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.

  17. Eigensolution of finite element problems in a completely connected parallel architecture

    NASA Technical Reports Server (NTRS)

    Akl, F.; Morel, M.

    1989-01-01

    A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm is successfully implemented on a tightly coupled MIMD parallel processor. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts, and the dimension of the subspace on the performance of the algorithm is investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18, and 3.61 are achieved on two, four, six, and eight processors, respectively.

  18. Extended performance electric propulsion power processor design study. Volume 2: Technical summary

    NASA Technical Reports Server (NTRS)

    Biess, J. J.; Inouye, L. Y.; Schoenfeld, A. D.

    1977-01-01

    Electric propulsion power processor technology has processed during the past decade to the point that it is considered ready for application. Several power processor design concepts were evaluated and compared. Emphasis was placed on a 30 cm ion thruster power processor with a beam power rating supply of 2.2KW to 10KW for the main propulsion power stage. Extension in power processor performance were defined and were designed in sufficient detail to determine efficiency, component weight, part count, reliability and thermal control. A detail design was performed on a microprocessor as the thyristor power processor controller. A reliability analysis was performed to evaluate the effect of the control electronics redesign. Preliminary electrical design, mechanical design and thermal analysis were performed on a 6KW power transformer for the beam supply. Bi-Mod mechanical, structural and thermal control configurations were evaluated for the power processor and preliminary estimates of mechanical weight were determined.

  19. Parallel grid population

    DOEpatents

    Wald, Ingo; Ize, Santiago

    2015-07-28

    Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.

  20. Sequence information signal processor

    DOEpatents

    Peterson, John C.; Chow, Edward T.; Waterman, Michael S.; Hunkapillar, Timothy J.

    1999-01-01

    An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

  1. Conditional load and store in a shared memory

    DOEpatents

    Blumrich, Matthias A; Ohmacht, Martin

    2015-02-03

    A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.

  2. 17 CFR 242.609 - Registration of securities information processors: form of application and amendments.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... information processors: form of application and amendments. 242.609 Section 242.609 Commodity and Securities....609 Registration of securities information processors: form of application and amendments. (a) An application for the registration of a securities information processor shall be filed on Form SIP (§ 249.1001...

  3. 17 CFR 242.609 - Registration of securities information processors: form of application and amendments.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... information processors: form of application and amendments. 242.609 Section 242.609 Commodity and Securities....609 Registration of securities information processors: form of application and amendments. (a) An application for the registration of a securities information processor shall be filed on Form SIP (§ 249.1001...

  4. Optical Associative Processors For Visual Perception"

    NASA Astrophysics Data System (ADS)

    Casasent, David; Telfer, Brian

    1988-05-01

    We consider various associative processor modifications required to allow these systems to be used for visual perception, scene analysis, and object recognition. For these applications, decisions on the class of the objects present in the input image are required and thus heteroassociative memories are necessary (rather than the autoassociative memories that have been given most attention). We analyze the performance of both associative processors and note that there is considerable difference between heteroassociative and autoassociative memories. We describe associative processors suitable for realizing functions such as: distortion invariance (using linear discriminant function memory synthesis techniques), noise and image processing performance (using autoassociative memories in cascade with with a heteroassociative processor and with a finite number of autoassociative memory iterations employed), shift invariance (achieved through the use of associative processors operating on feature space data), and the analysis of multiple objects in high noise (which is achieved using associative processing of the output from symbolic correlators). We detail and provide initial demonstrations of the use of associative processors operating on iconic, feature space and symbolic data, as well as adaptive associative processors.

  5. Enabling Future Robotic Missions with Multicore Processors

    NASA Technical Reports Server (NTRS)

    Powell, Wesley A.; Johnson, Michael A.; Wilmot, Jonathan; Some, Raphael; Gostelow, Kim P.; Reeves, Glenn; Doyle, Richard J.

    2011-01-01

    Recent commercial developments in multicore processors (e.g. Tilera, Clearspeed, HyperX) have provided an option for high performance embedded computing that rivals the performance attainable with FPGA-based reconfigurable computing architectures. Furthermore, these processors offer more straightforward and streamlined application development by allowing the use of conventional programming languages and software tools in lieu of hardware design languages such as VHDL and Verilog. With these advantages, multicore processors can significantly enhance the capabilities of future robotic space missions. This paper will discuss these benefits, along with onboard processing applications where multicore processing can offer advantages over existing or competing approaches. This paper will also discuss the key artchitecural features of current commercial multicore processors. In comparison to the current art, the features and advancements necessary for spaceflight multicore processors will be identified. These include power reduction, radiation hardening, inherent fault tolerance, and support for common spacecraft bus interfaces. Lastly, this paper will explore how multicore processors might evolve with advances in electronics technology and how avionics architectures might evolve once multicore processors are inserted into NASA robotic spacecraft.

  6. Hot Chips and Hot Interconnects for High End Computing Systems

    NASA Technical Reports Server (NTRS)

    Saini, Subhash

    2005-01-01

    I will discuss several processors: 1. The Cray proprietary processor used in the Cray X1; 2. The IBM Power 3 and Power 4 used in an IBM SP 3 and IBM SP 4 systems; 3. The Intel Itanium and Xeon, used in the SGI Altix systems and clusters respectively; 4. IBM System-on-a-Chip used in IBM BlueGene/L; 5. HP Alpha EV68 processor used in DOE ASCI Q cluster; 6. SPARC64 V processor, which is used in the Fujitsu PRIMEPOWER HPC2500; 7. An NEC proprietary processor, which is used in NEC SX-6/7; 8. Power 4+ processor, which is used in Hitachi SR11000; 9. NEC proprietary processor, which is used in Earth Simulator. The IBM POWER5 and Red Storm Computing Systems will also be discussed. The architectures of these processors will first be presented, followed by interconnection networks and a description of high-end computer systems based on these processors and networks. The performance of various hardware/programming model combinations will then be compared, based on latest NAS Parallel Benchmark results (MPI, OpenMP/HPF and hybrid (MPI + OpenMP). The tutorial will conclude with a discussion of general trends in the field of high performance computing, (quantum computing, DNA computing, cellular engineering, and neural networks).

  7. Configurable Multi-Purpose Processor

    NASA Technical Reports Server (NTRS)

    Valencia, J. Emilio; Forney, Chirstopher; Morrison, Robert; Birr, Richard

    2010-01-01

    Advancements in technology have allowed the miniaturization of systems used in aerospace vehicles. This technology is driven by the need for next-generation systems that provide reliable, responsive, and cost-effective range operations while providing increased capabilities such as simultaneous mission support, increased launch trajectories, improved launch, and landing opportunities, etc. Leveraging the newest technologies, the command and telemetry processor (CTP) concept provides for a compact, flexible, and integrated solution for flight command and telemetry systems and range systems. The CTP is a relatively small circuit board that serves as a processing platform for high dynamic, high vibration environments. The CTP can be reconfigured and reprogrammed, allowing it to be adapted for many different applications. The design is centered around a configurable field-programmable gate array (FPGA) device that contains numerous logic cells that can be used to implement traditional integrated circuits. The FPGA contains two PowerPC processors running the Vx-Works real-time operating system and are used to execute software programs specific to each application. The CTP was designed and developed specifically to provide telemetry functions; namely, the command processing, telemetry processing, and GPS metric tracking of a flight vehicle. However, it can be used as a general-purpose processor board to perform numerous functions implemented in either hardware or software using the FPGA s processors and/or logic cells. Functionally, the CTP was designed for range safety applications where it would ultimately become part of a vehicle s flight termination system. Consequently, the major functions of the CTP are to perform the forward link command processing, GPS metric tracking, return link telemetry data processing, error detection and correction, data encryption/ decryption, and initiate flight termination action commands. Also, the CTP had to be designed to survive and operate in a launch environment. Additionally, the CTP was designed to interface with the WFF (Wallops Flight Facility) custom-designed transceiver board which is used in the Low Cost TDRSS Transceiver (LCT2) also developed by WFF. The LCT2 s transceiver board demodulates commands received from the ground via the forward link and sends them to the CTP, where they are processed. The CTP inputs and processes data from the inertial measurement unit (IMU) and the GPS receiver board, generates status data, and then sends the data to the transceiver board where it is modulated and sent to the ground via the return link. Overall, the CTP has combined processing with the ability to interface to a GPS receiver, an IMU, and a pulse code modulation (PCM) communication link, while providing the capability to support common interfaces including Ethernet and serial interfaces boarding a relatively small-sized, lightweight package.

  8. Variation tolerant SoC design

    NASA Astrophysics Data System (ADS)

    Kozhikkottu, Vivek J.

    The scaling of integrated circuits into the nanometer regime has led to variations emerging as a primary concern for designers of integrated circuits. Variations are an inevitable consequence of the semiconductor manufacturing process, and also arise due to the side-effects of operation of integrated circuits (voltage, temperature, and aging). Conventional design approaches, which are based on design corners or worst-case scenarios, leave designers with an undesirable choice between the considerable overheads associated with over-design and significantly reduced manufacturing yield. Techniques for variation-tolerant design at the logic, circuit and layout levels of the design process have been developed and are in commercial use. However, with the incessant increase in variations due to technology scaling and design trends such as near-threshold computing, these techniques are no longer sufficient to contain the effects of variations, and there is a need to address variations at all stages of design. This thesis addresses the problem of variation-tolerant design at the earliest stages of the design process, where the system-level design decisions that are made can have a very significant impact. There are two key aspects to making system-level design variation-aware. First, analysis techniques must be developed to project the impact of variations on system-level metrics such as application performance and energy. Second, variation-tolerant design techniques need to be developed to absorb the residual impact of variations (that cannot be contained through lower-level techniques). In this thesis, we address both these facets by developing robust and scalable variation-aware analysis and variation mitigation techniques at the system level. The first contribution of this thesis is a variation-aware system-level performance analysis framework. We address the key challenge of translating the per-component clock frequency distributions into a system-level application performance distribution. This task is particularly complex and challenging due to the inter-dependencies between components' execution, indirect effects of shared resources, and interactions between multiple system-level "execution paths". We argue that accurate variation-aware performance analysis requires Monte-Carlo based repeated system execution. Our proposed analysis framework leverages emulation to significantly speedup performance analysis without sacrificing the generality and accuracy achieved by Monte-Carlo based simulations. Our experiments show performance improvements of around 60x compared to state-of-the-art hardware-software co-simulation tools and also underscore the framework's potential to enable variation-aware design and exploration at the system level. Our second contribution addresses the problem of designing variation-tolerant SoCs using recovery based design, a popular circuit design paradigm that addresses variations by eliminating guard-bands and operating circuits at close to "zero margins" while detecting and recovering from timing errors. While previous efforts have demonstrated the potential benefits of recovery based design, we identify several challenges that need to be addressed in order to apply this technique to SoCs. We present a systematic design framework to apply recovery based design at the system level. We propose to partition SoCs into "recovery islands", wherein each recovery island consists of one or more SoC components that can recover independent of the rest of the SoC. We present a variation-aware design methodology that partitions a given SoC into recovery islands and computes the optimal operating points for each island, taking into account the various trade-offs involved. Our experiments demonstrate that the proposed design framework achieves an average of 32% energy savings over conventional worst-case designs, with negligible losses in performance. The third contribution of this thesis introduces disproportionate allocation of shared system resources as a means to combat the adverse impact of within-die variations on multi-core platforms. For multi-threaded programs executing on variation-impacted multi-cores platforms, we make the key observation that thread performance is not only a function of the frequency of the core on which it is executing on, but also depends upon the amount of shared system resources allocated to it. We utilize this insight to design a variation-aware runtime scheme which allocates the ways of a last-level shared L2 cache amongst the different cores/threads of a multi-core platform taking into account both application characteristics as well as chip specific variation profiles. Our experiments on 100 quad-core chips, each with a distinct variation profile, shows on an average 15% performance improvements for a suite of multi-threaded benchmarks. Our final contribution investigates the variation-tolerant design of domain-specific accelerators and demonstrates how the unique architectural properties of these accelerators can be leveraged to create highly effective variation tolerance mechanisms. We explore this concept through the variation-tolerant design of a vector processor that efficiently executes applications from the domains of recognition, mining and synthesis (RMS). We develop a novel design approach for variation tolerance, which leverages the unique nature of the vector reduction operations performed by this processor to effectively predict and preempt the occurrence of timing errors under variations and subsequently restore the correct output at the end of each vector reduction operation. We implement the above predict, preempt and restore operations by suitably enhancing the processor hardware and the application software and demonstrate considerable energy benefits (on an average 32%) across six applications from the domains of RMS. In conclusion, our work provides system designers with powerful tools and mechanisms in their efforts to combat variations, resulting in improved designer productivity and variation-tolerant systems.

  9. Distributed-Memory Breadth-First Search on Massive Graphs

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Buluc, Aydin; Beamer, Scott; Madduri, Kamesh

    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.

  10. Cost Computations for Cyber Fighter Associate

    DTIC Science & Technology

    2015-05-01

    associate. Aberdeen Proving Ground (MD): Army Research Laboratory (US); in press. 2 Harman D, Brown S, Henz B, Marvel LM. A communication protocol... Harman , et al.2 A specific class called ListenThread was created for multithreaded listeners. When ListenThread is instantiated, it is passed a given...2. Harman D, Brown S, Henz B, Marvel LM. A communication protocol for CyAMS and the cyber associate interface. Aberdeen Proving Ground (MD): US Army

  11. Transformation Systems at NASA Ames

    NASA Technical Reports Server (NTRS)

    Buntine, Wray; Fischer, Bernd; Havelund, Klaus; Lowry, Michael; Pressburger, TOm; Roach, Steve; Robinson, Peter; VanBaalen, Jeffrey

    1999-01-01

    In this paper, we describe the experiences of the Automated Software Engineering Group at the NASA Ames Research Center in the development and application of three different transformation systems. The systems span the entire technology range, from deductive synthesis, to logic-based transformation, to almost compiler-like source-to-source transformation. These systems also span a range of NASA applications, including solving solar system geometry problems, generating data analysis software, and analyzing multi-threaded Java code.

  12. 17 CFR 249.1001 - Form SIP, for application for registration as a securities information processor or to amend such...

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... registration as a securities information processor or to amend such an application or registration. 249.1001..., SECURITIES EXCHANGE ACT OF 1934 Form for Registration of, and Reporting by Securities Information Processors § 249.1001 Form SIP, for application for registration as a securities information processor or to amend...

  13. 75 FR 39892 - Fisheries of the Exclusive Economic Zone Off Alaska; Community Development Quota Program

    Federal Register 2010, 2011, 2012, 2013, 2014

    2010-07-13

    ... Fisheries Act (AFA) trawl catcher/processor sector (otherwise known as the Amendment 80 sector... catcher/processors. Hook-and-line catcher/processors are allocated 48.7 percent of the annual BSAI Pacific... harvest of Pacific cod by hook-and-line catcher/processors, although this is one of the major groundfish...

  14. 78 FR 21483 - Joint Industry Plan; Order Approving the Third Amendment to the National Market System Plan to...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-04-10

    ... the Securities Information Processors (``SIPs'' or ``Processors'') responsible for consolidation of... Plan. \\9\\ 17 CFR 242.603(b). The Plan refers to this entity as the Processor. \\10\\ See Section I(T) of... Euronext, to Elizabeth M. Murphy, Secretary, Commission, dated May 24, 2012. The Processors would also...

  15. Simulating Synchronous Processors

    DTIC Science & Technology

    1988-06-01

    34f Fvtvru m LABORATORY FOR INMASSACHUSETTSFCOMPUTER SCIENCE TECHNOLOGY MIT/LCS/TM-359 SIMULATING SYNCHRONOUS PROCESSORS Jennifer Lundelius Welch...PROJECT TASK WORK UNIT Arlington, VA 22217 ELEMENT NO. NO. NO ACCESSION NO. 11. TITLE Include Security Classification) Simulating Synchronous Processors...necessary and identify by block number) In this paper we show how a distributed system with synchronous processors and asynchro- nous message delays can

  16. Middle School Pupil Writing and the Word Processor.

    ERIC Educational Resources Information Center

    Ediger, Marlow

    Pupils in middle schools should have ample opportunities to write with the use of word processors. Legible writing in longhand will always be necessary in selected situations but, nevertheless, much drudgery is taken care of when using a word processor. Word processors tend to be very user friendly in that few mechanical skills are needed by the…

  17. 17 CFR 249.1001 - Form SIP, for application for registration as a securities information processor or to amend such...

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... registration as a securities information processor or to amend such an application or registration. 249.1001..., SECURITIES EXCHANGE ACT OF 1934 Form for Registration of, and Reporting by Securities Information Processors § 249.1001 Form SIP, for application for registration as a securities information processor or to amend...

  18. Analog Processor To Solve Optimization Problems

    NASA Technical Reports Server (NTRS)

    Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

    1993-01-01

    Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.

  19. Finite elements and the method of conjugate gradients on a concurrent processor

    NASA Technical Reports Server (NTRS)

    Lyzenga, G. A.; Raefsky, A.; Hager, G. H.

    1985-01-01

    An algorithm for the iterative solution of finite element problems on a concurrent processor is presented. The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition. This algorithm is implemented in a two-dimensional elastostatics program on the Caltech Hypercube concurrent processor. The results of tests on up to 32 processors show nearly linear concurrent speedup, with efficiencies over 90 percent for sufficiently large problems.

  20. A model for tracking concentration of chemical compounds within a tank of an automatic film processor.

    PubMed

    Sobol, Wlad T

    2002-01-01

    A simple kinetic model that describes the time evolution of the chemical concentration of an arbitrary compound within the tank of an automatic film processor is presented. It provides insights into the kinetics of chemistry concentration inside the processor's tank; the results facilitate the tasks of processor tuning and quality control (QC). The model has successfully been used in several troubleshooting sessions of low-volume mammography processors for which maintaining consistent QC tracking was difficult due to fluctuations of bromide levels in the developer tank.

  1. Finite elements and the method of conjugate gradients on a concurrent processor

    NASA Technical Reports Server (NTRS)

    Lyzenga, G. A.; Raefsky, A.; Hager, B. H.

    1984-01-01

    An algorithm for the iterative solution of finite element problems on a concurrent processor is presented. The method of conjugate gradients is used to solve the system of matrix equations, which is distributed among the processors of a MIMD computer according to an element-based spatial decomposition. This algorithm is implemented in a two-dimensional elastostatics program on the Caltech Hypercube concurrent processor. The results of tests on up to 32 processors show nearly linear concurrent speedup, with efficiencies over 90% for sufficiently large problems.

  2. A fully reconfigurable photonic integrated signal processor

    NASA Astrophysics Data System (ADS)

    Liu, Weilin; Li, Ming; Guzzon, Robert S.; Norberg, Erik J.; Parker, John S.; Lu, Mingzhi; Coldren, Larry A.; Yao, Jianping

    2016-03-01

    Photonic signal processing has been considered a solution to overcome the inherent electronic speed limitations. Over the past few years, an impressive range of photonic integrated signal processors have been proposed, but they usually offer limited reconfigurability, a feature highly needed for the implementation of large-scale general-purpose photonic signal processors. Here, we report and experimentally demonstrate a fully reconfigurable photonic integrated signal processor based on an InP-InGaAsP material system. The proposed photonic signal processor is capable of performing reconfigurable signal processing functions including temporal integration, temporal differentiation and Hilbert transformation. The reconfigurability is achieved by controlling the injection currents to the active components of the signal processor. Our demonstration suggests great potential for chip-scale fully programmable all-optical signal processing.

  3. Neurovision processor for designing intelligent sensors

    NASA Astrophysics Data System (ADS)

    Gupta, Madan M.; Knopf, George K.

    1992-03-01

    A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.

  4. Glycerol Production and Transformation: A Critical Review with Particular Emphasis on Glycerol Reforming Reaction for Producing Hydrogen in Conventional and Membrane Reactors.

    PubMed

    Bagnato, Giuseppe; Iulianelli, Adolfo; Sanna, Aimaro; Basile, Angelo

    2017-03-23

    Glycerol represents an emerging renewable bio-derived feedstock, which could be used as a source for producing hydrogen through steam reforming reaction. In this review, the state-of-the-art about glycerol production processes is reviewed, with particular focus on glycerol reforming reactions and on the main catalysts under development. Furthermore, the use of membrane catalytic reactors instead of conventional reactors for steam reforming is discussed. Finally, the review describes the utilization of the Pd-based membrane reactor technology, pointing out the ability of these alternative fuel processors to simultaneously extract high purity hydrogen and enhance the whole performances of the reaction system in terms of glycerol conversion and hydrogen yield.

  5. Real time animation of space plasma phenomena

    NASA Technical Reports Server (NTRS)

    Jordan, K. F.; Greenstadt, E. W.

    1987-01-01

    In pursuit of real time animation of computer simulated space plasma phenomena, the code was rewritten for the Massively Parallel Processor (MPP). The program creates a dynamic representation of the global bowshock which is based on actual spacecraft data and designed for three dimensional graphic output. This output consists of time slice sequences which make up the frames of the animation. With the MPP, 16384, 512 or 4 frames can be calculated simultaneously depending upon which characteristic is being computed. The run time was greatly reduced which promotes the rapid sequence of images and makes real time animation a foreseeable goal. The addition of more complex phenomenology in the constructed computer images is now possible and work proceeds to generate these images.

  6. Glycerol Production and Transformation: A Critical Review with Particular Emphasis on Glycerol Reforming Reaction for Producing Hydrogen in Conventional and Membrane Reactors

    PubMed Central

    Bagnato, Giuseppe; Iulianelli, Adolfo; Sanna, Aimaro; Basile, Angelo

    2017-01-01

    Glycerol represents an emerging renewable bio-derived feedstock, which could be used as a source for producing hydrogen through steam reforming reaction. In this review, the state-of-the-art about glycerol production processes is reviewed, with particular focus on glycerol reforming reactions and on the main catalysts under development. Furthermore, the use of membrane catalytic reactors instead of conventional reactors for steam reforming is discussed. Finally, the review describes the utilization of the Pd-based membrane reactor technology, pointing out the ability of these alternative fuel processors to simultaneously extract high purity hydrogen and enhance the whole performances of the reaction system in terms of glycerol conversion and hydrogen yield. PMID:28333121

  7. Real-Time Visualization of Tissue Ischemia

    NASA Technical Reports Server (NTRS)

    Bearman, Gregory H. (Inventor); Chrien, Thomas D. (Inventor); Eastwood, Michael L. (Inventor)

    2000-01-01

    A real-time display of tissue ischemia which comprises three CCD video cameras, each with a narrow bandwidth filter at the correct wavelength is discussed. The cameras simultaneously view an area of tissue suspected of having ischemic areas through beamsplitters. The output from each camera is adjusted to give the correct signal intensity for combining with, the others into an image for display. If necessary a digital signal processor (DSP) can implement algorithms for image enhancement prior to display. Current DSP engines are fast enough to give real-time display. Measurement at three, wavelengths, combined into a real-time Red-Green-Blue (RGB) video display with a digital signal processing (DSP) board to implement image algorithms, provides direct visualization of ischemic areas.

  8. When emotionality trumps reason: a study of individual processing style and juror bias.

    PubMed

    Gunnell, Justin J; Ceci, Stephen J

    2010-01-01

    "Cognitive Experiential Self Theory" (CEST) postulates that information-processing proceeds through two pathways, a rational one and an experiential one. The former is characterized by an emphasis on analysis, fact, and logical argument, whereas the latter is characterized by emotional and personal experience. We examined whether individuals influenced by the experiential system (E-processors) are more susceptible to extralegal biases (e.g. defendant attractiveness) than those influenced by the rational system (R-processors). Participants reviewed a criminal trial transcript and defendant profile and determined verdict, sentencing, and extralegal susceptibility. Although E-processors and R-processors convicted attractive defendants at similar rates, E-processors were more likely to convict less attractive defendants. Whereas R-processors did not sentence attractive and less attractive defendants differently, E-processors gave more lenient sentences to attractive defendants and harsher sentences to less attractive defendants. E-processors were also more likely to report that extralegal factors would change their verdicts. Further, the degree to which emotionality trumped rationality within an individual, as measured by a novel scoring method, linearly correlated with harsher sentences and extralegal influence. In sum, the results support an "unattractive harshness" effect during guilt determination, an attraction leniency effect during sentencing and increased susceptibility to extralegal factors within E-processors. Copyright © 2010 John Wiley & Sons, Ltd. Copyright © 2010 John Wiley & Sons, Ltd.

  9. Soft-core processor study for node-based architectures.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James

    2008-09-01

    Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor builtmore » out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.« less

  10. Development of small scale cluster computer for numerical analysis

    NASA Astrophysics Data System (ADS)

    Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

    2017-09-01

    In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.

  11. 78 FR 74063 - Fisheries of the Exclusive Economic Zone Off Alaska; Bering Sea and Aleutian Islands; 2014 and...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-12-10

    ...; catcher/ processor--40 percent; and motherships--10 percent. Under Sec. 679.20(a)(5)(iii)(B)(2)(i) and (ii... sector, 40 percent to the catcher/processor sector, and 10 percent to the mothership sector. In the.../processor sector will be available for harvest by AFA catcher vessels with catcher/ processor sector...

  12. Processor architecture for airborne SAR systems

    NASA Technical Reports Server (NTRS)

    Glass, C. M.

    1983-01-01

    Digital processors for spaceborne imaging radars and application of the technology developed for airborne SAR systems are considered. Transferring algorithms and implementation techniques from airborne to spaceborne SAR processors offers obvious advantages. The following topics are discussed: (1) a quantification of the differences in processing algorithms for airborne and spaceborne SARs; and (2) an overview of three processors for airborne SAR systems.

  13. Yes! An object-oriented compiler compiler (YOOCC)

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Avotins, J.; Mingins, C.; Schmidt, H.

    1995-12-31

    Grammar-based processor generation is one of the most widely studied areas in language processor construction. However, there have been very few approaches to date that reconcile object-oriented principles, processor generation, and an object-oriented language. Pertinent here also. is that currently to develop a processor using the Eiffel Parse libraries requires far too much time to be expended on tasks that can be automated. For these reasons, we have developed YOOCC (Yes! an Object-Oriented Compiler Compiler), which produces a processor framework from a grammar using an enhanced version of the Eiffel Parse libraries, incorporating the ideas hypothesized by Meyer, and Grapemore » and Walden, as well as many others. Various essential changes have been made to the Eiffel Parse libraries. Examples are presented to illustrate the development of a processor using YOOCC, and it is concluded that the Eiffel Parse libraries are now not only an intelligent, but also a productive option for processor construction.« less

  14. Effect of poor control of film processors on mammographic image quality.

    PubMed

    Kimme-Smith, C; Sun, H; Bassett, L W; Gold, R H

    1992-11-01

    With the increasingly stringent standards of image quality in mammography, film processor quality control is especially important. Current methods are not sufficient for ensuring good processing. The authors used a sensitometer and densitometer system to evaluate the performance of 22 processors at 16 mammographic facilities. Standard sensitometric values of two films were established, and processor performance was assessed for variations from these standards. Developer chemistry of each processor was analyzed and correlated with its sensitometric values. Ten processors were retested, and nine were found to be out of calibration. The developer components of hydroquinone, sulfites, bromide, and alkalinity varied the most, and low concentrations of hydroquinone were associated with lower average gradients at two facilities. Use of the sensitometer and densitometer system helps identify out-of-calibration processors, but further study is needed to correlate sensitometric values with developer component values. The authors believe that present quality control would be improved if sensitometric or other tests could be used to identify developer components that are out of calibration.

  15. Automatic film processors' quality control test in Greek military hospitals.

    PubMed

    Lymberis, C; Efstathopoulos, E P; Manetou, A; Poudridis, G

    1993-04-01

    The two major military radiology installations (Athens, Greece) using a total of 15 automatic film processors were assessed using the 21-step-wedge method. The results of quality control in all these processors are presented. The parameters measured under actual working conditions were base and fog, contrast and speed. Base and fog as well as speed displayed large variations with average values generally higher than acceptable, whilst contrast displayed greater stability. Developer temperature was measured daily during the test and was found to be outside the film manufacturers' recommended limits in nine of the 15 processors. In only one processor did film passing time vary on an every day basis and this was due to maloperation. Developer pH test was not part of the daily monitoring service being performed every 5 days for each film processor and found to be in the range 9-12; 10 of the 15 processors presented pH values outside the limits specified by the film manufacturers.

  16. A high-accuracy optical linear algebra processor for finite element applications

    NASA Technical Reports Server (NTRS)

    Casasent, D.; Taylor, B. K.

    1984-01-01

    Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.

  17. Optimal processor assignment for pipeline computations

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

    1991-01-01

    The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.

  18. Array processor architecture

    NASA Technical Reports Server (NTRS)

    Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

    1983-01-01

    A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

  19. Development of a Dynamic Time Sharing Scheduled Environment Final Report CRADA No. TC-824-94E

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jette, M.; Caliga, D.

    Massively parallel computers, such as the Cray T3D, have historically supported resource sharing solely with space sharing. In that method, multiple problems are solved by executing them on distinct processors. This project developed a dynamic time- and space-sharing scheduler to achieve greater interactivity and throughput than could be achieved with space-sharing alone. CRI and LLNL worked together on the design, testing, and review aspects of this project. There were separate software deliverables. CFU implemented a general purpose scheduling system as per the design specifications. LLNL ported the local gang scheduler software to the LLNL Cray T3D. In this approach, processorsmore » are allocated simultaneously to aU components of a parallel program (in a “gang”). Program execution is preempted as needed to provide for interactivity. Programs are also reIocated to different processors as needed to efficiently pack the computer’s torus of processors. In phase one, CRI developed an interface specification after discussions with LLNL for systemlevel software supporting a time- and space-sharing environment on the LLNL T3D. The two parties also discussed interface specifications for external control tools (such as scheduling policy tools, system administration tools) and applications programs. CRI assumed responsibility for the writing and implementation of all the necessary system software in this phase. In phase two, CRI implemented job-rolling on the Cray T3D, a mechanism for preempting a program, saving its state to disk, and later restoring its state to memory for continued execution. LLNL ported its gang scheduler to the LLNL T3D utilizing the CRI interface implemented in phases one and two. During phase three, the functionality and effectiveness of the LLNL gang scheduler was assessed to provide input to CRI time- and space-sharing, efforts. CRI will utilize this information in the development of general schedulers suitable for other sites and future architectures.« less

  20. Hyperspectral Microwave Atmospheric Sounder (HyMas) - New Capability in the CoSMIR-CoSSIR Scanhead

    NASA Technical Reports Server (NTRS)

    Hilliard, L. M.; Racette, P. E.; Blackwell, W.; Galbraith, C.; Thompson, E.

    2015-01-01

    Lincoln Laboratory and NASA's Goddard Space Flight Center have teamed to re-use an existing instrument platform, the CoSMIRCoSSIR system for atmospheric sounding, to develop a new capability in hyperspectral filtering, data collection, and display. The volume of the scanhead accomodated an intermediate frequency processor(IFP), that provides the filtering and digitization of the raw data and the interoperable remote component (IRC) adapted to CoSMIR, CoSSIR, and HyMAS that stores and archives the data with time tagged calibration and navigation data.The first element of the work is the demonstration of a hyperspectral microwave receiver subsystem that was recently shown using a comprehensive simulation study to yield performance that substantially exceeds current state-of-the-art. Hyperspectral microwave sounders with 100 channels offer temperature and humidity sounding improvements similar to those obtained when infrared sensors became hyperspectral, but with the relative insensitivity to clouds that characterizes microwave sensors. Hyperspectral microwave operation is achieved using independent RF antennareceiver arrays that sample the same areavolume of the Earths surfaceatmosphere at slightly different frequencies and therefore synthesize a set of dense, finely spaced vertical weighting functions. The second, enabling element of the proposal is the development of a compact 52-channel Intermediate Frequency processor module. A principal challenge in the development of a hyperspectral microwave system is the size of the IF filter bank required for channelization. Large bandwidths are simultaneously processed, thus complicating the use of digital back-ends with associated high complexities, costs, and power requirements. Our approach involves passive filters implemented using low-temperature co-fired ceramic (LTCC) technology to achieve an ultra-compact module that can be easily integrated with existing RF front-end technology. This IF processor is universally applicable to other microwave sensing missions requiring compact IF spectrometry.The data include 52 operational channels with low IF module volume (100cm3) and mass (300g) and linearity better than 0.3 over a 330K dynamic range.

  1. A high-resolution physically-based global flood hazard map

    NASA Astrophysics Data System (ADS)

    Kaheil, Y.; Begnudelli, L.; McCollum, J.

    2016-12-01

    We present the results from a physically-based global flood hazard model. The model uses a physically-based hydrologic model to simulate river discharges, and 2D hydrodynamic model to simulate inundation. The model is set up such that it allows the application of large-scale flood hazard through efficient use of parallel computing. For hydrology, we use the Hillslope River Routing (HRR) model. HRR accounts for surface hydrology using Green-Ampt parameterization. The model is calibrated against observed discharge data from the Global Runoff Data Centre (GRDC) network, among other publicly-available datasets. The parallel-computing framework takes advantage of the river network structure to minimize cross-processor messages, and thus significantly increases computational efficiency. For inundation, we implemented a computationally-efficient 2D finite-volume model with wetting/drying. The approach consists of simulating flood along the river network by forcing the hydraulic model with the streamflow hydrographs simulated by HRR, and scaled up to certain return levels, e.g. 100 years. The model is distributed such that each available processor takes the next simulation. Given an approximate criterion, the simulations are ordered from most-demanding to least-demanding to ensure that all processors finalize almost simultaneously. Upon completing all simulations, the maximum envelope of flood depth is taken to generate the final map. The model is applied globally, with selected results shown from different continents and regions. The maps shown depict flood depth and extent at different return periods. These maps, which are currently available at 3 arc-sec resolution ( 90m) can be made available at higher resolutions where high resolution DEMs are available. The maps can be utilized by flood risk managers at the national, regional, and even local levels to further understand their flood risk exposure, exercise certain measures of mitigation, and/or transfer the residual risk financially through flood insurance programs.

  2. Implementation of the ATLAS trigger within the multi-threaded software framework AthenaMT

    NASA Astrophysics Data System (ADS)

    Wynne, Ben; ATLAS Collaboration

    2017-10-01

    We present an implementation of the ATLAS High Level Trigger, HLT, that provides parallel execution of trigger algorithms within the ATLAS multithreaded software framework, AthenaMT. This development will enable the ATLAS HLT to meet future challenges due to the evolution of computing hardware and upgrades of the Large Hadron Collider, LHC, and ATLAS Detector. During the LHC data-taking period starting in 2021, luminosity will reach up to three times the original design value. Luminosity will increase further, to up to 7.5 times the design value, in 2026 following LHC and ATLAS upgrades. This includes an upgrade of the ATLAS trigger architecture that will result in an increase in the HLT input rate by a factor of 4 to 10 compared to the current maximum rate of 100 kHz. The current ATLAS multiprocess framework, AthenaMP, manages a number of processes that each execute algorithms sequentially for different events. AthenaMT will provide a fully multi-threaded environment that will additionally enable concurrent execution of algorithms within an event. This has the potential to significantly reduce the memory footprint on future manycore devices. An additional benefit of the HLT implementation within AthenaMT is that it facilitates the integration of offline code into the HLT. The trigger must retain high rejection in the face of increasing numbers of pileup collisions. This will be achieved by greater use of offline algorithms that are designed to maximize the discrimination of signal from background. Therefore a unification of the HLT and offline reconstruction software environment is required. This has been achieved while at the same time retaining important HLT-specific optimisations that minimize the computation performed to reach a trigger decision. Such optimizations include early event rejection and reconstruction within restricted geometrical regions. We report on an HLT prototype in which the need for HLT-specific components has been reduced to a minimum. Promising results have been obtained with a prototype that includes the key elements of trigger functionality including regional reconstruction and early event rejection. We report on the first experience of migrating trigger selections to this new framework and present the next steps towards a full implementation of the ATLAS trigger.

  3. Efficient methods for implementation of multi-level nonrigid mass-preserving image registration on GPUs and multi-threaded CPUs.

    PubMed

    Ellingwood, Nathan D; Yin, Youbing; Smith, Matthew; Lin, Ching-Long

    2016-04-01

    Faster and more accurate methods for registration of images are important for research involved in conducting population-based studies that utilize medical imaging, as well as improvements for use in clinical applications. We present a novel computation- and memory-efficient multi-level method on graphics processing units (GPU) for performing registration of two computed tomography (CT) volumetric lung images. We developed a computation- and memory-efficient Diffeomorphic Multi-level B-Spline Transform Composite (DMTC) method to implement nonrigid mass-preserving registration of two CT lung images on GPU. The framework consists of a hierarchy of B-Spline control grids of increasing resolution. A similarity criterion known as the sum of squared tissue volume difference (SSTVD) was adopted to preserve lung tissue mass. The use of SSTVD consists of the calculation of the tissue volume, the Jacobian, and their derivatives, which makes its implementation on GPU challenging due to memory constraints. The use of the DMTC method enabled reduced computation and memory storage of variables with minimal communication between GPU and Central Processing Unit (CPU) due to ability to pre-compute values. The method was assessed on six healthy human subjects. Resultant GPU-generated displacement fields were compared against the previously validated CPU counterpart fields, showing good agreement with an average normalized root mean square error (nRMS) of 0.044±0.015. Runtime and performance speedup are compared between single-threaded CPU, multi-threaded CPU, and GPU algorithms. Best performance speedup occurs at the highest resolution in the GPU implementation for the SSTVD cost and cost gradient computations, with a speedup of 112 times that of the single-threaded CPU version and 11 times over the twelve-threaded version when considering average time per iteration using a Nvidia Tesla K20X GPU. The proposed GPU-based DMTC method outperforms its multi-threaded CPU version in terms of runtime. Total registration time reduced runtime to 2.9min on the GPU version, compared to 12.8min on twelve-threaded CPU version and 112.5min on a single-threaded CPU. Furthermore, the GPU implementation discussed in this work can be adapted for use of other cost functions that require calculation of the first derivatives. Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.

  4. Extended performance electric propulsion power processor design study. Volume 1: Executive summary

    NASA Technical Reports Server (NTRS)

    Biess, J. J.; Inouye, L. Y.; Schoenfeld, A. D.

    1977-01-01

    Several power processor design concepts were evaluated and compared. Emphasis was placed on a 30cm ion thruster power processor with a beam supply rating of 2.2kW to 10kW. Extensions in power processor performance were defined and were designed in sufficient detail to determine efficiency, component weight, part count, reliability and thermal control. Preliminary electrical design, mechanical design, and thermal analysis were performed on a 6kW power transformer for the beam supply. Bi-Mod mechanical, structural, and thermal control configurations were evaluated for the power processor, and preliminary estimates of mechanical weight were determined. A program development plan was formulated that outlines the work breakdown structure for the development, qualification and fabrication of the power processor flight hardware.

  5. APRON: A Cellular Processor Array Simulation and Hardware Design Tool

    NASA Astrophysics Data System (ADS)

    Barr, David R. W.; Dudek, Piotr

    2009-12-01

    We present a software environment for the efficient simulation of cellular processor arrays (CPAs). This software (APRON) is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.

  6. Efficient Interconnection Schemes for VLSI and Parallel Computation

    DTIC Science & Technology

    1989-08-01

    Definition: Let R be a routing network. A set S of wires in R is a (directed) cut if it partitions the network into two sets of processors A and B ...such that every path from a processor in A to a processor in B contains a wire in S. The capacity cap(S) is the number of wires in the cut. For a set of...messages M, define the load load(M, S) of M on a cut S to be the number of messages in M from a processor in A to a processor in B . The load factor

  7. Hypercluster - Parallel processing for computational mechanics

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.

    1988-01-01

    An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

  8. The Effect of NUMA Tunings on CPU Performance

    NASA Astrophysics Data System (ADS)

    Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr

    2015-12-01

    Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.

  9. The Berkeley Out-of-Order Machine (BOOM): An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor

    DTIC Science & Technology

    2015-06-13

    The Berkeley Out-of-Order Machine (BOOM): An Industry- Competitive, Synthesizable, Parameterized RISC-V Processor Christopher Celio David A...Synthesizable, Parameterized RISC-V Processor Christopher Celio, David Patterson, and Krste Asanović University of California, Berkeley, California 94720...Order Machine BOOM is a synthesizable, parameterized, superscalar out- of-order RISC-V core designed to serve as the prototypical baseline processor

  10. A Medical Language Processor for Two Indo-European Languages

    PubMed Central

    Nhan, Ngo Thanh; Sager, Naomi; Lyman, Margaret; Tick, Leo J.; Borst, François; Su, Yun

    1989-01-01

    The syntax and semantics of clinical narrative across Indo-European languages are quite similar, making it possible to envison a single medical language processor that can be adapted for different European languages. The Linguistic String Project of New York University is continuing the development of its Medical Language Processor in this direction. The paper describes how the processor operates on English and French.

  11. Performance Modeling of the ADA Rendezvous

    DTIC Science & Technology

    1991-10-01

    queueing network of figure 2, SERVERTASK can complete only one rendezvous at a time. Thus, the rate that the rendezvous requests are processed at the... Network 1, SERVERTASK competes with the traffic tasks of Server Processor. Each time SERVERTASK gains access to the processor, SERVERTASK completes...Client Processor Server Processor Software Server Nek Netork2 Figure 10. A conceptualization of the algorithm. The SERVERTASK software server of Network 2

  12. Accelerating semantic graph databases on commodity clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Morari, Alessandro; Castellana, Vito G.; Haglin, David J.

    We are developing a full software system for accelerating semantic graph databases on commodity cluster that scales to hundreds of nodes while maintaining constant query throughput. Our framework comprises a SPARQL to C++ compiler, a library of parallel graph methods and a custom multithreaded runtime layer, which provides a Partitioned Global Address Space (PGAS) programming model with fork/join parallelism and automatic load balancing over a commodity clusters. We present preliminary results for the compiler and for the runtime.

  13. ImageJ: Image processing and analysis in Java

    NASA Astrophysics Data System (ADS)

    Rasband, W. S.

    2012-06-01

    ImageJ is a public domain Java image processing program inspired by NIH Image. It can display, edit, analyze, process, save and print 8-bit, 16-bit and 32-bit images. It can read many image formats including TIFF, GIF, JPEG, BMP, DICOM, FITS and "raw". It supports "stacks", a series of images that share a single window. It is multithreaded, so time-consuming operations such as image file reading can be performed in parallel with other operations.

  14. An adaptive transmission protocol for managing dynamic shared states in collaborative surgical simulation.

    PubMed

    Qin, J; Choi, K S; Ho, Simon S M; Heng, P A

    2008-01-01

    A force prediction algorithm is proposed to facilitate virtual-reality (VR) based collaborative surgical simulation by reducing the effect of network latencies. State regeneration is used to correct the estimated prediction. This algorithm is incorporated into an adaptive transmission protocol in which auxiliary features such as view synchronization and coupling control are equipped to ensure the system consistency. We implemented this protocol using multi-threaded technique on a cluster-based network architecture.

  15. Model Checking with Multi-Threaded IC3 Portfolios

    DTIC Science & Technology

    2015-01-15

    different runs varies randomly depending on the thread interleaving. The use of a portfolio of solvers to maximize the likelihood of a quick solution is...empirically show (cf. Sec. 5.2) that the predictions based on this formula have high accuracy. Note that each solver in the portfolio potentially searches...speedup of over 300. We also show that widening the proof search of ic3 by randomizing its SAT solver is not as effective as paral- lelization

  16. Reducing False Positives in Runtime Analysis of Deadlocks

    NASA Technical Reports Server (NTRS)

    Bensalem, Saddek; Havelund, Klaus; Clancy, Daniel (Technical Monitor)

    2002-01-01

    This paper presents an improvement of a standard algorithm for detecting dead-lock potentials in multi-threaded programs, in that it reduces the number of false positives. The standard algorithm works as follows. The multi-threaded program under observation is executed, while lock and unlock events are observed. A graph of locks is built, with edges between locks symbolizing locking orders. Any cycle in the graph signifies a potential for a deadlock. The typical standard example is the group of dining philosophers sharing forks. The algorithm is interesting because it can catch deadlock potentials even though no deadlocks occur in the examined trace, and at the same time it scales very well in contrast t o more formal approaches to deadlock detection. The algorithm, however, can yield false positives (as well as false negatives). The extension of the algorithm described in this paper reduces the amount of false positives for three particular cases: when a gate lock protects a cycle, when a single thread introduces a cycle, and when the code segments in different threads that cause the cycle can actually not execute in parallel. The paper formalizes a theory for dynamic deadlock detection and compares it to model checking and static analysis techniques. It furthermore describes an implementation for analyzing Java programs and its application to two case studies: a planetary rover and a space craft altitude control system.

  17. GPU Linear Algebra Libraries and GPGPU Programming for Accelerating MOPAC Semiempirical Quantum Chemistry Calculations.

    PubMed

    Maia, Julio Daniel Carvalho; Urquiza Carvalho, Gabriel Aires; Mangueira, Carlos Peixoto; Santana, Sidney Ramos; Cabral, Lucidio Anjos Formiga; Rocha, Gerd B

    2012-09-11

    In this study, we present some modifications in the semiempirical quantum chemistry MOPAC2009 code that accelerate single-point energy calculations (1SCF) of medium-size (up to 2500 atoms) molecular systems using GPU coprocessors and multithreaded shared-memory CPUs. Our modifications consisted of using a combination of highly optimized linear algebra libraries for both CPU (LAPACK and BLAS from Intel MKL) and GPU (MAGMA and CUBLAS) to hasten time-consuming parts of MOPAC such as the pseudodiagonalization, full diagonalization, and density matrix assembling. We have shown that it is possible to obtain large speedups just by using CPU serial linear algebra libraries in the MOPAC code. As a special case, we show a speedup of up to 14 times for a methanol simulation box containing 2400 atoms and 4800 basis functions, with even greater gains in performance when using multithreaded CPUs (2.1 times in relation to the single-threaded CPU code using linear algebra libraries) and GPUs (3.8 times). This degree of acceleration opens new perspectives for modeling larger structures which appear in inorganic chemistry (such as zeolites and MOFs), biochemistry (such as polysaccharides, small proteins, and DNA fragments), and materials science (such as nanotubes and fullerenes). In addition, we believe that this parallel (GPU-GPU) MOPAC code will make it feasible to use semiempirical methods in lengthy molecular simulations using both hybrid QM/MM and QM/QM potentials.

  18. A Parallel Algorithm for Contact in a Finite Element Hydrocode

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Pierce, Timothy G.

    A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the samemore » processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm has been implemented as the contact capability of the ALE3D multiphysics code, and is currently in production use.« less

  19. FPGA wavelet processor design using language for instruction-set architectures (LISA)

    NASA Astrophysics Data System (ADS)

    Meyer-Bäse, Uwe; Vera, Alonzo; Rao, Suhasini; Lenk, Karl; Pattichis, Marios

    2007-04-01

    The design of an microprocessor is a long, tedious, and error-prone task consisting of typically three design phases: architecture exploration, software design (assembler, linker, loader, profiler), architecture implementation (RTL generation for FPGA or cell-based ASIC) and verification. The Language for instruction-set architectures (LISA) allows to model a microprocessor not only from instruction-set but also from architecture description including pipelining behavior that allows a design and development tool consistency over all levels of the design. To explore the capability of the LISA processor design platform a.k.a. CoWare Processor Designer we present in this paper three microprocessor designs that implement a 8/8 wavelet transform processor that is typically used in today's FBI fingerprint compression scheme. We have designed a 3 stage pipelined 16 bit RISC processor (NanoBlaze). Although RISC μPs are usually considered "fast" processors due to design concept like constant instruction word size, deep pipelines and many general purpose registers, it turns out that DSP operations consume essential processing time in a RISC processor. In a second step we have used design principles from programmable digital signal processor (PDSP) to improve the throughput of the DWT processor. A multiply-accumulate operation along with indirect addressing operation were the key to achieve higher throughput. A further improvement is possible with today's FPGA technology. Today's FPGAs offer a large number of embedded array multipliers and it is now feasible to design a "true" vector processor (TVP). A multiplication of two vectors can be done in just one clock cycle with our TVP, a complete scalar product in two clock cycles. Code profiling and Xilinx FPGA ISE synthesis results are provided that demonstrate the essential improvement that a TVP has compared with traditional RISC or PDSP designs.

  20. Automobile Crash Sensor Signal Processor

    DOT National Transportation Integrated Search

    1973-11-01

    The crash sensor signal processor described interfaces between an automobile-installed doppler radar and an air bag activating solenoid or equivalent electromechanical device. The processor utilizes both digital and analog techniques to produce an ou...

Top