NASA Astrophysics Data System (ADS)
Zhang, Yuli; Han, Jun; Weng, Xinqian; He, Zhongzhu; Zeng, Xiaoyang
This paper presents an Application Specific Instruction-set Processor (ASIP) for the SHA-3 BLAKE algorithm family by instruction set extensions (ISE) from an RISC (reduced instruction set computer) processor. With a design space exploration for this ASIP to increase the performance and reduce the area cost, we accomplish an efficient hardware and software implementation of BLAKE algorithm. The special instructions and their well-matched hardware function unit improve the calculation of the key section of the algorithm, namely G-functions. Also, relaxing the time constraint of the special function unit can decrease its hardware cost, while keeping the high data throughput of the processor. Evaluation results reveal the ASIP achieves 335Mbps and 176Mbps for BLAKE-256 and BLAKE-512. The extra area cost is only 8.06k equivalent gates. The proposed ASIP outperforms several software approaches on various platforms in cycle per byte. In fact, both high throughput and low hardware cost achieved by this programmable processor are comparable to that of ASIC implementations.
Gschwind, Michael K [Chappaqua, NY
2011-03-01
Mechanisms for implementing a floating point only single instruction multiple data instruction set architecture are provided. A processor is provided that comprises an issue unit, an execution unit coupled to the issue unit, and a vector register file coupled to the execution unit. The execution unit has logic that implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA). The floating point vector registers of the vector register file store both scalar and floating point values as vectors having a plurality of vector elements. The processor may be part of a data processing system.
FPGA wavelet processor design using language for instruction-set architectures (LISA)
NASA Astrophysics Data System (ADS)
Meyer-Bäse, Uwe; Vera, Alonzo; Rao, Suhasini; Lenk, Karl; Pattichis, Marios
2007-04-01
The design of an microprocessor is a long, tedious, and error-prone task consisting of typically three design phases: architecture exploration, software design (assembler, linker, loader, profiler), architecture implementation (RTL generation for FPGA or cell-based ASIC) and verification. The Language for instruction-set architectures (LISA) allows to model a microprocessor not only from instruction-set but also from architecture description including pipelining behavior that allows a design and development tool consistency over all levels of the design. To explore the capability of the LISA processor design platform a.k.a. CoWare Processor Designer we present in this paper three microprocessor designs that implement a 8/8 wavelet transform processor that is typically used in today's FBI fingerprint compression scheme. We have designed a 3 stage pipelined 16 bit RISC processor (NanoBlaze). Although RISC μPs are usually considered "fast" processors due to design concept like constant instruction word size, deep pipelines and many general purpose registers, it turns out that DSP operations consume essential processing time in a RISC processor. In a second step we have used design principles from programmable digital signal processor (PDSP) to improve the throughput of the DWT processor. A multiply-accumulate operation along with indirect addressing operation were the key to achieve higher throughput. A further improvement is possible with today's FPGA technology. Today's FPGAs offer a large number of embedded array multipliers and it is now feasible to design a "true" vector processor (TVP). A multiplication of two vectors can be done in just one clock cycle with our TVP, a complete scalar product in two clock cycles. Code profiling and Xilinx FPGA ISE synthesis results are provided that demonstrate the essential improvement that a TVP has compared with traditional RISC or PDSP designs.
Diversification of Processors Based on Redundancy in Instruction Set
NASA Astrophysics Data System (ADS)
Ichikawa, Shuichi; Sawada, Takashi; Hata, Hisashi
By diversifying processor architecture, computer software is expected to be more resistant to plagiarism, analysis, and attacks. This study presents a new method to diversify instruction set architecture (ISA) by utilizing the redundancy in the instruction set. Our method is particularly suited for embedded systems implemented with FPGA technology, and realizes a genuine instruction set randomization, which has not been provided by the preceding studies. The evaluation results on four typical ISAs indicate that our scheme can provide a far larger degree of freedom than the preceding studies. Diversified processors based on MIPS architecture were actually implemented and evaluated with Xilinx Spartan-3 FPGA. The increase of logic scale was modest: 5.1% in Specialized design and 3.6% in RAM-mapped design. The performance overhead was also modest: 3.4% in Specialized design and 11.6% in RAM-mapped design. From these results, our scheme is regarded as a practical and promising way to secure FPGA-based embedded systems.
Hypercluster Parallel Processor
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela
1992-01-01
Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
System support software for the Space Ultrareliable Modular Computer (SUMC)
NASA Technical Reports Server (NTRS)
Hill, T. E.; Hintze, G. C.; Hodges, B. C.; Austin, F. A.; Buckles, B. P.; Curran, R. T.; Lackey, J. D.; Payne, R. E.
1974-01-01
The highly transportable programming system designed and implemented to support the development of software for the Space Ultrareliable Modular Computer (SUMC) is described. The SUMC system support software consists of program modules called processors. The initial set of processors consists of the supervisor, the general purpose assembler for SUMC instruction and microcode input, linkage editors, an instruction level simulator, a microcode grid print processor, and user oriented utility programs. A FORTRAN 4 compiler is undergoing development. The design facilitates the addition of new processors with a minimum effort and provides the user quasi host independence on the ground based operational software development computer. Additional capability is provided to accommodate variations in the SUMC architecture without consequent major modifications in the initial processors.
A debugger-interpreter with setup facilities for assembly programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dolinskii, I.S.; Zisel`man, I.M.; Belotskii, S.L.
1995-11-01
In this paper a software program allowing one to introduce and debug the descriptions of the von Nuemann architecture processors and their assemblers, efficiently debug assembly programs, and investigate the instruction sets of the described processors is considered. For a description of the processor sematics and assembler syntax, a metassembly language is suggested.
A Simple and Affordable TTL Processor for the Classroom
ERIC Educational Resources Information Center
Feinberg, Dave
2007-01-01
This paper presents a simple 4 bit computer processor design that may be built using TTL chips for less than $65. In addition to describing the processor itself in detail, we discuss our experience using the laboratory kit and its associated machine instruction set to teach computer architecture to high school students. (Contains 3 figures and 5…
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1994-01-01
In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Development for SSV on a parallel processing system (PARAGON)
NASA Astrophysics Data System (ADS)
Gothard, Benny M.; Allmen, Mark; Carroll, Michael J.; Rich, Dan
1995-12-01
A goal of the surrogate semi-autonomous vehicle (SSV) program is to have multiple vehicles navigate autonomously and cooperatively with other vehicles. This paper describes the process and tools used in porting UGV/SSV (unmanned ground vehicle) autonomous mobility and target recognition algorithms from a SISD (single instruction single data) processor architecture (i.e., a Sun SPARC workstation running C/UNIX) to a MIMD (multiple instruction multiple data) parallel processor architecture (i.e., PARAGON-a parallel set of i860 processors running C/UNIX). It discusses the gains in performance and the pitfalls of such a venture. It also examines the merits of this processor architecture (based on this conceptual prototyping effort) and programming paradigm to meet the final SSV demonstration requirements.
Ordering of guarded and unguarded stores for no-sync I/O
Gara, Alan; Ohmacht, Martin
2013-06-25
A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
Unaligned instruction relocation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bertolli, Carlo; O'Brien, John K.; Sallenave, Olivier H.
In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unalignedmore » ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.« less
Unaligned instruction relocation
Bertolli, Carlo; O'Brien, John K.; Sallenave, Olivier H.; Sura, Zehra N.
2018-01-23
In one embodiment, a computer-implemented method includes receiving source code to be compiled into an executable file for an unaligned instruction set architecture (ISA). Aligned assembled code is generated, by a computer processor. The aligned assembled code complies with an aligned ISA and includes aligned processor code for a processor and aligned accelerator code for an accelerator. A first linking pass is performed on the aligned assembled code, including relocating a first relocation target in the aligned accelerator code that refers to a first object outside the aligned accelerator code. Unaligned assembled code is generated in accordance with the unaligned ISA and includes unaligned accelerator code for the accelerator and unaligned processor code for the processor. A second linking pass is performed on the unaligned assembled code, including relocating a second relocation target outside the unaligned accelerator code that refers to an object in the unaligned accelerator code.
Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading.
Rahn, René; Budach, Stefan; Costanza, Pascal; Ehrhardt, Marcel; Hancox, Jonny; Reinert, Knut
2018-05-03
Pairwise sequence alignment is undoubtedly a central tool in many bioinformatics analyses. In this paper, we present a generically accelerated module for pairwise sequence alignments applicable for a broad range of applications. In our module, we unified the standard dynamic programming kernel used for pairwise sequence alignments and extended it with a generalized inter-sequence vectorization layout, such that many alignments can be computed simultaneously by exploiting SIMD (Single Instruction Multiple Data) instructions of modern processors. We then extended the module by adding two layers of thread-level parallelization, where we a) distribute many independent alignments on multiple threads and b) inherently parallelize a single alignment computation using a work stealing approach producing a dynamic wavefront progressing along the minor diagonal. We evaluated our alignment vectorization and parallelization on different processors, including the newest Intel® Xeon® (Skylake) and Intel® Xeon Phi™ (KNL) processors, and use cases. The instruction set AVX512-BW (Byte and Word), available on Skylake processors, can genuinely improve the performance of vectorized alignments. We could run single alignments 1600 times faster on the Xeon Phi™ and 1400 times faster on the Xeon® than executing them with our previous sequential alignment module. The module is programmed in C++ using the SeqAn (Reinert et al., 2017) library and distributed with version 2.4. under the BSD license. We support SSE4, AVX2, AVX512 instructions and included UME::SIMD, a SIMD-instruction wrapper library, to extend our module for further instruction sets. We thoroughly test all alignment components with all major C++ compilers on various platforms. rene.rahn@fu-berlin.de.
Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA
NASA Astrophysics Data System (ADS)
Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei
2013-03-01
With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.
Parallel machine architecture for production rule systems
Allen, Jr., John D.; Butler, Philip L.
1989-01-01
A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.
NASA Astrophysics Data System (ADS)
Liu, Fenglai; Kong, Jing
2018-07-01
Unique technical challenges and their solutions for implementing semi-numerical Hartree-Fock exchange on the Phil Processor are discussed, especially concerning the single- instruction-multiple-data type of processing and small cache size. Benchmark calculations on a series of buckyball molecules with various Gaussian basis sets on a Phi processor and a six-core CPU show that the Phi processor provides as much as 12 times of speedup with large basis sets compared with the conventional four-center electron repulsion integration approach performed on the CPU. The accuracy of the semi-numerical scheme is also evaluated and found to be comparable to that of the resolution-of-identity approach.
An Efficient Functional Test Generation Method For Processors Using Genetic Algorithms
NASA Astrophysics Data System (ADS)
Hudec, Ján; Gramatová, Elena
2015-07-01
The paper presents a new functional test generation method for processors testing based on genetic algorithms and evolutionary strategies. The tests are generated over an instruction set architecture and a processor description. Such functional tests belong to the software-oriented testing. Quality of the tests is evaluated by code coverage of the processor description using simulation. The presented test generation method uses VHDL models of processors and the professional simulator ModelSim. The rules, parameters and fitness functions were defined for various genetic algorithms used in automatic test generation. Functionality and effectiveness were evaluated using the RISC type processor DP32.
NASA Astrophysics Data System (ADS)
Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi
2005-04-01
A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.
Application of Prognostic Health Management in Digital Electronic Systems
2007-01-01
variable external supply applied the necessary core power to the processor while the motherboard continued to source power from the ATX supply. By...isolating the processor power from the motherboard power , control over the aging profile of the processor was achieved. Once nominal operating...Physics-of-failure RISC – Reduced Instruction Set Computer RUL – Remaining Useful Life 1 1-4244-0525-4/07/$20.00 ©2007 IEEE. Paper 1326
Measurement of fault latency in a digital avionic mini processor, part 2
NASA Technical Reports Server (NTRS)
Mcgough, J.; Swern, F.
1983-01-01
The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are described. Several earlier programs were reprogrammed, expanding the instruction set to capitalize on the full power of the BDX-930 computer. As a final demonstration of fault coverage an extensive, 3-axis, high performance flght control computation was added. The stages in the development of a CPU self-test program emphasizing the relationship between fault coverage, speed, and quantity of instructions were demonstrated.
Instruction-level performance modeling and characterization of multimedia applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, Y.; Cameron, K.W.
1999-06-01
One of the challenges for characterizing and modeling realistic multimedia applications is the lack of access to source codes. On-chip performance counters effectively resolve this problem by monitoring run-time behaviors at the instruction-level. This paper presents a novel technique of characterizing and modeling workloads at the instruction level for realistic multimedia applications using hardware performance counters. A variety of instruction counts are collected from some multimedia applications, such as RealPlayer, GSM Vocoder, MPEG encoder/decoder, and speech synthesizer. These instruction counts can be used to form a set of abstract characteristic parameters directly related to a processor`s architectural features. Based onmore » microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. The biggest advantage of this new characterization technique is a better understanding of processor utilization efficiency and architectural bottleneck for each application. This technique also provides predictive insight of future architectural enhancements and their affect on current codes. In this paper the authors also attempt to model architectural effect on processor utilization without memory influence. They derive formulas for calculating CPI{sub 0}, CPI without memory effect, and they quantify utilization of architectural parameters. These equations are architecturally diagnostic and predictive in nature. Results provide promise in code characterization, and empirical/analytical modeling.« less
Grafe, Victor G.; Hoch, James E.
1993-01-01
A sequencing and data fanout mechanism is provided for a dataflow processor is activated by an input token which causes a sequence of operations to occur by initiating a first instruction to act on data contained within the token and then executing a sequential thread of instructions identified by either a repeat count and an offset within the token, or by an offset within each preceding instruction.
A Micro-Processor Based System as a Teaching Tool.
ERIC Educational Resources Information Center
Spero, Samuel W.
1979-01-01
Two instructional strategies incorporating a microprocessor-based computer system are described. These are the use of the system to drive a television monitor, and the system's use in generating problem sets. (MP)
An innovative on-board processor for lightsats
NASA Technical Reports Server (NTRS)
Henshaw, R. M.; Ballard, B. W.; Hayes, J. R.; Lohr, D. A.
1990-01-01
The Applied Physics Laboratory (APL) has developed a flightworthy custom microprocessor that increases capability and reduces development costs of lightsat science instruments. This device, called the FRISC (FORTH Reduced Instruction Set Computer), directly executes the high-level language called FORTH, which is ideally suited to the multitasking control and data processing environment of a spaceborne instrument processor. The FRISC will be flown as the onboard processor in the Magnetic Field Experiment on the Freja satllite. APL has achieved a significant increase in onboard processing capability with no increase in cost when compared to the magnetometer instrument on Freja's predecessor, the Viking satellite.
A 32-bit NMOS microprocessor with a large register file
NASA Astrophysics Data System (ADS)
Sherburne, R. W., Jr.; Katevenis, M. G. H.; Patterson, D. A.; Sequin, C. H.
1984-10-01
Two scaled versions of a 32-bit NMOS reduced instruction set computer CPU, called RISC II, have been implemented on two different processing lines using the simple Mead and Conway layout rules with lambda values of 2 and 1.5 microns (corresponding to drawn gate lengths of 4 and 3 microns), respectively. The design utilizes a small set of simple instructions in conjunction with a large register file in order to provide high performance. This approach has resulted in two surprisingly powerful single-chip processors.
NASA Technical Reports Server (NTRS)
Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)
1983-01-01
A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.
NASA Astrophysics Data System (ADS)
Tanikawa, Ataru; Yoshikawa, Kohji; Nitadori, Keigo; Okamoto, Takashi
2013-02-01
We have developed a numerical software library for collisionless N-body simulations named "Phantom-GRAPE" which highly accelerates force calculations among particles by use of a new SIMD instruction set extension to the x86 architecture, Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). In our library, not only the Newton's forces, but also central forces with an arbitrary shape f(r), which has a finite cutoff radius rcut (i.e. f(r)=0 at r>rcut), can be quickly computed. In computing such central forces with an arbitrary force shape f(r), we refer to a pre-calculated look-up table. We also present a new scheme to create the look-up table whose binning is optimal to keep good accuracy in computing forces and whose size is small enough to avoid cache misses. Using an Intel Core i7-2600 processor, we measure the performance of our library for both of the Newton's forces and the arbitrarily shaped central forces. In the case of Newton's forces, we achieve 2×109 interactions per second with one processor core (or 75 GFLOPS if we count 38 operations per interaction), which is 20 times higher than the performance of an implementation without any explicit use of SIMD instructions, and 2 times than that with the SSE instructions. With four processor cores, we obtain the performance of 8×109 interactions per second (or 300 GFLOPS). In the case of the arbitrarily shaped central forces, we can calculate 1×109 and 4×109 interactions per second with one and four processor cores, respectively. The performance with one processor core is 6 times and 2 times higher than those of the implementations without any use of SIMD instructions and with the SSE instructions. These performances depend only weakly on the number of particles, irrespective of the force shape. It is good contrast with the fact that the performance of force calculations accelerated by graphics processing units (GPUs) depends strongly on the number of particles. Substantially weak dependence of the performance on the number of particles is suitable to collisionless N-body simulations, since these simulations are usually performed with sophisticated N-body solvers such as Tree- and TreePM-methods combined with an individual timestep scheme. We conclude that collisionless N-body simulations accelerated with our library have significant advantage over those accelerated by GPUs, especially on massively parallel environments.
A 20 MHz CMOS reorder buffer for a superscalar microprocessor
NASA Technical Reports Server (NTRS)
Lenell, John; Wallace, Steve; Bagherzadeh, Nader
1992-01-01
Superscalar processors can achieve increased performance by issuing instructions out-of-order from the original sequential instruction stream. Implementing an out-of-order instruction issue policy requires a hardware mechanism to prevent incorrectly executed instructions from updating register values. A reorder buffer can be used to allow a superscalar processor to issue instructions out-of-order and maintain program correctness. This paper describes the design and implementation of a 20MHz CMOS reorder buffer for superscalar processors. The reorder buffer is designed to accept and retire two instructions per cycle. A full-custom layout in 1.2 micron has been implemented, measuring 1.1058 mm by 1.3542 mm.
NASA Astrophysics Data System (ADS)
González, Diego; Botella, Guillermo; García, Carlos; Prieto, Manuel; Tirado, Francisco
2013-12-01
This contribution focuses on the optimization of matching-based motion estimation algorithms widely used for video coding standards using an Altera custom instruction-based paradigm and a combination of synchronous dynamic random access memory (SDRAM) with on-chip memory in Nios II processors. A complete profile of the algorithms is achieved before the optimization, which locates code leaks, and afterward, creates a custom instruction set, which is then added to the specific design, enhancing the original system. As well, every possible memory combination between on-chip memory and SDRAM has been tested to achieve the best performance. The final throughput of the complete designs are shown. This manuscript outlines a low-cost system, mapped using very large scale integration technology, which accelerates software algorithms by converting them into custom hardware logic blocks and showing the best combination between on-chip memory and SDRAM for the Nios II processor.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan
We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
Reconfigurable Very Long Instruction Word (VLIW) Processor
NASA Technical Reports Server (NTRS)
Velev, Miroslav N.
2015-01-01
Future NASA missions will depend on radiation-hardened, power-efficient processing systems-on-a-chip (SOCs) that consist of a range of processor cores custom tailored for space applications. Aries Design Automation, LLC, has developed a processing SOC that is optimized for software-defined radio (SDR) uses. The innovation implements the Institute of Electrical and Electronics Engineers (IEEE) RazorII voltage management technique, a microarchitectural mechanism that allows processor cores to self-monitor, self-analyze, and selfheal after timing errors, regardless of their cause (e.g., radiation; chip aging; variations in the voltage, frequency, temperature, or manufacturing process). This highly automated SOC can also execute legacy PowerPC 750 binary code instruction set architecture (ISA), which is used in the flight-control computers of many previous NASA space missions. In developing this innovation, Aries Design Automation has made significant contributions to the fields of formal verification of complex pipelined microprocessors and Boolean satisfiability (SAT) and has developed highly efficient electronic design automation tools that hold promise for future developments.
NASA Astrophysics Data System (ADS)
Tanikawa, Ataru; Yoshikawa, Kohji; Okamoto, Takashi; Nitadori, Keigo
2012-02-01
We present a high-performance N-body code for self-gravitating collisional systems accelerated with the aid of a new SIMD instruction set extension of the x86 architecture: Advanced Vector eXtensions (AVX), an enhanced version of the Streaming SIMD Extensions (SSE). With one processor core of Intel Core i7-2600 processor (8 MB cache and 3.40 GHz) based on Sandy Bridge micro-architecture, we implemented a fourth-order Hermite scheme with individual timestep scheme ( Makino and Aarseth, 1992), and achieved the performance of ˜20 giga floating point number operations per second (GFLOPS) for double-precision accuracy, which is two times and five times higher than that of the previously developed code implemented with the SSE instructions ( Nitadori et al., 2006b), and that of a code implemented without any explicit use of SIMD instructions with the same processor core, respectively. We have parallelized the code by using so-called NINJA scheme ( Nitadori et al., 2006a), and achieved ˜90 GFLOPS for a system containing more than N = 8192 particles with 8 MPI processes on four cores. We expect to achieve about 10 tera FLOPS (TFLOPS) for a self-gravitating collisional system with N ˜ 10 5 on massively parallel systems with at most 800 cores with Sandy Bridge micro-architecture. This performance will be comparable to that of Graphic Processing Unit (GPU) cluster systems, such as the one with about 200 Tesla C1070 GPUs ( Spurzem et al., 2010). This paper offers an alternative to collisional N-body simulations with GRAPEs and GPUs.
Hardware accuracy counters for application precision and quality feedback
DOE Office of Scientific and Technical Information (OSTI.GOV)
de Paula Rosa Piga, Leonardo; Majumdar, Abhinandan; Paul, Indrani
Methods, devices, and systems for capturing an accuracy of an instruction executing on a processor. An instruction may be executed on the processor, and the accuracy of the instruction may be captured using a hardware counter circuit. The accuracy of the instruction may be captured by analyzing bits of at least one value of the instruction to determine a minimum or maximum precision datatype for representing the field, and determining whether to adjust a value of the hardware counter circuit accordingly. The representation may be output to a debugger or logfile for use by a developer, or may be outputmore » to a runtime or virtual machine to automatically adjust instruction precision or gating of portions of the processor datapath.« less
Stateless and stateful implementations of faithful execution
Pierson, Lyndon G; Witzke, Edward L; Tarman, Thomas D; Robertson, Perry J; Eldridge, John M; Campbell, Philip L
2014-12-16
A faithful execution system includes system memory, a target processor, and protection engine. The system memory stores a ciphertext including value fields and integrity fields. The value fields each include an encrypted executable instruction and the integrity fields each include an encrypted integrity value for determining whether a corresponding one of the value fields has been modified. The target processor executes plaintext instructions decoded from the ciphertext while the protection engine is coupled between the system memory and the target processor. The protection engine includes logic to retrieve the ciphertext from the system memory, decrypt the value fields into the plaintext instructions, perform an integrity check based on the integrity fields to determine whether any of the corresponding value fields have been modified, and provide the plaintext instructions to the target processor for execution.
Processor-in-memory-and-storage architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeBenedictis, Erik
A method and apparatus for performing reliable general-purpose computing. Each sub-core of a plurality of sub-cores of a processor core processes a same instruction at a same time. A code analyzer receives a plurality of residues that represents a code word corresponding to the same instruction and an indication of whether the code word is a memory address code or a data code from the plurality of sub-cores. The code analyzer determines whether the plurality of residues are consistent or inconsistent. The code analyzer and the plurality of sub-cores perform a set of operations based on whether the code wordmore » is a memory address code or a data code and a determination of whether the plurality of residues are consistent or inconsistent.« less
Multitasking OS manages a team of processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ripps, D.L.
1983-07-21
MTOS-68k is a real-time multitasking operating system designed for the popular MC68000 microprocessors. It aproaches task coordination and synchronization in a fashion that matches uniquely the structural simplicity and regularity of the 68000 instruction set. Since in many 68000 applications the speed and power of one CPU are not enough, MTOS-68k has been designed to support multiple processors, as well as multiple tasks. Typically, the devices are tightly coupled single-board computers, that is they share a backplane and parts of global memory.
Computer-Based Practice in Editing.
ERIC Educational Resources Information Center
Cronnell, Bruce
One goal of computer-based instruction in writing is to help students to edit their compositions, particularly those compositions written on a word processor. This can be accomplished by a complete editing program that would contain the full set of mechanics rules--capitalization, punctuation, spelling, usage--appropriate for the grade level of…
Content addressable memory project
NASA Technical Reports Server (NTRS)
Hall, Josh; Levy, Saul; Smith, D.; Wei, S.; Miyake, K.; Murdocca, M.
1991-01-01
The progress on the Rutgers CAM (Content Addressable Memory) Project is described. The overall design of the system is completed at the architectural level and described. The machine is composed of two kinds of cells: (1) the CAM cells which include both memory and processor, and support local processing within each cell; and (2) the tree cells, which have smaller instruction set, and provide global processing over the CAM cells. A parameterized design of the basic CAM cell is completed. Progress was made on the final specification of the CPS. The machine architecture was driven by the design of algorithms whose requirements are reflected in the resulted instruction set(s). A few of these algorithms are described.
Effective Vectorization with OpenMP 4.5
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham
This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less
A GaAs vector processor based on parallel RISC microprocessors
NASA Astrophysics Data System (ADS)
Misko, Tim A.; Rasset, Terry L.
A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
Stanford Hardware Development Program
NASA Technical Reports Server (NTRS)
Peterson, A.; Linscott, I.; Burr, J.
1986-01-01
Architectures for high performance, digital signal processing, particularly for high resolution, wide band spectrum analysis were developed. These developments are intended to provide instrumentation for NASA's Search for Extraterrestrial Intelligence (SETI) program. The real time signal processing is both formal and experimental. The efficient organization and optimal scheduling of signal processing algorithms were investigated. The work is complemented by efforts in processor architecture design and implementation. A high resolution, multichannel spectrometer that incorporates special purpose microcoded signal processors is being tested. A general purpose signal processor for the data from the multichannel spectrometer was designed to function as the processing element in a highly concurrent machine. The processor performance required for the spectrometer is in the range of 1000 to 10,000 million instructions per second (MIPS). Multiple node processor configurations, where each node performs at 100 MIPS, are sought. The nodes are microprogrammable and are interconnected through a network with high bandwidth for neighboring nodes, and medium bandwidth for nodes at larger distance. The implementation of both the current mutlichannel spectrometer and the signal processor as Very Large Scale Integration CMOS chip sets was commenced.
Arguing for Computer-Based Composition Instruction.
ERIC Educational Resources Information Center
Larsen, Richard B.
The English teaching profession must accept the fact that the new paradigm for composition involves microcomputers and word processors, and as such calls for a somewhat different set of skills on the part of both students and teachers. The "text editor" can lessen the effects of some physical and psychological constraints on students,…
A Low-Power Instruction Issue Queue for Microprocessors
NASA Astrophysics Data System (ADS)
Watanabe, Shingo; Chiyonobu, Akihiro; Sato, Toshinori
Instruction issue queue is a key component which extracts instruction level parallelism (ILP) in modern out-of-order microprocessors. In order to exploit ILP for improving processor performance, instruction queue size should be increased. However, it is difficult to increase the size, since instruction queue is implemented by a content addressable memory (CAM) whose power and delay are much large. This paper introduces a low power and scalable instruction queue that replaces the CAM with a RAM. In this queue, instructions are explicitly woken up. Evaluation results show that the proposed instruction queue decreases processor performance by only 1.9% on average. Furthermore, the total energy consumption is reduced by 54% on average.
Evaluating local indirect addressing in SIMD proc essors
NASA Technical Reports Server (NTRS)
Middleton, David; Tomboulian, Sherryl
1989-01-01
In the design of parallel computers, there exists a tradeoff between the number and power of individual processors. The single instruction stream, multiple data stream (SIMD) model of parallel computers lies at one extreme of the resulting spectrum. The available hardware resources are devoted to creating the largest possible number of processors, and consequently each individual processor must use the fewest possible resources. Disagreement exists as to whether SIMD processors should be able to generate addresses individually into their local data memory, or all processors should access the same address. The tradeoff is examined between the increased capability and the reduced number of processors that occurs in this single instruction stream, multiple, locally addressed, data (SIMLAD) model. Factors are assembled that affect this design choice, and the SIMLAD model is compared with the bare SIMD and the MIMD models.
Enhancing instruction scheduling with a block-structured ISA
DOE Office of Scientific and Technical Information (OSTI.GOV)
Melvin, S.; Patt, Y.
It is now generally recognized that not enough parallelism exists within the small basic blocks of most general purpose programs to satisfy high performance processors. Thus, a wide variety of techniques have been developed to exploit instruction level parallelism across basic block boundaries. In this paper we discuss some previous techniques along with their hardware and software requirements. Then we propose a new paradigm for an instruction set architecture (ISA): block-structuring. This new paradigm is presented, its hardware and software requirements are discussed and the results from a simulation study are presented. We show that a block-structured ISA utilizes bothmore » dynamic and compile-time mechanisms for exploiting instruction level parallelism and has significant performance advantages over a conventional ISA.« less
Design of a modular digital computer system, CDRL no. D001, final design plan
NASA Technical Reports Server (NTRS)
Easton, R. A.
1975-01-01
The engineering breadboard implementation for the CDRL no. D001 modular digital computer system developed during design of the logic system was documented. This effort followed the architecture study completed and documented previously, and was intended to verify the concepts of a fault tolerant, automatically reconfigurable, modular version of the computer system conceived during the architecture study. The system has a microprogrammed 32 bit word length, general register architecture and an instruction set consisting of a subset of the IBM System 360 instruction set plus additional fault tolerance firmware. The following areas were covered: breadboard packaging, central control element, central processing element, memory, input/output processor, and maintenance/status panel and electronics.
NASA Technical Reports Server (NTRS)
Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)
1993-01-01
This is a real-time robotic controller and simulator which is a MIMD-SIMD parallel architecture for interfacing with an external host computer and providing a high degree of parallelism in computations for robotic control and simulation. It includes a host processor for receiving instructions from the external host computer and for transmitting answers to the external host computer. There are a plurality of SIMD microprocessors, each SIMD processor being a SIMD parallel processor capable of exploiting fine grain parallelism and further being able to operate asynchronously to form a MIMD architecture. Each SIMD processor comprises a SIMD architecture capable of performing two matrix-vector operations in parallel while fully exploiting parallelism in each operation. There is a system bus connecting the host processor to the plurality of SIMD microprocessors and a common clock providing a continuous sequence of clock pulses. There is also a ring structure interconnecting the plurality of SIMD microprocessors and connected to the clock for providing the clock pulses to the SIMD microprocessors and for providing a path for the flow of data and instructions between the SIMD microprocessors. The host processor includes logic for controlling the RRCS by interpreting instructions sent by the external host computer, decomposing the instructions into a series of computations to be performed by the SIMD microprocessors, using the system bus to distribute associated data among the SIMD microprocessors, and initiating activity of the SIMD microprocessors to perform the computations on the data by procedure call.
Extending the granularity of representation and control for the MIL-STD CAIS 1.0 node model
NASA Technical Reports Server (NTRS)
Rogers, Kathy L.
1986-01-01
The Common APSE (Ada 1 Program Support Environment) Interface Set (CAIS) (DoD85) node model provides an excellent baseline for interfaces in a single-host development environment. To encompass the entire spectrum of computing, however, the CAIS model should be extended in four areas. It should provide the interface between the engineering workstation and the host system throughout the entire lifecycle of the system. It should provide a basis for communication and integration functions needed by distributed host environments. It should provide common interfaces for communications mechanisms to and among target processors. It should provide facilities for integration, validation, and verification of test beds extending to distributed systems on geographically separate processors with heterogeneous instruction set architectures (ISAS). Additions to the PROCESS NODE model to extend the CAIS into these four areas are proposed.
Replication of Space-Shuttle Computers in FPGAs and ASICs
NASA Technical Reports Server (NTRS)
Ferguson, Roscoe C.
2008-01-01
A document discusses the replication of the functionality of the onboard space-shuttle general-purpose computers (GPCs) in field-programmable gate arrays (FPGAs) and application-specific integrated circuits (ASICs). The purpose of the replication effort is to enable utilization of proven space-shuttle flight software and software-development facilities to the extent possible during development of software for flight computers for a new generation of launch vehicles derived from the space shuttles. The replication involves specifying the instruction set of the central processing unit and the input/output processor (IOP) of the space-shuttle GPC in a hardware description language (HDL). The HDL is synthesized to form a "core" processor in an FPGA or, less preferably, in an ASIC. The core processor can be used to create a flight-control card to be inserted into a new avionics computer. The IOP of the GPC as implemented in the core processor could be designed to support data-bus protocols other than that of a multiplexer interface adapter (MIA) used in the space shuttle. Hence, a computer containing the core processor could be tailored to communicate via the space-shuttle GPC bus and/or one or more other buses.
LISP on a Reduced-Instruction-Set-Processor,
1986-01-01
Digital * Press, 1984. 19. Steele, G. L. Jr., and Sussman, G.J. LAMBDA : The Ultimate Imperative. Al Memo 353, MIT, Artificial ,, Inteligence Laboratory...procedure B is No 444, MIT Artificial Intelligence Laboratory, August, recursive, if procedure A can be reexecuted before the call 1977. returns. This...the programs Artificial Intelligence Programming. Lawrence Erlbaum use apply and eval, and of these three only frl uses eval Associates, Hillsdale, New
Design of RISC Processor Using VHDL and Cadence
NASA Astrophysics Data System (ADS)
Moslehpour, Saeid; Puliroju, Chandrasekhar; Abu-Aisheh, Akram
The project deals about development of a basic RISC processor. The processor is designed with basic architecture consisting of internal modules like clock generator, memory, program counter, instruction register, accumulator, arithmetic and logic unit and decoder. This processor is mainly used for simple general purpose like arithmetic operations and which can be further developed for general purpose processor by increasing the size of the instruction register. The processor is designed in VHDL by using Xilinx 8.1i version. The present project also serves as an application of the knowledge gained from past studies of the PSPICE program. The study will show how PSPICE can be used to simplify massive complex circuits designed in VHDL Synthesis. The purpose of the project is to explore the designed RISC model piece by piece, examine and understand the Input/ Output pins, and to show how the VHDL synthesis code can be converted to a simplified PSPICE model. The project will also serve as a collection of various research materials about the pieces of the circuit.
Malleable architecture generator for FPGA computing
NASA Astrophysics Data System (ADS)
Gokhale, Maya; Kaba, James; Marks, Aaron; Kim, Jang
1996-10-01
The malleable architecture generator (MARGE) is a tool set that translates high-level parallel C to configuration bit streams for field-programmable logic based computing systems. MARGE creates an application-specific instruction set and generates the custom hardware components required to perform exactly those computations specified by the C program. In contrast to traditional fixed-instruction processors, MARGE's dynamic instruction set creation provides for efficient use of hardware resources. MARGE processes intermediate code in which each operation is annotated by the bit lengths of the operands. Each basic block (sequence of straight line code) is mapped into a single custom instruction which contains all the operations and logic inherent in the block. A synthesis phase maps the operations comprising the instructions into register transfer level structural components and control logic which have been optimized to exploit functional parallelism and function unit reuse. As a final stage, commercial technology-specific tools are used to generate configuration bit streams for the desired target hardware. Technology- specific pre-placed, pre-routed macro blocks are utilized to implement as much of the hardware as possible. MARGE currently supports the Xilinx-based Splash-2 reconfigurable accelerator and National Semiconductor's CLAy-based parallel accelerator, MAPA. The MARGE approach has been demonstrated on systolic applications such as DNA sequence comparison.
NASA Astrophysics Data System (ADS)
Coffey, Stephen; Connell, Joseph
2005-06-01
This paper presents a development platform for real-time image processing based on the ADSP-BF533 Blackfin processor and the MicroC/OS-II real-time operating system (RTOS). MicroC/OS-II is a completely portable, ROMable, pre-emptive, real-time kernel. The Blackfin Digital Signal Processors (DSPs), incorporating the Analog Devices/Intel Micro Signal Architecture (MSA), are a broad family of 16-bit fixed-point products with a dual Multiply Accumulate (MAC) core. In addition, they have a rich instruction set with variable instruction length and both DSP and MCU functionality thus making them ideal for media based applications. Using the MicroC/OS-II for task scheduling and management, the proposed system can capture and process raw RGB data from any standard 8-bit greyscale image sensor in soft real-time and then display the processed result using a simple PC graphical user interface (GUI). Additionally, the GUI allows configuration of the image capture rate and the system and core DSP clock rates thereby allowing connectivity to a selection of image sensors and memory devices. The GUI also allows selection from a set of image processing algorithms based in the embedded operating system.
NASA Astrophysics Data System (ADS)
Weigand, R.
Two new processor devices have been developed for the use on board of spacecrafts. An 8-bit 8032-microcontroller targets typical controlling applications in instruments and sub-systems, or could be used as a main processor on small satellites, whereas the LEON 32-bit SPARC processor can be used for high performance controlling and data processing tasks. The ADV80S32 is fully compliant to the Intel 80x1 architecture and instruction set, extended by additional peripherals, 512 bytes on-chip RAM and a bootstrap PROM, which allows downloading the application software using the CCSDS PacketWire pro- tocol. The memory controller provides a de-multiplexed address/data bus, and allows to access up to 16 MB data and 8 MB program RAM. The peripherals have been de- signed for the specific needs of a spacecraft, such as serial interfaces compatible to RS232, PacketWire and TTC-B-01, counters/timers for extended duration and a CRC calculation unit accelerating the CCSDS TM/TC protocol. The 0.5 um Atmel manu- facturing technology (MG2RT) provides latch-up and total dose immunity; SEU fault immunity is implemented by using SEU hardened Flip-Flops and EDAC protection of internal and external memories. The maximum clock frequency of 20 MHz allows a processing power of 3 MIPS. Engineering samples are available. For SW develop- ment, various SW packages for the 8051 architecture are on the market. The LEON processor implements a 32-bit SPARC V8 architecture, including all the multiply and divide instructions, complemented by a floating-point unit (FPU). It includes several standard peripherals, such as timers/watchdog, interrupt controller, UARTs, parallel I/Os and a memory controller, allowing to use 8, 16 and 32 bit PROM, SRAM or memory mapped I/O. With on-chip separate instruction and data caches, almost one instruction per clock cycle can be reached in some applications. A 33-MHz 32-bit PCI master/target interface and a PCI arbiter allow operating the device in a plug-in card (for SW development on PC etc.), or to consider using it as a PCI master controller in an on-board system. Advanced SEU fault tolerance is in- troduced by design, using triple modular redundancy (TMR) flip-flops for all registers and EDAC protection for all memories. The device will be manufactured in a radia- tion hard Atmel 0.25 um technology, targeting 100 MHz processor clock frequency. The non fault-tolerant LEON processor VHDL model is available as free source code, and the SPARC architecture is a well-known industry standard. Therefore, know-how, software tools and operating systems are widely available.
Operation of commercial R3000 processors in the low earth orbit (LEO) space environment
NASA Astrophysics Data System (ADS)
Kaschmitter, J. L.; Shaeffer, D. L.; Colella, N. J.; McKnett, C. L.; Coakley, P. G.
1991-12-01
Spacecraft processors must operate with minimal degradation of performance in the LEO radiation environment, which includes the effects of total accumulated ionizing dose and single event phenomena (SEP) caused by protons and cosmic rays. Commercially available microprocessors can offer a number of advantages relative to radiation-hardened devices but are not normally designed to tolerate effects induced by the LEO environment. Extensive testing of the MIPS R3000 Reduced Instruction Set Computer (RISC) microprocessor family for operation in LEO environments is reported. The authors have characterized total dose and SEP effects for altitudes and inclinations of interest to systems operating in LEO, and they postulate techniques for detection and alleviation of SEP effects based on experimental results.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nash, T.; Atac, R.; Cook, A.
1989-03-06
The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
Store-operate-coherence-on-value
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Dong; Heidelberger, Philip; Kumar, Sameer
A system, method and computer program product for performing various store-operate instructions in a parallel computing environment that includes a plurality of processors and at least one cache memory device. A queue in the system receives, from a processor, a store-operate instruction that specifies under which condition a cache coherence operation is to be invoked. A hardware unit in the system runs the received store-operate instruction. The hardware unit evaluates whether a result of the running the received store-operate instruction satisfies the condition. The hardware unit invokes a cache coherence operation on a cache memory address associated with the receivedmore » store-operate instruction if the result satisfies the condition. Otherwise, the hardware unit does not invoke the cache coherence operation on the cache memory device.« less
Ultra-Reliable Digital Avionics (URDA) processor
NASA Astrophysics Data System (ADS)
Branstetter, Reagan; Ruszczyk, William; Miville, Frank
1994-10-01
Texas Instruments Incorporated (TI) developed the URDA processor design under contract with the U.S. Air Force Wright Laboratory and the U.S. Army Night Vision and Electro-Sensors Directorate. TI's approach couples advanced packaging solutions with advanced integrated circuit (IC) technology to provide a high-performance (200 MIPS/800 MFLOPS) modular avionics processor module for a wide range of avionics applications. TI's processor design integrates two Ada-programmable, URDA basic processor modules (BPM's) with a JIAWG-compatible PiBus and TMBus on a single F-22 common integrated processor-compatible form-factor SEM-E avionics card. A separate, high-speed (25-MWord/second 32-bit word) input/output bus is provided for sensor data. Each BPM provides a peak throughput of 100 MIPS scalar concurrent with 400-MFLOPS vector processing in a removable multichip module (MCM) mounted to a liquid-flowthrough (LFT) core and interfacing to a processor interface module printed wiring board (PWB). Commercial RISC technology coupled with TI's advanced bipolar complementary metal oxide semiconductor (BiCMOS) application specific integrated circuit (ASIC) and silicon-on-silicon packaging technologies are used to achieve the high performance in a miniaturized package. A Mips R4000-family reduced instruction set computer (RISC) processor and a TI 100-MHz BiCMOS vector coprocessor (VCP) ASIC provide, respectively, the 100 MIPS of a scalar processor throughput and 400 MFLOPS of vector processing throughput for each BPM. The TI Aladdim ASIC chipset was developed on the TI Aladdin Program under contract with the U.S. Army Communications and Electronics Command and was sponsored by the Advanced Research Projects Agency with technical direction from the U.S. Army Night Vision and Electro-Sensors Directorate.
Compilation of Abstracts of Theses Submitted by Candidates for Degrees.
1984-06-01
Management System for the TI - 59 Programmable Calculator Kersh, T. B. Signal Processor Interface 65 CPT, USA Simulation of the AN/SPY-lA Radar...DESIGN AND IMPLEMENTATION OF A BASIC CROSS-COMPILER AND VIRTUAL MEMORY MANAGEMENT SYSTEM FOR THE TI - 59 PROGRAMMABLE CALCULATOR Mark R. Kindl Captain...Academy, 1974 The instruction set of the TI - 59 Programmable Calculator bears a close similarity to that of an assembler. Though most of the calculator
NASA Technical Reports Server (NTRS)
Phyne, J. R.; Nelson, M. D.
1975-01-01
The design and implementation of hardware and software systems involved in using a 40,000 bit/second communication line as the connecting link between an IMLAC PDS 1-D display computer and a Univac 1108 computer system were described. The IMLAC consists of two independent processors sharing a common memory. The display processor generates the deflection and beam control currents as it interprets a program contained in the memory; the minicomputer has a general instruction set and is responsible for starting and stopping the display processor and for communicating with the outside world through the keyboard, teletype, light pen, and communication line. The processing time associated with each data byte was minimized by designing the input and output processes as finite state machines which automatically sequence from each state to the next. Several tests of the communication link and the IMLAC software were made using a special low capacity computer grade cable between the IMLAC and the Univac.
Visualization of unsteady computational fluid dynamics
NASA Astrophysics Data System (ADS)
Haimes, Robert
1994-11-01
A brief summary of the computer environment used for calculating three dimensional unsteady Computational Fluid Dynamic (CFD) results is presented. This environment requires a super computer as well as massively parallel processors (MPP's) and clusters of workstations acting as a single MPP (by concurrently working on the same task) provide the required computational bandwidth for CFD calculations of transient problems. The cluster of reduced instruction set computers (RISC) is a recent advent based on the low cost and high performance that workstation vendors provide. The cluster, with the proper software can act as a multiple instruction/multiple data (MIMD) machine. A new set of software tools is being designed specifically to address visualizing 3D unsteady CFD results in these environments. Three user's manuals for the parallel version of Visual3, pV3, revision 1.00 make up the bulk of this report.
Visualization of unsteady computational fluid dynamics
NASA Technical Reports Server (NTRS)
Haimes, Robert
1994-01-01
A brief summary of the computer environment used for calculating three dimensional unsteady Computational Fluid Dynamic (CFD) results is presented. This environment requires a super computer as well as massively parallel processors (MPP's) and clusters of workstations acting as a single MPP (by concurrently working on the same task) provide the required computational bandwidth for CFD calculations of transient problems. The cluster of reduced instruction set computers (RISC) is a recent advent based on the low cost and high performance that workstation vendors provide. The cluster, with the proper software can act as a multiple instruction/multiple data (MIMD) machine. A new set of software tools is being designed specifically to address visualizing 3D unsteady CFD results in these environments. Three user's manuals for the parallel version of Visual3, pV3, revision 1.00 make up the bulk of this report.
Design and implementation of a high performance network security processor
NASA Astrophysics Data System (ADS)
Wang, Haixin; Bai, Guoqiang; Chen, Hongyi
2010-03-01
The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.
NASA Astrophysics Data System (ADS)
Kutt, P. H.; Balamuth, D. P.
1989-10-01
Summary form only given, as follows. A multiprocessor system based on commercially available VMEbus components has been developed for the acquisition and reduction of event-mode data in nuclear physics experiments. The system contains seven 68000 CPUs and 14 Mbyte of memory. A minimal operating system handles data transfer and task allocation, and a compiler for a specially designed event analysis language produces code for the processors. The system has been in operation for four years at the University of Pennsylvania Tandem Accelerator Laboratory. Computation rates over three times that of a MicroVAX II have been achieved at a fraction of the cost. The use of WORM optical disks for event recording allows the processing of gigabyte data sets without operator intervention. A more powerful system is being planned which will make use of recently developed RISC (reduced instruction set computer) processors to obtain an order of magnitude increase in computing power per node.
The One Computer Classroom: Applications in Language Arts.
ERIC Educational Resources Information Center
Tobin, Kenneth; Tobin, Barbara
1985-01-01
Describes a study showing that primary grade students can benefit from using a word processor in activities that involve the creation of text. Suggests activities that further extend the value of the word processor in language arts instruction. (EL)
The Formal Specification of a Visual display Device: Design and Implementation.
1985-06-01
The use of these data structures with their defined operations, give the programmer a very powerful instructions set. Like the DPU code generator in...which any AM hosted machine could faithfully display. 27 In- general , most applications have no need to create images from a data structure representing...formation of standard functional interfaces to these resources. OS’s generally do not provide a functional interface to either the processor or the display2
Elan4/SPARC V9 Cross Loader and Dynamic Linker
DOE Office of Scientific and Technical Information (OSTI.GOV)
anf Fabien Lebaillif-Delamare, Fabrizio Petrini
2004-10-25
The Elan4/Sparc V9 Croos Loader and Liner is a part of the Linux system software that allows the dynamic loading and linking of user code in the network interface Quadrics QsNETII, also called as Elan4 Quadrics. Elan44 uses a thread processor that is based on the assembly instruction set of the Sparc V9. All this software is integrated as a Linux kernel module in the Linux 2.6.5 release.
Evaluation of Instruction Set Processor Architecture by Program Tracing
1974-07-01
l ■! ... ...i v«!!, la . mm>>i mmmm mm«’ ■wnnvnfnmnmi "■ "»■’" ipMvmpw !■ i Th« rt^iitcr uta^e cia^ .iftcMlipn i NOI« MPiiHntnc... SOJA • 233 117.07 SOJCE | 2850 5101.50 SOJN 1 158 282.82 SOJC a 1811 3295.39 37 SOS • 1279 3980.95 SOSL ■ S IS. 25 sose • 1 21 M SOStE
120-MHz BiCMOS superscalar RISC processor
NASA Astrophysics Data System (ADS)
Tanaka, Shigeya; Hotta, Takashi; Murabayashi, Fumio; Yamada, Hiromichi; Yoshida, Shoji; Shimamura, Kotaro; Katsura, Koyo; Bandoh, Tadaaki; Ikeda, Koichi; Matsubara, Kenji
1994-04-01
A superscalar RISC processor contains 2.8 million transistors in a die size of 16.2 mm x 16.5 mm, and utilizes 3.3 V/0.5 micron BiCMOS technology. In order to take advantage of superscalar performance without incurring penalties from a slower clock or a longer pipeline, a tag bit is implemented in the instruction cache to indicate dependency between two instructions. A performance gain of up to 37% is obtained with only a 3.5% area overhead from our superscalar design.
A scalable SIMD digital signal processor for high-quality multifunctional printer systems
NASA Astrophysics Data System (ADS)
Kang, Hyeong-Ju; Choi, Yongwoo; Kim, Kimo; Park, In-Cheol; Kim, Jung-Wook; Lee, Eul-Hwan; Gahang, Goo-Soo
2005-01-01
This paper describes a high-performance scalable SIMD digital signal processor (DSP) developed for multifunctional printer systems. The DSP supports a variable number of datapaths to cover a wide range of performance and maintain a RISC-like pipeline structure. Many special instructions suitable for image processing algorithms are included in the DSP. Quad/dual instructions are introduced for 8-bit or 16-bit data, and bit-field extraction/insertion instructions are supported to process various data types. Conditional instructions are supported to deal with complex relative conditions efficiently. In addition, an intelligent DMA block is integrated to align data in the course of data reading. Experimental results show that the proposed DSP outperforms a high-end printer-system DSP by at least two times.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, Tyler Barratt; Urrea, Jorge Mario
2012-06-01
The aim of the Authenticating Cache architecture is to ensure that machine instructions in a Read Only Memory (ROM) are legitimate from the time the ROM image is signed (immediately after compilation) to the time they are placed in the cache for the processor to consume. The proposed architecture allows the detection of ROM image modifications during distribution or when it is loaded into memory. It also ensures that modified instructions will not execute in the processor-as the cache will not be loaded with a page that fails an integrity check. The authenticity of the instruction stream can also bemore » verified in this architecture. The combination of integrity and authenticity assurance greatly improves the security profile of a system.« less
1992-09-01
to acquire or develop effective simulation tools to observe the behavior of a RISC implementation as it executes different types of programs . We choose...Performance Computer performance is measured by the amount of the time required to execute a program . Performance encompasses two types of time, elapsed time...and CPU time. Elapsed time is the time required to execute a program from start to finish. It includes latency of input/output activities such as
Data processing with microcode designed with source coding
McCoy, James A; Morrison, Steven E
2013-05-07
Programming for a data processor to execute a data processing application is provided using microcode source code. The microcode source code is assembled to produce microcode that includes digital microcode instructions with which to signal the data processor to execute the data processing application.
NASA Technical Reports Server (NTRS)
Srivas, Mandayam; Bickford, Mark
1991-01-01
The design and formal verification of a hardware system for a task that is an important component of a fault tolerant computer architecture for flight control systems is presented. The hardware system implements an algorithm for obtaining interactive consistancy (byzantine agreement) among four microprocessors as a special instruction on the processors. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, provided certain preconditions hold. An assumption is made that the processors execute synchronously. For verification, the authors used a computer aided design hardware design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
NASA Technical Reports Server (NTRS)
Bickford, Mark; Srivas, Mandayam
1991-01-01
Presented here is a formal specification and verification of a property of a quadruplicately redundant fault tolerant microprocessor system design. A complete listing of the formal specification of the system and the correctness theorems that are proved are given. The system performs the task of obtaining interactive consistency among the processors using a special instruction on the processors. The design is based on an algorithm proposed by Pease, Shostak, and Lamport. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, providing certain preconditions hold, using a computer aided design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
Watchdog activity monitor (WAM) for use wth high coverage processor self-test
NASA Technical Reports Server (NTRS)
Tulpule, Bhalchandra R. (Inventor); Crosset, III, Richard W. (Inventor); Versailles, Richard E. (Inventor)
1988-01-01
A high fault coverage, instruction modeled self-test for a signal processor in a user environment is disclosed. The self-test executes a sequence of sub-tests and issues a state transition signal upon the execution of each sub-test. The self-test may be combined with a watchdog activity monitor (WAM) which provides a test-failure signal in the presence of a counted number of state transitions not agreeing with an expected number. An independent measure of time may be provided in the WAM to increase fault coverage by checking the processor's clock. Additionally, redundant processor systems are protected from inadvertent unsevering of a severed processor using a unique unsever arming technique and apparatus.
Design of a MIMD neural network processor
NASA Astrophysics Data System (ADS)
Saeks, Richard E.; Priddy, Kevin L.; Pap, Robert M.; Stowell, S.
1994-03-01
The Accurate Automation Corporation (AAC) neural network processor (NNP) module is a fully programmable multiple instruction multiple data (MIMD) parallel processor optimized for the implementation of neural networks. The AAC NNP design fully exploits the intrinsic sparseness of neural network topologies. Moreover, by using a MIMD parallel processing architecture one can update multiple neurons in parallel with efficiency approaching 100 percent as the size of the network increases. Each AAC NNP module has 8 K neurons and 32 K interconnections and is capable of 140,000,000 connections per second with an eight processor array capable of over one billion connections per second.
Scan line graphics generation on the massively parallel processor
NASA Technical Reports Server (NTRS)
Dorband, John E.
1988-01-01
Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.
A sweep algorithm for massively parallel simulation of circuit-switched networks
NASA Technical Reports Server (NTRS)
Gaujal, Bruno; Greenberg, Albert G.; Nicol, David M.
1992-01-01
A new massively parallel algorithm is presented for simulating large asymmetric circuit-switched networks, controlled by a randomized-routing policy that includes trunk-reservation. A single instruction multiple data (SIMD) implementation is described, and corresponding experiments on a 16384 processor MasPar parallel computer are reported. A multiple instruction multiple data (MIMD) implementation is also described, and corresponding experiments on an Intel IPSC/860 parallel computer, using 16 processors, are reported. By exploiting parallelism, our algorithm increases the possible execution rate of such complex simulations by as much as an order of magnitude.
Vector generator scan converter
Moore, James M.; Leighton, James F.
1990-01-01
High printing speeds for graphics data are achieved with a laser printer by transmitting compressed graphics data from a main processor over an I/O (input/output) channel to a vector generator scan converter which reconstructs a full graphics image for input to the laser printer through a raster data input port. The vector generator scan converter includes a microprocessor with associated microcode memory containing a microcode instruction set, a working memory for storing compressed data, vector generator hardward for drawing a full graphic image from vector parameters calculated by the microprocessor, image buffer memory for storing the reconstructed graphics image and an output scanner for reading the graphics image data and inputting the data to the printer. The vector generator scan converter eliminates the bottleneck created by the I/O channel for transmitting graphics data from the main processor to the laser printer, and increases printer speed up to thirty fold.
Vector generator scan converter
Moore, J.M.; Leighton, J.F.
1988-02-05
High printing speeds for graphics data are achieved with a laser printer by transmitting compressed graphics data from a main processor over an I/O channel to a vector generator scan converter which reconstructs a full graphics image for input to the laser printer through a raster data input port. The vector generator scan converter includes a microprocessor with associated microcode memory containing a microcode instruction set, a working memory for storing compressed data, vector generator hardware for drawing a full graphic image from vector parameters calculated by the microprocessor, image buffer memory for storing the reconstructed graphics image and an output scanner for reading the graphics image data and inputting the data to the printer. The vector generator scan converter eliminates the bottleneck created by the I/O channel for transmitting graphics data from the main processor to the laser printer, and increases printer speed up to thirty fold. 7 figs.
Image segmentation based upon topological operators: real-time implementation case study
NASA Astrophysics Data System (ADS)
Mahmoudi, R.; Akil, M.
2009-02-01
In miscellaneous applications of image treatment, thinning and crest restoring present a lot of interests. Recommended algorithms for these procedures are those able to act directly over grayscales images while preserving topology. But their strong consummation in term of time remains the major disadvantage in their choice. In this paper we present an efficient hardware implementation on RISC processor of two powerful algorithms of thinning and crest restoring developed by our team. Proposed implementation enhances execution time. A chain of segmentation applied to medical imaging will serve as a concrete example to illustrate the improvements brought thanks to the optimization techniques in both algorithm and architectural levels. The particular use of the SSE instruction set relative to the X86_32 processors (PIV 3.06 GHz) will allow a best performance for real time processing: a cadency of 33 images (512*512) per second is assured.
Parallelization of the FLAPW method
NASA Astrophysics Data System (ADS)
Canning, A.; Mannstadt, W.; Freeman, A. J.
2000-08-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining structural, electronic and magnetic properties of crystals and surfaces. Until the present work, the FLAPW method has been limited to systems of less than about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work, we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell, running on up to 512 processors on a CRAY T3E parallel supercomputer.
NASA Technical Reports Server (NTRS)
Liu, Yuan-Kwei
1991-01-01
The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/386 DX memory hierachy appears to be the most beneficial change to the current DMS design at this time.
NASA Technical Reports Server (NTRS)
Liu, Yuan-Kwei
1991-01-01
The feasibility is analyzed of upgrading the Intel 386 microprocessor, which has been proposed as the baseline processor for the Space Station Freedom (SSF) Data Management System (DMS), to the more advanced i486 microprocessors. The items compared between the two processors include the instruction set architecture, power consumption, the MIL-STD-883C Class S (Space) qualification schedule, and performance. The advantages of the i486 over the 386 are (1) lower power consumption; and (2) higher floating point performance. The i486 on-chip cache does not have parity check or error detection and correction circuitry. The i486 with on-chip cache disabled, however, has lower integer performance than the 386 without cache, which is the current DMS design choice. Adding cache to the 386/387 DX memory hierarchy appears to be the most beneficial change to the current DMS design at this time.
Design of a massively parallel computer using bit serial processing elements
NASA Technical Reports Server (NTRS)
Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing
1995-01-01
A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.
Introduction to Parallel Computing
1992-05-01
Instruction Stream, Multiple Data Stream Machines .................... 19 2.4 Networks of M achines...independent memory units and connecting them to the processors by an interconnection network . Many different interconnection schemes have been considered, and...connected to the same processor at the same time. Crossbar switching networks are still too expensive to be practical for connecting large numbers of
An FPGA computing demo core for space charge simulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Jinyuan; Huang, Yifei; /Fermilab
2009-01-01
In accelerator physics, space charge simulation requires large amount of computing power. In a particle system, each calculation requires time/resource consuming operations such as multiplications, divisions, and square roots. Because of the flexibility of field programmable gate arrays (FPGAs), we implemented this task with efficient use of the available computing resources and completely eliminated non-calculating operations that are indispensable in regular micro-processors (e.g. instruction fetch, instruction decoding, etc.). We designed and tested a 16-bit demo core for computing Coulomb's force in an Altera Cyclone II FPGA device. To save resources, the inverse square-root cube operation in our design is computedmore » using a memory look-up table addressed with nine to ten most significant non-zero bits. At 200 MHz internal clock, our demo core reaches a throughput of 200 M pairs/s/core, faster than a typical 2 GHz micro-processor by about a factor of 10. Temperature and power consumption of FPGAs were also lower than those of micro-processors. Fast and convenient, FPGAs can serve as alternatives to time-consuming micro-processors for space charge simulation.« less
Conditional load and store in a shared memory
Blumrich, Matthias A; Ohmacht, Martin
2015-02-03
A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
ERIC Educational Resources Information Center
McCrary, Ronald G.
A discussion of computer software and courseware for second-language instruction outlines considerations for selecting software of various kinds and presents a list of selected computer programs. Suggestions are made for choosing text-specific software, non-text-specific software intended for language instruction, word processors intended for…
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™
Gomes, Jeremias M.; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H.
2016-01-01
We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel® Xeon Phi™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP’s irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63× on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7× and 1.62×, respectively, as compared to efficient CPU and GPU implementations. PMID:27298591
Efficient irregular wavefront propagation algorithms on Intel® Xeon Phi™.
Gomes, Jeremias M; Teodoro, George; de Melo, Alba; Kong, Jun; Kurc, Tahsin; Saltz, Joel H
2015-10-01
We investigate the execution of the Irregular Wavefront Propagation Pattern (IWPP), a fundamental computing structure used in several image analysis operations, on the Intel ® Xeon Phi ™ co-processor. An efficient implementation of IWPP on the Xeon Phi is a challenging problem because of IWPP's irregularity and the use of atomic instructions in the original IWPP algorithm to resolve race conditions. On the Xeon Phi, the use of SIMD and vectorization instructions is critical to attain high performance. However, SIMD atomic instructions are not supported. Therefore, we propose a new IWPP algorithm that can take advantage of the supported SIMD instruction set. We also evaluate an alternate storage container (priority queue) to track active elements in the wavefront in an effort to improve the parallel algorithm efficiency. The new IWPP algorithm is evaluated with Morphological Reconstruction and Imfill operations as use cases. Our results show performance improvements of up to 5.63 × on top of the original IWPP due to vectorization. Moreover, the new IWPP achieves speedups of 45.7 × and 1.62 × , respectively, as compared to efficient CPU and GPU implementations.
Update on radiation-hardened microcomputers for robotics and teleoperated systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sias, F.R. Jr.; Tulenko, J.S.
1993-12-31
Since many programs sponsored by the Department of Defense are being canceled, it is important to select carefully radiation-hardened microprocessors for projects that will mature (or will require continued support) several years in the future. At the present time there are seven candidate 32-bit processors that should be considered for long-range planning for high-performance radiation-hardened computer systems. For Department of Energy applications it is also important to consider efforts at standardization that require the use of the VxWorks operating system and hardware based on the VMEbus. Of the seven processors, one has been delivered and is operating and other systemsmore » are scheduled to be delivered late in 1993 or early in 1994. At the present time the Honeywell-developed RH32, the Harris RH-3000 and the Harris RHC-3000 are leading contenders for meeting DOE requirements for a radiation-hardened advanced 32-bit microprocessor. These are all either compatible with or are derivatives of the MIPS R3000 Reduced Instruction Set Computer. It is anticipated that as few as two of the seven radiation-hardened processors will be supported by the space program in the long run.« less
ERIC Educational Resources Information Center
Osguthorpe, Russell T.; Li Chang, Linda
1988-01-01
A computerized symbol processor system using an Apple IIe computer and a Power Pad graphics tablet was tested with 22 nonspeaking, multiply disabled students. The students were taught to express themselves independently in writing, and they did significantly better than control students on measures of language comprehension and symbol recognition.…
Direct match data flow machine apparatus and process for data driven computing
Davidson, G.S.; Grafe, V.G.
1997-08-12
A data flow computer and method of computing are disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing
Davidson, G.S.; Grafe, V.G.
1988-07-22
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information from an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ''fire'' signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing
Davidson, George S.; Grafe, Victor G.
1995-01-01
A data flow computer which of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow machine apparatus and process for data driven computing
Davidson, George S.; Grafe, Victor Gerald
1997-01-01
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing
Davidson, George S.; Grafe, Victor Gerald
1997-01-01
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing
Davidson, G.S.; Grafe, V.G.
1997-10-07
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Using video-oriented instructions to speed up sequence comparison.
Wozniak, A
1997-04-01
This document presents an implementation of the well-known Smith-Waterman algorithm for comparison of proteic and nucleic sequences, using specialized video instructions. These instructions, SIMD-like in their design, make possible parallelization of the algorithm at the instruction level. Benchmarks on an ULTRA SPARC running at 167 MHz show a speed-up factor of two compared to the same algorithm implemented with integer instructions on the same machine. Performance reaches over 18 million matrix cells per second on a single processor, giving to our knowledge the fastest implementation of the Smith-Waterman algorithm on a workstation. The accelerated procedure was introduced in LASSAP--a LArge Scale Sequence compArison Package software developed at INRIA--which handles parallelism at higher level. On a SUN Enterprise 6000 server with 12 processors, a speed of nearly 200 million matrix cells per second has been obtained. A sequence of length 300 amino acids is scanned against SWISSPROT R33 (1,8531,385 residues) in 29 s. This procedure is not restricted to databank scanning. It applies to all cases handled by LASSAP (intra- and inter-bank comparisons, Z-score computation, etc.
Combining instruction prefetching with partial cache locking to improve WCET in real-time systems.
Ni, Fan; Long, Xiang; Wan, Han; Gao, Xiaopeng
2013-01-01
Caches play an important role in embedded systems to bridge the performance gap between fast processor and slow memory. And prefetching mechanisms are proposed to further improve the cache performance. While in real-time systems, the application of caches complicates the Worst-Case Execution Time (WCET) analysis due to its unpredictable behavior. Modern embedded processors often equip locking mechanism to improve timing predictability of the instruction cache. However, locking the whole cache may degrade the cache performance and increase the WCET of the real-time application. In this paper, we proposed an instruction-prefetching combined partial cache locking mechanism, which combines an instruction prefetching mechanism (termed as BBIP) with partial cache locking to improve the WCET estimates of real-time applications. BBIP is an instruction prefetching mechanism we have already proposed to improve the worst-case cache performance and in turn the worst-case execution time. The estimations on typical real-time applications show that the partial cache locking mechanism shows remarkable WCET improvement over static analysis and full cache locking.
Combining Instruction Prefetching with Partial Cache Locking to Improve WCET in Real-Time Systems
Ni, Fan; Long, Xiang; Wan, Han; Gao, Xiaopeng
2013-01-01
Caches play an important role in embedded systems to bridge the performance gap between fast processor and slow memory. And prefetching mechanisms are proposed to further improve the cache performance. While in real-time systems, the application of caches complicates the Worst-Case Execution Time (WCET) analysis due to its unpredictable behavior. Modern embedded processors often equip locking mechanism to improve timing predictability of the instruction cache. However, locking the whole cache may degrade the cache performance and increase the WCET of the real-time application. In this paper, we proposed an instruction-prefetching combined partial cache locking mechanism, which combines an instruction prefetching mechanism (termed as BBIP) with partial cache locking to improve the WCET estimates of real-time applications. BBIP is an instruction prefetching mechanism we have already proposed to improve the worst-case cache performance and in turn the worst-case execution time. The estimations on typical real-time applications show that the partial cache locking mechanism shows remarkable WCET improvement over static analysis and full cache locking. PMID:24386133
Chen, Dong; Giampapa, Mark; Heidelberger, Philip; Ohmacht, Martin; Satterfield, David L; Steinmacher-Burow, Burkhard; Sugavanam, Krishnan
2013-05-21
A system and method for enhancing performance of a computer which includes a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program are executed by a processer. The processor processes instructions from the program. A wait state in the processor waits for receiving specified data. A thread in the processor has a pause state wherein the processor waits for specified data. A pin in the processor initiates a return to an active state from the pause state for the thread. A logic circuit is external to the processor, and the logic circuit is configured to detect a specified condition. The pin initiates a return to the active state of the thread when the specified condition is detected using the logic circuit.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Donofrio, David
A method and apparatus for performing stencil computations efficiently are disclosed. In one embodiment, a processor receives an offset, and in response, retrieves a value from a memory via a single instruction, where the retrieving comprises: identifying, based on the offset, one of a plurality of registers of the processor; loading an address stored in the identified register; and retrieving from the memory the value at the address.
NASA Technical Reports Server (NTRS)
Marlowe, M. B.; Moore, R. A.; Whetstone, W. D.
1979-01-01
User instructions are given for performing linear and nonlinear steady state and transient thermal analyses with SPAR thermal analysis processors TGEO, SSTA, and TRTA. It is assumed that the user is familiar with basic SPAR operations and basic heat transfer theory.
Methodology for fast detection of false sharing in threaded scientific codes
Chung, I-Hsin; Cong, Guojing; Murata, Hiroki; Negishi, Yasushi; Wen, Hui-Fang
2014-11-25
A profiling tool identifies a code region with a false sharing potential. A static analysis tool classifies variables and arrays in the identified code region. A mapping detection library correlates memory access instructions in the identified code region with variables and arrays in the identified code region while a processor is running the identified code region. The mapping detection library identifies one or more instructions at risk, in the identified code region, which are subject to an analysis by a false sharing detection library. A false sharing detection library performs a run-time analysis of the one or more instructions at risk while the processor is re-running the identified code region. The false sharing detection library determines, based on the performed run-time analysis, whether two different portions of the cache memory line are accessed by the generated binary code.
Simulation of a master-slave event set processor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Comfort, J.C.
1984-03-01
Event set manipulation may consume a considerable amount of the computation time spent in performing a discrete-event simulation. One way of minimizing this time is to allow event set processing to proceed in parallel with the remainder of the simulation computation. The paper describes a multiprocessor simulation computer, in which all non-event set processing is performed by the principal processor (called the host). Event set processing is coordinated by a front end processor (the master) and actually performed by several other functionally identical processors (the slaves). A trace-driven simulation program modeling this system was constructed, and was run with tracemore » output taken from two different simulation programs. Output from this simulation suggests that a significant reduction in run time may be realized by this approach. Sensitivity analysis was performed on the significant parameters to the system (number of slave processors, relative processor speeds, and interprocessor communication times). A comparison between actual and simulation run times for a one-processor system was used to assist in the validation of the simulation. 7 references.« less
Efficient Interconnection Schemes for VLSI and Parallel Computation
1989-08-01
Definition: Let R be a routing network. A set S of wires in R is a (directed) cut if it partitions the network into two sets of processors A and B ...such that every path from a processor in A to a processor in B contains a wire in S. The capacity cap(S) is the number of wires in the cut. For a set of...messages M, define the load load(M, S) of M on a cut S to be the number of messages in M from a processor in A to a processor in B . The load factor
Massively parallel processor computer
NASA Technical Reports Server (NTRS)
Fung, L. W. (Inventor)
1983-01-01
An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.
Parallelization of the FLAPW method and comparison with the PPW method
NASA Astrophysics Data System (ADS)
Canning, Andrew; Mannstadt, Wolfgang; Freeman, Arthur
2000-03-01
The FLAPW (full-potential linearized-augmented plane-wave) method is one of the most accurate first-principles methods for determining electronic and magnetic properties of crystals and surfaces. In the past the FLAPW method has been limited to systems of about a hundred atoms due to the lack of an efficient parallel implementation to exploit the power and memory of parallel computers. In this work we present an efficient parallelization of the method by division among the processors of the plane-wave components for each state. The code is also optimized for RISC (reduced instruction set computer) architectures, such as those found on most parallel computers, making full use of BLAS (basic linear algebra subprograms) wherever possible. Scaling results are presented for systems of up to 686 silicon atoms and 343 palladium atoms per unit cell running on up to 512 processors on a Cray T3E parallel supercomputer. Some results will also be presented on a comparison of the plane-wave pseudopotential method and the FLAPW method on large systems.
Gigaflop architecture, a hardware perspective
NASA Technical Reports Server (NTRS)
Feierbach, G. F.
1978-01-01
Any super computer built in the early 1980s will use components that are available by fall 1978. The architecture of such a system cannot depart radically from current super computers if the software experience painfully acquired from these computers in the 70's is to apply. Given the above constraints, 10 billion floating point operations per second (BFLOPS) are attainable and a problem memory of 512 million (64 bit) words could be supported by the technology of the time. In contrast to this, industry is likely to respond with commercially available machines with a performance of less than 150 MFLOPS. This is due to self-imposed constraints on the manufacturers to provide upward compatible architectures (same instruction set) and systems which can be sold in significant volumes. Since this computing speed is inadequate to meet the demands of computational fluid dynamics, a special processor is required. Issues which are felt to be significant in the pursuit of maximum compute capability in this special processor are discussed.
High performance in silico virtual drug screening on many-core processors.
McIntosh-Smith, Simon; Price, James; Sessions, Richard B; Ibarra, Amaurys A
2015-05-01
Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel's Xeon Phi and multi-core CPUs with SIMD instruction sets.
High performance in silico virtual drug screening on many-core processors
Price, James; Sessions, Richard B; Ibarra, Amaurys A
2015-01-01
Drug screening is an important part of the drug development pipeline for the pharmaceutical industry. Traditional, lab-based methods are increasingly being augmented with computational methods, ranging from simple molecular similarity searches through more complex pharmacophore matching to more computationally intensive approaches, such as molecular docking. The latter simulates the binding of drug molecules to their targets, typically protein molecules. In this work, we describe BUDE, the Bristol University Docking Engine, which has been ported to the OpenCL industry standard parallel programming language in order to exploit the performance of modern many-core processors. Our highly optimized OpenCL implementation of BUDE sustains 1.43 TFLOP/s on a single Nvidia GTX 680 GPU, or 46% of peak performance. BUDE also exploits OpenCL to deliver effective performance portability across a broad spectrum of different computer architectures from different vendors, including GPUs from Nvidia and AMD, Intel’s Xeon Phi and multi-core CPUs with SIMD instruction sets. PMID:25972727
NASA Astrophysics Data System (ADS)
Jo, Hyunho; Sim, Donggyu
2014-06-01
We present a bitstream decoding processor for entropy decoding of variable length coding-based multiformat videos. Since most of the computational complexity of entropy decoders comes from bitstream accesses and table look-up process, the developed bitstream processing unit (BsPU) has several designated instructions to access bitstreams and to minimize branch operations in the table look-up process. In addition, the instruction for bitstream access has the capability to remove emulation prevention bytes (EPBs) of H.264/AVC without initial delay, repeated memory accesses, and additional buffer. Experimental results show that the proposed method for EPB removal achieves a speed-up of 1.23 times compared to the conventional EPB removal method. In addition, the BsPU achieves speed-ups of 5.6 and 3.5 times in entropy decoding of H.264/AVC and MPEG-4 Visual bitstreams, respectively, compared to an existing processor without designated instructions and a new table mapping algorithm. The BsPU is implemented on a Xilinx Virtex5 LX330 field-programmable gate array. The MPEG-4 Visual (ASP, Level 5) and H.264/AVC (Main Profile, Level 4) are processed using the developed BsPU with a core clock speed of under 250 MHz in real time.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rodenbeck, Christopher T.; Young, Derek; Chou, Tina
A combined radar and telemetry system is described. The combined radar and telemetry system includes a processing unit that executes instructions, where the instructions define a radar waveform and a telemetry waveform. The processor outputs a digital baseband signal based upon the instructions, where the digital baseband signal is based upon the radar waveform and the telemetry waveform. A radar and telemetry circuit transmits, simultaneously, a radar signal and telemetry signal based upon the digital baseband signal.
ALI: A CSSL/multiprocessor software interface
DOE Office of Scientific and Technical Information (OSTI.GOV)
Makoui, A.; Karplus, W.J.
ALI (A Language Interface) is a software package which translates simulation models expressed in one of the higher-level languages, CSSL-IV or ACSL, into sequences of instructions for each processor of a network of microprocessors. The partitioning of the source program among the processors is automatically accomplished. The code is converted into a data flow graph, analyzed and divided among the processors to minimize the overall execution time in the presence of interprocessor communication delays. This paper describes ALI from the user's point of view and includes a detailed example of the application of ALI to a specific dynamic system simulation.
Fast Fourier Transform Co-Processor (FFTC)- Towards Embedded GFLOPs
NASA Astrophysics Data System (ADS)
Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Wite, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland
2012-08-01
Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co- Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment.In frame of the ESA activity “Fast Fourier Transform DSP Co-processor (FFTC)” (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following:Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP.The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance.The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT- based processing tasks.A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses.The presentation will give and overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs
NASA Astrophysics Data System (ADS)
Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Witte, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland; Kopp, Nicholas
2012-10-01
Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co-Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment. In frame of the ESA activity "Fast Fourier Transform DSP Co-processor (FFTC)" (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following: • Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP. • The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance. The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT-based processing tasks. A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses. The paper will give an overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Using the Microcomputer to Generate Materials for Bibliographic Instruction.
ERIC Educational Resources Information Center
Hendley, Gaby G.
Guide-worksheets were developed on a word processor in a high school library for bibliographic instruction of English and social studies students to cover the following reference sources: Facts on File; Social Issues Resource Series (S.I.R.S.); Editorial Research Reports; Great Contemporary Issues (New York Times), which also includes Facts on…
NASA Technical Reports Server (NTRS)
Dongarra, Jack
1998-01-01
This exploratory study initiated our inquiry into algorithms and applications that would benefit by latency tolerant approach to algorithm building, including the construction of new algorithms where appropriate. In a multithreaded execution, when a processor reaches a point where remote memory access is necessary, the request is sent out on the network and a context--switch occurs to a new thread of computation. This effectively masks a long and unpredictable latency due to remote loads, thereby providing tolerance to remote access latency. We began to develop standards to profile various algorithm and application parameters, such as the degree of parallelism, granularity, precision, instruction set mix, interprocessor communication, latency etc. These tools will continue to develop and evolve as the Information Power Grid environment matures. To provide a richer context for this research, the project also focused on issues of fault-tolerance and computation migration of numerical algorithms and software. During the initial phase we tried to increase our understanding of the bottlenecks in single processor performance. Our work began by developing an approach for the automatic generation and optimization of numerical software for processors with deep memory hierarchies and pipelined functional units. Based on the results we achieved in this study we are planning to study other architectures of interest, including development of cost models, and developing code generators appropriate to these architectures.
Technology transfer of military space microprocessor developments
NASA Astrophysics Data System (ADS)
Gorden, C.; King, D.; Byington, L.; Lanza, D.
1999-01-01
Over the past 13 years the Air Force Research Laboratory (AFRL) has led the development of microprocessors and computers for USAF space and strategic missile applications. As a result of these Air Force development programs, advanced computer technology is available for use by civil and commercial space customers as well. The Generic VHSIC Spaceborne Computer (GVSC) program began in 1985 at AFRL to fulfill a deficiency in the availability of space-qualified data and control processors. GVSC developed a radiation hardened multi-chip version of the 16-bit, Mil-Std 1750A microprocessor. The follow-on to GVSC, the Advanced Spaceborne Computer Module (ASCM) program, was initiated by AFRL to establish two industrial sources for complete, radiation-hardened 16-bit and 32-bit computers and microelectronic components. Development of the Control Processor Module (CPM), the first of two ASCM contract phases, concluded in 1994 with the availability of two sources for space-qualified, 16-bit Mil-Std-1750A computers, cards, multi-chip modules, and integrated circuits. The second phase of the program, the Advanced Technology Insertion Module (ATIM), was completed in December 1997. ATIM developed two single board computers based on 32-bit reduced instruction set computer (RISC) processors. GVSC, CPM, and ATIM technologies are flying or baselined into the majority of today's DoD, NASA, and commercial satellite systems.
Wald, Ingo; Ize, Santiago
2015-07-28
Parallel population of a grid with a plurality of objects using a plurality of processors. One example embodiment is a method for parallel population of a grid with a plurality of objects using a plurality of processors. The method includes a first act of dividing a grid into n distinct grid portions, where n is the number of processors available for populating the grid. The method also includes acts of dividing a plurality of objects into n distinct sets of objects, assigning a distinct set of objects to each processor such that each processor determines by which distinct grid portion(s) each object in its distinct set of objects is at least partially bounded, and assigning a distinct grid portion to each processor such that each processor populates its distinct grid portion with any objects that were previously determined to be at least partially bounded by its distinct grid portion.
An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes
NASA Astrophysics Data System (ADS)
Vincenti, H.; Lobet, M.; Lehe, R.; Sasanka, R.; Vay, J.-L.
2017-01-01
In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈ 20 pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD register length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scatter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of × 2 to × 2.5 speed-up in double precision for particle shape factor of orders 1- 3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles).
High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine
NASA Astrophysics Data System (ADS)
Goto, Hiroyuki
This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband Engine™ (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3™ equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
Kalman filter tracking on parallel architectures
NASA Astrophysics Data System (ADS)
Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.
2017-10-01
We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
Instructional image processing on a university mainframe: The Kansas system
NASA Technical Reports Server (NTRS)
Williams, T. H. L.; Siebert, J.; Gunn, C.
1981-01-01
An interactive digital image processing program package was developed that runs on the University of Kansas central computer, a Honeywell Level 66 multi-processor system. The module form of the package allows easy and rapid upgrades and extensions of the system and is used in remote sensing courses in the Department of Geography, in regional five-day short courses for academics and professionals, and also in remote sensing projects and research. The package comprises three self-contained modules of processing functions: Subimage extraction and rectification; image enhancement, preprocessing and data reduction; and classification. Its use in a typical course setting is described. Availability and costs are considered.
Hyperswitch Communication Network Computer
NASA Technical Reports Server (NTRS)
Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.
1993-01-01
Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
Integration, Development and Performance of the 500 TFLOPS Heterogeneous Cluster (Condor)
2012-08-01
PlayStation 3 for High Performance Cluster Computing” LAPACK Working Note 185, 2007. [ 4 ] Feng, W., X. Feng, and R. Ge, “Green Supercomputing Comes of...CONFERENCE PAPER (Post Print) 3. DATES COVERED (From - To) JUN 2010 – MAY 2013 4 . TITLE AND SUBTITLE INTEGRATION, DEVELOPMENT AND PERFORMANCE OF...and streaming processing; the PlayStation 3 uses the IBM Cell BE processor, which adopts the multi-processor, single-instruction-multiple- data (SIMD
Pierce, Paul E.
1986-01-01
A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
NASA Astrophysics Data System (ADS)
Erez, Mattan; Dally, William J.
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
Pierce, P.E.
A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
Multiprocessing on supercomputers for computational aerodynamics
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Mehta, Unmeel B.
1991-01-01
Little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPS or more) to improve turnaround time in computational aerodynamics. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, such improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) is applied through multitasking via a strategy that requires relatively minor modifications to an existing code for a single processor. This approach maps the available memory to multiple processors, exploiting the C-Fortran-Unix interface. The existing code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor.
Efficient Ada multitasking on a RISC register window architecture
NASA Technical Reports Server (NTRS)
Kearns, J. P.; Quammen, D.
1987-01-01
This work addresses the problem of reducing context switch overhead on a processor which supports a large register file - a register file much like that which is part of the Berkeley RISC processors and several other emerging architectures (which are not necessarily reduced instruction set machines in the purest sense). Such a reduction in overhead is particularly desirable in a real-time embedded application, in which task-to-task context switch overhead may result in failure to meet crucial deadlines. A storage management technique by which a context switch may be implemented as cheaply as a procedure call is presented. The essence of this technique is the avoidance of the save/restore of registers on the context switch. This is achieved through analysis of the static source text of an Ada tasking program. Information gained during that analysis directs the optimized storage management strategy for that program at run time. A formal verification of the technique in terms of an operational control model and an evaluation of the technique's performance via simulations driven by synthetic Ada program traces are presented.
Set processing in a network environment. [data bases and magnetic disks and tapes
NASA Technical Reports Server (NTRS)
Hardgrave, W. T.
1975-01-01
A combination of a local network, a mass storage system, and an autonomous set processor serving as a data/storage management machine is described. Its characteristics include: content-accessible data bases usable from all connected devices; efficient storage/access of large data bases; simple and direct programming with data manipulation and storage management handled by the set processor; simple data base design and entry from source representation to set processor representation with no predefinition necessary; capability available for user sort/order specification; significant reduction in tape/disk pack storage and mounts; flexible environment that allows upgrading hardware/software configuration without causing major interruptions in service; minimal traffic on data communications network; and improved central memory usage on large processors.
Balasubramonian, Rajeev [Sandy, UT; Dwarkadas, Sandhya [Rochester, NY; Albonesi, David [Ithaca, NY
2009-02-10
In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
A Parallel Algorithm for Contact in a Finite Element Hydrocode
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pierce, Timothy G.
A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the samemore » processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm has been implemented as the contact capability of the ALE3D multiphysics code, and is currently in production use.« less
A thin-film microprocessor with inkjet print-programmable memory
NASA Astrophysics Data System (ADS)
Myny, Kris; Smout, Steve; Rockelé, Maarten; Bhoolokam, Ajay; Ke, Tung Huei; Steudel, Soeren; Cobb, Brian; Gulati, Aashini; Rodriguez, Francisco Gonzalez; Obata, Koji; Marinkovic, Marko; Pham, Duy-Vu; Hoppe, Arne; Gelinck, Gerwin H.; Genoe, Jan; Dehaene, Wim; Heremans, Paul
2014-12-01
The Internet of Things is driving extensive efforts to develop intelligent everyday objects. This requires seamless integration of relatively simple electronics, for example through `stick-on' electronics labels. We believe the future evolution of this technology will be governed by Wright's Law, which was first proposed in 1936 and states that the cost of a product decreases with cumulative production. This implies that a generic electronic device that can be tailored for application-specific requirements during downstream integration would be a cornerstone in the development of the Internet of Things. We present an 8-bit thin-film microprocessor with a write-once, read-many (WORM) instruction generator that can be programmed after manufacture via inkjet printing. The processor combines organic p-type and soluble oxide n-type thin-film transistors in a new flavor of the familiar complementary transistor technology with the potential to be manufactured on a very thin polyimide film, enabling low-cost flexible electronics. It operates at 6.5 V and reaches clock frequencies up to 2.1 kHz. An instruction set of 16 code lines, each line providing a 9 bit instruction, is defined by means of inkjet printing of conductive silver inks.
Multiprocessing on supercomputers for computational aerodynamics
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Mehta, Unmeel B.
1990-01-01
Very little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPs or more) in computational aerodynamics to significantly improve turnaround time. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, the improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) through multi-tasking is applied via a strategy which requires relatively minor modifications to an existing code for a single processor. Essentially, this approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. The existing single processor code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor. As a demonstration of this approach, a Multiple Processor Multiple Grid (MPMG) code is developed. It is capable of using nine processors, and can be easily extended to a larger number of processors. This code solves the three-dimensional, Reynolds averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. The solver is applied to generic oblique-wing aircraft problem on a four processor Cray-2 computer. A tricubic interpolation scheme is developed to increase the accuracy of coupling of overlapped grids. For the oblique-wing aircraft problem, a speedup of two in elapsed (turnaround) time is observed in a saturated time-sharing environment.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Evangelinos, Constantinos; Nair, Ravi; Ohmacht, Martin
In one embodiment, a computer-implemented method includes encountering a store operation during a compile-time of a program, where the store operation is applicable to a memory line. It is determined, by a computer processor, that no cache coherence action is necessary for the store operation. A store-without-coherence-action instruction is generated for the store operation, responsive to determining that no cache coherence action is necessary. The store-without-coherence-action instruction specifies that the store operation is to be performed without a cache coherence action, and cache coherence is maintained upon execution of the store-without-coherence-action instruction.
Schedulers with load-store queue awareness
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Tong; Eichenberger, Alexandre E.; Jacob, Arpith C.
2017-02-07
In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.
Schedulers with load-store queue awareness
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Tong; Eichenberger, Alexandre E.; Jacob, Arpith C.
2017-01-24
In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.
Simulink/PARS Integration Support
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vacaliuc, B.; Nakhaee, N.
2013-12-18
The state of the art for signal processor hardware has far out-paced the development tools for placing applications on that hardware. In addition, signal processors are available in a variety of architectures, each uniquely capable of handling specific types of signal processing efficiently. With these processors becoming smaller and demanding less power, it has become possible to group multiple processors, a heterogeneous set of processors, into single systems. Different portions of the desired problem set can be assigned to different processor types as appropriate. As software development tools do not keep pace with these processors, especially when multiple processors ofmore » different types are used, a method is needed to enable software code portability among multiple processors and multiple types of processors along with their respective software environments. Sundance DSP, Inc. has developed a software toolkit called “PARS”, whose objective is to provide a framework that uses suites of tools provided by different vendors, along with modeling tools and a real time operating system, to build an application that spans different processor types. The software language used to express the behavior of the system is a very high level modeling language, “Simulink”, a MathWorks product. ORNL has used this toolkit to effectively implement several deliverables. This CRADA describes this collaboration between ORNL and Sundance DSP, Inc.« less
Move Over, Word Processors--Here Come the Databases.
ERIC Educational Resources Information Center
Olds, Henry F., Jr.; Dickenson, Anne
1985-01-01
Discusses the use of beginning, intermediate, and advanced databases for instructional purposes. A table listing seven databases with information on ease of use, smoothness of operation, data capacity, speed, source, and program features is included. (JN)
Data communications in a parallel active messaging interface of a parallel computer
Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E
2013-11-12
Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.
NASA Technical Reports Server (NTRS)
Holland, Maurita Peterson; Pinelli, Thomas E.; Barclay, Rebecca O.; Kennedy, John M.
1991-01-01
U.S. aerospace engineering faculty and students were surveyed as part of the NASA/DoD Aerospace Knowledge Research Project. Faculty and students were viewed as information processors within a conceptual framework of information seeking behavior. Questionnaires were received from 275 faculty members and 640 students, which were used to determine: (1) use and importance of information sources; (2) use of specific print sources and electronic data bases; (3) use of information technology; and (4) the influence of instruction on the use of information sources and the products of faculty and students. Little evidence was found to support the belief that instruction in library or engineering information use has significant impact either on broadening the frequency or range of information products and sources used by U.S. aerospace engineering students.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, Y.; Cameron, K.W.
1998-11-24
Workload characterization has been proven an essential tool to architecture design and performance evaluation in both scientific and commercial computing areas. Traditional workload characterization techniques include FLOPS rate, cache miss ratios, CPI (cycles per instruction or IPC, instructions per cycle) etc. With the complexity of sophisticated modern superscalar microprocessors, these traditional characterization techniques are not powerful enough to pinpoint the performance bottleneck of an application on a specific microprocessor. They are also incapable of immediately demonstrating the potential performance benefit of any architectural or functional improvement in a new processor design. To solve these problems, many people rely on simulators,more » which have substantial constraints especially on large-scale scientific computing applications. This paper presents a new technique of characterizing applications at the instruction level using hardware performance counters. It has the advantage of collecting instruction-level characteristics in a few runs virtually without overhead or slowdown. A variety of instruction counts can be utilized to calculate some average abstract workload parameters corresponding to microprocessor pipelines or functional units. Based on the microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. In particular, the analysis results can provide some insight to the problem that only a small percentage of processor peak performance can be achieved even for many very cache-friendly codes. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. Eventually, these abstract parameters can lead to the creation of an analytical microprocessor pipeline model and memory hierarchy model.« less
ELIPS: Toward a Sensor Fusion Processor on a Chip
NASA Technical Reports Server (NTRS)
Daud, Taher; Stoica, Adrian; Tyson, Thomas; Li, Wei-te; Fabunmi, James
1998-01-01
The paper presents the concept and initial tests from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics and autonomous systems are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an "intelligent" processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks (a fuzzy set preprocessor, a rule-based fuzzy system and a neural network) have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.
Wolinski, Christophe Czeslaw [Los Alamos, NM; Gokhale, Maya B [Los Alamos, NM; McCabe, Kevin Peter [Los Alamos, NM
2011-01-18
Fabric-based computing systems and methods are disclosed. A fabric-based computing system can include a polymorphous computing fabric that can be customized on a per application basis and a host processor in communication with said polymorphous computing fabric. The polymorphous computing fabric includes a cellular architecture that can be highly parameterized to enable a customized synthesis of fabric instances for a variety of enhanced application performances thereof. A global memory concept can also be included that provides the host processor random access to all variables and instructions associated with the polymorphous computing fabric.
FPGA based control system for space instrumentation
NASA Astrophysics Data System (ADS)
Di Giorgio, Anna M.; Cerulli Irelli, Pasquale; Nuzzolo, Francesco; Orfei, Renato; Spinoglio, Luigi; Liu, Giovanni S.; Saraceno, Paolo
2008-07-01
The prototype for a general purpose FPGA based control system for space instrumentation is presented, with particular attention to the instrument control application software. The system HW is based on the LEON3FT processor, which gives the flexibility to configure the chip with only the necessary HW functionalities, from simple logic up to small dedicated processors. The instrument control SW is developed in ANSI C and for time critical (<10μs) commanding sequences implements an internal instructions sequencer, triggered via an interrupt service routine based on a HW high priority interrupt.
JPRS Report, Science & Technology, Europe.
1991-04-30
processor in collaboration with Intel . The processor , christened Touchstone, will be used as the core of a parallel computer with 2,000 processors . One of...ELECTRONIQUE HEBDO in French 24 Jan 91 pp 14-15 [Article by Claire Remy: "Everything Set for Neural Signal Processors " first paragraph is ELECTRONIQUE...paving the way for neural signal processors in so doing. The principal advantage of this specific circuit over a neuromimetic software program is
NASA Astrophysics Data System (ADS)
Schultz, A.
2010-12-01
3D forward solvers lie at the core of inverse formulations used to image the variation of electrical conductivity within the Earth's interior. This property is associated with variations in temperature, composition, phase, presence of volatiles, and in specific settings, the presence of groundwater, geothermal resources, oil/gas or minerals. The high cost of 3D solutions has been a stumbling block to wider adoption of 3D methods. Parallel algorithms for modeling frequency domain 3D EM problems have not achieved wide scale adoption, with emphasis on fairly coarse grained parallelism using MPI and similar approaches. The communications bandwidth as well as the latency required to send and receive network communication packets is a limiting factor in implementing fine grained parallel strategies, inhibiting wide adoption of these algorithms. Leading Graphics Processor Unit (GPU) companies now produce GPUs with hundreds of GPU processor cores per die. The footprint, in silicon, of the GPU's restricted instruction set is much smaller than the general purpose instruction set required of a CPU. Consequently, the density of processor cores on a GPU can be much greater than on a CPU. GPUs also have local memory, registers and high speed communication with host CPUs, usually through PCIe type interconnects. The extremely low cost and high computational power of GPUs provides the EM geophysics community with an opportunity to achieve fine grained (i.e. massive) parallelization of codes on low cost hardware. The current generation of GPUs (e.g. NVidia Fermi) provides 3 billion transistors per chip die, with nearly 500 processor cores and up to 6 GB of fast (DDR5) GPU memory. This latest generation of GPU supports fast hardware double precision (64 bit) floating point operations of the type required for frequency domain EM forward solutions. Each Fermi GPU board can sustain nearly 1 TFLOP in double precision, and multiple boards can be installed in the host computer system. We describe our ongoing efforts to achieve massive parallelization on a novel hybrid GPU testbed machine currently configured with 12 Intel Westmere Xeon CPU cores (or 24 parallel computational threads) with 96 GB DDR3 system memory, 4 GPU subsystems which in aggregate contain 960 NVidia Tesla GPU cores with 16 GB dedicated DDR3 GPU memory, and a second interleved bank of 4 GPU subsystems containing in aggregate 1792 NVidia Fermi GPU cores with 12 GB dedicated DDR5 GPU memory. We are applying domain decomposition methods to a modified version of Weiss' (2001) 3D frequency domain full physics EM finite difference code, an open source GPL licensed f90 code available for download from www.OpenEM.org. This will be the core of a new hybrid 3D inversion that parallelizes frequencies across CPUs and individual forward solutions across GPUs. We describe progress made in modifying the code to use direct solvers in GPU cores dedicated to each small subdomain, iteratively improving the solution by matching adjacent subdomain boundary solutions, rather than iterative Krylov space sparse solvers as currently applied to the whole domain.
Communications and Information: Compendium of Communications and Information Terminology
2002-02-01
Basic Access Module BASIC— Beginners All-Purpose Symbolic Instruction Code BBP—Baseband Processor BBS—Bulletin Board Service (System) BBTC—Broadband...media, formats and labels, programming language, computer documentation, flowcharts and terminology, character codes, data communications and input
Efficient Parallel Algorithms on Restartable Fail-Stop Processors
1991-01-01
resource (memory), and ( 3 ) that processors, memory and their interconnection must be The model of parallel computation known as the Par- perfectly...setting), arid ure an(I restart errors. We describe these arguments if] [AAtPS 871 (in a deterministic setting). Fault-tolerance Section 3 . of...grannmarity at the processor level --- for recent work on where Al is the nmber of failures during this step’s gate granilarities see [All 90, Pip 85
Near-memory data reorganization engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gokhale, Maya; Lloyd, G. Scott
A memory subsystem package is provided that has processing logic for data reorganization within the memory subsystem package. The processing logic is adapted to reorganize data stored within the memory subsystem package. In some embodiments, the memory subsystem package includes memory units, a memory interconnect, and a data reorganization engine ("DRE"). The data reorganization engine includes a stream interconnect and DRE units including a control processor and a load-store unit. The control processor is adapted to execute instructions to control a data reorganization. The load-store unit is adapted to process data move commands received from the control processor via themore » stream interconnect for loading data from a load memory address of a memory unit and storing data to a store memory address of a memory unit.« less
Store operations to maintain cache coherence
Evangelinos, Constantinos; Nair, Ravi; Ohmacht, Martin
2017-08-01
In one embodiment, a computer-implemented method includes encountering a store operation during a compile-time of a program, where the store operation is applicable to a memory line. It is determined, by a computer processor, that no cache coherence action is necessary for the store operation. A store-without-coherence-action instruction is generated for the store operation, responsive to determining that no cache coherence action is necessary. The store-without-coherence-action instruction specifies that the store operation is to be performed without a cache coherence action, and cache coherence is maintained upon execution of the store-without-coherence-action instruction.
A Comparative Study : Microprogrammed Vs Risc Architectures For Symbolic Processing
NASA Astrophysics Data System (ADS)
Heudin, J. C.; Metivier, C.; Demigny, D.; Maurin, T.; Zavidovique, B.; Devos, F.
1987-05-01
It is oftenclaimed that conventional computers are not well suited for human-like tasks : Vision (Image Processing), Intelligence (Symbolic Processing) ... In the particular case of Artificial Intelligence, dynamic type-checking is one example of basic task that must be improved. The solution implemented in most Lisp work-stations consists in a microprogrammed architecture with a tagged memory. Another way to gain efficiency is to design a well suited instruction set for symbolic processing, which reduces the semantic gap between the high level language and the machine code. In this framework, the RISC concept provides a convenient approach to study new architectures for symbolic processing. This paper compares both approaches and describes our projectof designing a compact symbolic processor for Artificial Intelligence applications.
An ultra-compact processor module based on the R3000
NASA Astrophysics Data System (ADS)
Mullenhoff, D. J.; Kaschmitter, J. L.; Lyke, J. C.; Forman, G. A.
1992-08-01
Viable high density packaging is of critical importance for future military systems, particularly space borne systems which require minimum weight and size and high mechanical integrity. A leading, emerging technology for high density packaging is multi-chip modules (MCM). During the 1980's, a number of different MCM technologies have emerged. In support of Strategic Defense Initiative Organization (SDIO) programs, Lawrence Livermore National Laboratory (LLNL) has developed, utilized, and evaluated several different MCM technologies. Prior LLNL efforts include modules developed in 1986, using hybrid wafer scale packaging, which are still operational in an Air Force satellite mission. More recent efforts have included very high density cache memory modules, developed using laser pantography. As part of the demonstration effort, LLNL and Phillips Laboratory began collaborating in 1990 in the Phase 3 Multi-Chip Module (MCM) technology demonstration project. The goal of this program was to demonstrate the feasibility of General Electric's (GE) High Density Interconnect (HDI) MCM technology. The design chosen for this demonstration was the processor core for a MIPS R3000 based reduced instruction set computer (RISC), which has been described previously. It consists of the R3000 microprocessor, R3010 floating point coprocessor and 128 Kbytes of cache memory.
Knowledge-based processing for aircraft flight control
NASA Technical Reports Server (NTRS)
Painter, John H.
1991-01-01
The purpose is to develop algorithms and architectures for embedding artificial intelligence in aircraft guidance and control systems. With the approach adopted, AI-computing is used to create an outer guidance loop for driving the usual aircraft autopilot. That is, a symbolic processor monitors the operation and performance of the aircraft. Then, based on rules and other stored knowledge, commands are automatically formulated for driving the autopilot so as to accomplish desired flight operations. The focus is on developing a software system which can respond to linguistic instructions, input in a standard format, so as to formulate a sequence of simple commands to the autopilot. The instructions might be a fairly complex flight clearance, input either manually or by data-link. Emphasis is on a software system which responds much like a pilot would, employing not only precise computations, but, also, knowledge which is less precise, but more like common-sense. The approach is based on prior work to develop a generic 'shell' architecture for an AI-processor, which may be tailored to many applications by describing the application in appropriate processor data bases (libraries). Such descriptions include numerical models of the aircraft and flight control system, as well as symbolic (linguistic) descriptions of flight operations, rules, and tactics.
Eigensolution of finite element problems in a completely connected parallel architecture
NASA Technical Reports Server (NTRS)
Akl, Fred A.; Morel, Michael R.
1989-01-01
A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.
An implementation of a tree code on a SIMD, parallel computer
NASA Technical Reports Server (NTRS)
Olson, Kevin M.; Dorband, John E.
1994-01-01
We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Balasubramonian, Rajeev [Sandy, UT; Dwarkadas, Sandhya [Rochester, NY; Albonesi, David [Ithaca, NY
2012-01-24
In a processor having multiple clusters which operate in parallel, the number of clusters in use can be varied dynamically. At the start of each program phase, the configuration option for an interval is run to determine the optimal configuration, which is used until the next phase change is detected. The optimum instruction interval is determined by starting with a minimum interval and doubling it until a low stability factor is reached.
Compiler-assisted multiple instruction rollback recovery using a read buffer
NASA Technical Reports Server (NTRS)
Alewine, N. J.; Chen, S.-K.; Fuchs, W. K.; Hwu, W.-M.
1993-01-01
Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe computers to provide rapid recovery from transient processor failures. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs have also been developed which remove rollback data hazards directly with data-flow transformations. This paper focuses on compiler-assisted techniques to achieve multiple instruction rollback recovery. We observe that some data hazards resulting from instruction rollback can be resolved efficiently by providing an operand read buffer while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations indicate improved efficiency over previous hardware-based and compiler-based schemes.
List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor
NASA Astrophysics Data System (ADS)
Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.
2014-03-01
List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.
NASA Technical Reports Server (NTRS)
Pitts, E. R.
1976-01-01
The DJANAL (DisJunct ANALyzer) Program provides a means for the LSI designer to format output from the Mask Analysis Program (MAP) for input to the FETLOG (FETSIM/LOGSIM) processor. This document presents a brief description of the operation of DJANAL and provides comprehensive instruction for its use.
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Bolosky, William Joseph
1993-01-01
Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.
Single bus star connected reluctance drive and method
Fahimi, Babak; Shamsi, Pourya
2016-05-10
A system and methods for operating a switched reluctance machine includes a controller, an inverter connected to the controller and to the switched reluctance machine, a hysteresis control connected to the controller and to the inverter, a set of sensors connected to the switched reluctance machine and to the controller, the switched reluctance machine further including a set of phases the controller further comprising a processor and a memory connected to the processor, wherein the processor programmed to execute a control process and a generation process.
A Real-Time Image Acquisition And Processing System For A RISC-Based Microcomputer
NASA Astrophysics Data System (ADS)
Luckman, Adrian J.; Allinson, Nigel M.
1989-03-01
A low cost image acquisition and processing system has been developed for the Acorn Archimedes microcomputer. Using a Reduced Instruction Set Computer (RISC) architecture, the ARM (Acorn Risc Machine) processor provides instruction speeds suitable for image processing applications. The associated improvement in data transfer rate has allowed real-time video image acquisition without the need for frame-store memory external to the microcomputer. The system is comprised of real-time video digitising hardware which interfaces directly to the Archimedes memory, and software to provide an integrated image acquisition and processing environment. The hardware can digitise a video signal at up to 640 samples per video line with programmable parameters such as sampling rate and gain. Software support includes a work environment for image capture and processing with pixel, neighbourhood and global operators. A friendly user interface is provided with the help of the Archimedes Operating System WIMP (Windows, Icons, Mouse and Pointer) Manager. Windows provide a convenient way of handling images on the screen and program control is directed mostly by pop-up menus.
Optimal partitioning of random programs across two processors
NASA Technical Reports Server (NTRS)
Nicol, D. M.
1986-01-01
The optimal partitioning of random distributed programs is discussed. It is concluded that the optimal partitioning of a homogeneous random program over a homogeneous distributed system either assigns all modules to a single processor, or distributes the modules as evenly as possible among all processors. The analysis rests heavily on the approximation which equates the expected maximum of a set of independent random variables with the set's maximum expectation. The results are strengthened by providing an approximation-free proof of this result for two processors under general conditions on the module execution time distribution. It is also shown that use of this approximation causes two of the previous central results to be false.
Schafer, Erin C; Romine, Denise; Musgrave, Elizabeth; Momin, Sadaf; Huynh, Christy
2013-01-01
Previous research has suggested that electrically coupled frequency modulation (FM) systems substantially improved speech-recognition performance in noise in individuals with cochlear implants (CIs). However, there is limited evidence to support the use of electromagnetically coupled (neck loop) FM receivers with contemporary CI sound processors containing telecoils. The primary goal of this study was to compare speech-recognition performance in noise and subjective ratings of adolescents and adults using one of three contemporary CI sound processors coupled to electromagnetically and electrically coupled FM receivers from Oticon. A repeated-measures design was used to compare speech-recognition performance in noise and subjective ratings without and with the FM systems across three test sessions (Experiment 1) and to compare performance at different FM-gain settings (Experiment 2). Descriptive statistics were used in Experiment 3 to describe output differences measured through a CI sound processor. Experiment 1 included nine adolescents or adults with unilateral or bilateral Advanced Bionics Harmony (n = 3), Cochlear Nucleus 5 (n = 3), and MED-EL OPUS 2 (n = 3) CI sound processors. In Experiment 2, seven of the original nine participants were tested. In Experiment 3, electroacoustic output was measured from a Nucleus 5 sound processor when coupled to the electromagnetically coupled Oticon Arc neck loop and electrically coupled Oticon R2. In Experiment 1, participants completed a field trial with each FM receiver and three test sessions that included speech-recognition performance in noise and a subjective rating scale. In Experiment 2, participants were tested in three receiver-gain conditions. Results in both experiments were analyzed using repeated-measures analysis of variance. Experiment 3 involved electroacoustic-test measures to determine the monitor-earphone output of the CI alone and CI coupled to the two FM receivers. The results in Experiment 1 suggested that both FM receivers provided significantly better speech-recognition performance in noise than the CI alone; however, the electromagnetically coupled receiver provided significantly better speech-recognition performance in noise and better ratings in some situations than the electrically coupled receiver when set to the same gain. In Experiment 2, the primary analysis suggested significantly better speech-recognition performance in noise for the neck-loop versus electrically coupled receiver, but a second analysis, using the best performance across gain settings for each device, revealed no significant differences between the two FM receivers. Experiment 3 revealed monitor-earphone output differences in the Nucleus 5 sound processor for the two FM receivers when set to the +8 setting used in Experiment 1 but equal output when the electrically coupled device was set to a +16 gain setting and the electromagnetically coupled device was set to the +8 gain setting. Individuals with contemporary sound processors may show more favorable speech-recognition performance in noise electromagnetically coupled FM systems (i.e., Oticon Arc), which is most likely related to the input processing and signal processing pathway within the CI sound processor for direct input versus telecoil input. Further research is warranted to replicate these findings with a larger sample size and to develop and validate a more objective approach to fitting FM systems to CI sound processors. American Academy of Audiology.
49 CFR 236.929 - Training specific to roadway workers.
Code of Federal Regulations, 2010 CFR
2010-10-01
... themselves or roadway work groups. (b) What subject areas must roadway worker training include? (1... control equipment in establishing protection for roadway workers and their equipment. (2) Instruction for roadway workers must ensure recognition of processor-based signal and train control equipment on the...
Efficiency of parallel direct optimization
NASA Technical Reports Server (NTRS)
Janies, D. A.; Wheeler, W. C.
2001-01-01
Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.
Geospace simulations on the Cell BE processor
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D.
2008-12-01
OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.
NASA Technical Reports Server (NTRS)
Dorband, John E.
1987-01-01
Generating graphics to faithfully represent information can be a computationally intensive task. A way of using the Massively Parallel Processor to generate images by ray tracing is presented. This technique uses sort computation, a method of performing generalized routing interspersed with computation on a single-instruction-multiple-data (SIMD) computer.
49 CFR 236.1049 - Training specific to roadway workers.
Code of Federal Regulations, 2010 CFR
2010-10-01
... who provide protection for themselves or roadway work groups. (b) Training subject areas. (1... control equipment in establishing protection for roadway workers and their equipment. (2) Instruction for... recognition of processor-based signal and train control equipment on the wayside and an understanding of how...
Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J
2004-09-01
We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hunn, B. D.; Diamond, S. C.; Bennett, G. A.
1977-10-01
A set of computer programs, called Cal-ERDA, is described that is capable of rapid and detailed analysis of energy consumption in buildings. A new user-oriented input language, named the Building Design Language (BDL), has been written to allow simplified manipulation of the many variables used to describe a building and its operation. This manual provides the user with information necessary to understand in detail the Cal-ERDA set of computer programs. The new computer programs described include: an EXECUTIVE Processor to create computer system control commands; a BDL Processor to analyze input instructions, execute computer system control commands, perform assignments andmore » data retrieval, and control the operation of the LOADS, SYSTEMS, PLANT, ECONOMICS, and REPORT programs; a LOADS analysis program that calculates peak (design) zone and hourly loads and the effect of the ambient weather conditions, the internal occupancy, lighting, and equipment within the building, as well as variations in the size, location, orientation, construction, walls, roofs, floors, fenestrations, attachments (awnings, balconies), and shape of a building; a Heating, Ventilating, and Air-Conditioning (HVAC) SYSTEMS analysis program capable of modeling the operation of HVAC components including fans, coils, economizers, humidifiers, etc.; 16 standard configurations and operated according to various temperature and humidity control schedules. A plant equipment program models the operation of boilers, chillers, electrical generation equipment (diesel or turbines), heat storage apparatus (chilled or heated water), and solar heating and/or cooling systems. An ECONOMIC analysis program calculates life-cycle costs. A REPORT program produces tables of user-selected variables and arranges them according to user-specified formats. A set of WEATHER ANALYSIS programs manipulates, summarizes and plots weather data. Libraries of weather data, schedule data, and building data were prepared.« less
Fault-free behavior of reliable multiprocessor systems: FTMP experiments in AIRLAB
NASA Technical Reports Server (NTRS)
Clune, E.; Segall, Z.; Siewiorek, D.
1985-01-01
This report describes a set of experiments which were implemented on the Fault tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These experiments are part of an effort to formulate and evaluate validation methodologies for fault-tolerant computers. This report deals with the measurement of single parameters (baselines) of a fault free system. The initial set of baseline experiments lead to the following conclusions: (1) The system clock is constant and independent of workload in the tested cases; (2) the instruction execution times are constant; (3) the R4 frame size is 40mS with some variation; (4) the frame stretching mechanism has some flaws in its implementation that allow the possibility of an infinite stretching of frame duration. Future experiments are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. The implementation of a synthetic workload generation mechanism for FTMP is planned to enhance the experimental environment of the system.
Compiler-assisted multiple instruction rollback recovery using a read buffer
NASA Technical Reports Server (NTRS)
Alewine, Neal J.; Chen, Shyh-Kwei; Fuchs, W. Kent; Hwu, Wen-Mei W.
1995-01-01
Multiple instruction rollback (MIR) is a technique that has been implemented in mainframe computers to provide rapid recovery from transient processor failures. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs have also been developed which remove rollback data hazards directly with data-flow transformations. This paper describes compiler-assisted techniques to achieve multiple instruction rollback recovery. We observe that some data hazards resulting from instruction rollback can be resolved efficiently by providing an operand read buffer while others are resolved more efficiently with compiler transformations. The compiler-assisted scheme presented consists of hardware that is less complex than shadow files, history files, history buffers, or delayed write buffers, while experimental evaluation indicates performance improvement over compiler-based schemes.
Parallel eigenanalysis of finite element models in a completely connected architecture
NASA Technical Reports Server (NTRS)
Akl, F. A.; Morel, M. R.
1989-01-01
A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi) = (M)(phi)(omega), where (K) and (M) are of order N, and (omega) is order of q. The concurrent solution of the eigenproblem is based on the multifrontal/modified subspace method and is achieved in a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm was successfully implemented on a tightly coupled multiple-instruction multiple-data parallel processing machine, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macrotasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. A parallel finite element dynamic analysis program, p-feda, is documented and the performance of its subroutines in parallel environment is analyzed.
Assessment of directionality performances: comparison between Freedom and CP810 sound processors.
Razza, Sergio; Albanese, Greta; Ermoli, Lucilla; Zaccone, Monica; Cristofari, Eliana
2013-10-01
To compare speech recognition in noise for the Nucleus Freedom and CP810 sound processors using different directional settings among those available in the SmartSound portfolio. Single-subject, repeated measures study. Tertiary care referral center. Thirty-one monoaurally and binaurally implanted subjects (24 children and 7 adults) were enrolled. They were all experienced Nucleus Freedom sound processor users and achieved a 100% open set word recognition score in quiet listening conditions. Each patient was fitted with the Freedom and the CP810 processor. The program setting incorporated Adaptive Dynamic Range Optimization (ADRO) and adopted the directional algorithm BEAM (both devices) and ZOOM (only on CP810). Speech reception threshold (SRT) was assessed in a free-field layout, with disyllabic word list and interfering multilevel babble noise in the 3 different pre-processing configurations. On average, CP810 improved significantly patients' SRTs as compared to Freedom SP after 1 hour of use. Instead, no significant difference was observed in patients' SRT between the BEAM and the ZOOM algorithm fitted in the CP810 processor. The results suggest that hardware developments achieved in the design of CP810 allow an immediate and relevant directional advantage as compared to the previous-generation Freedom device.
Research on numerical control system based on S3C2410 and MCX314AL
NASA Astrophysics Data System (ADS)
Ren, Qiang; Jiang, Tingbiao
2008-10-01
With the rapid development of micro-computer technology, embedded system, CNC technology and integrated circuits, numerical control system with powerful functions can be realized by several high-speed CPU chips and RISC (Reduced Instruction Set Computing) chips which have small size and strong stability. In addition, the real-time operating system also makes the attainment of embedded system possible. Developing the NC system based on embedded technology can overcome some shortcomings of common PC-based CNC system, such as the waste of resources, low control precision, low frequency and low integration. This paper discusses a hardware platform of ENC (Embedded Numerical Control) system based on embedded processor chip ARM (Advanced RISC Machines)-S3C2410 and DSP (Digital Signal Processor)-MCX314AL and introduces the process of developing ENC system software. Finally write the MCX314AL's driver under the embedded Linux operating system. The embedded Linux operating system can deal with multitask well moreover satisfy the real-time and reliability of movement control. NC system has the advantages of best using resources and compact system with embedded technology. It provides a wealth of functions and superior performance with a lower cost. It can be sure that ENC is the direction of the future development.
PC-CUBE: A Personal Computer Based Hypercube
NASA Technical Reports Server (NTRS)
Ho, Alex; Fox, Geoffrey; Walker, David; Snyder, Scott; Chang, Douglas; Chen, Stanley; Breaden, Matt; Cole, Terry
1988-01-01
PC-CUBE is an ensemble of IBM PCs or close compatibles connected in the hypercube topology with ordinary computer cables. Communication occurs at the rate of 115.2 K-band via the RS-232 serial links. Available for PC-CUBE is the Crystalline Operating System III (CrOS III), Mercury Operating System, CUBIX and PLOTIX which are parallel I/O and graphics libraries. A CrOS performance monitor was developed to facilitate the measurement of communication and computation time of a program and their effects on performance. Also available are CXLISP, a parallel version of the XLISP interpreter; GRAFIX, some graphics routines for the EGA and CGA; and a general execution profiler for determining execution time spent by program subroutines. PC-CUBE provides a programming environment similar to all hypercube systems running CrOS III, Mercury and CUBIX. In addition, every node (personal computer) has its own graphics display monitor and storage devices. These allow data to be displayed or stored at every processor, which has much instructional value and enables easier debugging of applications. Some application programs which are taken from the book Solving Problems on Concurrent Processors (Fox 88) were implemented with graphics enhancement on PC-CUBE. The applications range from solving the Mandelbrot set, Laplace equation, wave equation, long range force interaction, to WaTor, an ecological simulation.
Custom instruction set NIOS-based OFDM processor for FPGAs
NASA Astrophysics Data System (ADS)
Meyer-Bäse, Uwe; Sunkara, Divya; Castillo, Encarnacion; Garcia, Antonio
2006-05-01
Orthogonal Frequency division multiplexing (OFDM) spread spectrum technique, sometimes also called multi-carrier or discrete multi-tone modulation, are used in bandwidth-efficient communication systems in the presence of channel distortion. The benefits of OFDM are high spectral efficiency, resiliency to RF interference, and lower multi-path distortion. OFDM is the basis for the European digital audio broadcasting (DAB) standard, the global asymmetric digital subscriber line (ADSL) standard, in the IEEE 802.11 5.8 GHz band standard, and ongoing development in wireless local area networks. The modulator and demodulator in an OFDM system can be implemented by use of a parallel bank of filters based on the discrete Fourier transform (DFT), in case the number of subchannels is large (e.g. K > 25), the OFDM system are efficiently implemented by use of the fast Fourier transform (FFT) to compute the DFT. We have developed a custom FPGA-based Altera NIOS system to increase the performance, programmability, and low power in mobil wireless systems. The overall gain observed for a 1024-point FFT ranges depending on the multiplier used by the NIOS processor between a factor of 3 and 16. A careful optimization described in the appendix yield a performance gain of up to 77% when compared with our preliminary results.
Multiple core computer processor with globally-accessible local memories
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shalf, John; Donofrio, David; Oliker, Leonid
A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality ofmore » processor cores.« less
Intelligent sensor and controller framework for the power grid
Akyol, Bora A.; Haack, Jereme Nathan; Craig, Jr., Philip Allen; Tews, Cody William; Kulkarni, Anand V.; Carpenter, Brandon J.; Maiden, Wendy M.; Ciraci, Selim
2015-07-28
Disclosed below are representative embodiments of methods, apparatus, and systems for monitoring and using data in an electric power grid. For example, one disclosed embodiment comprises a sensor for measuring an electrical characteristic of a power line, electrical generator, or electrical device; a network interface; a processor; and one or more computer-readable storage media storing computer-executable instructions. In this embodiment, the computer-executable instructions include instructions for implementing an authorization and authentication module for validating a software agent received at the network interface; instructions for implementing one or more agent execution environments for executing agent code that is included with the software agent and that causes data from the sensor to be collected; and instructions for implementing an agent packaging and instantiation module for storing the collected data in a data container of the software agent and for transmitting the software agent, along with the stored data, to a next destination.
Dynamic wavefront creation for processing units using a hybrid compactor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Puthoor, Sooraj; Beckmann, Bradford M.; Yudanov, Dmitri
A method, a non-transitory computer readable medium, and a processor for repacking dynamic wavefronts during program code execution on a processing unit, each dynamic wavefront including multiple threads are presented. If a branch instruction is detected, a determination is made whether all wavefronts following a same control path in the program code have reached a compaction point, which is the branch instruction. If no branch instruction is detected in executing the program code, a determination is made whether all wavefronts following the same control path have reached a reconvergence point, which is a beginning of a program code segment tomore » be executed by both a taken branch and a not taken branch from a previous branch instruction. The dynamic wavefronts are repacked with all threads that follow the same control path, if all wavefronts following the same control path have reached the branch instruction or the reconvergence point.« less
Intelligent sensor and controller framework for the power grid
DOE Office of Scientific and Technical Information (OSTI.GOV)
Akyol, Bora A.; Haack, Jereme Nathan; Craig, Jr., Philip Allen
Disclosed below are representative embodiments of methods, apparatus, and systems for monitoring and using data in an electric power grid. For example, one disclosed embodiment comprises a sensor for measuring an electrical characteristic of a power line, electrical generator, or electrical device; a network interface; a processor; and one or more computer-readable storage media storing computer-executable instructions. In this embodiment, the computer-executable instructions include instructions for implementing an authorization and authentication module for validating a software agent received at the network interface; instructions for implementing one or more agent execution environments for executing agent code that is included with themore » software agent and that causes data from the sensor to be collected; and instructions for implementing an agent packaging and instantiation module for storing the collected data in a data container of the software agent and for transmitting the software agent, along with the stored data, to a next destination.« less
Testing and operating a multiprocessor chip with processor redundancy
Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J
2014-10-21
A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.
SPAR improved structural-fluid dynamic analysis capability
NASA Technical Reports Server (NTRS)
Pearson, M. L.
1985-01-01
The results of a study whose objective was to improve the operation of the SPAR computer code by improving efficiency, user features, and documentation is presented. Additional capability was added to the SPAR arithmetic utility system, including trigonometric functions, numerical integration, interpolation, and matrix combinations. Improvements were made in the EIG processor. A processor was created to compute and store principal stresses in table-format data sets. An additional capability was developed and incorporated into the plot processor which permits plotting directly from table-format data sets. Documentation of all these features is provided in the form of updates to the SPAR users manual.
System on a chip with MPEG-4 capability
NASA Astrophysics Data System (ADS)
Yassa, Fathy; Schonfeld, Dan
2002-12-01
Current products supporting video communication applications rely on existing computer architectures. RISC processors have been used successfully in numerous applications over several decades. DSP processors have become ubiquitous in signal processing and communication applications. Real-time applications such as speech processing in cellular telephony rely extensively on the computational power of these processors. Video processors designed to implement the computationally intensive codec operations have also been used to address the high demands of video communication applications (e.g., cable set-top boxes and DVDs). This paper presents an overview of a system-on-chip (SOC) architecture used for real-time video in wireless communication applications. The SOC specifications answer to the system requirements imposed by the application environment. A CAM-based video processor is used to accelerate data intensive video compression tasks such as motion estimations and filtering. Other components are dedicated to system level data processing and audio processing. A rich set of I/Os allows the SOC to communicate with other system components such as baseband and memory subsystems.
System Level RBDO for Military Ground Vehicles using High Performance Computing
2008-01-01
platform. Only the analyses that required more than 24 processors were conducted on the Onyx 350 due to the limited number of processors on the...optimization constraints varied. The queues set the number of processors and number of finite element code licenses available to the analyses. sgi ONYX ...3900: unix 24 MIPS R16000 PROCESSORS 4 IR2 GRAPHICS PIPES 4 IR3 GRAPHICS PIPES 24 GBYTES MEMORY 36 GBYTES LOCAL DISK SPACE sgi ONYX 350: unix 32 MIPS
Processor and method for developing a set of admissible fixture designs for a workpiece
Brost, R.C.; Goldberg, K.Y.; Wallack, A.S.; Canny, J.
1996-08-13
A fixture process and method is provided for developing a complete set of all admissible fixture designs for a workpiece which prevents the workpiece from translating or rotating. The fixture processor generates the set of all admissible designs based on geometric access constraints and expected applied forces on the workpiece. For instance, the fixture processor may generate a set of admissible fixture designs for first, second and third locators placed in an array of holes on a fixture plate and a translating clamp attached to the fixture plate for contacting the workpiece. In another instance, a fixture vice is used in which first, second, third and fourth locators are used and first and second fixture jaws are tightened to secure the workpiece. The fixture process also ranks the set of admissible fixture designs according to a predetermined quality metric so that the optimal fixture design for the desired purpose may be identified from the set of all admissible fixture designs. 27 figs.
Processor and method for developing a set of admissible fixture designs for a workpiece
Brost, Randolph C.; Goldberg, Kenneth Y.; Canny, John; Wallack, Aaron S.
1999-01-01
Methods and apparatus are provided for developing a complete set of all admissible Type I and Type II fixture designs for a workpiece. The fixture processor generates the set of all admissible designs based on geometric access constraints and expected applied forces on the workpiece. For instance, the fixture processor may generate a set of admissible fixture designs for first, second and third locators placed in an array of holes on a fixture plate and a translating clamp attached to the fixture plate for contacting the workpiece. In another instance, a fixture vise is used in which first, second, third and fourth locators are used and first and second fixture jaws are tightened to secure the workpiece. The fixture process also ranks the set of admissible fixture designs according to a predetermined quality metric so that the optimal fixture design for the desired purpose may be identified from the set of all admissible fixture designs.
Processor and method for developing a set of admissible fixture designs for a workpiece
Brost, Randolph C.; Goldberg, Kenneth Y.; Wallack, Aaron S.; Canny, John
1996-01-01
A fixture process and method is provided for developing a complete set of all admissible fixture designs for a workpiece which prevents the workpiece from translating or rotating. The fixture processor generates the set of all admissible designs based on geometric access constraints and expected applied forces on the workpiece. For instance, the fixture processor may generate a set of admissible fixture designs for first, second and third locators placed in an array of holes on a fixture plate and a translating clamp attached to the fixture plate for contacting the workpiece. In another instance, a fixture vice is used in which first, second, third and fourth locators are used and first and second fixture jaws are tightened to secure the workpiece. The fixture process also ranks the set of admissible fixture designs according to a predetermined quality metric so that the optimal fixture design for the desired purpose may be identified from the set of all admissible fixture designs.
Processor and method for developing a set of admissible fixture designs for a workpiece
Brost, R.C.; Goldberg, K.Y.; Canny, J.; Wallack, A.S.
1999-01-05
Methods and apparatus are provided for developing a complete set of all admissible Type 1 and Type 2 fixture designs for a workpiece. The fixture processor generates the set of all admissible designs based on geometric access constraints and expected applied forces on the workpiece. For instance, the fixture processor may generate a set of admissible fixture designs for first, second and third locators placed in an array of holes on a fixture plate and a translating clamp attached to the fixture plate for contacting the workpiece. In another instance, a fixture vise is used in which first, second, third and fourth locators are used and first and second fixture jaws are tightened to secure the workpiece. The fixture process also ranks the set of admissible fixture designs according to a predetermined quality metric so that the optimal fixture design for the desired purpose may be identified from the set of all admissible fixture designs. 44 figs.
Holo-Chidi video concentrator card
NASA Astrophysics Data System (ADS)
Nwodoh, Thomas A.; Prabhakar, Aditya; Benton, Stephen A.
2001-12-01
The Holo-Chidi Video Concentrator Card is a frame buffer for the Holo-Chidi holographic video processing system. Holo- Chidi is designed at the MIT Media Laboratory for real-time computation of computer generated holograms and the subsequent display of the holograms at video frame rates. The Holo-Chidi system is made of two sets of cards - the set of Processor cards and the set of Video Concentrator Cards (VCCs). The Processor cards are used for hologram computation, data archival/retrieval from a host system, and for higher-level control of the VCCs. The VCC formats computed holographic data from multiple hologram computing Processor cards, converting the digital data to analog form to feed the acousto-optic-modulators of the Media lab's Mark-II holographic display system. The Video Concentrator card is made of: a High-Speed I/O (HSIO) interface whence data is transferred from the hologram computing Processor cards, a set of FIFOs and video RAM used as buffer for data for the hololines being displayed, a one-chip integrated microprocessor and peripheral combination that handles communication with other VCCs and furnishes the card with a USB port, a co-processor which controls display data formatting, and D-to-A converters that convert digital fringes to analog form. The co-processor is implemented with an SRAM-based FPGA with over 500,000 gates and controls all the signals needed to format the data from the multiple Processor cards into the format required by Mark-II. A VCC has three HSIO ports through which up to 500 Megabytes of computed holographic data can flow from the Processor Cards to the VCC per second. A Holo-Chidi system with three VCCs has enough frame buffering capacity to hold up to thirty two 36Megabyte hologram frames at a time. Pre-computed holograms may also be loaded into the VCC from a host computer through the low- speed USB port. Both the microprocessor and the co- processor in the VCC can access the main system memory used to store control programs and data for the VCC. The Card also generates the control signals used by the scanning mirrors of Mark-II. In this paper we discuss the design of the VCC and its implementation in the Holo-Chidi system.
Power estimation on functional level for programmable processors
NASA Astrophysics Data System (ADS)
Schneider, M.; Blume, H.; Noll, T. G.
2004-05-01
In diesem Beitrag werden verschiedene Ansätze zur Verlustleistungsschätzung von programmierbaren Prozessoren vorgestellt und bezüglich ihrer Übertragbarkeit auf moderne Prozessor-Architekturen wie beispielsweise Very Long Instruction Word (VLIW)-Architekturen bewertet. Besonderes Augenmerk liegt hierbei auf dem Konzept der sogenannten Functional-Level Power Analysis (FLPA). Dieser Ansatz basiert auf der Einteilung der Prozessor-Architektur in funktionale Blöcke wie beispielsweise Processing-Unit, Clock-Netzwerk, interner Speicher und andere. Die Verlustleistungsaufnahme dieser Bl¨ocke wird parameterabhängig durch arithmetische Modellfunktionen beschrieben. Durch automatisierte Analyse von Assemblercodes des zu schätzenden Systems mittels eines Parsers können die Eingangsparameter wie beispielsweise der erzielte Parallelitätsgrad oder die Art des Speicherzugriffs gewonnen werden. Dieser Ansatz wird am Beispiel zweier moderner digitaler Signalprozessoren durch eine Vielzahl von Basis-Algorithmen der digitalen Signalverarbeitung evaluiert. Die ermittelten Schätzwerte für die einzelnen Algorithmen werden dabei mit physikalisch gemessenen Werten verglichen. Es ergibt sich ein sehr kleiner maximaler Schätzfehler von 3%. In this contribution different approaches for power estimation for programmable processors are presented and evaluated concerning their capability to be applied to modern digital signal processor architectures like e.g. Very Long InstructionWord (VLIW) -architectures. Special emphasis will be laid on the concept of so-called Functional-Level Power Analysis (FLPA). This approach is based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory and others. The power consumption of these blocks is described by parameter dependent arithmetic model functions. By application of a parser based automized analysis of assembler codes of the systems to be estimated the input parameters of the Correspondence to: H. Blume (blume@eecs.rwth-aachen.de) arithmetic functions like e.g. the achieved degree of parallelism or the kind and number of memory accesses can be computed. This approach is exemplarily demonstrated and evaluated applying two modern digital signal processors and a variety of basic algorithms of digital signal processing. The resulting estimation values for the inspected algorithms are compared to physically measured values. A resulting maximum estimation error of 3% is achieved.
NASA Astrophysics Data System (ADS)
Olson, Richard F.
2013-05-01
Rendering of point scatterer based radar scenes for millimeter wave (mmW) seeker tests in real-time hardware-in-the-loop (HWIL) scene generation requires efficient algorithms and vector-friendly computer architectures for complex signal synthesis. New processor technology from Intel implements an extended 256-bit vector SIMD instruction set (AVX, AVX2) in a multi-core CPU design providing peak execution rates of hundreds of GigaFLOPS (GFLOPS) on one chip. Real world mmW scene generation code can approach peak SIMD execution rates only after careful algorithm and source code design. An effective software design will maintain high computing intensity emphasizing register-to-register SIMD arithmetic operations over data movement between CPU caches or off-chip memories. Engineers at the U.S. Army Aviation and Missile Research, Development and Engineering Center (AMRDEC) applied two basic parallel coding methods to assess new 256-bit SIMD multi-core architectures for mmW scene generation in HWIL. These include use of POSIX threads built on vector library functions and more portable, highlevel parallel code based on compiler technology (e.g. OpenMP pragmas and SIMD autovectorization). Since CPU technology is rapidly advancing toward high processor core counts and TeraFLOPS peak SIMD execution rates, it is imperative that coding methods be identified which produce efficient and maintainable parallel code. This paper describes the algorithms used in point scatterer target model rendering, the parallelization of those algorithms, and the execution performance achieved on an AVX multi-core machine using the two basic parallel coding methods. The paper concludes with estimates for scale-up performance on upcoming multi-core technology.
RISC-type microprocessors may revolutionize aerospace simulation
NASA Astrophysics Data System (ADS)
Jackson, Albert S.
The author explores the application of RISC (reduced instruction set computer) processors in massively parallel computer (MPC) designs for aerospace simulation. The MPC approach is shown to be well adapted to the needs of aerospace simulation. It is shown that any of the three common types of interconnection schemes used with MPCs are effective for general-purpose simulation, although the bus-or switch-oriented machines are somewhat easier to use. For partial differential equation models, the hypercube approach at first glance appears more efficient because the nearest-neighbor connections required for three-dimensional models are hardwired in a hypercube machine. However, the data broadcast ability of a bus system, combined with the fact that data can be transmitted over a bus as soon as it has been updated, makes the bus approach very competitive with the hypercube approach even for these types of models.
Design of a search and rescue terminal based on the dual-mode satellite and CDMA network
NASA Astrophysics Data System (ADS)
Zhao, Junping; Zhang, Xuan; Zheng, Bing; Zhou, Yubin; Song, Hao; Song, Wei; Zhang, Meikui; Liu, Tongze; Zhou, Li
2010-12-01
The current goal is to create a set of portable terminals with GPS/BD2 dual-mode satellite positioning, vital signs monitoring and wireless transmission functions. The terminal depends on an ARM processor to collect and combine data related to vital signs and GPS/BD2 location information, and sends the message to headquarters through the military CDMA network. It integrates multiple functions as a whole. The satellite positioning and wireless transmission capabilities are integrated into the motherboard, and the vital signs sensors used in the form of belts communicate with the board through Bluetooth. It can be adjusted according to the headquarters' instructions. This kind of device is of great practical significance for operations during disaster relief, search and rescue of the wounded in wartime, non-war military operations and other special circumstances.
Computers as Instructional Aids.
ERIC Educational Resources Information Center
Wright, Anne
The use of microcomputers as word processors for writing papers is commonplace in English departments, but there are many less well-known uses that English teachers can make of the computer. For example, word processing programs can be used to teach sentence combining. Moving text on the screen is very easy, so it is possible to rearrange words or…
Radio-nuclide mixture identification using medium energy resolution detectors
Nelson, Karl Einar
2013-09-17
According to one embodiment, a method for identifying radio-nuclides includes receiving spectral data, extracting a feature set from the spectral data comparable to a plurality of templates in a template library, and using a branch and bound method to determine a probable template match based on the feature set and templates in the template library. In another embodiment, a device for identifying unknown radio-nuclides includes a processor, a multi-channel analyzer, and a memory operatively coupled to the processor, the memory having computer readable code stored thereon. The computer readable code is configured, when executed by the processor, to receive spectral data, to extract a feature set from the spectral data comparable to a plurality of templates in a template library, and to use a branch and bound method to determine a probable template match based on the feature set and templates in the template library.
System and method for motor fault detection using stator current noise cancellation
Zhou, Wei; Lu, Bin; Nowak, Michael P.; Dimino, Steven A.
2010-12-07
A system and method for detecting incipient mechanical motor faults by way of current noise cancellation is disclosed. The system includes a controller configured to detect indicia of incipient mechanical motor faults. The controller further includes a processor programmed to receive a baseline set of current data from an operating motor and define a noise component in the baseline set of current data. The processor is also programmed to acquire at least on additional set of real-time operating current data from the motor during operation, redefine the noise component present in each additional set of real-time operating current data, and remove the noise component from the operating current data in real-time to isolate any fault components present in the operating current data. The processor is then programmed to generate a fault index for the operating current data based on any isolated fault components.
Hybrid multicore/vectorisation technique applied to the elastic wave equation on a staggered grid
NASA Astrophysics Data System (ADS)
Titarenko, Sofya; Hildyard, Mark
2017-07-01
In modern physics it has become common to find the solution of a problem by solving numerically a set of PDEs. Whether solving them on a finite difference grid or by a finite element approach, the main calculations are often applied to a stencil structure. In the last decade it has become usual to work with so called big data problems where calculations are very heavy and accelerators and modern architectures are widely used. Although CPU and GPU clusters are often used to solve such problems, parallelisation of any calculation ideally starts from a single processor optimisation. Unfortunately, it is impossible to vectorise a stencil structured loop with high level instructions. In this paper we suggest a new approach to rearranging the data structure which makes it possible to apply high level vectorisation instructions to a stencil loop and which results in significant acceleration. The suggested method allows further acceleration if shared memory APIs are used. We show the effectiveness of the method by applying it to an elastic wave propagation problem on a finite difference grid. We have chosen Intel architecture for the test problem and OpenMP (Open Multi-Processing) since they are extensively used in many applications.
NASA Astrophysics Data System (ADS)
Bellerby, Tim
2015-04-01
PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
NASA Astrophysics Data System (ADS)
Bross, Benjamin; Alvarez-Mesa, Mauricio; George, Valeri; Chi, Chi Ching; Mayer, Tobias; Juurlink, Ben; Schierl, Thomas
2013-09-01
The new High Efficiency Video Coding Standard (HEVC) was finalized in January 2013. Compared to its predecessor H.264 / MPEG4-AVC, this new international standard is able to reduce the bitrate by 50% for the same subjective video quality. This paper investigates decoder optimizations that are needed to achieve HEVC real-time software decoding on a mobile processor. It is shown that HEVC real-time decoding up to high definition video is feasible using instruction extensions of the processor while decoding 4K ultra high definition video in real-time requires additional parallel processing. For parallel processing, a picture-level parallel approach has been chosen because it is generic and does not require bitstreams with special indication.
Reader set encoding for directory of shared cache memory in multiprocessor system
Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang
2014-06-10
In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
SPAR data set contents. [finite element structural analysis system
NASA Technical Reports Server (NTRS)
Cunningham, S. W.
1981-01-01
The contents of the stored data sets of the SPAR (space processing applications rocket) finite element structural analysis system are documented. The data generated by each of the system's processors are stored in a data file organized as a library. Each data set, containing a two-dimensional table or matrix, is identified by a four-word name listed in a table of contents. The creating SPAR processor, number of rows and columns, and definitions of each of the data items are listed for each data set. An example SPAR problem using these data sets is also presented.
Picoradio: Communication/Computation Piconodes for Sensor Networks
2003-01-02
diagram of PicoNode III, or Quark node. It is made from two custom chips, Strange RF and Charm digital processor , and is complemented by a set of...the chipset comprising of Strange (analog OOK transceiver) and Charm (digital processor ) chips. 44 Figure 33: System block diagram of the Quark node...19 2.B PICONODE II - TWO-CHIP PICONODE IMPLEMENTATION ......................................... 21 2.B.1 Baseband processor (BBP
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array
NASA Astrophysics Data System (ADS)
Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul
2008-04-01
This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Floating-Point Modules Targeted for Use with RC Compilation Tools
NASA Technical Reports Server (NTRS)
Sahin, Ibrahin; Gloster, Clay S.
2000-01-01
Reconfigurable Computing (RC) has emerged as a viable computing solution for computationally intensive applications. Several applications have been mapped to RC system and in most cases, they provided the smallest published execution time. Although RC systems offer significant performance advantages over general-purpose processors, they require more application development time than general-purpose processors. This increased development time of RC systems provides the motivation to develop an optimized module library with an assembly language instruction format interface for use with future RC system that will reduce development time significantly. In this paper, we present area/performance metrics for several different types of floating point (FP) modules that can be utilized to develop complex FP applications. These modules are highly pipelined and optimized for both speed and area. Using these modules, and example application, FP matrix multiplication, is also presented. Our results and experiences show, that with these modules, 8-10X speedup over general-purpose processors can be achieved.
1992-10-01
5 2.1 Network Overview ................................................ 5 2.2 Network Access Methods...to TAC and ?4ini-TAC users, such as common error messages, TAC commands, and instructions for performing file tranders. Section 5 , Network Use...originally known as Interface Message Processors, or IMPs. 5 THE DEFENSE DATA NETWORK DRAFt NIC 60001, October 1992 message do not necessarily take the same
Mongoose: Creation of a Rad-Hard MIPS R3000
NASA Technical Reports Server (NTRS)
Lincoln, Dan; Smith, Brian
1993-01-01
This paper describes the development of a 32 Bit, full MIPS R3000 code-compatible Rad-Hard CPU, code named Mongoose. Mongoose progressed from contract award, through the design cycle, to operational silicon in 12 months to meet a space mission for NASA. The goal was the creation of a fully static device capable of operation to the maximum Mil-883 derated speed, worst-case post-rad exposure with full operational integrity. This included consideration of features for functional enhancements relating to mission compatibility and removal of commercial practices not supported by Rad-Hard technology. 'Mongoose' developed from an evolution of LSI Logic's MIPS-I embedded processor, LR33000, code named Cobra, to its Rad-Hard 'equivalent', Mongoose. The term 'equivalent' is used to infer that the core of the processor is functionally identical, allowing the same use and optimizations of the MIPS-I Instruction Set software tool suite for compilation, software program trace, etc. This activity was started in September of 1991 under a contract from NASA-Goddard Space Flight Center (GSFC)-Flight Data Systems. The approach affected a teaming of NASA-GSFC for program development, LSI Logic for system and ASIC design coupled with the Rad-Hard process technology, and Harris (GASD) for Rad-Hard microprocessor design expertise. The program culminated with the generation of Rad-Hard Mongoose prototypes one year later.
NASA Astrophysics Data System (ADS)
Iwato, Hirofumi; Sakanushi, Keishi; Takeuchi, Yoshinori; Imai, Masaharu
To measure the detrusor pressure for diagnosing lower urinary tract symptoms, we designed a small-area and low-power System on a Chip (SoC). The SoC should be small and low power because it is encapsulated in tiny air-tight capsules which are simultaneously inserted in the urinary bladder and rectum for several days. Since the SoC is also required to be programmable, we designed an Application Specific Instruction set Processor (ASIP) for pressure measurement and wireless communication, and implemented almost required functions on the ASIP. The SoC was fabricated using a 0.18µm CMOS mixed-signal process and the chip size is 2.5×2.5mm2. Evaluation results show that the power consumption of the SoC is 93.5µW, and that it can operate the capsule for seven days with a tiny battery.
Using a Cray Y-MP as an array processor for a RISC Workstation
NASA Technical Reports Server (NTRS)
Lamaster, Hugh; Rogallo, Sarah J.
1992-01-01
As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.
Line-drawing algorithms for parallel machines
NASA Technical Reports Server (NTRS)
Pang, Alex T.
1990-01-01
The fact that conventional line-drawing algorithms, when applied directly on parallel machines, can lead to very inefficient codes is addressed. It is suggested that instead of modifying an existing algorithm for a parallel machine, a more efficient implementation can be produced by going back to the invariants in the definition. Popular line-drawing algorithms are compared with two alternatives; distance to a line (a point is on the line if sufficiently close to it) and intersection with a line (a point on the line if an intersection point). For massively parallel single-instruction-multiple-data (SIMD) machines (with thousands of processors and up), the alternatives provide viable line-drawing algorithms. Because of the pixel-per-processor mapping, their performance is independent of the line length and orientation.
Optimal processor assignment for pipeline computations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath
1991-01-01
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
CoNNeCT Baseband Processor Module
NASA Technical Reports Server (NTRS)
Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.
2011-01-01
A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.
Efficiently modeling neural networks on massively parallel computers
NASA Technical Reports Server (NTRS)
Farber, Robert M.
1993-01-01
Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer
NASA Technical Reports Server (NTRS)
Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.
1984-01-01
SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
A Series of Case Studies of Tinnitus Suppression With Mixed Background Stimuli in a Cochlear Implant
Keiner, A. J.; Walker, Kurt; Deshpande, Aniruddha K.; Witt, Shelley; Killian, Matthijs; Ji, Helena; Patrick, Jim; Dillier, Norbert; van Dijk, Pim; Lai, Wai Kong; Hansen, Marlan R.; Gantz, Bruce
2015-01-01
Purpose Background sounds provided by a wearable sound playback device were mixed with the acoustical input picked up by a cochlear implant speech processor in an attempt to suppress tinnitus. Method First, patients were allowed to listen to several sounds and to select up to 4 sounds that they thought might be effective. These stimuli were programmed to loop continuously in the wearable playback device. Second, subjects were instructed to use 1 background sound each day on the wearable device, and they sequenced the selected background sounds during a 28-day trial. Patients were instructed to go to a website at the end of each day and rate the loudness and annoyance of the tinnitus as well as the acceptability of the background sound. Patients completed the Tinnitus Primary Function Questionnaire (Tyler, Stocking, Secor, & Slattery, 2014) at the beginning of the trial. Results Results indicated that background sounds were very effective at suppressing tinnitus. There was considerable variability in sounds preferred by the subjects. Conclusion The study shows that a background sound mixed with the microphone input can be effective for suppressing tinnitus during daily use of the sound processor in selected cochlear implant users. PMID:26001407
Broadband set-top box using MAP-CA processor
NASA Astrophysics Data System (ADS)
Bush, John E.; Lee, Woobin; Basoglu, Chris
2001-12-01
Advances in broadband access are expected to exert a profound impact in our everyday life. It will be the key to the digital convergence of communication, computer and consumer equipment. A common thread that facilitates this convergence comprises digital media and Internet. To address this market, Equator Technologies, Inc., is developing the Dolphin broadband set-top box reference platform using its MAP-CA Broadband Signal ProcessorT chip. The Dolphin reference platform is a universal media platform for display and presentation of digital contents on end-user entertainment systems. The objective of the Dolphin reference platform is to provide a complete set-top box system based on the MAP-CA processor. It includes all the necessary hardware and software components for the emerging broadcast and the broadband digital media market based on IP protocols. Such reference design requires a broadband Internet access and high-performance digital signal processing. By using the MAP-CA processor, the Dolphin reference platform is completely programmable, allowing various codecs to be implemented in software, such as MPEG-2, MPEG-4, H.263 and proprietary codecs. The software implementation also enables field upgrades to keep pace with evolving technology and industry demands.
Optimizing the inner loop of the gravitational force interaction on modern processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Warren, Michael S
2010-12-08
We have achieved superior performance on multiple generations of the fastest supercomputers in the world with our hashed oct-tree N-body code (HOT), spanning almost two decades and garnering multiple Gordon Bell Prizes for significant achievement in parallel processing. Execution time for our N-body code is largely influenced by the force calculation in the inner loop. Improvements to the inner loop using SSE3 instructions has enabled the calculation of over 200 million gravitational interactions per second per processor on a 2.6 GHz Opteron, for a computational rate of over 7 Gflops in single precision (700/0 of peak). We obtain optimal performancemore » some processors (including the Cell) by decomposing the reciprocal square root function required for a gravitational interaction into a table lookup, Chebychev polynomial interpolation, and Newton-Raphson iteration, using the algorithm of Karp. By unrolling the loop by a factor of six, and using SPU intrinsics to compute on vectors, we obtain performance of over 16 Gflops on a single Cell SPE. Aggregated over the 8 SPEs on a Cell processor, the overall performance is roughly 130 Gflops. In comparison, the ordinary C version of our inner loop only obtains 1.6 Gflops per SPE with the spuxlc compiler.« less
Geospace simulations using modern accelerator processor technology
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D. J.
2009-12-01
OpenGGCM (Open Geospace General Circulation Model) is a well-established numerical code simulating the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is currently limited by computational constraints on grid resolution. OpenGGCM has been ported to make use of the added computational powerof modern accelerator based processor architectures, in particular the Cell processor. The Cell architecture is a novel inhomogeneous multicore architecture capable of achieving up to 230 GFLops on a single chip. The University of New Hampshire recently acquired a PowerXCell 8i based computing cluster, and here we will report initial performance results of OpenGGCM. Realizing the high theoretical performance of the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallelization approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We use a modern technique, automatic code generation, which shields the application programmer from having to deal with all of the implementation details just described, keeping the code much more easily maintainable. Our preliminary results indicate excellent performance, a speed-up of a factor of 30 compared to the unoptimized version.
Optical systolic array processor using residue arithmetic
NASA Technical Reports Server (NTRS)
Jackson, J.; Casasent, D.
1983-01-01
The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.
2017-02-15
Maunz2 Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone...information processors have been demonstrated experimentally using superconducting circuits1–3, electrons in semiconductors4–6, trapped atoms and...qubit quantum information processor has been realized14, and single- qubit gates have demonstrated randomized benchmarking (RB) infidelities as low as 10
Real-Time Spatio-Temporal Twice Whitening for MIMO Energy Detector
DOE Office of Scientific and Technical Information (OSTI.GOV)
Humble, Travis S; Mitra, Pramita; Barhen, Jacob
2010-01-01
While many techniques exist for local spectrum sensing of a primary user, each represents a computationally demanding task to secondary user receivers. In software-defined radio, computational complexity lengthens the time for a cognitive radio to recognize changes in the transmission environment. This complexity is even more significant for spatially multiplexed receivers, e.g., in SIMO and MIMO, where the spatio-temporal data sets grow in size with the number of antennae. Limits on power and space for the processor hardware further constrain SDR performance. In this report, we discuss improvements in spatio-temporal twice whitening (STTW) for real-time local spectrum sensing by demonstratingmore » a form of STTW well suited for MIMO environments. We implement STTW on the Coherent Logix hx3100 processor, a multicore processor intended for low-power, high-throughput software-defined signal processing. These results demonstrate how coupling the novel capabilities of emerging multicore processors with algorithmic advances can enable real-time, software-defined processing of large spatio-temporal data sets.« less
A Systematic Methodology for Verifying Superscalar Microprocessors
NASA Technical Reports Server (NTRS)
Srivas, Mandayam; Hosabettu, Ravi; Gopalakrishnan, Ganesh
1999-01-01
We present a systematic approach to decompose and incrementally build the proof of correctness of pipelined microprocessors. The central idea is to construct the abstraction function by using completion functions, one per unfinished instruction, each of which specifies the effect (on the observables) of completing the instruction. In addition to avoiding the term size and case explosion problem that limits the pure flushing approach, our method helps localize errors, and also handles stages with interactive loops. The technique is illustrated on pipelined and superscalar pipelined implementations of a subset of the DLX architecture. It has also been applied to a processor with out-of-order execution.
Application of an onboard processor to the OAO C spacecraft
NASA Technical Reports Server (NTRS)
Stewart, W. N.; Hartenstein, R. G.; Trevathan, C.
1972-01-01
The design of a stored program computer for spacecraft use and its application on the fourth Orbiting Astronomical Observatory (OAO) is reported. The computer is a medium scale, parallel machine with a memory capacity of 16384 words of 18 bits each. It possesses a comprehensive instruction repertoire and operates on 45 W of power (including the dc-to-dc converter). The machine operates at a 500-kHz rate and executes an add instruction in 10 microseconds. Its primary functions on OAO C will be auxiliary command storage, spacecraft monitoring and malfunction reporting, data compression and status summary, and possible performance of emergency corrective action for certain anomalous situations.
The computational structural mechanics testbed architecture. Volume 2: The interface
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the third set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 3 describes the CLIP-Processor interface and related topics. It is intended only for processor developers.
Method and apparatus for high speed data acquisition and processing
Ferron, J.R.
1997-02-11
A method and apparatus are disclosed for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register. 15 figs.
Method and apparatus for high speed data acquisition and processing
Ferron, John R.
1997-01-01
A method and apparatus for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register.
Fault Tolerance in Critical Information Systems
2001-05-01
that provides an inte- grated editing and analysis environment through the use of the Adobe FrameMaker document processor [1] and the Z/Eves theorem... FrameMaker document processor provid- ing the special character set for Z just as it would any other character set (such as mathe- matical symbols). Zeus...happens to use the LaTeX Z language definition, so Zeus processes the Framemaker spec- ification and outputs the LaTeX translation to Z/Eves for
The science of computing - Parallel computation
NASA Technical Reports Server (NTRS)
Denning, P. J.
1985-01-01
Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.
G-cueing microcontroller (a microprocessor application in simulators)
NASA Technical Reports Server (NTRS)
Horattas, C. G.
1980-01-01
A g cueing microcontroller is described which consists of a tandem pair of microprocessors, dedicated to the task of simulating pilot sensed cues caused by gravity effects. This task includes execution of a g cueing model which drives actuators that alter the configuration of the pilot's seat. The g cueing microcontroller receives acceleration commands from the aerodynamics model in the main computer and creates the stimuli that produce physical acceleration effects of the aircraft seat on the pilots anatomy. One of the two microprocessors is a fixed instruction processor that performs all control and interface functions. The other, a specially designed bipolar bit slice microprocessor, is a microprogrammable processor dedicated to all arithmetic operations. The two processors communicate with each other by a shared memory. The g cueing microcontroller contains its own dedicated I/O conversion modules for interface with the seat actuators and controls, and a DMA controller for interfacing with the simulation computer. Any application which can be microcoded within the available memory, the available real time and the available I/O channels, could be implemented in the same controller.
NASA Astrophysics Data System (ADS)
Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos
1990-03-01
The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
A Testbed Processor for Embedded Multicomputing
1990-04-01
Gajski 85]. These two problems of parallel expression and performance impact the real-time response of a vehicle system and, consequently, what models...and memory access. The following discussion of these problems is primarily from Gajski and Peir [ Gajski 85]. Multi-computers are Multiple Instruction...International Symposium on Unmanned Untethered Submersible Technology, University of New Hampshire, Durham, NH, June 22-24 1987, pp. 33-43. [ Gajski 85
A digital retina-like low-level vision processor.
Mertoguno, S; Bourbakis, N G
2003-01-01
This correspondence presents the basic design and the simulation of a low level multilayer vision processor that emulates to some degree the functional behavior of a human retina. This retina-like multilayer processor is the lower part of an autonomous self-organized vision system, called Kydon, that could be used on visually impaired people with a damaged visual cerebral cortex. The Kydon vision system, however, is not presented in this paper. The retina-like processor consists of four major layers, where each of them is an array processor based on hexagonal, autonomous processing elements that perform a certain set of low level vision tasks, such as smoothing and light adaptation, edge detection, segmentation, line recognition and region-graph generation. At each layer, the array processor is a 2D array of k/spl times/m hexagonal identical autonomous cells that simultaneously execute certain low level vision tasks. Thus, the hardware design and the simulation at the transistor level of the processing elements (PEs) of the retina-like processor and its simulated functionality with illustrative examples are provided in this paper.
Support for Diagnosis of Custom Computer Hardware
NASA Technical Reports Server (NTRS)
Molock, Dwaine S.
2008-01-01
The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.
System and method for bearing fault detection using stator current noise cancellation
Zhou, Wei; Lu, Bin; Habetler, Thomas G.; Harley, Ronald G.; Theisen, Peter J.
2010-08-17
A system and method for detecting incipient mechanical motor faults by way of current noise cancellation is disclosed. The system includes a controller configured to detect indicia of incipient mechanical motor faults. The controller further includes a processor programmed to receive a baseline set of current data from an operating motor and define a noise component in the baseline set of current data. The processor is also programmed to repeatedly receive real-time operating current data from the operating motor and remove the noise component from the operating current data in real-time to isolate any fault components present in the operating current data. The processor is then programmed to generate a fault index for the operating current data based on any isolated fault components.
A wideband software reconfigurable modem
NASA Astrophysics Data System (ADS)
Turner, J. H., Jr.; Vickers, H.
A wideband modem is described which provides signal processing capability for four Lx-band signals employing QPSK, MSK and PPM waveforms and employs a software reconfigurable architecture for maximum system flexibility and graceful degradation. The current processor uses a 2901 and two 8086 microprocessors per channel and performs acquisition, tracking, and data demodulation for JITDS, GPS, IFF and TACAN systems. The next generation processor will be implemented using a VHSIC chip set employing a programmable complex array vector processor module, a GP computer module, customized gate array modules, and a digital array correlator. This integrated processor has application to a wide number of diverse system waveforms, and will bring the benefits of VHSIC technology insertion into avionic antijam communications systems.
Parallel reduced-instruction-set-computer architecture for real-time symbolic pattern matching
NASA Astrophysics Data System (ADS)
Parson, Dale E.
1991-03-01
This report discusses ongoing work on a parallel reduced-instruction- set-computer (RISC) architecture for automatic production matching. The PRIOPS compiler takes advantage of the memoryless character of automatic processing by translating a program's collection of automatic production tests into an equivalent combinational circuit-a digital circuit without memory, whose outputs are immediate functions of its inputs. The circuit provides a highly parallel, fine-grain model of automatic matching. The compiler then maps the combinational circuit onto RISC hardware. The heart of the processor is an array of comparators capable of testing production conditions in parallel, Each comparator attaches to private memory that contains virtual circuit nodes-records of the current state of nodes and busses in the combinational circuit. All comparator memories hold identical information, allowing simultaneous update for a single changing circuit node and simultaneous retrieval of different circuit nodes by different comparators. Along with the comparator-based logic unit is a sequencer that determines the current combination of production-derived comparisons to try, based on the combined success and failure of previous combinations of comparisons. The memoryless nature of automatic matching allows the compiler to designate invariant memory addresses for virtual circuit nodes, and to generate the most effective sequences of comparison test combinations. The result is maximal utilization of parallel hardware, indicating speed increases and scalability beyond that found for course-grain, multiprocessor approaches to concurrent Rete matching. Future work will consider application of this RISC architecture to the standard (controlled) Rete algorithm, where search through memory dominates portions of matching.
Satellite on-board real-time SAR processor prototype
NASA Astrophysics Data System (ADS)
Bergeron, Alain; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Marchese, Linda; Bourqui, Pascal; Desnoyers, Nicholas; Legros, Mathieu; Guillot, Ludovic; Mercier, Luc; Châteauneuf, François
2017-11-01
A Compact Real-Time Optronic SAR Processor has been successfully developed and tested up to a Technology Readiness Level of 4 (TRL4), the breadboard validation in a laboratory environment. SAR, or Synthetic Aperture Radar, is an active system allowing day and night imaging independent of the cloud coverage of the planet. The SAR raw data is a set of complex data for range and azimuth, which cannot be compressed. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Originally the first SAR images were optically processed. The optical Fourier processor architecture provides inherent parallel computing capabilities allowing real-time SAR data processing and thus the ability for compression and strongly reduced communication bandwidth requirements for the satellite. SAR signal return data are in general complex data. Both amplitude and phase must be combined optically in the SAR processor for each range and azimuth pixel. Amplitude and phase are generated by dedicated spatial light modulators and superimposed by an optical relay set-up. The spatial light modulators display the full complex raw data information over a two-dimensional format, one for the azimuth and one for the range. Since the entire signal history is displayed at once, the processor operates in parallel yielding real-time performances, i.e. without resulting bottleneck. Processing of both azimuth and range information is performed in a single pass. This paper focuses on the onboard capabilities of the compact optical SAR processor prototype that allows in-orbit processing of SAR images. Examples of processed ENVISAT ASAR images are presented. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and size are reviewed.
Case Study of Using High Performance Commercial Processors in Space
NASA Technical Reports Server (NTRS)
Ferguson, Roscoe C.; Olivas, Zulema
2009-01-01
The purpose of the Space Shuttle Cockpit Avionics Upgrade project (1999 2004) was to reduce crew workload and improve situational awareness. The upgrade was to augment the Shuttle avionics system with new hardware and software. A major success of this project was the validation of the hardware architecture and software design. This was significant because the project incorporated new technology and approaches for the development of human rated space software. An early version of this system was tested at the Johnson Space Center for one month by teams of astronauts. The results were positive, but NASA eventually cancelled the project towards the end of the development cycle. The goal to reduce crew workload and improve situational awareness resulted in the need for high performance Central Processing Units (CPUs). The choice of CPU selected was the PowerPC family, which is a reduced instruction set computer (RISC) known for its high performance. However, the requirement for radiation tolerance resulted in the re-evaluation of the selected family member of the PowerPC line. Radiation testing revealed that the original selected processor (PowerPC 7400) was too soft to meet mission objectives and an effort was established to perform trade studies and performance testing to determine a feasible candidate. At that time, the PowerPC RAD750s were radiation tolerant, but did not meet the required performance needs of the project. Thus, the final solution was to select the PowerPC 7455. This processor did not have a radiation tolerant version, but had some ability to detect failures. However, its cache tags did not provide parity and thus the project incorporated a software strategy to detect radiation failures. The strategy was to incorporate dual paths for software generating commands to the legacy Space Shuttle avionics to prevent failures due to the softness of the upgraded avionics.
Case Study of Using High Performance Commercial Processors in a Space Environment
NASA Technical Reports Server (NTRS)
Ferguson, Roscoe C.; Olivas, Zulema
2009-01-01
The purpose of the Space Shuttle Cockpit Avionics Upgrade project was to reduce crew workload and improve situational awareness. The upgrade was to augment the Shuttle avionics system with new hardware and software. A major success of this project was the validation of the hardware architecture and software design. This was significant because the project incorporated new technology and approaches for the development of human rated space software. An early version of this system was tested at the Johnson Space Center for one month by teams of astronauts. The results were positive, but NASA eventually cancelled the project towards the end of the development cycle. The goal to reduce crew workload and improve situational awareness resulted in the need for high performance Central Processing Units (CPUs). The choice of CPU selected was the PowerPC family, which is a reduced instruction set computer (RISC) known for its high performance. However, the requirement for radiation tolerance resulted in the reevaluation of the selected family member of the PowerPC line. Radiation testing revealed that the original selected processor (PowerPC 7400) was too soft to meet mission objectives and an effort was established to perform trade studies and performance testing to determine a feasible candidate. At that time, the PowerPC RAD750s where radiation tolerant, but did not meet the required performance needs of the project. Thus, the final solution was to select the PowerPC 7455. This processor did not have a radiation tolerant version, but faired better than the 7400 in the ability to detect failures. However, its cache tags did not provide parity and thus the project incorporated a software strategy to detect radiation failures. The strategy was to incorporate dual paths for software generating commands to the legacy Space Shuttle avionics to prevent failures due to the softness of the upgraded avionics.
NASA Technical Reports Server (NTRS)
1994-01-01
The objective of this contract was the investigation of the potential performance gains that would result from an upgrade of the Space Station Freedom (SSF) Data Management System (DMS) Embedded Data Processor (EDP) '386' design with the Intel Pentium (registered trade-mark of Intel Corp.) '586' microprocessor. The Pentium ('586') is the latest member of the industry standard Intel X86 family of CISC (Complex Instruction Set Computer) microprocessors. This contract was scheduled to run in parallel with an internal IBM Federal Systems Company (FSC) Internal Research and Development (IR&D) task that had the goal to generate a baseline flight design for an upgraded EDP using the Pentium. This final report summarizes the activities performed in support of Contract NAS2-13758. Our plan was to baseline performance analyses and measurements on the latest state-of-the-art commercially available Pentium processor, representative of the proposed space station design, and then phase to an IBM capital funded breadboard version of the flight design (if available from IR&D and Space Station work) for additional evaluation of results. Unfortunately, the phase-over to the flight design breadboard did not take place, since the IBM Data Management System (DMS) for the Space Station Freedom was terminated by NASA before the referenced capital funded EDP breadboard could be completed. The baseline performance analyses and measurements, however, were successfully completed, as planned, on the commercial Pentium hardware. The results of those analyses, evaluations, and measurements are presented in this final report.
On the Efficacy of Source Code Optimizations for Cache-Based Systems
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Saphir, William C.
1998-01-01
Obtaining high performance without machine-specific tuning is an important goal of scientific application programmers. Since most scientific processing is done on commodity microprocessors with hierarchical memory systems, this goal of "portable performance" can be achieved if a common set of optimization principles is effective for all such systems. It is widely believed, or at least hoped, that portable performance can be realized. The rule of thumb for optimization on hierarchical memory systems is to maximize temporal and spatial locality of memory references by reusing data and minimizing memory access stride. We investigate the effects of a number of optimizations on the performance of three related kernels taken from a computational fluid dynamics application. Timing the kernels on a range of processors, we observe an inconsistent and often counterintuitive impact of the optimizations on performance. In particular, code variations that have a positive impact on one architecture can have a negative impact on another, and variations expected to be unimportant can produce large effects. Moreover, we find that cache miss rates - as reported by a cache simulation tool, and confirmed by hardware counters - only partially explain the results. By contrast, the compiler-generated assembly code provides more insight by revealing the importance of processor-specific instructions and of compiler maturity, both of which strongly, and sometimes unexpectedly, influence performance. We conclude that it is difficult to obtain performance portability on modern cache-based computers, and comment on the implications of this result.
On the Efficacy of Source Code Optimizations for Cache-Based Systems
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Saphir, William C.; Saini, Subhash (Technical Monitor)
1998-01-01
Obtaining high performance without machine-specific tuning is an important goal of scientific application programmers. Since most scientific processing is done on commodity microprocessors with hierarchical memory systems, this goal of "portable performance" can be achieved if a common set of optimization principles is effective for all such systems. It is widely believed, or at least hoped, that portable performance can be realized. The rule of thumb for optimization on hierarchical memory systems is to maximize temporal and spatial locality of memory references by reusing data and minimizing memory access stride. We investigate the effects of a number of optimizations on the performance of three related kernels taken from a computational fluid dynamics application. Timing the kernels on a range of processors, we observe an inconsistent and often counterintuitive impact of the optimizations on performance. In particular, code variations that have a positive impact on one architecture can have a negative impact on another, and variations expected to be unimportant can produce large effects. Moreover, we find that cache miss rates-as reported by a cache simulation tool, and confirmed by hardware counters-only partially explain the results. By contrast, the compiler-generated assembly code provides more insight by revealing the importance of processor-specific instructions and of compiler maturity, both of which strongly, and sometimes unexpectedly, influence performance. We conclude that it is difficult to obtain performance portability on modern cache-based computers, and comment on the implications of this result.
Interactive Digital Signal Processor
NASA Technical Reports Server (NTRS)
Mish, W. H.
1985-01-01
Interactive Digital Signal Processor, IDSP, consists of set of time series analysis "operators" based on various algorithms commonly used for digital signal analysis. Processing of digital signal time series to extract information usually achieved by applications of number of fairly standard operations. IDSP excellent teaching tool for demonstrating application for time series operators to artificially generated signals.
Nair, Erika L; Sousa, Rhonda; Wannagot, Shannon
Guidelines established by the AAA currently recommend behavioral testing when fitting frequency modulated (FM) systems to individuals with cochlear implants (CIs). A protocol for completing electroacoustic measures has not yet been validated for personal FM systems or digital modulation (DM) systems coupled to CI sound processors. In response, some professionals have used or altered the AAA electroacoustic verification steps for fitting FM systems to hearing aids when fitting FM systems to CI sound processors. More recently steps were outlined in a proposed protocol. The purpose of this research is to review and compare the electroacoustic test measures outlined in a 2013 article by Schafer and colleagues in the Journal of the American Academy of Audiology titled "A Proposed Electroacoustic Test Protocol for Personal FM Receivers Coupled to Cochlear Implant Sound Processors" to the AAA electroacoustic verification steps for fitting FM systems to hearing aids when fitting DM systems to CI users. Electroacoustic measures were conducted on 71 CI sound processors and Phonak Roger DM systems using a proposed protocol and an adapted AAA protocol. Phonak's recommended default receiver gain setting was used for each CI sound processor manufacturer and adjusted if necessary to achieve transparency. Electroacoustic measures were conducted on Cochlear and Advanced Bionics (AB) sound processors. In this study, 28 Cochlear Nucleus 5/CP810 sound processors, 26 Cochlear Nucleus 6/CP910 sound processors, and 17 AB Naida CI Q70 sound processors were coupled in various combinations to Phonak Roger DM dedicated receivers (25 Phonak Roger 14 receivers-Cochlear dedicated receiver-and 9 Phonak Roger 17 receivers-AB dedicated receiver) and 20 Phonak Roger Inspiro transmitters. Employing both the AAA and the Schafer et al protocols, electroacoustic measurements were conducted with the Audioscan Verifit in a clinical setting on 71 CI sound processors and Phonak Roger DM systems to determine transparency and verify FM advantage, comparing speech inputs (65 dB SPL) in an effort to achieve equal outputs. If transparency was not achieved at Phonak's recommended default receiver gain, adjustments were made to the receiver gain. The integrity of the signal was monitored with the appropriate manufacturer's monitor earphones. Using the AAA hearing aid protocol, 50 of the 71 CI sound processors achieved transparency, and 59 of the 71 CI sound processors achieved transparency when using the proposed protocol at Phonak's recommended default receiver gain. After the receiver gain was adjusted, 3 of 21 CI sound processors still did not meet transparency using the AAA protocol, and 2 of 12 CI sound processors still did not meet transparency using the Schafer et al proposed protocol. Both protocols were shown to be effective in taking reliable electroacoustic measurements and demonstrate transparency. Both protocols are felt to be clinically feasible and to address the needs of populations that are unable to reliably report regarding the integrity of their personal DM systems. American Academy of Audiology
Image Matrix Processor for Volumetric Computations Final Report CRADA No. TSB-1148-95
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roberson, G. Patrick; Browne, Jolyon
The development of an Image Matrix Processor (IMP) was proposed that would provide an economical means to perform rapid ray-tracing processes on volume "Giga Voxel" data sets. This was a multi-phased project. The objective of the first phase of the IMP project was to evaluate the practicality of implementing a workstation-based Image Matrix Processor for use in volumetric reconstruction and rendering using hardware simulation techniques. Additionally, ARACOR and LLNL worked together to identify and pursue further funding sources to complete a second phase of this project.
An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors
NASA Technical Reports Server (NTRS)
Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.
2015-01-01
This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.
Effect of processor temperature on film dosimetry
DOE Office of Scientific and Technical Information (OSTI.GOV)
Srivastava, Shiv P.; Das, Indra J., E-mail: idas@iupui.edu
2012-07-01
Optical density (OD) of a radiographic film plays an important role in radiation dosimetry, which depends on various parameters, including beam energy, depth, field size, film batch, dose, dose rate, air film interface, postexposure processing time, and temperature of the processor. Most of these parameters have been studied for Kodak XV and extended dose range (EDR) films used in radiation oncology. There is very limited information on processor temperature, which is investigated in this study. Multiple XV and EDR films were exposed in the reference condition (d{sub max.}, 10 Multiplication-Sign 10 cm{sup 2}, 100 cm) to a given dose. Anmore » automatic film processor (X-Omat 5000) was used for processing films. The temperature of the processor was adjusted manually with increasing temperature. At each temperature, a set of films was processed to evaluate OD at a given dose. For both films, OD is a linear function of processor temperature in the range of 29.4-40.6 Degree-Sign C (85-105 Degree-Sign F) for various dose ranges. The changes in processor temperature are directly related to the dose by a quadratic function. A simple linear equation is provided for the changes in OD vs. processor temperature, which could be used for correcting dose in radiation dosimetry when film is used.« less
Multiple Hypothesis Tracking (MHT) for Space Surveillance: Results and Simulation Studies
2013-09-01
processor. 1 . INTRODUCTION The Joint Space Operations Center (JSpOC) currently tracks more than 22,000 satellites and space debris orbiting the Earth... 1 , 2]. With the anticipated installation of more accurate sensors and the increased probability of future collisions between space objects, the...average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed
Numerical aerodynamic simulation facility preliminary study, volume 2 and appendices
NASA Technical Reports Server (NTRS)
1977-01-01
Data to support results obtained in technology assessment studies are presented. Objectives, starting points, and future study tasks are outlined. Key design issues discussed in appendices include: data allocation, transposition network design, fault tolerance and trustworthiness, logic design, processing element of existing components, number of processors, the host system, alternate data base memory designs, number representation, fast div 521 instruction, architectures, and lockstep array versus synchronizable array machine comparison.
Microdot - A Four-Bit Microcontroller Designed for Distributed Low-End Computing in Satellites
NASA Astrophysics Data System (ADS)
2002-03-01
Many satellites are an integrated collection of sensors and actuators that require dedicated real-time control. For single processor systems, additional sensors require an increase in computing power and speed to provide the multi-tasking capability needed to service each sensor. Faster processors cost more and consume more power, which taxes a satellite's power resources and may lead to shorter satellite lifetimes. An alternative design approach is a distributed network of small and low power microcontrollers designed for space that handle the computing requirements of each individual sensor and actuator. The design of microdot, a four-bit microcontroller for distributed low-end computing, is presented. The design is based on previous research completed at the Space Electronics Branch, Air Force Research Laboratory (AFRL/VSSE) at Kirtland AFB, NM, and the Air Force Institute of Technology at Wright-Patterson AFB, OH. The Microdot has 29 instructions and a 1K x 4 instruction memory. The distributed computing architecture is based on the Philips Semiconductor I2C Serial Bus Protocol. A prototype was implemented and tested using an Altera Field Programmable Gate Array (FPGA). The prototype was operable to 9.1 MHz. The design was targeted for fabrication in a radiation-hardened-by-design gate-array cell library for the TSMC 0.35 micrometer CMOS process.
Parallel discrete event simulation: A shared memory approach
NASA Technical Reports Server (NTRS)
Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.
1987-01-01
With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.
NASA Technical Reports Server (NTRS)
Saini, Subhash; Hood, Robert T.; Chang, Johnny; Baron, John
2016-01-01
We present a performance evaluation conducted on a production supercomputer of the Intel Xeon Processor E5- 2680v3, a twelve-core implementation of the fourth-generation Haswell architecture, and compare it with Intel Xeon Processor E5-2680v2, an Ivy Bridge implementation of the third-generation Sandy Bridge architecture. Several new architectural features have been incorporated in Haswell including improvements in all levels of the memory hierarchy as well as improvements to vector instructions and power management. We critically evaluate these new features of Haswell and compare with Ivy Bridge using several low-level benchmarks including subset of HPCC, HPCG and four full-scale scientific and engineering applications. We also present a model to predict the performance of HPCG and Cart3D within 5%, and Overflow within 10% accuracy.
Deri, Robert J.; DeGroot, Anthony J.; Haigh, Ronald E.
2002-01-01
As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.
NSTAR Ion Thrusters and Power Processors
NASA Technical Reports Server (NTRS)
Bond, T. A.; Christensen, J. A.
1999-01-01
The purpose of the NASA Solar Electric Propulsion Technology Applications Readiness (NSTAR) project is to validate ion propulsion technology for use on future NASA deep space missions. This program, which was initiated in September 1995, focused on the development of two sets of flight quality ion thrusters, power processors, and controllers that provided the same performance as engineering model hardware and also met the dynamic and environmental requirements of the Deep Space 1 Project. One of the flight sets was used for primary propulsion for the Deep Space 1 spacecraft which was launched in October 1998.
Monte Carlo: in the beginning and some great expectations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Metropolis, N.
1985-01-01
The central theme will be on the historical setting and origins of the Monte Carlo Method. The scene was post-war Los Alamos Scientific Laboratory. There was an inevitability about the Monte Carlo Event: the ENIAC had recently enjoyed its meteoric rise (on a classified Los Alamos problem); Stan Ulam had returned to Los Alamos; John von Neumann was a frequent visitor. Techniques, algorithms, and applications developed rapidly at Los Alamos. Soon, the fascination of the Method reached wider horizons. The first paper was submitted for publication in the spring of 1949. In the summer of 1949, the first open conferencemore » was held at the University of California at Los Angeles. Of some interst perhaps is an account of Fermi's earlier, independent application in neutron moderation studies while at the University of Rome. The quantum leap expected with the advent of massively parallel processors will provide stimuli for very ambitious applications of the Monte Carlo Method in disciplines ranging from field theories to cosmology, including more realistic models in the neurosciences. A structure of multi-instruction sets for parallel processing is ideally suited for the Monte Carlo approach. One may even hope for a modest hardening of the soft sciences.« less
Massively parallel algorithms for trace-driven cache simulations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Greenberg, Albert G.; Lubachevsky, Boris D.
1991-01-01
Trace driven cache simulation is central to computer design. A trace is a very long sequence of reference lines from main memory. At the t(exp th) instant, reference x sub t is hashed into a set of cache locations, the contents of which are then compared with x sub t. If at the t sup th instant x sub t is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x sub t present for the (t+1) sup st instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least Recently Used (LRU) policy, which regradless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. Timings are presented of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference based line replacement policies are considered, which includes LRU as well as the Least Frequently Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in the O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kelly, Priscilla N.
2016-08-12
The code runs the Game of Life among several processors. Each processor uses CUDA to set up the grid's buffer on the GPU, and that buffer is fed to other GPU languages to apply the rules of the game of life. Only the halo is copied off the buffer and exchanged using MPI. This code looks at the interoperability of GPU languages among current platforms.
Image matrix processor for fast multi-dimensional computations
Roberson, George P.; Skeate, Michael F.
1996-01-01
An apparatus for multi-dimensional computation which comprises a computation engine, including a plurality of processing modules. The processing modules are configured in parallel and compute respective contributions to a computed multi-dimensional image of respective two dimensional data sets. A high-speed, parallel access storage system is provided which stores the multi-dimensional data sets, and a switching circuit routes the data among the processing modules in the computation engine and the storage system. A data acquisition port receives the two dimensional data sets representing projections through an image, for reconstruction algorithms such as encountered in computerized tomography. The processing modules include a programmable local host, by which they may be configured to execute a plurality of different types of multi-dimensional algorithms. The processing modules thus include an image manipulation processor, which includes a source cache, a target cache, a coefficient table, and control software for executing image transformation routines using data in the source cache and the coefficient table and loading resulting data in the target cache. The local host processor operates to load the source cache with a two dimensional data set, loads the coefficient table, and transfers resulting data out of the target cache to the storage system, or to another destination.
Ansari, A H; Cherian, P J; Dereymaeker, A; Matic, V; Jansen, K; De Wispelaere, L; Dielman, C; Vervisch, J; Swarte, R M; Govaert, P; Naulaers, G; De Vos, M; Van Huffel, S
2016-09-01
After identifying the most seizure-relevant characteristics by a previously developed heuristic classifier, a data-driven post-processor using a novel set of features is applied to improve the performance. The main characteristics of the outputs of the heuristic algorithm are extracted by five sets of features including synchronization, evolution, retention, segment, and signal features. Then, a support vector machine and a decision making layer remove the falsely detected segments. Four datasets including 71 neonates (1023h, 3493 seizures) recorded in two different university hospitals, are used to train and test the algorithm without removing the dubious seizures. The heuristic method resulted in a false alarm rate of 3.81 per hour and good detection rate of 88% on the entire test databases. The post-processor, effectively reduces the false alarm rate by 34% while the good detection rate decreases by 2%. This post-processing technique improves the performance of the heuristic algorithm. The structure of this post-processor is generic, improves our understanding of the core visually determined EEG features of neonatal seizures and is applicable for other neonatal seizure detectors. The post-processor significantly decreases the false alarm rate at the expense of a small reduction of the good detection rate. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
Algorithm architecture co-design for ultra low-power image sensor
NASA Astrophysics Data System (ADS)
Laforest, T.; Dupret, A.; Verdant, A.; Lattard, D.; Villard, P.
2012-03-01
In a context of embedded video surveillance, stand alone leftbehind image sensors are used to detect events with high level of confidence, but also with a very low power consumption. Using a steady camera, motion detection algorithms based on background estimation to find regions in movement are simple to implement and computationally efficient. To reduce power consumption, the background is estimated using a down sampled image formed of macropixels. In order to extend the class of moving objects to be detected, we propose an original mixed mode architecture developed thanks to an algorithm architecture co-design methodology. This programmable architecture is composed of a vector of SIMD processors. A basic RISC architecture was optimized in order to implement motion detection algorithms with a dedicated set of 42 instructions. Definition of delta modulation as a calculation primitive has allowed to implement algorithms in a very compact way. Thereby, a 1920x1080@25fps CMOS image sensor performing integrated motion detection is proposed with a power estimation of 1.8 mW.
Toshiba TDF-500 High Resolution Viewing And Analysis System
NASA Astrophysics Data System (ADS)
Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.
1988-06-01
A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
Early MIMD experience on the CRAY X-MP
NASA Astrophysics Data System (ADS)
Rhoades, Clifford E.; Stevens, K. G.
1985-07-01
This paper describes some early experience with converting four physics simulation programs to the CRAY X-MP, a current Multiple Instruction, Multiple Data (MIMD) computer consisting of two processors each with an architecture similar to that of the CRAY-1. As a multi-processor, the CRAY X-MP together with the high speed Solid-state Storage Device (SSD) in an ideal machine upon which to study MIMD algorithms for solving the equations of mathematical physics because it is fast enough to run real problems. The computer programs used in this study are all FORTRAN versions of original production codes. They range in sophistication from a one-dimensional numerical simulation of collisionless plasma to a two-dimensional hydrodynamics code with heat flow to a couple of three-dimensional fluid dynamics codes with varying degrees of viscous modeling. Early research with a dual processor configuration has shown speed-ups ranging from 1.55 to 1.98. It has been observed that a few simple extensions to FORTRAN allow a typical programmer to achieve a remarkable level of efficiency. These extensions involve the concept of memory local to a concurrent subprogram and memory common to all concurrent subprograms.
The AIS-5000 parallel processor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schmitt, L.A.; Wilson, S.S.
1988-05-01
The AIS-5000 is a commercially available massively parallel processor which has been designed to operate in an industrial environment. It has fine-grained parallelism with up to 1024 processing elements arranged in a single-instruction multiple-data (SIMD) architecture. The processing elements are arranged in a one-dimensional chain that, for computer vision applications, can be as wide as the image itself. This architecture has superior cost/performance characteristics than two-dimensional mesh-connected systems. The design of the processing elements and their interconnections as well as the software used to program the system allow a wide variety of algorithms and applications to be implemented. In thismore » paper, the overall architecture of the system is described. Various components of the system are discussed, including details of the processing elements, data I/O pathways and parallel memory organization. A virtual two-dimensional model for programming image-based algorithms for the system is presented. This model is supported by the AIS-5000 hardware and software and allows the system to be treated as a full-image-size, two-dimensional, mesh-connected parallel processor. Performance bench marks are given for certain simple and complex functions.« less
Real-Time Data Processing in the muon system of the D0 detector.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Neeti Parashar et al.
2001-07-03
This paper presents a real-time application of the 16-bit fixed point Digital Signal Processors (DSPs), in the Muon System of the D0 detector located at the Fermilab Tevatron, presently the world's highest-energy hadron collider. As part of the Upgrade for a run beginning in the year 2000, the system is required to process data at an input event rate of 10 KHz without incurring significant deadtime in readout. The ADSP21csp01 processor has high I/O bandwidth, single cycle instruction execution and fast task switching support to provide efficient multisignal processing. The processor's internal memory consists of 4K words of Program Memorymore » and 4K words of Data Memory. In addition there is an external memory of 32K words for general event buffering and 16K words of Dual port Memory for input data queuing. This DSP fulfills the requirement of the Muon subdetector systems for data readout. All error handling, buffering, formatting and transferring of the data to the various trigger levels of the data acquisition system is done in software. The algorithms developed for the system complete these tasks in about 20 {micro}s per event.« less
Method and apparatus for digitally based high speed x-ray spectrometer
Warburton, W.K.; Hubbard, B.
1997-11-04
A high speed, digitally based, signal processing system which accepts input data from a detector-preamplifier and produces a spectral analysis of the x-rays illuminating the detector. The system achieves high throughputs at low cost by dividing the required digital processing steps between a ``hardwired`` processor implemented in combinatorial digital logic, which detects the presence of the x-ray signals in the digitized data stream and extracts filtered estimates of their amplitudes, and a programmable digital signal processing computer, which refines the filtered amplitude estimates and bins them to produce the desired spectral analysis. One set of algorithms allow this hybrid system to match the resolution of analog systems while operating at much higher data rates. A second set of algorithms implemented in the processor allow the system to be self calibrating as well. The same processor also handles the interface to an external control computer. 19 figs.
Method and apparatus for digitally based high speed x-ray spectrometer
Warburton, William K.; Hubbard, Bradley
1997-01-01
A high speed, digitally based, signal processing system which accepts input data from a detector-preamplifier and produces a spectral analysis of the x-rays illuminating the detector. The system achieves high throughputs at low cost by dividing the required digital processing steps between a "hardwired" processor implemented in combinatorial digital logic, which detects the presence of the x-ray signals in the digitized data stream and extracts filtered estimates of their amplitudes, and a programmable digital signal processing computer, which refines the filtered amplitude estimates and bins them to produce the desired spectral analysis. One set of algorithms allow this hybrid system to match the resolution of analog systems while operating at much higher data rates. A second set of algorithms implemented in the processor allow the system to be self calibrating as well. The same processor also handles the interface to an external control computer.
MIL-STD-1553B Marconi LSI chip set in a remote terminal application
NASA Astrophysics Data System (ADS)
Dimarino, A.
1982-11-01
Marconi Avionics is utilizing the MIL-STD-1553B LSI Chip Set in the SCADC Air Data Computer application to perform all of the required remote terminal MIL-STD-1553B protocol functions. Basic components of the RTU are the dual redundant chip set, CT3231 Transceivers, 256 x 16 RAM and a Z8002 microprocessor. Basic transfers are to/from the RAM command of the bus controller or Z8002 processor. During transfers from the processor to the RAM, the chip set busy bit is set for a period not exceeding 250 microseconds. When the transfer is complete, the busy bit is released and transfers to the data bus occur on command. The LSI Chip Set word count lines are used to locate each data word in the local memory and 4 mode codes are used in the application: reset remote terminal, transmit status word, transmitter shut-down, and override transmitter shutdown.
Compiler-Assisted Multiple Instruction Rollback Recovery Using a Read Buffer. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Alewine, Neal Jon
1993-01-01
Multiple instruction rollback (MIR) is a technique to provide rapid recovery from transient processor failures and was implemented in hardware by researchers and slow in mainframe computers. Hardware-based MIR designs eliminate rollback data hazards by providing data redundancy implemented in hardware. Compiler-based MIR designs were also developed which remove rollback data hazards directly with data flow manipulations, thus eliminating the need for most data redundancy hardware. Compiler-assisted techniques to achieve multiple instruction rollback recovery are addressed. It is observed that data some hazards resulting from instruction rollback can be resolved more efficiently by providing hardware redundancy while others are resolved more efficiently with compiler transformations. A compiler-assisted multiple instruction rollback scheme is developed which combines hardware-implemented data redundancy with compiler-driven hazard removal transformations. Experimental performance evaluations were conducted which indicate improved efficiency over previous hardware-based and compiler-based schemes. Various enhancements to the compiler transformations and to the data redundancy hardware developed for the compiler-assisted MIR scheme are described and evaluated. The final topic deals with the application of compiler-assisted MIR techniques to aid in exception repair and branch repair in a speculative execution architecture.
Read buffer optimizations to support compiler-assisted multiple instruction retry
NASA Technical Reports Server (NTRS)
Alewine, N. J.; Fuchs, W. K.; Hwu, W. M.
1993-01-01
Multiple instruction retry is a recovery mechanism for transient processor faults. We previously developed a compiler-assisted approach to multiple instruction ferry in which a read buffer of size 2N (where N represents the maximum instruction rollback distance) was used to resolve some data hazards while the compiler resolved the remaining hazards. The compiler-assisted scheme was shown to reduce the performance overhead and/or hardware complexity normally associated with hardware-only retry schemes. This paper examines the size and design of the read buffer. We establish a practical lower bound and average size requirement for the read buffer by modifying the scheme to save only the data required for rollback. The study measures the effect on the performance of a DECstation 3100 running ten application programs using six read buffer configurations with varying read buffer sizes. Two alternative configurations are shown to be the most efficient and differed depending on whether split-cycle-saves are assumed. Up to a 55 percent read buffer size reduction is achievable with an average reduction of 39 percent given the most efficient read buffer configuration and a variety of applications.
Implementation of 4-way Superscalar Hash MIPS Processor Using FPGA
NASA Astrophysics Data System (ADS)
Sahib Omran, Safaa; Fouad Jumma, Laith
2018-05-01
Due to the quick advancements in the personal communications systems and wireless communications, giving data security has turned into a more essential subject. This security idea turns into a more confounded subject when next-generation system requirements and constant calculation speed are considered in real-time. Hash functions are among the most essential cryptographic primitives and utilized as a part of the many fields of signature authentication and communication integrity. These functions are utilized to acquire a settled size unique fingerprint or hash value of an arbitrary length of message. In this paper, Secure Hash Algorithms (SHA) of types SHA-1, SHA-2 (SHA-224, SHA-256) and SHA-3 (BLAKE) are implemented on Field-Programmable Gate Array (FPGA) in a processor structure. The design is described and implemented using a hardware description language, namely VHSIC “Very High Speed Integrated Circuit” Hardware Description Language (VHDL). Since the logical operation of the hash types of (SHA-1, SHA-224, SHA-256 and SHA-3) are 32-bits, so a Superscalar Hash Microprocessor without Interlocked Pipelines (MIPS) processor are designed with only few instructions that were required in invoking the desired Hash algorithms, when the four types of hash algorithms executed sequentially using the designed processor, the total time required equal to approximately 342 us, with a throughput of 4.8 Mbps while the required to execute the same four hash algorithms using the designed four-way superscalar is reduced to 237 us with improved the throughput to 5.1 Mbps.
Multicore Challenges and Benefits for High Performance Scientific Computing
Nielsen, Ida M. B.; Janssen, Curtis L.
2008-01-01
Until recently, performance gains in processors were achieved largely by improvements in clock speeds and instruction level parallelism. Thus, applications could obtain performance increases with relatively minor changes by upgrading to the latest generation of computing hardware. Currently, however, processor performance improvements are realized by using multicore technology and hardware support for multiple threads within each core, and taking full advantage of this technology to improve the performance of applications requires exposure of extreme levels of software parallelism. We will here discuss the architecture of parallel computers constructed from many multicore chips as well as techniques for managing the complexitymore » of programming such computers, including the hybrid message-passing/multi-threading programming model. We will illustrate these ideas with a hybrid distributed memory matrix multiply and a quantum chemistry algorithm for energy computation using Møller–Plesset perturbation theory.« less
Handling of huge multispectral image data volumes from a spectral hole burning device (SHBD)
NASA Astrophysics Data System (ADS)
Graff, Werner; Rosselet, Armel C.; Wild, Urs P.; Gschwind, Rudolf; Keller, Christoph U.
1995-06-01
We use chlorin-doped polymer films at low temperatures as the primary imaging detector. Based on the principles of persistent spectral hole burning, this system is capable of storing spatial and spectral information simultaneously in one exposure with extremely high resolution. The sun as an extended light source has been imaged onto the film. The information recorded amounts to tens of GBytes. This data volume is read out by scanning the frequency of a tunable dye laser and reading the images with a digital CCD camera. For acquisition, archival, processing, and visualization, we use MUSIC (MUlti processor System with Intelligent Communication), a single instruction multiple data parallel processor system equipped with the necessary I/O facilities. The huge amount of data requires the developemnt of sophisticated algorithms to efficiently calibrate the data and to extract useful and new information for solar physics.
An architecture for real-time vision processing
NASA Technical Reports Server (NTRS)
Chien, Chiun-Hong
1994-01-01
To study the feasibility of developing an architecture for real time vision processing, a task queue server and parallel algorithms for two vision operations were designed and implemented on an i860-based Mercury Computing System 860VS array processor. The proposed architecture treats each vision function as a task or set of tasks which may be recursively divided into subtasks and processed by multiple processors coordinated by a task queue server accessible by all processors. Each idle processor subsequently fetches a task and associated data from the task queue server for processing and posts the result to shared memory for later use. Load balancing can be carried out within the processing system without the requirement for a centralized controller. The author concludes that real time vision processing cannot be achieved without both sequential and parallel vision algorithms and a good parallel vision architecture.
Göritz, Anja S; Birnbaum, Michael H
2005-11-01
The customizable PHP script Generic HTML Form Processor is intended to assist researchers and students in quickly setting up surveys and experiments that can be administered via the Web. This script relieves researchers from the burdens of writing new CGI scripts and building databases for each Web study. Generic HTML Form Processor processes any syntactically correct HTML forminput and saves it into a dynamically created open-source database. We describe five modes for usage of the script that allow increasing functionality but require increasing levels of knowledge of PHP and Web servers: The first two modes require no previous knowledge, and the fifth requires PHP programming expertise. Use of Generic HTML Form Processor is free for academic purposes, and its Web address is www.goeritz.net/brmic.
2006-12-01
Data transfer unit ( DTU ) • Remote data concentrator (RDC) • Main processor unit (MPU) • 2 junction boxes (JB1/JB2) • 20 drive train and...NETWORKS TO PREDICT UH-60L ELECTRICAL GENERATOR CONDITION USING (IMD-HUMS) DATA by Evangelos Tourvalis December 2006 Thesis Advisor...including the time for reviewing instruction, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the
Concurrent Probabilistic Simulation of High Temperature Composite Structural Response
NASA Technical Reports Server (NTRS)
Abdi, Frank
1996-01-01
A computational structural/material analysis and design tool which would meet industry's future demand for expedience and reduced cost is presented. This unique software 'GENOA' is dedicated to parallel and high speed analysis to perform probabilistic evaluation of high temperature composite response of aerospace systems. The development is based on detailed integration and modification of diverse fields of specialized analysis techniques and mathematical models to combine their latest innovative capabilities into a commercially viable software package. The technique is specifically designed to exploit the availability of processors to perform computationally intense probabilistic analysis assessing uncertainties in structural reliability analysis and composite micromechanics. The primary objectives which were achieved in performing the development were: (1) Utilization of the power of parallel processing and static/dynamic load balancing optimization to make the complex simulation of structure, material and processing of high temperature composite affordable; (2) Computational integration and synchronization of probabilistic mathematics, structural/material mechanics and parallel computing; (3) Implementation of an innovative multi-level domain decomposition technique to identify the inherent parallelism, and increasing convergence rates through high- and low-level processor assignment; (4) Creating the framework for Portable Paralleled architecture for the machine independent Multi Instruction Multi Data, (MIMD), Single Instruction Multi Data (SIMD), hybrid and distributed workstation type of computers; and (5) Market evaluation. The results of Phase-2 effort provides a good basis for continuation and warrants Phase-3 government, and industry partnership.
2001-06-19
Queue Get Put The MutexQ module provides primitive queue operations which synchronize access to the queues and ensure queue structure integrity...interface provides for synchronous data rates ranging from 64 Kbps to 1.536 Mbps, while an RS-232 interface accommodates asynchronous data up to...interface VME Communications processor 57 and 8-channel serial I/O board. This board set provides a 68040 processor and 8-channels of synchronous
2004-03-01
turned off. SLEEP Set the timer for 30 seconds before scheduled transmit time, then sleep the processor. WAKE When timer trips, power up the processor...slots where none of its neighbors are schedule to transmit. This allows the sensor nodes to perform a simple power man- agement scheme that puts the...routing This simple case study highlights the following crucial observation: optimal traffic scheduling in energy constrained networks requires future
NASA Astrophysics Data System (ADS)
Arestova, M. L.; Bykovskii, A. Yu
1995-10-01
An architecture is proposed for a specialised optoelectronic multivalued logic processor based on the Allen—Givone algebra. The processor is intended for multiparametric processing of data arriving from a large number of sensors or for tackling spectral analysis tasks. The processor architecture makes it possible to obtain an approximate general estimate of the state of an object being diagnosed on a p-level scale. Optoelectronic systems are proposed for MAXIMUM, MINIMUM, and LITERAL logic gates, based on optical-frequency encoding of logic levels. Corresponding logic gates form a complete set of logic functions in the Allen—Givone algebra.
Page, Andrew J.; Keane, Thomas M.; Naughton, Thomas J.
2010-01-01
We present a multi-heuristic evolutionary task allocation algorithm to dynamically map tasks to processors in a heterogeneous distributed system. It utilizes a genetic algorithm, combined with eight common heuristics, in an effort to minimize the total execution time. It operates on batches of unmapped tasks and can preemptively remap tasks to processors. The algorithm has been implemented on a Java distributed system and evaluated with a set of six problems from the areas of bioinformatics, biomedical engineering, computer science and cryptography. Experiments using up to 150 heterogeneous processors show that the algorithm achieves better efficiency than other state-of-the-art heuristic algorithms. PMID:20862190
Parallel discrete event simulation using shared memory
NASA Technical Reports Server (NTRS)
Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.
1988-01-01
With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.
Image matrix processor for fast multi-dimensional computations
Roberson, G.P.; Skeate, M.F.
1996-10-15
An apparatus for multi-dimensional computation is disclosed which comprises a computation engine, including a plurality of processing modules. The processing modules are configured in parallel and compute respective contributions to a computed multi-dimensional image of respective two dimensional data sets. A high-speed, parallel access storage system is provided which stores the multi-dimensional data sets, and a switching circuit routes the data among the processing modules in the computation engine and the storage system. A data acquisition port receives the two dimensional data sets representing projections through an image, for reconstruction algorithms such as encountered in computerized tomography. The processing modules include a programmable local host, by which they may be configured to execute a plurality of different types of multi-dimensional algorithms. The processing modules thus include an image manipulation processor, which includes a source cache, a target cache, a coefficient table, and control software for executing image transformation routines using data in the source cache and the coefficient table and loading resulting data in the target cache. The local host processor operates to load the source cache with a two dimensional data set, loads the coefficient table, and transfers resulting data out of the target cache to the storage system, or to another destination. 10 figs.
Simplified microprocessor design for VLSI control applications
NASA Technical Reports Server (NTRS)
Cameron, K.
1991-01-01
A design technique for microprocessors combining the simplicity of reduced instruction set computers (RISC's) with the richer instruction sets of complex instruction set computers (CISC's) is presented. They utilize the pipelined instruction decode and datapaths common to RISC's. Instruction invariant data processing sequences which transparently support complex addressing modes permit the formulation of simple control circuitry. Compact implementations are possible since neither complicated controllers nor large register sets are required.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-11-14
... Associated Instruction Sets and Software; Institution of Investigation AGENCY: U.S. International Trade... associated instruction sets and software by reason of infringement of certain claims of U.S. Patent No. 6,253... certain computing devices with associated instruction sets and software that infringe one or more of...
System and method for resolving gamma-ray spectra
Gentile, Charles A.; Perry, Jason; Langish, Stephen W.; Silber, Kenneth; Davis, William M.; Mastrovito, Dana
2010-05-04
A system for identifying radionuclide emissions is described. The system includes at least one processor for processing output signals from a radionuclide detecting device, at least one training algorithm run by the at least one processor for analyzing data derived from at least one set of known sample data from the output signals, at least one classification algorithm derived from the training algorithm for classifying unknown sample data, wherein the at least one training algorithm analyzes the at least one sample data set to derive at least one rule used by said classification algorithm for identifying at least one radionuclide emission detected by the detecting device.
1982-07-01
blocks. DISCUS has no form of hardware synchronisation between the processors. The only synchronisation is at an operating system level. ;ach processor is... operations in global store so that semaphoring on global objects can be done correctly. Write Protect is used by the operating system for read-only...the appropriate operating system program. String Handling primitives . The Z8000 has a rich set of string primitives . However as we saw before if a
Power Aware Distributed Systems
2004-01-01
detection or threshold functions to trigger the main CPU. The main processor can sleep and either wakeup on a schedule or by a positive threshold event...the RTOS must determine if wake-up latency can be tolerated (or, if it could be hidden by pre- wakeup ). The prediction accuracy for scheduling ...and processor shutdown/ wakeup . This analysis can be used to accurately analyze the schedulability of non-concrete periodic task sets, scheduled using
Efficient NC Algorithms for Set Cover Applications to Learning and Geometry.
1989-05-01
required for one invocation of [L2]. The number of processors is O( ZeEE Jej 2 +n 2) = O(cneE_,E jeI+n 2). Assuming the n 2 term is insignificant...get a good P. 0 The number of processors for our deterministic selection procedure will be O( ZeEE , e 2+ IV12), which is at most n times the input size
NASA Technical Reports Server (NTRS)
Fatoohi, Rod; Saini, Subbash; Ciotti, Robert
2006-01-01
We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
Mapping of MPEG-4 decoding on a flexible architecture platform
NASA Astrophysics Data System (ADS)
van der Tol, Erik B.; Jaspers, Egbert G.
2001-12-01
In the field of consumer electronics, the advent of new features such as Internet, games, video conferencing, and mobile communication has triggered the convergence of television and computers technologies. This requires a generic media-processing platform that enables simultaneous execution of very diverse tasks such as high-throughput stream-oriented data processing and highly data-dependent irregular processing with complex control flows. As a representative application, this paper presents the mapping of a Main Visual profile MPEG-4 for High-Definition (HD) video onto a flexible architecture platform. A stepwise approach is taken, going from the decoder application toward an implementation proposal. First, the application is decomposed into separate tasks with self-contained functionality, clear interfaces, and distinct characteristics. Next, a hardware-software partitioning is derived by analyzing the characteristics of each task such as the amount of inherent parallelism, the throughput requirements, the complexity of control processing, and the reuse potential over different applications and different systems. Finally, a feasible implementation is proposed that includes amongst others a very-long-instruction-word (VLIW) media processor, one or more RISC processors, and some dedicated processors. The mapping study of the MPEG-4 decoder proves the flexibility and extensibility of the media-processing platform. This platform enables an effective HW/SW co-design yielding a high performance density.
Multiple Microcomputer Control Algorithm.
1979-09-01
discrete and semaphore supervisor calls can be used with tasks in separate processors, in which case they are maintained in shared memory. Operations on ...the source or destination operand specifier of each mode in most cases . However, four of the 16 general register addressing modes and one of the 8 pro...instruction time is based on the specified usage factors and the best cast, and worst case execution times for the instruc- 1I 5 1NAVTRAEQZJ1PCrN M’.V7~j
Function algorithms for MPP scientific subroutines, volume 1
NASA Technical Reports Server (NTRS)
Gouch, J. G.
1984-01-01
Design documentation and user documentation for function algorithms for the Massively Parallel Processor (MPP) are presented. The contract specifies development of MPP assembler instructions to perform the following functions: natural logarithm; exponential (e to the x power); square root; sine; cosine; and arctangent. To fulfill the requirements of the contract, parallel array and solar implementations for these functions were developed on the PDP11/34 Program Development and Management Unit (PDMU) that is resident at the MPP testbed installation located at the NASA Goddard facility.
Application of total distributed control system in car-body inspection
NASA Astrophysics Data System (ADS)
Yang, Xueyou; Ren, Dahai; Wang, Zhong; Ye, Shenghua; Lu, Hongbo; Duan, Jilin
1996-08-01
An application of distributed control system in Autocar-body Visual Inspection Station is presented in the paper, a distributed control system using PC as the host processor and single-chip microcomputer as the slave controller is proposed. In this paper, the physical interface of the control network and the relevant hardware are introduced. Meanwhile, a minute research on data communication is performed, relevant protocols on data framing, instruction codes and channel access methods have been laid down and part of related software is presented.
Distributed control system in a car-body inspection station
NASA Astrophysics Data System (ADS)
Yang, Xueyou; Ren, Dahai; Ye, Shenghua; Lu, Hongbo; Duan, Jilin
1997-06-01
In this paper, a distributed control network in autocar-body visual inspection station is presented in which PC is used as the host processor and single-chip microcomputers are employed as slave controllers. The physical interface of the control network and the relevant hardware are introduced in this paper. Meanwhile, a minute research on data communication is performed, relevant protocols on data framing, instruction codes and channel access methods have been laid down and part of related software is presented.
Measurements of the LHCb software stack on the ARM architecture
NASA Astrophysics Data System (ADS)
Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko
2014-06-01
The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.
Does overgeneral autobiographical memory result from poor memory for task instructions?
Yanes, Paula K; Roberts, John E; Carlos, Erica L
2008-10-01
Considerable previous research has shown that retrieval of overgeneral autobiographical memories (OGM) is elevated among individuals suffering from various emotional disorders and those with a history of trauma. Although previous theories suggest that OGM serves the function of regulating acute negative affect, it is also possible that OGM results from difficulties in keeping the instruction set for the Autobiographical Memory Test (AMT) in working memory, or what has been coined "secondary goal neglect" (Dalgleish, 2004). The present study tested whether OGM is associated with poor memory for the task's instruction set, and whether an instruction set reminder would improve memory specificity over repeated trials. Multilevel modelling data-analytic techniques demonstrated a significant relationship between poor recall of instruction set and probability of retrieving OGMs. Providing an instruction set reminder for the AMT relative to a control task's instruction set improved memory specificity immediately afterward.
ERIC Educational Resources Information Center
Burkholder, Barry L.
1981-01-01
This study conducted to determine the effectiveness of using the Instructional Strategy Diagnostic Profile to revise self-instructional materials that teach abstract concepts examined three sets of materials: the original set, the set with improved consistency rating, and the set with improved consistency and adequacy ratings. Forty-six references…
Prototype Focal-Plane-Array Optoelectronic Image Processor
NASA Technical Reports Server (NTRS)
Fang, Wai-Chi; Shaw, Timothy; Yu, Jeffrey
1995-01-01
Prototype very-large-scale integrated (VLSI) planar array of optoelectronic processing elements combines speed of optical input and output with flexibility of reconfiguration (programmability) of electronic processing medium. Basic concept of processor described in "Optical-Input, Optical-Output Morphological Processor" (NPO-18174). Performs binary operations on binary (black and white) images. Each processing element corresponds to one picture element of image and located at that picture element. Includes input-plane photodetector in form of parasitic phototransistor part of processing circuit. Output of each processing circuit used to modulate one picture element in output-plane liquid-crystal display device. Intended to implement morphological processing algorithms that transform image into set of features suitable for high-level processing; e.g., recognition.
Prototype automated post-MECO ascent I-load Verification Data Table
NASA Technical Reports Server (NTRS)
Lardas, George D.
1990-01-01
A prototype automated processor for quality assurance of Space Shuttle post-Main Engine Cut Off (MECO) ascent initialization parameters (I-loads) is described. The processor incorporates Clips rules adapted from the quality assurance criteria for the post-MECO ascent I-loads. Specifically, the criteria are implemented for nominal and abort targets, as given in the 'I-load Verification Data Table, Part 3, Post-MECO Ascent, Version 2.1, December 1989.' This processor, ivdt, compares a given l-load set with the stated mission design and quality assurance criteria. It determines which I-loads violate the stated criteria, and presents a summary of I-loads that pass or fail the tests.
An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes
Vincenti, H.; Lobet, M.; Lehe, R.; ...
2016-09-19
In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries: OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less
An efficient and portable SIMD algorithm for charge/current deposition in Particle-In-Cell codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vincenti, H.; Lobet, M.; Lehe, R.
In current computer architectures, data movement (from die to network) is by far the most energy consuming part of an algorithm (≈20pJ/word on-die to ≈10,000 pJ/word on the network). To increase memory locality at the hardware level and reduce energy consumption related to data movement, future exascale computers tend to use many-core processors on each compute nodes that will have a reduced clock speed to allow for efficient cooling. To compensate for frequency decrease, machine vendors are making use of long SIMD instruction registers that are able to process multiple data with one arithmetic operator in one clock cycle. SIMD registermore » length is expected to double every four years. As a consequence, Particle-In-Cell (PIC) codes will have to achieve good vectorization to fully take advantage of these upcoming architectures. In this paper, we present a new algorithm that allows for efficient and portable SIMD vectorization of current/charge deposition routines that are, along with the field gathering routines, among the most time consuming parts of the PIC algorithm. Our new algorithm uses a particular data structure that takes into account memory alignment constraints and avoids gather/scat;ter instructions that can significantly affect vectorization performances on current CPUs. The new algorithm was successfully implemented in the 3D skeleton PIC code PICSAR and tested on Haswell Xeon processors (AVX2-256 bits wide data registers). Results show a factor of ×2 to ×2.5 speed-up in double precision for particle shape factor of orders 1–3. The new algorithm can be applied as is on future KNL (Knights Landing) architectures that will include AVX-512 instruction sets with 512 bits register lengths (8 doubles/16 singles). Program summary Program Title: vec_deposition Program Files doi:http://dx.doi.org/10.17632/nh77fv9k8c.1 Licensing provisions: BSD 3-Clause Programming language: Fortran 90 External routines/libraries: OpenMP > 4.0 Nature of problem: Exascale architectures will have many-core processors per node with long vector data registers capable of performing one single instruction on multiple data during one clock cycle. Data register lengths are expected to double every four years and this pushes for new portable solutions for efficiently vectorizing Particle-In-Cell codes on these future many-core architectures. One of the main hotspot routines of the PIC algorithm is the current/charge deposition for which there is no efficient and portable vector algorithm. Solution method: Here we provide an efficient and portable vector algorithm of current/charge deposition routines that uses a new data structure, which significantly reduces gather/scatter operations. Vectorization is controlled using OpenMP 4.0 compiler directives for vectorization which ensures portability across different architectures. Restrictions: Here we do not provide the full PIC algorithm with an executable but only vector routines for current/charge deposition. These scalar/vector routines can be used as library routines in your 3D Particle-In-Cell code. However, to get the best performances out of vector routines you have to satisfy the two following requirements: (1) Your code should implement particle tiling (as explained in the manuscript) to allow for maximized cache reuse and reduce memory accesses that can hinder vector performances. The routines can be used directly on each particle tile. (2) You should compile your code with a Fortran 90 compiler (e.g Intel, gnu or cray) and provide proper alignment flags and compiler alignment directives (more details in README file).« less
Reconfigurable data path processor
NASA Technical Reports Server (NTRS)
Donohoe, Gregory (Inventor)
2005-01-01
A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.
Ion Thruster Development at NASA Lewis Research Center
NASA Technical Reports Server (NTRS)
Sovey, James S.; Hamley, John A.; Patterson, Michael J.; Rawlin, Vincent K.; Sarver-Verhey, Timothy R.
1992-01-01
Recent ion propulsion technology efforts at NASA's Lewis Research Center including development of kW-class xenon ion thrusters, high power xenon and krypton ion thrusters, and power processors are reviewed. Thruster physical characteristics, performance data, life projections, and power processor component technology are summarized. The ion propulsion technology program is structured to address a broad set of mission applications from satellite stationkeeping and repositioning to primary propulsion using solar or nuclear power systems.
2016-05-07
REPORT DOCUMENTATION PAGE I . ... ... .. . ,...,.., ............. OMB No. 0704-0188 The public reporting burden for this collection of...Student Support for Appl ication of Advanced Multi- Core Processor N00014-12-1-0298 Technologies to Oceanographic Research Sb. GRANT NUMBER Sc...communications protocols (i.e. UART, I2C, and SPI), through the , ’ . handing off of the data to the server APis. By providing a common set of tools
Handling debugger breakpoints in a shared instruction system
Gooding, Thomas Michael; Shok, Richard Michael
2014-01-21
A debugger debugs processes that execute shared instructions so that a breakpoint set for one process will not cause a breakpoint to occur in the other processes. A breakpoint is set by recording the original instruction at the desired location and writing a trap instruction to the shared instructions at that location. When a process encounters the breakpoint, the process passes control to the debugger for breakpoint processing if the breakpoint was set at that location for that process. If the trap was not set at that location for that process, the cacheline containing the trap is copied to a small scratchpad memory, and the virtual memory mappings are changed to translate the virtual address of the cacheline to the scratchpad. The original instruction is then written to replace the trap instruction in the scratchpad, so that process can execute the instructions in the scatchpad thereby avoiding the trap instruction.
Clinical Validation of a Sound Processor Upgrade in Direct Acoustic Cochlear Implant Subjects
Kludt, Eugen; D’hondt, Christiane; Lenarz, Thomas; Maier, Hannes
2017-01-01
Objective: The objectives of the investigation were to evaluate the effect of a sound processor upgrade on the speech reception threshold in noise and to collect long-term safety and efficacy data after 2½ to 5 years of device use of direct acoustic cochlear implant (DACI) recipients. Study Design: The study was designed as a mono-centric, prospective clinical trial. Setting: Tertiary referral center. Patients: Fifteen patients implanted with a direct acoustic cochlear implant. Intervention: Upgrade with a newer generation of sound processor. Main Outcome Measures: Speech recognition test in quiet and in noise, pure tone thresholds, subject-reported outcome measures. Results: The speech recognition in quiet and in noise is superior after the sound processor upgrade and stable after long-term use of the direct acoustic cochlear implant. The bone conduction thresholds did not decrease significantly after long-term high level stimulation. Conclusions: The new sound processor for the DACI system provides significant benefits for DACI users for speech recognition in both quiet and noise. Especially the noise program with the use of directional microphones (Zoom) allows DACI patients to have much less difficulty when having conversations in noisy environments. Furthermore, the study confirms that the benefits of the sound processor upgrade are available to the DACI recipients even after several years of experience with a legacy sound processor. Finally, our study demonstrates that the DACI system is a safe and effective long-term therapy. PMID:28406848
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee
2008-01-01
High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less
Partitioning and packing mathematical simulation models for calculation on parallel computers
NASA Technical Reports Server (NTRS)
Arpasi, D. J.; Milner, E. J.
1986-01-01
The development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system is described. Degrees of parallelism (i.e., coupling between the equations) and their impact on parallel processing are discussed. The problem of identifying computational parallelism within sets of closely coupled equations that require the exchange of current values of variables is described. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. An algorithm which packs the equations into a minimum number of processors is also described. The results of the packing algorithm when applied to a turbojet engine model are presented in terms of processor utilization.
General linear codes for fault-tolerant matrix operations on processor arrays
NASA Technical Reports Server (NTRS)
Nair, V. S. S.; Abraham, J. A.
1988-01-01
Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.
CMOS Image Sensor with a Built-in Lane Detector.
Hsiao, Pei-Yung; Cheng, Hsien-Chein; Huang, Shih-Shinh; Fu, Li-Chen
2009-01-01
This work develops a new current-mode mixed signal Complementary Metal-Oxide-Semiconductor (CMOS) imager, which can capture images and simultaneously produce vehicle lane maps. The adopted lane detection algorithm, which was modified to be compatible with hardware requirements, can achieve a high recognition rate of up to approximately 96% under various weather conditions. Instead of a Personal Computer (PC) based system or embedded platform system equipped with expensive high performance chip of Reduced Instruction Set Computer (RISC) or Digital Signal Processor (DSP), the proposed imager, without extra Analog to Digital Converter (ADC) circuits to transform signals, is a compact, lower cost key-component chip. It is also an innovative component device that can be integrated into intelligent automotive lane departure systems. The chip size is 2,191.4 × 2,389.8 μm, and the package uses 40 pin Dual-In-Package (DIP). The pixel cell size is 18.45 × 21.8 μm and the core size of photodiode is 12.45 × 9.6 μm; the resulting fill factor is 29.7%.
Yang, Fan; Paindavoine, M
2003-01-01
This paper describes a real time vision system that allows us to localize faces in video sequences and verify their identity. These processes are image processing techniques based on the radial basis function (RBF) neural network approach. The robustness of this system has been evaluated quantitatively on eight video sequences. We have adapted our model for an application of face recognition using the Olivetti Research Laboratory (ORL), Cambridge, UK, database so as to compare the performance against other systems. We also describe three hardware implementations of our model on embedded systems based on the field programmable gate array (FPGA), zero instruction set computer (ZISC) chips, and digital signal processor (DSP) TMS320C62, respectively. We analyze the algorithm complexity and present results of hardware implementations in terms of the resources used and processing speed. The success rates of face tracking and identity verification are 92% (FPGA), 85% (ZISC), and 98.2% (DSP), respectively. For the three embedded systems, the processing speeds for images size of 288 /spl times/ 352 are 14 images/s, 25 images/s, and 4.8 images/s, respectively.
A Low Power SOC Architecture for the V2.0+EDR Bluetooth Using a Unified Verification Platform
NASA Astrophysics Data System (ADS)
Kim, Jeonghun; Kim, Suki; Baek, Kwang-Hyun
This paper presents a low-power System on Chip (SOC) architecture for the v2.0+EDR (Enhanced Data Rate) Bluetooth and its applications. Our design includes a link controller, modem, RF transceiver, Sub-Band Codec (SBC), Expanded Instruction Set Computer (ESIC) processor, and peripherals. To decrease power consumption of the proposed SOC, we reduce data transfer using a dual-port memory, including a power management unit, and a clock gated approach. We also address some of issues and benefits of reusable and unified environment on a centralized data structure and SOC verification platform. This includes flexibility in meeting the final requirements using technology-independent tools wherever possible in various processes and for projects. The other aims of this work are to minimize design efforts by avoiding the same work done twice by different people and to reuse the similar environment and platform for different projects. This chip occupies a die size of 30mm2 in 0.18µm CMOS, and the worst-case current of the total chip is 54mA.
Yu, Jen-Shiang K; Hwang, Jenn-Kang; Tang, Chuan Yi; Yu, Chin-Hui
2004-01-01
A number of recently released numerical libraries including Automatically Tuned Linear Algebra Subroutines (ATLAS) library, Intel Math Kernel Library (MKL), GOTO numerical library, and AMD Core Math Library (ACML) for AMD Opteron processors, are linked against the executables of the Gaussian 98 electronic structure calculation package, which is compiled by updated versions of Fortran compilers such as Intel Fortran compiler (ifc/efc) 7.1 and PGI Fortran compiler (pgf77/pgf90) 5.0. The ifc 7.1 delivers about 3% of improvement on 32-bit machines compared to the former version 6.0. Performance improved from pgf77 3.3 to 5.0 is also around 3% when utilizing the original unmodified optimization options of the compiler enclosed in the software. Nevertheless, if extensive compiler tuning options are used, the speed can be further accelerated to about 25%. The performances of these fully optimized numerical libraries are similar. The double-precision floating-point (FP) instruction sets (SSE2) are also functional on AMD Opteron processors operated in 32-bit compilation, and Intel Fortran compiler has performed better optimization. Hardware-level tuning is able to improve memory bandwidth by adjusting the DRAM timing, and the efficiency in the CL2 mode is further accelerated by 2.6% compared to that of the CL2.5 mode. The FP throughput is measured by simultaneous execution of two identical copies of each of the test jobs. Resultant performance impact suggests that IA64 and AMD64 architectures are able to fulfill significantly higher throughput than the IA32, which is consistent with the SpecFPrate2000 benchmarks.
A fast, programmable hardware architecture for the processing of spaceborne SAR data
NASA Technical Reports Server (NTRS)
Bennett, J. R.; Cumming, I. G.; Lim, J.; Wedding, R. M.
1984-01-01
The development of high-throughput SAR processors (HTSPs) for the spaceborne SARs being planned by NASA, ESA, DFVLR, NASDA, and the Canadian Radarsat Project is discussed. The basic parameters and data-processing requirements of the SARs are listed in tables, and the principal problems are identified as real-operations rates in excess of 2 x 10 to the 9th/sec, I/O rates in excess of 8 x 10 to the 6th samples/sec, and control computation loads (as for range cell migration correction) as high as 1.4 x 10 to the 6th instructions/sec. A number of possible HTSP architectures are reviewed; host/array-processor (H/AP) and distributed-control/data-path (DCDP) architectures are examined in detail and illustrated with block diagrams; and a cost/speed comparison of these two architectures is presented. The H/AP approach is found to be adequate and economical for speeds below 1/200 of real time, while DCDP is more cost-effective above 1/50 of real time.
1987-12-01
Application Programs Intelligent Disk Database Controller Manangement System Operating System Host .1’ I% Figure 2. Intelligent Disk Controller Application...8217. /- - • Database Control -% Manangement System Disk Data Controller Application Programs Operating Host I"" Figure 5. Processor-Per- Head data. Therefore, the...However. these ad- ditional properties have been proven in classical set and relation theory [75]. These additional properties are described here
Multicore Programming Challenges
NASA Astrophysics Data System (ADS)
Perrone, Michael
The computer industry is facing fundamental challenges that are driving a major change in the design of computer processors. Due to restrictions imposed by quantum physics, one historical path to higher computer processor performance - by increased clock frequency - has come to an end. Increasing clock frequency now leads to power consumption costs that are too high to justify. As a result, we have seen in recent years that the processor frequencies have peaked and are receding from their high point. At the same time, competitive market conditions are giving business advantage to those companies that can field new streaming applications, handle larger data sets, and update their models to market conditions faster. The desire for newer, faster and larger is driving continued demand for higher computer performance.
A Conformance Test Suite for Arden Syntax Compilers and Interpreters.
Wolf, Klaus-Hendrik; Klimek, Mike
2016-01-01
The Arden Syntax for Medical Logic Modules is a standardized and well-established programming language to represent medical knowledge. To test the compliance level of existing compilers and interpreters no public test suite exists. This paper presents the research to transform the specification into a set of unit tests, represented in JUnit. It further reports on the utilization of the test suite testing four different Arden Syntax processors. The presented and compared results reveal the status conformance of the tested processors. How test driven development of Arden Syntax processors can help increasing the compliance with the standard is described with two examples. In the end some considerations how an open source test suite can improve the development and distribution of the Arden Syntax are presented.
Efficacy of Code Optimization on Cache-based Processors
NASA Technical Reports Server (NTRS)
VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important computational algorithms employed at NASA Ames require different programming styles on vector machines and cache-based machines, respectively, neither architecture class appeared to be favored by particular algorithms in principle. Practice tells us that the situation is more complicated. This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist.
Vaerenberg, Bart; Govaerts, Paul J; de Ceulaer, Geert; Daemers, Kristin; Schauwers, Karen
2011-01-01
This report describes the application of the software tool "Fitting to Outcomes eXpert" (FOX) in programming the cochlear implant (CI) processor in new users. FOX is an intelligent agent to assist in the programming of CI processors. The concept of FOX is to modify maps on the basis of specific outcome measures, achieved using heuristic logic and based on a set of deterministic "rules". A prospective study was conducted on eight consecutive CI-users with a follow-up of three months. Eight adult subjects with postlingual deafness were implanted with the Advanced Bionics HiRes90k device. The implants were programmed using FOX, running a set of rules known as Eargroup's EG0910 advice, which features a set of "automaps". The protocol employed for the initial 3 months is presented, with description of the map modifications generated by FOX and the corresponding psychoacoustic test results. The 3 month median results show 25 dBHL as PTA, 77% (55 dBSPL) and 71% (70 dBSPL) phoneme score at speech audiometry and loudness scaling in or near to the normal zone at different frequencies. It is concluded that this approach is feasible to start up CI fitting and yields good outcome.
Six years of total ozone column measurements from SCIAMACHY nadir observations
NASA Astrophysics Data System (ADS)
Lerot, C.; van Roozendael, M.; van Geffen, J.; van Gent, J.; Fayt, C.; Spurr, R.; Lichtenberg, G.; von Bargen, A.
2009-04-01
Total O3 columns have been retrieved from six years of SCIAMACHY nadir UV radiance measurements using SDOAS, an adaptation of the GDOAS algorithm previously developed at BIRA-IASB for the GOME instrument. GDOAS and SDOAS have been implemented by the German Aerospace Center (DLR) in the version 4 of the GOME Data Processor (GDP) and in version 3 of the SCIAMACHY Ground Processor (SGP), respectively. The processors are being run at the DLR processing centre on behalf of the European Space Agency (ESA). We first focus on the description of the SDOAS algorithm with particular attention to the impact of uncertainties on the reference O3 absorption cross-sections. Second, the resulting SCIAMACHY total ozone data set is globally evaluated through large-scale comparisons with results from GOME and OMI as well as with ground-based correlative measurements. The various total ozone data sets are found to agree within 2% on average. However, a negative trend of 0.2-0.4%/year has been identified in the SCIAMACHY O3 columns; this probably originates from instrumental degradation effects that have not yet been fully characterized.
Six years of total ozone column measurements from SCIAMACHY nadir observations
NASA Astrophysics Data System (ADS)
Lerot, C.; van Roozendael, M.; van Geffen, J.; van Gent, J.; Fayt, C.; Spurr, R.; Lichtenberg, G.; von Bargen, A.
2008-11-01
Total O3 columns have been retrieved from six years of SCIAMACHY nadir UV radiance measurements using SDOAS, an adaptation of the GDOAS algorithm previously developed at BIRA-IASB for the GOME instrument. GDOAS and SDOAS have been implemented by the German Aerospace Center (DLR) in the version 4 of the GOME Data Processor (GDP) and in version 3 of the SCIAMACHY Ground Processor (SGP), respectively. The processors are being run at the DLR processing centre on behalf of the European Space Agency (ESA). We first focus on the description of the SDOAS algorithm with particular attention to the impact of uncertainties on the reference O3 absorption cross-sections. Second, the resulting SCIAMACHY total ozone data set is globally evaluated through large-scale comparisons with results from GOME and OMI as well as with ground-based correlative measurements. The various total ozone data sets are found to agree within 2% on average. However, a negative trend of 0.2-0.4%/year has been identified in the SCIAMACHY O3 columns; this probably originates from instrumental degradation effects that have not yet been fully characterized.
Transputer parallel processing at NASA Lewis Research Center
NASA Technical Reports Server (NTRS)
Ellis, Graham K.
1989-01-01
The transputer parallel processing lab at NASA Lewis Research Center (LeRC) consists of 69 processors (transputers) that can be connected into various networks for use in general purpose concurrent processing applications. The main goal of the lab is to develop concurrent scientific and engineering application programs that will take advantage of the computational speed increases available on a parallel processor over the traditional sequential processor. Current research involves the development of basic programming tools. These tools will help standardize program interfaces to specific hardware by providing a set of common libraries for applications programmers. The thrust of the current effort is in developing a set of tools for graphics rendering/animation. The applications programmer currently has two options for on-screen plotting. One option can be used for static graphics displays and the other can be used for animated motion. The option for static display involves the use of 2-D graphics primitives that can be called from within an application program. These routines perform the standard 2-D geometric graphics operations in real-coordinate space as well as allowing multiple windows on a single screen.
A Electro-Optical Image Algebra Processing System for Automatic Target Recognition
NASA Astrophysics Data System (ADS)
Coffield, Patrick Cyrus
The proposed electro-optical image algebra processing system is designed specifically for image processing and other related computations. The design is a hybridization of an optical correlator and a massively paralleled, single instruction multiple data processor. The architecture of the design consists of three tightly coupled components: a spatial configuration processor (the optical analog portion), a weighting processor (digital), and an accumulation processor (digital). The systolic flow of data and image processing operations are directed by a control buffer and pipelined to each of the three processing components. The image processing operations are defined in terms of basic operations of an image algebra developed by the University of Florida. The algebra is capable of describing all common image-to-image transformations. The merit of this architectural design is how it implements the natural decomposition of algebraic functions into spatially distributed, point use operations. The effect of this particular decomposition allows convolution type operations to be computed strictly as a function of the number of elements in the template (mask, filter, etc.) instead of the number of picture elements in the image. Thus, a substantial increase in throughput is realized. The implementation of the proposed design may be accomplished in many ways. While a hybrid electro-optical implementation is of primary interest, the benefits and design issues of an all digital implementation are also discussed. The potential utility of this architectural design lies in its ability to control a large variety of the arithmetic and logic operations of the image algebra's generalized matrix product. The generalized matrix product is the most powerful fundamental operation in the algebra, thus allowing a wide range of applications. No other known device or design has made this claim of processing speed and general implementation of a heterogeneous image algebra.
Mapping of H.264 decoding on a multiprocessor architecture
NASA Astrophysics Data System (ADS)
van der Tol, Erik B.; Jaspers, Egbert G.; Gelderblom, Rob H.
2003-05-01
Due to the increasing significance of development costs in the competitive domain of high-volume consumer electronics, generic solutions are required to enable reuse of the design effort and to increase the potential market volume. As a result from this, Systems-on-Chip (SoCs) contain a growing amount of fully programmable media processing devices as opposed to application-specific systems, which offered the most attractive solutions due to a high performance density. The following motivates this trend. First, SoCs are increasingly dominated by their communication infrastructure and embedded memory, thereby making the cost of the functional units less significant. Moreover, the continuously growing design costs require generic solutions that can be applied over a broad product range. Hence, powerful programmable SoCs are becoming increasingly attractive. However, to enable power-efficient designs, that are also scalable over the advancing VLSI technology, parallelism should be fully exploited. Both task-level and instruction-level parallelism can be provided by means of e.g. a VLIW multiprocessor architecture. To provide the above-mentioned scalability, we propose to partition the data over the processors, instead of traditional functional partitioning. An advantage of this approach is the inherent locality of data, which is extremely important for communication-efficient software implementations. Consequently, a software implementation is discussed, enabling e.g. SD resolution H.264 decoding with a two-processor architecture, whereas High-Definition (HD) decoding can be achieved with an eight-processor system, executing the same software. Experimental results show that the data communication considerably reduces up to 65% directly improving the overall performance. Apart from considerable improvement in memory bandwidth, this novel concept of partitioning offers a natural approach for optimally balancing the load of all processors, thereby further improving the overall speedup.
Benchmarking NWP Kernels on Multi- and Many-core Processors
NASA Astrophysics Data System (ADS)
Michalakes, J.; Vachharajani, M.
2008-12-01
Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing
NASA Astrophysics Data System (ADS)
Johl, John T.; Baker, Nick C.
1988-10-01
The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
NASA Technical Reports Server (NTRS)
Windley, P.
1992-01-01
We present a state property called congruence and show how it can be used to demonstrate commutivity of instructions in a modern load-store architecture. Our analysis is particularly important in pipelined microprocessors where instructions are frequently reordered to avoid costly delays in execution caused by hazards. Our work has significant implications to safety and security critical applications since reordering can easily change the meaning and an instruction sequence and current techniques are largely ad hoc. Our work is done in a mechanical theorem prover and results in a set of trustworthy rules for instruction reordering. The mechanization makes it practical to analyze the entire instruction set.
Implementation and simulations of the sphere solution in FAST
NASA Astrophysics Data System (ADS)
Murgolo, F. P.; Schirone, M. G.; Lattanzi, M.; Bernacca, P. L.
1989-06-01
The details of the implementation of the sphere solution software in the Fundamental Astronomy by Space Techniques (FAST) consortium, are described. The simulation results for realistic data sets, both with and without grid-step errors are given. Expected errors on the astrometric parameters of the primary stars and the precision of the reference great circle zero points, are provided as a function of mission duration. The design matrix, the diagrams of the context processor and the processors experimental results are given.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bales, Benjamin B; Barrett, Richard F
In almost all modern scientific applications, developers achieve the greatest performance gains by tuning algorithms, communication systems, and memory access patterns, while leaving low level instruction optimizations to the compiler. Given the increasingly varied and complicated x86 architectures, the value of these optimizations is unclear, and, due to time and complexity constraints, it is difficult for many programmers to experiment with them. In this report we explore the potential gains of these 'last mile' optimization efforts on an AMD Barcelona processor, providing readers with relevant information so that they can decide whether investment in the presented optimizations is worthwhile.
A multicomputer simulation of the Galileo spacecraft command and data subsystem
NASA Technical Reports Server (NTRS)
Zipse, John E.; Yeung, Raymond Y.; Zimmerman, Barbara A.; Morillo, Ronald; Olster, Daniel B.; Flower, Jon W.; Mizuo, Thomas
1991-01-01
A detailed simulation of the command and data subsystem of the Galileo spacecraft on a distributed memory multicomputer is described. The simulation is based on an ensemble of Inmos Transputers for simulating, to the bit level, the execution of instruction sequences for the six RCA 1802 microcomputers and the intricate bus traffic between them and other components of the spacecraft. Expressions were developed to estimate the performance of the simulator on a distributed system given the processor clock speed, memory access time, and communication characteristics.
NASA Astrophysics Data System (ADS)
Gaikwad, Akshay; Rehal, Diksha; Singh, Amandeep; Arvind, Dorai, Kavita
2018-02-01
We present the NMR implementation of a scheme for selective and efficient quantum process tomography without ancilla. We generalize this scheme such that it can be implemented efficiently using only a set of measurements involving product operators. The method allows us to estimate any element of the quantum process matrix to a desired precision, provided a set of quantum states can be prepared efficiently. Our modified technique requires fewer experimental resources as compared to the standard implementation of selective and efficient quantum process tomography, as it exploits the special nature of NMR measurements to allow us to compute specific elements of the process matrix by a restrictive set of subsystem measurements. To demonstrate the efficacy of our scheme, we experimentally tomograph the processes corresponding to "no operation," a controlled-NOT (CNOT), and a controlled-Hadamard gate on a two-qubit NMR quantum information processor, with high fidelities.
ERIC Educational Resources Information Center
Mazzotti, Valerie L.; Wood, Charles L.; Test, David W.; Fowler, Catherine H.
2012-01-01
Instruction about goal setting can increase students' self-determination and reduce problem behavior. Computer-assisted instruction could offer teachers another format for teaching goal setting and self-determination. This study used a multiple probes across participants design to examine the effects of computer-assisted instruction on students'…
ERIC Educational Resources Information Center
Molenda, Michael; Pershing, James A.
2004-01-01
Training in business settings and instruction in academic settings have never taken place in a vacuum, but in earlier times many instructional technology professionals behaved as though they did. Models of instructional systems design (ISD) placed training and instruction at the center of the universe ignoring the impact of the external…
ERIC Educational Resources Information Center
Mendenhall, Anne M.
2012-01-01
Merrill (2002a) created a set of fundamental principles of instruction that can lead to effective, efficient, and engaging (e[superscript 3]) instruction. The First Principles of Instruction (Merrill, 2002a) are a prescriptive set of interrelated instructional design practices that consist of activating prior knowledge, using specific portrayals…
Computing an operating parameter of a unified power flow controller
Wilson, David G.; Robinett, III, Rush D.
2017-12-26
A Unified Power Flow Controller described herein comprises a sensor that outputs at least one sensed condition, a processor that receives the at least one sensed condition, a memory that comprises control logic that is executable by the processor; and power electronics that comprise power storage, wherein the processor causes the power electronics to selectively cause the power storage to act as one of a power generator or a load based at least in part upon the at least one sensed condition output by the sensor and the control logic, and wherein at least one operating parameter of the power electronics is designed to facilitate maximal transmittal of electrical power generated at a variable power generation system to a grid system while meeting power constraints set forth by the electrical power grid.
The computational structural mechanics testbed architecture. Volume 1: The language
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1988-01-01
This is the first set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP, and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 1 presents the basic elements of the CLAMP language and is intended for all users.
Moving target, distributed, real-time simulation using Ada
NASA Technical Reports Server (NTRS)
Collins, W. R.; Feyock, S.; King, L. A.; Morell, L. J.
1985-01-01
Research on a precompiler solution is described for the moving target compiler problem encountered when trying to run parallel simulation algorithms on several microcomputers. The precompiler is under development at NASA-Lewis for simulating jet engines. Since the behavior of any component of a jet engine, e.g., the fan inlet, rear duct, forward sensor, etc., depends on the previous behaviors and not the current behaviors of other components, the behaviors can be modeled on different processors provided the outputs of the processors reach other processors in appropriate time intervals. The simulator works in compute and transfer modes. The Ada procedure sets for the behaviors of different components are divided up and routed by the precompiler, which essentially receives a multitasking program. The subroutines are synchronized after each computation cycle.
The computational structural mechanics testbed architecture. Volume 2: Directives
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the second of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 2 describes the CLIP directives in detail. It is intended for intermediate and advanced users.
Computing an operating parameter of a unified power flow controller
Wilson, David G; Robinett, III, Rush D
2015-01-06
A Unified Power Flow Controller described herein comprises a sensor that outputs at least one sensed condition, a processor that receives the at least one sensed condition, a memory that comprises control logic that is executable by the processor; and power electronics that comprise power storage, wherein the processor causes the power electronics to selectively cause the power storage to act as one of a power generator or a load based at least in part upon the at least one sensed condition output by the sensor and the control logic, and wherein at least one operating parameter of the power electronics is designed to facilitate maximal transmittal of electrical power generated at a variable power generation system to a grid system while meeting power constraints set forth by the electrical power grid.
Solving the Cauchy-Riemann equations on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
ERIC Educational Resources Information Center
Pistoljevic, Nirvana; Greer, R. Douglas
2006-01-01
We tested the effects of an intensive tact instruction procedure on numbers of tacts emitted in non-instructional settings (NIS) using a multiple probe design across 3 participants (3- and 4-year old boys with autism). The dependent variable was tacts emitted in NIS before/after the mastery of sets of 5 different stimuli. The non-instructional…
SDR implementation of the receiver of adaptive communication system
NASA Astrophysics Data System (ADS)
Skarzynski, Jacek; Darmetko, Marcin; Kozlowski, Sebastian; Kurek, Krzysztof
2016-04-01
The paper presents software implementation of a receiver forming a part of an adaptive communication system. The system is intended for communication with a satellite placed in a low Earth orbit (LEO). The ability of adaptation is believed to increase the total amount of data transmitted from the satellite to the ground station. Depending on the signal-to-noise ratio (SNR) of the received signal, adaptive transmission is realized using different transmission modes, i.e., different modulation schemes (BPSK, QPSK, 8-PSK, and 16-APSK) and different convolutional code rates (1/2, 2/3, 3/4, 5/6, and 7/8). The receiver consists of a software-defined radio (SDR) module (National Instruments USRP-2920) and a multithread reception software running on Windows operating system. In order to increase the speed of signal processing, the software takes advantage of single instruction multiple data instructions supported by x86 processor architecture.
Preprocessor and postprocessor computer programs for a radial-flow finite-element model
Pucci, A.A.; Pope, D.A.
1987-01-01
Preprocessing and postprocessing computer programs that enhance the utility of the U.S. Geological Survey radial-flow model have been developed. The preprocessor program: (1) generates a triangular finite element mesh from minimal data input, (2) produces graphical displays and tabulations of data for the mesh , and (3) prepares an input data file to use with the radial-flow model. The postprocessor program is a version of the radial-flow model, which was modified to (1) produce graphical output for simulation and field results, (2) generate a statistic for comparing the simulation results with observed data, and (3) allow hydrologic properties to vary in the simulated region. Examples of the use of the processor programs for a hypothetical aquifer test are presented. Instructions for the data files, format instructions, and a listing of the preprocessor and postprocessor source codes are given in the appendixes. (Author 's abstract)
77 FR 75617 - 36(b)(1) Arms Sales Notification
Federal Register 2010, 2011, 2012, 2013, 2014
2012-12-21
... transmittal, policy justification, and Sensitivity of Technology. Dated: December 18, 2012. Aaron Siegel... Processor Cabinets, 2 Video Wall Screen and Projector Systems, 46 Flat Panel Displays, and 2 Distributed Video Systems), 2 ship sets AN/SPQ-15 Digital Video Distribution Systems, 2 ship sets Operational...
Federal Register 2010, 2011, 2012, 2013, 2014
2013-01-08
... INTERNATIONAL TRADE COMMISSION [Investigation No. 337-TA-812] Certain Computing Devices With Associated Instruction Sets and Software; Notice of Commission Determination Not To Review an Initial... devices with associated instruction sets and software by reason of infringement of claims 1-4, 7-10, and...
Phase coherence adaptive processor for automatic signal detection and identification
NASA Astrophysics Data System (ADS)
Wagstaff, Ronald A.
2006-05-01
A continuously adapting acoustic signal processor with an automatic detection/decision aid is presented. Its purpose is to preserve the signals of tactical interest, and filter out other signals and noise. It utilizes single sensor or beamformed spectral data and transforms the signal and noise phase angles into "aligned phase angles" (APA). The APA increase the phase temporal coherence of signals and leave the noise incoherent. Coherence thresholds are set, which are representative of the type of source "threat vehicle" and the geographic area or volume in which it is operating. These thresholds separate signals, based on the "quality" of their APA coherence. An example is presented in which signals from a submerged source in the ocean are preserved, while clutter signals from ships and noise are entirely eliminated. Furthermore, the "signals of interest" were identified by the processor's automatic detection aid. Similar performance is expected for air and ground vehicles. The processor's equations are formulated in such a manner that they can be tuned to eliminate noise and exploit signal, based on the "quality" of their APA temporal coherence. The mathematical formulation for this processor is presented, including the method by which the processor continuously self-adapts. Results show nearly complete elimination of noise, with only the selected category of signals remaining, and accompanying enhancements in spectral and spatial resolution. In most cases, the concept of signal-to-noise ratio looses significance, and "adaptive automated /decision aid" is more relevant.
Gschwind, Michael K
2013-04-16
Mechanisms for generating and executing programs for a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA) are provided. A computer program product comprising a computer recordable medium having a computer readable program recorded thereon is provided. The computer readable program, when executed on a computing device, causes the computing device to receive one or more instructions and execute the one or more instructions using logic in an execution unit of the computing device. The logic implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA), based on data stored in a vector register file of the computing device. The vector register file is configured to store both scalar and floating point values as vectors having a plurality of vector elements.
Assessment of mammographic film processor performance in a hospital and mobile screening unit.
Murray, J G; Dowsett, D J; Laird, O; Ennis, J T
1992-12-01
In contrast to the majority of mammographic breast screening programmes, film processing at this centre occurs on site in both hospital and mobile trailer units. Initial (1989) quality control (QC) sensitometric tests revealed a large variation in film processor performance in the mobile unit. The clinical significance of these variations was assessed and acceptance limits for processor performance determined. Abnormal mammograms were used as reference material and copied using high definition 35 mm film over a range of exposure settings. The copies were than matched with QC film density variation from the mobile unit. All films were subsequently ranked for spatial and contrast resolution. Optimal values for processing time of 2 min (equivalent to film transit time 3 min and developer time 46 s) and temperature of 36 degrees C were obtained. The widespread anomaly of reporting film transit time as processing time is highlighted. Use of mammogram copies as a means of measuring the influence of film processor variation is advocated. Careful monitoring of the mobile unit film processor performance has produced stable quality comparable with the hospital based unit. The advantages of on site film processing are outlined. The addition of a sensitometric step wedge to all mammography film stock as a means of assessing image quality is recommended.
A Parallel Pipelined Renderer for the Time-Varying Volume Data
NASA Technical Reports Server (NTRS)
Chiueh, Tzi-Cker; Ma, Kwan-Liu
1997-01-01
This paper presents a strategy for efficiently rendering time-varying volume data sets on a distributed-memory parallel computer. Time-varying volume data take large storage space and visualizing them requires reading large files continuously or periodically throughout the course of the visualization process. Instead of using all the processors to collectively render one volume at a time, a pipelined rendering process is formed by partitioning processors into groups to render multiple volumes concurrently. In this way, the overall rendering time may be greatly reduced because the pipelined rendering tasks are overlapped with the I/O required to load each volume into a group of processors; moreover, parallelization overhead may be reduced as a result of partitioning the processors. We modify an existing parallel volume renderer to exploit various levels of rendering parallelism and to study how the partitioning of processors may lead to optimal rendering performance. Two factors which are important to the overall execution time are re-source utilization efficiency and pipeline startup latency. The optimal partitioning configuration is the one that balances these two factors. Tests on Intel Paragon computers show that in general optimal partitionings do exist for a given rendering task and result in 40-50% saving in overall rendering time.
CTF Preprocessor User's Manual
DOE Office of Scientific and Technical Information (OSTI.GOV)
Avramova, Maria; Salko, Robert K.
2016-05-26
This document describes how a user should go about using the CTF pre- processor tool to create an input deck for modeling rod-bundle geometry in CTF. The tool was designed to generate input decks in a quick and less error-prone manner for CTF. The pre-processor is a completely independent utility, written in Fortran, that takes a reduced amount of input from the user. The information that the user must supply is basic information on bundle geometry, such as rod pitch, clad thickness, and axial location of spacer grids--the pre-processor takes this basic information and determines channel placement and connection informationmore » to be written to the input deck, which is the most time-consuming and error-prone segment of creating a deck. Creation of the model is also more intuitive, as the user can specify assembly and water-tube placement using visual maps instead of having to place them by determining channel/channel and rod/channel connections. As an example of the benefit of the pre-processor, a quarter-core model that contains 500,000 scalar-mesh cells was read into CTF from an input deck containing 200,000 lines of data. This 200,000 line input deck was produced automatically from a set of pre-processor decks that contained only 300 lines of data.« less
Interactive high-resolution isosurface ray casting on multicore processors.
Wang, Qin; JaJa, Joseph
2008-01-01
We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
50 CFR 648.206 - Framework provisions.
Code of Federal Regulations, 2011 CFR
2011-10-01
... deducted when determining specifications; (8) Distribution of the ACL; (9) Gear restrictions (such as mesh... and bycatch monitoring; (27) Requirements for a herring processor survey; (28) ACL set-aside amounts...
50 CFR 648.206 - Framework provisions.
Code of Federal Regulations, 2013 CFR
2013-10-01
... deducted when determining specifications; (8) Distribution of the ACL; (9) Gear restrictions (such as mesh... and bycatch monitoring; (27) Requirements for a herring processor survey; (28) ACL set-aside amounts...
50 CFR 648.206 - Framework provisions.
Code of Federal Regulations, 2014 CFR
2014-10-01
... deducted when determining specifications; (8) Distribution of the ACL; (9) Gear restrictions (such as mesh... and bycatch monitoring; (27) Requirements for a herring processor survey; (28) ACL set-aside amounts...
50 CFR 648.206 - Framework provisions.
Code of Federal Regulations, 2012 CFR
2012-10-01
... deducted when determining specifications; (8) Distribution of the ACL; (9) Gear restrictions (such as mesh... and bycatch monitoring; (27) Requirements for a herring processor survey; (28) ACL set-aside amounts...
Multiprocessor speed-up, Amdahl's Law, and the Activity Set Model of parallel program behavior
NASA Technical Reports Server (NTRS)
Gelenbe, Erol
1988-01-01
An important issue in the effective use of parallel processing is the estimation of the speed-up one may expect as a function of the number of processors used. Amdahl's Law has traditionally provided a guideline to this issue, although it appears excessively pessimistic in the light of recent experimental results. In this note, Amdahl's Law is amended by giving a greater importance to the capacity of a program to make effective use of parallel processing, but also recognizing the fact that imbalance of the workload of each processor is bound to occur. An activity set model of parallel program behavior is then introduced along with the corresponding parallelism index of a program, leading to upper and lower bounds to the speed-up.
Partitioning in parallel processing of production systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oflazer, K.
1987-01-01
This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
Landsat image registration for agricultural applications
NASA Technical Reports Server (NTRS)
Wolfe, R. H., Jr.; Juday, R. D.; Wacker, A. G.; Kaneko, T.
1982-01-01
An image registration system has been developed at the NASA Johnson Space Center (JSC) to spatially align multi-temporal Landsat acquisitions for use in agriculture and forestry research. Working in conjunction with the Master Data Processor (MDP) at the Goddard Space Flight Center, it functionally replaces the long-standing LACIE Registration Processor as JSC's data supplier. The system represents an expansion of the techniques developed for the MDP and LACIE Registration Processor, and it utilizes the experience gained in an IBM/JSC effort evaluating the performance of the latter. These techniques are discussed in detail. Several tests were developed to evaluate the registration performance of the system. The results indicate that 1/15-pixel accuracy (about 4m for Landsat MSS) is achievable in ideal circumstances, sub-pixel accuracy (often to 0.2 pixel or better) was attained on a representative set of U.S. acquisitions, and a success rate commensurate with the LACIE Registration Processor was realized. The system has been employed in a production mode on U.S. and foreign data, and a performance similar to the earlier tests has been noted.
ERIC Educational Resources Information Center
Titsworth, Scott
2017-01-01
In this response, Scott Titsworth analyzes similarities among the forum essays and then offers ideas for how instructional communication scholars might adopt greater situational awareness in research, theory, and application of their work. [Other essays in this forum include: (1) FORUM: Interpersonal Communication in Instructional Settings: The…
Portable parallel stochastic optimization for the design of aeropropulsion components
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Rhodes, G. S.
1994-01-01
This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Sequential pattern data mining and visualization
Wong, Pak Chung [Richland, WA; Jurrus, Elizabeth R [Kennewick, WA; Cowley, Wendy E [Benton City, WA; Foote, Harlan P [Richland, WA; Thomas, James J [Richland, WA
2011-12-06
One or more processors (22) are operated to extract a number of different event identifiers therefrom. These processors (22) are further operable to determine a number a display locations each representative of one of the different identifiers and a corresponding time. The display locations are grouped into sets each corresponding to a different one of several event sequences (330a, 330b, 330c. 330d, 330e). An output is generated corresponding to a visualization (320) of the event sequences (330a, 330b, 330c, 330d, 330e).
Sequential pattern data mining and visualization
Wong, Pak Chung [Richland, WA; Jurrus, Elizabeth R [Kennewick, WA; Cowley, Wendy E [Benton City, WA; Foote, Harlan P [Richland, WA; Thomas, James J [Richland, WA
2009-05-26
One or more processors (22) are operated to extract a number of different event identifiers therefrom. These processors (22) are further operable to determine a number a display locations each representative of one of the different identifiers and a corresponding time. The display locations are grouped into sets each corresponding to a different one of several event sequences (330a, 330b, 330c. 330d, 330e). An output is generated corresponding to a visualization (320) of the event sequences (330a, 330b, 330c, 330d, 330e).
The development of an engineering computer graphics laboratory
NASA Technical Reports Server (NTRS)
Anderson, D. C.; Garrett, R. E.
1975-01-01
Hardware and software systems developed to further research and education in interactive computer graphics were described, as well as several of the ongoing application-oriented projects, educational graphics programs, and graduate research projects. The software system consists of a FORTRAN 4 subroutine package, in conjunction with a PDP 11/40 minicomputer as the primary computation processor and the Imlac PDS-1 as an intelligent display processor. The package comprises a comprehensive set of graphics routines for dynamic, structured two-dimensional display manipulation, and numerous routines to handle a variety of input devices at the Imlac.
State estimation for distributed systems with sensing delay
NASA Astrophysics Data System (ADS)
Alexander, Harold L.
1991-08-01
Control of complex systems such as remote robotic vehicles requires combining data from many sensors where the data may often be delayed by sensory processing requirements. The number and variety of sensors make it desirable to distribute the computational burden of sensing and estimation among multiple processors. Classic Kalman filters do not lend themselves to distributed implementations or delayed measurement data. The alternative Kalman filter designs presented in this paper are adapted for delays in sensor data generation and for distribution of computation for sensing and estimation over a set of networked processors.
Dynamically allocating sets of fine-grained processors to running computations
NASA Technical Reports Server (NTRS)
Middleton, David
1988-01-01
Researchers explore an approach to using general purpose parallel computers which involves mapping hardware resources onto computations instead of mapping computations onto hardware. Problems such as processor allocation, task scheduling and load balancing, which have traditionally proven to be challenging, change significantly under this approach and may become amenable to new attacks. Researchers describe the implementation of this approach used by the FFP Machine whose computation and communication resources are repeatedly partitioned into disjoint groups that match the needs of available tasks from moment to moment. Several consequences of this system are examined.
NASA Technical Reports Server (NTRS)
Wright, Mary A.; Regelbrugge, Marc E.; Felippa, Carlos A.
1989-01-01
This is the fourth of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language CLAMP, the command language interpreter CLIP, and the data manager GAL. Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 4 describes the nominal-record data management component of the NICE software. It is intended for all users.
Soto-Quiros, Pablo
2015-01-01
This paper presents a parallel implementation of a kind of discrete Fourier transform (DFT): the vector-valued DFT. The vector-valued DFT is a novel tool to analyze the spectra of vector-valued discrete-time signals. This parallel implementation is developed in terms of a mathematical framework with a set of block matrix operations. These block matrix operations contribute to analysis, design, and implementation of parallel algorithms in multicore processors. In this work, an implementation and experimental investigation of the mathematical framework are performed using MATLAB with the Parallel Computing Toolbox. We found that there is advantage to use multicore processors and a parallel computing environment to minimize the high execution time. Additionally, speedup increases when the number of logical processors and length of the signal increase.
Algorithms and software for solving finite element equations on serial and parallel architectures
NASA Technical Reports Server (NTRS)
Chu, Eleanor; George, Alan
1988-01-01
The primary objective was to compare the performance of state-of-the-art techniques for solving sparse systems with those that are currently available in the Computational Structural Mechanics (MSC) testbed. One of the first tasks was to become familiar with the structure of the testbed, and to install some or all of the SPARSPAK package in the testbed. A brief overview of the CSM Testbed software and its usage is presented. An overview of the sparse matrix research for the Testbed currently employed in the CSM Testbed is given. An interface which was designed and implemented as a research tool for installing and appraising new matrix processors in the CSM Testbed is described. The results of numerical experiments performed in solving a set of testbed demonstration problems using the processor SPK and other experimental processors are contained.
Recall Performance for Content-Addressable Memory Using Adiabatic Quantum Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Imam, Neena; Humble, Travis S.; McCaskey, Alex
A content-addressable memory (CAM) stores key-value associations such that the key is recalled by providing its associated value. While CAM recall is traditionally performed using recurrent neural network models, we show how to solve this problem using adiabatic quantum optimization. Our approach maps the recurrent neural network to a commercially available quantum processing unit by taking advantage of the common underlying Ising spin model. We then assess the accuracy of the quantum processor to store key-value associations by quantifying recall performance against an ensemble of problem sets. We observe that different learning rules from the neural network community influence recallmore » accuracy but performance appears to be limited by potential noise in the processor. The strong connection established between quantum processors and neural network problems supports the growing intersection of these two ideas.« less
Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing
NASA Technical Reports Server (NTRS)
Fricker, David M.
1997-01-01
The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.
Petrenko, Taras; Kossmann, Simone; Neese, Frank
2011-02-07
In this paper, we present the implementation of efficient approximations to time-dependent density functional theory (TDDFT) within the Tamm-Dancoff approximation (TDA) for hybrid density functionals. For the calculation of the TDDFT/TDA excitation energies and analytical gradients, we combine the resolution of identity (RI-J) algorithm for the computation of the Coulomb terms and the recently introduced "chain of spheres exchange" (COSX) algorithm for the calculation of the exchange terms. It is shown that for extended basis sets, the RIJCOSX approximation leads to speedups of up to 2 orders of magnitude compared to traditional methods, as demonstrated for hydrocarbon chains. The accuracy of the adiabatic transition energies, excited state structures, and vibrational frequencies is assessed on a set of 27 excited states for 25 molecules with the configuration interaction singles and hybrid TDDFT/TDA methods using various basis sets. Compared to the canonical values, the typical error in transition energies is of the order of 0.01 eV. Similar to the ground-state results, excited state equilibrium geometries differ by less than 0.3 pm in the bond distances and 0.5° in the bond angles from the canonical values. The typical error in the calculated excited state normal coordinate displacements is of the order of 0.01, and relative error in the calculated excited state vibrational frequencies is less than 1%. The errors introduced by the RIJCOSX approximation are, thus, insignificant compared to the errors related to the approximate nature of the TDDFT methods and basis set truncation. For TDDFT/TDA energy and gradient calculations on Ag-TB2-helicate (156 atoms, 2732 basis functions), it is demonstrated that the COSX algorithm parallelizes almost perfectly (speedup ~26-29 for 30 processors). The exchange-correlation terms also parallelize well (speedup ~27-29 for 30 processors). The solution of the Z-vector equations shows a speedup of ~24 on 30 processors. The parallelization efficiency for the Coulomb terms can be somewhat smaller (speedup ~15-25 for 30 processors), but their contribution to the total calculation time is small. Thus, the parallel program completes a Becke3-Lee-Yang-Parr energy and gradient calculation on the Ag-TB2-helicate in less than 4 h on 30 processors. We also present the necessary extension of the Lagrangian formalism, which enables the calculation of the TDDFT excited state properties in the frozen-core approximation. The algorithms described in this work are implemented into the ORCA electronic structure system.
Morrow, S A; Bates, P E
1987-01-01
This study examined the effectiveness of three sets of school-based instructional materials and community training on acquisition and generalization of a community laundry skill by nine students with severe handicaps. School-based instruction involved artificial materials (pictures), simulated materials (cardboard replica of a community washing machine), and natural materials (modified home model washing machine). Generalization assessments were conducted at two different community laundromats, on two machines represented fully by the school-based instructional materials and two machines not represented fully by these materials. After three phases of school-based instruction, the students were provided ten community training trials in one laundromat setting and a final assessment was conducted in both the trained and untrained community settings. A multiple probe design across students was used to evaluate the effectiveness of the three types of school instruction and community training. After systematic training, most of the students increased their laundry performance with all three sets of school-based materials; however, generalization of these acquired skills was limited in the two community settings. Direct training in one of the community settings resulted in more efficient acquisition of the laundry skills and enhanced generalization to the untrained laundromat setting for most of the students. Results of this study are discussed in regard to the issue of school versus community-based instruction and recommendations are made for future research in this area.
Proteus: a reconfigurable computational network for computer vision
NASA Astrophysics Data System (ADS)
Haralick, Robert M.; Somani, Arun K.; Wittenbrink, Craig M.; Johnson, Robert; Cooper, Kenneth; Shapiro, Linda G.; Phillips, Ihsin T.; Hwang, Jenq N.; Cheung, William; Yao, Yung H.; Chen, Chung-Ho; Yang, Larry; Daugherty, Brian; Lorbeski, Bob; Loving, Kent; Miller, Tom; Parkins, Larye; Soos, Steven L.
1992-04-01
The Proteus architecture is a highly parallel MIMD, multiple instruction, multiple-data machine, optimized for large granularity tasks such as machine vision and image processing The system can achieve 20 Giga-flops (80 Giga-flops peak). It accepts data via multiple serial links at a rate of up to 640 megabytes/second. The system employs a hierarchical reconfigurable interconnection network with the highest level being a circuit switched Enhanced Hypercube serial interconnection network for internal data transfers. The system is designed to use 256 to 1,024 RISC processors. The processors use one megabyte external Read/Write Allocating Caches for reduced multiprocessor contention. The system detects, locates, and replaces faulty subsystems using redundant hardware to facilitate fault tolerance. The parallelism is directly controllable through an advanced software system for partitioning, scheduling, and development. System software includes a translator for the INSIGHT language, a parallel debugger, low and high level simulators, and a message passing system for all control needs. Image processing application software includes a variety of point operators neighborhood, operators, convolution, and the mathematical morphology operations of binary and gray scale dilation, erosion, opening, and closing.
Implementation and analysis of a Navier-Stokes algorithm on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1988-01-01
The results of the implementation of a Navier-Stokes algorithm on three parallel/vector computers are presented. The object of this research is to determine how well, or poorly, a single numerical algorithm would map onto three different architectures. The algorithm is a compact difference scheme for the solution of the incompressible, two-dimensional, time-dependent Navier-Stokes equations. The computers were chosen so as to encompass a variety of architectures. They are the following: the MPP, an SIMD machine with 16K bit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. The basic comparison is among SIMD instruction parallelism on the MPP, MIMD process parallelism on the Flex/32, and vectorization of a serial code on the Cray/2. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
[Improving speech comprehension using a new cochlear implant speech processor].
Müller-Deile, J; Kortmann, T; Hoppe, U; Hessel, H; Morsnowski, A
2009-06-01
The aim of this multicenter clinical field study was to assess the benefits of the new Freedom 24 sound processor for cochlear implant (CI) users implanted with the Nucleus 24 cochlear implant system. The study included 48 postlingually profoundly deaf experienced CI users who demonstrated speech comprehension performance with their current speech processor on the Oldenburg sentence test (OLSA) in quiet conditions of at least 80% correct scores and who were able to perform adaptive speech threshold testing using the OLSA in noisy conditions. Following baseline measures of speech comprehension performance with their current speech processor, subjects were upgraded to the Freedom 24 speech processor. After a take-home trial period of at least 2 weeks, subject performance was evaluated by measuring the speech reception threshold with the Freiburg multisyllabic word test and speech intelligibility with the Freiburg monosyllabic word test at 50 dB and 70 dB in the sound field. The results demonstrated highly significant benefits for speech comprehension with the new speech processor. Significant benefits for speech comprehension were also demonstrated with the new speech processor when tested in competing background noise.In contrast, use of the Abbreviated Profile of Hearing Aid Benefit (APHAB) did not prove to be a suitably sensitive assessment tool for comparative subjective self-assessment of hearing benefits with each processor. Use of the preprocessing algorithm known as adaptive dynamic range optimization (ADRO) in the Freedom 24 led to additional improvements over the standard upgrade map for speech comprehension in quiet and showed equivalent performance in noise. Through use of the preprocessing beam-forming algorithm BEAM, subjects demonstrated a highly significant improved signal-to-noise ratio for speech comprehension thresholds (i.e., signal-to-noise ratio for 50% speech comprehension scores) when tested with an adaptive procedure using the Oldenburg sentences in the clinical setting S(0)N(CI), with speech signal at 0 degrees and noise lateral to the CI at 90 degrees . With the convincing findings from our evaluations of this multicenter study cohort, a trial with the Freedom 24 sound processor for all suitable CI users is recommended. For evaluating the benefits of a new processor, the comparative assessment paradigm used in our study design would be considered ideal for use with individual patients.
NASA Technical Reports Server (NTRS)
Clune, E.; Segall, Z.; Siewiorek, D.
1984-01-01
A program of experiments has been conducted at NASA-Langley to test the fault-free performance of a Fault-Tolerant Multiprocessor (FTMP) avionics system for next-generation aircraft. Baseline measurements of an operating FTMP system were obtained with respect to the following parameters: instruction execution time, frame size, and the variation of clock ticks. The mechanisms of frame stretching were also investigated. The experimental results are summarized in a table. Areas of interest for future tests are identified, with emphasis given to the implementation of a synthetic workload generation mechanism on FTMP.
Implementing Access to Data Distributed on Many Processors
NASA Technical Reports Server (NTRS)
James, Mark
2006-01-01
A reference architecture is defined for an object-oriented implementation of domains, arrays, and distributions written in the programming language Chapel. This technology primarily addresses domains that contain arrays that have regular index sets with the low-level implementation details being beyond the scope of this discussion. What is defined is a complete set of object-oriented operators that allows one to perform data distributions for domain arrays involving regular arithmetic index sets. What is unique is that these operators allow for the arbitrary regions of the arrays to be fragmented and distributed across multiple processors with a single point of access giving the programmer the illusion that all the elements are collocated on a single processor. Today's massively parallel High Productivity Computing Systems (HPCS) are characterized by a modular structure, with a large number of processing and memory units connected by a high-speed network. Locality of access as well as load balancing are primary concerns in these systems that are typically used for high-performance scientific computation. Data distributions address these issues by providing a range of methods for spreading large data sets across the components of a system. Over the past two decades, many languages, systems, tools, and libraries have been developed for the support of distributions. Since the performance of data parallel applications is directly influenced by the distribution strategy, users often resort to low-level programming models that allow fine-tuning of the distribution aspects affecting performance, but, at the same time, are tedious and error-prone. This technology presents a reusable design of a data-distribution framework for data parallel high-performance applications. Distributions are a means to express locality in systems composed of large numbers of processor and memory components connected by a network. Since distributions have a great effect on the performance of applications, it is important that the distribution strategy is flexible, so its behavior can change depending on the needs of the application. At the same time, high productivity concerns require that the user be shielded from error-prone, tedious details such as communication and synchronization.
2011-01-01
Background Educators in allied health and medical education programs utilize instructional multimedia to facilitate psychomotor skill acquisition in students. This study examines the effects of instructional multimedia on student and instructor attitudes and student study behavior. Methods Subjects consisted of 45 student physical therapists from two universities. Two skill sets were taught during the course of the study. Skill set one consisted of knee examination techniques and skill set two consisted of ankle/foot examination techniques. For each skill set, subjects were randomly assigned to either a control group or an experimental group. The control group was taught with live demonstration of the examination skills, while the experimental group was taught using multimedia. A cross-over design was utilized so that subjects in the control group for skill set one served as the experimental group for skill set two, and vice versa. During the last week of the study, students and instructors completed written questionnaires to assess attitude toward teaching methods, and students answered questions regarding study behavior. Results There were no differences between the two instructional groups in attitudes, but students in the experimental group for skill set two reported greater study time alone compared to other groups. Conclusions Multimedia provides an efficient method to teach psychomotor skills to students entering the health professions. Both students and instructors identified advantages and disadvantages for both instructional techniques. Reponses relative to instructional multimedia emphasized efficiency, processing level, autonomy, and detail of instruction compared to live presentation. Students and instructors identified conflicting views of instructional detail and control of the content. PMID:21693058
Smith, A Russell; Cavanaugh, Cathy; Moore, W Allen
2011-06-21
Educators in allied health and medical education programs utilize instructional multimedia to facilitate psychomotor skill acquisition in students. This study examines the effects of instructional multimedia on student and instructor attitudes and student study behavior. Subjects consisted of 45 student physical therapists from two universities. Two skill sets were taught during the course of the study. Skill set one consisted of knee examination techniques and skill set two consisted of ankle/foot examination techniques. For each skill set, subjects were randomly assigned to either a control group or an experimental group. The control group was taught with live demonstration of the examination skills, while the experimental group was taught using multimedia. A cross-over design was utilized so that subjects in the control group for skill set one served as the experimental group for skill set two, and vice versa. During the last week of the study, students and instructors completed written questionnaires to assess attitude toward teaching methods, and students answered questions regarding study behavior. There were no differences between the two instructional groups in attitudes, but students in the experimental group for skill set two reported greater study time alone compared to other groups. Multimedia provides an efficient method to teach psychomotor skills to students entering the health professions. Both students and instructors identified advantages and disadvantages for both instructional techniques. Reponses relative to instructional multimedia emphasized efficiency, processing level, autonomy, and detail of instruction compared to live presentation. Students and instructors identified conflicting views of instructional detail and control of the content.
Using Microcomputer Word Processors for Foreign Languages.
ERIC Educational Resources Information Center
Smith, Kim L.
1984-01-01
Describes the programs and modifications needed to do word processing using foreign language characters. One such program, Screenwriter, uses soft character sets -- character sets which can be designed by the program user. This program has a word processing power combined with a foreign language capability that would allow any person to work with…
ERIC Educational Resources Information Center
Wilk, Leslie A.; Redmon, William K.
1990-01-01
Daily goals set by supervisors and feedback were used with three female application processors in a college admissions office. After 39 weeks, comparison with baseline data showed large increases in the amount of work done and decreases in overtime, use of temporaries, and absenteeism. (SK)
Extending Moore's Law via Computationally Error Tolerant Computing.
Deng, Bobin; Srikanth, Sriseshan; Hein, Eric R.; ...
2018-03-01
Dennard scaling has ended. Lowering the voltage supply (V dd) to sub-volt levels causes intermittent losses in signal integrity, rendering further scaling (down) no longer acceptable as a means to lower the power required by a processor core. However, it is possible to correct the occasional errors caused due to lower V dd in an efficient manner and effectively lower power. By deploying the right amount and kind of redundancy, we can strike a balance between overhead incurred in achieving reliability and energy savings realized by permitting lower V dd. One promising approach is the Redundant Residue Number System (RRNS)more » representation. Unlike other error correcting codes, RRNS has the important property of being closed under addition, subtraction and multiplication, thus enabling computational error correction at a fraction of an overhead compared to conventional approaches. We use the RRNS scheme to design a Computationally-Redundant, Energy-Efficient core, including the microarchitecture, Instruction Set Architecture (ISA) and RRNS centered algorithms. Finally, from the simulation results, this RRNS system can reduce the energy-delay-product by about 3× for multiplication intensive workloads and by about 2× in general, when compared to a non-error-correcting binary core.« less
Benchmarking multimedia performance
NASA Astrophysics Data System (ADS)
Zandi, Ahmad; Sudharsanan, Subramania I.
1998-03-01
With the introduction of faster processors and special instruction sets tailored to multimedia, a number of exciting applications are now feasible on the desktops. Among these is the DVD playback consisting, among other things, of MPEG-2 video and Dolby digital audio or MPEG-2 audio. Other multimedia applications such as video conferencing and speech recognition are also becoming popular on computer systems. In view of this tremendous interest in multimedia, a group of major computer companies have formed, Multimedia Benchmarks Committee as part of Standard Performance Evaluation Corp. to address the performance issues of multimedia applications. The approach is multi-tiered with three tiers of fidelity from minimal to full compliant. In each case the fidelity of the bitstream reconstruction as well as quality of the video or audio output are measured and the system is classified accordingly. At the next step the performance of the system is measured. In many multimedia applications such as the DVD playback the application needs to be run at a specific rate. In this case the measurement of the excess processing power, makes all the difference. All these make a system level, application based, multimedia benchmark very challenging. Several ideas and methodologies for each aspect of the problems will be presented and analyzed.
Extending Moore's Law via Computationally Error Tolerant Computing.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deng, Bobin; Srikanth, Sriseshan; Hein, Eric R.
Dennard scaling has ended. Lowering the voltage supply (V dd) to sub-volt levels causes intermittent losses in signal integrity, rendering further scaling (down) no longer acceptable as a means to lower the power required by a processor core. However, it is possible to correct the occasional errors caused due to lower V dd in an efficient manner and effectively lower power. By deploying the right amount and kind of redundancy, we can strike a balance between overhead incurred in achieving reliability and energy savings realized by permitting lower V dd. One promising approach is the Redundant Residue Number System (RRNS)more » representation. Unlike other error correcting codes, RRNS has the important property of being closed under addition, subtraction and multiplication, thus enabling computational error correction at a fraction of an overhead compared to conventional approaches. We use the RRNS scheme to design a Computationally-Redundant, Energy-Efficient core, including the microarchitecture, Instruction Set Architecture (ISA) and RRNS centered algorithms. Finally, from the simulation results, this RRNS system can reduce the energy-delay-product by about 3× for multiplication intensive workloads and by about 2× in general, when compared to a non-error-correcting binary core.« less
NASA Technical Reports Server (NTRS)
Jones, Kennie H.; Randall, Donald P.; Stallcup, Scott S.; Rowell, Lawrence F.
1988-01-01
The Environment for Application Software Integration and Execution, EASIE, provides a methodology and a set of software utility programs to ease the task of coordinating engineering design and analysis codes. EASIE was designed to meet the needs of conceptual design engineers that face the task of integrating many stand-alone engineering analysis programs. Using EASIE, programs are integrated through a relational data base management system. In volume 2, the use of a SYSTEM LIBRARY PROCESSOR is used to construct a DATA DICTIONARY describing all relations defined in the data base, and a TEMPLATE LIBRARY. A TEMPLATE is a description of all subsets of relations (including conditional selection criteria and sorting specifications) to be accessed as input or output for a given application. Together, these form the SYSTEM LIBRARY which is used to automatically produce the data base schema, FORTRAN subroutines to retrieve/store data from/to the data base, and instructions to a generic REVIEWER program providing review/modification of data for a given template. Automation of these functions eliminates much of the tedious, error prone work required by the usual approach to data base integration.
Computer-simulated laboratory explorations for middle school life, earth, and physical Science
NASA Astrophysics Data System (ADS)
von Blum, Ruth
1992-06-01
Explorations in Middle School Science is a set of 72 computer-simulated laboratory lessons in life, earth, and physical Science for grades 6 9 developed by Jostens Learning Corporation with grants from the California State Department of Education and the National Science Foundation.3 At the heart of each lesson is a computer-simulated laboratory that actively involves students in doing science improving their: (1) understanding of science concepts by applying critical thinking to solve real problems; (2) skills in scientific processes and communications; and (3) attitudes about science. Students use on-line tools (notebook, calculator, word processor) to undertake in-depth investigations of phenomena (like motion in outer space, disease transmission, volcanic eruptions, or the structure of the atom) that would be too difficult, dangerous, or outright impossible to do in a “live” laboratory. Suggested extension activities lead students to hands-on investigations, away from the computer. This article presents the underlying rationale, instructional model, and process by which Explorations was designed and developed. It also describes the general courseware structure and three lesson's in detail, as well as presenting preliminary data from the evaluation. Finally, it suggests a model for incorporating technology into the science classroom.
Minati, Ludovico; Zacà, Domenico; D'Incerti, Ludovico; Jovicich, Jorge
2014-09-01
An outstanding issue in graph-based analysis of resting-state functional MRI is choice of network nodes. Individual consideration of entire brain voxels may represent a less biased approach than parcellating the cortex according to pre-determined atlases, but entails establishing connectedness for 1(9)-1(11) links, with often prohibitive computational cost. Using a representative Human Connectome Project dataset, we show that, following appropriate time-series normalization, it may be possible to accelerate connectivity determination replacing Pearson correlation with l1-norm. Even though the adjacency matrices derived from correlation coefficients and l1-norms are not identical, their similarity is high. Further, we describe and provide in full an example vector hardware implementation of l1-norm on an array of 4096 zero instruction-set processors. Calculation times <1000 s are attainable, removing the major deterrent to voxel-based resting-sate network mapping and revealing fine-grained node degree heterogeneity. L1-norm should be given consideration as a substitute for correlation in very high-density resting-state functional connectivity analyses. Copyright © 2014 IPEM. Published by Elsevier Ltd. All rights reserved.
Messiah College Biodiesel Fuel Generation Project Final Technical Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zummo, Michael M; Munson, J; Derr, A
Many obvious and significant concerns arise when considering the concept of small-scale biodiesel production. Does the fuel produced meet the stringent requirements set by the commercial biodiesel industry? Is the process safe? How are small-scale producers collecting and transporting waste vegetable oil? How is waste from the biodiesel production process handled by small-scale producers? These concerns and many others were the focus of the research preformed in the Messiah College Biodiesel Fuel Generation project over the last three years. This project was a unique research program in which undergraduate engineering students at Messiah College set out to research the feasibilitymore » of small-biodiesel production for application on a campus of approximately 3000 students. This Department of Energy (DOE) funded research program developed out of almost a decade of small-scale biodiesel research and development work performed by students at Messiah College. Over the course of the last three years the research team focused on four key areas related to small-scale biodiesel production: Quality Testing and Assurance, Process and Processor Research, Process and Processor Development, and Community Education. The objectives for the Messiah College Biodiesel Fuel Generation Project included the following: 1. Preparing a laboratory facility for the development and optimization of processors and processes, ASTM quality assurance, and performance testing of biodiesel fuels. 2. Developing scalable processor and process designs suitable for ASTM certifiable small-scale biodiesel production, with the goals of cost reduction and increased quality. 3. Conduct research into biodiesel process improvement and cost optimization using various biodiesel feedstocks and production ingredients.« less
NASA Astrophysics Data System (ADS)
Gan, Chee Kwan; Challacombe, Matt
2003-05-01
Recently, early onset linear scaling computation of the exchange-correlation matrix has been achieved using hierarchical cubature [J. Chem. Phys. 113, 10037 (2000)]. Hierarchical cubature differs from other methods in that the integration grid is adaptive and purely Cartesian, which allows for a straightforward domain decomposition in parallel computations; the volume enclosing the entire grid may be simply divided into a number of nonoverlapping boxes. In our data parallel approach, each box requires only a fraction of the total density to perform the necessary numerical integrations due to the finite extent of Gaussian-orbital basis sets. This inherent data locality may be exploited to reduce communications between processors as well as to avoid memory and copy overheads associated with data replication. Although the hierarchical cubature grid is Cartesian, naive boxing leads to irregular work loads due to strong spatial variations of the grid and the electron density. In this paper we describe equal time partitioning, which employs time measurement of the smallest sub-volumes (corresponding to the primitive cubature rule) to load balance grid-work for the next self-consistent-field iteration. After start-up from a heuristic center of mass partitioning, equal time partitioning exploits smooth variation of the density and grid between iterations to achieve load balance. With the 3-21G basis set and a medium quality grid, equal time partitioning applied to taxol (62 heavy atoms) attained a speedup of 61 out of 64 processors, while for a 110 molecule water cluster at standard density it achieved a speedup of 113 out of 128. The efficiency of equal time partitioning applied to hierarchical cubature improves as the grid work per processor increases. With a fine grid and the 6-311G(df,p) basis set, calculations on the 26 atom molecule α-pinene achieved a parallel efficiency better than 99% with 64 processors. For more coarse grained calculations, superlinear speedups are found to result from reduced computational complexity associated with data parallelism.
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Tumeo, Antonino; Secchi, Simone
Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
Asynchronous parallel status comparator
Arnold, Jeffrey W.; Hart, Mark M.
1992-01-01
Apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition.
Asynchronous parallel status comparator
Arnold, J.W.; Hart, M.M.
1992-12-15
Disclosed is an apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition. 4 figs.
Scalable Algorithms for Clustering Large Geospatiotemporal Data Sets on Manycore Architectures
NASA Astrophysics Data System (ADS)
Mills, R. T.; Hoffman, F. M.; Kumar, J.; Sreepathi, S.; Sripathi, V.
2016-12-01
The increasing availability of high-resolution geospatiotemporal data sets from sources such as observatory networks, remote sensing platforms, and computational Earth system models has opened new possibilities for knowledge discovery using data sets fused from disparate sources. Traditional algorithms and computing platforms are impractical for the analysis and synthesis of data sets of this size; however, new algorithmic approaches that can effectively utilize the complex memory hierarchies and the extremely high levels of available parallelism in state-of-the-art high-performance computing platforms can enable such analysis. We describe a massively parallel implementation of accelerated k-means clustering and some optimizations to boost computational intensity and utilization of wide SIMD lanes on state-of-the art multi- and manycore processors, including the second-generation Intel Xeon Phi ("Knights Landing") processor based on the Intel Many Integrated Core (MIC) architecture, which includes several new features, including an on-package high-bandwidth memory. We also analyze the code in the context of a few practical applications to the analysis of climatic and remotely-sensed vegetation phenology data sets, and speculate on some of the new applications that such scalable analysis methods may enable.
Feature instructions improve face-matching accuracy
Bindemann, Markus
2018-01-01
Identity comparisons of photographs of unfamiliar faces are prone to error but important for applied settings, such as person identification at passport control. Finding techniques to improve face-matching accuracy is therefore an important contemporary research topic. This study investigated whether matching accuracy can be improved by instruction to attend to specific facial features. Experiment 1 showed that instruction to attend to the eyebrows enhanced matching accuracy for optimized same-day same-race face pairs but not for other-race faces. By contrast, accuracy was unaffected by instruction to attend to the eyes, and declined with instruction to attend to ears. Experiment 2 replicated the eyebrow-instruction improvement with a different set of same-race faces, comprising both optimized same-day and more challenging different-day face pairs. These findings suggest that instruction to attend to specific features can enhance face-matching accuracy, but feature selection is crucial and generalization across face sets may be limited. PMID:29543822
Contextual classification on the massively parallel processor
NASA Technical Reports Server (NTRS)
Tilton, James C.
1987-01-01
Classifiers are often used to produce land cover maps from multispectral Earth observation imagery. Conventionally, these classifiers have been designed to exploit the spectral information contained in the imagery. Very few classifiers exploit the spatial information content of the imagery, and the few that do rarely exploit spatial information content in conjunction with spectral and/or temporal information. A contextual classifier that exploits spatial and spectral information in combination through a general statistical approach was studied. Early test results obtained from an implementation of the classifier on a VAX-11/780 minicomputer were encouraging, but they are of limited meaning because they were produced from small data sets. An implementation of the contextual classifier is presented on the Massively Parallel Processor (MPP) at Goddard that for the first time makes feasible the testing of the classifier on large data sets.
A neuronal model of a global workspace in effortful cognitive tasks.
Dehaene, S; Kerszberg, M; Changeux, J P
1998-11-24
A minimal hypothesis is proposed concerning the brain processes underlying effortful tasks. It distinguishes two main computational spaces: a unique global workspace composed of distributed and heavily interconnected neurons with long-range axons, and a set of specialized and modular perceptual, motor, memory, evaluative, and attentional processors. Workspace neurons are mobilized in effortful tasks for which the specialized processors do not suffice. They selectively mobilize or suppress, through descending connections, the contribution of specific processor neurons. In the course of task performance, workspace neurons become spontaneously coactivated, forming discrete though variable spatio-temporal patterns subject to modulation by vigilance signals and to selection by reward signals. A computer simulation of the Stroop task shows workspace activation to increase during acquisition of a novel task, effortful execution, and after errors. We outline predictions for spatio-temporal activation patterns during brain imaging, particularly about the contribution of dorsolateral prefrontal cortex and anterior cingulate to the workspace.
NASA Technical Reports Server (NTRS)
Carter, John; Stephenson, Mark
1999-01-01
The NASA Dryden Flight Research Center has completed the initial flight test of a modified set of F/A-18 flight control computers that gives the aircraft a research control law capability. The production support flight control computers (PSFCC) provide an increased capability for flight research in the control law, handling qualities, and flight systems areas. The PSFCC feature a research flight control processor that is "piggybacked" onto the baseline F/A-18 flight control system. This research processor allows for pilot selection of research control law operation in flight. To validate flight operation, a replication of a standard F/A-18 control law was programmed into the research processor and flight-tested over a limited envelope. This paper provides a brief description of the system, summarizes the initial flight test of the PSFCC, and describes future experiments for the PSFCC.
Implementation of an ADI method on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Implementation of an ADI method on parallel computers
NASA Technical Reports Server (NTRS)
Fatoohi, Raad A.; Grosch, Chester E.
1987-01-01
In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
NASA Technical Reports Server (NTRS)
Felippa, Carlos A.
1989-01-01
This is the fifth of a set of five volumes which describe the software architecture for the Computational Structural Mechanics Testbed. Derived from NICE, an integrated software system developed at Lockheed Palo Alto Research Laboratory, the architecture is composed of the command language (CLAMP), the command language interpreter (CLIP), and the data manager (GAL). Volumes 1, 2, and 3 (NASA CR's 178384, 178385, and 178386, respectively) describe CLAMP and CLIP and the CLIP-processor interface. Volumes 4 and 5 (NASA CR's 178387 and 178388, respectively) describe GAL and its low-level I/O. CLAMP, an acronym for Command Language for Applied Mechanics Processors, is designed to control the flow of execution of processors written for NICE. Volume 5 describes the low-level data management component of the NICE software. It is intended only for advanced programmers involved in maintenance of the software.
ERIC Educational Resources Information Center
Erguvan, Deniz
2014-01-01
This study sets out to explore the faculty members' perceptions of a specific web-based instruction tool (Achieve3000) in a private higher education institute in Kuwait. The online tool provides highly differentiated instruction, which is initiated with a level set at the beginning of the term. The program is used in two consecutive courses as…
Parallel evolution of image processing tools for multispectral imagery
NASA Astrophysics Data System (ADS)
Harvey, Neal R.; Brumby, Steven P.; Perkins, Simon J.; Porter, Reid B.; Theiler, James P.; Young, Aaron C.; Szymanski, John J.; Bloch, Jeffrey J.
2000-11-01
We describe the implementation and performance of a parallel, hybrid evolutionary-algorithm-based system, which optimizes image processing tools for feature-finding tasks in multi-spectral imagery (MSI) data sets. Our system uses an integrated spatio-spectral approach and is capable of combining suitably-registered data from different sensors. We investigate the speed-up obtained by parallelization of the evolutionary process via multiple processors (a workstation cluster) and develop a model for prediction of run-times for different numbers of processors. We demonstrate our system on Landsat Thematic Mapper MSI , covering the recent Cerro Grande fire at Los Alamos, NM, USA.
A Micro-Computer Computational Unit for an IR-CCD Intrusion Detection System.
1980-10-01
signal processor. In the laboratory this processor has met or exceeded all its design goals. LYN H. SKOLNIK Project Engineer i, viii AMONO I...4 +* * SEPIKY LANL ITINE’-10. - 4. 1 4* This section of :ode containrs all assembly lasosiae rotinec ft - used in the IRCCD tntruton 4eector rm r s...In arra -B47- -I- here o, ’t , ’ retore A0I o.- .tmp2,r! rotetre r5t jr t o I 00 to r.9ioter restore ro-tine THRESH - routix, for set.ing J thres
Fault-Tolerant, Radiation-Hard DSP
NASA Technical Reports Server (NTRS)
Czajkowski, David
2011-01-01
Commercial digital signal processors (DSPs) for use in high-speed satellite computers are challenged by the damaging effects of space radiation, mainly single event upsets (SEUs) and single event functional interrupts (SEFIs). Innovations have been developed for mitigating the effects of SEUs and SEFIs, enabling the use of very-highspeed commercial DSPs with improved SEU tolerances. Time-triple modular redundancy (TTMR) is a method of applying traditional triple modular redundancy on a single processor, exploiting the VLIW (very long instruction word) class of parallel processors. TTMR improves SEU rates substantially. SEFIs are solved by a SEFI-hardened core circuit, external to the microprocessor. It monitors the health of the processor, and if a SEFI occurs, forces the processor to return to performance through a series of escalating events. TTMR and hardened-core solutions were developed for both DSPs and reconfigurable field-programmable gate arrays (FPGAs). This includes advancement of TTMR algorithms for DSPs and reconfigurable FPGAs, plus a rad-hard, hardened-core integrated circuit that services both the DSP and FPGA. Additionally, a combined DSP and FPGA board architecture was fully developed into a rad-hard engineering product. This technology enables use of commercial off-the-shelf (COTS) DSPs in computers for satellite and other space applications, allowing rapid deployment at a much lower cost. Traditional rad-hard space computers are very expensive and typically have long lead times. These computers are either based on traditional rad-hard processors, which have extremely low computational performance, or triple modular redundant (TMR) FPGA arrays, which suffer from power and complexity issues. Even more frustrating is that the TMR arrays of FPGAs require a fixed, external rad-hard voting element, thereby causing them to lose much of their reconfiguration capability and in some cases significant speed reduction. The benefits of COTS high-performance signal processing include significant increase in onboard science data processing, enabling orders of magnitude reduction in required communication bandwidth for science data return, orders of magnitude improvement in onboard mission planning and critical decision making, and the ability to rapidly respond to changing mission environments, thus enabling opportunistic science and orders of magnitude reduction in the cost of mission operations through reduction of required staff. Additional benefits of COTS-based, high-performance signal processing include the ability to leverage considerable commercial and academic investments in advanced computing tools, techniques, and infra structure, and the familiarity of the science and IT community with these computing environments.
Weaving Mathematical Instructional Strategies into Inclusive Settings.
ERIC Educational Resources Information Center
Karp, Karen S.; Voltz, Deborah L.
2000-01-01
This article describes a framework that allows teachers in inclusive elementary settings to interweave instructional strategies from a variety of paradigms to meet individual learning needs in inclusive mathematics classes. Factors to be considered are highlighted and an instructional continuum from more teacher-centered strategies to more…
Compact gasoline fuel processor for passenger vehicle APU
NASA Astrophysics Data System (ADS)
Severin, Christopher; Pischinger, Stefan; Ogrzewalla, Jürgen
Due to the increasing demand for electrical power in today's passenger vehicles, and with the requirements regarding fuel consumption and environmental sustainability tightening, a fuel cell-based auxiliary power unit (APU) becomes a promising alternative to the conventional generation of electrical energy via internal combustion engine, generator and battery. It is obvious that the on-board stored fuel has to be used for the fuel cell system, thus, gasoline or diesel has to be reformed on board. This makes the auxiliary power unit a complex integrated system of stack, air supply, fuel processor, electrics as well as heat and water management. Aside from proving the technical feasibility of such a system, the development has to address three major barriers:start-up time, costs, and size/weight of the systems. In this paper a packaging concept for an auxiliary power unit is presented. The main emphasis is placed on the fuel processor, as good packaging of this large subsystem has the strongest impact on overall size. The fuel processor system consists of an autothermal reformer in combination with water-gas shift and selective oxidation stages, based on adiabatic reactors with inter-cooling. The configuration was realized in a laboratory set-up and experimentally investigated. The results gained from this confirm a general suitability for mobile applications. A start-up time of 30 min was measured, while a potential reduction to 10 min seems feasible. An overall fuel processor efficiency of about 77% was measured. On the basis of the know-how gained by the experimental investigation of the laboratory set-up a packaging concept was developed. Using state-of-the-art catalyst and heat exchanger technology, the volumes of these components are fixed. However, the overall volume is higher mainly due to mixing zones and flow ducts, which do not contribute to the chemical or thermal function of the system. Thus, the concept developed mainly focuses on minimization of those component volumes. Therefore, the packaging utilizes rectangular catalyst bricks and integrates flow ducts into the heat exchangers. A concept is presented with a 25 l fuel processor volume including thermal isolation for a 3 kW el auxiliary power unit. The overall size of the system, i.e. including stack, air supply and auxiliaries can be estimated to 44 l.
Shared prefetching to reduce execution skew in multi-threaded systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eichenberger, Alexandre E; Gunnels, John A
Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated basedmore » on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.« less
NASA Technical Reports Server (NTRS)
Muller, Dagmar; Krasemann, Hajo; Brewin, Robert J. W.; Brockmann, Carsten; Deschamps, Pierre-Yves; Fomferra, Norman; Franz, Bryan A.; Grant, Mike G.; Groom, Steve B.; Melin, Frederic;
2015-01-01
The established procedure to access the quality of atmospheric correction processors and their underlying algorithms is the comparison of satellite data products with related in-situ measurements. Although this approach addresses the accuracy of derived geophysical properties in a straight forward fashion, it is also limited in its ability to catch systematic sensor and processor dependent behaviour of satellite products along the scan-line, which might impair the usefulness of the data in spatial analyses. The Ocean Colour Climate Change Initiative (OC-CCI) aims to create an ocean colour dataset on a global scale to meet the demands of the ecosystem modelling community. The need for products with increasing spatial and temporal resolution that also show as little systematic and random errors as possible, increases. Due to cloud cover, even temporal means can be influenced by along-scanline artefacts if the observations are not balanced and effects cannot be cancelled out mutually. These effects can arise from a multitude of results which are not easily separated, if at all. Among the sources of artefacts, there are some sensor-specific calibration issues which should lead to similar responses in all processors, as well as processor-specific features which correspond with the individual choices in the algorithms. A set of methods is proposed and applied to MERIS data over two regions of interest in the North Atlantic and the South Pacific Gyre. The normalised water leaving reflectance products of four atmospheric correction processors, which have also been evaluated in match-up analysis, is analysed in order to find and interpret systematic effects across track. These results are summed up with a semi-objective ranking and are used as a complement to the match-up analysis in the decision for the best Atmospheric Correction (AC) processor. Although the need for discussion remains concerning the absolutes by which to judge an AC processor, this example demonstrates clearly, that relying on the match-up analysis alone can lead to misjudgement.
Fast particles identification in programmable form at level-0 trigger by means of the 3D-Flow system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Crosetto, Dario B.
1998-10-30
The 3D-Flow Processor system is a new, technology-independent concept in very fast, real-time system architectures. Based on either an FPGA or an ASIC implementation, it can address, in a fully programmable manner, applications where commercially available processors would fail because of throughput requirements. Possible applications include filtering-algorithms (pattern recognition) from the input of multiple sensors, as well as moving any input validated by these filtering-algorithms to a single output channel. Both operations can easily be implemented on a 3D-Flow system to achieve a real-time processing system with a very short lag time. This system can be built either with off-the-shelfmore » FPGAs or, for higher data rates, with CMOS chips containing 4 to 16 processors each. The basic building block of the system, a 3D-Flow processor, has been successfully designed in VHDL code written in ''Generic HDL'' (mostly made of reusable blocks that are synthesizable in different technologies, or FPGAs), to produce a netlist for a four-processor ASIC featuring 0.35 micron CBA (Ceil Base Array) technology at 3.3 Volts, 884 mW power dissipation at 60 MHz and 63.75 mm sq. die size. The same VHDL code has been targeted to three FPGA manufacturers (Altera EPF10K250A, ORCA-Lucent Technologies 0R3T165 and Xilinx XCV1000). A complete set of software tools, the 3D-Flow System Manager, equally applicable to ASIC or FPGA implementations, has been produced to provide full system simulation, application development, real-time monitoring, and run-time fault recovery. Today's technology can accommodate 16 processors per chip in a medium size die, at a cost per processor of less than $5 based on the current silicon die/size technology cost.« less
Relationships between Working Conditions and Special Educators' Instruction
ERIC Educational Resources Information Center
Bettini, Elizabeth A.; Crockett, Jean B.; Brownell, Mary T.; Merrill, Kristen L.
2016-01-01
Students with disabilities (SWDs) depend upon special education teachers (SETs) to provide effective instruction. SETs, in turn, depend upon school leaders to provide conditions necessary to learn and engage in effective instructional practices for students with the most significant learning needs. A promising body of research indicates that…
ERIC Educational Resources Information Center
Wood, Amanda L.; Luiselli, James K.; Harchik, Alan E.
2007-01-01
The present study evaluates a training program with paraprofessional service providers at a community-based habilitation setting. Four staff were taught to implement alternative and augmentative communication instruction with an adult who had autism and mental retardation through a combination of instruction, demonstration, behavior rehearsal, and…
Magalhães, Ana Tereza de Matos; Goffi-Gomez, M Valéria Schmidt; Hoshino, Ana Cristina; Tsuji, Robinson Koji; Bento, Ricardo Ferreira; Brito, Rubens
2013-09-01
To identify the technological contributions of the newer version of speech processor to the first generation of multichannel cochlear implant and the satisfaction of users of the new technology. Among the new features available, we focused on the effect of the frequency allocation table, the T-SPL and C-SPL, and the preprocessing gain adjustments (adaptive dynamic range optimization). Prospective exploratory study. Cochlear implant center at hospital. Cochlear implant users of the Spectra processor with speech recognition in closed set. Seventeen patients were selected between the ages of 15 and 82 and deployed for more than 8 years. The technology update of the speech processor for the Nucleus 22. To determine Freedom's contribution, thresholds and speech perception tests were performed with the last map used with the Spectra and the maps created for Freedom. To identify the effect of the frequency allocation table, both upgraded and converted maps were programmed. One map was programmed with 25 dB T-SPL and 65 dB C-SPL and the other map with adaptive dynamic range optimization. To assess satisfaction, SADL and APHAB were used. All speech perception tests and all sound field thresholds were statistically better with the new speech processor; 64.7% of patients preferred maintaining the same frequency table that was suggested for the older processor. The sound field threshold was statistically significant at 500, 1,000, 1,500, and 2,000 Hz with 25 dB T-SPL/65 dB C-SPL. Regarding patient's satisfaction, there was a statistically significant improvement, only in the subscale of speech in noise abilities and phone use. The new technology improved the performance of patients with the first generation of multichannel cochlear implant.
Seeing the forest for the trees: Networked workstations as a parallel processing computer
NASA Technical Reports Server (NTRS)
Breen, J. O.; Meleedy, D. M.
1992-01-01
Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Design of object-oriented distributed simulation classes
NASA Technical Reports Server (NTRS)
Schoeffler, James D. (Principal Investigator)
1995-01-01
Distributed simulation of aircraft engines as part of a computer aided design package is being developed by NASA Lewis Research Center for the aircraft industry. The project is called NPSS, an acronym for 'Numerical Propulsion Simulation System'. NPSS is a flexible object-oriented simulation of aircraft engines requiring high computing speed. It is desirable to run the simulation on a distributed computer system with multiple processors executing portions of the simulation in parallel. The purpose of this research was to investigate object-oriented structures such that individual objects could be distributed. The set of classes used in the simulation must be designed to facilitate parallel computation. Since the portions of the simulation carried out in parallel are not independent of one another, there is the need for communication among the parallel executing processors which in turn implies need for their synchronization. Communication and synchronization can lead to decreased throughput as parallel processors wait for data or synchronization signals from other processors. As a result of this research, the following have been accomplished. The design and implementation of a set of simulation classes which result in a distributed simulation control program have been completed. The design is based upon MIT 'Actor' model of a concurrent object and uses 'connectors' to structure dynamic connections between simulation components. Connectors may be dynamically created according to the distribution of objects among machines at execution time without any programming changes. Measurements of the basic performance have been carried out with the result that communication overhead of the distributed design is swamped by the computation time of modules unless modules have very short execution times per iteration or time step. An analytical performance model based upon queuing network theory has been designed and implemented. Its application to realistic configurations has not been carried out.
Design of Object-Oriented Distributed Simulation Classes
NASA Technical Reports Server (NTRS)
Schoeffler, James D.
1995-01-01
Distributed simulation of aircraft engines as part of a computer aided design package being developed by NASA Lewis Research Center for the aircraft industry. The project is called NPSS, an acronym for "Numerical Propulsion Simulation System". NPSS is a flexible object-oriented simulation of aircraft engines requiring high computing speed. It is desirable to run the simulation on a distributed computer system with multiple processors executing portions of the simulation in parallel. The purpose of this research was to investigate object-oriented structures such that individual objects could be distributed. The set of classes used in the simulation must be designed to facilitate parallel computation. Since the portions of the simulation carried out in parallel are not independent of one another, there is the need for communication among the parallel executing processors which in turn implies need for their synchronization. Communication and synchronization can lead to decreased throughput as parallel processors wait for data or synchronization signals from other processors. As a result of this research, the following have been accomplished. The design and implementation of a set of simulation classes which result in a distributed simulation control program have been completed. The design is based upon MIT "Actor" model of a concurrent object and uses "connectors" to structure dynamic connections between simulation components. Connectors may be dynamically created according to the distribution of objects among machines at execution time without any programming changes. Measurements of the basic performance have been carried out with the result that communication overhead of the distributed design is swamped by the computation time of modules unless modules have very short execution times per iteration or time step. An analytical performance model based upon queuing network theory has been designed and implemented. Its application to realistic configurations has not been carried out.
A new parallel algorithm of MP2 energy calculations.
Ishimura, Kazuya; Pulay, Peter; Nagase, Shigeru
2006-03-01
A new parallel algorithm has been developed for second-order Møller-Plesset perturbation theory (MP2) energy calculations. Its main projected applications are for large molecules, for instance, for the calculation of dispersion interaction. Tests on a moderate number of processors (2-16) show that the program has high CPU and parallel efficiency. Timings are presented for two relatively large molecules, taxol (C(47)H(51)NO(14)) and luciferin (C(11)H(8)N(2)O(3)S(2)), the former with the 6-31G* and 6-311G** basis sets (1,032 and 1,484 basis functions, 164 correlated orbitals), and the latter with the aug-cc-pVDZ and aug-cc-pVTZ basis sets (530 and 1,198 basis functions, 46 correlated orbitals). An MP2 energy calculation on C(130)H(10) (1,970 basis functions, 265 correlated orbitals) completed in less than 2 h on 128 processors.
E-Learning: Students Input for Using Mobile Devices in Science Instructional Settings
ERIC Educational Resources Information Center
Yilmaz, Ozkan
2016-01-01
A variety of e-learning theories, models, and strategy have been developed to support educational settings. There are many factors for designing good instructional settings. This study set out to determine functionality of mobile devices, students who already have, and the student needs and views in relation to e-learning settings. The study…
Malakooti, Behnam; Yang, Ziyong
2004-02-01
In many real-world problems, the range of consequences of different alternatives are considerably different. In addition, sometimes, selection of a group of alternatives (instead of only one best alternative) is necessary. Traditional decision making approaches treat the set of alternatives with the same method of analysis and selection. In this paper, we propose clustering alternatives into different groups so that different methods of analysis, selection, and implementation for each group can be applied. As an example, consider the selection of a group of functions (or tasks) to be processed by a group of processors. The set of tasks can be grouped according to their similar criteria, and hence, each cluster of tasks to be processed by a processor. The selection of the best alternative for each clustered group can be performed using existing methods; however, the process of selecting groups is different than the process of selecting alternatives within a group. We develop theories and procedures for clustering discrete multiple criteria alternatives. We also demonstrate how the set of alternatives is clustered into mutually exclusive groups based on 1) similar features among alternatives; 2) ideal (or most representative) alternatives given by the decision maker; and 3) other preferential information of the decision maker. The clustering of multiple criteria alternatives also has the following advantages. 1) It decreases the set of alternatives to be considered by the decision maker (for example, different decision makers are assigned to different groups of alternatives). 2) It decreases the number of criteria. 3) It may provide a different approach for analyzing multiple decision makers problems. Each decision maker may cluster alternatives differently, and hence, clustering of alternatives may provide a basis for negotiation. The developed approach is applicable for solving a class of telecommunication networks problems where a set of objects (such as routers, processors, or intelligent autonomous vehicles) are to be clustered into similar groups. Objects are clustered based on several criteria and the decision maker's preferences.
PS3 CELL Development for Scientific Computation and Research
NASA Astrophysics Data System (ADS)
Christiansen, M.; Sevre, E.; Wang, S. M.; Yuen, D. A.; Liu, S.; Lyness, M. D.; Broten, M.
2007-12-01
The Cell processor is one of the most powerful processors on the market, and researchers in the earth sciences may find its parallel architecture to be very useful. A cell processor, with 7 cores, can easily be obtained for experimentation by purchasing a PlayStation 3 (PS3) and installing linux and the IBM SDK. Each core of the PS3 is capable of 25 GFLOPS giving a potential limit of 150 GFLOPS when using all 6 SPUs (synergistic processing units) by using vectorized algorithms. We have used the Cell's computational power to create a program which takes simulated tsunami datasets, parses them, and returns a colorized height field image using ray casting techniques. As expected, the time required to create an image is inversely proportional to the number of SPUs used. We believe that this trend will continue when multiple PS3s are chained using OpenMP functionality and are in the process of researching this. By using the Cell to visualize tsunami data, we have found that its greatest feature is its power. This fact entwines well with the needs of the scientific community where the limiting factor is time. Any algorithm, such as the heat equation, that can be subdivided into multiple parts can take advantage of the PS3 Cell's ability to split the computations across the 6 SPUs reducing required run time by one sixth. Further vectorization of the code can allow for 4 simultanious floating point operations by using the SIMD (single instruction multiple data) capabilities of the SPU increasing efficiency 24 times.
NASA Technical Reports Server (NTRS)
Green, F. M.; Resnick, D. R.
1979-01-01
An FMP (Flow Model Processor) was designed for use in the Numerical Aerodynamic Simulation Facility (NASF). The NASF was developed to simulate fluid flow over three-dimensional bodies in wind tunnel environments and in free space. The facility is applicable to studying aerodynamic and aircraft body designs. The following general topics are discussed in this volume: (1) FMP functional computer specifications; (2) FMP instruction specification; (3) standard product system components; (4) loosely coupled network (LCN) specifications/description; and (5) three appendices: performance of trunk allocation contention elimination (trace) method, LCN channel protocol and proposed LCN unified second level protocol.
Parallel processors and nonlinear structural dynamics algorithms and software
NASA Technical Reports Server (NTRS)
Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.
1989-01-01
The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.
Interaction sorting method for molecular dynamics on multi-core SIMD CPU architecture.
Matvienko, Sergey; Alemasov, Nikolay; Fomin, Eduard
2015-02-01
Molecular dynamics (MD) is widely used in computational biology for studying binding mechanisms of molecules, molecular transport, conformational transitions, protein folding, etc. The method is computationally expensive; thus, the demand for the development of novel, much more efficient algorithms is still high. Therefore, the new algorithm designed in 2007 and called interaction sorting (IS) clearly attracted interest, as it outperformed the most efficient MD algorithms. In this work, a new IS modification is proposed which allows the algorithm to utilize SIMD processor instructions. This paper shows that the improvement provides an additional gain in performance, 9% to 45% in comparison to the original IS method.
Implicit schemes and parallel computing in unstructured grid CFD
NASA Technical Reports Server (NTRS)
Venkatakrishnam, V.
1995-01-01
The development of implicit schemes for obtaining steady state solutions to the Euler and Navier-Stokes equations on unstructured grids is outlined. Applications are presented that compare the convergence characteristics of various implicit methods. Next, the development of explicit and implicit schemes to compute unsteady flows on unstructured grids is discussed. Next, the issues involved in parallelizing finite volume schemes on unstructured meshes in an MIMD (multiple instruction/multiple data stream) fashion are outlined. Techniques for partitioning unstructured grids among processors and for extracting parallelism in explicit and implicit solvers are discussed. Finally, some dynamic load balancing ideas, which are useful in adaptive transient computations, are presented.
ERIC Educational Resources Information Center
Jorgensen, Christie L.
2016-01-01
Although instructional coaching and professional learning communities provide ongoing, job-embedded support and professional learning, little is known about what role the instructional coach serves within the setting of the professional learning community or what coaching skills teachers find most helpful within this setting. Research examining…
Categories for Observing Language Arts Instruction (COLAI).
ERIC Educational Resources Information Center
Benterud, Julianna G.
Designed to study individual use of time spent in reading during regularly scheduled language arts instruction in a natural classroom setting, this coding sheet consists of nine categories: (1) engagement, (2) area of language arts, (3) instructional setting, (4) partner (teacher or pupil(s)), (5) source of content, (6) type of unit, (7) assigned…
Enabling MPEG-2 video playback in embedded systems through improved data cache efficiency
NASA Astrophysics Data System (ADS)
Soderquist, Peter; Leeser, Miriam E.
1999-01-01
Digital video decoding, enabled by the MPEG-2 Video standard, is an important future application for embedded systems, particularly PDAs and other information appliances. Many such system require portability and wireless communication capabilities, and thus face severe limitations in size and power consumption. This places a premium on integration and efficiency, and favors software solutions for video functionality over specialized hardware. The processors in most embedded system currently lack the computational power needed to perform video decoding, but a related and equally important problem is the required data bandwidth, and the need to cost-effectively insure adequate data supply. MPEG data sets are very large, and generate significant amounts of excess memory traffic for standard data caches, up to 100 times the amount required for decoding. Meanwhile, cost and power limitations restrict cache sizes in embedded systems. Some systems, including many media processors, eliminate caches in favor of memories under direct, painstaking software control in the manner of digital signal processors. Yet MPEG data has locality which caches can exploit if properly optimized, providing fast, flexible, and automatic data supply. We propose a set of enhancements which target the specific needs of the heterogeneous types within the MPEG decoder working set. These optimizations significantly improve the efficiency of small caches, reducing cache-memory traffic by almost 70 percent, and can make an enhanced 4 KB cache perform better than a standard 1 MB cache. This performance improvement can enable high-resolution, full frame rate video playback in cheaper, smaller system than woudl otherwise be possible.
New insight into pecan boron nutrition
USDA-ARS?s Scientific Manuscript database
Alternate bearing by individual pecan [Carya illinoinensis (Wangenh.) K. Koch] trees is problematic for nut producers and processors. There are many unknowns regarding alternate bearing physiology, such as the relationship between boron and fruit set, nutmeat quality, and kernel maladies. Evidence...
48 CFR 52.219-18 - Notification of Competition Limited to Eligible 8(a) Concerns.
Code of Federal Regulations, 2010 CFR
2010-10-01
... conformance with the Business Activity Targets set forth in its approved business plan or any remedial action... business manufacturers or processors in the Federal market in accordance with 19.502-2(c), delete...
Effects of Goal-Setting Instruction on Academic Engagement for Students at Risk
ERIC Educational Resources Information Center
Rowe, Dawn A.; Mazzotti, Valerie L.; Ingram, Angela; Lee, Seunghee
2017-01-01
Research indicates teachers feel teaching goal-setting is an effective way to enhance academic engagement. However, teachers ultimately feel unprepared to embed goal-setting instruction into academic content to support active student engagement. Given the importance teachers place on goal-setting skills, there is a need to identify strategies to…
AltiVec performance increases for autonomous robotics for the MARSSCAPE architecture program
NASA Astrophysics Data System (ADS)
Gothard, Benny M.
2002-02-01
One of the main tall poles that must be overcome to develop a fully autonomous vehicle is the inability of the computer to understand its surrounding environment to a level that is required for the intended task. The military mission scenario requires a robot to interact in a complex, unstructured, dynamic environment. Reference A High Fidelity Multi-Sensor Scene Understanding System for Autonomous Navigation The Mobile Autonomous Robot Software Self Composing Adaptive Programming Environment (MarsScape) perception research addresses three aspects of the problem; sensor system design, processing architectures, and algorithm enhancements. A prototype perception system has been demonstrated on robotic High Mobility Multi-purpose Wheeled Vehicle and All Terrain Vehicle testbeds. This paper addresses the tall pole of processing requirements and the performance improvements based on the selected MarsScape Processing Architecture. The processor chosen is the Motorola Altivec-G4 Power PC(PPC) (1998 Motorola, Inc.), a highly parallized commercial Single Instruction Multiple Data processor. Both derived perception benchmarks and actual perception subsystems code will be benchmarked and compared against previous Demo II-Semi-autonomous Surrogate Vehicle processing architectures along with desktop Personal Computers(PC). Performance gains are highlighted with progress to date, and lessons learned and future directions are described.
Local wavelet transform: a cost-efficient custom processor for space image compression
NASA Astrophysics Data System (ADS)
Masschelein, Bart; Bormans, Jan G.; Lafruit, Gauthier
2002-11-01
Thanks to its intrinsic scalability features, the wavelet transform has become increasingly popular as decorrelator in image compression applications. Throuhgput, memory requirements and complexity are important parameters when developing hardware image compression modules. An implementation of the classical, global wavelet transform requires large memory sizes and implies a large latency between the availability of the input image and the production of minimal data entities for entropy coding. Image tiling methods, as proposed by JPEG2000, reduce the memory sizes and the latency, but inevitably introduce image artefacts. The Local Wavelet Transform (LWT), presented in this paper, is a low-complexity wavelet transform architecture using a block-based processing that results in the same transformed images as those obtained by the global wavelet transform. The architecture minimizes the processing latency with a limited amount of memory. Moreover, as the LWT is an instruction-based custom processor, it can be programmed for specific tasks, such as push-broom processing of infinite-length satelite images. The features of the LWT makes it appropriate for use in space image compression, where high throughput, low memory sizes, low complexity, low power and push-broom processing are important requirements.
Reumann, Matthias; Fitch, Blake G; Rayshubskiy, Aleksandr; Keller, David U J; Seemann, Gunnar; Dossel, Olaf; Pitman, Michael C; Rice, John J
2009-01-01
Orthogonal recursive bisection (ORB) algorithm can be used as data decomposition strategy to distribute a large data set of a cardiac model to a distributed memory supercomputer. It has been shown previously that good scaling results can be achieved using the ORB algorithm for data decomposition. However, the ORB algorithm depends on the distribution of computational load of each element in the data set. In this work we investigated the dependence of data decomposition and load balancing on different rotations of the anatomical data set to achieve optimization in load balancing. The anatomical data set was given by both ventricles of the Visible Female data set in a 0.2 mm resolution. Fiber orientation was included. The data set was rotated by 90 degrees around x, y and z axis, respectively. By either translating or by simply taking the magnitude of the resulting negative coordinates we were able to create 14 data set of the same anatomy with different orientation and position in the overall volume. Computation load ratios for non - tissue vs. tissue elements used in the data decomposition were 1:1, 1:2, 1:5, 1:10, 1:25, 1:38.85, 1:50 and 1:100 to investigate the effect of different load ratios on the data decomposition. The ten Tusscher et al. (2004) electrophysiological cell model was used in monodomain simulations of 1 ms simulation time to compare performance using the different data sets and orientations. The simulations were carried out for load ratio 1:10, 1:25 and 1:38.85 on a 512 processor partition of the IBM Blue Gene/L supercomputer. Th results show that the data decomposition does depend on the orientation and position of the anatomy in the global volume. The difference in total run time between the data sets is 10 s for a simulation time of 1 ms. This yields a difference of about 28 h for a simulation of 10 s simulation time. However, given larger processor partitions, the difference in run time decreases and becomes less significant. Depending on the processor partition size, future work will have to consider the orientation of the anatomy in the global volume for longer simulation runs.
Comparison of neuronal spike exchange methods on a Blue Gene/P supercomputer.
Hines, Michael; Kumar, Sameer; Schürmann, Felix
2011-01-01
For neural network simulations on parallel machines, interprocessor spike communication can be a significant portion of the total simulation time. The performance of several spike exchange methods using a Blue Gene/P (BG/P) supercomputer has been tested with 8-128 K cores using randomly connected networks of up to 32 M cells with 1 k connections per cell and 4 M cells with 10 k connections per cell, i.e., on the order of 4·10(10) connections (K is 1024, M is 1024(2), and k is 1000). The spike exchange methods used are the standard Message Passing Interface (MPI) collective, MPI_Allgather, and several variants of the non-blocking Multisend method either implemented via non-blocking MPI_Isend, or exploiting the possibility of very low overhead direct memory access (DMA) communication available on the BG/P. In all cases, the worst performing method was that using MPI_Isend due to the high overhead of initiating a spike communication. The two best performing methods-the persistent Multisend method using the Record-Replay feature of the Deep Computing Messaging Framework DCMF_Multicast; and a two-phase multisend in which a DCMF_Multicast is used to first send to a subset of phase one destination cores, which then pass it on to their subset of phase two destination cores-had similar performance with very low overhead for the initiation of spike communication. Departure from ideal scaling for the Multisend methods is almost completely due to load imbalance caused by the large variation in number of cells that fire on each processor in the interval between synchronization. Spike exchange time itself is negligible since transmission overlaps with computation and is handled by a DMA controller. We conclude that ideal performance scaling will be ultimately limited by imbalance between incoming processor spikes between synchronization intervals. Thus, counterintuitively, maximization of load balance requires that the distribution of cells on processors should not reflect neural net architecture but be randomly distributed so that sets of cells which are burst firing together should be on different processors with their targets on as large a set of processors as possible.
Ayres, Daniel L; Darling, Aaron; Zwickl, Derrick J; Beerli, Peter; Holder, Mark T; Lewis, Paul O; Huelsenbeck, John P; Ronquist, Fredrik; Swofford, David L; Cummings, Michael P; Rambaut, Andrew; Suchard, Marc A
2012-01-01
Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software.
Ayres, Daniel L.; Darling, Aaron; Zwickl, Derrick J.; Beerli, Peter; Holder, Mark T.; Lewis, Paul O.; Huelsenbeck, John P.; Ronquist, Fredrik; Swofford, David L.; Cummings, Michael P.; Rambaut, Andrew; Suchard, Marc A.
2012-01-01
Abstract Phylogenetic inference is fundamental to our understanding of most aspects of the origin and evolution of life, and in recent years, there has been a concentration of interest in statistical approaches such as Bayesian inference and maximum likelihood estimation. Yet, for large data sets and realistic or interesting models of evolution, these approaches remain computationally demanding. High-throughput sequencing can yield data for thousands of taxa, but scaling to such problems using serial computing often necessitates the use of nonstatistical or approximate approaches. The recent emergence of graphics processing units (GPUs) provides an opportunity to leverage their excellent floating-point computational performance to accelerate statistical phylogenetic inference. A specialized library for phylogenetic calculation would allow existing software packages to make more effective use of available computer hardware, including GPUs. Adoption of a common library would also make it easier for other emerging computing architectures, such as field programmable gate arrays, to be used in the future. We present BEAGLE, an application programming interface (API) and library for high-performance statistical phylogenetic inference. The API provides a uniform interface for performing phylogenetic likelihood calculations on a variety of compute hardware platforms. The library includes a set of efficient implementations and can currently exploit hardware including GPUs using NVIDIA CUDA, central processing units (CPUs) with Streaming SIMD Extensions and related processor supplementary instruction sets, and multicore CPUs via OpenMP. To demonstrate the advantages of a common API, we have incorporated the library into several popular phylogenetic software packages. The BEAGLE library is free open source software licensed under the Lesser GPL and available from http://beagle-lib.googlecode.com. An example client program is available as public domain software. PMID:21963610
Pedretti, Kevin
2008-11-18
A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.
Enabling Next-Generation Multicore Platforms in Embedded Applications
2014-04-01
mapping to sets 129 − 256 ) to the second page in memory, color 2 (sets 257 − 384) to the third page, and so on. Then, after the 32nd page, all 212 sets...the Real-Time Nested Locking Protocol (RNLP) [56], a recently developed multiprocessor real-time locking protocol that optimally supports the...RELEASE; DISTRIBUTION UNLIMITED 15 In general, the problems of optimally assigning tasks to processors and colors to tasks are both NP-hard in the
DOE Office of Scientific and Technical Information (OSTI.GOV)
Newman, G.A.; Commer, M.
Three-dimensional (3D) geophysical imaging is now receiving considerable attention for electrical conductivity mapping of potential offshore oil and gas reservoirs. The imaging technology employs controlled source electromagnetic (CSEM) and magnetotelluric (MT) fields and treats geological media exhibiting transverse anisotropy. Moreover when combined with established seismic methods, direct imaging of reservoir fluids is possible. Because of the size of the 3D conductivity imaging problem, strategies are required exploiting computational parallelism and optimal meshing. The algorithm thus developed has been shown to scale to tens of thousands of processors. In one imaging experiment, 32,768 tasks/processors on the IBM Watson Research Blue Gene/Lmore » supercomputer were successfully utilized. Over a 24 hour period we were able to image a large scale field data set that previously required over four months of processing time on distributed clusters based on Intel or AMD processors utilizing 1024 tasks on an InfiniBand fabric. Electrical conductivity imaging using massively parallel computational resources produces results that cannot be obtained otherwise and are consistent with timeframes required for practical exploration problems.« less
Particle simulation of plasmas on the massively parallel processor
NASA Technical Reports Server (NTRS)
Gledhill, I. M. A.; Storey, L. R. O.
1987-01-01
Particle simulations, in which collective phenomena in plasmas are studied by following the self consistent motions of many discrete particles, involve several highly repetitive sets of calculations that are readily adaptable to SIMD parallel processing. A fully electromagnetic, relativistic plasma simulation for the massively parallel processor is described. The particle motions are followed in 2 1/2 dimensions on a 128 x 128 grid, with periodic boundary conditions. The two dimensional simulation space is mapped directly onto the processor network; a Fast Fourier Transform is used to solve the field equations. Particle data are stored according to an Eulerian scheme, i.e., the information associated with each particle is moved from one local memory to another as the particle moves across the spatial grid. The method is applied to the study of the nonlinear development of the whistler instability in a magnetospheric plasma model, with an anisotropic electron temperature. The wave distribution function is included as a new diagnostic to allow simulation results to be compared with satellite observations.
Item Difficulty in the Evaluation of Computer-Based Instruction: An Example from Neuroanatomy
Chariker, Julia H.; Naaz, Farah; Pani, John R.
2012-01-01
This article reports large item effects in a study of computer-based learning of neuroanatomy. Outcome measures of the efficiency of learning, transfer of learning, and generalization of knowledge diverged by a wide margin across test items, with certain sets of items emerging as particularly difficult to master. In addition, the outcomes of comparisons between instructional methods changed with the difficulty of the items to be learned. More challenging items better differentiated between instructional methods. This set of results is important for two reasons. First, it suggests that instruction may be more efficient if sets of consistently difficult items are the targets of instructional methods particularly suited to them. Second, there is wide variation in the published literature regarding the outcomes of empirical evaluations of computer-based instruction. As a consequence, many questions arise as to the factors that may affect such evaluations. The present paper demonstrates that the level of challenge in the material that is presented to learners is an important factor to consider in the evaluation of a computer-based instructional system. PMID:22231801
Item difficulty in the evaluation of computer-based instruction: an example from neuroanatomy.
Chariker, Julia H; Naaz, Farah; Pani, John R
2012-01-01
This article reports large item effects in a study of computer-based learning of neuroanatomy. Outcome measures of the efficiency of learning, transfer of learning, and generalization of knowledge diverged by a wide margin across test items, with certain sets of items emerging as particularly difficult to master. In addition, the outcomes of comparisons between instructional methods changed with the difficulty of the items to be learned. More challenging items better differentiated between instructional methods. This set of results is important for two reasons. First, it suggests that instruction may be more efficient if sets of consistently difficult items are the targets of instructional methods particularly suited to them. Second, there is wide variation in the published literature regarding the outcomes of empirical evaluations of computer-based instruction. As a consequence, many questions arise as to the factors that may affect such evaluations. The present article demonstrates that the level of challenge in the material that is presented to learners is an important factor to consider in the evaluation of a computer-based instructional system. Copyright © 2011 American Association of Anatomists.
ERIC Educational Resources Information Center
Leh, Jayne
2011-01-01
Substantial evidence indicates that teacher-delivered schema-based instruction (SBI) facilitates significant increases in mathematics word problem solving (WPS) skills for diverse students; however research is unclear whether technology affordances facilitate superior gains in computer-mediated (CM) instruction in mathematics WPS when compared to…
Analysis of the Supporting Websites for the Use of Instructional Games in K-12 Settings
ERIC Educational Resources Information Center
Kebritchi, Mansureh; Hirumi, Atsusi; Kappers, Wendi; Henry, Renee
2009-01-01
This paper identifies resources to be included in a website designed to facilitate the integration of instructional games in K-12 settings. Guidelines and supporting components are based on a survey of K-12 educators who are integrating games, an analysis of existing instructional game websites, and summaries of literature on the use of…
The Effects of Text Structure Instruction on Expository Reading Comprehension: A Meta-Analysis
ERIC Educational Resources Information Center
Hebert, Michael; Bohaty, Janet J.; Nelson, J. Ron; Brown, Jessica
2016-01-01
In this meta-analysis of 45 studies involving students in Grades 2-12, the authors present evidence on the effects of text structure instruction on the expository reading comprehension of students. The meta-analysis was deigned to answer 2 sets of questions. The first set of questions examined the effectiveness of text structure instruction on…
ERIC Educational Resources Information Center
Schulz, Russel E.; And Others
This Programming Design Guide (PDG) was developed to permit the offline Job Aid for Selecting Instructional Setting, which is one of 13 job aids presently available for use with the Instructional Systems Development (ISD) model, to be available in an inquiry-type, online version. It is intended to provide computer programmers with all of the…
Instructional Set, Deep Relaxation and Growth Enhancement: A Pilot Study
ERIC Educational Resources Information Center
Leeb, Charles; And Others
1976-01-01
This study provides experimental evidence that instructional set can influence access to altered states of consciousness. Fifteen male subjects were randomly assigned to three groups, each of which received the same autogenic biofeedback training in hand temperature control, but each group received a different attitudinal set. (Editor)
Grade-related differences in strategy use in multidigit division in two instructional settings.
Hickendorff, Marian; Torbeyns, Joke; Verschaffel, Lieven
2017-11-23
We aimed to investigate upper elementary children's strategy use in the domain of multidigit division in two instructional settings: the Netherlands and Flanders (Belgium). A cross-sectional sample of 119 Dutch and 122 Flemish fourth to sixth graders solved a varied set of multidigit division problems. With latent class analysis, three distinct strategy profiles were identified: children consistently using number-based strategies, children combining the use of column-based and number-based strategies, and children combining the use of digit-based and number-based strategies. The relation between children's strategy profiles and their instructional setting (country) and grade were generally in line with instructional differences, but large individual differences remained. Furthermore, Dutch children more frequently made adaptive strategy choices and realistic solutions than their Flemish peers. These results complement and refine previous findings on children's strategy use in relation to mathematics instruction. Statement of contribution What is already known? Mathematics education reform emphasizes variety, adaptivity, and insight in arithmetic strategies. Countries have different instructional trajectories for multidigit division. Mixed results on the impact of instruction on children's strategy use in multidigit division. What does this study add? Latent class analysis identified three meaningful strategy profiles in children from grades 4-6. These strategy profiles substantially differed between children. Dutch and Flemish children's strategy use is related to their instructional trajectory. © 2017 The Authors. British Journal of Developmental Psychology published by John Wiley & Sons Ltd on behalf of British Psychological Society.
Design Spectrum Analysis in NASTRAN
NASA Technical Reports Server (NTRS)
Butler, T. G.
1984-01-01
The utility of Design Spectrum Analysis is to give a mode by mode characterization of the behavior of a design under a given loading. The theory of design spectrum is discussed after operations are explained. User instructions are taken up here in three parts: Transient Preface, Maximum Envelope Spectrum, and RMS Average Spectrum followed by a Summary Table. A single DMAP ALTER packet will provide for all parts of the design spectrum operations. The starting point for getting a modal break-down of the response to acceleration loading is the Modal Transient rigid format. After eigenvalue extraction, modal vectors need to be isolated in the full set of physical coordinates (P-sized as opposed to the D-sized vectors in RF 12). After integration for transient response the results are scanned over the solution time interval for the peak values and for the times that they occur. A module called SCAN was written to do this job, that organizes these maxima into a diagonal output matrix. The maximum amplifier in each mode is applied to the eigenvector of each mode which then reveals the maximum displacements, stresses, forces and boundary reactions that the structure will experience for a load history, mode by mode. The standard NASTRAN output processors have been modified for this task. It is required that modes be normalized to mass.
Identifying Instructional Strategies Used to Design Mobile Learning in a Corporate Setting
ERIC Educational Resources Information Center
Jackson-Butler, Uletta
2016-01-01
The purpose of this qualitative embedded multiple case study was to describe what instructional strategies corporate instructional designers were using to design mobile learning and to understand from their experiences which instructional strategies they believed enhance learning. Participants were five instructional designers who were actively…
PANDA: A distributed multiprocessor operating system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chubb, P.
1989-01-01
PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
The role of absorption in women's sexual response to erotica: a cognitive-affective investigation.
Sheen, Jade; Koukounas, Eric
2009-01-01
This study examined the effect of absorption on women's emotional and cognitive processing of erotic film. Absorption was experimentally manipulated using 2 different sets of test session instructions. The first, participant-oriented, instruction set directed participants to absorb themselves in the erotic film presentation, imagining that they were active participants in the sexual activities depicted. The second, spectator-oriented, instruction set directed participants to observe and assess the erotic film excerpt as impartial spectators. The participant-oriented instruction set was found to elicit greater subjective absorption in women than the spectator-oriented instruction set, and women reported greater subjective sexual arousal in the former set compared with the latter. Thus, it appears that the degree to which a woman becomes absorbed in an erotic stimulus may affect her subsequent subjective sexual arousal. Also, women reported greater degrees of positive affect when they took a participant-oriented perspective than when they viewed the erotic materials as impartial spectators. Thus, participants who were highly absorbed in the erotic film excerpt were more likely to view the stimulus favorably. By contrast, the degree to which women became absorbed in the stimulus had no effect on their reported negative affect. Future directions for examining female response patterns are suggested.
Strong scaling and speedup to 16,384 processors in cardiac electro-mechanical simulations.
Reumann, Matthias; Fitch, Blake G; Rayshubskiy, Aleksandr; Keller, David U J; Seemann, Gunnar; Dossel, Olaf; Pitman, Michael C; Rice, John J
2009-01-01
High performance computing is required to make feasible simulations of whole organ models of the heart with biophysically detailed cellular models in a clinical setting. Increasing model detail by simulating electrophysiology and mechanical models increases computation demands. We present scaling results of an electro - mechanical cardiac model of two ventricles and compare them to our previously published results using an electrophysiological model only. The anatomical data-set was given by both ventricles of the Visible Female data-set in a 0.2 mm resolution. Fiber orientation was included. Data decomposition for the distribution onto the distributed memory system was carried out by orthogonal recursive bisection. Load weight ratios for non-tissue vs. tissue elements used in the data decomposition were 1:1, 1:2, 1:5, 1:10, 1:25, 1:38.85, 1:50 and 1:100. The ten Tusscher et al. (2004) electrophysiological cell model was used and the Rice et al. (1999) model for the computation of the calcium transient dependent force. Scaling results for 512, 1024, 2048, 4096, 8192 and 16,384 processors were obtained for 1 ms simulation time. The simulations were carried out on an IBM Blue Gene/L supercomputer. The results show linear scaling from 512 to 16,384 processors with speedup factors between 1.82 and 2.14 between partitions. The most optimal load ratio was 1:25 for on all partitions. However, a shift towards load ratios with higher weight for the tissue elements can be recognized as can be expected when adding computational complexity to the model while keeping the same communication setup. This work demonstrates that it is potentially possible to run simulations of 0.5 s using the presented electro-mechanical cardiac model within 1.5 hours.
Runtime support and compilation methods for user-specified data distributions
NASA Technical Reports Server (NTRS)
Ponnusamy, Ravi; Saltz, Joel; Choudhury, Alok; Hwang, Yuan-Shin; Fox, Geoffrey
1993-01-01
This paper describes two new ideas by which an HPF compiler can deal with irregular computations effectively. The first mechanism invokes a user specified mapping procedure via a set of compiler directives. The directives allow use of program arrays to describe graph connectivity, spatial location of array elements, and computational load. The second mechanism is a simple conservative method that in many cases enables a compiler to recognize that it is possible to reuse previously computed information from inspectors (e.g. communication schedules, loop iteration partitions, information that associates off-processor data copies with on-processor buffer locations). We present performance results for these mechanisms from a Fortran 90D compiler implementation.
Parallel/distributed direct method for solving linear systems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A new family of parallel schemes for directly solving linear systems is presented and analyzed. It is shown that these schemes exhibit a near optimal performance and enjoy several important features: (1) For large enough linear systems, the design of the appropriate paralleled algorithm is insensitive to the number of processors as its performance grows monotonically with them; (2) It is especially good for large matrices, with dimensions large relative to the number of processors in the system; (3) It can be used in both distributed parallel computing environments and tightly coupled parallel computing systems; and (4) This set of algorithms can be mapped onto any parallel architecture without any major programming difficulties or algorithmical changes.
Signal processor for processing ultrasonic receiver signals
Fasching, George E.
1980-01-01
A signal processor is provided which uses an analog integrating circuit in conjunction with a set of digital counters controlled by a precision clock for sampling timing to provide an improved presentation of an ultrasonic transmitter/receiver signal. The signal is sampled relative to the transmitter trigger signal timing at precise times, the selected number of samples are integrated and the integrated samples are transferred and held for recording on a strip chart recorder or converted to digital form for storage. By integrating multiple samples taken at precisely the same time with respect to the trigger for the ultrasonic transmitter, random noise, which is contained in the ultrasonic receiver signal, is reduced relative to the desired useful signal.
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Gaeke, Brian R.; Husbands, Parry; Li, Xiaoye S.; Oliker, Leonid; Yelick, Katherine A.; Biegel, Bryan (Technical Monitor)
2002-01-01
The increasing gap between processor and memory performance has lead to new architectural models for memory-intensive applications. In this paper, we explore the performance of a set of memory-intensive benchmarks and use them to compare the performance of conventional cache-based microprocessors to a mixed logic and DRAM processor called VIRAM. The benchmarks are based on problem statements, rather than specific implementations, and in each case we explore the fundamental hardware requirements of the problem, as well as alternative algorithms and data structures that can help expose fine-grained parallelism or simplify memory access patterns. The benchmarks are characterized by their memory access patterns, their basic control structures, and the ratio of computation to memory operation.
Efficient accesses of data structures using processing near memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jayasena, Nuwan S.; Zhang, Dong Ping; Diez, Paula Aguilera
Systems, apparatuses, and methods for implementing efficient queues and other data structures. A queue may be shared among multiple processors and/or threads without using explicit software atomic instructions to coordinate access to the queue. System software may allocate an atomic queue and corresponding queue metadata in system memory and return, to the requesting thread, a handle referencing the queue metadata. Any number of threads may utilize the handle for accessing the atomic queue. The logic for ensuring the atomicity of accesses to the atomic queue may reside in a management unit in the memory controller coupled to the memory wheremore » the atomic queue is allocated.« less
Vectorization for Molecular Dynamics on Intel Xeon Phi Corpocessors
NASA Astrophysics Data System (ADS)
Yi, Hongsuk
2014-03-01
Many modern processors are capable of exploiting data-level parallelism through the use of single instruction multiple data (SIMD) execution. The new Intel Xeon Phi coprocessor supports 512 bit vector registers for the high performance computing. In this paper, we have developed a hierarchical parallelization scheme for accelerated molecular dynamics simulations with the Terfoff potentials for covalent bond solid crystals on Intel Xeon Phi coprocessor systems. The scheme exploits multi-level parallelism computing. We combine thread-level parallelism using a tightly coupled thread-level and task-level parallelism with 512-bit vector register. The simulation results show that the parallel performance of SIMD implementations on Xeon Phi is apparently superior to their x86 CPU architecture.
ERIC Educational Resources Information Center
Chapman, Stanley L.; Jeffrey, D. Balfour
1978-01-01
In comprehensive wieght loss program, overweight women exposed to instruction in self-standard setting and to situational management techniques lost more weight than those instructed only in situational management techniques. Findings illustrate facilitative effect of teaching individuals to set specific, objective, and realistic goals for eating…
A Critical Review of Instructional Design Process of Distance Learning System
ERIC Educational Resources Information Center
Chaudry, Muhammad Ajmal; ur-Rahman, Fazal
2010-01-01
Instructional design refers to planning, development, delivery and evaluation of instructional system. It is an applied field of study aiming at the application of descriptive research outcomes in regular instructional settings. The present study was designed to critically review the process of instructional design at Allama Iqbal Open University…
RASSP signal processing architectures
NASA Astrophysics Data System (ADS)
Shirley, Fred; Bassett, Bob; Letellier, J. P.
1995-06-01
The rapid prototyping of application specific signal processors (RASSP) program is an ARPA/tri-service effort to dramatically improve the process by which complex digital systems, particularly embedded signal processors, are specified, designed, documented, manufactured, and supported. The domain of embedded signal processing was chosen because it is important to a variety of military and commercial applications as well as for the challenge it presents in terms of complexity and performance demands. The principal effort is being performed by two major contractors, Lockheed Sanders (Nashua, NH) and Martin Marietta (Camden, NJ). For both, improvements in methodology are to be exercised and refined through the performance of individual 'Demonstration' efforts. The Lockheed Sanders' Demonstration effort is to develop an infrared search and track (IRST) processor. In addition, both contractors' results are being measured by a series of externally administered (by Lincoln Labs) six-month Benchmark programs that measure process improvement as a function of time. The first two Benchmark programs are designing and implementing a synthetic aperture radar (SAR) processor. Our demonstration team is using commercially available VME modules from Mercury Computer to assemble a multiprocessor system scalable from one to hundreds of Intel i860 microprocessors. Custom modules for the sensor interface and display driver are also being developed. This system implements either proprietary or Navy owned algorithms to perform the compute-intensive IRST function in real time in an avionics environment. Our Benchmark team is designing custom modules using commercially available processor ship sets, communication submodules, and reconfigurable logic devices. One of the modules contains multiple vector processors optimized for fast Fourier transform processing. Another module is a fiberoptic interface that accepts high-rate input data from the sensors and provides video-rate output data to a display. This paper discusses the impact of simulation on choosing signal processing algorithms and architectures, drawing from the experiences of the Demonstration and Benchmark inter-company teams at Lockhhed Sanders, Motorola, Hughes, and ISX.
A Cost Effective System Design Approach for Critical Space Systems
NASA Technical Reports Server (NTRS)
Abbott, Larry Wayne; Cox, Gary; Nguyen, Hai
2000-01-01
NASA-JSC required an avionics platform capable of serving a wide range of applications in a cost-effective manner. In part, making the avionics platform cost effective means adhering to open standards and supporting the integration of COTS products with custom products. Inherently, operation in space requires low power, mass, and volume while retaining high performance, reconfigurability, scalability, and upgradability. The Universal Mini-Controller project is based on a modified PC/104-Plus architecture while maintaining full compatibility with standard COTS PC/104 products. The architecture consists of a library of building block modules, which can be mixed and matched to meet a specific application. A set of NASA developed core building blocks, processor card, analog input/output card, and a Mil-Std-1553 card, have been constructed to meet critical functions and unique interfaces. The design for the processor card is based on the PowerPC architecture. This architecture provides an excellent balance between power consumption and performance, and has an upgrade path to the forthcoming radiation hardened PowerPC processor. The processor card, which makes extensive use of surface mount technology, has a 166 MHz PowerPC 603e processor, 32 Mbytes of error detected and corrected RAM, 8 Mbytes of Flash, and I Mbytes of EPROM, on a single PC/104-Plus card. Similar densities have been achieved with the quad channel Mil-Std-1553 card and the analog input/output cards. The power management built into the processor and its peripheral chip allows the power and performance of the system to be adjusted to meet the requirements of the application, allowing another dimension to the flexibility of the Universal Mini-Controller. Unique mechanical packaging allows the Universal Mini-Controller to accommodate standard COTS and custom oversized PC/104-Plus cards. This mechanical packaging also provides thermal management via conductive cooling of COTS boards, which are typically designed for convection cooling methods.
European Scientific Notes. Volume 35, Number 12,
1981-12-31
been redesigned to work A. Osorio, which was organized some 3 with the Intel 8085 microprocessor, it years ago and contains about half of the has the...operational set. attempt to derive a set of invariants MOISE is based on the Intel 8085A upon which virtually speaker-invariant microprocessor, and...FACILITY software interface; a Research Signal Processor (RSP) using reduced computational It has been IBM International’s complexity algorithms for
Complex Instruction Set Quantum Computing
NASA Astrophysics Data System (ADS)
Sanders, G. D.; Kim, K. W.; Holton, W. C.
1998-03-01
In proposed quantum computers, electromagnetic pulses are used to implement logic gates on quantum bits (qubits). Gates are unitary transformations applied to coherent qubit wavefunctions and a universal computer can be created using a minimal set of gates. By applying many elementary gates in sequence, desired quantum computations can be performed. This reduced instruction set approach to quantum computing (RISC QC) is characterized by serial application of a few basic pulse shapes and a long coherence time. However, the unitary matrix of the overall computation is ultimately a unitary matrix of the same size as any of the elementary matrices. This suggests that we might replace a sequence of reduced instructions with a single complex instruction using an optimally taylored pulse. We refer to this approach as complex instruction set quantum computing (CISC QC). One trades the requirement for long coherence times for the ability to design and generate potentially more complex pulses. We consider a model system of coupled qubits interacting through nearest neighbor coupling and show that CISC QC can reduce the time required to perform quantum computations.
ERIC Educational Resources Information Center
Shriver, Edgar L.; And Others
This document furnishes a complete copy of the Test Subject's Instructions and the Test Administrator's Handbook for a battery of criterion referenced Job Task Performance Tests (JTPT) for electronic maintenance. General information is provided on soldering, Radar Set AN/APN-147(v), Radar Set Special Equipment, Radar Set Bench Test Set-Up, and…
Instructional Rounds in Action
ERIC Educational Resources Information Center
Roberts, John E.
2012-01-01
"Instructional Rounds in Action" is an invaluable guide for those involved in implementing instructional rounds as the foundation and framework for systemic improvement in schools. Over the past few years, districts across the United States, Canada, and Australia have begun implementing "instructional rounds," a set of ideas…
ERIC Educational Resources Information Center
Malan, Pierre
This paper presents an overview of information technology development. The first section sets the scene, comparing the first WAN (Wide Area Network) and Intel processor to current technology. The birth of the microcomputer is described in the second section, including historical background on semiconductors, microprocessors, and the microcomputer.…
Physics Teachers' Professional Development in the Project "physics in Context"
NASA Astrophysics Data System (ADS)
Mikelskis-Seifert, Silke; Duit, Reinders
2013-06-01
Developing teachers' ways of thinking about "good" instruction as well as their views of the teaching and learning process is generally seen as essential for improving teaching behaviour and implementation of more efficient teaching and learning settings. Major deficiencies of German physics instruction as revealed by a nationwide video-study on the practice of physics instruction are addressed. Teachers participating in the project are made familiar with recent views of efficient instruction on the one hand and develop context-based instructional settings on the other. The evaluation resulted in partly encouraging findings. However, it also turned out that a number of teachers' ways of thinking about good instruction did only develop to a somewhat limited degree. The most impressive changes occurred for teachers who enjoyed the most intensive coaching.
Physics Teachers' Professional Development in the Project "physics in Context"
NASA Astrophysics Data System (ADS)
Mikelskis-Seifert, Silke; Duit, Reinders
2012-12-01
Developing teachers' ways of thinking about "good" instruction as well as their views of the teaching and learning process is generally seen as essential for improving teaching behaviour and implementation of more efficient teaching and learning settings. Major deficiencies of German physics instruction as revealed by a nationwide video-study on the practice of physics instruction are addressed. Teachers participating in the project are made familiar with recent views of efficient instruction on the one hand and develop context-based instructional settings on the other. The evaluation resulted in partly encouraging findings. However, it also turned out that a number of teachers' ways of thinking about good instruction did only develop to a somewhat limited degree. The most impressive changes occurred for teachers who enjoyed the most intensive coaching.
Chip architecture - A revolution brewing
NASA Astrophysics Data System (ADS)
Guterl, F.
1983-07-01
Techniques being explored by microchip designers and manufacturers to both speed up memory access and instruction execution while protecting memory are discussed. Attention is given to hardwiring control logic, pipelining for parallel processing, devising orthogonal instruction sets for interchangeable instruction fields, and the development of hardware for implementation of virtual memory and multiuser systems to provide memory management and protection. The inclusion of microcode in mainframes eliminated logic circuits that control timing and gating of the CPU. However, improvements in memory architecture have reduced access time to below that needed for instruction execution. Hardwiring the functions as a virtual memory enhances memory protection. Parallelism involves a redundant architecture, which allows identical operations to be performed simultaneously, and can be directed with microcode to avoid abortion of intermediate instructions once on set of instructions has been completed.
Quasi-Newton parallel geometry optimization methods
NASA Astrophysics Data System (ADS)
Burger, Steven K.; Ayers, Paul W.
2010-07-01
Algorithms for parallel unconstrained minimization of molecular systems are examined. The overall framework of minimization is the same except for the choice of directions for updating the quasi-Newton Hessian. Ideally these directions are chosen so the updated Hessian gives steps that are same as using the Newton method. Three approaches to determine the directions for updating are presented: the straightforward approach of simply cycling through the Cartesian unit vectors (finite difference), a concurrent set of minimizations, and the Lanczos method. We show the importance of using preconditioning and a multiple secant update in these approaches. For the Lanczos algorithm, an initial set of directions is required to start the method, and a number of possibilities are explored. To test the methods we used the standard 50-dimensional analytic Rosenbrock function. Results are also reported for the histidine dipeptide, the isoleucine tripeptide, and cyclic adenosine monophosphate. All of these systems show a significant speed-up with the number of processors up to about eight processors.
The resolution of identity and chain of spheres approximations for the LPNO-CCSD singles Fock term
NASA Astrophysics Data System (ADS)
Izsák, Róbert; Hansen, Andreas; Neese, Frank
2012-10-01
In the present work, the RIJCOSX approximation, developed earlier for accelerating the SCF procedure, is applied to one of the limiting factors of LPNO-CCSD calculations: the evaluation of the singles Fock term. It turns out that the introduction of RIJCOSX in the evaluation of the closed shell LPNO-CCSD singles Fock term causes errors below the microhartree limit. If the proposed procedure is also combined with RIJCOSX in SCF, then a somewhat larger error occurs, but reaction energy errors will still remain negligible. The speedup for the singles Fock term only is about 9-10 fold for the largest basis set applied. For the case of Penicillin using the def2-QZVPP basis set, a single point energy evaluation takes 2 day 16 h on a single processor leading to a total speedup of 2.6 as compared to a fully analytic calculation. Using eight processors, the same calculation takes only 14 h.
ERIC Educational Resources Information Center
Jett, Janice Rowe
2013-01-01
With the steady increase of ubiquitous computing initiatives across the country in the last decade, there is a pressing need for specific research looking at content area instruction in 1:1 settings. This qualitative multiple case study examines writing instruction at two middle schools as it is delivered by experienced teachers in five English…
Managing Small Group Instruction in an Integrated Preschool Setting.
ERIC Educational Resources Information Center
O'Connell, Joanne Curry
1986-01-01
A structured small group instructional setting helps to teach mainstreamed handicapped preschoolers the skills necessary to interact with the classroom materials without direct supervision. Examples are cited of individualized play activities with puzzles, paint, and play dough. (CL)
Branch classification: A new mechanism for improving branch predictor performance
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chang, P.Y.; Hao, E.; Patt, Y.
There is wide agreement that one of the most significant impediments to the performance of current and future pipelined superscalar processors is the presence of conditional branches in the instruction stream. Speculative execution is one solution to the branch problem, but speculative work is discarded if a branch is mispredicted. For it to be effective, speculative work is discarded if a branch is mispredicted. For it to be effective, speculative execution requires a very accurate branch predictor; 95% accuracy is not good enough. This paper proposes branch classification, a methodology for building more accurate branch predictors. Branch classification allows anmore » individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor can be constructed such that each component branch predictor predicts those branches for which it is best suited. To demonstrate the usefulness of branch classification, an example classification scheme is given and a new hybrid predictor is built based on this scheme which achieves a higher prediction accuracy than any branch predictor previously reported in the literature.« less
Small Microprocessor for ASIC or FPGA Implementation
NASA Technical Reports Server (NTRS)
Kleyner, Igor; Katz, Richard; Blair-Smith, Hugh
2011-01-01
A small microprocessor, suitable for use in applications in which high reliability is required, was designed to be implemented in either an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA). The design is based on commercial microprocessor architecture, making it possible to use available software development tools and thereby to implement the microprocessor at relatively low cost. The design features enhancements, including trapping during execution of illegal instructions. The internal structure of the design yields relatively high performance, with a significant decrease, relative to other microprocessors that perform the same functions, in the number of microcycles needed to execute macroinstructions. The problem meant to be solved in designing this microprocessor was to provide a modest level of computational capability in a general-purpose processor while adding as little as possible to the power demand, size, and weight of a system into which the microprocessor would be incorporated. As designed, this microprocessor consumes very little power and occupies only a small portion of a typical modern ASIC or FPGA. The microprocessor operates at a rate of about 4 million instructions per second with clock frequency of 20 MHz.
Probing for quantum speedup in spin-glass problems with planted solutions
NASA Astrophysics Data System (ADS)
Hen, Itay; Job, Joshua; Albash, Tameem; Rønnow, Troels F.; Troyer, Matthias; Lidar, Daniel A.
2015-10-01
The availability of quantum annealing devices with hundreds of qubits has made the experimental demonstration of a quantum speedup for optimization problems a coveted, albeit elusive goal. Going beyond earlier studies of random Ising problems, here we introduce a method to construct a set of frustrated Ising-model optimization problems with tunable hardness. We study the performance of a D-Wave Two device (DW2) with up to 503 qubits on these problems and compare it to a suite of classical algorithms, including a highly optimized algorithm designed to compete directly with the DW2. The problems are generated around predetermined ground-state configurations, called planted solutions, which makes them particularly suitable for benchmarking purposes. The problem set exhibits properties familiar from constraint satisfaction (SAT) problems, such as a peak in the typical hardness of the problems, determined by a tunable clause density parameter. We bound the hardness regime where the DW2 device either does not or might exhibit a quantum speedup for our problem set. While we do not find evidence for a speedup for the hardest and most frustrated problems in our problem set, we cannot rule out that a speedup might exist for some of the easier, less frustrated problems. Our empirical findings pertain to the specific D-Wave processor and problem set we studied and leave open the possibility that future processors might exhibit a quantum speedup on the same problem set.
Instructional Development for Clinical Settings.
ERIC Educational Resources Information Center
Cranton, P. A.
Clinical teaching involves instruction in a natural health-related environment which allows students to observe and participate in the actual practice of the profession. The use of objectives, the sequence of instruction, the instructional methods and materials, and the evaluation of student performance constitute the components studied in…
Instructional Design: System Strategies.
ERIC Educational Resources Information Center
Ledford, Bruce R.; Sleeman, Phillip J.
This book is intended as a source for those who desire to apply a coherent system of instructional design, thereby insuring accountability. Chapter 1 covers the instructional design process, including: instructional technology; the role of evaluation; goal setting; the psychology of teaching and learning; task analysis; operational objectives;…
English-Language Writing Instruction in Poland
ERIC Educational Resources Information Center
Reichelt, Melinda
2005-01-01
Second language writing scholars have undertaken descriptions of English-language writing instruction in a variety of international settings, describing the role of various contextual factors in shaping English-language writing instruction. This article describes English-language writing instruction at various levels in Poland, noting how it is…
Teacher's Perceived Instructional Needs in the Northwest Region.
ERIC Educational Resources Information Center
Lilly, M. Stephen; Kelleher, John
A survey was conducted to determine teachers' perceived needs in direct instruction and related professional activities and to determine teachers' familiarity with 14 sets of instructional materials, which were said to represent materials available through Special Education Instructional Materials Centers (SEIMC). Data indicated consistency of…
Number of Instructional Days/Hours in the School Year
ERIC Educational Resources Information Center
Rowland, Julie
2014-01-01
While state requirements vary on the number of instructional days and/or hours in the school year, the majority of states require 180 days of student instruction. Most also specify the minimum length of time that constitutes an instructional day. Some states set instructional time in terms of days, some specify hours, and some provide…
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
NASA Astrophysics Data System (ADS)
Rostrup, Scott; De Sterck, Hans
2010-12-01
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
42 CFR 488.68 - State Agency responsibilities for OASIS collection and data base requirements.
Code of Federal Regulations, 2014 CFR
2014-10-01
...— (1) Instruct each HHA on the administration of the data set, privacy/confidentiality of the data set, and integration of the OASIS data set into the facility's own record keeping system; (2) Instruct each... and data base requirements. 488.68 Section 488.68 Public Health CENTERS FOR MEDICARE & MEDICAID...
42 CFR 488.68 - State Agency responsibilities for OASIS collection and data base requirements.
Code of Federal Regulations, 2011 CFR
2011-10-01
...— (1) Instruct each HHA on the administration of the data set, privacy/confidentiality of the data set, and integration of the OASIS data set into the facility's own record keeping system; (2) Instruct each... and data base requirements. 488.68 Section 488.68 Public Health CENTERS FOR MEDICARE & MEDICAID...
42 CFR 488.68 - State Agency responsibilities for OASIS collection and data base requirements.
Code of Federal Regulations, 2013 CFR
2013-10-01
...— (1) Instruct each HHA on the administration of the data set, privacy/confidentiality of the data set, and integration of the OASIS data set into the facility's own record keeping system; (2) Instruct each... and data base requirements. 488.68 Section 488.68 Public Health CENTERS FOR MEDICARE & MEDICAID...
42 CFR 488.68 - State Agency responsibilities for OASIS collection and data base requirements.
Code of Federal Regulations, 2012 CFR
2012-10-01
...— (1) Instruct each HHA on the administration of the data set, privacy/confidentiality of the data set, and integration of the OASIS data set into the facility's own record keeping system; (2) Instruct each... and data base requirements. 488.68 Section 488.68 Public Health CENTERS FOR MEDICARE & MEDICAID...
78 FR 67099 - Submission for OMB Review; Comment Request
Federal Register 2010, 2011, 2012, 2013, 2014
2013-11-08
... professionals, who provide meals in institutional settings, can locate processors who manufacture foods... Service Title: USDA Food Connect Web site. OMB Control Number: 0581-0224. Summary of Collection: The USDA Food Connect Web site (previously known as the USDA Food and Commodity Connection Web site) operates...
Data Input for Libraries: State-of-the-Art Report.
ERIC Educational Resources Information Center
Buckland, Lawrence F.
This brief overview of new manuscript preparation methods which allow authors and editors to set their own type discusses the advantages and disadvantages of optical character recognition (OCR), microcomputers and personal computers, minicomputers, and word processors for editing and database entry. Potential library applications are also…
Wood, Amanda L; Luiselli, James K; Harchik, Alan E
2007-11-01
The present study evaluates a training program with paraprofessional service providers at a community-based habilitation setting. Four staff were taught to implement alternative and augmentative communication instruction with an adult who had autism and mental retardation through a combination of instruction, demonstration, behavior rehearsal, and performance feedback. Training was conducted under natural conditions at the adult's group home residence. Three of the four staff were able to maintain near-100% instructional accuracy following initial training. The results add to the limited research literature concerning community-based training of direct-care personnel.
A Model for Designing Library Instruction for Distance Learning
ERIC Educational Resources Information Center
Rand, Angela Doucet
2013-01-01
Providing library instruction in distance learning environments presents a unique set of challenges for instructional librarians. Innovations in computer-mediated communication and advances in cognitive science research provide the opportunity for designing library instruction that meets a variety of student information seeking needs. Using a…