driven processor node: Topics by Science.gov

Sample records for driven processor node

Direct match data flow machine apparatus and process for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1997-08-12

A data flow computer and method of computing are disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1988-07-22

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information from an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ''fire'' signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor G.

1995-01-01

A data flow computer which of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow machine apparatus and process for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor Gerald

1997-01-01

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor Gerald

1997-01-01

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1997-10-07

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Broadcasting collective operation contributions throughout a parallel computer

DOEpatents

Faraj, Ahmad [Rochester, MN

2012-02-21

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
Switch for serial or parallel communication networks

DOEpatents

Crosette, D.B.

1994-07-19

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.
Switch for serial or parallel communication networks

DOEpatents

Crosette, Dario B.

1994-01-01

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.
Hypercluster - Parallel processing for computational mechanics

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1988-01-01

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks.

PubMed

Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan

2017-06-26

Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H²RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H²RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller.
A wireless laser displacement sensor node for structural health monitoring.

PubMed

Park, Hyo Seon; Kim, Jong Moon; Choi, Se Woon; Kim, Yousok

2013-09-30

This study describes a wireless laser displacement sensor node that measures displacement as a representative damage index for structural health monitoring (SHM). The proposed measurement system consists of a laser displacement sensor (LDS) and a customized wireless sensor node. Wireless communication is enabled by a sensor node that consists of a sensor module, a code division multiple access (CDMA) communication module, a processor, and a power module. An LDS with a long measurement distance is chosen to increase field applicability. For a wireless sensor node driven by a battery, we use a power control module with a low-power processor, which facilitates switching between the sleep and active modes, thus maximizing the power consumption efficiency during non-measurement and non-transfer periods. The CDMA mode is also used to overcome the limitation of communication distance, which is a challenge for wireless sensor networks and wireless communication. To evaluate the reliability and field applicability of the proposed wireless displacement measurement system, the system is tested onsite to obtain the required vertical displacement measurements during the construction of mega-trusses and an edge truss, which are the primary structural members in a large-scale irregular building currently under construction. The measurement values confirm the validity of the proposed wireless displacement measurement system and its potential for use in safety evaluations of structural elements.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, D.B.

1996-12-31

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer

Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the rootmore » processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.« less
Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks

PubMed Central

Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan

2017-01-01

Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H2RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H2RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller. PMID:28672856
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, Dario B.

1996-01-01

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.
Soft-core processor study for node-based architectures.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James

2008-09-01

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor builtmore » out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.« less
Stanford Hardware Development Program

NASA Technical Reports Server (NTRS)

Peterson, A.; Linscott, I.; Burr, J.

1986-01-01

Architectures for high performance, digital signal processing, particularly for high resolution, wide band spectrum analysis were developed. These developments are intended to provide instrumentation for NASA's Search for Extraterrestrial Intelligence (SETI) program. The real time signal processing is both formal and experimental. The efficient organization and optimal scheduling of signal processing algorithms were investigated. The work is complemented by efforts in processor architecture design and implementation. A high resolution, multichannel spectrometer that incorporates special purpose microcoded signal processors is being tested. A general purpose signal processor for the data from the multichannel spectrometer was designed to function as the processing element in a highly concurrent machine. The processor performance required for the spectrometer is in the range of 1000 to 10,000 million instructions per second (MIPS). Multiple node processor configurations, where each node performs at 100 MIPS, are sought. The nodes are microprogrammable and are interconnected through a network with high bandwidth for neighboring nodes, and medium bandwidth for nodes at larger distance. The implementation of both the current mutlichannel spectrometer and the signal processor as Very Large Scale Integration CMOS chip sets was commenced.
Picoradio: Communication/Computation Piconodes for Sensor Networks

DTIC Science & Technology

2003-01-02

diagram of PicoNode III, or Quark node. It is made from two custom chips, Strange RF and Charm digital processor , and is complemented by a set of...the chipset comprising of Strange (analog OOK transceiver) and Charm (digital processor ) chips. 44 Figure 33: System block diagram of the Quark node...19 2.B PICONODE II - TWO-CHIP PICONODE IMPLEMENTATION ......................................... 21 2.B.1 Baseband processor (BBP
Method for simultaneous overlapped communications between neighboring processors in a multiple

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1991-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

A universal computer control system for motors

NASA Technical Reports Server (NTRS)

Szakaly, Zoltan F. (Inventor)

1991-01-01

A control system for a multi-motor system such as a space telerobot, having a remote computational node and a local computational node interconnected with one another by a high speed data link is described. A Universal Computer Control System (UCCS) for the telerobot is located at each node. Each node is provided with a multibus computer system which is characterized by a plurality of processors with all processors being connected to a common bus, and including at least one command processor. The command processor communicates over the bus with a plurality of joint controller cards. A plurality of direct current torque motors, of the type used in telerobot joints and telerobot hand-held controllers, are connected to the controller cards and responds to digital control signals from the command processor. Essential motor operating parameters are sensed by analog sensing circuits and the sensed analog signals are converted to digital signals for storage at the controller cards where such signals can be read during an address read/write cycle of the command processing processor.
Function Allocation in a Robust Distributed Real-Time Environment

DTIC Science & Technology

1991-12-01

fundamental characteristic of a distributed system is its ability to map individual logical functions of an application program onto many physical nodes... how much of a node’s processor time is scheduled for function processing. IMC is the function- to -function communication required to facilitate...indicator of how much excess processor time a node has. The reconfiguration algorithms use these variables to determine the most appropriate node(s) to
Digital system for structural dynamics simulation

NASA Technical Reports Server (NTRS)

Krauter, A. I.; Lagace, L. J.; Wojnar, M. K.; Glor, C.

1982-01-01

State-of-the-art digital hardware and software for the simulation of complex structural dynamic interactions, such as those which occur in rotating structures (engine systems). System were incorporated in a designed to use an array of processors in which the computation for each physical subelement or functional subsystem would be assigned to a single specific processor in the simulator. These node processors are microprogrammed bit-slice microcomputers which function autonomously and can communicate with each other and a central control minicomputer over parallel digital lines. Inter-processor nearest neighbor communications busses pass the constants which represent physical constraints and boundary conditions. The node processors are connected to the six nearest neighbor node processors to simulate the actual physical interface of real substructures. Computer generated finite element mesh and force models can be developed with the aid of the central control minicomputer. The control computer also oversees the animation of a graphics display system, disk-based mass storage along with the individual processing elements.
Multi-petascale highly efficient parallel supercomputer

DOEpatents

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen -Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O'Brien, John K.; O'Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Smith, Brian; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng

2015-07-14

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
Lambda network having 2.sup.m-1 nodes in each of m stages with each node coupled to four other nodes for bidirectional routing of data packets between nodes

DOEpatents

Napolitano, Jr., Leonard M.

1995-01-01

The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance.
Methods for operating parallel computing systems employing sequenced communications

DOEpatents

Benner, R.E.; Gustafson, J.L.; Montry, G.R.

1999-08-10

A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.
Methods for operating parallel computing systems employing sequenced communications

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1999-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
Smart Sensor Network for Aircraft Corrosion Monitoring

DTIC Science & Technology

2010-02-01

Network Elements – Hub, Network capable application processor ( NCAP ) – Node, Smart transducer interface module (STIM)  Corrosion Sensing and...software Transducer software Network Protocol 1451.2 1451.3 1451.5 1451.6 1451.7 I/O Node -processor Power TEDS Smart Sensor Hub ( NCAP ) IEEE 1451.0 and
Parallel-aware, dedicated job co-scheduling within/across symmetric multiprocessing nodes

DOEpatents

Jones, Terry R.; Watson, Pythagoras C.; Tuel, William; Brenner, Larry; ,Caffrey, Patrick; Fier, Jeffrey

2010-10-05

In a parallel computing environment comprising a network of SMP nodes each having at least one processor, a parallel-aware co-scheduling method and system for improving the performance and scalability of a dedicated parallel job having synchronizing collective operations. The method and system uses a global co-scheduler and an operating system kernel dispatcher adapted to coordinate interfering system and daemon activities on a node and across nodes to promote intra-node and inter-node overlap of said interfering system and daemon activities as well as intra-node and inter-node overlap of said synchronizing collective operations. In this manner, the impact of random short-lived interruptions, such as timer-decrement processing and periodic daemon activity, on synchronizing collective operations is minimized on large processor-count SPMD bulk-synchronous programming styles.
Lambda network having 2{sup m{minus}1} nodes in each of m stages with each node coupled to four other nodes for bidirectional routing of data packets between nodes

DOEpatents

Napolitano, L.M. Jr.

1995-11-28

The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance. 14 figs.
MPI parallelization of Vlasov codes for the simulation of nonlinear laser-plasma interactions

NASA Astrophysics Data System (ADS)

Savchenko, V.; Won, K.; Afeyan, B.; Decyk, V.; Albrecht-Marc, M.; Ghizzo, A.; Bertrand, P.

2003-10-01

The simulation of optical mixing driven KEEN waves [1] and electron plasma waves [1] in laser-produced plasmas require nonlinear kinetic models and massive parallelization. We use Massage Passing Interface (MPI) libraries and Appleseed [2] to solve the Vlasov Poisson system of equations on an 8 node dual processor MAC G4 cluster. We use the semi-Lagrangian time splitting method [3]. It requires only row-column exchanges in the global data redistribution, minimizing the total number of communications between processors. Recurrent communication patterns for 2D FFTs involves global transposition. In the Vlasov-Maxwell case, we use splitting into two 1D spatial advections and a 2D momentum advection [4]. Discretized momentum advection equations have a double loop structure with the outer index being assigned to different processors. We adhere to a code structure with separate routines for calculations and data management for parallel computations. [1] B. Afeyan et al., IFSA 2003 Conference Proceedings, Monterey, CA [2] V. K. Decyk, Computers in Physics, 7, 418 (1993) [3] Sonnendrucker et al., JCP 149, 201 (1998) [4] Begue et al., JCP 151, 458 (1999)
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

PubMed Central

Kim, Youngmin; Lee, Chan-Gun

2017-01-01

In wireless sensor networks (WSNs), sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods. PMID:29240695
Multinode reconfigurable pipeline computer

NASA Technical Reports Server (NTRS)

Nosenchuck, Daniel M. (Inventor); Littman, Michael G. (Inventor)

1989-01-01

A multinode parallel-processing computer is made up of a plurality of innerconnected, large capacity nodes each including a reconfigurable pipeline of functional units such as Integer Arithmetic Logic Processors, Floating Point Arithmetic Processors, Special Purpose Processors, etc. The reconfigurable pipeline of each node is connected to a multiplane memory by a Memory-ALU switch NETwork (MASNET). The reconfigurable pipeline includes three (3) basic substructures formed from functional units which have been found to be sufficient to perform the bulk of all calculations. The MASNET controls the flow of signals from the memory planes to the reconfigurable pipeline and vice versa. the nodes are connectable together by an internode data router (hyperspace router) so as to form a hypercube configuration. The capability of the nodes to conditionally configure the pipeline at each tick of the clock, without requiring a pipeline flush, permits many powerful algorithms to be implemented directly.
Performance of VPIC on Sequoia

NASA Astrophysics Data System (ADS)

Nystrom, William

2014-10-01

Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
OpenMP Performance on the Columbia Supercomputer

NASA Technical Reports Server (NTRS)

Haoqiang, Jin; Hood, Robert

2005-01-01

This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.
A Software Implementation of a Satellite Interface Message Processor.

ERIC Educational Resources Information Center

Eastwood, Margaret A.; Eastwood, Lester F., Jr.

A design for network control software for a computer network is described in which some nodes are linked by a communications satellite channel. It is assumed that the network has an ARPANET-like configuration; that is, that specialized processors at each node are responsible for message switching and network control. The purpose of the control…
An intelligent allocation algorithm for parallel processing

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Homaifar, Abdollah; Ananthram, Kishan G.

1988-01-01

The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network.
Simplifying and speeding the management of intra-node cache coherence

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-04-17

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
A hybrid optic-fiber sensor network with the function of self-diagnosis and self-healing

NASA Astrophysics Data System (ADS)

Xu, Shibo; Liu, Tiegen; Ge, Chunfeng; Chen, Cheng; Zhang, Hongxia

2014-11-01

We develop a hybrid wavelength division multiplexing optical fiber network with distributed fiber-optic sensors and quasi-distributed FBG sensor arrays which detect vibrations, temperatures and strains at the same time. The network has the ability to locate the failure sites automatically designated as self-diagnosis and make protective switching to reestablish sensing service designated as self-healing by cooperative work of software and hardware. The processes above are accomplished by master-slave processors with the help of optical and wireless telemetry signals. All the sensing and optical telemetry signals transmit in the same fiber either working fiber or backup fiber. We take wavelength 1450nm as downstream signal and wavelength 1350nm as upstream signal to control the network in normal circumstances, both signals are sent by a light emitting node of the corresponding processor. There is also a continuous laser wavelength 1310nm sent by each node and received by next node on both working and backup fibers to monitor their healthy states, but it does not carry any message like telemetry signals do. When fibers of two sensor units are completely damaged, the master processor will lose the communication with the node between the damaged ones.However we install RF module in each node to solve the possible problem. Finally, the whole network state is transmitted to host computer by master processor. Operator could know and control the network by human-machine interface if needed.
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.

1991-01-01

A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.

Modeling heterogeneous processor scheduling for real time systems

NASA Technical Reports Server (NTRS)

Leathrum, J. F.; Mielke, R. R.; Stoughton, J. W.

1994-01-01

A new model is presented to describe dataflow algorithms implemented in a multiprocessing system. Called the resource/data flow graph (RDFG), the model explicitly represents cyclo-static processor schedules as circuits of processor arcs which reflect the order that processors execute graph nodes. The model also allows the guarantee of meeting hard real-time deadlines. When unfolded, the model identifies statically the processor schedule. The model therefore is useful for determining the throughput and latency of systems with heterogeneous processors. The applicability of the model is demonstrated using a space surveillance algorithm.
Next Generation Security for the 10,240 Processor Columbia System

NASA Technical Reports Server (NTRS)

Hinke, Thomas; Kolano, Paul; Shaw, Derek; Keller, Chris; Tweton, Dave; Welch, Todd; Liu, Wen (Betty)

2005-01-01

This presentation includes a discussion of the Columbia 10,240-processor system located at the NASA Advanced Supercomputing (NAS) division at the NASA Ames Research Center which supports each of NASA's four missions: science, exploration systems, aeronautics, and space operations. It is comprised of 20 Silicon Graphics nodes, each consisting of 512 Itanium II processors. A 64 processor Columbia front-end system supports users as they prepare their jobs and then submits them to the PBS system. Columbia nodes and front-end systems use the Linux OS. Prior to SC04, the Columbia system was used to attain a processing speed of 51.87 TeraFlops, which made it number two on the Top 500 list of the world's supercomputers and the world's fastest "operational" supercomputer since it was fully engaged in supporting NASA users.
Peregrine System | High-Performance Computing | NREL

Science.gov Websites

) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2011-01-11

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-02-21

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Method and apparatus of parallel computing with simultaneously operating stream prefetching and list prefetching engines

DOEpatents

Boyle, Peter A.; Christ, Norman H.; Gara, Alan; Mawhinney, Robert D.; Ohmacht, Martin; Sugavanam, Krishnan

2012-12-11

A prefetch system improves a performance of a parallel computing system. The parallel computing system includes a plurality of computing nodes. A computing node includes at least one processor and at least one memory device. The prefetch system includes at least one stream prefetch engine and at least one list prefetch engine. The prefetch system operates those engines simultaneously. After the at least one processor issues a command, the prefetch system passes the command to a stream prefetch engine and a list prefetch engine. The prefetch system operates the stream prefetch engine and the list prefetch engine to prefetch data to be needed in subsequent clock cycles in the processor in response to the passed command.
Peregrine System Configuration | High-Performance Computing | NREL

Science.gov Websites

nodes and storage are connected by a high speed InfiniBand network. Compute nodes are diskless with an directories are mounted on all nodes, along with a file system dedicated to shared projects. A brief processors with 64 GB of memory. All nodes are connected to the high speed Infiniband network and and a
A Coherent VLSI Environment

DTIC Science & Technology

1987-03-31

processors . The symmetry-breaking algorithms give efficient ways to convert probabilistic algorithms to deterministic algorithms. Some of the...techniques have been applied to construct several efficient linear- processor algorithms for graph problems, including an O(lg* n)-time algorithm for (A + 1...On n-node graphs, the algorithm works in O(log 2 n) time using only n processors , in contrast to the previous best algorithm which used about n3
High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine

NASA Astrophysics Data System (ADS)

Goto, Hiroyuki

This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband Engine™ (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3™ equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
A site oriented supercomputer for theoretical physics: The Fermilab Advanced Computer Program Multi Array Processor System (ACMAPS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nash, T.; Atac, R.; Cook, A.

1989-03-06

The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
Video sensor architecture for surveillance applications.

PubMed

Sánchez, Jordi; Benet, Ginés; Simó, José E

2012-01-01

This paper introduces a flexible hardware and software architecture for a smart video sensor. This sensor has been applied in a video surveillance application where some of these video sensors are deployed, constituting the sensory nodes of a distributed surveillance system. In this system, a video sensor node processes images locally in order to extract objects of interest, and classify them. The sensor node reports the processing results to other nodes in the cloud (a user or higher level software) in the form of an XML description. The hardware architecture of each sensor node has been developed using two DSP processors and an FPGA that controls, in a flexible way, the interconnection among processors and the image data flow. The developed node software is based on pluggable components and runs on a provided execution run-time. Some basic and application-specific software components have been developed, in particular: acquisition, segmentation, labeling, tracking, classification and feature extraction. Preliminary results demonstrate that the system can achieve up to 7.5 frames per second in the worst case, and the true positive rates in the classification of objects are better than 80%.
Video Sensor Architecture for Surveillance Applications

PubMed Central

Sánchez, Jordi; Benet, Ginés; Simó, José E.

2012-01-01

This paper introduces a flexible hardware and software architecture for a smart video sensor. This sensor has been applied in a video surveillance application where some of these video sensors are deployed, constituting the sensory nodes of a distributed surveillance system. In this system, a video sensor node processes images locally in order to extract objects of interest, and classify them. The sensor node reports the processing results to other nodes in the cloud (a user or higher level software) in the form of an XML description. The hardware architecture of each sensor node has been developed using two DSP processors and an FPGA that controls, in a flexible way, the interconnection among processors and the image data flow. The developed node software is based on pluggable components and runs on a provided execution run-time. Some basic and application-specific software components have been developed, in particular: acquisition, segmentation, labeling, tracking, classification and feature extraction. Preliminary results demonstrate that the system can achieve up to 7.5 frames per second in the worst case, and the true positive rates in the classification of objects are better than 80%. PMID:22438723
DMA shared byte counters in a parallel computer

DOEpatents

Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

2010-04-06

A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
An Application-Based Performance Characterization of the Columbia Supercluster

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Djomehri, Jahed M.; Hood, Robert; Jin, Hoaqiang; Kiris, Cetin; Saini, Subhash

2005-01-01

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors.
Managing coherence via put/get windows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A; Chen, Dong; Coteus, Paul W

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2013-11-12

Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.
Portable multi-node LQCD Monte Carlo simulations using OpenACC

NASA Astrophysics Data System (ADS)

Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele

This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Active non-volatile memory post-processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kannan, Sudarsun; Milojicic, Dejan S.; Talwar, Vanish

A computing node includes an active Non-Volatile Random Access Memory (NVRAM) component which includes memory and a sub-processor component. The memory is to store data chunks received from a processor core, the data chunks comprising metadata indicating a type of post-processing to be performed on data within the data chunks. The sub-processor component is to perform post-processing of said data chunks based on said metadata.
Low latency messages on distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Saltz, Joel

1993-01-01

Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Marshburn updates software on the WHC UPA in the Node 3

NASA Image and Video Library

2013-01-17

ISS034-E-031133 (17 Jan. 2013) --- NASA astronaut Tom Marshburn, Expedition 34 flight engineer, updates software on the Waste and Hygiene Compartment?s Urine Processor Assembly in the Tranquility node of the International Space Station.

Marshburn updates software on the WHC UPA in the Node 3

NASA Image and Video Library

2013-01-17

ISS034-E-031130 (17 Jan. 2013) --- NASA astronaut Tom Marshburn, Expedition 34 flight engineer, updates software on the Waste and Hygiene Compartment?s Urine Processor Assembly in the Tranquility node of the International Space Station.
Asynchronous Communication Scheme For Hypercube Computer

NASA Technical Reports Server (NTRS)

Madan, Herb S.

1988-01-01

Scheme devised for asynchronous-message communication system for Mark III hypercube concurrent-processor network. Network consists of up to 1,024 processing elements connected electrically as though were at corners of 10-dimensional cube. Each node contains two Motorola 68020 processors along with Motorola 68881 floating-point processor utilizing up to 4 megabytes of shared dynamic random-access memory. Scheme intended to support applications requiring passage of both polled or solicited and unsolicited messages.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors

DTIC Science & Technology

2009-09-01

TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
Eliminating livelock by assigning the same priority state to each message that is input into a flushable routing system during N time intervals

DOEpatents

Faber, V.

1994-11-29

Livelock-free message routing is provided in a network of interconnected nodes that is flushable in time T. An input message processor generates sequences of at least N time intervals, each of duration T. An input register provides for receiving and holding each input message, where the message is assigned a priority state p during an nth one of the N time intervals. At each of the network nodes a message processor reads the assigned priority state and awards priority to messages with priority state (p-1) during an nth time interval and to messages with priority state p during an (n+1) th time interval. The messages that are awarded priority are output on an output path toward the addressed output message processor. Thus, no message remains in the network for a time longer than T. 4 figures.
Eliminating livelock by assigning the same priority state to each message that is inputted into a flushable routing system during N time intervals

DOEpatents

Faber, Vance

1994-01-01

Livelock-free message routing is provided in a network of interconnected nodes that is flushable in time T. An input message processor generates sequences of at least N time intervals, each of duration T. An input register provides for receiving and holding each input message, where the message is assigned a priority state p during an nth one of the N time intervals. At each of the network nodes a message processor reads the assigned priority state and awards priority to messages with priority state (p-1) during an nth time interval and to messages with priority state p during an (n+1) th time interval. The messages that are awarded priority are output on an output path toward the addressed output message processor. Thus, no message remains in the network for a time longer than T.
Comments on Samal and Henderson: Parallel consistent labeling algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Swain, M.J.

Samal and Henderson claim that any parallel algorithm for enforcing arc consistency in the worst case must have {Omega}(na) sequential steps, where n is the number of nodes, and a is the number of labels per node. The authors argue that Samal and Henderon's argument makes assumptions about how processors are used and give a counterexample that enforces arc consistency in a constant number of steps using O(n{sup 2}a{sup 2}2{sup na}) processors. It is possible that the lower bound holds for a polynomial number of processors; if such a lower bound were to be proven it would answer an importantmore » open question in theoretical computer science concerning the relation between the complexity classes P and NC. The strongest existing lower bound for the arc consistency problem states that it cannot be solved in polynomial log time unless P = NC.« less
Extending the granularity of representation and control for the MIL-STD CAIS 1.0 node model

NASA Technical Reports Server (NTRS)

Rogers, Kathy L.

1986-01-01

The Common APSE (Ada 1 Program Support Environment) Interface Set (CAIS) (DoD85) node model provides an excellent baseline for interfaces in a single-host development environment. To encompass the entire spectrum of computing, however, the CAIS model should be extended in four areas. It should provide the interface between the engineering workstation and the host system throughout the entire lifecycle of the system. It should provide a basis for communication and integration functions needed by distributed host environments. It should provide common interfaces for communications mechanisms to and among target processors. It should provide facilities for integration, validation, and verification of test beds extending to distributed systems on geographically separate processors with heterogeneous instruction set architectures (ISAS). Additions to the PROCESS NODE model to extend the CAIS into these four areas are proposed.
Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications

NASA Technical Reports Server (NTRS)

Das, Sajal K.; Harvey, Daniel J.; Biswas, Rupak

2001-01-01

The Information Power Grid (IPG) concept developed by NASA is aimed to provide a metacomputing platform for large-scale distributed computations, by hiding the intricacies of highly heterogeneous environment and yet maintaining adequate security. In this paper, we propose a latency-tolerant partitioning scheme that dynamically balances processor workloads on the.IPG, and minimizes data movement and runtime communication. By simulating an unsteady adaptive mesh application on a wide area network, we study the performance of our load balancer under the Globus environment. The number of IPG nodes, the number of processors per node, and the interconnected speeds are parameterized to derive conditions under which the IPG would be suitable for parallel distributed processing of such applications. Experimental results demonstrate that effective solution are achieved when the IPG nodes are connected by a high-speed asynchronous interconnection network.
An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes

NASA Astrophysics Data System (ADS)

Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry

2003-08-01

Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

DTIC Science & Technology

2015-06-01

5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
An MPA-IO interface to HPSS

NASA Technical Reports Server (NTRS)

Jones, Terry; Mark, Richard; Martin, Jeanne; May, John; Pierce, Elsie; Stanberry, Linda

1996-01-01

This paper describes an implementation of the proposed MPI-IO (Message Passing Interface - Input/Output) standard for parallel I/O. Our system uses third-party transfer to move data over an external network between the processors where it is used and the I/O devices where it resides. Data travels directly from source to destination, without the need for shuffling it among processors or funneling it through a central node. Our distributed server model lets multiple compute nodes share the burden of coordinating data transfers. The system is built on the High Performance Storage System (HPSS), and a prototype version runs on a Meiko CS-2 parallel computer.
Real-time trajectory optimization on parallel processors

NASA Technical Reports Server (NTRS)

Psiaki, Mark L.

1993-01-01

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
System on chip module configured for event-driven architecture

DOEpatents

Robbins, Kevin; Brady, Charles E.; Ashlock, Tad A.

2017-10-17

A system on chip (SoC) module is described herein, wherein the SoC modules comprise a processor subsystem and a hardware logic subsystem. The processor subsystem and hardware logic subsystem are in communication with one another, and transmit event messages between one another. The processor subsystem executes software actors, while the hardware logic subsystem includes hardware actors, the software actors and hardware actors conform to an event-driven architecture, such that the software actors receive and generate event messages and the hardware actors receive and generate event messages.
A Locality-Based Threading Algorithm for the Configuration-Interaction Method

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Johnson, Calvin

The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
A Locality-Based Threading Algorithm for the Configuration-Interaction Method

DOE PAGES

Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; ...

2017-07-03

The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
Massively parallel processor networks with optical express channels

DOEpatents

Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

1999-08-24

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.
Massively parallel processor networks with optical express channels

DOEpatents

Deri, Robert J.; Brooks, III, Eugene D.; Haigh, Ronald E.; DeGroot, Anthony J.

1999-01-01

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.
Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

DTIC Science & Technology

2013-02-01

with the bias node ( gray ) denoted as ww and the weights associated with the remaining first layer nodes (black) denoted as W. In forming the overall...Implementation of RBF network on GPU Platform 3.5.1 The Cholesky decomposition algorithm We need to invert the matrix multiplication GTG to
Energy-Efficient Querying of Wireless Sensor Networks

DTIC Science & Technology

2007-09-01

will fail to locate the desired information. Depending on the rate of node movement , this data exchange will be costly in terms of total network...nodes is best accomplished using a small time window to reduce errors introduced by the node’s movement (i.e., older measurements are less likely to...embedded processor or input from upper layer applications,” nodes which detect their own movement transmit an alert signal over a “wake-up” channel
Experiments and Simulations on Magnetically Driven Implosions in High Repetition Rate Dense Plasma Focus

NASA Astrophysics Data System (ADS)

Caballero Bendixsen, Luis; Bott-Suzuki, Simon; Cordaro, Samuel; Krishnan, Mahadevan; Chapman, Stephen; Coleman, Phil; Chittenden, Jeremy

2015-11-01

Results will be shown on coordinated experiments and MHD simulations on magnetically driven implosions, with an emphasis on current diffusion and heat transport. Experiments are run at a Mather-type dense plasma focus (DPF-3, Vc: 20 kV, Ip: 480 kA, E: 5.8 kJ). Typical experiments are run at 300 kA and 0.33 Hz repetition rate with different gas loads (Ar, Ne, and He) at pressures of ~ 1-3 Torr, usually gathering 1000 shots per day. Simulations are run at a 96-core HP blade server cluster using 3GHz processors with 4GB RAM per node.Preliminary results show axial and radial phase plasma sheath velocity of ~ 1x105 m/s. These are in agreement with the snow-plough model of DPFs. Peak magnetic field of ~ 1 Tesla in the radial compression phase are measured. Electron densities on the order of 1018 cm-3 anticipated. Comparison between 2D and 3D models with empirical results show a good agreement in the axial and radial phase.

ALMA Correlator Real-Time Data Processor

NASA Astrophysics Data System (ADS)

Pisano, J.; Amestica, R.; Perez, J.

2005-10-01

The design of a real-time Linux application utilizing Real-Time Application Interface (RTAI) to process real-time data from the radio astronomy correlator for the Atacama Large Millimeter Array (ALMA) is described. The correlator is a custom-built digital signal processor which computes the cross-correlation function of two digitized signal streams. ALMA will have 64 antennas with 2080 signal streams each with a sample rate of 4 giga-samples per second. The correlator's aggregate data output will be 1 gigabyte per second. The software is defined by hard deadlines with high input and processing data rates, while requiring interfaces to non real-time external computers. The designed computer system - the Correlator Data Processor or CDP, consists of a cluster of 17 SMP computers, 16 of which are compute nodes plus a master controller node all running real-time Linux kernels. Each compute node uses an RTAI kernel module to interface to a 32-bit parallel interface which accepts raw data at 64 megabytes per second in 1 megabyte chunks every 16 milliseconds. These data are transferred to tasks running on multiple CPUs in hard real-time using RTAI's LXRT facility to perform quantization corrections, data windowing, FFTs, and phase corrections for a processing rate of approximately 1 GFLOPS. Highly accurate timing signals are distributed to all seventeen computer nodes in order to synchronize them to other time-dependent devices in the observatory array. RTAI kernel tasks interface to the timing signals providing sub-millisecond timing resolution. The CDP interfaces, via the master node, to other computer systems on an external intra-net for command and control, data storage, and further data (image) processing. The master node accesses these external systems utilizing ALMA Common Software (ACS), a CORBA-based client-server software infrastructure providing logging, monitoring, data delivery, and intra-computer function invocation. The software is being developed in tandem with the correlator hardware which presents software engineering challenges as the hardware evolves. The current status of this project and future goals are also presented.
Dynamic Sensor Networks

DTIC Science & Technology

2004-03-01

turned off. SLEEP Set the timer for 30 seconds before scheduled transmit time, then sleep the processor. WAKE When timer trips, power up the processor...slots where none of its neighbors are schedule to transmit. This allows the sensor nodes to perform a simple power man- agement scheme that puts the...routing This simple case study highlights the following crucial observation: optimal traffic scheduling in energy constrained networks requires future
High Performance Active Database Management on a Shared-Nothing Parallel Processor

DTIC Science & Technology

1998-05-01

either stored or virtual. A stored node is like a materialized view. It actually contains the specified tuples. A virtual node is like a real view...90292-6695 DL-5 COLUMBIA UNIV/DEPT COMPUTER SCIENCi ATTN: OR GAIL £. KAISER 450 COMPUTER SCIENCE 3LDG 500 WEST 12ÖTH STRSET NEW YORK NY 10027
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

PubMed Central

Kim, Deokho; Park, Karam; Ro, Won W.

2011-01-01

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
Treecode with a Special-Purpose Processor

NASA Astrophysics Data System (ADS)

Makino, Junichiro

1991-08-01

We describe an implementation of the modified Barnes-Hut tree algorithm for a gravitational N-body calculation on a GRAPE (GRAvity PipE) backend processor. GRAPE is a special-purpose computer for N-body calculations. It receives the positions and masses of particles from a host computer and then calculates the gravitational force at each coordinate specified by the host. To use this GRAPE processor with the hierarchical tree algorithm, the host computer must maintain a list of all nodes that exert force on a particle. If we create this list for each particle of the system at each timestep, the number of floating-point operations on the host and that on GRAPE would become comparable, and the increased speed obtained by using GRAPE would be small. In our modified algorithm, we create a list of nodes for many particles. Thus, the amount of the work required of the host is significantly reduced. This algorithm was originally developed by Barnes in order to vectorize the force calculation on a Cyber 205. With this algorithm, the computing time of the force calculation becomes comparable to that of the tree construction, if the GRAPE backend processor is sufficiently fast. The obtained speed-up factor is 30 to 50 for a RISC-based host computer and GRAPE-1A with a peak speed of 240 Mflops.
SpaceWire Driver Software for Special DSPs

NASA Technical Reports Server (NTRS)

Clark, Douglas; Lux, James; Nishimoto, Kouji; Lang, Minh

2003-01-01

A computer program provides a high-level C-language interface to electronics circuitry that controls a SpaceWire interface in a system based on a space qualified version of the ADSP-21020 digital signal processor (DSP). SpaceWire is a spacecraft-oriented standard for packet-switching data-communication networks that comprise nodes connected through bidirectional digital serial links that utilize low-voltage differential signaling (LVDS). The software is tailored to the SMCS-332 application-specific integrated circuit (ASIC) (also available as the TSS901E), which provides three highspeed (150 Mbps) serial point-to-point links compliant with the proposed Institute of Electrical and Electronics Engineers (IEEE) Standard 1355.2 and equivalent European Space Agency (ESA) Standard ECSS-E-50-12. In the specific application of this software, the SpaceWire ASIC was combined with the DSP processor, memory, and control logic in a Multi-Chip Module DSP (MCM-DSP). The software is a collection of low-level driver routines that provide a simple message-passing application programming interface (API) for software running on the DSP. Routines are provided for interrupt-driven access to the two styles of interface provided by the SMCS: (1) the "word at a time" conventional host interface (HOCI); and (2) a higher performance "dual port memory" style interface (COMI).
Parallel discrete event simulation: A shared memory approach

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1987-01-01

With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.
Efficient packet forwarding using cyber-security aware policies

DOEpatents

Ros-Giralt, Jordi

2017-04-04

For balancing load, a forwarder can selectively direct data from the forwarder to a processor according to a loading parameter. The selective direction includes forwarding the data to the processor for processing, transforming and/or forwarding the data to another node, and dropping the data. The forwarder can also adjust the loading parameter based on, at least in part, feedback received from the processor. One or more processing elements can store values associated with one or more flows into a structure without locking the structure. The stored values can be used to determine how to direct the flows, e.g., whether to process a flow or to drop it. The structure can be used within an information channel providing feedback to a processor.
Efficient packet forwarding using cyber-security aware policies

DOEpatents

Ros-Giralt, Jordi

2017-10-25

For balancing load, a forwarder can selectively direct data from the forwarder to a processor according to a loading parameter. The selective direction includes forwarding the data to the processor for processing, transforming and/or forwarding the data to another node, and dropping the data. The forwarder can also adjust the loading parameter based on, at least in part, feedback received from the processor. One or more processing elements can store values associated with one or more flows into a structure without locking the structure. The stored values can be used to determine how to direct the flows, e.g., whether to process a flow or to drop it. The structure can be used within an information channel providing feedback to a processor.
Checkpointing for a hybrid computing node

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cher, Chen-Yong

2016-03-08

According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.
Spacecraft On-Board Information Extraction Computer (SOBIEC)

NASA Technical Reports Server (NTRS)

Eisenman, David; Decaro, Robert E.; Jurasek, David W.

1994-01-01

The Jet Propulsion Laboratory is the Technical Monitor on an SBIR Program issued for Irvine Sensors Corporation to develop a highly compact, dual use massively parallel processing node known as SOBIEC. SOBIEC couples 3D memory stacking technology provided by nCUBE. The node contains sufficient network Input/Output to implement up to an order-13 binary hypercube. The benefit of this network, is that it scales linearly as more processors are added, and it is a superset of other commonly used interconnect topologies such as: meshes, rings, toroids, and trees. In this manner, a distributed processing network can be easily devised and supported. The SOBIEC node has sufficient memory for most multi-computer applications, and also supports external memory expansion and DMA interfaces. The SOBIEC node is supported by a mature set of software development tools from nCUBE. The nCUBE operating system (OS) provides configuration and operational support for up to 8000 SOBIEC processors in an order-13 binary hypercube or any subset or partition(s) thereof. The OS is UNIX (USL SVR4) compatible, with C, C++, and FORTRAN compilers readily available. A stand-alone development system is also available to support SOBIEC test and integration.
New Modular Ultrasonic Signal Processing Building Blocks for Real-Time Data Acquisition and Post Processing

NASA Astrophysics Data System (ADS)

Weber, Walter H.; Mair, H. Douglas; Jansen, Dion

2003-03-01

A suite of basic signal processors has been developed. These basic building blocks can be cascaded together to form more complex processors without the need for programming. The data structures between each of the processors are handled automatically. This allows a processor built for one purpose to be applied to any type of data such as images, waveform arrays and single values. The processors are part of Winspect Data Acquisition software. The new processors are fast enough to work on A-scan signals live while scanning. Their primary use is to extract features, reduce noise or to calculate material properties. The cascaded processors work equally well on live A-scan displays, live gated data or as a post-processing engine on saved data. Researchers are able to call their own MATLAB or C-code from anywhere within the processor structure. A built-in formula node processor that uses a simple algebraic editor may make external user programs unnecessary. This paper also discusses the problems associated with ad hoc software development and how graphical programming languages can tie up researchers writing software rather than designing experiments.
Systems and Methods for Locating a Target in a GPS-Denied Environment

NASA Technical Reports Server (NTRS)

Mackay, John D. (Inventor); Murdock, Ronald G. (Inventor); Cummins, Douglas A. (Inventor)

2017-01-01

A system for locating an object in a GPS-denied environment includes first and second stationary nodes of a network and an object out of synchronization with a common time base of the network. The system includes one or more processors that are configured to estimate distances between the first stationary node and the object and a distance between the second stationary node and the object by comparing time-stamps of messages relayed between the object and the nodes. A position of the object can then be trilaterated using a location of each of the first and second stationary nodes and the measured distances between the object and each of the first and second stationary nodes.
Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Class network routing

DOEpatents

Bhanot, Gyan [Princeton, NJ; Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-09-08

Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
Using algebra for massively parallel processor design and utilization

NASA Technical Reports Server (NTRS)

Campbell, Lowell; Fellows, Michael R.

1990-01-01

This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.
DFT algorithms for bit-serial GaAs array processor architectures

NASA Technical Reports Server (NTRS)

Mcmillan, Gary B.

1988-01-01

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Implementing direct, spatially isolated problems on transputer networks

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

Parametric studies were performed on transputer networks of up to 40 processors to determine how to implement and maximize the performance of the solution of problems where no processor-to-processor data transfer is required for the problem solution (spatially isolated). Two types of problems are investigated a computationally intensive problem where the solution required the transmission of 160 bytes of data through the parallel network, and a communication intensive example that required the transmission of 3 Mbytes of data through the network. This data consists of solutions being sent back to the host processor and not intermediate results for another processor to work on. Studies were performed on both integer and floating-point transputers. The latter features an on-chip floating-point math unit and offers approximately an order of magnitude performance increase over the integer transputer on real valued computations. The results indicate that a minimum amount of work is required on each node per communication to achieve high network speedups (efficiencies). The floating-point processor requires approximately an order of magnitude more work per communication than the integer processor because of the floating-point unit's increased computing capacity.
Support for Diagnosis of Custom Computer Hardware

NASA Technical Reports Server (NTRS)

Molock, Dwaine S.

2008-01-01

The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.
Parallel discrete event simulation using shared memory

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1988-01-01

With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.

A message passing kernel for the hypercluster parallel processing test bed

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Quealy, Angela; Cole, Gary L.

1989-01-01

A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.
A Scalable Software Architecture Booting and Configuring Nodes in the Whitney Commodity Computing Testbed

NASA Technical Reports Server (NTRS)

Fineberg, Samuel A.; Kutler, Paul (Technical Monitor)

1997-01-01

The Whitney project is integrating commodity off-the-shelf PC hardware and software technology to build a parallel supercomputer with hundreds to thousands of nodes. To build such a system, one must have a scalable software model, and the installation and maintenance of the system software must be completely automated. We describe the design of an architecture for booting, installing, and configuring nodes in such a system with particular consideration given to scalability and ease of maintenance. This system has been implemented on a 40-node prototype of Whitney and is to be used on the 500 processor Whitney system to be built in 1998.
The Carnegie Mellon University Insert Project

DTIC Science & Technology

1997-02-01

Real - Time Systems (INSERT) project under the DARPA Evolutionary Design for Complex Software (EDCS) Program. The INSERT team has completed an initial API definition and ported the existing real-time publication subscription group communication software to LynxOS 2.4, a POSIX.1b compliant OS. The distributed real-time publisher/subscriber communication model is now supported by a processor membership protocol which allows a node in the system to fail, or to rejoin the system later. When a node fails, all the publishers and subscribers on that node have to be
PANDA: A distributed multiprocessor operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chubb, P.

1989-01-01

PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
Status of the Node 3 Regenerative Environmental Cpntrol& Life Support System Water Recovery & Oxygen Generation Systems

NASA Technical Reports Server (NTRS)

Carrasquillo, Robyn L.

2003-01-01

NASA s Marshall Space Flight Center is providing three racks containing regenerative water recovery and oxygen generation systems (WRS and OGS) for flight on the lnternational Space Station s (ISS) Node 3 element. The major assemblies included in these racks are the Water Processor Assembly (WPA), Urine Processor Assembly (UPA), Oxygen Generation Assembly (OGA), and the Power Supply Module (PSM) supporting the OGA. The WPA and OGA are provided by Hamilton Sundstrand Space Systems lnternational (HSSSI), while the UPA and PSM are being designed and manufactured in-house by MSFC. The assemblies are currently in the manufacturing and test phase and are to be completed and integrated into flight racks this year. This paper gives an overview of the technologies and system designs, technical challenges encountered and solved, and the current status.
High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

DOEpatents

Deri, Robert J.; DeGroot, Anthony J.; Haigh, Ronald E.

2002-01-01

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.
DMA engine for repeating communication patterns

DOEpatents

Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard; Vranas, Pavlos

2010-09-21

A parallel computer system is constructed as a network of interconnected compute nodes to operate a global message-passing application for performing communications across the network. Each of the compute nodes includes one or more individual processors with memories which run local instances of the global message-passing application operating at each compute node to carry out local processing operations independent of processing operations carried out at other compute nodes. Each compute node also includes a DMA engine constructed to interact with the application via Injection FIFO Metadata describing multiple Injection FIFOs where each Injection FIFO may containing an arbitrary number of message descriptors in order to process messages with a fixed processing overhead irrespective of the number of message descriptors included in the Injection FIFO.
Reconfiguration in Robust Distributed Real-Time Systems Based on Global Checkpoints

DTIC Science & Technology

1991-12-01

achieved by utilizing distributed systems in which a single application program executes on multiple processors, connected to a network. The distributed...single application program executes on multiple proces- sors, connected to a network. The distributed nature of such systems make it possible to ...resident at every node. How - ever, the responsibility for execution of a particular function is assigned to only one node in this framework. This function
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing

NASA Astrophysics Data System (ADS)

Amooie, M. A.; Moortgat, J.

2017-12-01

We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Reliable appropriate topology design for multiple-processor systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chou, C.P.

1987-01-01

A Shift and Replace Graph which is a very appropriate candidate for the topology of a multiple-processor system is a function of two positive integers r and m, and is denoted as SRF(r,m). Pradhan and Reddy proved that the node connectivity of SRG(r,m) is at least r and also give a routing algorithm which generally requires 2m jumps if the number of node failures is no larger than r - 1. Later, Esfahanian and Hakimi proved that SRG(r,m) has maximum node connectivity 2r - 2 and give routing algorithms which require: (1) at most m + 3 + log/sub r/mmore » jumps if 3 + log/sub r/m does not exceed m and the number of node failures is at most r - 1; (2) at most m + 5 + log/sub r/m jumps if 4 + log/sub r/m less than or equal to m and the number of node failures if less than or equal to 2r - 3; (3) all the other situations require no more than 2m jumps. By modifying the SRG(r,m), it is first proved that node connectivity of SRG(r,m) can be increased to: (1) 2r - 1 when r = 2, m = 2, and (2) 2r when (r = 2, m > 2) or (r > 2, m greater than or equal to 2, m greater than or equal to 2). The routing algorithms are also given for the modified SRG (r,m), which require at most 2m + 3 jumps when the number of node failures is less than or equal to 2r - 1.« less
Network of dedicated processors for finding lowest-cost map path

NASA Technical Reports Server (NTRS)

Eberhardt, Silvio P. (Inventor)

1991-01-01

A method and associated apparatus are disclosed for finding the lowest cost path of several variable paths. The paths are comprised of a plurality of linked cost-incurring areas existing between an origin point and a destination point. The method comprises the steps of connecting a purality of nodes together in the manner of the cost-incurring areas; programming each node to have a cost associated therewith corresponding to one of the cost-incurring areas; injecting a signal into one of the nodes representing the origin point; propagating the signal through the plurality of nodes from inputs to outputs; reducing the signal in magnitude at each node as a function of the respective cost of the node; and, starting at one of the nodes representing the destination point and following a path having the least reduction in magnitude of the signal from node to node back to one of the nodes representing the origin point whereby the lowest cost path from the origin point to the destination point is found.
Implementing An Image Understanding System Architecture Using Pipe

NASA Astrophysics Data System (ADS)

Luck, Randall L.

1988-03-01

This paper will describe PIPE and how it can be used to implement an image understanding system. Image understanding is the process of developing a description of an image in order to make decisions about its contents. The tasks of image understanding are generally split into low level vision and high level vision. Low level vision is performed by PIPE -a high performance parallel processor with an architecture specifically designed for processing video images at up to 60 fields per second. High level vision is performed by one of several types of serial or parallel computers - depending on the application. An additional processor called ISMAP performs the conversion from iconic image space to symbolic feature space. ISMAP plugs into one of PIPE's slots and is memory mapped into the high level processor. Thus it forms the high speed link between the low and high level vision processors. The mechanisms for bottom-up, data driven processing and top-down, model driven processing are discussed.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Self-checking self-repairing computer nodes using the mirror processor

NASA Technical Reports Server (NTRS)

Tamir, Yuval

1992-01-01

Circuitry added to fault-tolerant systems for concurrent error deduction usually reduces performance. Using a technique called micro rollback, it is possible to eliminate most of the performance penalty of concurrent error detection. Error detection is performed in parallel with intermodule communication, and erroneous state changes are later undone. The author reports on the design and implementation of a VLSI RISC microprocessor, called the Mirror Processor (MP), which is capable of micro rollback. In order to achieve concurrent error detection, two MP chips operate in lockstep, comparing external signals and a signature of internal signals every clock cycle. If a mismatch is detected, both processors roll back to the beginning of the cycle when the error occurred. In some cases the erroneous state is corrected by copying a value from the fault-free processor to the faulty processor. The architecture, microarchitecture, and VLSI implementation of the MP, emphasizing its error-detection, error-recovery, and self-diagnosis capabilities, are described.
Distributed computation of graphics primitives on a transputer network

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

A method is developed for distributing the computation of graphics primitives on a parallel processing network. Off-the-shelf transputer boards are used to perform the graphics transformations and scan-conversion tasks that would normally be assigned to a single transputer based display processor. Each node in the network performs a single graphics primitive computation. Frequently requested tasks can be duplicated on several nodes. The results indicate that the current distribution of commands on the graphics network shows a performance degradation when compared to the graphics display board alone. A change to more computation per node for every communication (perform more complex tasks on each node) may cause the desired increase in throughput.
Wide-Area Persistent Energy-Efficient Maritime Sensing

DTIC Science & Technology

2015-09-30

Matt Reynolds, Lefteris Kampianakis, and Andreas Pedrosse-Engel at UW designed and tested a Software Defined Radar testbed as well as an Arduino - based ...hardware based on a software-defined radio platform. 2) Development of a standalone Arduino - based backscatter node. 3) Analysis of the limits of the... Arduino - based node that can modulate radar backscatter with data received from a sensor using a low-power Arduino Nano processor. Figure 5 shows a
Active Nodal Task Seeking for High-Performance, Ultra-Dependable Computing

DTIC Science & Technology

1994-07-01

implementation. Figure 1 shows a hardware organization of ANTS: stand-alone computing nodes inter - connected by buses. 2.1 Run Time Partitioning The...nodes in 14 respond to changing loads [27] or system reconfiguration [26]. Existing techniques are all source-initiated or server-initiated [27]. 5.1...short-running task segments. The task segments must be short-running in order that processors will become avalable often enough to satisfy changing
Non-preconditioned conjugate gradient on cell and FPGA based hybrid supercomputer nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dubois, David H; Dubois, Andrew J; Boorman, Thomas M

2009-01-01

This work presents a detailed implementation of a double precision, non-preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{sup TM} in conjunction with x86 Opteron{sup TM} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.
Non-preconditioned conjugate gradient on cell and FPCA-based hybrid supercomputer nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dubois, David H; Dubois, Andrew J; Boorman, Thomas M

2009-03-10

This work presents a detailed implementation of a double precision, Non-Preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{trademark} in conjunction with x86 Opteron{trademark} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.

System and method for modeling and analyzing complex scenarios

DOEpatents

Shevitz, Daniel Wolf

2013-04-09

An embodiment of the present invention includes a method for analyzing and solving possibility tree. A possibility tree having a plurality of programmable nodes is constructed and solved with a solver module executed by a processor element. The solver module executes the programming of said nodes, and tracks the state of at least a variable through a branch. When a variable of said branch is out of tolerance with a parameter, the solver disables remaining nodes of the branch and marks the branch as an invalid solution. The valid solutions are then aggregated and displayed as valid tree solutions.
Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
Feasibility of optically interconnected parallel processors using wavelength division multiplexing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deri, R.J.; De Groot, A.J.; Haigh, R.E.

1996-03-01

New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little informationmore » is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.« less
Integrated High-Speed Torque Control System for a Robotic Joint

NASA Technical Reports Server (NTRS)

Davis, Donald R. (Inventor); Radford, Nicolaus A. (Inventor); Permenter, Frank Noble (Inventor); Valvo, Michael C. (Inventor); Askew, R. Scott (Inventor)

2013-01-01

A control system for achieving high-speed torque for a joint of a robot includes a printed circuit board assembly (PCBA) having a collocated joint processor and high-speed communication bus. The PCBA may also include a power inverter module (PIM) and local sensor conditioning electronics (SCE) for processing sensor data from one or more motor position sensors. Torque control of a motor of the joint is provided via the PCBA as a high-speed torque loop. Each joint processor may be embedded within or collocated with the robotic joint being controlled. Collocation of the joint processor, PIM, and high-speed bus may increase noise immunity of the control system, and the localized processing of sensor data from the joint motor at the joint level may minimize bus cabling to and from each control node. The joint processor may include a field programmable gate array (FPGA).
Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

NASA Technical Reports Server (NTRS)

Fatoohi, Rod; Saini, Subbash; Ciotti, Robert

2006-01-01

We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
Data driven CAN node reliability assessment for manufacturing system

NASA Astrophysics Data System (ADS)

Zhang, Leiming; Yuan, Yong; Lei, Yong

2017-01-01

The reliability of the Controller Area Network(CAN) is critical to the performance and safety of the system. However, direct bus-off time assessment tools are lacking in practice due to inaccessibility of the node information and the complexity of the node interactions upon errors. In order to measure the mean time to bus-off(MTTB) of all the nodes, a novel data driven node bus-off time assessment method for CAN network is proposed by directly using network error information. First, the corresponding network error event sequence for each node is constructed using multiple-layer network error information. Then, the generalized zero inflated Poisson process(GZIP) model is established for each node based on the error event sequence. Finally, the stochastic model is constructed to predict the MTTB of the node. The accelerated case studies with different error injection rates are conducted on a laboratory network to demonstrate the proposed method, where the network errors are generated by a computer controlled error injection system. Experiment results show that the MTTB of nodes predicted by the proposed method agree well with observations in the case studies. The proposed data driven node time to bus-off assessment method for CAN networks can successfully predict the MTTB of nodes by directly using network error event data.
The Mark III Hypercube-Ensemble Computers

NASA Technical Reports Server (NTRS)

Peterson, John C.; Tuazon, Jesus O.; Lieberman, Don; Pniel, Moshe

1988-01-01

Mark III Hypercube concept applied in development of series of increasingly powerful computers. Processor of each node of Mark III Hypercube ensemble is specialized computer containing three subprocessors and shared main memory. Solves problem quickly by simultaneously processing part of problem at each such node and passing combined results to host computer. Disciplines benefitting from speed and memory capacity include astrophysics, geophysics, chemistry, weather, high-energy physics, applied mechanics, image processing, oil exploration, aircraft design, and microcircuit design.
Systems and methods for process and user driven dynamic voltage and frequency scaling

DOEpatents

Mallik, Arindam [Evanston, IL; Lin, Bin [Hillsboro, OR; Memik, Gokhan [Evanston, IL; Dinda, Peter [Evanston, IL; Dick, Robert [Evanston, IL

2011-03-22

Certain embodiments of the present invention provide a method for power management including determining at least one of an operating frequency and an operating voltage for a processor and configuring the processor based on the determined at least one of the operating frequency and the operating voltage. The operating frequency is determined based at least in part on direct user input. The operating voltage is determined based at least in part on an individual profile for processor.
Energy efficient sensor network implementations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Frigo, Janette R; Raby, Eric Y; Brennan, Sean M

In this paper, we discuss a low power embedded sensor node architecture we are developing for distributed sensor network systems deployed in a natural environment. In particular, we examine the sensor node for energy efficient processing-at-the-sensor. We analyze the following modes of operation; event detection, sleep(wake-up), data acquisition, data processing modes using low power, high performance embedded technology such as specialized embedded DSP processors and a low power FPGAs at the sensing node. We use compute intensive sensor node applications: an acoustic vehicle classifier (frequency domain analysis) and a video license plate identification application (learning algorithm) as a case study.more » We report performance and total energy usage for our system implementations and discuss the system architecture design trade offs.« less
DD-αAMG on QPACE 3

NASA Astrophysics Data System (ADS)

Georg, Peter; Richtmann, Daniel; Wettig, Tilo

2018-03-01

We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.
Computationally Efficient Modeling and Simulation of Large Scale Systems

NASA Technical Reports Server (NTRS)

Jain, Jitesh (Inventor); Koh, Cheng-Kok (Inventor); Balakrishnan, Vankataramanan (Inventor); Cauley, Stephen F (Inventor); Li, Hong (Inventor)

2014-01-01

A system for simulating operation of a VLSI interconnect structure having capacitive and inductive coupling between nodes thereof, including a processor, and a memory, the processor configured to perform obtaining a matrix X and a matrix Y containing different combinations of passive circuit element values for the interconnect structure, the element values for each matrix including inductance L and inverse capacitance P, obtaining an adjacency matrix A associated with the interconnect structure, storing the matrices X, Y, and A in the memory, and performing numerical integration to solve first and second equations.
RAMA: A file system for massively parallel computers

NASA Technical Reports Server (NTRS)

Miller, Ethan L.; Katz, Randy H.

1993-01-01

This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
Computing NLTE Opacities -- Node Level Parallel Calculation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Holladay, Daniel

Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
Free-Space Optical Interconnect Employing VCSEL Diodes

NASA Technical Reports Server (NTRS)

Simons, Rainee N.; Savich, Gregory R.; Torres, Heidi

2009-01-01

Sensor signal processing is widely used on aircraft and spacecraft. The scheme employs multiple input/output nodes for data acquisition and CPU (central processing unit) nodes for data processing. To connect 110 nodes and CPU nodes, scalable interconnections such as backplanes are desired because the number of nodes depends on requirements of each mission. An optical backplane consisting of vertical-cavity surface-emitting lasers (VCSELs), VCSEL drivers, photodetectors, and transimpedance amplifiers is the preferred approach since it can handle several hundred megabits per second data throughput.The next generation of satellite-borne systems will require transceivers and processors that can handle several Gb/s of data. Optical interconnects have been praised for both their speed and functionality with hopes that light can relieve the electrical bottleneck predicted for the near future. Optoelectronic interconnects provide a factor of ten improvement over electrical interconnects.
ROBUS-2: A Fault-Tolerant Broadcast Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.

2005-01-01

The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault-tolerant integrated modular architecture currently under development at NASA Langley Research Center. The ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, clock synchronization, and distributed diagnosis (group membership). The ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 is tolerant to internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. This version of the ROBUS is intended for laboratory experimentation and demonstrations of the capability to reintegrate failed nodes, dynamically update the communication schedule, and tolerate and recover from correlated transient faults.
Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Kalamkar, Dhiraj; Singh, Amik

2012-12-01

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this report, we describe miniGMG, our compact geometric multigrid benchmark designed to proxy the multigrid solves found in AMR applications. We explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel Sandy Bridge and Nehalem-based Infiniband clusters, as well as manycore-based architectures including NVIDIA's Fermi and Kepler GPUs and Intel's Knights Corner (KNC) co-processor. This report examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding,more » dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.« less
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
An implementation of a tree code on a SIMD, parallel computer

NASA Technical Reports Server (NTRS)

Olson, Kevin M.; Dorband, John E.

1994-01-01

We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Improved multi-stage neonatal seizure detection using a heuristic classifier and a data-driven post-processor.

PubMed

Ansari, A H; Cherian, P J; Dereymaeker, A; Matic, V; Jansen, K; De Wispelaere, L; Dielman, C; Vervisch, J; Swarte, R M; Govaert, P; Naulaers, G; De Vos, M; Van Huffel, S

2016-09-01

After identifying the most seizure-relevant characteristics by a previously developed heuristic classifier, a data-driven post-processor using a novel set of features is applied to improve the performance. The main characteristics of the outputs of the heuristic algorithm are extracted by five sets of features including synchronization, evolution, retention, segment, and signal features. Then, a support vector machine and a decision making layer remove the falsely detected segments. Four datasets including 71 neonates (1023h, 3493 seizures) recorded in two different university hospitals, are used to train and test the algorithm without removing the dubious seizures. The heuristic method resulted in a false alarm rate of 3.81 per hour and good detection rate of 88% on the entire test databases. The post-processor, effectively reduces the false alarm rate by 34% while the good detection rate decreases by 2%. This post-processing technique improves the performance of the heuristic algorithm. The structure of this post-processor is generic, improves our understanding of the core visually determined EEG features of neonatal seizures and is applicable for other neonatal seizure detectors. The post-processor significantly decreases the false alarm rate at the expense of a small reduction of the good detection rate. Copyright © 2016 International Federation of Clinical Neurophysiology. Published by Elsevier Ireland Ltd. All rights reserved.
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Attaway, S.; Brown, K.; Gardner, D.

1997-12-31

Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less
Efficacy of Code Optimization on Cache-Based Processors

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.
Synchronous Parallel Emulation and Discrete Event Simulation System with Self-Contained Simulation Objects and Active Event Objects

NASA Technical Reports Server (NTRS)

Steinman, Jeffrey S. (Inventor)

1998-01-01

The present invention is embodied in a method of performing object-oriented simulation and a system having inter-connected processor nodes operating in parallel to simulate mutual interactions of a set of discrete simulation objects distributed among the nodes as a sequence of discrete events changing state variables of respective simulation objects so as to generate new event-defining messages addressed to respective ones of the nodes. The object-oriented simulation is performed at each one of the nodes by assigning passive self-contained simulation objects to each one of the nodes, responding to messages received at one node by generating corresponding active event objects having user-defined inherent capabilities and individual time stamps and corresponding to respective events affecting one of the passive self-contained simulation objects of the one node, restricting the respective passive self-contained simulation objects to only providing and receiving information from die respective active event objects, requesting information and changing variables within a passive self-contained simulation object by the active event object, and producing corresponding messages specifying events resulting therefrom by the active event objects.
Dynamic Load Balancing for Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single EBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Dynamic Load Balancing For Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single IBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Concurrent hypercube system with improved message passing

NASA Technical Reports Server (NTRS)

Peterson, John C. (Inventor); Tuazon, Jesus O. (Inventor); Lieberman, Don (Inventor); Pniel, Moshe (Inventor)

1989-01-01

A network of microprocessors, or nodes, are interconnected in an n-dimensional cube having bidirectional communication links along the edges of the n-dimensional cube. Each node's processor network includes an I/O subprocessor dedicated to controlling communication of message packets along a bidirectional communication link with each end thereof terminating at an I/O controlled transceiver. Transmit data lines are directly connected from a local FIFO through each node's communication link transceiver. Status and control signals from the neighboring nodes are delivered over supervisory lines to inform the local node that the neighbor node's FIFO is empty and the bidirectional link between the two nodes is idle for data communication. A clocking line between neighbors, clocks a message into an empty FIFO at a neighbor's node and vica versa. Either neighbor may acquire control over the bidirectional communication link at any time, and thus each node has circuitry for checking whether or not the communication link is busy or idle, and whether or not the receive FIFO is empty. Likewise, each node can empty its own FIFO and in turn deliver a status signal to a neighboring node indicating that the local FIFO is empty. The system includes features of automatic message rerouting, block message transfer and automatic parity checking and generation.
An enhanced Ada run-time system for real-time embedded processors

NASA Technical Reports Server (NTRS)

Sims, J. T.

1991-01-01

An enhanced Ada run-time system has been developed to support real-time embedded processor applications. The primary focus of this development effort has been on the tasking system and the memory management facilities of the run-time system. The tasking system has been extended to support efficient and precise periodic task execution as required for control applications. Event-driven task execution providing a means of task-asynchronous control and communication among Ada tasks is supported in this system. Inter-task control is even provided among tasks distributed on separate physical processors. The memory management system has been enhanced to provide object allocation and protected access support for memory shared between disjoint processors, each of which is executing a distinct Ada program.
A Parallel Vector Machine for the PM Programming Language

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2016-04-01

PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using standard OpenMP and MPI. Performance analyses of the PM vector machine, demonstrating its scaling properties with respect to domain size and the number of processor nodes will be presented for a range of hardware configurations. The PM software and language definition are being made available under unrestrictive MIT and Creative Commons Attribution licenses respectively: www.pm-lang.org.
Burbank works on the EPIC in the Node 2

NASA Image and Video Library

2012-02-28

ISS030-E-114433 (29 Feb. 2012) --- In the International Space Station?s Destiny laboratory, NASA astronaut Dan Burbank, Expedition 30 commander, upgrades Multiplexer/Demultiplexer (MDM) computers and Portable Computer System (PCS) laptops and installs the Enhanced Processor & Integrated Communications (EPIC) hardware in the Payload 1 (PL-1) MDM.
Simulation of a master-slave event set processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Comfort, J.C.

1984-03-01

Event set manipulation may consume a considerable amount of the computation time spent in performing a discrete-event simulation. One way of minimizing this time is to allow event set processing to proceed in parallel with the remainder of the simulation computation. The paper describes a multiprocessor simulation computer, in which all non-event set processing is performed by the principal processor (called the host). Event set processing is coordinated by a front end processor (the master) and actually performed by several other functionally identical processors (the slaves). A trace-driven simulation program modeling this system was constructed, and was run with tracemore » output taken from two different simulation programs. Output from this simulation suggests that a significant reduction in run time may be realized by this approach. Sensitivity analysis was performed on the significant parameters to the system (number of slave processors, relative processor speeds, and interprocessor communication times). A comparison between actual and simulation run times for a one-processor system was used to assist in the validation of the simulation. 7 references.« less
Non-smooth saddle-node bifurcations III: Strange attractors in continuous time

NASA Astrophysics Data System (ADS)

Fuhrmann, G.

2016-08-01

Non-smooth saddle-node bifurcations give rise to minimal sets of interesting geometry built of so-called strange non-chaotic attractors. We show that certain families of quasiperiodically driven logistic differential equations undergo a non-smooth bifurcation. By a previous result on the occurrence of non-smooth bifurcations in forced discrete time dynamical systems, this yields that within the class of families of quasiperiodically driven differential equations, non-smooth saddle-node bifurcations occur in a set with non-empty C2-interior.
Garbage Collection in a Distributed Object-Oriented System

NASA Technical Reports Server (NTRS)

Gupta, Aloke; Fuchs, W. Kent

1993-01-01

An algorithm is described in this paper for garbage collection in distributed systems with object sharing across processor boundaries. The algorithm allows local garbage collection at each node in the system to proceed independently of local collection at the other nodes. It requires no global synchronization or knowledge of the global state of the system and exhibits the capability of graceful degradation. The concept of a specialized dump node is proposed to facilitate the collection of inaccessible circular structures. An experimental evaluation of the algorithm is also described. The algorithm is compared with a corresponding scheme that requires global synchronization. The results show that the algorithm works well in distributed processing environments even when the locality of object references is low.
Direct memory access transfer completion notification

DOEpatents

Chen, Dong; Giampapa, Mark E.; Heidelberger, Philip; Kumar, Sameer; Parker, Jeffrey J.; Steinmacher-Burow, Burkhard D.; Vranas, Pavlos

2010-07-27

Methods, compute nodes, and computer program products are provided for direct memory access (`DMA`) transfer completion notification. Embodiments include determining, by an origin DMA engine on an origin compute node, whether a data descriptor for an application message to be sent to a target compute node is currently in an injection first-in-first-out (`FIFO`) buffer in dependence upon a sequence number previously associated with the data descriptor, the total number of descriptors currently in the injection FIFO buffer, and the current sequence number for the newest data descriptor stored in the injection FIFO buffer; and notifying a processor core on the origin DMA engine that the message has been sent if the data descriptor for the message is not currently in the injection FIFO buffer.
Parallelization of MRCI based on hole-particle symmetry.

PubMed

Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin

2005-01-15

The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.
NATIONAL WATER INFORMATION SYSTEM OF THE U. S. GEOLOGICAL SURVEY.

USGS Publications Warehouse

Edwards, Melvin D.

1985-01-01

National Water Information System (NWIS) has been designed as an interactive, distributed data system. It will integrate the existing, diverse data-processing systems into a common system. It will also provide easier, more flexible use as well as more convenient access and expanded computing, dissemination, and data-analysis capabilities. The NWIS is being implemented as part of a Distributed Information System (DIS) being developed by the Survey's Water Resources Division. The NWIS will be implemented on each node of the distributed network for the local processing, storage, and dissemination of hydrologic data collected within the node's area of responsibility. The processor at each node will also be used to perform hydrologic modeling, statistical data analysis, text editing, and some administrative work.
High order parallel numerical schemes for solving incompressible flows

NASA Technical Reports Server (NTRS)

Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.

1992-01-01

The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.
Uncovering hidden nodes in complex networks in the presence of noise

PubMed Central

Su, Ri-Qi; Lai, Ying-Cheng; Wang, Xiao; Do, Younghae

2014-01-01

Ascertaining the existence of hidden objects in a complex system, objects that cannot be observed from the external world, not only is curiosity-driven but also has significant practical applications. Generally, uncovering a hidden node in a complex network requires successful identification of its neighboring nodes, but a challenge is to differentiate its effects from those of noise. We develop a completely data-driven, compressive-sensing based method to address this issue by utilizing complex weighted networks with continuous-time oscillatory or discrete-time evolutionary-game dynamics. For any node, compressive sensing enables accurate reconstruction of the dynamical equations and coupling functions, provided that time series from this node and all its neighbors are available. For a neighboring node of the hidden node, this condition cannot be met, resulting in abnormally large prediction errors that, counterintuitively, can be used to infer the existence of the hidden node. Based on the principle of differential signal, we demonstrate that, when strong noise is present, insofar as at least two neighboring nodes of the hidden node are subject to weak background noise only, unequivocal identification of the hidden node can be achieved. PMID:24487720
Multiple-Flat-Panel System Displays Multidimensional Data

NASA Technical Reports Server (NTRS)

Gundo, Daniel; Levit, Creon; Henze, Christopher; Sandstrom, Timothy; Ellsworth, David; Green, Bryan; Joly, Arthur

2006-01-01

The NASA Ames hyperwall is a display system designed to facilitate the visualization of sets of multivariate and multidimensional data like those generated in complex engineering and scientific computations. The hyperwall includes a 77 matrix of computer-driven flat-panel video display units, each presenting an image of 1,280 1,024 pixels. The term hyperwall reflects the fact that this system is a more capable successor to prior computer-driven multiple-flat-panel display systems known by names that include the generic term powerwall and the trade names PowerWall and Powerwall. Each of the 49 flat-panel displays is driven by a rack-mounted, dual-central-processing- unit, workstation-class personal computer equipped with a hig-hperformance graphical-display circuit card and with a hard-disk drive having a storage capacity of 100 GB. Each such computer is a slave node in a master/ slave computing/data-communication system (see Figure 1). The computer that acts as the master node is similar to the slave-node computers, except that it runs the master portion of the system software and is equipped with a keyboard and mouse for control by a human operator. The system utilizes commercially available master/slave software along with custom software that enables the human controller to interact simultaneously with any number of selected slave nodes. In a powerwall, a single rendering task is spread across multiple processors and then the multiple outputs are tiled into one seamless super-display. It must be noted that the hyperwall concept subsumes the powerwall concept in that a single scene could be rendered as a mosaic image on the hyperwall. However, the hyperwall offers a wider set of capabilities to serve a different purpose: The hyperwall concept is one of (1) simultaneously displaying multiple different but related images, and (2) providing means for composing and controlling such sets of images. In place of elaborate software or hardware crossbar switches, the hyperwall concept substitutes reliance on the human visual system for integration, synthesis, and discrimination of patterns in complex and high-dimensional data spaces represented by the multiple displayed images. The variety of multidimensional data sets that can be displayed on the hyperwall is practically unlimited. For example, Figure 2 shows a hyperwall display of surface pressures and streamlines from a computational simulation of airflow about an aerospacecraft at various Mach numbers and angles of attack. In this display, Mach numbers increase from left to right and angles of attack increase from bottom to top. That is, all images in the same column represent simulations at the same Mach number, while all images in the same row represent simulations at the same angle of attack. The same viewing transformations and the same mapping from surface pressure to colors were used in generating all the images.
Going End to End to Deliver High-Speed Data

NASA Technical Reports Server (NTRS)

2005-01-01

By the end of the 1990s, the optical fiber "backbone" of the telecommunication and data-communication networks had evolved from megabits-per-second transmission rates to gigabits-per-second transmission rates. Despite this boom in bandwidth, however, users at the end nodes were still not being reached on a consistent basis. (An end node is any device that does not behave like a router or a managed hub or switch. Examples of end node objects are computers, printers, serial interface processor phones, and unmanaged hubs and switches.) The primary reason that prevents bandwidth from reaching the end nodes is the complex local network topology that exists between the optical backbone and the end nodes. This complex network topology consists of several layers of routing and switch equipment which introduce potential congestion points and network latency. By breaking down the complex network topology, a true optical connection can be achieved. Access Optical Networks, Inc., is making this connection a reality with guidance from NASA s nondestructive evaluation experts.

Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoshii, K.; Iskra, K.; Naik, H.

2011-05-01

We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
Initial Kernel Timing Using a Simple PIM Performance Model

NASA Technical Reports Server (NTRS)

Katz, Daniel S.; Block, Gary L.; Springer, Paul L.; Sterling, Thomas; Brockman, Jay B.; Callahan, David

2005-01-01

This presentation will describe some initial results of paper-and-pencil studies of 4 or 5 application kernels applied to a processor-in-memory (PIM) system roughly similar to the Cascade Lightweight Processor (LWP). The application kernels are: * Linked list traversal * Sun of leaf nodes on a tree * Bitonic sort * Vector sum * Gaussian elimination The intent of this work is to guide and validate work on the Cascade project in the areas of compilers, simulators, and languages. We will first discuss the generic PIM structure. Then, we will explain the concepts needed to program a parallel PIM system (locality, threads, parcels). Next, we will present a simple PIM performance model that will be used in the remainder of the presentation. For each kernel, we will then present a set of codes, including codes for a single PIM node, and codes for multiple PIM nodes that move data to threads and move threads to data. These codes are written at a fairly low level, between assembly and C, but much closer to C than to assembly. For each code, we will present some hand-drafted timing forecasts, based on the simple PIM performance model. Finally, we will conclude by discussing what we have learned from this work, including what programming styles seem to work best, from the point-of-view of both expressiveness and performance.
Demonstration of a full volume 3D pre-stack depth migration in the Garden Banks area using massively parallel processor (MPP) technology

DOE Office of Scientific and Technical Information (OSTI.GOV)

Solano, M.; Chang, H.; VanDyke, J.

1996-12-31

This paper describes the implementation and results of portable, production-scale 3D Pre-stack Kirchhoff depth migration software. Full volume pre-stack imaging was applied to a six million trace (46.9 Gigabyte) data set from a subsalt play in the Garden Banks area in the Gulf of Mexico. The velocity model building and updating, were derived using image depth gathers and an image-driven strategy. After three velocity iterations, depth migrated sections revealed drilling targets that were not visible in the conventional 3D post-stack time migrated data set. As expected from the implementation of the migration algorithm, it was found that amplitudes are wellmore » preserved and anomalies associated with known reservoirs, conform to petrophysical predictions. Image gathers for velocity analysis and the final depth migrated volume, were generated on an 1824 node Intel Paragon at Sandia National Laboratories. The code has been successfully ported to a CRAY (T3D) and Unix workstation Parallel Virtual Machine environments (PVM).« less
ADEN ALOS PALSAR Product Verification

NASA Astrophysics Data System (ADS)

Wright, P. A.; Meadows, P. J.; Mack, G.; Miranda, N.; Lavalle, M.

2008-11-01

Within the ALOS Data European Node (ADEN) the verification of PALSAR products is an important and continuing activity, to ensure data utility for the users. The paper will give a summary of the verification activities, the status of the ADEN PALSAR processor and the current quality issues that are important for users of ADEN PALSAR data.
A FPGA embedded web server for remote monitoring and control of smart sensors networks.

PubMed

Magdaleno, Eduardo; Rodríguez, Manuel; Pérez, Fernando; Hernández, David; García, Enrique

2013-12-27

This article describes the implementation of a web server using an embedded Altera NIOS II IP core, a general purpose and configurable RISC processor which is embedded in a Cyclone FPGA. The processor uses the μCLinux operating system to support a Boa web server of dynamic pages using Common Gateway Interface (CGI). The FPGA is configured to act like the master node of a network, and also to control and monitor a network of smart sensors or instruments. In order to develop a totally functional system, the FPGA also includes an implementation of the time-triggered protocol (TTP/A). Thus, the implemented master node has two interfaces, the webserver that acts as an Internet interface and the other to control the network. This protocol is widely used to connecting smart sensors and actuators and microsystems in embedded real-time systems in different application domains, e.g., industrial, automotive, domotic, etc., although this protocol can be easily replaced by any other because of the inherent characteristics of the FPGA-based technology.
A FPGA Embedded Web Server for Remote Monitoring and Control of Smart Sensors Networks

PubMed Central

Magdaleno, Eduardo; Rodríguez, Manuel; Pérez, Fernando; Hernández, David; García, Enrique

2014-01-01

This article describes the implementation of a web server using an embedded Altera NIOS II IP core, a general purpose and configurable RISC processor which is embedded in a Cyclone FPGA. The processor uses the μCLinux operating system to support a Boa web server of dynamic pages using Common Gateway Interface (CGI). The FPGA is configured to act like the master node of a network, and also to control and monitor a network of smart sensors or instruments. In order to develop a totally functional system, the FPGA also includes an implementation of the time-triggered protocol (TTP/A). Thus, the implemented master node has two interfaces, the webserver that acts as an Internet interface and the other to control the network. This protocol is widely used to connecting smart sensors and actuators and microsystems in embedded real-time systems in different application domains, e.g., industrial, automotive, domotic, etc., although this protocol can be easily replaced by any other because of the inherent characteristics of the FPGA-based technology. PMID:24379047
A high-speed, large-capacity, 'jukebox' optical disk system

NASA Technical Reports Server (NTRS)

Ammon, G. J.; Calabria, J. A.; Thomas, D. T.

1985-01-01

Two optical disk 'jukebox' mass storage systems which provide access to any data in a store of 10 to the 13th bits (1250G bytes) within six seconds have been developed. The optical disk jukebox system is divided into two units, including a hardware/software controller and a disk drive. The controller provides flexibility and adaptability, through a ROM-based microcode-driven data processor and a ROM-based software-driven control processor. The cartridge storage module contains 125 optical disks housed in protective cartridges. Attention is given to a conceptual view of the disk drive unit, the NASA optical disk system, the NASA database management system configuration, the NASA optical disk system interface, and an open systems interconnect reference model.
Algorithm implementation on the Navier-Stokes computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krist, S.E.; Zang, T.A.

1987-03-01

The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Algorithm implementation on the Navier-Stokes computer

NASA Technical Reports Server (NTRS)

Krist, Steven E.; Zang, Thomas A.

1987-01-01

The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Advanced flight computer. Special study

NASA Technical Reports Server (NTRS)

Coo, Dennis

1995-01-01

This report documents a special study to define a 32-bit radiation hardened, SEU tolerant flight computer architecture, and to investigate current or near-term technologies and development efforts that contribute to the Advanced Flight Computer (AFC) design and development. An AFC processing node architecture is defined. Each node may consist of a multi-chip processor as needed. The modular, building block approach uses VLSI technology and packaging methods that demonstrate a feasible AFC module in 1998 that meets that AFC goals. The defined architecture and approach demonstrate a clear low-risk, low-cost path to the 1998 production goal, with intermediate prototypes in 1996.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Learn, Mark Walter

Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not availablemore » to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.« less
Communication-Driven Codesign for Multiprocessor Systems

DTIC Science & Technology

2004-01-01

processors, FPGA or ASIC subsystems, mi- croprocessors, and microcontrollers. When a processor is embedded within a SLOT architecture, one or more...Broderson, Low-power CMOS digital design, IEEE Journal of Solid-State Circuits 27 (1992), no. 4, 473–484. [25] L. Chao and E. Sha , Scheduling data-flow...1997), 239– 256 . [82] P. K. Murthy, E. G. Cohen, and S. Rowland, System Canvas: A new design en- vironment for embedded DSP and telecommunications
Bluetooth-based wireless sensor networks

NASA Astrophysics Data System (ADS)

You, Ke; Liu, Rui Qiang

2007-11-01

In this work a Bluetooth-based wireless sensor network is proposed. In this bluetooth-based wireless sensor networks, information-driven star topology and energy-saved mode are used, through which a blue master node can control more than seven slave node, the energy of each sensor node is reduced and secure management of each sensor node is improved.
Alternative Water Processor Test Development

NASA Technical Reports Server (NTRS)

Pickering, Karen D.; Mitchell, Julie L.; Adam, Niklas M.; Barta, Daniel; Meyer, Caitlin E.; Pensinger, Stuart; Vega, Leticia M.; Callahan, Michael R.; Flynn, Michael; Wheeler, Ray;

2013-01-01

The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrification and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.

Low-voltage analog front-end processor design for ISFET-based sensor and H+ sensing applications

NASA Astrophysics Data System (ADS)

Chung, Wen-Yaw; Yang, Chung-Huang; Peng, Kang-Chu; Yeh, M. H.

2003-04-01

This paper presents a modular-based low-voltage analog-front-end processor design in a 0.5mm double-poly double-metal CMOS technology for Ion Sensitive Field Effect Transistor (ISFET)-based sensor and H+ sensing applications. To meet the potentiometric response of the ISFET that is proportional to various H+ concentrations, the constant-voltage and constant current (CVCS) testing configuration has been used. Low-voltage design skills such as bulk-driven input pair, folded-cascode amplifier, bootstrap switch control circuits have been designed and integrated for 1.5V supply and nearly rail-to-rail analog to digital signal processing. Core modules consist of an 8-bit two-step analog-digital converter and bulk-driven pre-amplifiers have been developed in this research. The experimental results show that the proposed circuitry has an acceptable linearity to 0.1 pH-H+ sensing conversions with the buffer solution in the range of pH2 to pH12. The processor has a potential usage in battery-operated and portable healthcare devices and environmental monitoring applications.
Alternative Water Processor Test Development

NASA Technical Reports Server (NTRS)

Pickering, Karen D.; Mitchell, Julie; Vega, Leticia; Adam, Niklas; Flynn, Michael; Wjee (er. Rau); Lunn, Griffin; Jackson, Andrew

2012-01-01

The Next Generation Life Support Project is developing an Alternative Water Processor (AWP) as a candidate water recovery system for long duration exploration missions. The AWP consists of biological water processor (BWP) integrated with a forward osmosis secondary treatment system (FOST). The basis of the BWP is a membrane aerated biological reactor (MABR), developed in concert with Texas Tech University. Bacteria located within the MABR metabolize organic material in wastewater, converting approximately 90% of the total organic carbon to carbon dioxide. In addition, bacteria convert a portion of the ammonia-nitrogen present in the wastewater to nitrogen gas, through a combination of nitrogen and denitrification. The effluent from the BWP system is low in organic contaminants, but high in total dissolved solids. The FOST system, integrated downstream of the BWP, removes dissolved solids through a combination of concentration-driven forward osmosis and pressure driven reverse osmosis. The integrated system is expected to produce water with a total organic carbon less than 50 mg/l and dissolved solids that meet potable water requirements for spaceflight. This paper describes the test definition, the design of the BWP and FOST subsystems, and plans for integrated testing.
Software/hardware distributed processing network supporting the Ada environment

NASA Astrophysics Data System (ADS)

Wood, Richard J.; Pryk, Zen

1993-09-01

A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.
[A novel biologic electricity signal measurement based on neuron chip].

PubMed

Lei, Yinsheng; Wang, Mingshi; Sun, Tongjing; Zhu, Qiang; Qin, Ran

2006-06-01

Neuron chip is a multiprocessor with three pipeline CPU; its communication protocol and control processor are integrated in effect to carry out the function of communication, control, attemper, I/O, etc. A novel biologic electronic signal measurement network system is composed of intelligent measurement nodes with neuron chip at the core. In this study, the electronic signals such as ECG, EEG, EMG and BOS can be synthetically measured by those intelligent nodes, and some valuable diagnostic messages are found. Wavelet transform is employed in this system to analyze various biologic electronic signals due to its strong time-frequency ability of decomposing signal local character. Better effect is gained. This paper introduces the hardware structure of network and intelligent measurement node, the measurement theory and the signal figure of data acquisition and processing.

Large-N in Volcano Settings: Volcanosri

NASA Astrophysics Data System (ADS)

Lees, J. M.; Song, W.; Xing, G.; Vick, S.; Phillips, D.

2014-12-01

We seek a paradigm shift in the approach we take on volcano monitoring where the compromise from high fidelity to large numbers of sensors is used to increase coverage and resolution. Accessibility, danger and the risk of equipment loss requires that we develop systems that are independent and inexpensive. Furthermore, rather than simply record data on hard disk for later analysis we desire a system that will work autonomously, capitalizing on wireless technology and in field network analysis. To this end we are currently producing a low cost seismic array which will incorporate, at the very basic level, seismological tools for first cut analysis of a volcano in crises mode. At the advanced end we expect to perform tomographic inversions in the network in near real time. Geophone (4 Hz) sensors connected to a low cost recording system will be installed on an active volcano where triggering earthquake location and velocity analysis will take place independent of human interaction. Stations are designed to be inexpensive and possibly disposable. In one of the first implementations the seismic nodes consist of an Arduino Due processor board with an attached Seismic Shield. The Arduino Due processor board contains an Atmel SAM3X8E ARM Cortex-M3 CPU. This 32 bit 84 MHz processor can filter and perform coarse seismic event detection on a 1600 sample signal in fewer than 200 milliseconds. The Seismic Shield contains a GPS module, 900 MHz high power mesh network radio, SD card, seismic amplifier, and 24 bit ADC. External sensors can be attached to either this 24-bit ADC or to the internal multichannel 12 bit ADC contained on the Arduino Due processor board. This allows the node to support attachment of multiple sensors. By utilizing a high-speed 32 bit processor complex signal processing tasks can be performed simultaneously on multiple sensors. Using a 10 W solar panel, second system being developed can run autonomously and collect data on 3 channels at 100Hz for 6 months with the installed 16Gb SD card. Initial designs and test results will be presented and discussed.
Benchmarking NWP Kernels on Multi- and Many-core Processors

NASA Astrophysics Data System (ADS)

Michalakes, J.; Vachharajani, M.

2008-12-01

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Trade-offs between driving nodes and time-to-control in complex networks

PubMed Central

Pequito, Sérgio; Preciado, Victor M.; Barabási, Albert-László; Pappas, George J.

2017-01-01

Recent advances in control theory provide us with efficient tools to determine the minimum number of driving (or driven) nodes to steer a complex network towards a desired state. Furthermore, we often need to do it within a given time window, so it is of practical importance to understand the trade-offs between the minimum number of driving/driven nodes and the minimum time required to reach a desired state. Therefore, we introduce the notion of actuation spectrum to capture such trade-offs, which we used to find that in many complex networks only a small fraction of driving (or driven) nodes is required to steer the network to a desired state within a relatively small time window. Furthermore, our empirical studies reveal that, even though synthetic network models are designed to present structural properties similar to those observed in real networks, their actuation spectra can be dramatically different. Thus, it supports the need to develop new synthetic network models able to replicate controllability properties of real-world networks. PMID:28054597
Trade-offs between driving nodes and time-to-control in complex networks

NASA Astrophysics Data System (ADS)

Pequito, Sérgio; Preciado, Victor M.; Barabási, Albert-László; Pappas, George J.

2017-01-01

Recent advances in control theory provide us with efficient tools to determine the minimum number of driving (or driven) nodes to steer a complex network towards a desired state. Furthermore, we often need to do it within a given time window, so it is of practical importance to understand the trade-offs between the minimum number of driving/driven nodes and the minimum time required to reach a desired state. Therefore, we introduce the notion of actuation spectrum to capture such trade-offs, which we used to find that in many complex networks only a small fraction of driving (or driven) nodes is required to steer the network to a desired state within a relatively small time window. Furthermore, our empirical studies reveal that, even though synthetic network models are designed to present structural properties similar to those observed in real networks, their actuation spectra can be dramatically different. Thus, it supports the need to develop new synthetic network models able to replicate controllability properties of real-world networks.
Magnetic Bubble Memories for Data Collection in Sounding Rockets,

DTIC Science & Technology

1982-01-29

generate interest in bubbles as a mass storage device for micro - processor based equipment, manufacturers have come up with a variety of diversified...absence of a bubble represents a Ŕ". With diameters on the order of I to 5 micro -meters, these bubbles are so small that extremely tiny chips can hold...methods of transfer: polled I/O, interrupt driven I/O, and direct memory access (DMA). The first two methods require tho host processor be involved
Parallel solution of high-order numerical schemes for solving incompressible flows

NASA Technical Reports Server (NTRS)

Milner, Edward J.; Lin, Avi; Liou, May-Fun; Blech, Richard A.

1993-01-01

A new parallel numerical scheme for solving incompressible steady-state flows is presented. The algorithm uses a finite-difference approach to solving the Navier-Stokes equations. The algorithms are scalable and expandable. They may be used with only two processors or with as many processors as are available. The code is general and expandable. Any size grid may be used. Four processors of the NASA LeRC Hypercluster were used to solve for steady-state flow in a driven square cavity. The Hypercluster was configured in a distributed-memory, hypercube-like architecture. By using a 50-by-50 finite-difference solution grid, an efficiency of 74 percent (a speedup of 2.96) was obtained.
Using Nested Contractions and a Hierarchical Tensor Format To Compute Vibrational Spectra of Molecules with Seven Atoms.

PubMed

Thomas, Phillip S; Carrington, Tucker

2015-12-31

We propose a method for solving the vibrational Schrödinger equation with which one can compute hundreds of energy levels of seven-atom molecules using at most a few gigabytes of memory. It uses nested contractions in conjunction with the reduced-rank block power method (RRBPM) described in J. Chem. Phys. 2014, 140, 174111. Successive basis contractions are organized into a tree, the nodes of which are associated with eigenfunctions of reduced-dimension Hamiltonians. The RRBPM is used recursively to compute eigenfunctions of nodes in bases of products of reduced-dimension eigenfunctions of nodes with fewer coordinates. The corresponding vectors are tensors in what is called CP-format. The final wave functions are therefore represented in a hierarchical CP-format. Computational efficiency and accuracy are significantly improved by representing the Hamiltonian in the same hierarchical format as the wave function. We demonstrate that with this hierarchical RRBPM it is possible to compute energy levels of a 64-D coupled-oscillator model Hamiltonian and also of acetonitrile (CH3CN) and ethylene oxide (C2H4O), for which we use quartic potentials. The most accurate acetonitrile calculation uses 139 MB of memory and takes 3.2 h on a single processor. The most accurate ethylene oxide calculation uses 6.1 GB of memory and takes 14 d on 63 processors. The hierarchical RRBPM shatters the memory barrier that impedes the calculation of vibrational spectra.
A computational system for lattice QCD with overlap Dirac quarks

NASA Astrophysics Data System (ADS)

Chiu, Ting-Wai; Hsieh, Tung-Han; Huang, Chao-Hsi; Huang, Tsung-Ren

2003-05-01

We outline the essential features of a Linux PC cluster which is now being developed at National Taiwan University, and discuss how to optimize its hardware and software for lattice QCD with overlap Dirac quarks. At present, the cluster constitutes of 30 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0 GHz), one Gbyte of PC800 RDRAM, one 40/80 Gbyte hard disk, and a network card. The speed of this system is estimated to be 30 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched lattice QCD with overlap Dirac quarks.
Spectral Structure Of Phase-Induced Intensity Noise In Recirculating Delay Lines

NASA Astrophysics Data System (ADS)

Tur, M.; Moslehi, B.; Bowers, J. E.; Newton, S. A.; Jackson, K. P.; Goodman, J. W.; Cutler, C. C.; Shaw, H. J.

1983-09-01

The dynamic range of fiber optic signal processors driven by relatively incoherent multimode semiconductor lasers is shown to be severely limited by laser phase-induced noise. It is experimentally demonstrated that while the noise power spectrum of differential length fiber filters is approximately flat, processors with recirculating loops exhibit noise with a periodically structured power spectrum with notches at zero frequency as well as at all other multiples of 1/(loop delay). The experimental results are aug-mented by a theoretical analysis.
Free-Electron Laser Driven by the NBS (National Bureau of Standards) CW Microtron

DTIC Science & Technology

1988-03-31

planned over several years. This will begin with the purchase of a 32-bit dual processor system for the yet to be constructed primary station wire scanner ...display subsystem. This 32-bit dual processor system will not only form the wire scanner display system, but has sufficient processing power to...7th hit. Coiif. on FELs, eds., E.T. Scharlemann and D. Prosnitz (North- Holland, Amsterdam, 1986) p. 278. 121 X.K Maruyania and S. Penner, C.M. Tang
Flexible network wireless transceiver and flexible network telemetry transceiver

DOEpatents

Brown, Kenneth D.

2008-08-05

A transceiver for facilitating two-way wireless communication between a baseband application and other nodes in a wireless network, wherein the transceiver provides baseband communication networking and necessary configuration and control functions along with transmitter, receiver, and antenna functions to enable the wireless communication. More specifically, the transceiver provides a long-range wireless duplex communication node or channel between the baseband application, which is associated with a mobile or fixed space, air, water, or ground vehicle or other platform, and other nodes in the wireless network or grid. The transceiver broadly comprises a communication processor; a flexible telemetry transceiver including a receiver and a transmitter; a power conversion and regulation mechanism; a diplexer; and a phased array antenna system, wherein these various components and certain subcomponents thereof may be separately enclosed and distributable relative to the other components and subcomponents.
Fault isolation through no-overhead link level CRC

DOEpatents

Chen, Dong; Coteus, Paul W.; Gara, Alan G.

2007-04-24

A fault isolation technique for checking the accuracy of data packets transmitted between nodes of a parallel processor. An independent crc is kept of all data sent from one processor to another, and received from one processor to another. At the end of each checkpoint, the crcs are compared. If they do not match, there was an error. The crcs may be cleared and restarted at each checkpoint. In the preferred embodiment, the basic functionality is to calculate a CRC of all packet data that has been successfully transmitted across a given link. This CRC is done on both ends of the link, thereby allowing an independent check on all data believed to have been correctly transmitted. Preferably, all links have this CRC coverage, and the CRC used in this link level check is different from that used in the packet transfer protocol. This independent check, if successfully passed, virtually eliminates the possibility that any data errors were missed during the previous transfer period.
A multi-satellite orbit determination problem in a parallel processing environment

NASA Technical Reports Server (NTRS)

Deakyne, M. S.; Anderle, R. J.

1988-01-01

The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.
A monitoring system for vegetable greenhouses based on a wireless sensor network.

PubMed

Li, Xiu-hong; Cheng, Xiao; Yan, Ke; Gong, Peng

2010-01-01

A wireless sensor network-based automatic monitoring system is designed for monitoring the life conditions of greenhouse vegetables. The complete system architecture includes a group of sensor nodes, a base station, and an internet data center. For the design of wireless sensor node, the JN5139 micro-processor is adopted as the core component and the Zigbee protocol is used for wireless communication between nodes. With an ARM7 microprocessor and embedded ZKOS operating system, a proprietary gateway node is developed to achieve data influx, screen display, system configuration and GPRS based remote data forwarding. Through a Client/Server mode the management software for remote data center achieves real-time data distribution and time-series analysis. Besides, a GSM-short-message-based interface is developed for sending real-time environmental measurements, and for alarming when a measurement is beyond some pre-defined threshold. The whole system has been tested for over one year and satisfactory results have been observed, which indicate that this system is very useful for greenhouse environment monitoring.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Dmitriev, Alexander S.; Yemelyanov, Ruslan Yu.; Moscow Institute of Physics and Technology

The paper deals with a new multi-element processor platform assigned for modelling the behaviour of interacting dynamical systems, i.e., active wireless network. Experimentally, this ensemble is implemented in an active network, the active nodes of which include direct chaotic transceivers and special actuator boards containing microcontrollers for modelling the dynamical systems and an information display unit (colored LEDs). The modelling technique and experimental results are described and analyzed.
Effective correlator for RadioAstron project

NASA Astrophysics Data System (ADS)

Sergeev, Sergey

This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.
Cots Correlator Platform

NASA Astrophysics Data System (ADS)

Schaaf, Kjeld; Overeem, Ruud

2004-06-01

Moore’s law is best exploited by using consumer market hardware. In particular, the gaming industry pushes the limit of processor performance thus reducing the cost per raw flop even faster than Moore’s law predicts. Next to the cost benefits of Common-Of-The-Shelf (COTS) processing resources, there is a rapidly growing experience pool in cluster based processing. The typical Beowulf cluster of PC’s supercomputers are well known. Multiple examples exists of specialised cluster computers based on more advanced server nodes or even gaming stations. All these cluster machines build upon the same knowledge about cluster software management, scheduling, middleware libraries and mathematical libraries. In this study, we have integrated COTS processing resources and cluster nodes into a very high performance processing platform suitable for streaming data applications, in particular to implement a correlator. The required processing power for the correlator in modern radio telescopes is in the range of the larger supercomputers, which motivates the usage of supercomputer technology. Raw processing power is provided by graphical processors and is combined with an Infiniband host bus adapter with integrated data stream handling logic. With this processing platform a scalable correlator can be built with continuously growing processing power at consumer market prices.
Scalability and performance of data-parallel pressure-based multigrid methods for viscous flows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blosch, E.L.; Shyy, W.

1996-05-01

A full-approximation storage multigrid method for solving the steady-state 2-d incompressible Navier-Stokes equations on staggered grids has been implemented in Fortran on the CM-5, using the array aliasing feature in CM-Fortran to avoid declaring fine-grid-sized arrays on all levels while still allowing a variable number of grid levels. Thus, the storage cost scales with the number of unknowns, allowing us to consider significantly larger problems than would otherwise be possible. Timings over a range of problem sizes and numbers of processors, up to 4096 x 4096 on 512 nodes, show that the smoothing procedure, a pressure-correction technique, is scalable andmore » that the restriction and prolongation steps are nearly so. The performance obtained for the multigrid method is 333 Mflops out of the theoretical peak 4 Gflops on a 32-node CM-5. In comparison, a single-grid computation obtained 420 Mflops. The decrease is due to the inefficiency of the smoothing iterations on the coarse grid levels. W cycles cost much more and are much less efficient than V cycles, due to the increased contribution from the coarse grids. The convergence rate characteristics of the pressure-correction multigrid method are investigated in a Re = 5000 lid-driven cavity flow and a Re = 300 symmetric backward-facing step flow, using either a defect-correction scheme or a second-order upwind scheme. A heuristic technique relating the convergence tolerances for the course grids to the truncation error of the discretization has been found effective and robust. With second-order upwinding on all grid levels, a 5-level 320 x 80 step flow solution was obtained in 20 V cycles, which corresponds to a smoothing rate of 0.7, and required 25 s on a 32-node CM-5. Overall, the convergence rates obtained in the present work are comparable to the most competitive findings reported in the literature. 62 refs., 13 figs.« less
Scalability and Performance of Data-Parallel Pressure-Based Multigrid Methods for Viscous Flows

NASA Astrophysics Data System (ADS)

Blosch, Edwin L.; Shyy, Wei

1996-05-01

A full-approximation storage multigrid method for solving the steady-state 2-dincompressible Navier-Stokes equations on staggered grids has been implemented in Fortran on the CM-5,using the array aliasing feature in CM-Fortran to avoid declaring fine-grid-sized arrays on all levels while still allowing a variable number of grid levels. Thus, the storage cost scales with the number of unknowns,allowing us to consider significantly larger problems than would otherwise be possible. Timings over a range of problem sizes and numbers of processors, up to 4096 × 4096 on 512 nodes, show that the smoothing procedure, a pressure-correction technique, is scalable and that the restriction and prolongation steps are nearly so. The performance obtained for the multigrid method is 333 Mflops out of the theoretical peak 4 Gflops on a 32-node CM-5. In comparison, a single-grid computation obtained 420 Mflops. The decrease is due to the inefficiency of the smoothing iterations on the coarse grid levels. W cycles cost much more and are much less efficient than V cycles, due to the increased contribution from the coarse grids. The convergence rate characteristics of the pressure-correction multigrid method are investigated in a Re = 5000 lid-driven cavity flow and a Re = 300 symmetric backward-facing step flow, using either a defect-correction scheme or a second-order upwind scheme. A heuristic technique relating the convergence tolerances for the coarse grids to the truncation error of the discretization has been found effective and robust. With second-order upwinding on all grid levels, a 5-level 320× 80 step flow solution was obtained in 20 V cycles, which corresponds to a smoothing rate of 0.7, and required 25 s on a 32-node CM-5. Overall, the convergence rates obtained in the present work are comparable to the most competitive findings reported in the literature.
Opinion formation driven by PageRank node influence on directed networks

NASA Astrophysics Data System (ADS)

Eom, Young-Ho; Shepelyansky, Dima L.

2015-10-01

We study a two states opinion formation model driven by PageRank node influence and report an extensive numerical study on how PageRank affects collective opinion formations in large-scale empirical directed networks. In our model the opinion of a node can be updated by the sum of its neighbor nodes' opinions weighted by the node influence of the neighbor nodes at each step. We consider PageRank probability and its sublinear power as node influence measures and investigate evolution of opinion under various conditions. First, we observe that all networks reach steady state opinion after a certain relaxation time. This time scale is decreasing with the heterogeneity of node influence in the networks. Second, we find that our model shows consensus and non-consensus behavior in steady state depending on types of networks: Web graph, citation network of physics articles, and LiveJournal social network show non-consensus behavior while Wikipedia article network shows consensus behavior. Third, we find that a more heterogeneous influence distribution leads to a more uniform opinion state in the cases of Web graph, Wikipedia, and Livejournal. However, the opposite behavior is observed in the citation network. Finally we identify that a small number of influential nodes can impose their own opinion on significant fraction of other nodes in all considered networks. Our study shows that the effects of heterogeneity of node influence on opinion formation can be significant and suggests further investigations on the interplay between node influence and collective opinion in networks.

Two-dimensional optoelectronic interconnect-processor and its operational bit error rate

NASA Astrophysics Data System (ADS)

Liu, J. Jiang; Gollsneider, Brian; Chang, Wayne H.; Carhart, Gary W.; Vorontsov, Mikhail A.; Simonis, George J.; Shoop, Barry L.

2004-10-01

Two-dimensional (2-D) multi-channel 8x8 optical interconnect and processor system were designed and developed using complementary metal-oxide-semiconductor (CMOS) driven 850-nm vertical-cavity surface-emitting laser (VCSEL) arrays and the photodetector (PD) arrays with corresponding wavelengths. We performed operation and bit-error-rate (BER) analysis on this free-space integrated 8x8 VCSEL optical interconnects driven by silicon-on-sapphire (SOS) circuits. Pseudo-random bit stream (PRBS) data sequence was used in operation of the interconnects. Eye diagrams were measured from individual channels and analyzed using a digital oscilloscope at data rates from 155 Mb/s to 1.5 Gb/s. Using a statistical model of Gaussian distribution for the random noise in the transmission, we developed a method to compute the BER instantaneously with the digital eye-diagrams. Direct measurements on this interconnects were also taken on a standard BER tester for verification. We found that the results of two methods were in the same order and within 50% accuracy. The integrated interconnects were investigated in an optoelectronic processing architecture of digital halftoning image processor. Error diffusion networks implemented by the inherently parallel nature of photonics promise to provide high quality digital halftoned images.
A Conformance Test Suite for Arden Syntax Compilers and Interpreters.

PubMed

Wolf, Klaus-Hendrik; Klimek, Mike

2016-01-01

The Arden Syntax for Medical Logic Modules is a standardized and well-established programming language to represent medical knowledge. To test the compliance level of existing compilers and interpreters no public test suite exists. This paper presents the research to transform the specification into a set of unit tests, represented in JUnit. It further reports on the utilization of the test suite testing four different Arden Syntax processors. The presented and compared results reveal the status conformance of the tested processors. How test driven development of Arden Syntax processors can help increasing the compliance with the standard is described with two examples. In the end some considerations how an open source test suite can improve the development and distribution of the Arden Syntax are presented.
A New Path-Constrained Rendezvous Planning Approach for Large-Scale Event-Driven Wireless Sensor Networks

PubMed Central

Zhang, Gongxuan; Wang, Yongli; Wang, Tianshu

2018-01-01

We study the problem of employing a mobile-sink into a large-scale Event-Driven Wireless Sensor Networks (EWSNs) for the purpose of data harvesting from sensor-nodes. Generally, this employment improves the main weakness of WSNs that is about energy-consumption in battery-driven sensor-nodes. The main motivation of our work is to address challenges which are related to a network’s topology by adopting a mobile-sink that moves in a predefined trajectory in the environment. Since, in this fashion, it is not possible to gather data from sensor-nodes individually, we adopt the approach of defining some of the sensor-nodes as Rendezvous Points (RPs) in the network. We argue that RP-planning in this case is a tradeoff between minimizing the number of RPs while decreasing the number of hops for a sensor-node that needs data transformation to the related RP which leads to minimizing average energy consumption in the network. We address the problem by formulating the challenges and expectations as a Mixed Integer Linear Programming (MILP). Henceforth, by proving the NP-hardness of the problem, we propose three effective and distributed heuristics for RP-planning, identifying sojourn locations, and constructing routing trees. Finally, experimental results prove the effectiveness of our approach. PMID:29734718
A New Path-Constrained Rendezvous Planning Approach for Large-Scale Event-Driven Wireless Sensor Networks.

PubMed

Vajdi, Ahmadreza; Zhang, Gongxuan; Zhou, Junlong; Wei, Tongquan; Wang, Yongli; Wang, Tianshu

2018-05-04

We study the problem of employing a mobile-sink into a large-scale Event-Driven Wireless Sensor Networks (EWSNs) for the purpose of data harvesting from sensor-nodes. Generally, this employment improves the main weakness of WSNs that is about energy-consumption in battery-driven sensor-nodes. The main motivation of our work is to address challenges which are related to a network’s topology by adopting a mobile-sink that moves in a predefined trajectory in the environment. Since, in this fashion, it is not possible to gather data from sensor-nodes individually, we adopt the approach of defining some of the sensor-nodes as Rendezvous Points (RPs) in the network. We argue that RP-planning in this case is a tradeoff between minimizing the number of RPs while decreasing the number of hops for a sensor-node that needs data transformation to the related RP which leads to minimizing average energy consumption in the network. We address the problem by formulating the challenges and expectations as a Mixed Integer Linear Programming (MILP). Henceforth, by proving the NP-hardness of the problem, we propose three effective and distributed heuristics for RP-planning, identifying sojourn locations, and constructing routing trees. Finally, experimental results prove the effectiveness of our approach.
Flow-driven triboelectric generator for directly powering a wireless sensor node.

PubMed

Wang, Shuhua; Mu, Xiaojing; Yang, Ya; Sun, Chengliang; Gu, Alex Yuandong; Wang, Zhong Lin

2015-01-14

A triboelectric generator (TEG) for scavenging flow-driven mechanical -energy to directly power a wireless sensor node is demonstrated for the first time. The output performances of TEGs with different dimensions are systematically investigated, indicating that a largest output power of about 3.7 mW for one TEG can be achieved under an external load of 3 MΩ. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Modeling Dynamic Evolution of Online Friendship Network

NASA Astrophysics Data System (ADS)

Wu, Lian-Ren; Yan, Qiang

2012-10-01

In this paper, we study the dynamic evolution of friendship network in SNS (Social Networking Site). Our analysis suggests that an individual joining a community depends not only on the number of friends he or she has within the community, but also on the friendship network generated by those friends. In addition, we propose a model which is based on two processes: first, connecting nearest neighbors; second, strength driven attachment mechanism. The model reflects two facts: first, in the social network it is a universal phenomenon that two nodes are connected when they have at least one common neighbor; second, new nodes connect more likely to nodes which have larger weights and interactions, a phenomenon called strength driven attachment (also called weight driven attachment). From the simulation results, we find that degree distribution P(k), strength distribution P(s), and degree-strength correlation are all consistent with empirical data.
Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
Optimally Distributed Kalman Filtering with Data-Driven Communication †

PubMed Central

Dormann, Katharina

2018-01-01

For multisensor data fusion, distributed state estimation techniques that enable a local processing of sensor data are the means of choice in order to minimize storage and communication costs. In particular, a distributed implementation of the optimal Kalman filter has recently been developed. A significant disadvantage of this algorithm is that the fusion center needs access to each node so as to compute a consistent state estimate, which requires full communication each time an estimate is requested. In this article, different extensions of the optimally distributed Kalman filter are proposed that employ data-driven transmission schemes in order to reduce communication expenses. As a first relaxation of the full-rate communication scheme, it can be shown that each node only has to transmit every second time step without endangering consistency of the fusion result. Also, two data-driven algorithms are introduced that even allow for lower transmission rates, and bounds are derived to guarantee consistent fusion results. Simulations demonstrate that the data-driven distributed filtering schemes can outperform a centralized Kalman filter that requires each measurement to be sent to the center node. PMID:29596392
Joint Experimentation on Scalable Parallel Processors (JESPP)

DTIC Science & Technology

2006-04-01

made use of local embedded relational databases, implemented using sqlite on each node of an SPP to execute queries and return results via an ad hoc ...rl.af.mil 12a. DISTRIBUTION / AVAILABILITY STATEENT APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. 12b. DISTRIBUTION CODE 13. ABSTRACT...Experimentation Directorate (J9) required expansion of its joint semi-automated forces (JSAF) code capabilities; including number of entities, behavior complexity
Communication overhead on the Intel iPSC-860 hypercube

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1990-01-01

Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.
MOBS - A modular on-board switching system

NASA Astrophysics Data System (ADS)

Berner, W.; Grassmann, W.; Piontek, M.

The authors describe a multibeam satellite system that is designed for business services and for communications at a high bit rate. The repeater is regenerative with a modular onboard switching system. It acts not only as baseband switch but also as the central node of the network, performing network control and protocol evaluation. The hardware is based on a modular bus/memory architecture with associated processors.
Free Mesh Method: fundamental conception, algorithms and accuracy study

PubMed Central

YAGAWA, Genki

2011-01-01

The finite element method (FEM) has been commonly employed in a variety of fields as a computer simulation method to solve such problems as solid, fluid, electro-magnetic phenomena and so on. However, creation of a quality mesh for the problem domain is a prerequisite when using FEM, which becomes a major part of the cost of a simulation. It is natural that the concept of meshless method has evolved. The free mesh method (FMM) is among the typical meshless methods intended for particle-like finite element analysis of problems that are difficult to handle using global mesh generation, especially on parallel processors. FMM is an efficient node-based finite element method that employs a local mesh generation technique and a node-by-node algorithm for the finite element calculations. In this paper, FMM and its variation are reviewed focusing on their fundamental conception, algorithms and accuracy. PMID:21558752
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Howison, Mark; Bethel, E. Wes; Childs, Hank

2012-01-01

With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
A Parallel Multigrid Solver for Viscous Flows on Anisotropic Structured Grids

NASA Technical Reports Server (NTRS)

Prieto, Manuel; Montero, Ruben S.; Llorente, Ignacio M.; Bushnell, Dennis M. (Technical Monitor)

2001-01-01

This paper presents an efficient parallel multigrid solver for speeding up the computation of a 3-D model that treats the flow of a viscous fluid over a flat plate. The main interest of this simulation lies in exhibiting some basic difficulties that prevent optimal multigrid efficiencies from being achieved. As the computing platform, we have used Coral, a Beowulf-class system based on Intel Pentium processors and equipped with GigaNet cLAN and switched Fast Ethernet networks. Our study not only examines the scalability of the solver but also includes a performance evaluation of Coral where the investigated solver has been used to compare several of its design choices, namely, the interconnection network (GigaNet versus switched Fast-Ethernet) and the node configuration (dual nodes versus single nodes). As a reference, the performance results have been compared with those obtained with the NAS-MG benchmark.
Accuracy-energy configurable sensor processor and IoT device for long-term activity monitoring in rare-event sensing applications.

PubMed

Park, Daejin; Cho, Jeonghun

2014-01-01

A specially designed sensor processor used as a main processor in IoT (internet-of-thing) device for the rare-event sensing applications is proposed. The IoT device including the proposed sensor processor performs the event-driven sensor data processing based on an accuracy-energy configurable event-quantization in architectural level. The received sensor signal is converted into a sequence of atomic events, which is extracted by the signal-to-atomic-event generator (AEG). Using an event signal processing unit (EPU) as an accelerator, the extracted atomic events are analyzed to build the final event. Instead of the sampled raw data transmission via internet, the proposed method delays the communication with a host system until a semantic pattern of the signal is identified as a final event. The proposed processor is implemented on a single chip, which is tightly coupled in bus connection level with a microcontroller using a 0.18 μm CMOS embedded-flash process. For experimental results, we evaluated the proposed sensor processor by using an IR- (infrared radio-) based signal reflection and sensor signal acquisition system. We successfully demonstrated that the expected power consumption is in the range of 20% to 50% compared to the result of the basement in case of allowing 10% accuracy error.
A general natural-language text processor for clinical radiology.

PubMed Central

Friedman, C; Alderson, P O; Austin, J H; Cimino, J J; Johnson, S B

1994-01-01

OBJECTIVE: Development of a general natural-language processor that identifies clinical information in narrative reports and maps that information into a structured representation containing clinical terms. DESIGN: The natural-language processor provides three phases of processing, all of which are driven by different knowledge sources. The first phase performs the parsing. It identifies the structure of the text through use of a grammar that defines semantic patterns and a target form. The second phase, regularization, standardizes the terms in the initial target structure via a compositional mapping of multi-word phrases. The third phase, encoding, maps the terms to a controlled vocabulary. Radiology is the test domain for the processor and the target structure is a formal model for representing clinical information in that domain. MEASUREMENTS: The impression sections of 230 radiology reports were encoded by the processor. Results of an automated query of the resultant database for the occurrences of four diseases were compared with the analysis of a panel of three physicians to determine recall and precision. RESULTS: Without training specific to the four diseases, recall and precision of the system (combined effect of the processor and query generator) were 70% and 87%. Training of the query component increased recall to 85% without changing precision. PMID:7719797
Endobronchial ultrasound-guided transbronchial needle aspiration for lung cancer staging: early experience in Brazil*,**

PubMed Central

Figueiredo, Viviane Rossi; Cardoso, Paulo Francisco Guerreiro; Jacomelli, Márcia; Demarzo, Sérgio Eduardo; Palomino, Addy Lidvina Mejia; Rodrigues, Ascédio José; Terra, Ricardo Mingarini; Pego-Fernandes, Paulo Manoel; Carvalho, Carlos Roberto Ribeiro

2015-01-01

Objective: Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) is a minimally invasive, safe and accurate method for collecting samples from mediastinal and hilar lymph nodes. This study focused on the initial results obtained with EBUS-TBNA for lung cancer and lymph node staging at three teaching hospitals in Brazil. Methods: This was a retrospective analysis of patients diagnosed with lung cancer and submitted to EBUS-TBNA for mediastinal lymph node staging. The EBUS-TBNA procedures, which involved the use of an EBUS scope, an ultrasound processor, and a compatible, disposable 22 G needle, were performed while the patients were under general anesthesia. Results: Between January of 2011 and January of 2014, 149 patients underwent EBUS-TBNA for lymph node staging. The mean age was 66 ± 12 years, and 58% were male. A total of 407 lymph nodes were sampled by EBUS-TBNA. The most common types of lung neoplasm were adenocarcinoma (in 67%) and squamous cell carcinoma (in 24%). For lung cancer staging, EBUS-TBNA was found to have a sensitivity of 96%, a specificity of 100%, and a negative predictive value of 85%. Conclusions: We found EBUS-TBNA to be a safe and accurate method for lymph node staging in lung cancer patients. PMID:25750671
Endobronchial ultrasound-guided transbronchial needle aspiration for lung cancer staging: early experience in Brazil.

PubMed

Figueiredo, Viviane Rossi; Cardoso, Paulo Francisco Guerreiro; Jacomelli, Márcia; Demarzo, Sérgio Eduardo; Palomino, Addy Lidvina Mejia; Rodrigues, Ascédio José; Terra, Ricardo Mingarini; Pego-Fernandes, Paulo Manoel; Carvalho, Carlos Roberto Ribeiro

2015-01-01

Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) is a minimally invasive, safe and accurate method for collecting samples from mediastinal and hilar lymph nodes. This study focused on the initial results obtained with EBUS-TBNA for lung cancer and lymph node staging at three teaching hospitals in Brazil. This was a retrospective analysis of patients diagnosed with lung cancer and submitted to EBUS-TBNA for mediastinal lymph node staging. The EBUS-TBNA procedures, which involved the use of an EBUS scope, an ultrasound processor, and a compatible, disposable 22 G needle, were performed while the patients were under general anesthesia. Between January of 2011 and January of 2014, 149 patients underwent EBUS-TBNA for lymph node staging. The mean age was 66 ± 12 years, and 58% were male. A total of 407 lymph nodes were sampled by EBUS-TBNA. The most common types of lung neoplasm were adenocarcinoma (in 67%) and squamous cell carcinoma (in 24%). For lung cancer staging, EBUS-TBNA was found to have a sensitivity of 96%, a specificity of 100%, and a negative predictive value of 85%. We found EBUS-TBNA to be a safe and accurate method for lymph node staging in lung cancer patients.
Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Lin, Paul Tinphone

2009-01-01

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less
Wireless and Powerless Sensing Node System Developed for Monitoring Motors.

PubMed

Lee, Dasheng

2008-08-27

Reliability and maintainability of tooling systems can be improved through condition monitoring of motors. However, it is difficult to deploy sensor nodes due to the harsh environment of industrial plants. Sensor cables are easily damaged, which renders the monitoring system deployed to assure the machine's reliability itself unreliable. A wireless and powerless sensing node integrated with a MEMS (Micro Electro-Mechanical System) sensor, a signal processor, a communication module, and a self-powered generator was developed in this study for implementation of an easily mounted network sensor for monitoring motors. A specially designed communication module transmits a sequence of electromagnetic (EM) pulses in response to the sensor signals. The EM pulses can penetrate through the machine's metal case and delivers signals from the sensor inside the motor to the external data acquisition center. By using induction power, which is generated by the motor's shaft rotation, the sensor node is self-sustaining; therefore, no power line is required. A monitoring system, equipped with novel sensing nodes, was constructed to test its performance. The test results illustrate that, the novel sensing node developed in this study can effectively enhance the reliability of the motor monitoring system and it is expected to be a valuable technology, which will be available to the plant for implementation in a reliable motor management program.

Wireless and Powerless Sensing Node System Developed for Monitoring Motors

PubMed Central

Lee, Dasheng

2008-01-01

Reliability and maintainability of tooling systems can be improved through condition monitoring of motors. However, it is difficult to deploy sensor nodes due to the harsh environment of industrial plants. Sensor cables are easily damaged, which renders the monitoring system deployed to assure the machine's reliability itself unreliable. A wireless and powerless sensing node integrated with a MEMS (Micro Electro-Mechanical System) sensor, a signal processor, a communication module, and a self-powered generator was developed in this study for implementation of an easily mounted network sensor for monitoring motors. A specially designed communication module transmits a sequence of electromagnetic (EM) pulses in response to the sensor signals. The EM pulses can penetrate through the machine's metal case and delivers signals from the sensor inside the motor to the external data acquisition center. By using induction power, which is generated by the motor's shaft rotation, the sensor node is self-sustaining; therefore, no power line is required. A monitoring system, equipped with novel sensing nodes, was constructed to test its performance. The test results illustrate that, the novel sensing node developed in this study can effectively enhance the reliability of the motor monitoring system and it is expected to be a valuable technology, which will be available to the plant for implementation in a reliable motor management program. PMID:27873798
Petri net model for analysis of concurrently processed complex algorithms

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1986-01-01

This paper presents a Petri-net model suitable for analyzing the concurrent processing of computationally complex algorithms. The decomposed operations are to be processed in a multiple processor, data driven architecture. Of particular interest is the application of the model to both the description of the data/control flow of a particular algorithm, and to the general specification of the data driven architecture. A candidate architecture is also presented.
Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

2017-10-20

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
Optimistic barrier synchronization

NASA Technical Reports Server (NTRS)

Nicol, David M.

1992-01-01

Barrier synchronization is fundamental operation in parallel computation. In many contexts, at the point a processor enters a barrier it knows that it has already processed all the work required of it prior to synchronization. The alternative case, when a processor cannot enter a barrier with the assurance that it has already performed all the necessary pre-synchronization computation, is treated. The problem arises when the number of pre-sychronization messages to be received by a processor is unkown, for example, in a parallel discrete simulation or any other computation that is largely driven by an unpredictable exchange of messages. We describe an optimistic O(log sup 2 P) barrier algorithm for such problems, study its performance on a large-scale parallel system, and consider extensions to general associative reductions as well as associative parallel prefix computations.
Software-Reconfigurable Processors for Spacecraft

NASA Technical Reports Server (NTRS)

Farrington, Allen; Gray, Andrew; Bell, Bryan; Stanton, Valerie; Chong, Yong; Peters, Kenneth; Lee, Clement; Srinivasan, Jeffrey

2005-01-01

A report presents an overview of an architecture for a software-reconfigurable network data processor for a spacecraft engaged in scientific exploration. When executed on suitable electronic hardware, the software performs the functions of a physical layer (in effect, acts as a software radio in that it performs modulation, demodulation, pulse-shaping, error correction, coding, and decoding), a data-link layer, a network layer, a transport layer, and application-layer processing of scientific data. The software-reconfigurable network processor is undergoing development to enable rapid prototyping and rapid implementation of communication, navigation, and scientific signal-processing functions; to provide a long-lived communication infrastructure; and to provide greatly improved scientific-instrumentation and scientific-data-processing functions by enabling science-driven in-flight reconfiguration of computing resources devoted to these functions. This development is an extension of terrestrial radio and network developments (e.g., in the cellular-telephone industry) implemented in software running on such hardware as field-programmable gate arrays, digital signal processors, traditional digital circuits, and mixed-signal application-specific integrated circuits (ASICs).
Performance analysis of the GR712RC dual-core LEON3FT SPARC V8 processor in an asymmetric multi-processing environment

NASA Astrophysics Data System (ADS)

Giusi, Giovanni; Liu, Scige J.; Galli, Emanuele; Di Giorgio, Anna M.; Farina, Maria; Vertolli, Nello; Di Lellis, Andrea M.

2016-07-01

In this paper we present the results of a series of performance tests carried out on a prototype board mounting the Cobham Gaisler GR712RC Dual Core LEON3FT processor. The aim was the characterization of the performances of the dual core processor when used for executing a highly demanding lossless compression task, acting on data segments continuously copied from the static memory to the processor RAM. The selection of the compression activity to evaluate the performances was driven by the possibility of a comparison with previously executed tests on the Cobham/Aeroflex Gaisler UT699 LEON3FT SPARC™ V8. The results of the test activity have shown a factor 1.6 of improvement with respect to the previous tests, which can easily be improved by adopting a faster onboard board clock, and provided indications on the best size of the data chunks to be used in the compression activity.
Neural node network and model, and method of teaching same

DOEpatents

Parlos, A.G.; Atiya, A.F.; Fernandez, B.; Tsai, W.K.; Chong, K.T.

1995-12-26

The present invention is a fully connected feed forward network that includes at least one hidden layer. The hidden layer includes nodes in which the output of the node is fed back to that node as an input with a unit delay produced by a delay device occurring in the feedback path (local feedback). Each node within each layer also receives a delayed output (crosstalk) produced by a delay unit from all the other nodes within the same layer. The node performs a transfer function operation based on the inputs from the previous layer and the delayed outputs. The network can be implemented as analog or digital or within a general purpose processor. Two teaching methods can be used: (1) back propagation of weight calculation that includes the local feedback and the crosstalk or (2) more preferably a feed forward gradient decent which immediately follows the output computations and which also includes the local feedback and the crosstalk. Subsequent to the gradient propagation, the weights can be normalized, thereby preventing convergence to a local optimum. Education of the network can be incremental both on and off-line. An educated network is suitable for modeling and controlling dynamic nonlinear systems and time series systems and predicting the outputs as well as hidden states and parameters. The educated network can also be further educated during on-line processing. 21 figs.
Neural node network and model, and method of teaching same

DOEpatents

Parlos, Alexander G.; Atiya, Amir F.; Fernandez, Benito; Tsai, Wei K.; Chong, Kil T.

1995-01-01

The present invention is a fully connected feed forward network that includes at least one hidden layer 16. The hidden layer 16 includes nodes 20 in which the output of the node is fed back to that node as an input with a unit delay produced by a delay device 24 occurring in the feedback path 22 (local feedback). Each node within each layer also receives a delayed output (crosstalk) produced by a delay unit 36 from all the other nodes within the same layer 16. The node performs a transfer function operation based on the inputs from the previous layer and the delayed outputs. The network can be implemented as analog or digital or within a general purpose processor. Two teaching methods can be used: (1) back propagation of weight calculation that includes the local feedback and the crosstalk or (2) more preferably a feed forward gradient decent which immediately follows the output computations and which also includes the local feedback and the crosstalk. Subsequent to the gradient propagation, the weights can be normalized, thereby preventing convergence to a local optimum. Education of the network can be incremental both on and off-line. An educated network is suitable for modeling and controlling dynamic nonlinear systems and time series systems and predicting the outputs as well as hidden states and parameters. The educated network can also be further educated during on-line processing.
Autonomic Cluster Management System (ACMS): A Demonstration of Autonomic Principles at Work

NASA Technical Reports Server (NTRS)

Baldassari, James D.; Kopec, Christopher L.; Leshay, Eric S.; Truszkowski, Walt; Finkel, David

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of achieving significant computational capabilities for high-performance computing applications, while simultaneously affording the ability to. increase that capability simply by adding more (inexpensive) processors. However, the task of manually managing and con.guring a cluster quickly becomes impossible as the cluster grows in size. Autonomic computing is a relatively new approach to managing complex systems that can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Automatic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Development of an Autonomous Navigation Technology Test Vehicle

DTIC Science & Technology

2004-08-01

as an independent thread on processors using the Linux operating system. The computer hardware selected for the nodes that host the MRS threads...communications system design. Linux was chosen as the operating system for all of the single board computers used on the Mule. Linux was specifically...used for system analysis and development. The simple realization of multi-thread processing and inter-process communications in Linux made it a
QCDOC: A 10-teraflops scale computer for lattice QCD

NASA Astrophysics Data System (ADS)

Chen, D.; Christ, N. H.; Cristian, C.; Dong, Z.; Gara, A.; Garg, K.; Joo, B.; Kim, C.; Levkova, L.; Liao, X.; Mawhinney, R. D.; Ohta, S.; Wettig, T.

2001-03-01

The architecture of a new class of computers, optimized for lattice QCD calculations, is described. An individual node is based on a single integrated circuit containing a PowerPC 32-bit integer processor with a 1 Gflops 64-bit IEEE floating point unit, 4 Mbyte of memory, 8 Gbit/sec nearest-neighbor communications and additional control and diagnostic circuitry. The machine's name, QCDOC, derives from "QCD On a Chip".
A Monitoring System for Vegetable Greenhouses based on a Wireless Sensor Network

PubMed Central

Li, Xiu-hong; Cheng, Xiao; Yan, Ke; Gong, Peng

2010-01-01

A wireless sensor network-based automatic monitoring system is designed for monitoring the life conditions of greenhouse vegetatables. The complete system architecture includes a group of sensor nodes, a base station, and an internet data center. For the design of wireless sensor node, the JN5139 micro-processor is adopted as the core component and the Zigbee protocol is used for wireless communication between nodes. With an ARM7 microprocessor and embedded ZKOS operating system, a proprietary gateway node is developed to achieve data influx, screen display, system configuration and GPRS based remote data forwarding. Through a Client/Server mode the management software for remote data center achieves real-time data distribution and time-series analysis. Besides, a GSM-short-message-based interface is developed for sending real-time environmental measurements, and for alarming when a measurement is beyond some pre-defined threshold. The whole system has been tested for over one year and satisfactory results have been observed, which indicate that this system is very useful for greenhouse environment monitoring. PMID:22163391
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J; Lang, Michael; Pakin, Scott

2009-01-01

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contain wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost ismore » typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional computation and higher use of on-chip communications. This tradeoff is explored using a performance model and an implementation on the Petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
Multi-threaded ATLAS simulation on Intel Knights Landing processors

NASA Astrophysics Data System (ADS)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE PAGES

Wang, Bei; Ethier, Stephane; Tang, William; ...

2017-06-29

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Modern gyrokinetic particle-in-cell simulation of fusion plasmas on top supercomputers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, Bei; Ethier, Stephane; Tang, William

The Gyrokinetic Toroidal Code at Princeton (GTC-P) is a highly scalable and portable particle-in-cell (PIC) code. It solves the 5D Vlasov-Poisson equation featuring efficient utilization of modern parallel computer architectures at the petascale and beyond. Motivated by the goal of developing a modern code capable of dealing with the physics challenge of increasing problem size with sufficient resolution, new thread-level optimizations have been introduced as well as a key additional domain decomposition. GTC-P's multiple levels of parallelism, including inter-node 2D domain decomposition and particle decomposition, as well as intra-node shared memory partition and vectorization have enabled pushing the scalability ofmore » the PIC method to extreme computational scales. In this paper, we describe the methods developed to build a highly parallelized PIC code across a broad range of supercomputer designs. This particularly includes implementations on heterogeneous systems using NVIDIA GPU accelerators and Intel Xeon Phi (MIC) co-processors and performance comparisons with state-of-the-art homogeneous HPC systems such as Blue Gene/Q. New discovery science capabilities in the magnetic fusion energy application domain are enabled, including investigations of Ion-Temperature-Gradient (ITG) driven turbulence simulations with unprecedented spatial resolution and long temporal duration. Performance studies with realistic fusion experimental parameters are carried out on multiple supercomputing systems spanning a wide range of cache capacities, cache-sharing configurations, memory bandwidth, interconnects and network topologies. These performance comparisons using a realistic discovery-science-capable domain application code provide valuable insights on optimization techniques across one of the broadest sets of current high-end computing platforms worldwide.« less
Time Series Analysis for Spatial Node Selection in Environment Monitoring Sensor Networks

PubMed Central

Bhandari, Siddhartha; Jurdak, Raja; Kusy, Branislav

2017-01-01

Wireless sensor networks are widely used in environmental monitoring. The number of sensor nodes to be deployed will vary depending on the desired spatio-temporal resolution. Selecting an optimal number, position and sampling rate for an array of sensor nodes in environmental monitoring is a challenging question. Most of the current solutions are either theoretical or simulation-based where the problems are tackled using random field theory, computational geometry or computer simulations, limiting their specificity to a given sensor deployment. Using an empirical dataset from a mine rehabilitation monitoring sensor network, this work proposes a data-driven approach where co-integrated time series analysis is used to select the number of sensors from a short-term deployment of a larger set of potential node positions. Analyses conducted on temperature time series show 75% of sensors are co-integrated. Using only 25% of the original nodes can generate a complete dataset within a 0.5 °C average error bound. Our data-driven approach to sensor position selection is applicable for spatiotemporal monitoring of spatially correlated environmental parameters to minimize deployment cost without compromising data resolution. PMID:29271880

Hyperswitch Network For Hypercube Computer

NASA Technical Reports Server (NTRS)

Chow, Edward; Madan, Herbert; Peterson, John

1989-01-01

Data-driven dynamic switching enables high speed data transfer. Proposed hyperswitch network based on mixed static and dynamic topologies. Routing header modified in response to congestion or faults encountered as path established. Static topology meets requirement if nodes have switching elements that perform necessary routing header revisions dynamically. Hypercube topology now being implemented with switching element in each computer node aimed at designing very-richly-interconnected multicomputer system. Interconnection network connects great number of small computer nodes, using fixed hypercube topology, characterized by point-to-point links between nodes.
Accuracy-Energy Configurable Sensor Processor and IoT Device for Long-Term Activity Monitoring in Rare-Event Sensing Applications

PubMed Central

2014-01-01

A specially designed sensor processor used as a main processor in IoT (internet-of-thing) device for the rare-event sensing applications is proposed. The IoT device including the proposed sensor processor performs the event-driven sensor data processing based on an accuracy-energy configurable event-quantization in architectural level. The received sensor signal is converted into a sequence of atomic events, which is extracted by the signal-to-atomic-event generator (AEG). Using an event signal processing unit (EPU) as an accelerator, the extracted atomic events are analyzed to build the final event. Instead of the sampled raw data transmission via internet, the proposed method delays the communication with a host system until a semantic pattern of the signal is identified as a final event. The proposed processor is implemented on a single chip, which is tightly coupled in bus connection level with a microcontroller using a 0.18 μm CMOS embedded-flash process. For experimental results, we evaluated the proposed sensor processor by using an IR- (infrared radio-) based signal reflection and sensor signal acquisition system. We successfully demonstrated that the expected power consumption is in the range of 20% to 50% compared to the result of the basement in case of allowing 10% accuracy error. PMID:25580458
Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2015-04-01

PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
DIRAC universal pilots

NASA Astrophysics Data System (ADS)

Stagni, F.; McNab, A.; Luzzi, C.; Krzemien, W.; Consortium, DIRAC

2017-10-01

In the last few years, new types of computing models, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of pledged resources, while others are in the form of opportunistic ones. Most but not all of these new infrastructures are based on virtualization techniques. In addition, some of them, present opportunities for multi-processor computing slots to the users. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to provide the transparent, uniform interface has become essential. The transparent access to the underlying resources is realized by implementing the pilot model. DIRAC’s newest generation of generic pilots (the so-called Pilots 2.0) are the “pilots for all the skies”, and have been successfully released in production more than a year ago. They use a plugin mechanism that makes them easily adaptable. Pilots 2.0 have been used for fetching and running jobs on every type of resource, being it a Worker Node (WN) behind a CREAM/ARC/HTCondor/DIRAC Computing element, a Virtual Machine running on IaaC infrastructures like Vac or BOINC, on IaaS cloud resources managed by Vcycle, the LHCb High Level Trigger farm nodes, and any type of opportunistic computing resource. Make a machine a “Pilot Machine”, and all diversities between them will disappear. This contribution describes how pilots are made suitable for different resources, and the recent steps taken towards a fully unified framework, including monitoring. Also, the cases of multi-processor computing slots either on real or virtual machines, with the whole node or a partition of it, is discussed.
Design of the SLAC RCE Platform: A General Purpose ATCA Based Data Acquisition System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Herbst, R.; Claus, R.; Freytag, M.

2015-01-23

The SLAC RCE platform is a general purpose clustered data acquisition system implemented on a custom ATCA compliant blade, called the Cluster On Board (COB). The core of the system is the Reconfigurable Cluster Element (RCE), which is a system-on-chip design based upon the Xilinx Zynq family of FPGAs, mounted on custom COB daughter-boards. The Zynq architecture couples a dual core ARM Cortex A9 based processor with a high performance 28nm FPGA. The RCE has 12 external general purpose bi-directional high speed links, each supporting serial rates of up to 12Gbps. 8 RCE nodes are included on a COB, eachmore » with a 10Gbps connection to an on-board 24-port Ethernet switch integrated circuit. The COB is designed to be used with a standard full-mesh ATCA backplane allowing multiple RCE nodes to be tightly interconnected with minimal interconnect latency. Multiple shelves can be clustered using the front panel 10-gbps connections. The COB also supports local and inter-blade timing and trigger distribution. An experiment specific Rear Transition Module adapts the 96 high speed serial links to specific experiments and allows an experiment-specific timing and busy feedback connection. This coupling of processors with a high performance FPGA fabric in a low latency, multiple node cluster allows high speed data processing that can be easily adapted to any physics experiment. RTEMS and Linux are both ported to the module. The RCE has been used or is the baseline for several current and proposed experiments (LCLS, HPS, LSST, ATLAS-CSC, LBNE, DarkSide, ILC-SiD, etc).« less
Cost/Performance Ratio Achieved by Using a Commodity-Based Cluster

NASA Technical Reports Server (NTRS)

Lopez, Isaac

2001-01-01

Researchers at the NASA Glenn Research Center acquired a commodity cluster based on Intel Corporation processors to compare its performance with a traditional UNIX cluster in the execution of aeropropulsion applications. Since the cost differential of the clusters was significant, a cost/performance ratio was calculated. After executing a propulsion application on both clusters, the researchers demonstrated a 9.4 cost/performance ratio in favor of the Intel-based cluster. These researchers utilize the Aeroshark cluster as one of the primary testbeds for developing NPSS parallel application codes and system software. The Aero-shark cluster provides 64 Intel Pentium II 400-MHz processors, housed in 32 nodes. Recently, APNASA - a code developed by a Government/industry team for the design and analysis of turbomachinery systems was used for a simulation on Glenn's Aeroshark cluster.
Benchmarking and tuning the MILC code on clusters and supercomputers

NASA Astrophysics Data System (ADS)

Gottlieb, Steven

2002-03-01

Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Benchmarking and tuning the MILC code on clusters and supercomputers

NASA Astrophysics Data System (ADS)

Gottlieb, Steven

Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Scalable Unix commands for parallel processors : a high-performance implementation.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
Electrooptical adaptive switching network for the hypercube computer

NASA Technical Reports Server (NTRS)

Chow, E.; Peterson, J.

1988-01-01

An all-optical network design for the hyperswitch network using regular free-space interconnects between electronic processor nodes is presented. The adaptive routing model used is described, and an adaptive routing control example is presented. The design demonstrates that existing electrooptical techniques are sufficient for implementing efficient parallel architectures without the need for more complex means of implementing arbitrary interconnection schemes. The electrooptical hyperswitch network significantly improves the communication performance of the hypercube computer.
Morpho-functional characterization of the systemic venous pole of the reptile heart.

PubMed

Jensen, Bjarke; Vesterskov, Signe; Boukens, Bastiaan J; Nielsen, Jan M; Moorman, Antoon F M; Christoffels, Vincent M; Wang, Tobias

2017-07-27

Mammals evolved from reptile-like ancestors, and while the mammalian heart is driven by a distinct sinus node, a sinus node is not apparent in reptiles. We characterized the myocardial systemic venous pole, the sinus venosus, in reptiles to identify the dominant pacemaker and to assess whether the sinus venosus remodels and adopts an atrium-like phenotype as observed in mammals. Anolis lizards had an extensive sinus venosus of myocardium expressing Tbx18. A small sub-population of cells encircling the sinuatrial junction expressed Isl1, Bmp2, Tbx3, and Hcn4, homologues of genes marking the mammalian sinus node. Electrical mapping showed that hearts of Anolis lizards and Python snakes were driven from the sinuatrial junction. The electrical impulse was delayed between the sinus venosus and the right atrium, allowing the sinus venosus to contract and aid right atrial filling. In proximity of the systemic veins, the Anolis sinus venosus expressed markers of the atrial phenotype Nkx2-5 and Gja5. In conclusion, the reptile heart is driven by a pacemaker region with an expression signature similar to that of the immature sinus node of mammals. Unlike mammals, reptiles maintain a sinuatrial delay of the impulse, allowing the partly atrialized sinus venosus to function as a chamber.
Asynchronous Data Retrieval from an Object-Oriented Database

NASA Astrophysics Data System (ADS)

Gilbert, Jonathan P.; Bic, Lubomir

We present an object-oriented semantic database model which, similar to other object-oriented systems, combines the virtues of four concepts: the functional data model, a property inheritance hierarchy, abstract data types and message-driven computation. The main emphasis is on the last of these four concepts. We describe generic procedures that permit queries to be processed in a purely message-driven manner. A database is represented as a network of nodes and directed arcs, in which each node is a logical processing element, capable of communicating with other nodes by exchanging messages. This eliminates the need for shared memory and for centralized control during query processing. Hence, the model is suitable for implementation on a multiprocessor computer architecture, consisting of large numbers of loosely coupled processing elements.
Sensor node for remote monitoring of waterborne disease-causing bacteria.

PubMed

Kim, Kyukwang; Myung, Hyun

2015-05-05

A sensor node for sampling water and checking for the presence of harmful bacteria such as E. coli in water sources was developed in this research. A chromogenic enzyme substrate assay method was used to easily detect coliform bacteria by monitoring the color change of the sampled water mixed with a reagent. Live webcam image streaming to the web browser of the end user with a Wi-Fi connected sensor node shows the water color changes in real time. The liquid can be manipulated on the web-based user interface, and also can be observed by webcam feeds. Image streaming and web console servers run on an embedded processor with an expansion board. The UART channel of the expansion board is connected to an external Arduino board and a motor driver to control self-priming water pumps to sample the water, mix the reagent, and remove the water sample after the test is completed. The sensor node can repeat water testing until the test reagent is depleted. The authors anticipate that the use of the sensor node developed in this research can decrease the cost and required labor for testing samples in a factory environment and checking the water quality of local water sources in developing countries.
Spectral Element Method for the Simulation of Unsteady Compressible Flows

NASA Technical Reports Server (NTRS)

Diosady, Laslo Tibor; Murman, Scott M.

2013-01-01

This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.
Massively parallel algorithms for trace-driven cache simulations

NASA Technical Reports Server (NTRS)

Nicol, David M.; Greenberg, Albert G.; Lubachevsky, Boris D.

1991-01-01

Trace driven cache simulation is central to computer design. A trace is a very long sequence of reference lines from main memory. At the t(exp th) instant, reference x sub t is hashed into a set of cache locations, the contents of which are then compared with x sub t. If at the t sup th instant x sub t is not present in the cache, then it is said to be a miss, and is loaded into the cache set, possibly forcing the replacement of some other memory line, and making x sub t present for the (t+1) sup st instant. The problem of parallel simulation of a subtrace of N references directed to a C line cache set is considered, with the aim of determining which references are misses and related statistics. A simulation method is presented for the Least Recently Used (LRU) policy, which regradless of the set size C runs in time O(log N) using N processors on the exclusive read, exclusive write (EREW) parallel model. A simpler LRU simulation algorithm is given that runs in O(C log N) time using N/log N processors. Timings are presented of the second algorithm's implementation on the MasPar MP-1, a machine with 16384 processors. A broad class of reference based line replacement policies are considered, which includes LRU as well as the Least Frequently Used and Random replacement policies. A simulation method is presented for any such policy that on any trace of length N directed to a C line set runs in the O(C log N) time with high probability using N processors on the EREW model. The algorithms are simple, have very little space overhead, and are well suited for SIMD implementation.
enhancedGraphics: a Cytoscape app for enhanced node graphics

PubMed Central

Morris, John H.; Kuchinsky, Allan; Ferrin, Thomas E.; Pico, Alexander R.

2014-01-01

enhancedGraphics ( http://apps.cytoscape.org/apps/enhancedGraphics) is a Cytoscape app that implements a series of enhanced charts and graphics that may be added to Cytoscape nodes. It enables users and other app developers to create pie, line, bar, and circle plots that are driven by columns in the Cytoscape Node Table. Charts are drawn using vector graphics to allow full-resolution scaling. PMID:25285206
a Linux PC Cluster for Lattice QCD with Exact Chiral Symmetry

NASA Astrophysics Data System (ADS)

Chiu, Ting-Wai; Hsieh, Tung-Han; Huang, Chao-Hsi; Huang, Tsung-Ren

A computational system for lattice QCD with overlap Dirac quarks is described. The platform is a home-made Linux PC cluster, built with off-the-shelf components. At present the system constitutes of 64 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0/2.5 GHz), one Gbyte of PC800/1066 RDRAM, one 40/80/120 Gbyte hard disk, and a network card. The computationally intensive parts of our program are written in SSE2 codes. The speed of our system is estimated to be 70 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched QCD. We discuss how to optimize its hardware and software for computing propagators of overlap Dirac quarks.
Radar Data Processing Using a Distributed Computational System

DTIC Science & Technology

1992-06-01

objects to processors must reduce Toc (N) (i.e., the time to compute on 85 N nodes) [Ref. 28]. Time spent communicating can represent a degradation of...de Sistemas e Computaq&o, s/ data. [9] Vilhena R. "IntroduqAo aos Algoritmos para Processamento de Marcaq6es e DistAncias", Escola Naval - Notas de...Aula - Automaq&o de Sistemas Navais, s/ data. (101 Averbuch A., Itzikcwitz S., and Kapon T. "Parallel Implementation of Multiple Model Tracking
Ultra-Low Power Event-Driven Wireless Sensor Node Using Piezoelectric Accelerometer for Health Monitoring

NASA Astrophysics Data System (ADS)

Okada, Hironao; Kobayashi, Takeshi; Masuda, Takashi; Itoh, Toshihiro

2009-07-01

We describe a low power consumption wireless sensor node designed for monitoring the conditions of animals, especially of chickens. The node detects variations in 24-h behavior patterns by acquiring the number of the movement of an animal whose acceleration exceeds a threshold measured in per unit time. Wireless sensor nodes when operated intermittently are likely to miss necessary data during their sleep mode state and waste the power in the case of acquiring useless data. We design the node worked only when required acceleration is detected using a piezoelectric accelerometer and a comparator for wake-up source of micro controller unit.
Instrument front-ends at Fermilab during Run II

NASA Astrophysics Data System (ADS)

Meyer, T.; Slimmer, D.; Voy, D.

2011-11-01

The optimization of an accelerator relies on the ability to monitor the behavior of the beam in an intelligent and timely fashion. The use of processor-driven front-ends allowed for the deployment of smart systems in the field for improved data collection and analysis during Run II. This paper describes the implementation of the two main systems used: National Instruments LabVIEW running on PCs, and WindRiver's VxWorks real-time operating system running in a VME crate processor. Work supported by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the United States Department of Energy.

Cooperation Helps Power Saving

DTIC Science & Technology

2009-04-07

the destination node hears the poll, the link between the two nodes is activated. In the original STEM, two radios working on two separate channels... hears the poll, the link between the two nodes is activated. In the original STEM, two radios working on two separate chan- nels are used: one radio is...Computer and Communications Societies. Proceedings. IEEE, vol. 3, pp. 1548–1557 vol.3, 2001. [2] R . Kravets and P. Krishnan, “Application-driven power
Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

1997-01-01

A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.
High Fidelity Simulations of Unsteady Flow through Turbopumps and Flowliners

NASA Technical Reports Server (NTRS)

Kiris, Cetin C.; Kwak, dochan; Chan, William; Housman, Jeff

2006-01-01

High fidelity computations were carried out to analyze the orbiter LH2 feedline flowliner. Computations were performed on the Columbia platform which is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processor each. Various computational models were used to characterize the unsteady flow features in the turbopump, including the orbiter Low-Pressure-Fuel-Turbopump (LPFTP) inducer, the orbiter manifold and a test article used to represent the manifold. Unsteady flow originating from the orbiter LPFTP inducer is one of the major contributors to the high frequency cyclic loading that results in high cycle fatigue damage to the gimbal flowliners just upstream of the LPFTP. The flow fields for the orbiter manifold and representative test article are computed and analyzed for similarities and differences. The incompressible Navier-Stokes flow solver INS3D, based on the artificial compressibility method, was used to compute the flow of liquid hydrogen in each test article.
IJA: an efficient algorithm for query processing in sensor networks.

PubMed

Lee, Hyun Chang; Lee, Young Jae; Lim, Ji Hyang; Kim, Dong Hwa

2011-01-01

One of main features in sensor networks is the function that processes real time state information after gathering needed data from many domains. The component technologies consisting of each node called a sensor node that are including physical sensors, processors, actuators and power have advanced significantly over the last decade. Thanks to the advanced technology, over time sensor networks have been adopted in an all-round industry sensing physical phenomenon. However, sensor nodes in sensor networks are considerably constrained because with their energy and memory resources they have a very limited ability to process any information compared to conventional computer systems. Thus query processing over the nodes should be constrained because of their limitations. Due to the problems, the join operations in sensor networks are typically processed in a distributed manner over a set of nodes and have been studied. By way of example while simple queries, such as select and aggregate queries, in sensor networks have been addressed in the literature, the processing of join queries in sensor networks remains to be investigated. Therefore, in this paper, we propose and describe an Incremental Join Algorithm (IJA) in Sensor Networks to reduce the overhead caused by moving a join pair to the final join node or to minimize the communication cost that is the main consumer of the battery when processing the distributed queries in sensor networks environments. At the same time, the simulation result shows that the proposed IJA algorithm significantly reduces the number of bytes to be moved to join nodes compared to the popular synopsis join algorithm.
IJA: An Efficient Algorithm for Query Processing in Sensor Networks

PubMed Central

Lee, Hyun Chang; Lee, Young Jae; Lim, Ji Hyang; Kim, Dong Hwa

2011-01-01

One of main features in sensor networks is the function that processes real time state information after gathering needed data from many domains. The component technologies consisting of each node called a sensor node that are including physical sensors, processors, actuators and power have advanced significantly over the last decade. Thanks to the advanced technology, over time sensor networks have been adopted in an all-round industry sensing physical phenomenon. However, sensor nodes in sensor networks are considerably constrained because with their energy and memory resources they have a very limited ability to process any information compared to conventional computer systems. Thus query processing over the nodes should be constrained because of their limitations. Due to the problems, the join operations in sensor networks are typically processed in a distributed manner over a set of nodes and have been studied. By way of example while simple queries, such as select and aggregate queries, in sensor networks have been addressed in the literature, the processing of join queries in sensor networks remains to be investigated. Therefore, in this paper, we propose and describe an Incremental Join Algorithm (IJA) in Sensor Networks to reduce the overhead caused by moving a join pair to the final join node or to minimize the communication cost that is the main consumer of the battery when processing the distributed queries in sensor networks environments. At the same time, the simulation result shows that the proposed IJA algorithm significantly reduces the number of bytes to be moved to join nodes compared to the popular synopsis join algorithm. PMID:22319375
The Design and Development of the SMEX-Lite Power System

NASA Technical Reports Server (NTRS)

Rakow, Glenn P.; Schnurr, Richard G., Jr.; Solly, Michael A.

1998-01-01

This paper describes the design and development of a 250W orbit average electrical power system electronic Power Node and software for use in Low Earth Orbit missions. The mass of the Power Node is 3.6 Kg (8 lb.). The dimensions of the Power Node are 30cm x 26cm x 7.9cm (11 in. x 10.25 in x 3.1 in.) The design was realized using software, Field Programmable Gate Array (FPGA) digital logic and surface mount technology. The design is generic enough to reduce the non-recurring engineering for different mission configurations. The Power Node charges one to five, low cost, 22-cell 4 AH D-cell battery packs independently. The battery charging algorithms are executed in the power software to reduce the mass and size of the power electronic. The Power Node implements a peak-power tracking algorithm using an innovative hardware/software approach. The power software task is hosted on the spacecraft processor. The power software task generates a MIL-STD-1553 command packet to update the Power Node control settings. The settings for the battery voltage and current limits, as well as minimum solar array voltage used to implement peak power tracking are contained in this packet. Several advanced topologies are used in the Power Node. These include synchronous rectification in the bus regulators, average current control in the battery chargers and quasi-resonant converters for the Field Effect Transistor (FET) transistor drive electronics. Lastly, the main bus regulator uses a feed-forward topology with the PWM implemented in an FPGA.
Maximizing Mission Science Return Through Use of Spacecraft Autonomy: Active Volcanism and the Autonomous Sciencecraft Experiment

NASA Technical Reports Server (NTRS)

Davies, A. G.; Chien, S.; Baker, V.; Castano, R.; Cichy, B.; Doggett, T.; Dohm, J. M.; Greeley, R.; Ip, F.; Rabideau, G.

2005-01-01

ASE has successfully demonstrated that a spacecraft can be driven by science analysis and autonomously controlled. ASE is available for flight on other missions. Mission hardware design should consider ASE requirements for available onboard data storage, onboard memory size and processor speed.
Advanced satellite communication system

NASA Technical Reports Server (NTRS)

Staples, Edward J.; Lie, Sen

1992-01-01

The objective of this research program was to develop an innovative advanced satellite receiver/demodulator utilizing surface acoustic wave (SAW) chirp transform processor and coherent BPSK demodulation. The algorithm of this SAW chirp Fourier transformer is of the Convolve - Multiply - Convolve (CMC) type, utilizing off-the-shelf reflective array compressor (RAC) chirp filters. This satellite receiver, if fully developed, was intended to be used as an on-board multichannel communications repeater. The Advanced Communications Receiver consists of four units: (1) CMC processor, (2) single sideband modulator, (3) demodulator, and (4) chirp waveform generator and individual channel processors. The input signal is composed of multiple user transmission frequencies operating independently from remotely located ground terminals. This signal is Fourier transformed by the CMC Processor into a unique time slot for each user frequency. The CMC processor is driven by a waveform generator through a single sideband (SSB) modulator. The output of the coherent demodulator is composed of positive and negative pulses, which are the envelopes of the chirp transform processor output. These pulses correspond to the data symbols. Following the demodulator, a logic circuit reconstructs the pulses into data, which are subsequently differentially decoded to form the transmitted data. The coherent demodulation and detection of BPSK signals derived from a CMC chirp transform processor were experimentally demonstrated and bit error rate (BER) testing was performed. To assess the feasibility of such advanced receiver, the results were compared with the theoretical analysis and plotted for an average BER as a function of signal-to-noise ratio. Another goal of this SBIR program was the development of a commercial product. The commercial product developed was an arbitrary waveform generator. The successful sales have begun with the delivery of the first arbitrary waveform generator.
Effectiveness of the Benign and Malignant Diagnosis of Mediastinal and Hilar Lymph Nodes by Endobronchial Ultrasound Elastography.

PubMed

Huang, Haidong; Huang, Zhiang; Wang, Qin; Wang, Xinan; Dong, Yuchao; Zhang, Wei; Zarogoulidis, Paul; Man, Yan-Gao; Schmidt, Wolfgang Hohenforst; Bai, Chong

2017-01-01

Background and Objectives: Endobronchial ultrasound elastography is a new technique for describing the stiffness of tissue during endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA). The aims of this study were to investigate the diagnostic value of Endobronchial ultrasound (EBUS) elastography for distinguishing the difference between benign and malignant lymph nodes among mediastinal and hilar lymph node. Materials and Methods: From June 2015 to August 2015, 47 patients confirmed of mediastinal and hilar lymph node enlargement through examination of Computed tomography (CT) were enrolled, and a total of 78 lymph nodes were evaluated by endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA). EBUS-guided elastography of lymph nodes was performed prior to EBUS-TBNA. A convex probe EBUS was used with a new EBUS processor to assess elastographic patterns that were classified based on color distribution as follows: Type 1, predominantly non-blue (green, yellow and red); Type 2, part blue, part non-blue (green, yellow and red); Type 3, predominantly blue. Pathological determination of malignant or benign lymph nodes was used as the gold standard for this study. The elastographic patterns were compared with the final pathologic results from EBUS-TBNA. Results: On pathological evaluation of the lymph nodes, 45 were benign and 33 were malignant. The lymph nodes that were classified as Type 1 on endobronchial ultrasound elastography were benign in 26/27 (96.3%) and malignant in 1/27 (3.7%); for Type 2 lymph nodes, 15/20 (75.0%) were benign and 5/20 (25.0%) were malignant; Type 3 lymph nodes were benign in 4/31 (12.9%) and malignant in 27/31 (87.1%). In classifying Type 1 as 'benign' and Type 3 as 'malignant,' the sensitivity, specificity, positive predictive value, negative predictive value and diagnostic accuracy rates were 96.43%, 86.67%, 87.10%, 96.30%, 91.38%, respectively. Conclusion: EBUS elastography of mediastinal and hilar lymph nodes is a noninvasive technique that can be performed reliably and may be helpful in the prediction of benign and malignant lymph nodes among mediastinal and hilar lymph node during EBUS-TBNA.
Scalable parallel communications

NASA Technical Reports Server (NTRS)

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
A Versatile Image Processor For Digital Diagnostic Imaging And Its Application In Computed Radiography

NASA Astrophysics Data System (ADS)

Blume, H.; Alexandru, R.; Applegate, R.; Giordano, T.; Kamiya, K.; Kresina, R.

1986-06-01

In a digital diagnostic imaging department, the majority of operations for handling and processing of images can be grouped into a small set of basic operations, such as image data buffering and storage, image processing and analysis, image display, image data transmission and image data compression. These operations occur in almost all nodes of the diagnostic imaging communications network of the department. An image processor architecture was developed in which each of these functions has been mapped into hardware and software modules. The modular approach has advantages in terms of economics, service, expandability and upgradeability. The architectural design is based on the principles of hierarchical functionality, distributed and parallel processing and aims at real time response. Parallel processing and real time response is facilitated in part by a dual bus system: a VME control bus and a high speed image data bus, consisting of 8 independent parallel 16-bit busses, capable of handling combined up to 144 MBytes/sec. The presented image processor is versatile enough to meet the video rate processing needs of digital subtraction angiography, the large pixel matrix processing requirements of static projection radiography, or the broad range of manipulation and display needs of a multi-modality diagnostic work station. Several hardware modules are described in detail. For illustrating the capabilities of the image processor, processed 2000 x 2000 pixel computed radiographs are shown and estimated computation times for executing the processing opera-tions are presented.
The TurboLAN project. Phase 1: Protocol choices for high speed local area networks. Phase 2: TurboLAN Intelligent Network Adapter Card, (TINAC) architecture

NASA Technical Reports Server (NTRS)

Alkhatib, Hasan S.

1991-01-01

The hardware and the software architecture of the TurboLAN Intelligent Network Adapter Card (TINAC) are described. A high level as well as detailed treatment of the workings of various components of the TINAC are presented. The TINAC is divided into the following four major functional units: (1) the network access unit (NAU); (2) the buffer management unit; (3) the host interface unit; and (4) the node processor unit.
100 GB/S Time Division Multiplex (TDM) Access Nodes and Regenerators Based on Novel Loop Mirrors with High Nonlinearity Fibers

DTIC Science & Technology

2002-07-01

spectral components remain co-polarized. We confirmed that this was the case by passing the continuum through a polarizing beam splitter . The...propagation direction through polarization beam splitters and aligned along the other axis of the fiber. Co-propagating control and signal pulses...amplifier, PBS = polarization beam splitter . Figure 8. Eye diagram of header processor. This is the trace of the eye diagrams taken with the setup of Fig
Building Columbia from the SysAdmin View

NASA Technical Reports Server (NTRS)

Chan, David

2005-01-01

Project Columbia was built at NASA Ames Research Center in partnership with SGI and Intel. Columbia consists of 20 512 processor Altix machines with 440TB of storage and achieved 51.87 TeraPlops to be ranked the second fastest on the top 500 at SuperComputing 2004. Columbia was delivered, installed and put into production in 3 months. On average, a new Columbia node was brought into production in less than a week. Columbia's configuration, installation, and future plans will be discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoshii, Kazutomo; Llopis, Pablo; Zhang, Kaicheng

As CMOS scaling nears its end, parameter variations (process, temperature and voltage) are becoming a major concern. To overcome parameter variations and provide stability, modern processors are becoming dynamic, opportunistically adjusting voltage and frequency based on thermal and energy constraints, which negatively impacts traditional bulk-synchronous parallelism-minded hardware and software designs. As node-level architecture is growing in complexity, implementing variation control mechanisms only with hardware can be a challenging task. In this paper we investigate a software strategy to manage hardwareinduced variations, leveraging low-level monitoring/controlling mechanisms.
Specifications of a Simulation Model for a Local Area Network Design in Support of Stock Point Logistics Integrated Communications Environment (SPLICE).

DTIC Science & Technology

1982-10-01

class queueing system with a preemptive -resume priority service discipline, as depicted in Figure 4.2. Concerning a SPLICLAN configuration a node can...processor can be modeled as a single resource, multi-class queueing system with a preemptive -resume priority structure as the one given in Figure 4.2. An...LOCAL AREA NETWORK DESIGN IN SUPPORT OF STOCK POINT LOGISTICS INTEGRATED COMMUNICATIONS ENVIRONMENT (SPLICE) by Ioannis Th. Mastrocostopoulos October
Functional Specification and Simulation of a Floating Point Co-Processor for SPUR

DTIC Science & Technology

1986-08-01

depend on this state will not be stable until the next phase; this leaves the problem of how to control events that must occur on phi 1 of a cycle. The... problems with the structure of the chip description. The worst of these problems is the absence of Slang constructs for coding separate chip component...constructs such as UNK as well. Another related problem was the inability to explicitly declare the size of Slang node values. \\Vhile the correct
JSC Wireless Sensor Network Update

NASA Technical Reports Server (NTRS)

Wagner, Robert

2010-01-01

Sensor nodes composed of three basic components... radio module: COTS radio module implementing standardized WSN protocol; treated as WSN modem by main board main board: contains application processor (TI MSP430 microcontroller), memory, power supply; responsible for sensor data acquisition, pre-processing, and task scheduling; re-used in every application with growing library of embedded C code sensor card: contains application-specific sensors, data conditioning hardware, and any advanced hardware not built into main board (DSPs, faster A/D, etc.); requires (re-) development for each application.
A Future Accelerated Cognitive Distributed Hybrid Testbed for Big Data Science Analytics

NASA Astrophysics Data System (ADS)

Halem, M.; Prathapan, S.; Golpayegani, N.; Huang, Y.; Blattner, T.; Dorband, J. E.

2016-12-01

As increased sensor spectral data volumes from current and future Earth Observing satellites are assimilated into high-resolution climate models, intensive cognitive machine learning technologies are needed to data mine, extract and intercompare model outputs. It is clear today that the next generation of computers and storage, beyond petascale cluster architectures, will be data centric. They will manage data movement and process data in place. Future cluster nodes have been announced that integrate multiple CPUs with high-speed links to GPUs and MICS on their backplanes with massive non-volatile RAM and access to active flash RAM disk storage. Active Ethernet connected key value store disk storage drives with 10Ge or higher are now available through the Kinetic Open Storage Alliance. At the UMBC Center for Hybrid Multicore Productivity Research, a future state-of-the-art Accelerated Cognitive Computer System (ACCS) for Big Data science is being integrated into the current IBM iDataplex computational system `bluewave'. Based on the next gen IBM 200 PF Sierra processor, an interim two node IBM Power S822 testbed is being integrated with dual Power 8 processors with 10 cores, 1TB Ram, a PCIe to a K80 GPU and an FPGA Coherent Accelerated Processor Interface card to 20TB Flash Ram. This system is to be updated to the Power 8+, an NVlink 1.0 with the Pascal GPU late in 2016. Moreover, the Seagate 96TB Kinetic Disk system with 24 Ethernet connected active disks is integrated into the ACCS storage system. A Lightweight Virtual File System developed at the NASA GSFC is installed on bluewave. Since remote access to publicly available quantum annealing computers is available at several govt labs, the ACCS will offer an in-line Restricted Boltzmann Machine optimization capability to the D-Wave 2X quantum annealing processor over the campus high speed 100 Gb network to Internet 2 for large files. As an evaluation test of the cognitive functionality of the architecture, the following studies utilizing all the system components will be presented; (i) a near real time climate change study generating CO2 fluxes and (ii) a deep dive capability into an 8000 x8000 pixel image pyramid display and (iii) Large dense and sparse eigenvalue decomposition.
Circuit for Communication Over Power Lines

NASA Technical Reports Server (NTRS)

Krasowski, Michael J.; Prokop, Normal F.; Greer, Lawrence C., III; Nappier, Jennifer

2011-01-01

Many distributed systems share common sensors and instruments along with a common power line supplying current to the system. A communication technique and circuit has been developed that allows for the simple inclusion of an instrument, sensor, or actuator node within any system containing a common power bus. Wherever power is available, a node can be added, which can then draw power for itself, its associated sensors, and actuators from the power bus all while communicating with other nodes on the power bus. The technique modulates a DC power bus through capacitive coupling using on-off keying (OOK), and receives and demodulates the signal from the DC power bus through the same capacitive coupling. The circuit acts as serial modem for the physical power line communication. The circuit and technique can be made of commercially available components or included in an application specific integrated circuit (ASIC) design, which allows for the circuit to be included in current designs with additional circuitry or embedded into new designs. This device and technique moves computational, sensing, and actuation abilities closer to the source, and allows for the networking of multiple similar nodes to each other and to a central processor. This technique also allows for reconfigurable systems by adding or removing nodes at any time. It can do so using nothing more than the in situ power wiring of the system.

Energy modelling in sensor networks

NASA Astrophysics Data System (ADS)

Schmidt, D.; Krämer, M.; Kuhn, T.; Wehn, N.

2007-06-01

Wireless sensor networks are one of the key enabling technologies for the vision of ambient intelligence. Energy resources for sensor nodes are very scarce. A key challenge is the design of energy efficient communication protocols. Models of the energy consumption are needed to accurately simulate the efficiency of a protocol or application design, and can also be used for automatic energy optimizations in a model driven design process. We propose a novel methodology to create models for sensor nodes based on few simple measurements. In a case study the methodology was used to create models for MICAz nodes. The models were integrated in a simulation environment as well as in a SDL runtime framework of a model driven design process. Measurements on a test application that was created automatically from an SDL specification showed an 80% reduction in energy consumption compared to an implementation without power saving strategies.
FLY MPI-2: a parallel tree code for LSS

NASA Astrophysics Data System (ADS)

Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.

2006-04-01

New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block structure. Summary of revisions: The parallel communication schema was totally changed. The new version adopts the MPICH2 library. Now FLY can be executed on all Unix systems having an MPI-2 standard library. The main data structure, is declared in a module procedure of FLY (fly_h.F90 routine). FLY creates the MPI Window object for one-sided communication for all the shared arrays, with a call like the following: CALL MPI_WIN_CREATE(POS, SIZE, REAL8, MPI_INFO_NULL, MPI_COMM_WORLD, WIN_POS, IERR) the following main window objects are created: win_pos, win_vel, win_acc: particles positions velocities and accelerations, win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells positions, masses, quadrupole momenta, tree structure and grouping cells. Other windows are created for dynamic load balance and global counters. Restrictions: The program uses the leapfrog integrator schema, but could be changed by the user. Unusual features: FLY uses the MPI-2 standard: the MPICH2 library on Linux systems was adopted. To run this version of FLY the working directory must be shared among all the processors that execute FLY. Additional comments: Full documentation for the program is included in the distribution in the form of a README file, a User Guide and a Reference manuscript. Running time: IBM Linux Cluster 1350, 512 nodes with 2 processors for each node and 2 GB RAM for each processor, at Cineca, was adopted to make performance tests. Processor type: Intel Xeon Pentium IV 3.0 GHz and 512 KB cache (128 nodes have Nocona processors). Internal Network: Myricom LAN Card "C" Version and "D" Version. Operating System: Linux SuSE SLES 8. The code was compiled using the mpif90 compiler version 8.1 and with basic optimization options in order to have performances that could be useful compared with other generic clusters Processors
A Parallel Genetic Algorithm for Automated Electronic Circuit Design

NASA Technical Reports Server (NTRS)

Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris

2000-01-01

Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency issues in the GA, it is possible to have idle processors. However, as long as the load at each processing node is similar, the processors are kept busy nearly all of the time. In applying GAs to circuit design, a suitable genetic representation 'is that of a circuit-construction program. We discuss one such circuit-construction programming language and show how evolution can generate useful analog circuit designs. This language has the desirable property that virtually all sets of combinations of primitives result in valid circuit graphs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. Using a parallel genetic algorithm and circuit simulation software, we present experimental results as applied to three analog filter and two amplifier design tasks. For example, a figure shows an 85 dB amplifier design evolved by our system, and another figure shows the performance of that circuit (gain and frequency response). In all tasks, our system is able to generate circuits that achieve the target specifications.
Female-specific down-regulation of tissue-PMN drives impaired Treg and amplified effector T cell responses in autoimmune dry eye disease1

PubMed Central

Gao, Yuan; Min, Kyungji; Zhang, Yibing; Su, John; Greenwood, Matthew; Gronert, Karsten

2015-01-01

Immune-driven dry eye disease primarily affects women; the cause for this sex-specific prevalence is unknown. PMN have distinct phenotypes that drive inflammation but also regulate lymphocytes and are the rate-limiting cell for generating anti-inflammatory lipoxin A4 (LXA4). Estrogen regulates the LXA4 circuit to induce delayed female-specific wound healing in the cornea. However, the role of PMN in dry eye disease remains unexplored. We discovered a LXA4-producing tissue-PMN population in the corneal limbus, lacrimal glands and cervical lymph nodes of healthy male and female mice. These tissue-PMN, unlike inflammatory-PMN, expressed a highly amplified LXA4 circuit and were sex-specifically regulated during immune-driven dry eye disease. Desiccating stress in females, unlike in males, triggered a remarkable decrease in lymph node PMN and LXA4 formation that remained depressed during dry eye disease. Depressed lymph node PMN and LXA4 in females correlated with an increase in T effector cells (TH1 and TH17), a decrease in regulatory T cells (Treg) and increased dry eye pathogenesis. Antibody depletion of tissue-PMN abrogated LXA4 formation in lymph nodes, caused a marked increase in TH1 and TH17 and decrease in Treg cells. To establish an immune regulatory role for PMN-derived LXA4 in dry eye females were treated with LXA4. LXA4 treatment markedly inhibited TH1 and TH17 and amplified Treg cells in draining lymph nodes, while reducing dry eye pathogenesis. These results identify female-specific regulation of LXA4-producing tissue-PMN as a potential key factor in aberrant T effector cell activation and initiation of immune-driven dry eye disease. PMID:26324767
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

PubMed Central

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-01-01

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget. PMID:28208730
A novel strategy for load balancing of distributed medical applications.

PubMed

Logeswaran, Rajasvaran; Chen, Li-Choo

2012-04-01

Current trends in medicine, specifically in the electronic handling of medical applications, ranging from digital imaging, paperless hospital administration and electronic medical records, telemedicine, to computer-aided diagnosis, creates a burden on the network. Distributed Service Architectures, such as Intelligent Network (IN), Telecommunication Information Networking Architecture (TINA) and Open Service Access (OSA), are able to meet this new challenge. Distribution enables computational tasks to be spread among multiple processors; hence, performance is an important issue. This paper proposes a novel approach in load balancing, the Random Sender Initiated Algorithm, for distribution of tasks among several nodes sharing the same computational object (CO) instances in Distributed Service Architectures. Simulations illustrate that the proposed algorithm produces better network performance than the benchmark load balancing algorithms-the Random Node Selection Algorithm and the Shortest Queue Algorithm, especially under medium and heavily loaded conditions.
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC.

PubMed

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-02-08

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget.
The P-Mesh: A Commodity-based Scalable Network Architecture for Clusters

NASA Technical Reports Server (NTRS)

Nitzberg, Bill; Kuszmaul, Chris; Stockdale, Ian; Becker, Jeff; Jiang, John; Wong, Parkson; Tweten, David (Technical Monitor)

1998-01-01

We designed a new network architecture, the P-Mesh which combines the scalability and fault resilience of a torus with the performance of a switch. We compare the scalability, performance, and cost of the hub, switch, torus, tree, and P-Mesh architectures. The latter three are capable of scaling to thousands of nodes, however, the torus has severe performance limitations with that many processors. The tree and P-Mesh have similar latency, bandwidth, and bisection bandwidth, but the P-Mesh outperforms the switch architecture (a lower bound for tree performance) on 16-node NAB Parallel Benchmark tests by up to 23%, and costs 40% less. Further, the P-Mesh has better fault resilience characteristics. The P-Mesh architecture trades increased management overhead for lower cost, and is a good bridging technology while the price of tree uplinks is expensive.
Diversity Driven Coexistence: Collective Stability in the Cyclic Competition of Three Species

NASA Astrophysics Data System (ADS)

Bassler, Kevin E.; Frey, Erwin; Zia, R. K. P.

2015-03-01

The basic physics of collective behavior are often difficult to quantify and understand, particularly when the system is driven out of equilibrium. Many complex systems are usefully described as complex networks, consisting of nodes and links. The nodes specify individual components of the system and the links describe their interactions. When both nodes and links change dynamically, or `co-evolve', as happens in many realistic systems, complex mathematical structures are encountered, posing challenges to our understanding. In this context, we introduce a minimal system of node and link degrees of freedom, co-evolving with stochastic rules. Specifically, we show that diversity of social temperament (intro- or extroversion) can produce collective stable coexistence when three species compete cyclically. It is well-known that when only extroverts exist in a stochastic rock-paper-scissors game, or in a conserved predator-prey, Lotka-Volterra system, extinction occurs at times of O(N), where N is the number of nodes. We find that when both introverts and extroverts exist, where introverts sever social interactions and extroverts create them, collective coexistence prevails in long-living, quasi-stationary states. Work supported by the NSF through Grants DMR-1206839 (KEB) and DMR-1244666 (RKPZ), and by the AFOSR and DARPA through Grant FA9550-12-1-0405 (KEB).
Programmable Quantum Photonic Processor Using Silicon Photonics

DTIC Science & Technology

2017-04-01

quantum information processing and quantum sensing, ranging from linear optics quantum computing and quantum simulation to quantum ...transformers have driven experimental and theoretical advances in quantum simulation, cluster-state quantum computing , all-optical quantum repeaters...neuromorphic computing , and other applications. In addition, we developed new schemes for ballistic quantum computation , new methods for
International Space Station USOS Waste and Hygiene Compartment Development

NASA Technical Reports Server (NTRS)

Link, Dwight E., Jr.; Broyan, James Lee, Jr.; Gelmis, Karen; Philistine, Cynthia; Balistreri, Steven

2007-01-01

The International Space Station (ISS) currently provides human waste collection and hygiene facilities in the Russian Segment Service Module (SM) which supports a three person crew. Additional hardware is planned for the United States Operational Segment (USOS) to support expansion of the crew to six person capability. The additional hardware will be integrated in an ISS standard equipment rack structure that was planned to be installed in the Node 3 element; however, the ISS Program Office recently directed implementation of the rack, or Waste and Hygiene Compartment (WHC), into the U.S. Laboratory element to provide early operational capability. In this configuration, preserved urine from the WHC waste collection system can be processed by the Urine Processor Assembly (UPA) in either the U.S. Lab or Node 3 to recover water for crew consumption or oxygen production. The human waste collection hardware is derived from the Service Module system and is provided by RSC-Energia. This paper describes the concepts, design, and integration of the WHC waste collection hardware into the USOS including integration with U.S. Lab and Node 3 systems.
An Embedded Sensor Node Microcontroller with Crypto-Processors.

PubMed

Panić, Goran; Stecklina, Oliver; Stamenković, Zoran

2016-04-27

Wireless sensor network applications range from industrial automation and control, agricultural and environmental protection, to surveillance and medicine. In most applications, data are highly sensitive and must be protected from any type of attack and abuse. Security challenges in wireless sensor networks are mainly defined by the power and computing resources of sensor devices, memory size, quality of radio channels and susceptibility to physical capture. In this article, an embedded sensor node microcontroller designed to support sensor network applications with severe security demands is presented. It features a low power 16-bitprocessor core supported by a number of hardware accelerators designed to perform complex operations required by advanced crypto algorithms. The microcontroller integrates an embedded Flash and an 8-channel 12-bit analog-to-digital converter making it a good solution for low-power sensor nodes. The article discusses the most important security topics in wireless sensor networks and presents the architecture of the proposed hardware solution. Furthermore, it gives details on the chip implementation, verification and hardware evaluation. Finally, the chip power dissipation and performance figures are estimated and analyzed.
An Embedded Sensor Node Microcontroller with Crypto-Processors

PubMed Central

Panić, Goran; Stecklina, Oliver; Stamenković, Zoran

2016-01-01

Wireless sensor network applications range from industrial automation and control, agricultural and environmental protection, to surveillance and medicine. In most applications, data are highly sensitive and must be protected from any type of attack and abuse. Security challenges in wireless sensor networks are mainly defined by the power and computing resources of sensor devices, memory size, quality of radio channels and susceptibility to physical capture. In this article, an embedded sensor node microcontroller designed to support sensor network applications with severe security demands is presented. It features a low power 16-bitprocessor core supported by a number of hardware accelerators designed to perform complex operations required by advanced crypto algorithms. The microcontroller integrates an embedded Flash and an 8-channel 12-bit analog-to-digital converter making it a good solution for low-power sensor nodes. The article discusses the most important security topics in wireless sensor networks and presents the architecture of the proposed hardware solution. Furthermore, it gives details on the chip implementation, verification and hardware evaluation. Finally, the chip power dissipation and performance figures are estimated and analyzed. PMID:27128925
Performance of a parallel thermal-hydraulics code TEMPEST

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fann, G.I.; Trent, D.S.

The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less
The Evolution of Software and Its Impact on Complex System Design in Robotic Spacecraft Embedded Systems

NASA Technical Reports Server (NTRS)

Butler, Roy

2013-01-01

The growth in computer hardware performance, coupled with reduced energy requirements, has led to a rapid expansion of the resources available to software systems, driving them towards greater logical abstraction, flexibility, and complexity. This shift in focus from compacting functionality into a limited field towards developing layered, multi-state architectures in a grand field has both driven and been driven by the history of embedded processor design in the robotic spacecraft industry.The combinatorial growth of interprocess conditions is accompanied by benefits (concurrent development, situational autonomy, and evolution of goals) and drawbacks (late integration, non-deterministic interactions, and multifaceted anomalies) in achieving mission success, as illustrated by the case of the Mars Reconnaissance Orbiter. Approaches to optimizing the benefits while mitigating the drawbacks have taken the form of the formalization of requirements, modular design practices, extensive system simulation, and spacecraft data trend analysis. The growth of hardware capability and software complexity can be expected to continue, with future directions including stackable commodity subsystems, computer-generated algorithms, runtime reconfigurable processors, and greater autonomy.
Dense, Efficient Chip-to-Chip Communication at the Extremes of Computing

ERIC Educational Resources Information Center

Loh, Matthew

2013-01-01

The scalability of CMOS technology has driven computation into a diverse range of applications across the power consumption, performance and size spectra. Communication is a necessary adjunct to computation, and whether this is to push data from node-to-node in a high-performance computing cluster or from the receiver of wireless link to a neural…
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.

PubMed

Andrade, G; Ferreira, R; Teodoro, George; Rocha, Leonardo; Saltz, Joel H; Kurc, Tahsin

2014-10-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

PubMed Central

Andrade, G.; Ferreira, R.; Teodoro, George; Rocha, Leonardo; Saltz, Joel H.; Kurc, Tahsin

2015-01-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales. PMID:26640423
Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wolfe, Noah; Carothers, Christopher; Mubarak, Misbah

As supercomputers close in on exascale performance, the increased number of processors and processing power translates to an increased demand on the underlying network interconnect. The Slim Fly network topology, a new lowdiameter and low-latency interconnection network, is gaining interest as one possible solution for next-generation supercomputing interconnect systems. In this paper, we present a high-fidelity Slim Fly it-level model leveraging the Rensselaer Optimistic Simulation System (ROSS) and Co-Design of Exascale Storage (CODES) frameworks. We validate our Slim Fly model with the Kathareios et al. Slim Fly model results provided at moderately sized network scales. We further scale the modelmore » size up to n unprecedented 1 million compute nodes; and through visualization of network simulation metrics such as link bandwidth, packet latency, and port occupancy, we get an insight into the network behavior at the million-node scale. We also show linear strong scaling of the Slim Fly model on an Intel cluster achieving a peak event rate of 36 million events per second using 128 MPI tasks to process 7 billion events. Detailed analysis of the underlying discrete-event simulation performance shows that a million-node Slim Fly model simulation can execute in 198 seconds on the Intel cluster.« less
Interface Supports Lightweight Subsystem Routing for Flight Applications

NASA Technical Reports Server (NTRS)

Lux, James P.; Block, Gary L.; Ahmad, Mohammad; Whitaker, William D.; Dillon, James W.

2010-01-01

A wireless avionics interface exploits the constrained nature of data networks in flight systems to use a lightweight routing method. This simplified routing means that a processor is not required, and the logic can be implemented as an intellectual property (IP) core in a field-programmable gate array (FPGA). The FPGA can be shared with the flight subsystem application. In addition, the router is aware of redundant subsystems, and can be configured to provide hot standby support as part of the interface. This simplifies implementation of flight applications requiring hot stand - by support. When a valid inbound packet is received from the network, the destination node address is inspected to determine whether the packet is to be processed by this node. Each node has routing tables for the next neighbor node to guide the packet to the destination node. If it is to be processed, the final packet destination is inspected to determine whether the packet is to be forwarded to another node, or routed locally. If the packet is local, it is sent to an Applications Data Interface (ADI), which is attached to a local flight application. Under this scheme, an interface can support many applications in a subsystem supporting a high level of subsystem integration. If the packet is to be forwarded to another node, it is sent to the outbound packet router. The outbound packet router receives packets from an ADI or a packet to be forwarded. It then uses a lookup table to determine the next destination for the packet. Upon detecting a remote subsystem failure, the routing table can be updated to autonomously bypass the failed subsystem.

Scalable NIC-based reduction on large-scale clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moody, A.; Fernández, J. C.; Petrini, F.

2003-01-01

Many parallel algorithms require effiaent support for reduction mllectives. Over the years, researchers have developed optimal reduction algonduns by taking inm account system size, dam size, and complexities of reduction operations. However, all of these algorithm have assumed the faa that the reduction precessing takes place on the host CPU. Modem Network Interface Cards (NICs) sport programmable processors with substantial memory and thus introduce a fresh variable into the equation This raises the following intersting challenge: Can we take advantage of modern NICs to implementJost redudion operations? In this paper, we take on this challenge in the context of large-scalemore » clusters. Through experiments on the 960-node, 1920-processor or ASCI Linux Cluster (ALC) located at the Lawrence Livermore National Laboratory, we show that NIC-based reductions indeed perform with reduced latency and immed consistency over host-based aleorithms for the wmmon case and that these benefits scale as the system grows. In the largest configuration tested--1812 processors-- our NIC-based algorithm can sum a single element vector in 73 ps with 32-bi integers and in 118 with Mbit floating-point numnbers. These results represent an improvement, respeaively, of 121% and 39% with resvect w the {approx}roductionle vel MPI library« less
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Generic Divide and Conquer Internet-Based Computing

NASA Technical Reports Server (NTRS)

Follen, Gregory J. (Technical Monitor); Radenski, Atanas

2003-01-01

The growth of Internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of Peer to Peer (P2P) software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high-performance computing applications community. The general goal of this project is to achieve better understanding of the transition to Internet-based high-performance computing and to develop solutions for some of the technical challenges of this transition. In particular, we are interested in creating long-term motivation for end users to provide their idle processor time to support computationally intensive tasks. We believe that a practical P2P architecture should provide useful service to both clients with high-performance computing needs and contributors of lower-end computing resources. To achieve this, we are designing dual -service architecture for P2P high-performance divide-and conquer computing; we are also experimenting with a prototype implementation. Our proposed architecture incorporates a master server, utilizes dual satellite servers, and operates on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. A dual satellite server comprises a high-performance computing engine and a lower-end contributor service engine. The computing engine provides generic support for divide and conquer computations. The service engine is intended to provide free useful HTTP-based services to contributors of lower-end computing resources. Our proposed architecture is complementary to and accessible from computational grids, such as Globus, Legion, and Condor. Grids provide remote access to existing higher-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end Internet nodes. Our project is focused on a generic divide and conquer paradigm and on mobile applications of this paradigm that can operate on a loose and ever changing pool of lower-end Internet nodes.
Endobronchial ultrasound elastography: a new method in endobronchial ultrasound-guided transbronchial needle aspiration.

PubMed

Jiang, Jun-Hong; Turner, J Francis; Huang, Jian-An

2015-12-01

TBNA through the flexible bronchoscope is a 37-year-old technology that utilizes a TBNA needle to puncture the bronchial wall and obtain specimens of peribronchial and mediastinal lesions through the flexible bronchoscope for the diagnosis of benign and malignant diseases in the mediastinum and lung. Since 2002, the Olympus Company developed the first generation ultrasound equipment for use in the airway, initially utilizing an ultrasound probe introduced through the working channel followed by incoroporation of a fixed linear ultrasound array at the distal tip of the bronchoscope. This new bronchoscope equipped with a convex type ultrasound probe on the tip was subsequently introduced into clinical practice. The convex probe (CP)-EBUS allows real-time endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) of mediastinal and hilar lymph nodes. EBUS-TBNA is a minimally invasive procedure performed under local anesthesia that has been shown to have a high sensitivity and diagnostic yield for lymph node staging of lung cancer. In 10 years of EBUS development, the Olympus Company developed the second generation EBUS bronchoscope (BF-UC260FW) with the ultrasound image processor (EU-M1), and in 2013 introduced a new ultrasound image processor (EU-M2) into clinical practice. FUJI company has also developed a curvilinear array endobronchial ultrasound bronchoscope (EB-530 US) that makes it easier for the operator to master the operation of the ultrasonic bronchoscope. Also, the new thin convex probe endobronchial ultrasound bronchoscope (TCP-EBUS) is able to visualize one to three bifurcations distal to the current CP-EBUS. The emergence of EBUS-TBNA has also been accompanied by innovation in EBUS instruments. EBUS elastography is, then, a new technique for describing the compliance of structures during EBUS, which may be of use in the determination of metastasis to the mediastinal and hilar lymph nodes. This article describes these new EBUS techniques and reviews the relevant literature.
Knowledge diffusion of dynamical network in terms of interaction frequency.

PubMed

Liu, Jian-Guo; Zhou, Qing; Guo, Qiang; Yang, Zhen-Hua; Xie, Fei; Han, Jing-Ti

2017-09-07

In this paper, we present a knowledge diffusion (SKD) model for dynamic networks by taking into account the interaction frequency which always used to measure the social closeness. A set of agents, which are initially interconnected to form a random network, either exchange knowledge with their neighbors or move toward a new location through an edge-rewiring procedure. The activity of knowledge exchange between agents is determined by a knowledge transfer rule that the target node would preferentially select one neighbor node to transfer knowledge with probability p according to their interaction frequency instead of the knowledge distance, otherwise, the target node would build a new link with its second-order neighbor preferentially or select one node in the system randomly with probability 1 - p. The simulation results show that, comparing with the Null model defined by the random selection mechanism and the traditional knowledge diffusion (TKD) model driven by knowledge distance, the knowledge would spread more fast based on SKD driven by interaction frequency. In particular, the network structure of SKD would evolve as an assortative one, which is a fundamental feature of social networks. This work would be helpful for deeply understanding the coevolution of the knowledge diffusion and network structure.
PC-BASED MIE SCATTERING PROGRAM FOR THEORETICAL INVESTIGATIONS OF THE OPTICAL PROPERTIES OF ATMOSPHERIC AEROSOLS AS A FUNCTION OF COMPOSITION AND RELATIVE HUMIDITY

EPA Science Inventory

Over the past decade there has been interest in exploring possible relationships between atmospheric visibility (extinction of light) and the chemical form of aerosols in the atmosphere. ser-friendly, menu-driven program for the personal computer (AT 286 with math co-processor or...
The Caltech Concurrent Computation Program - Project description

NASA Technical Reports Server (NTRS)

Fox, G.; Otto, S.; Lyzenga, G.; Rogstad, D.

1985-01-01

The Caltech Concurrent Computation Program wwhich studies basic issues in computational science is described. The research builds on initial work where novel concurrent hardware, the necessary systems software to use it and twenty significant scientific implementations running on the initial 32, 64, and 128 node hypercube machines have been constructed. A major goal of the program will be to extend this work into new disciplines and more complex algorithms including general packages that decompose arbitrary problems in major application areas. New high-performance concurrent processors with up to 1024-nodes, over a gigabyte of memory and multigigaflop performance are being constructed. The implementations cover a wide range of problems in areas such as high energy and astrophysics, condensed matter, chemical reactions, plasma physics, applied mathematics, geophysics, simulation, CAD for VLSI, graphics and image processing. The products of the research program include the concurrent algorithms, hardware, systems software, and complete program implementations.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors.

PubMed

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-07-08

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.
CAPRI (Computational Analysis PRogramming Interface): A Solid Modeling Based Infra-Structure for Engineering Analysis and Design Simulations

NASA Technical Reports Server (NTRS)

Haimes, Robert; Follen, Gregory J.

1998-01-01

CAPRI is a CAD-vendor neutral application programming interface designed for the construction of analysis and design systems. By allowing access to the geometry from within all modules (grid generators, solvers and post-processors) such tasks as meshing on the actual surfaces, node enrichment by solvers and defining which mesh faces are boundaries (for the solver and visualization system) become simpler. The overall reliance on file 'standards' is minimized. This 'Geometry Centric' approach makes multi-physics (multi-disciplinary) analysis codes much easier to build. By using the shared (coupled) surface as the foundation, CAPRI provides a single call to interpolate grid-node based data from the surface discretization in one volume to another. Finally, design systems are possible where the results can be brought back into the CAD system (and therefore manufactured) because all geometry construction and modification are performed using the CAD system's geometry kernel.
Hierarchical Address Event Routing for Reconfigurable Large-Scale Neuromorphic Systems.

PubMed

Park, Jongkil; Yu, Theodore; Joshi, Siddharth; Maier, Christoph; Cauwenberghs, Gert

2017-10-01

We present a hierarchical address-event routing (HiAER) architecture for scalable communication of neural and synaptic spike events between neuromorphic processors, implemented with five Xilinx Spartan-6 field-programmable gate arrays and four custom analog neuromophic integrated circuits serving 262k neurons and 262M synapses. The architecture extends the single-bus address-event representation protocol to a hierarchy of multiple nested buses, routing events across increasing scales of spatial distance. The HiAER protocol provides individually programmable axonal delay in addition to strength for each synapse, lending itself toward biologically plausible neural network architectures, and scales across a range of hierarchies suitable for multichip and multiboard systems in reconfigurable large-scale neuromorphic systems. We show approximately linear scaling of net global synaptic event throughput with number of routing nodes in the network, at 3.6×10 7 synaptic events per second per 16k-neuron node in the hierarchy.
KWICgrouper--Designing a Tool for Corpus-Driven Concordance Analysis

ERIC Educational Resources Information Center

O'Donnell, Matthew Brook

2008-01-01

The corpus-driven analysis of concordance data often results in the identification of groups of lines in which repeated patterns around the node item establish membership in a particular function meaning group (Mahlberg, 2005). This paper explains the KWICgrouper, a concept designed to support this kind of concordance analysis. Groups are defined…
Technology and design of an active-matrix OLED on crystalline silicon direct-view display for a wristwatch computer

NASA Astrophysics Data System (ADS)

Sanford, James L.; Schlig, Eugene S.; Prache, Olivier; Dove, Derek B.; Ali, Tariq A.; Howard, Webster E.

2002-02-01

The IBM Research Division and eMagin Corp. jointly have developed a low-power VGA direct view active matrix OLED display, fabricated on a crystalline silicon CMOS chip. The display is incorporated in IBM prototype wristwatch computers running the Linus operating system. IBM designed the silicon chip and eMagin developed the organic stack and performed the back-end-of line processing and packaging. Each pixel is driven by a constant current source controlled by a CMOS RAM cell, and the display receives its data from the processor memory bus. This paper describes the OLED technology and packaging, and outlines the design of the pixel and display electronics and the processor interface. Experimental results are presented.
Speeding up parallel processing

NASA Technical Reports Server (NTRS)

Denning, Peter J.

1988-01-01

In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.
Modular System Control Development Model (MSCDM). Design Specification.

DTIC Science & Technology

1979-08-01

with power supply and ¶ can be used independently of the loop. The PDU can be used as a general purpose processor. The loop is contained in a separate...inputs to nodes 22 (VSQC), 23 (DSQC ) , and 26 (BWBSA) will be generated by a LSI—ll microprocessor used as a simulated input generator (SIG). The SIG...who c o b m n u n i — cate tau lt - s to the FIAC module. F~IAC generates even t reports to the OCRI and DBMS. The PDP1I/40 in loop 2 generates
Polymorphous Computing Architecture (PCA) Kernel Benchmark Measurements on the MIT Raw Microprocessor

DTIC Science & Technology

2006-06-14

Robert Graybill . A Raw hoard for the use of this project was provided by the Computer Architecture Croup at the Massachusetts Institute of Technology...simulator is presented by MIT as being an accurate model of the Raw chip, we have found that it does not accurately model the board. Our comparison...G4 processor, model 7410. with a 32 kbyte level-1 cache on-chip and a 2 Mbyte L2 cache connected through a 250 MH/ bus [12]. Each node has 256 Mbyte
Feasibility study, software design, layout and simulation of a two-dimensional Fast Fourier Transform machine for use in optical array interferometry

NASA Technical Reports Server (NTRS)

Boriakoff, Valentin

1994-01-01

The goal of this project was the feasibility study of a particular architecture of a digital signal processing machine operating in real time which could do in a pipeline fashion the computation of the fast Fourier transform (FFT) of a time-domain sampled complex digital data stream. The particular architecture makes use of simple identical processors (called inner product processors) in a linear organization called a systolic array. Through computer simulation the new architecture to compute the FFT with systolic arrays was proved to be viable, and computed the FFT correctly and with the predicted particulars of operation. Integrated circuits to compute the operations expected of the vital node of the systolic architecture were proven feasible, and even with a 2 micron VLSI technology can execute the required operations in the required time. Actual construction of the integrated circuits was successful in one variant (fixed point) and unsuccessful in the other (floating point).
Operating experience with a VMEbus multiprocessor system for data acquisition and reduction in nuclear physics

NASA Astrophysics Data System (ADS)

Kutt, P. H.; Balamuth, D. P.

1989-10-01

Summary form only given, as follows. A multiprocessor system based on commercially available VMEbus components has been developed for the acquisition and reduction of event-mode data in nuclear physics experiments. The system contains seven 68000 CPUs and 14 Mbyte of memory. A minimal operating system handles data transfer and task allocation, and a compiler for a specially designed event analysis language produces code for the processors. The system has been in operation for four years at the University of Pennsylvania Tandem Accelerator Laboratory. Computation rates over three times that of a MicroVAX II have been achieved at a fraction of the cost. The use of WORM optical disks for event recording allows the processing of gigabyte data sets without operator intervention. A more powerful system is being planned which will make use of recently developed RISC (reduced instruction set computer) processors to obtain an order of magnitude increase in computing power per node.
Impacts of the IBM Cell Processor to Support Climate Models

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Duffy, Daniel; Clune, Tom; Suarez, Max; Williams, Samuel; Halem, Milt

2008-01-01

NASA is interested in the performance and cost benefits for adapting its applications to the IBM Cell processor. However, its 256KB local memory per SPE and the new communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics (approximately 50% computational time), (2) has a high computational load relative to transferring data from and to main memory, (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran) to C and ported it with manually SIMDizing 4 independent columns and found that a Cell with 8 SPEs can process 2274 columns per second. Compared with the baseline results, the Cell is approximately 5.2X, approximately 8.2X, approximately 15.1X faster than a core on Intel Woodcrest, Dempsey, and Itanium2, respectively. We believe this dramatic performance improvement makes a hybrid cluster with Cell and traditional nodes competitive.
Energy Efficient Image/Video Data Transmission on Commercial Multi-Core Processors

PubMed Central

Lee, Sungju; Kim, Heegon; Chung, Yongwha; Park, Daihee

2012-01-01

In transmitting image/video data over Video Sensor Networks (VSNs), energy consumption must be minimized while maintaining high image/video quality. Although image/video compression is well known for its efficiency and usefulness in VSNs, the excessive costs associated with encoding computation and complexity still hinder its adoption for practical use. However, it is anticipated that high-performance handheld multi-core devices will be used as VSN processing nodes in the near future. In this paper, we propose a way to improve the energy efficiency of image and video compression with multi-core processors while maintaining the image/video quality. We improve the compression efficiency at the algorithmic level or derive the optimal parameters for the combination of a machine and compression based on the tradeoff between the energy consumption and the image/video quality. Based on experimental results, we confirm that the proposed approach can improve the energy efficiency of the straightforward approach by a factor of 2∼5 without compromising image/video quality. PMID:23202181
Load Balancing Strategies for Multiphase Flows on Structured Grids

NASA Astrophysics Data System (ADS)

Olshefski, Kristopher; Owkes, Mark

2017-11-01

The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.

The Fermilab lattice supercomputer project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fischler, M.; Atac, R.; Cook, A.

1989-02-01

The ACPMAPS system is a highly cost effective, local memory MIMD computer targeted at algorithm development and production running for gauge theory on the lattice. The machine consists of a compound hypercube of crates, each of which is a full crossbar switch containing several processors. The processing nodes are single board array processors based on the Weitek XL chip set, each with a peak power of 20 MFLOPS and supported by 8 MBytes of data memory. The system currently being assembled has a peak power of 5 GFLOPS, delivering performance at approximately $250/MFLOP. The system is programmable in C andmore » Fortran. An underpinning of software routines (CANOPY) provides an easy and natural way of coding lattice problems, such that the details of parallelism, and communication and system architecture are transparent to the user. CANOPY can easily be ported to any single CPU or MIMD system which supports C, and allows the coding of typical applications with very little effort. 3 refs., 1 fig.« less
The cost of conservative synchronization in parallel discrete event simulations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1990-01-01

The performance of a synchronous conservative parallel discrete-event simulation protocol is analyzed. The class of simulation models considered is oriented around a physical domain and possesses a limited ability to predict future behavior. A stochastic model is used to show that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approach the complexity of the average per-event overhead of a serial simulation. The method is therefore within a constant factor of optimal. The analysis demonstrates that on large problems--those for which parallel processing is ideally suited--there is often enough parallel workload so that processors are not usually idle. The viability of the method is also demonstrated empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor.
A light hydrocarbon fuel processor producing high-purity hydrogen

NASA Astrophysics Data System (ADS)

Löffler, Daniel G.; Taylor, Kyle; Mason, Dylan

This paper discusses the design process and presents performance data for a dual fuel (natural gas and LPG) fuel processor for PEM fuel cells delivering between 2 and 8 kW electric power in stationary applications. The fuel processor resulted from a series of design compromises made to address different design constraints. First, the product quality was selected; then, the unit operations needed to achieve that product quality were chosen from the pool of available technologies. Next, the specific equipment needed for each unit operation was selected. Finally, the unit operations were thermally integrated to achieve high thermal efficiency. Early in the design process, it was decided that the fuel processor would deliver high-purity hydrogen. Hydrogen can be separated from other gases by pressure-driven processes based on either selective adsorption or permeation. The pressure requirement made steam reforming (SR) the preferred reforming technology because it does not require compression of combustion air; therefore, steam reforming is more efficient in a high-pressure fuel processor than alternative technologies like autothermal reforming (ATR) or partial oxidation (POX), where the combustion occurs at the pressure of the process stream. A low-temperature pre-reformer reactor is needed upstream of a steam reformer to suppress coke formation; yet, low temperatures facilitate the formation of metal sulfides that deactivate the catalyst. For this reason, a desulfurization unit is needed upstream of the pre-reformer. Hydrogen separation was implemented using a palladium alloy membrane. Packed beds were chosen for the pre-reformer and reformer reactors primarily because of their low cost, relatively simple operation and low maintenance. Commercial, off-the-shelf balance of plant (BOP) components (pumps, valves, and heat exchangers) were used to integrate the unit operations. The fuel processor delivers up to 100 slm hydrogen >99.9% pure with <1 ppm CO, <3 ppm CO 2. The thermal efficiency is better than 67% operating at full load. This fuel processor has been integrated with a 5-kW fuel cell producing electricity and hot water.
Concurrent Smalltalk on the Message-Driven Processor

DTIC Science & Technology

1991-09-01

language close to Concurrent Smalltalk and having an almost identical name is CONCURRENTSMALLTALK [39] [40] independently developed by Yasuhiko Yokote and...Laboratory Memo 1044, October 1988. [391 Yokote, Yasuhiko , and Tokoro, Mario. ’The Design and Implementation of Concur- rentSmalltalk." Proceedings...of the 1986 Object-Oriented Programming Systems, Lan- guages, and Applications Conference, September 1986. 222 Bibliography [401 Yokote, Yasuhiko , and
A software toolbox for robotics

NASA Technical Reports Server (NTRS)

Sanwal, J. C.

1985-01-01

A method for programming cooperating manipulators, which is guided by a geometric description of the task to be performed, is given. For this a suitable language must be used and a method for describing the workplace and the objects in it in geometric terms. A task level command language and its implementation for concurrently driven multiple robot arm is described. The language is suitable for driving a cell in which manipulators, end effectors, and sensors are controlled by their own dedicated processors. These processors can communicate with each other through a communication network. A mechanism for keeping track of the history of the commands already executed allows the command language for the manipulators to be event driven. A frame based world modeling system is utilized to describe the objects in the work environment and any relationships that hold between these objects. This system provides a versatile tool for managing information about the world model. Default actions normally needed are invoked when the data base is updated or accessed. Most of the first level error recovery is also invoked by the database by utilizing the concepts of demons. The package can be utilized to generate task level commands in a problem solver or a planner.
DNS of incompressible turbulence in a periodic box with up to 4096^3 grid points

NASA Astrophysics Data System (ADS)

Kaneda, Yukio

2007-11-01

Turbulence of incompressible fluid obeying the Navier-Stokes (NS) equations under periodic boundary conditions is one of the simplest dynamical systems keeping the essence of turbulence dynamics, and suitable for the study of high Reynolds number (Re) turbulence by direct numerical simulation (DNS). This talk presents a review on DNS of such a system with the number N^3 of the grid points up to 4096^3, performed on the Earth Simulator (ES). The ES consists of 640 processor nodes (=5120 arithmetic processors) with 10TB of main memory and the peak performance of 40 Tflops. The DNSs are based on a spectral method free from alias error. The convolution sums in the wave vector space were evaluated by radix-4 Fast Fourier Transforms with double precision arithmetic. Sustained performance of 16.4 Tflops was achieved on the 2048^3 DNS by using 512 processor nodes of the ES. The DNSs consist of two series; one is with kmax η1 (Series 1) and the other with kmax η2 (Series 2), where kmax is the highest wavenumber in each simulation, and η is the Kolmogorov length scale. In the 4096^3 DNS, the Taylor-scale Reynolds number Rλ1130 (675) and the ratio L/η of the integral length scale L to η is approximately 2133(1040), in Series 1 (Series 2). Such DNS data are expected to shed some light on the basic questions in turbulence research, including those on (i) the normalized mean rate of energy dissipation in the high Re limit, (ii) the universality of energy spectrum at small scale, (iii) scale- and Re- dependences of the statistics, and (iv) intermittency. We have constructed a database consisting of (a) animations and figures of turbulent fields (b) statistics including those associated with (i)-(iv) noted above, (c) snapshot data of the velocity fields. The data size of (c) can be very large for large N. For example, one snapshot of single precision data of the velocity vector field of the 4096^3 DNS requires approximately 0.8 TB.
Measurements of the LHCb software stack on the ARM architecture

NASA Astrophysics Data System (ADS)

Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko

2014-06-01

The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.
Heterogeneous delays making parents synchronized: A coupled maps on Cayley tree model

NASA Astrophysics Data System (ADS)

Singh, Aradhana; Jalan, Sarika

2014-06-01

We study the phase synchronized clusters in the diffusively coupled maps on the Cayley tree networks for heterogeneous delay values. Cayley tree networks comprise of two parts: the inner nodes and the boundary nodes. We find that heterogeneous delays lead to various cluster states, such as; (a) cluster state consisting of inner nodes and boundary nodes, and (b) cluster state consisting of only boundary nodes. The former state may comprise of nodes from all the generations forming self-organized cluster or nodes from few generations yielding driven clusters depending upon on the parity of heterogeneous delay values. Furthermore, heterogeneity in delays leads to the lag synchronization between the siblings lying on the boundary by destroying the exact synchronization among them. The time lag being equal to the difference in the delay values. The Lyapunov function analysis sheds light on the destruction of the exact synchrony among the last generation nodes. To the end we discuss the relevance of our results with respect to their applications in the family business as well as in understanding the occurrence of genetic diseases.
Smart photonic networks and computer security for image data

NASA Astrophysics Data System (ADS)

Campello, Jorge; Gill, John T.; Morf, Martin; Flynn, Michael J.

1998-02-01

Work reported here is part of a larger project on 'Smart Photonic Networks and Computer Security for Image Data', studying the interactions of coding and security, switching architecture simulations, and basic technologies. Coding and security: coding methods that are appropriate for data security in data fusion networks were investigated. These networks have several characteristics that distinguish them form other currently employed networks, such as Ethernet LANs or the Internet. The most significant characteristics are very high maximum data rates; predominance of image data; narrowcasting - transmission of data form one source to a designated set of receivers; data fusion - combining related data from several sources; simple sensor nodes with limited buffering. These characteristics affect both the lower level network design and the higher level coding methods.Data security encompasses privacy, integrity, reliability, and availability. Privacy, integrity, and reliability can be provided through encryption and coding for error detection and correction. Availability is primarily a network issue; network nodes must be protected against failure or routed around in the case of failure. One of the more promising techniques is the use of 'secret sharing'. We consider this method as a special case of our new space-time code diversity based algorithms for secure communication. These algorithms enable us to exploit parallelism and scalable multiplexing schemes to build photonic network architectures. A number of very high-speed switching and routing architectures and their relationships with very high performance processor architectures were studied. Indications are that routers for very high speed photonic networks can be designed using the very robust and distributed TCP/IP protocol, if suitable processor architecture support is available.
Parallel Application Performance on Two Generations of Intel Xeon HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chang, Christopher H.; Long, Hai; Sides, Scott

2015-10-15

Two next-generation node configurations hosting the Haswell microarchitecture were tested with a suite of microbenchmarks and application examples, and compared with a current Ivy Bridge production node on NREL" tm s Peregrine high-performance computing cluster. A primary conclusion from this study is that the additional cores are of little value to individual task performance--limitations to application parallelism, or resource contention among concurrently running but independent tasks, limits effective utilization of these added cores. Hyperthreading generally impacts throughput negatively, but can improve performance in the absence of detailed attention to runtime workflow configuration. The observations offer some guidance to procurement ofmore » future HPC systems at NREL. First, raw core count must be balanced with available resources, particularly memory bandwidth. Balance-of-system will determine value more than processor capability alone. Second, hyperthreading continues to be largely irrelevant to the workloads that are commonly seen, and were tested here, at NREL. Finally, perhaps the most impactful enhancement to productivity might occur through enabling multiple concurrent jobs per node. Given the right type and size of workload, more may be achieved by doing many slow things at once, than fast things in order.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Krueger, Jens; Micikevicius, Paulius; Williams, Samuel

Reverse Time Migration (RTM) is one of the main approaches in the seismic processing industry for imaging the subsurface structure of the Earth. While RTM provides qualitative advantages over its predecessors, it has a high computational cost warranting implementation on HPC architectures. We focus on three progressively more complex kernels extracted from RTM: for isotropic (ISO), vertical transverse isotropic (VTI) and tilted transverse isotropic (TTI) media. In this work, we examine performance optimization of forward wave modeling, which describes the computational kernels used in RTM, on emerging multi- and manycore processors and introduce a novel common subexpression elimination optimization formore » TTI kernels. We compare attained performance and energy efficiency in both the single-node and distributed memory environments in order to satisfy industry’s demands for fidelity, performance, and energy efficiency. Moreover, we discuss the interplay between architecture (chip and system) and optimizations (both on-node computation) highlighting the importance of NUMA-aware approaches to MPI communication. Ultimately, our results show we can improve CPU energy efficiency by more than 10× on Magny Cours nodes while acceleration via multiple GPUs can surpass the energy-efficient Intel Sandy Bridge by as much as 3.6×.« less
A Wireless Electronic Nose System Using a Fe2O3 Gas Sensing Array and Least Squares Support Vector Regression

PubMed Central

Song, Kai; Wang, Qi; Liu, Qi; Zhang, Hongquan; Cheng, Yingguo

2011-01-01

This paper describes the design and implementation of a wireless electronic nose (WEN) system which can online detect the combustible gases methane and hydrogen (CH4/H2) and estimate their concentrations, either singly or in mixtures. The system is composed of two wireless sensor nodes—a slave node and a master node. The former comprises a Fe2O3 gas sensing array for the combustible gas detection, a digital signal processor (DSP) system for real-time sampling and processing the sensor array data and a wireless transceiver unit (WTU) by which the detection results can be transmitted to the master node connected with a computer. A type of Fe2O3 gas sensor insensitive to humidity is developed for resistance to environmental influences. A threshold-based least square support vector regression (LS-SVR)estimator is implemented on a DSP for classification and concentration measurements. Experimental results confirm that LS-SVR produces higher accuracy compared with artificial neural networks (ANNs) and a faster convergence rate than the standard support vector regression (SVR). The designed WEN system effectively achieves gas mixture analysis in a real-time process. PMID:22346587
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

NASA Astrophysics Data System (ADS)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
WebStruct and VisualStruct: Web interfaces and visualization for Structure software implemented in a cluster environment.

PubMed

Jayashree, B; Rajgopal, S; Hoisington, D; Prasanth, V P; Chandra, S

2008-09-24

Structure, is a widely used software tool to investigate population genetic structure with multi-locus genotyping data. The software uses an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations. The serial implementation of this programme is processor-intensive even with small datasets. We describe an implementation of the program within a parallel framework. Speedup was achieved by running different replicates and values of K on each node of the cluster. A web-based user-oriented GUI has been implemented in PHP, through which the user can specify input parameters for the programme. The number of processors to be used can be specified in the background command. A web-based visualization tool "Visualstruct", written in PHP (HTML and Java script embedded), allows for the graphical display of population clusters output from Structure, where each individual may be visualized as a line segment with K colors defining its possible genomic composition with respect to the K genetic sub-populations. The advantage over available programs is in the increased number of individuals that can be visualized. The analyses of real datasets indicate a speedup of up to four, when comparing the speed of execution on clusters of eight processors with the speed of execution on one desktop. The software package is freely available to interested users upon request.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

2017-07-03

The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less
Transient many-body instability in driven Dirac materials

NASA Astrophysics Data System (ADS)

Pertsova, Anna; Triola, Christopher; Balatsky, Alexander

The defining feature of a Dirac material (DM) is the presence of nodes in the low-energy excitation spectrum leading to a strong energy dependence of the density of states (DOS). The vanishing of the DOS at the nodal point implies a very low effective coupling constant which leads to stability of the node against electron-electron interactions. Non-equilibrium or driven DM, in which the DOS and hence the effective coupling can be controlled by external drive, offer a new platform for investigating collective instabilities. In this work, we discuss the possibility of realizing transient collective states in driven DMs. Motivated by recent pump-probe experiments which demonstrate the existence of long-lived photo-excited states in DMs, we consider an example of a transient excitonic instability in an optically-pumped DM. We identify experimental signatures of the transient excitonic condensate and provide estimates of the critical temperatures and lifetimes of these states for few important examples of DMs, such as single-layer graphene and topological-insulator surfaces.
Detecting climatically driven phylogenetic and morphological divergence among spruce (Picea) species worldwide

NASA Astrophysics Data System (ADS)

Wang, Guo-Hong; Li, He; Zhao, Hai-Wei; Zhang, Wei-Kang

2017-05-01

This study aimed to elucidate the relationship between climate and the phylogenetic and morphological divergence of spruces (Picea) worldwide. Climatic and georeferenced data were collected from a total of 3388 sites distributed within the global domain of spruce species. A phylogenetic tree and a morphological tree for the global spruces were reconstructed based on DNA sequences and morphological characteristics. Spatial evolutionary and ecological vicariance analysis (SEEVA) was used to detect the ecological divergence among spruces. A divergence index (D) with (0, 1) scaling was calculated for each climatic factor at each node for both trees. The annual mean values, extreme values and annual range of the climatic variables were among the major determinants for spruce divergence. The ecological divergence was significant (P < 0. 001) for 185 of the 279 comparisons at 31 nodes in the phylogenetic tree, as well as for 196 of the 288 comparisons at 32 nodes in the morphological tree. Temperature parameters and precipitation parameters tended to be the main driving factors for the primary divergences of spruce phylogeny and morphology, respectively. Generally, the maximum D of the climatic variables was smaller in the basal nodes than in the remaining nodes. Notably, the primary divergence of morphology and phylogeny among the investigated spruces tended to be driven by different selective pressures. Given the climate scenario of severe and widespread drought over land areas in the next 30-90 years, our findings shed light on the prediction of spruce distribution under future climate change.
Traffic-driven epidemic spreading on scale-free networks with tunable degree distribution

NASA Astrophysics Data System (ADS)

Yang, Han-Xin; Wang, Bing-Hong

2016-04-01

We study the traffic-driven epidemic spreading on scale-free networks with tunable degree distribution. The heterogeneity of networks is controlled by the exponent γ of power-law degree distribution. It is found that the epidemic threshold is minimized at about γ=2.2. Moreover, we find that nodes with larger algorithmic betweenness are more likely to be infected. We expect our work to provide new insights in to the effect of network structures on traffic-driven epidemic spreading.
Parallel reduced-instruction-set-computer architecture for real-time symbolic pattern matching

NASA Astrophysics Data System (ADS)

Parson, Dale E.

1991-03-01

This report discusses ongoing work on a parallel reduced-instruction- set-computer (RISC) architecture for automatic production matching. The PRIOPS compiler takes advantage of the memoryless character of automatic processing by translating a program's collection of automatic production tests into an equivalent combinational circuit-a digital circuit without memory, whose outputs are immediate functions of its inputs. The circuit provides a highly parallel, fine-grain model of automatic matching. The compiler then maps the combinational circuit onto RISC hardware. The heart of the processor is an array of comparators capable of testing production conditions in parallel, Each comparator attaches to private memory that contains virtual circuit nodes-records of the current state of nodes and busses in the combinational circuit. All comparator memories hold identical information, allowing simultaneous update for a single changing circuit node and simultaneous retrieval of different circuit nodes by different comparators. Along with the comparator-based logic unit is a sequencer that determines the current combination of production-derived comparisons to try, based on the combined success and failure of previous combinations of comparisons. The memoryless nature of automatic matching allows the compiler to designate invariant memory addresses for virtual circuit nodes, and to generate the most effective sequences of comparison test combinations. The result is maximal utilization of parallel hardware, indicating speed increases and scalability beyond that found for course-grain, multiprocessor approaches to concurrent Rete matching. Future work will consider application of this RISC architecture to the standard (controlled) Rete algorithm, where search through memory dominates portions of matching.

ROSE::FTTransform - A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lidman, J; Quinlan, D; Liao, C

2012-03-26

Exascale computing systems will require sufficient resilience to tolerate numerous types of hardware faults while still assuring correct program execution. Such extreme-scale machines are expected to be dominated by processors driven at lower voltages (near the minimum 0.5 volts for current transistors). At these voltage levels, the rate of transient errors increases dramatically due to the sensitivity to transient and geographically localized voltage drops on parts of the processor chip. To achieve power efficiency, these processors are likely to be streamlined and minimal, and thus they cannot be expected to handle transient errors entirely in hardware. Here we present anmore » open, compiler-based framework to automate the armoring of High Performance Computing (HPC) software to protect it from these types of transient processor errors. We develop an open infrastructure to support research work in this area, and we define tools that, in the future, may provide more complete automated and/or semi-automated solutions to support software resiliency on future exascale architectures. Results demonstrate that our approach is feasible, pragmatic in how it can be separated from the software development process, and reasonably efficient (0% to 30% overhead for the Jacobi iteration on common hardware; and 20%, 40%, 26%, and 2% overhead for a randomly selected subset of benchmarks from the Livermore Loops [1]).« less
A parallel algorithm for 2D visco-acoustic frequency-domain full-waveform inversion: application to a dense OBS data set

NASA Astrophysics Data System (ADS)

Sourbier, F.; Operto, S.; Virieux, J.

2006-12-01

We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor computes the corresponding sub-domain of the gradient. In the end, the gradient is centralized on the master processor using a collective communation. The gradient is scaled by the diagonal elements of the Hessian matrix. This scaling is computed only once per frequency before the first iteration of the inversion. Estimation of the diagonal terms of the Hessian requires performing one simulation per non redondant shot and receiver position. The same strategy that the one used for the gradient is used to compute the diagonal Hessian in parallel. This algorithm was applied to a dense wide-angle data set recorded by 100 OBSs in the eastern Nankai trough, offshore Japan. Thirteen frequencies ranging from 3 and 15 Hz were inverted. Tweny iterations per frequency were computed leading to 260 tomographic velocity models of increasing resolution. The velocity model dimensions are 105 km x 25 km corresponding to a finite-difference grid of 4201 x 1001 grid with a 25-m grid interval. The number of shot was 1005 and the number of inverted OBS gathers was 93. The inversion requires 20 days on 6 32-bits bi-processor nodes with 4 Gbytes of RAM memory per node when only the LU factorization is performed in parallel. Preliminary estimations of the time required to perform the inversion with the fully-parallelized code is 6 and 4 days using 20 and 50 processors respectively.
Interdependent Multi-Layer Networks: Modeling and Survivability Analysis with Applications to Space-Based Networks

PubMed Central

Castet, Jean-Francois; Saleh, Joseph H.

2013-01-01

This article develops a novel approach and algorithmic tools for the modeling and survivability analysis of networks with heterogeneous nodes, and examines their application to space-based networks. Space-based networks (SBNs) allow the sharing of spacecraft on-orbit resources, such as data storage, processing, and downlink. Each spacecraft in the network can have different subsystem composition and functionality, thus resulting in node heterogeneity. Most traditional survivability analyses of networks assume node homogeneity and as a result, are not suited for the analysis of SBNs. This work proposes that heterogeneous networks can be modeled as interdependent multi-layer networks, which enables their survivability analysis. The multi-layer aspect captures the breakdown of the network according to common functionalities across the different nodes, and it allows the emergence of homogeneous sub-networks, while the interdependency aspect constrains the network to capture the physical characteristics of each node. Definitions of primitives of failure propagation are devised. Formal characterization of interdependent multi-layer networks, as well as algorithmic tools for the analysis of failure propagation across the network are developed and illustrated with space applications. The SBN applications considered consist of several networked spacecraft that can tap into each other's Command and Data Handling subsystem, in case of failure of its own, including the Telemetry, Tracking and Command, the Control Processor, and the Data Handling sub-subsystems. Various design insights are derived and discussed, and the capability to perform trade-space analysis with the proposed approach for various network characteristics is indicated. The select results here shown quantify the incremental survivability gains (with respect to a particular class of threats) of the SBN over the traditional monolith spacecraft. Failure of the connectivity between nodes is also examined, and the results highlight the importance of the reliability of the wireless links between spacecraft (nodes) to enable any survivability improvements for space-based networks. PMID:23599835
Interdependent multi-layer networks: modeling and survivability analysis with applications to space-based networks.

PubMed

Castet, Jean-Francois; Saleh, Joseph H

2013-01-01

This article develops a novel approach and algorithmic tools for the modeling and survivability analysis of networks with heterogeneous nodes, and examines their application to space-based networks. Space-based networks (SBNs) allow the sharing of spacecraft on-orbit resources, such as data storage, processing, and downlink. Each spacecraft in the network can have different subsystem composition and functionality, thus resulting in node heterogeneity. Most traditional survivability analyses of networks assume node homogeneity and as a result, are not suited for the analysis of SBNs. This work proposes that heterogeneous networks can be modeled as interdependent multi-layer networks, which enables their survivability analysis. The multi-layer aspect captures the breakdown of the network according to common functionalities across the different nodes, and it allows the emergence of homogeneous sub-networks, while the interdependency aspect constrains the network to capture the physical characteristics of each node. Definitions of primitives of failure propagation are devised. Formal characterization of interdependent multi-layer networks, as well as algorithmic tools for the analysis of failure propagation across the network are developed and illustrated with space applications. The SBN applications considered consist of several networked spacecraft that can tap into each other's Command and Data Handling subsystem, in case of failure of its own, including the Telemetry, Tracking and Command, the Control Processor, and the Data Handling sub-subsystems. Various design insights are derived and discussed, and the capability to perform trade-space analysis with the proposed approach for various network characteristics is indicated. The select results here shown quantify the incremental survivability gains (with respect to a particular class of threats) of the SBN over the traditional monolith spacecraft. Failure of the connectivity between nodes is also examined, and the results highlight the importance of the reliability of the wireless links between spacecraft (nodes) to enable any survivability improvements for space-based networks.
INTEGRATED MONITORING HARDWARE DEVELOPMENTS AT LOS ALAMOS

DOE Office of Scientific and Technical Information (OSTI.GOV)

R. PARKER; J. HALBIG; ET AL

1999-09-01

The hardware of the integrated monitoring system supports a family of instruments having a common internal architecture and firmware. Instruments can be easily configured from application-specific personality boards combined with common master-processor and high- and low-voltage power supply boards, and basic operating firmware. The instruments are designed to function autonomously to survive power and communication outages and to adapt to changing conditions. The personality boards allow measurement of gross gammas and neutrons, neutron coincidence and multiplicity, and gamma spectra. In addition, the Intelligent Local Node (ILON) provides a moderate-bandwidth network to tie together instruments, sensors, and computers.
Vapor Compression Distillation Flight Experiment

NASA Technical Reports Server (NTRS)

Hutchens, Cindy F.

2002-01-01

One of the major requirements associated with operating the International Space Station is the transportation -- space shuttle and Russian Progress spacecraft launches - necessary to re-supply station crews with food and water. The Vapor Compression Distillation (VCD) Flight Experiment, managed by NASA's Marshall Space Flight Center in Huntsville, Ala., is a full-scale demonstration of technology being developed to recycle crewmember urine and wastewater aboard the International Space Station and thereby reduce the amount of water that must be re-supplied. Based on results of the VCD Flight Experiment, an operational urine processor will be installed in Node 3 of the space station in 2005.
Dendritic cells control fibroblastic reticular network tension and lymph node expansion.

PubMed

Acton, Sophie E; Farrugia, Aaron J; Astarita, Jillian L; Mourão-Sá, Diego; Jenkins, Robert P; Nye, Emma; Hooper, Steven; van Blijswijk, Janneke; Rogers, Neil C; Snelgrove, Kathryn J; Rosewell, Ian; Moita, Luis F; Stamp, Gordon; Turley, Shannon J; Sahai, Erik; Reis e Sousa, Caetano

2014-10-23

After immunogenic challenge, infiltrating and dividing lymphocytes markedly increase lymph node cellularity, leading to organ expansion. Here we report that the physical elasticity of lymph nodes is maintained in part by podoplanin (PDPN) signalling in stromal fibroblastic reticular cells (FRCs) and its modulation by CLEC-2 expressed on dendritic cells. We show in mouse cells that PDPN induces actomyosin contractility in FRCs via activation of RhoA/C and downstream Rho-associated protein kinase (ROCK). Engagement by CLEC-2 causes PDPN clustering and rapidly uncouples PDPN from RhoA/C activation, relaxing the actomyosin cytoskeleton and permitting FRC stretching. Notably, administration of CLEC-2 protein to immunized mice augments lymph node expansion. In contrast, lymph node expansion is significantly constrained in mice selectively lacking CLEC-2 expression in dendritic cells. Thus, the same dendritic cells that initiate immunity by presenting antigens to T lymphocytes also initiate remodelling of lymph nodes by delivering CLEC-2 to FRCs. CLEC-2 modulation of PDPN signalling permits FRC network stretching and allows for the rapid lymph node expansion--driven by lymphocyte influx and proliferation--that is the critical hallmark of adaptive immunity.
FPGA-Based, Self-Checking, Fault-Tolerant Computers

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2004-01-01

A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Joint estimation of preferential attachment and node fitness in growing complex networks

NASA Astrophysics Data System (ADS)

Pham, Thong; Sheridan, Paul; Shimodaira, Hidetoshi

2016-09-01

Complex network growth across diverse fields of science is hypothesized to be driven in the main by a combination of preferential attachment and node fitness processes. For measuring the respective influences of these processes, previous approaches make strong and untested assumptions on the functional forms of either the preferential attachment function or fitness function or both. We introduce a Bayesian statistical method called PAFit to estimate preferential attachment and node fitness without imposing such functional constraints that works by maximizing a log-likelihood function with suitably added regularization terms. We use PAFit to investigate the interplay between preferential attachment and node fitness processes in a Facebook wall-post network. While we uncover evidence for both preferential attachment and node fitness, thus validating the hypothesis that these processes together drive complex network evolution, we also find that node fitness plays the bigger role in determining the degree of a node. This is the first validation of its kind on real-world network data. But surprisingly the rate of preferential attachment is found to deviate from the conventional log-linear form when node fitness is taken into account. The proposed method is implemented in the R package PAFit.
Joint estimation of preferential attachment and node fitness in growing complex networks

PubMed Central

Pham, Thong; Sheridan, Paul; Shimodaira, Hidetoshi

2016-01-01

Complex network growth across diverse fields of science is hypothesized to be driven in the main by a combination of preferential attachment and node fitness processes. For measuring the respective influences of these processes, previous approaches make strong and untested assumptions on the functional forms of either the preferential attachment function or fitness function or both. We introduce a Bayesian statistical method called PAFit to estimate preferential attachment and node fitness without imposing such functional constraints that works by maximizing a log-likelihood function with suitably added regularization terms. We use PAFit to investigate the interplay between preferential attachment and node fitness processes in a Facebook wall-post network. While we uncover evidence for both preferential attachment and node fitness, thus validating the hypothesis that these processes together drive complex network evolution, we also find that node fitness plays the bigger role in determining the degree of a node. This is the first validation of its kind on real-world network data. But surprisingly the rate of preferential attachment is found to deviate from the conventional log-linear form when node fitness is taken into account. The proposed method is implemented in the R package PAFit. PMID:27601314
Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers

NASA Astrophysics Data System (ADS)

Edwards, Mark; Heward, Jeffrey; Clark, C. W.

2009-05-01

At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
FPGA-accelerated algorithm for the regular expression matching system

NASA Astrophysics Data System (ADS)

Russek, P.; Wiatr, K.

2015-01-01

This article describes an algorithm to support a regular expressions matching system. The goal was to achieve an attractive performance system with low energy consumption. The basic idea of the algorithm comes from a concept of the Bloom filter. It starts from the extraction of static sub-strings for strings of regular expressions. The algorithm is devised to gain from its decomposition into parts which are intended to be executed by custom hardware and the central processing unit (CPU). The pipelined custom processor architecture is proposed and a software algorithm explained accordingly. The software part of the algorithm was coded in C and runs on a processor from the ARM family. The hardware architecture was described in VHDL and implemented in field programmable gate array (FPGA). The performance results and required resources of the above experiments are given. An example of target application for the presented solution is computer and network security systems. The idea was tested on nearly 100,000 body-based viruses from the ClamAV virus database. The solution is intended for the emerging technology of clusters of low-energy computing nodes.
Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.
Monitoring Data-Structure Evolution in Distributed Message-Passing Programs

NASA Technical Reports Server (NTRS)

Sarukkai, Sekhar R.; Beers, Andrew; Woodrow, Thomas S. (Technical Monitor)

1996-01-01

Monitoring the evolution of data structures in parallel and distributed programs, is critical for debugging its semantics and performance. However, the current state-of-art in tracking and presenting data-structure information on parallel and distributed environments is cumbersome and does not scale. In this paper we present a methodology that automatically tracks memory bindings (not the actual contents) of static and dynamic data-structures of message-passing C programs, using PVM. With the help of a number of examples we show that in addition to determining the impact of memory allocation overheads on program performance, graphical views can help in debugging the semantics of program execution. Scalable animations of virtual address bindings of source-level data-structures are used for debugging the semantics of parallel programs across all processors. In conjunction with light-weight core-files, this technique can be used to complement traditional debuggers on single processors. Detailed information (such as data-structure contents), on specific nodes, can be determined using traditional debuggers after the data structure evolution leading to the semantic error is observed graphically.
Global synchronization algorithms for the Intel iPSC/860

NASA Technical Reports Server (NTRS)

Seidel, Steven R.; Davis, Mark A.

1992-01-01

In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms.
GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation.

PubMed

Hess, Berk; Kutzner, Carsten; van der Spoel, David; Lindahl, Erik

2008-03-01

Molecular simulation is an extremely useful, but computationally very expensive tool for studies of chemical and biomolecular systems. Here, we present a new implementation of our molecular simulation toolkit GROMACS which now both achieves extremely high performance on single processors from algorithmic optimizations and hand-coded routines and simultaneously scales very well on parallel machines. The code encompasses a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations also in parallel. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, we have in addition used a Multiple-Program, Multiple-Data approach, with separate node domains responsible for direct and reciprocal space interactions. Not only does this combination of algorithms enable extremely long simulations of large systems but also it provides that simulation performance on quite modest numbers of standard cluster nodes.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors

PubMed Central

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-01-01

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction. PMID:27399722
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Distributed Two-Dimensional Fourier Transforms on DSPs with an Application for Phase Retrieval

NASA Technical Reports Server (NTRS)

Smith, Jeffrey Scott

2006-01-01

Many applications of two-dimensional Fourier Transforms require fixed timing as defined by system specifications. One example is image-based wavefront sensing. The image-based approach has many benefits, yet it is a computational intensive solution for adaptive optic correction, where optical adjustments are made in real-time to correct for external (atmospheric turbulence) and internal (stability) aberrations, which cause image degradation. For phase retrieval, a type of image-based wavefront sensing, numerous two-dimensional Fast Fourier Transforms (FFTs) are used. To meet the required real-time specifications, a distributed system is needed, and thus, the 2-D FFT necessitates an all-to-all communication among the computational nodes. The 1-D floating point FFT is very efficient on a digital signal processor (DSP). For this study, several architectures and analysis of such are presented which address the all-to-all communication with DSPs. Emphasis of this research is on a 64-node cluster of Analog Devices TigerSharc TS-101 DSPs.
Data decomposition of Monte Carlo particle transport simulations via tally servers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Romano, Paul K.; Siegel, Andrew R.; Forget, Benoit

An algorithm for decomposing large tally data in Monte Carlo particle transport simulations is developed, analyzed, and implemented in a continuous-energy Monte Carlo code, OpenMC. The algorithm is based on a non-overlapping decomposition of compute nodes into tracking processors and tally servers. The former are used to simulate the movement of particles through the domain while the latter continuously receive and update tally data. A performance model for this approach is developed, suggesting that, for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead on contemporary supercomputers. An implementation of the algorithmmore » in OpenMC is then tested on the Intrepid and Titan supercomputers, supporting the key predictions of the model over a wide range of parameters. We thus conclude that the tally server algorithm is a successful approach to circumventing classical on-node memory constraints en route to unprecedentedly detailed Monte Carlo reactor simulations.« less

Seeing the forest for the trees: Networked workstations as a parallel processing computer

NASA Technical Reports Server (NTRS)

Breen, J. O.; Meleedy, D. M.

1992-01-01

Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Global-view coefficients: a data management solution for parallel quantum Monte Carlo applications: A DATA MANAGEMENT SOLUTION FOR QMC APPLICATIONS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Niu, Qingpeng; Dinan, James; Tirukkovalur, Sravya

2016-01-28

Quantum Monte Carlo (QMC) applications perform simulation with respect to an initial state of the quantum mechanical system, which is often captured by using a cubic B-spline basis. This representation is stored as a read-only table of coefficients and accesses to the table are generated at random as part of the Monte Carlo simulation. Current QMC applications, such as QWalk and QMCPACK, replicate this table at every process or node, which limits scalability because increasing the number of processors does not enable larger systems to be run. We present a partitioned global address space approach to transparently managing this datamore » using Global Arrays in a manner that allows the memory of multiple nodes to be aggregated. We develop an automated data management system that significantly reduces communication overheads, enabling new capabilities for QMC codes. Experimental results with QWalk and QMCPACK demonstrate the effectiveness of the data management system.« less
Augmenting computer networks

NASA Technical Reports Server (NTRS)

Bokhari, S. H.; Raza, A. D.

1984-01-01

Three methods of augmenting computer networks by adding at most one link per processor are discussed: (1) A tree of N nodes may be augmented such that the resulting graph has diameter no greater than 4log sub 2((N+2)/3)-2. Thi O(N(3)) algorithm can be applied to any spanning tree of a connected graph to reduce the diameter of that graph to O(log N); (2) Given a binary tree T and a chain C of N nodes each, C may be augmented to produce C so that T is a subgraph of C. This algorithm is O(N) and may be used to produce augmented chains or rings that have diameter no greater than 2log sub 2((N+2)/3) and are planar; (3) Any rectangular two-dimensional 4 (8) nearest neighbor array of size N = 2(k) may be augmented so that it can emulate a single step shuffle-exchange network of size N/2 in 3(t) time steps.
A game theory approach to target tracking in sensor networks.

PubMed

Gu, Dongbing

2011-02-01

In this paper, we investigate a moving-target tracking problem with sensor networks. Each sensor node has a sensor to observe the target and a processor to estimate the target position. It also has wireless communication capability but with limited range and can only communicate with neighbors. The moving target is assumed to be an intelligent agent, which is "smart" enough to escape from the detection by maximizing the estimation error. This adversary behavior makes the target tracking problem more difficult. We formulate this target estimation problem as a zero-sum game in this paper and use a minimax filter to estimate the target position. The minimax filter is a robust filter that minimizes the estimation error by considering the worst case noise. Furthermore, we develop a distributed version of the minimax filter for multiple sensor nodes. The distributed computation is implemented via modeling the information received from neighbors as measurements in the minimax filter. The simulation results show that the target tracking algorithm proposed in this paper provides a satisfactory result.
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
Energy-Efficient Implementation of ECDH Key Exchange for Wireless Sensor Networks

NASA Astrophysics Data System (ADS)

Lederer, Christian; Mader, Roland; Koschuch, Manuel; Großschädl, Johann; Szekely, Alexander; Tillich, Stefan

Wireless Sensor Networks (WSNs) are playing a vital role in an ever-growing number of applications ranging from environmental surveillance over medical monitoring to home automation. Since WSNs are often deployed in unattended or even hostile environments, they can be subject to various malicious attacks, including the manipulation and capture of nodes. The establishment of a shared secret key between two or more individual nodes is one of the most important security services needed to guarantee the proper functioning of a sensor network. Despite some recent advances in this field, the efficient implementation of cryptographic key establishment for WSNs remains a challenge due to the resource constraints of small sensor nodes such as the MICAz mote. In this paper we present a lightweight implementation of the elliptic curve Diffie-Hellman (ECDH) key exchange for ZigBee-compliant sensor nodes equipped with an ATmega128 processor running the TinyOS operating system. Our implementation uses a 192-bit prime field specified by the NIST as underlying algebraic structure and requires only 5.20 ·106 clock cycles to compute a scalar multiplication if the base point is fixed and known a priori. A scalar multiplication using a random base point takes about 12.33 ·106 cycles. Our results show that a full ECDH key exchange between two MICAz motes consumes an energy of 57.33 mJ (including radio communication), which is significantly better than most previously reported ECDH implementations on comparable platforms.
Interactive Planning for Capability Driven Air & Space Operations

DTIC Science & Technology

2008-04-30

Time, Routledge and Kegan , London, UK, 1980. [5] A. Bochman, Concerted instant–interval temporal semantics I: Temporal ontologies, Notre Dame Journal...then return true else deleteStatement (X, rj , Y ) end if end for return false Figure 8 shows the search space for instance in Table 2. The green ...nodes are those for which the set of relations cor- responding to the path from the root form a consistent set. A path from root to a green leaf node
Design of the Bus Interface Unit for the Distributed Processor/Memory System.

DTIC Science & Technology

1976-12-01

microroutine flowchart developed. Once this had been done , a high-speed, flexible microprocessor that would be adapt- able to a hardware...routine) was translated Into microcode and provide the mnemonic code and flowchart , Chapter V summarizes and discusses actual system construction...Fig. 11. This diagram shows that the BIU is driven by Interrupt stimuli which select the beginn ing address of the appropriate microroutine rather
Architecture of a Message-Driven Processor,

DTIC Science & Technology

1987-11-01

Jon Kaplan, Paul Song, Brian Totty, and Scott Wills Artifcial Intelligence Laboratory -4 Laboratory for Computer Science Massachusetts Institute of...Information Dally, Chao, Chien, Hassoun, Horwat, Kaplan, Song, Totty & Wills: Artificial Intelligence i Laboratory and Laboratory for Computer Science, MIT...applied to a problem if we could are 36 bits long (32 data bits + 4 tag bits) and are used to hold efficiently run programs with a granularity of 5s
A Diagnostic System for Studying Energy Partitioning and Assessing the Response of the Ionosphere during HAARP Modification Experiments

NASA Technical Reports Server (NTRS)

Djuth, Frank T.; Elder, John H.; Williams, Kenneth L.

1996-01-01

This research program focused on the construction of several key radio wave diagnostics in support of the HF Active Auroral Ionospheric Research Program (HAARP). Project activities led to the design, development, and fabrication of a variety of hardware units and to the development of several menu-driven software packages for data acquisition and analysis. The principal instrumentation includes an HF (28 MHz) radar system, a VHF (50 MHz) radar system, and a high-speed radar processor consisting of three separable processing units. The processor system supports the HF and VHF radars and is capable of acquiring very detailed data with large incoherent scatter radars. In addition, a tunable HF receiver system having high dynamic range was developed primarily for measurements of stimulated electromagnetic emissions (SEE). A separate processor unit was constructed for the SEE receiver. Finally, a large amount of support instrumentation was developed to accommodate complex field experiments. Overall, the HAARP diagnostics are powerful tools for studying diverse ionospheric modification phenomena. They are also flexible enough to support a host of other missions beyond the scope of HAARP. Many new research programs have been initiated by applying the HAARP diagnostics to studies of natural atmospheric processes.
Generic element processor (application to nonlinear analysis)

NASA Technical Reports Server (NTRS)

Stanley, Gary

1989-01-01

The focus here is on one aspect of the Computational Structural Mechanics (CSM) Testbed: finite element technology. The approach involves a Generic Element Processor: a command-driven, database-oriented software shell that facilitates introduction of new elements into the testbed. This shell features an element-independent corotational capability that upgrades linear elements to geometrically nonlinear analysis, and corrects the rigid-body errors that plague many contemporary plate and shell elements. Specific elements that have been implemented in the Testbed via this mechanism include the Assumed Natural-Coordinate Strain (ANS) shell elements, developed with Professor K. C. Park (University of Colorado, Boulder), a new class of curved hybrid shell elements, developed by Dr. David Kang of LPARL (formerly a student of Professor T. Pian), other shell and solid hybrid elements developed by NASA personnel, and recently a repackaged version of the workhorse shell element used in the traditional STAGS nonlinear shell analysis code. The presentation covers: (1) user and developer interfaces to the generic element processor, (2) an explanation of the built-in corotational option, (3) a description of some of the shell-elements currently implemented, and (4) application to sample nonlinear shell postbuckling problems.
Test Platform for Advanced Digital Control of Brushless DC Motors (MSFC Center Director's Discretionary Fund)

NASA Technical Reports Server (NTRS)

Gwaltney, D. A.

2002-01-01

A FY 2001 Center Director's Discretionary Fund task to develop a test platform for the development, implementation. and evaluation of adaptive and other advanced control techniques for brushless DC (BLDC) motor-driven mechanisms is described. Important applications for BLDC motor-driven mechanisms are the translation of specimens in microgravity experiments and electromechanical actuation of nozzle and fuel valves in propulsion systems. Motor-driven aerocontrol surfaces are also being utilized in developmental X vehicles. The experimental test platform employs a linear translation stage that is mounted vertically and driven by a BLDC motor. Control approaches are implemented on a digital signal processor-based controller for real-time, closed-loop control of the stage carriage position. The goal of the effort is to explore the application of advanced control approaches that can enhance the performance of a motor-driven actuator over the performance obtained using linear control approaches with fixed gains. Adaptive controllers utilizing an exact model knowledge controller and a self-tuning controller are implemented and the control system performance is illustrated through the presentation of experimental results.
Multi-processor including data flow accelerator module

DOEpatents

Davidson, George S.; Pierce, Paul E.

1990-01-01

An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Present Status and Extensions of the Monte Carlo Performance Benchmark

NASA Astrophysics Data System (ADS)

Hoogenboom, J. Eduard; Petrovic, Bojan; Martin, William R.

2014-06-01

The NEA Monte Carlo Performance benchmark started in 2011 aiming to monitor over the years the abilities to perform a full-size Monte Carlo reactor core calculation with a detailed power production for each fuel pin with axial distribution. This paper gives an overview of the contributed results thus far. It shows that reaching a statistical accuracy of 1 % for most of the small fuel zones requires about 100 billion neutron histories. The efficiency of parallel execution of Monte Carlo codes on a large number of processor cores shows clear limitations for computer clusters with common type computer nodes. However, using true supercomputers the speedup of parallel calculations is increasing up to large numbers of processor cores. More experience is needed from calculations on true supercomputers using large numbers of processors in order to predict if the requested calculations can be done in a short time. As the specifications of the reactor geometry for this benchmark test are well suited for further investigations of full-core Monte Carlo calculations and a need is felt for testing other issues than its computational performance, proposals are presented for extending the benchmark to a suite of benchmark problems for evaluating fission source convergence for a system with a high dominance ratio, for coupling with thermal-hydraulics calculations to evaluate the use of different temperatures and coolant densities and to study the correctness and effectiveness of burnup calculations. Moreover, other contemporary proposals for a full-core calculation with realistic geometry and material composition will be discussed.
Call Admission Control on Single Node Networks under Output Rate-Controlled Generalized Processor Sharing (ORC-GPS) Scheduler

NASA Astrophysics Data System (ADS)

Hanada, Masaki; Nakazato, Hidenori; Watanabe, Hitoshi

Multimedia applications such as music or video streaming, video teleconferencing and IP telephony are flourishing in packet-switched networks. Applications that generate such real-time data can have very diverse quality-of-service (QoS) requirements. In order to guarantee diverse QoS requirements, the combined use of a packet scheduling algorithm based on Generalized Processor Sharing (GPS) and leaky bucket traffic regulator is the most successful QoS mechanism. GPS can provide a minimum guaranteed service rate for each session and tight delay bounds for leaky bucket constrained sessions. However, the delay bounds for leaky bucket constrained sessions under GPS are unnecessarily large because each session is served according to its associated constant weight until the session buffer is empty. In order to solve this problem, a scheduling policy called Output Rate-Controlled Generalized Processor Sharing (ORC-GPS) was proposed in [17]. ORC-GPS is a rate-based scheduling like GPS, and controls the service rate in order to lower the delay bounds for leaky bucket constrained sessions. In this paper, we propose a call admission control (CAC) algorithm for ORC-GPS, for leaky-bucket constrained sessions with deterministic delay requirements. This CAC algorithm for ORC-GPS determines the optimal values of parameters of ORC-GPS from the deterministic delay requirements of the sessions. In numerical experiments, we compare the CAC algorithm for ORC-GPS with one for GPS in terms of schedulable region and computational complexity.
Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis.

PubMed

González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J

2017-02-01

The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
The use of isotope injections in sentinel node biopsy for breast cancer: are the 1- and 2-day protocols equally effective?

PubMed

Dodia, Nazera; El-Sharief, Deena; Kirwan, Cliona C

2015-01-01

Sentinel lymph nodes are mapped using (99m)Technetium, injected on day of surgery (1-day protocol) or day before (2-day protocol). This retrospective cohort study compares efficacy between the two protocols. Histopathology for all unilateral sentinel lymph node biopsies (March 2012-March 2013) in a single centre were reviewed. Number of sentinel lymph nodes, non-sentinel lymph nodes and pathology was compared. 2/270 (0.7 %) in 1-day protocol and 8/192 (4 %) in 2-day protocol had no sentinel lymph nodes removed (p = 0.02). The median (range) number of sentinel lymph nodes removed per patient was 2 (0-7) and 1 (0-11) in the 1- and 2-day protocols respectively (p = 0.08). There was a trend for removing more non-sentinel lymph nodes in 2-day protocol [1-day: 52/270 (19 %); 2-day: 50/192 (26 %), p = 0.07]. Using 2-day, sentinel lymph node identification failure rate is higher, although within acceptable rates. The 1 and 2 day protocols are both effective, therefore choice of protocol should be driven by patient convenience and hospital efficiency. However, this study raises the possibility that 1-day may be preferable when higher sentinel lymph node count is beneficial, for example following neoadjuvant chemotherapy.
Status report of the end-to-end ASKAP software system: towards early science operations

NASA Astrophysics Data System (ADS)

Guzman, Juan Carlos; Chapman, Jessica; Marquarding, Malte; Whiting, Matthew

2016-08-01

The Australian SKA Pathfinder (ASKAP) is a novel centimetre radio synthesis telescope currently in the commissioning phase and located in the midwest region of Western Australia. It comprises of 36 x 12 m diameter reflector antennas each equipped with state-of-the-art and award winning Phased Array Feeds (PAF) technology. The PAFs provide a wide, 30 square degree field-of-view by forming up to 36 separate dual-polarisation beams at once. This results in a high data rate: 70 TB of correlated visibilities in an 8-hour observation, requiring custom-written, high-performance software running in dedicated High Performance Computing (HPC) facilities. The first six antennas equipped with first-generation PAF technology (Mark I), named the Boolardy Engineering Test Array (BETA) have been in use since 2014 as a platform to test PAF calibration and imaging techniques, and along the way it has been producing some great science results. Commissioning of the ASKAP Array Release 1, that is the first six antennas with second-generation PAFs (Mark II) is currently under way. An integral part of the instrument is the Central Processor platform hosted at the Pawsey Supercomputing Centre in Perth, which executes custom-written software pipelines, designed specifically to meet the ASKAP imaging requirements of wide field of view and high dynamic range. There are three key hardware components of the Central Processor: The ingest nodes (16 x node cluster), the fast temporary storage (1 PB Lustre file system) and the processing supercomputer (200 TFlop system). This High-Performance Computing (HPC) platform is managed and supported by the Pawsey support team. Due to the limited amount of data generated by BETA and the first ASKAP Array Release, the Central Processor platform has been running in a more "traditional" or user-interactive mode. But this is about to change: integration and verification of the online ingest pipeline starts in early 2016, which is required to support the full 300 MHz bandwidth for Array Release 1; followed by the deployment of the real-time data processing components. In addition to the Central Processor, the first production release of the CSIRO ASKAP Science Data Archive (CASDA) has also been deployed in one of the Pawsey Supercomputing Centre facilities and it is integrated to the end-to-end ASKAP data flow system. This paper describes the current status of the "end-to-end" data flow software system from preparing observations to data acquisition, processing and archiving; and the challenges of integrating an HPC facility as a key part of the instrument. It also shares some lessons learned since the start of integration activities and the challenges ahead in preparation for the start of the Early Science program.
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Comparing brain graphs in which nodes are regions of interest or independent components: A simulation study.

PubMed

Yu, Qingbao; Du, Yuhui; Chen, Jiayu; He, Hao; Sui, Jing; Pearlson, Godfrey; Calhoun, Vince D

2017-11-01

A key challenge in building a brain graph using fMRI data is how to define the nodes. Spatial brain components estimated by independent components analysis (ICA) and regions of interest (ROIs) determined by brain atlas are two popular methods to define nodes in brain graphs. It is difficult to evaluate which method is better in real fMRI data. Here we perform a simulation study and evaluate the accuracies of a few graph metrics in graphs with nodes of ICA components, ROIs, or modified ROIs in four simulation scenarios. Graph measures with ICA nodes are more accurate than graphs with ROI nodes in all cases. Graph measures with modified ROI nodes are modulated by artifacts. The correlations of graph metrics across subjects between graphs with ICA nodes and ground truth are higher than the correlations between graphs with ROI nodes and ground truth in scenarios with large overlapped spatial sources. Moreover, moving the location of ROIs would largely decrease the correlations in all scenarios. Evaluating graphs with different nodes is promising in simulated data rather than real data because different scenarios can be simulated and measures of different graphs can be compared with a known ground truth. Since ROIs defined using brain atlas may not correspond well to real functional boundaries, overall findings of this work suggest that it is more appropriate to define nodes using data-driven ICA than ROI approaches in real fMRI data. Copyright © 2017 Elsevier B.V. All rights reserved.

Quantum information density scaling and qubit operation time constraints of CMOS silicon-based quantum computer architectures

NASA Astrophysics Data System (ADS)

Rotta, Davide; Sebastiano, Fabio; Charbon, Edoardo; Prati, Enrico

2017-06-01

Even the quantum simulation of an apparently simple molecule such as Fe2S2 requires a considerable number of qubits of the order of 106, while more complex molecules such as alanine (C3H7NO2) require about a hundred times more. In order to assess such a multimillion scale of identical qubits and control lines, the silicon platform seems to be one of the most indicated routes as it naturally provides, together with qubit functionalities, the capability of nanometric, serial, and industrial-quality fabrication. The scaling trend of microelectronic devices predicting that computing power would double every 2 years, known as Moore's law, according to the new slope set after the 32-nm node of 2009, suggests that the technology roadmap will achieve the 3-nm manufacturability limit proposed by Kelly around 2020. Today, circuital quantum information processing architectures are predicted to take advantage from the scalability ensured by silicon technology. However, the maximum amount of quantum information per unit surface that can be stored in silicon-based qubits and the consequent space constraints on qubit operations have never been addressed so far. This represents one of the key parameters toward the implementation of quantum error correction for fault-tolerant quantum information processing and its dependence on the features of the technology node. The maximum quantum information per unit surface virtually storable and controllable in the compact exchange-only silicon double quantum dot qubit architecture is expressed as a function of the complementary metal-oxide-semiconductor technology node, so the size scale optimizing both physical qubit operation time and quantum error correction requirements is assessed by reviewing the physical and technological constraints. According to the requirements imposed by the quantum error correction method and the constraints given by the typical strength of the exchange coupling, we determine the workable operation frequency range of a silicon complementary metal-oxide-semiconductor quantum processor to be within 1 and 100 GHz. Such constraint limits the feasibility of fault-tolerant quantum information processing with complementary metal-oxide-semiconductor technology only to the most advanced nodes. The compatibility with classical complementary metal-oxide-semiconductor control circuitry is discussed, focusing on the cryogenic complementary metal-oxide-semiconductor operation required to bring the classical controller as close as possible to the quantum processor and to enable interfacing thousands of qubits on the same chip via time-division, frequency-division, and space-division multiplexing. The operation time range prospected for cryogenic control electronics is found to be compatible with the operation time expected for qubits. By combining the forecast of the development of scaled technology nodes with operation time and classical circuitry constraints, we derive a maximum quantum information density for logical qubits of 2.8 and 4 Mqb/cm2 for the 10 and 7-nm technology nodes, respectively, for the Steane code. The density is one and two orders of magnitude less for surface codes and for concatenated codes, respectively. Such values provide a benchmark for the development of fault-tolerant quantum algorithms by circuital quantum information based on silicon platforms and a guideline for other technologies in general.
Announcement/Subscription/Publication: Message Based Communication for Heterogeneous Mobile Environments

NASA Astrophysics Data System (ADS)

Ristau, Henry

Many tasks in smart environments can be implemented using message based communication paradigms that decouple applications in time, space, synchronization and semantics. Current solutions for decoupled message based communication either do not support message processing and thus semantic decoupling or rely on clearly defined network structures. In this paper we present ASP, a novel concept for such communication that can directly operate on neighbor relations between brokers and does not rely on a homogeneous addressing scheme or anymore than simple link layer communication. We show by simulation that ASP performs well in a heterogeneous scenario with mobile nodes and decreases network or processor load significantly compared to message flooding.
Studies of an Optical Multi-Processor Interconnect

DTIC Science & Technology

1994-01-01

that the destinations are uniformly distributed, I is given by (E2k-i 2 k(l -(1 -Pd)i)]ýE) N g -1 i+ ( -- Pd)- +2k- 1 2k - 2’-k’ + k(1 -- (1 -- pd)k...15oo 2000- Number of Nodes Figure 5.16: Variation of Maximum User Throughput with Size I I 94I I I I6 Io•MSe P (snowed) g (dsehd) o𔃺.0 OJU/ I£ 0. 00...curves. Table (6.9) shows that the size and topology of the network does not have any significant effect on the number g of threads needed to keep the
Efficient Parallel Formulations of Hierarchical Methods and Their Applications

NASA Astrophysics Data System (ADS)

Grama, Ananth Y.

1996-01-01

Hierarchical methods such as the Fast Multipole Method (FMM) and Barnes-Hut (BH) are used for rapid evaluation of potential (gravitational, electrostatic) fields in particle systems. They are also used for solving integral equations using boundary element methods. The linear systems arising from these methods are dense and are solved iteratively. Hierarchical methods reduce the complexity of the core matrix-vector product from O(n^2) to O(n log n) and the memory requirement from O(n^2) to O(n). We have developed highly scalable parallel formulations of a hybrid FMM/BH method that are capable of handling arbitrarily irregular distributions. We apply these formulations to astrophysical simulations of Plummer and Gaussian galaxies. We have used our parallel formulations to solve the integral form of the Laplace equation. We show that our parallel hierarchical mat-vecs yield high efficiency and overall performance even on relatively small problems. A problem containing approximately 200K nodes takes under a second to compute on 256 processors and yet yields over 85% efficiency. The efficiency and raw performance is expected to increase for bigger problems. For the 200K node problem, our code delivers about 5 GFLOPS of performance on a 256 processor T3D. This is impressive considering the fact that the problem has floating point divides and roots, and very little locality resulting in poor cache performance. A dense matrix-vector product of the same dimensions would require about 0.5 TeraBytes of memory and about 770 TeraFLOPS of computing speed. Clearly, if the loss in accuracy resulting from the use of hierarchical methods is acceptable, our code yields significant savings in time and memory. We also study the convergence of a GMRES solver built around this mat-vec. We accelerate the convergence of the solver using three preconditioning techniques: diagonal scaling, block-diagonal preconditioning, and inner-outer preconditioning. We study the performance and parallel efficiency of these preconditioned solvers. Using this solver, we solve dense linear systems with hundreds of thousands of unknowns. Solving a 105K unknown problem takes about 10 minutes on a 64 processor T3D. Until very recently, boundary element problems of this magnitude could not even be generated, let alone solved.
Turbo Pascal Implementation of a Distributed Processing Network of MS-DOS Microcomputers Connected in a Master-Slave Configuration

DTIC Science & Technology

1989-12-01

Interrupt Procedures ....... 29 13. Support for a Larger Memory Model ................ 29 C. IMPLEMENTATION ........................................ 29...describe the programmer’s model of the hardware utilized in the microcomputers and interrupt driven serial communication considerations. Chapter III...Central Processor Unit The programming model of Table 2.1 is common to the Intel 8088, 8086 and 80x86 series of microprocessors used in the IBM PC/AT
Demonstration of multi-wavelength tunable fiber lasers based on a digital micromirror device processor.

PubMed

Ai, Qi; Chen, Xiao; Tian, Miao; Yan, Bin-bin; Zhang, Ying; Song, Fei-jun; Chen, Gen-xiang; Sang, Xin-zhu; Wang, Yi-quan; Xiao, Feng; Alameh, Kamal

2015-02-01

Based on a digital micromirror device (DMD) processor as the multi-wavelength narrow-band tunable filter, we demonstrate a multi-port tunable fiber laser through experiments. The key property of this laser is that any lasing wavelength channel from any arbitrary output port can be switched independently over the whole C-band, which is only driven by single DMD chip flexibly. All outputs display an excellent tuning capacity and high consistency in the whole C-band with a 0.02 nm linewidth, 0.055 nm wavelength tuning step, and side-mode suppression ratio greater than 60 dB. Due to the automatic power control and polarization design, the power uniformity of output lasers is less than 0.008 dB and the wavelength fluctuation is below 0.02 nm within 2 h at room temperature.
Heuristic-driven graph wavelet modeling of complex terrain

NASA Astrophysics Data System (ADS)

Cioacǎ, Teodor; Dumitrescu, Bogdan; Stupariu, Mihai-Sorin; Pǎtru-Stupariu, Ileana; Nǎpǎrus, Magdalena; Stoicescu, Ioana; Peringer, Alexander; Buttler, Alexandre; Golay, François

2015-03-01

We present a novel method for building a multi-resolution representation of large digital surface models. The surface points coincide with the nodes of a planar graph which can be processed using a critically sampled, invertible lifting scheme. To drive the lazy wavelet node partitioning, we employ an attribute aware cost function based on the generalized quadric error metric. The resulting algorithm can be applied to multivariate data by storing additional attributes at the graph's nodes. We discuss how the cost computation mechanism can be coupled with the lifting scheme and examine the results by evaluating the root mean square error. The algorithm is experimentally tested using two multivariate LiDAR sets representing terrain surface and vegetation structure with different sampling densities.
The Brain's Router: A Cortical Network Model of Serial Processing in the Primate Brain

PubMed Central

Zylberberg, Ariel; Fernández Slezak, Diego; Roelfsema, Pieter R.; Dehaene, Stanislas; Sigman, Mariano

2010-01-01

The human brain efficiently solves certain operations such as object recognition and categorization through a massively parallel network of dedicated processors. However, human cognition also relies on the ability to perform an arbitrarily large set of tasks by flexibly recombining different processors into a novel chain. This flexibility comes at the cost of a severe slowing down and a seriality of operations (100–500 ms per step). A limit on parallel processing is demonstrated in experimental setups such as the psychological refractory period (PRP) and the attentional blink (AB) in which the processing of an element either significantly delays (PRP) or impedes conscious access (AB) of a second, rapidly presented element. Here we present a spiking-neuron implementation of a cognitive architecture where a large number of local parallel processors assemble together to produce goal-driven behavior. The precise mapping of incoming sensory stimuli onto motor representations relies on a “router” network capable of flexibly interconnecting processors and rapidly changing its configuration from one task to another. Simulations show that, when presented with dual-task stimuli, the network exhibits parallel processing at peripheral sensory levels, a memory buffer capable of keeping the result of sensory processing on hold, and a slow serial performance at the router stage, resulting in a performance bottleneck. The network captures the detailed dynamics of human behavior during dual-task-performance, including both mean RTs and RT distributions, and establishes concrete predictions on neuronal dynamics during dual-task experiments in humans and non-human primates. PMID:20442869
Cancer of the esophagus and esophagogastric junction: data-driven staging for the seventh edition of the American Joint Committee on Cancer/International Union Against Cancer Cancer Staging Manuals.

PubMed

Rice, Thomas W; Rusch, Valerie W; Ishwaran, Hemant; Blackstone, Eugene H

2010-08-15

Previous American Joint Committee on Cancer/International Union Against Cancer (AJCC/UICC) stage groupings for esophageal cancer have not been data driven or harmonized with stomach cancer. At the request of the AJCC, worldwide data from 3 continents were assembled to develop data-driven, harmonized esophageal staging for the seventh edition of the AJCC/UICC cancer staging manuals. All-cause mortality among 4627 patients with esophageal and esophagogastric junction cancer who underwent surgery alone (no preoperative or postoperative adjuvant therapy) was analyzed by using novel random forest methodology to produce stage groups for which survival was monotonically decreasing, distinctive, and homogeneous. For lymph node-negative pN0M0 cancers, risk-adjusted 5-year survival was dominated by pathologic tumor classification (pT) but was modulated by histopathologic cell type, histologic grade, and location. For lymph node-positive, pN+M0 cancers, the number of cancer-positive lymph nodes (a new pN classification) dominated survival. Resulting stage groupings departed from a simple, logical arrangement of TNM. Stage groupings for stage I and II adenocarcinoma were based on pT, pN, and histologic grade; and groupings for squamous cell carcinoma were based on pT, pN, histologic grade, and location. Stage III was similar for histopathologic cell types and was based only on pT and pN. Stage 0 and stage IV, by definition, were categorized as tumor in situ (Tis) (high-grade dysplasia) and pM1, respectively. The prognosis for patients with esophageal and esophagogastric junction cancer depends on the complex interplay of TNM classifications as well as nonanatomic factors, including histopathologic cell type, histologic grade, and cancer location. These features were incorporated into a data-driven staging of these cancers for the seventh edition of the AJCC/UICC cancer staging manuals. Copyright (c) 2010 American Cancer Society.
Sub-nanosecond clock synchronization and trigger management in the nuclear physics experiment AGATA

NASA Astrophysics Data System (ADS)

Bellato, M.; Bortolato, D.; Chavas, J.; Isocrate, R.; Rampazzo, G.; Triossi, A.; Bazzacco, D.; Mengoni, D.; Recchia, F.

2013-07-01

The new-generation spectrometer AGATA, the Advanced GAmma Tracking Array, requires sub-nanosecond clock synchronization among readout and front-end electronics modules that may lie hundred meters apart. We call GTS (Global Trigger and Synchronization System) the infrastructure responsible for precise clock synchronization and for the trigger management of AGATA. It is made of a central trigger processor and nodes, connected in a tree structure by means of optical fibers operated at 2Gb/s. The GTS tree handles the synchronization and the trigger data flow, whereas the trigger processor analyses and eventually validates the trigger primitives centrally. Sub-nanosecond synchronization is achieved by measuring two different types of round-trip times and by automatically correcting for phase-shift differences. For a tree of depth two, the peak-to-peak clock jitter at each leaf is 70 ps; the mean phase difference is 180 ps, while the standard deviation over such phase difference, namely the phase equalization repeatability, is 20 ps. The GTS system has run flawlessly for the two-year long AGATA campaign, held at the INFN Legnaro National Laboratories, Italy, where five triple clusters of the AGATA sub-array were coupled with a variety of ancillary detectors.
Recent advances and future prospects for Monte Carlo

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, Forrest B

2010-01-01

The history of Monte Carlo methods is closely linked to that of computers: The first known Monte Carlo program was written in 1947 for the ENIAC; a pre-release of the first Fortran compiler was used for Monte Carlo In 1957; Monte Carlo codes were adapted to vector computers in the 1980s, clusters and parallel computers in the 1990s, and teraflop systems in the 2000s. Recent advances include hierarchical parallelism, combining threaded calculations on multicore processors with message-passing among different nodes. With the advances In computmg, Monte Carlo codes have evolved with new capabilities and new ways of use. Production codesmore » such as MCNP, MVP, MONK, TRIPOLI and SCALE are now 20-30 years old (or more) and are very rich in advanced featUres. The former 'method of last resort' has now become the first choice for many applications. Calculations are now routinely performed on office computers, not just on supercomputers. Current research and development efforts are investigating the use of Monte Carlo methods on FPGAs. GPUs, and many-core processors. Other far-reaching research is exploring ways to adapt Monte Carlo methods to future exaflop systems that may have 1M or more concurrent computational processes.« less
Optimization of the coherence function estimation for multi-core central processing unit

NASA Astrophysics Data System (ADS)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
Assumed--stress hybrid elements with drilling degrees of freedom for nonlinear analysis of composite structures

NASA Technical Reports Server (NTRS)

Knight, Norman F., Jr. (Principal Investigator)

1996-01-01

The goal of this research project is to develop assumed-stress hybrid elements with rotational degrees of freedom for analyzing composite structures. During the first year of the three-year activity, the effort was directed to further assess the AQ4 shell element and its extensions to buckling and free vibration problems. In addition, the development of a compatible 2-node beam element was to be accomplished. The extensions and new developments were implemented in the Computational Structural Mechanics Testbed COMET. An assessment was performed to verify the implementation and to assess the performance of these elements in terms of accuracy. During the second and third years, extensions to geometrically nonlinear problems were developed and tested. This effort involved working with the nonlinear solution strategy as well as the nonlinear formulation for the elements. This research has resulted in the development and implementation of two additional element processors (ES22 for the beam element and ES24 for the shell elements) in COMET. The software was developed using a SUN workstation and has been ported to the NASA Langley Convex named blackbird. Both element processors are now part of the baseline version of COMET.
A Parallel Compact Multi-Dimensional Numerical Algorithm with Aeroacoustics Applications

NASA Technical Reports Server (NTRS)

Povitsky, Alex; Morris, Philip J.

1999-01-01

In this study we propose a novel method to parallelize high-order compact numerical algorithms for the solution of three-dimensional PDEs (Partial Differential Equations) in a space-time domain. For this numerical integration most of the computer time is spent in computation of spatial derivatives at each stage of the Runge-Kutta temporal update. The most efficient direct method to compute spatial derivatives on a serial computer is a version of Gaussian elimination for narrow linear banded systems known as the Thomas algorithm. In a straightforward pipelined implementation of the Thomas algorithm processors are idle due to the forward and backward recurrences of the Thomas algorithm. To utilize processors during this time, we propose to use them for either non-local data independent computations, solving lines in the next spatial direction, or local data-dependent computations by the Runge-Kutta method. To achieve this goal, control of processor communication and computations by a static schedule is adopted. Thus, our parallel code is driven by a communication and computation schedule instead of the usual "creative, programming" approach. The obtained parallelization speed-up of the novel algorithm is about twice as much as that for the standard pipelined algorithm and close to that for the explicit DRP algorithm.
Design of the Protocol Processor for the ROBUS-2 Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.

2005-01-01

The ROBUS-2 Protocol Processor (RPP) is a custom-designed hardware component implementing the functionality of the ROBUS-2 fault-tolerant communication system. The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault tolerant integrated modular architecture currently under development at NASA Langley Research Center. ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, time reference (clock synchronization), and distributed diagnosis (group membership). ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 tolerates internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. ROBUS consists of RPPs connected to each other by a lower-level physical communication network. The RPP has a pipelined architecture and the design is parameterized in the behavioral and structural domains. The design of the RPP enables the bus to achieve a PE-message throughput that approaches the available bandwidth at the physical layer.
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Secchi, Simone; Villa, Oreste

Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less
Distributed Virtual System (DIVIRS) Project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1993-01-01

As outlined in our continuation proposal 92-ISI-50R (revised) on contract NCC 2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to program parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the virtual system model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)

NASA Astrophysics Data System (ADS)

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter

2015-12-01

AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
DIstributed VIRtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1994-01-01

As outlined in our continuation proposal 92-ISI-. OR (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
DIstributed VIRtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, Clifford B.

1995-01-01

As outlined in our continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.

Distributed Virtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1993-01-01

As outlined in the continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC 2-539, the investigators are developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; developing communications routines that support the abstractions implemented; continuing the development of file and information systems based on the Virtual System Model; and incorporating appropriate security measures to allow the mechanisms developed to be used on an open network. The goal throughout the work is to provide a uniform model that can be applied to both parallel and distributed systems. The authors believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. The work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.
Optimal power allocation and joint source-channel coding for wireless DS-CDMA visual sensor networks

NASA Astrophysics Data System (ADS)

Pandremmenou, Katerina; Kondi, Lisimachos P.; Parsopoulos, Konstantinos E.

2011-01-01

In this paper, we propose a scheme for the optimal allocation of power, source coding rate, and channel coding rate for each of the nodes of a wireless Direct Sequence Code Division Multiple Access (DS-CDMA) visual sensor network. The optimization is quality-driven, i.e. the received quality of the video that is transmitted by the nodes is optimized. The scheme takes into account the fact that the sensor nodes may be imaging scenes with varying levels of motion. Nodes that image low-motion scenes will require a lower source coding rate, so they will be able to allocate a greater portion of the total available bit rate to channel coding. Stronger channel coding will mean that such nodes will be able to transmit at lower power. This will both increase battery life and reduce interference to other nodes. Two optimization criteria are considered. One that minimizes the average video distortion of the nodes and one that minimizes the maximum distortion among the nodes. The transmission powers are allowed to take continuous values, whereas the source and channel coding rates can assume only discrete values. Thus, the resulting optimization problem lies in the field of mixed-integer optimization tasks and is solved using Particle Swarm Optimization. Our experimental results show the importance of considering the characteristics of the video sequences when determining the transmission power, source coding rate and channel coding rate for the nodes of the visual sensor network.
MIROS: A Hybrid Real-Time Energy-Efficient Operating System for the Resource-Constrained Wireless Sensor Nodes

PubMed Central

Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; Gholami, Khalid El

2014-01-01

Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant. PMID:25248069
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Using SRAM Based FPGAs for Power-Aware High Performance Wireless Sensor Networks

PubMed Central

Valverde, Juan; Otero, Andres; Lopez, Miguel; Portilla, Jorge; de la Torre, Eduardo; Riesgo, Teresa

2012-01-01

While for years traditional wireless sensor nodes have been based on ultra-low power microcontrollers with sufficient but limited computing power, the complexity and number of tasks of today’s applications are constantly increasing. Increasing the node duty cycle is not feasible in all cases, so in many cases more computing power is required. This extra computing power may be achieved by either more powerful microcontrollers, though more power consumption or, in general, any solution capable of accelerating task execution. At this point, the use of hardware based, and in particular FPGA solutions, might appear as a candidate technology, since though power use is higher compared with lower power devices, execution time is reduced, so energy could be reduced overall. In order to demonstrate this, an innovative WSN node architecture is proposed. This architecture is based on a high performance high capacity state-of-the-art FPGA, which combines the advantages of the intrinsic acceleration provided by the parallelism of hardware devices, the use of partial reconfiguration capabilities, as well as a careful power-aware management system, to show that energy savings for certain higher-end applications can be achieved. Finally, comprehensive tests have been done to validate the platform in terms of performance and power consumption, to proof that better energy efficiency compared to processor based solutions can be achieved, for instance, when encryption is imposed by the application requirements. PMID:22736971
Using SRAM based FPGAs for power-aware high performance wireless sensor networks.

PubMed

Valverde, Juan; Otero, Andres; Lopez, Miguel; Portilla, Jorge; de la Torre, Eduardo; Riesgo, Teresa

2012-01-01

While for years traditional wireless sensor nodes have been based on ultra-low power microcontrollers with sufficient but limited computing power, the complexity and number of tasks of today's applications are constantly increasing. Increasing the node duty cycle is not feasible in all cases, so in many cases more computing power is required. This extra computing power may be achieved by either more powerful microcontrollers, though more power consumption or, in general, any solution capable of accelerating task execution. At this point, the use of hardware based, and in particular FPGA solutions, might appear as a candidate technology, since though power use is higher compared with lower power devices, execution time is reduced, so energy could be reduced overall. In order to demonstrate this, an innovative WSN node architecture is proposed. This architecture is based on a high performance high capacity state-of-the-art FPGA, which combines the advantages of the intrinsic acceleration provided by the parallelism of hardware devices, the use of partial reconfiguration capabilities, as well as a careful power-aware management system, to show that energy savings for certain higher-end applications can be achieved. Finally, comprehensive tests have been done to validate the platform in terms of performance and power consumption, to proof that better energy efficiency compared to processor based solutions can be achieved, for instance, when encryption is imposed by the application requirements.
Efficient implementation of MrBayes on multi-GPU.

PubMed

Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang

2013-06-01

MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
MIROS: a hybrid real-time energy-efficient operating system for the resource-constrained wireless sensor nodes.

PubMed

Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; El Gholami, Khalid

2014-09-22

Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant.
a Non-Overlapping Discretization Method for Partial Differential Equations

NASA Astrophysics Data System (ADS)

Rosas-Medina, A.; Herrera, I.

2013-05-01

Mathematical models of many systems of interest, including very important continuous systems of Engineering and Science, lead to a great variety of partial differential equations whose solution methods are based on the computational processing of large-scale algebraic systems. Furthermore, the incredible expansion experienced by the existing computational hardware and software has made amenable to effective treatment problems of an ever increasing diversity and complexity, posed by engineering and scientific applications. The emergence of parallel computing prompted on the part of the computational-modeling community a continued and systematic effort with the purpose of harnessing it for the endeavor of solving boundary-value problems (BVPs) of partial differential equations. Very early after such an effort began, it was recognized that domain decomposition methods (DDM) were the most effective technique for applying parallel computing to the solution of partial differential equations, since such an approach drastically simplifies the coordination of the many processors that carry out the different tasks and also reduces very much the requirements of information-transmission between them. Ideally, DDMs intend producing algorithms that fulfill the DDM-paradigm; i.e., such that "the global solution is obtained by solving local problems defined separately in each subdomain of the coarse-mesh -or domain-decomposition-". Stated in a simplistic manner, the basic idea is that, when the DDM-paradigm is satisfied, full parallelization can be achieved by assigning each subdomain to a different processor. When intensive DDM research began much attention was given to overlapping DDMs, but soon after attention shifted to non-overlapping DDMs. This evolution seems natural when the DDM-paradigm is taken into account: it is easier to uncouple the local problems when the subdomains are separated. However, an important limitation of non-overlapping domain decompositions, as that concept is usually understood today, is that interface nodes are shared by two or more subdomains of the coarse-mesh and, therefore, even non-overlapping DDMs are actually overlapping when seen from the perspective of the nodes used in the discretization. In this talk we present and discuss a discretization method in which the nodes used are non-overlapping, in the sense that each one of them belongs to one and only one subdomain of the coarse-mesh.
Experiences modeling ocean circulation problems on a 30 node commodity cluster with 3840 GPU processor cores.

NASA Astrophysics Data System (ADS)

Hill, C.

2008-12-01

Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes for which this technology is currently most useful. However, many interesting problems fit within this envelope. Looking forward, we extrapolate our experience to estimate full-scale ocean model performance and applicability. Finally we describe preliminary hybrid mixed 32-bit and 64-bit experiments with graphics cards that support 64-bit arithmetic, albeit at a lower performance.
Non-Markovian dynamics in chiral quantum networks with spins and photons

NASA Astrophysics Data System (ADS)

Ramos, Tomás; Vermersch, Benoît; Hauke, Philipp; Pichler, Hannes; Zoller, Peter

2016-06-01

We study the dynamics of chiral quantum networks consisting of nodes coupled by unidirectional or asymmetric bidirectional quantum channels. In contrast to familiar photonic networks where driven two-level atoms exchange photons via 1D photonic nanostructures, we propose and study a setup where interactions between the atoms are mediated by spin excitations (magnons) in 1D X X spin chains representing spin waveguides. While Markovian quantum network theory eliminates quantum channels as structureless reservoirs in a Born-Markov approximation to obtain a master equation for the nodes, we are interested in non-Markovian dynamics. This arises from the nonlinear character of the dispersion with band-edge effects, and from finite spin propagation velocities leading to time delays in interactions. To account for the non-Markovian dynamics we treat the quantum degrees of freedom of the nodes and connecting channel as a composite spin system with the surrounding of the quantum network as a Markovian bath, allowing for an efficient solution with time-dependent density matrix renormalization-group techniques. We illustrate our approach showing non-Markovian effects in the driven-dissipative formation of quantum dimers, and we present examples for quantum information protocols involving quantum state transfer with engineered elements as basic building blocks of quantum spintronic circuits.
Category-theoretic models of algebraic computer systems

NASA Astrophysics Data System (ADS)

Kovalyov, S. P.

2016-01-01

A computer system is said to be algebraic if it contains nodes that implement unconventional computation paradigms based on universal algebra. A category-based approach to modeling such systems that provides a theoretical basis for mapping tasks to these systems' architecture is proposed. The construction of algebraic models of general-purpose computations involving conditional statements and overflow control is formally described by a reflector in an appropriate category of algebras. It is proved that this reflector takes the modulo ring whose operations are implemented in the conventional arithmetic processors to the Łukasiewicz logic matrix. Enrichments of the set of ring operations that form bases in the Łukasiewicz logic matrix are found.
Effect of various features on the life cycle cost of the timing/synchronization subsystem of the DCS digital communications network

NASA Technical Reports Server (NTRS)

Kimsey, D. B.

1978-01-01

The effect on the life cycle cost of the timing subsystem was examined, when these optional features were included in various combinations. The features included mutual control, directed control, double-ended reference links, independence of clock error measurement and correction, phase reference combining, self-organization, smoothing for link and nodal dropouts, unequal reference weightings, and a master in a mutual control network. An overall design of a microprocessor-based timing subsystem was formulated. The microprocessor (8080) implements the digital filter portion of a digital phase locked loop, as well as other control functions such as organization of the network through communication with processors at neighboring nodes.
Ultra-Compact Transputer-Based Controller for High-Level, Multi-Axis Coordination

NASA Technical Reports Server (NTRS)

Zenowich, Brian; Crowell, Adam; Townsend, William T.

2013-01-01

The design of machines that rely on arrays of servomotors such as robotic arms, orbital platforms, and combinations of both, imposes a heavy computational burden to coordinate their actions to perform coherent tasks. For example, the robotic equivalent of a person tracing a straight line in space requires enormously complex kinematics calculations, and complexity increases with the number of servo nodes. A new high-level architecture for coordinated servo-machine control enables a practical, distributed transputer alternative to conventional central processor electronics. The solution is inherently scalable, dramatically reduces bulkiness and number of conductor runs throughout the machine, requires only a fraction of the power, and is designed for cooling in a vacuum.
High-speed prediction of crystal structures for organic molecules

NASA Astrophysics Data System (ADS)

Obata, Shigeaki; Goto, Hitoshi

2015-02-01

We developed a master-worker type parallel algorithm for allocating tasks of crystal structure optimizations to distributed compute nodes, in order to improve a performance of simulations for crystal structure predictions. The performance experiments were demonstrated on TUT-ADSIM supercomputer system (HITACHI HA8000-tc/HT210). The experimental results show that our parallel algorithm could achieve speed-ups of 214 and 179 times using 256 processor cores on crystal structure optimizations in predictions of crystal structures for 3-aza-bicyclo(3.3.1)nonane-2,4-dione and 2-diazo-3,5-cyclohexadiene-1-one, respectively. We expect that this parallel algorithm is always possible to reduce computational costs of any crystal structure predictions.
Single-Event Transient Testing of Low Dropout PNP Series Linear Voltage Regulators

NASA Technical Reports Server (NTRS)

Adell, Philippe; Allen, Gregory

2013-01-01

As demand for high-speed, on-board, digital-processing integrated circuits on spacecraft increases (field-programmable gate arrays and digital signal processors in particular), the need for the next generation point-of-load (POL) regulator becomes a prominent design issue. Shrinking process nodes have resulted in core rails dropping to values close to 1.0 V, drastically reducing margin to standard switching converters or regulators that power digital ICs. The goal of this task is to perform SET characterization of several commercial POL converters, and provide a discussion of the impact of these results to state-of-the-art digital processing IC through laser and heavy ion testing
Critical Problems in Very Large Scale Computer Systems

DTIC Science & Technology

1988-09-30

253-6043 Srinivas Devadas (617) 253-0454 Thomas F. Knight, Jr. (617) 253-7807 F. Thomson Leighton (617) 253-3662 Charles E. Leiserson (617) 253-5833...J. Keen, P. Nuth, J. Larivee, and B . Totty, "Message-Driven Processor Architecture," MIT VLSI Memo No. 88-468, August 1988. *W. J. Dally and A. A...losses and gains) which are the first polynomial-time combinatorial algorithms for this problem. One algorithm runs in O(n2m2 lg 2 n Ig B ) time and the
Compiler-directed cache management in multiprocessors

NASA Technical Reports Server (NTRS)

Cheong, Hoichi; Veidenbaum, Alexander V.

1990-01-01

The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace-driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented.
A Concurrent Smalltalk Compiler for the Message-Driven Processor

DTIC Science & Technology

1988-05-01

apj with bits from low-bit (inclusive) to high-bit (exclusive) set. ;;;Low-bit defaults to zero. (defmacro brange (high-bit &optional low-bit) (list...n2) (null (cddr num))) (aetg bits (b+ bits (if (>- nl n2) ( brange (1+ nl) n2) ( brange (1+ n2) ni)))) (error "Bad bmap range: -S" flu.)))) (t (error...vlocs) flat ((vlive (b- finst-vllv* mast) *I.( brange firat-context-slot-nun))) (next (inst-next last))) (if (bempty vlive) (delete-module module inat

A SPDS Node to Support the Systematic Interpretation of Cosmic Ray Data

NASA Technical Reports Server (NTRS)

1997-01-01

The purpose of this project was to establish and maintain a Space Physics Data System (SPDS) node that supports the analysis and interpretation of current and future galactic cosmic ray (GCR) measurements by (1) providing on-line databases relevant to GCR propagation studies; (2) providing other on-line services, such as anonymous FTP access, mail list service and pointers to e-mail address books, to support the cosmic ray community; (3) providing a mechanism for those in the community who might wish to submit similar contributions for public access; (4) maintaining the node to assure that the databases remain current; and (5) investigating other possibilities, such as CD-ROM, for public dissemination of the data products. Shortly after the original grant to support these activities was established at Louisiana State University a detailed study of alternate choices for the node hardware was initiated. The chosen hardware was an Apple Workgroup Server 9150/120 consisting of a 120 MHz PowerPC 601 processor, 32 MB of memory, two I GB disks and one 2 GB disk. This hardware was ordered and installed and has been operating reliably ever since. A preliminary version of the database server was available during the first year effort and was used as part of the very successful SPDS demonstration during the Rome, Italy International Cosmic Ray Conference. For this server version we were able to establish the html and anonymous FTP server software, develop a Web page structure which can be easily modified to include new items, provide an on-line database of charge changing total cross sections, include the cross section prediction software of Silberberg & Tsao as well as Webber, Kish and Schrier for download access, and provide an on-line bibliography of the cross section measurement references by the Transport Collaboration. The preliminary version of this SPDS Cosmic Ray node was examined by members of the C&H SPDS committee and returned comments were used to refine the implementation.
Robust scalable stabilisability conditions for large-scale heterogeneous multi-agent systems with uncertain nonlinear interactions: towards a distributed computing architecture

NASA Astrophysics Data System (ADS)

Manfredi, Sabato

2016-06-01

Large-scale dynamic systems are becoming highly pervasive in their occurrence with applications ranging from system biology, environment monitoring, sensor networks, and power systems. They are characterised by high dimensionality, complexity, and uncertainty in the node dynamic/interactions that require more and more computational demanding methods for their analysis and control design, as well as the network size and node system/interaction complexity increase. Therefore, it is a challenging problem to find scalable computational method for distributed control design of large-scale networks. In this paper, we investigate the robust distributed stabilisation problem of large-scale nonlinear multi-agent systems (briefly MASs) composed of non-identical (heterogeneous) linear dynamical systems coupled by uncertain nonlinear time-varying interconnections. By employing Lyapunov stability theory and linear matrix inequality (LMI) technique, new conditions are given for the distributed control design of large-scale MASs that can be easily solved by the toolbox of MATLAB. The stabilisability of each node dynamic is a sufficient assumption to design a global stabilising distributed control. The proposed approach improves some of the existing LMI-based results on MAS by both overcoming their computational limits and extending the applicative scenario to large-scale nonlinear heterogeneous MASs. Additionally, the proposed LMI conditions are further reduced in terms of computational requirement in the case of weakly heterogeneous MASs, which is a common scenario in real application where the network nodes and links are affected by parameter uncertainties. One of the main advantages of the proposed approach is to allow to move from a centralised towards a distributed computing architecture so that the expensive computation workload spent to solve LMIs may be shared among processors located at the networked nodes, thus increasing the scalability of the approach than the network size. Finally, a numerical example shows the applicability of the proposed method and its advantage in terms of computational complexity when compared with the existing approaches.
Multi-hop routing mechanism for reliable sensor computing.

PubMed

Chen, Jiann-Liang; Ma, Yi-Wei; Lai, Chia-Ping; Hu, Chia-Cheng; Huang, Yueh-Min

2009-01-01

Current research on routing in wireless sensor computing concentrates on increasing the service lifetime, enabling scalability for large number of sensors and supporting fault tolerance for battery exhaustion and broken nodes. A sensor node is naturally exposed to various sources of unreliable communication channels and node failures. Sensor nodes have many failure modes, and each failure degrades the network performance. This work develops a novel mechanism, called Reliable Routing Mechanism (RRM), based on a hybrid cluster-based routing protocol to specify the best reliable routing path for sensor computing. Table-driven intra-cluster routing and on-demand inter-cluster routing are combined by changing the relationship between clusters for sensor computing. Applying a reliable routing mechanism in sensor computing can improve routing reliability, maintain low packet loss, minimize management overhead and save energy consumption. Simulation results indicate that the reliability of the proposed RRM mechanism is around 25% higher than that of the Dynamic Source Routing (DSR) and ad hoc On-demand Distance Vector routing (AODV) mechanisms.
Proactive Fault Tolerance for HPC with Xen Virtualization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian

2007-01-01

with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplied by but not limited to Xen. This paper contributes an automatic and transparent mechanismmore » for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/ restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the rst comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.« less
Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06

NASA Astrophysics Data System (ADS)

Charpentier, P.

2017-10-01

In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
Data-Driven Packet Loss Estimation for Node Healthy Sensing in Decentralized Cluster.

PubMed

Fan, Hangyu; Wang, Huandong; Li, Yong

2018-01-23

Decentralized clustering of modern information technology is widely adopted in various fields these years. One of the main reason is the features of high availability and the failure-tolerance which can prevent the entire system form broking down by a failure of a single point. Recently, toolkits such as Akka are used by the public commonly to easily build such kind of cluster. However, clusters of such kind that use Gossip as their membership managing protocol and use link failure detecting mechanism to detect link failures cannot deal with the scenario that a node stochastically drops packets and corrupts the member status of the cluster. In this paper, we formulate the problem to be evaluating the link quality and finding a max clique (NP-Complete) in the connectivity graph. We then proposed an algorithm that consists of two models driven by data from application layer to respectively solving these two problems. Through simulations with statistical data and a real-world product, we demonstrate that our algorithm has a good performance.
Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100

NASA Astrophysics Data System (ADS)

Briggs, J. P.; Pennycook, S. J.; Fergusson, J. R.; Jäykkä, J.; Shellard, E. P. S.

2016-04-01

We present a case study describing efforts to optimise and modernise "Modal", the simulation and analysis pipeline used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum (or three-point correlator) of the cosmic microwave background radiation. We focus on one particular element of the code: the projection of bispectra from the end of inflation to the spherical shell at decoupling, which defines the CMB we observe today. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular domain containing a sparse grid. We show that by employing separable methods this calculation can be reduced to a one-dimensional summation plus two integrations, reducing the overall dimensionality from four to three. The introduction of separable functions also solves the issue of the non-rectangular sparse grid. This separable method can become unstable in certain scenarios and so the slower non-separable integral must be calculated instead. We present a discussion of the optimisation of both approaches. We demonstrate significant speed-ups of ≈100×, arising from a combination of algorithmic improvements and architecture-aware optimisations targeted at improving thread and vectorisation behaviour. The resulting MPI/OpenMP hybrid code is capable of executing on clusters containing processors and/or coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3× and that running the same code across a combination of both microarchitectures improves performance-per-node by a factor of 3.38×. By making bispectrum calculations competitive with those for the power spectrum (or two-point correlator) we are now able to consider joint analysis for cosmological science exploitation of new data.
Accelerating list management for MPI.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hemmert, K. Scott; Rodrigues, Arun F.; Underwood, Keith Douglas

2005-07-01

The latency and throughput of MPI messages are critically important to a range of parallel scientific applications. In many modern networks, both of these performance characteristics are largely driven by the performance of a processor on the network interface. Because of the semantics of MPI, this embedded processor is forced to traverse a linked list of posted receives each time a message is received. As this list grows long, the latency of message reception grows and the throughput of MPI messages decreases. This paper presents a novel hardware feature to handle list management functions on a network interface. By movingmore » functions such as list insertion, list traversal, and list deletion to the hardware unit, latencies are decreased by up to 20% in the zero length queue case with dramatic improvements in the presence of long queues. Similarly, the throughput is increased by up to 10% in the zero length queue case and by nearly 100% in the presence queues of 30 messages.« less
Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

NASA Astrophysics Data System (ADS)

Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

2016-12-01

The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
A Survey of Recent MARTe Based Systems

NASA Astrophysics Data System (ADS)

Neto, André C.; Alves, Diogo; Boncagni, Luca; Carvalho, Pedro J.; Valcarcel, Daniel F.; Barbalace, Antonio; De Tommasi, Gianmaria; Fernandes, Horácio; Sartori, Filippo; Vitale, Enzo; Vitelli, Riccardo; Zabeo, Luca

2011-08-01

The Multithreaded Application Real-Time executor (MARTe) is a data driven framework environment for the development and deployment of real-time control algorithms. The main ideas which led to the present version of the framework were to standardize the development of real-time control systems, while providing a set of strictly bounded standard interfaces to the outside world and also accommodating a collection of facilities which promote the speed and ease of development, commissioning and deployment of such systems. At the core of every MARTe based application, is a set of independent inter-communicating software blocks, named Generic Application Modules (GAM), orchestrated by a real-time scheduler. The platform independence of its core library provides MARTe the necessary robustness and flexibility for conveniently testing applications in different environments including non-real-time operating systems. MARTe is already being used in several machines, each with its own peculiarities regarding hardware interfacing, supervisory control configuration, operating system and target control application. This paper presents and compares the most recent results of systems using MARTe: the JET Vertical Stabilization system, which uses the Real Time Application Interface (RTAI) operating system on Intel multi-core processors; the COMPASS plasma control system, driven by Linux RT also on Intel multi-core processors; ISTTOK real-time tomography equilibrium reconstruction which shares the same support configuration of COMPASS; JET error field correction coils based on VME, PowerPC and VxWorks; FTU LH reflected power system running on VME, Intel with RTAI.
Field Model: An Object-Oriented Data Model for Fields

NASA Technical Reports Server (NTRS)

Moran, Patrick J.

2001-01-01

We present an extensible, object-oriented data model designed for field data entitled Field Model (FM). FM objects can represent a wide variety of fields, including fields of arbitrary dimension and node type. FM can also handle time-series data. FM achieves generality through carefully selected topological primitives and through an implementation that leverages the potential of templated C++. FM supports fields where the nodes values are paired with any cell type. Thus FM can represent data where the field nodes are paired with the vertices ("vertex-centered" data), fields where the nodes are paired with the D-dimensional cells in R(sup D) (often called "cell-centered" data), as well as fields where nodes are paired with edges or other cell types. FM is designed to effectively handle very large data sets; in particular FM employs a demand-driven evaluation strategy that works especially well with large field data. Finally, the interfaces developed for FM have the potential to effectively abstract field data based on adaptive meshes. We present initial results with a triangular adaptive grid in R(sup 2) and discuss how the same design abstractions would work equally well with other adaptive-grid variations, including meshes in R(sup 3).
A Decentralized Event-Triggered Dissipative Control Scheme for Systems With Multiple Sensors to Sample the System Outputs.

PubMed

Zhang, Xian-Ming; Han, Qing-Long

2016-12-01

This paper is concerned with decentralized event-triggered dissipative control for systems with the entries of the system outputs having different physical properties. Depending on these different physical properties, the entries of the system outputs are grouped into multiple nodes. A number of sensors are used to sample the signals from different nodes. A decentralized event-triggering scheme is introduced to select those necessary sampled-data packets to be transmitted so that communication resources can be saved significantly while preserving the prescribed closed-loop performance. First, in order to organize the decentralized data packets transmitted from the sensor nodes, a data packet processor (DPP) is used to generate a new signal to be held by the zero-order-hold once the signal stored by the DPP is updated at some time instant. Second, under the mechanism of the DPP, the resulting closed-loop system is modeled as a linear system with an interval time-varying delay. A sufficient condition is derived such that the closed-loop system is asymptotically stable and strictly (Q 0 ,S 0 ,R 0 ) -dissipative, where Q 0 ,S 0 , and R 0 are real matrices of appropriate dimensions with Q 0 and R 0 symmetric. Third, suitable output-based controllers can be designed based on solutions to a set of a linear matrix inequality. Finally, two examples are given to demonstrate the effectiveness of the proposed method.
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools

PubMed Central

Cheng, Yinhe; Tzeng, Tzy-Hwa Kathy

2016-01-01

This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156–186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data pre-processing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory. PMID:27861637
Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

NASA Technical Reports Server (NTRS)

Logan, Terry G.

1994-01-01

The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
Towards an Autonomic Cluster Management System (ACMS) with Reflex Autonomicity

NASA Technical Reports Server (NTRS)

Truszkowski, Walt; Hinchey, Mike; Sterritt, Roy

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of providing a fault-tolerant environment and achieving significant computational capabilities for high-performance computing applications. However, the task of manually managing and configuring a cluster quickly becomes daunting as the cluster grows in size. Autonomic computing, with its vision to provide self-management, can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Autonomic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management and its evolution to include reflex reactions via pulse monitoring.
A Configurable Event-Driven Convolutional Node with Rate Saturation Mechanism for Modular ConvNet Systems Implementation.

PubMed

Camuñas-Mesa, Luis A; Domínguez-Cordero, Yaisel L; Linares-Barranco, Alejandro; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabé

2018-01-01

Convolutional Neural Networks (ConvNets) are a particular type of neural network often used for many applications like image recognition, video analysis or natural language processing. They are inspired by the human brain, following a specific organization of the connectivity pattern between layers of neurons known as receptive field. These networks have been traditionally implemented in software, but they are becoming more computationally expensive as they scale up, having limitations for real-time processing of high-speed stimuli. On the other hand, hardware implementations show difficulties to be used for different applications, due to their reduced flexibility. In this paper, we propose a fully configurable event-driven convolutional node with rate saturation mechanism that can be used to implement arbitrary ConvNets on FPGAs. This node includes a convolutional processing unit and a routing element which allows to build large 2D arrays where any multilayer structure can be implemented. The rate saturation mechanism emulates the refractory behavior in biological neurons, guaranteeing a minimum separation in time between consecutive events. A 4-layer ConvNet with 22 convolutional nodes trained for poker card symbol recognition has been implemented in a Spartan6 FPGA. This network has been tested with a stimulus where 40 poker cards were observed by a Dynamic Vision Sensor (DVS) in 1 s time. Different slow-down factors were applied to characterize the behavior of the system for high speed processing. For slow stimulus play-back, a 96% recognition rate is obtained with a power consumption of 0.85 mW. At maximum play-back speed, a traffic control mechanism downsamples the input stimulus, obtaining a recognition rate above 63% when less than 20% of the input events are processed, demonstrating the robustness of the network.
A Configurable Event-Driven Convolutional Node with Rate Saturation Mechanism for Modular ConvNet Systems Implementation

PubMed Central

Camuñas-Mesa, Luis A.; Domínguez-Cordero, Yaisel L.; Linares-Barranco, Alejandro; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabé

2018-01-01

Convolutional Neural Networks (ConvNets) are a particular type of neural network often used for many applications like image recognition, video analysis or natural language processing. They are inspired by the human brain, following a specific organization of the connectivity pattern between layers of neurons known as receptive field. These networks have been traditionally implemented in software, but they are becoming more computationally expensive as they scale up, having limitations for real-time processing of high-speed stimuli. On the other hand, hardware implementations show difficulties to be used for different applications, due to their reduced flexibility. In this paper, we propose a fully configurable event-driven convolutional node with rate saturation mechanism that can be used to implement arbitrary ConvNets on FPGAs. This node includes a convolutional processing unit and a routing element which allows to build large 2D arrays where any multilayer structure can be implemented. The rate saturation mechanism emulates the refractory behavior in biological neurons, guaranteeing a minimum separation in time between consecutive events. A 4-layer ConvNet with 22 convolutional nodes trained for poker card symbol recognition has been implemented in a Spartan6 FPGA. This network has been tested with a stimulus where 40 poker cards were observed by a Dynamic Vision Sensor (DVS) in 1 s time. Different slow-down factors were applied to characterize the behavior of the system for high speed processing. For slow stimulus play-back, a 96% recognition rate is obtained with a power consumption of 0.85 mW. At maximum play-back speed, a traffic control mechanism downsamples the input stimulus, obtaining a recognition rate above 63% when less than 20% of the input events are processed, demonstrating the robustness of the network. PMID:29515349
High-performance reconfigurable hardware architecture for restricted Boltzmann machines.

PubMed

Ly, Daniel Le; Chow, Paul

2010-11-01

Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications has been limited. A primary cause for this lack of adoption is that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can exploit the inherent parallelism in neural networks is desired. This paper investigates how the restricted Boltzmann machine (RBM), which is a popular type of neural network, can be mapped to a high-performance hardware architecture on field-programmable gate array (FPGA) platforms. The proposed modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. A method to partition large RBMs into smaller congruent components is also presented, allowing the distribution of one RBM across multiple FPGA resources. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100 MHz through a variety of different configurations. The maximum performance was obtained by instantiating an RBM of 256 × 256 nodes distributed across four FPGAs, which resulted in a computational speed of 3.13 billion connection-updates-per-second and a speedup of 145-fold over an optimized C program running on a 2.8-GHz Intel processor.
PC-CUBE: A Personal Computer Based Hypercube

NASA Technical Reports Server (NTRS)

Ho, Alex; Fox, Geoffrey; Walker, David; Snyder, Scott; Chang, Douglas; Chen, Stanley; Breaden, Matt; Cole, Terry

1988-01-01

PC-CUBE is an ensemble of IBM PCs or close compatibles connected in the hypercube topology with ordinary computer cables. Communication occurs at the rate of 115.2 K-band via the RS-232 serial links. Available for PC-CUBE is the Crystalline Operating System III (CrOS III), Mercury Operating System, CUBIX and PLOTIX which are parallel I/O and graphics libraries. A CrOS performance monitor was developed to facilitate the measurement of communication and computation time of a program and their effects on performance. Also available are CXLISP, a parallel version of the XLISP interpreter; GRAFIX, some graphics routines for the EGA and CGA; and a general execution profiler for determining execution time spent by program subroutines. PC-CUBE provides a programming environment similar to all hypercube systems running CrOS III, Mercury and CUBIX. In addition, every node (personal computer) has its own graphics display monitor and storage devices. These allow data to be displayed or stored at every processor, which has much instructional value and enables easier debugging of applications. Some application programs which are taken from the book Solving Problems on Concurrent Processors (Fox 88) were implemented with graphics enhancement on PC-CUBE. The applications range from solving the Mandelbrot set, Laplace equation, wave equation, long range force interaction, to WaTor, an ecological simulation.
An Atmospheric General Circulation Model with Chemistry for the CRAY T3E: Design, Performance Optimization and Coupling to an Ocean Model

NASA Technical Reports Server (NTRS)

Farrara, John D.; Drummond, Leroy A.; Mechoso, Carlos R.; Spahr, Joseph A.

1998-01-01

The design, implementation and performance optimization on the CRAY T3E of an atmospheric general circulation model (AGCM) which includes the transport of, and chemical reactions among, an arbitrary number of constituents is reviewed. The parallel implementation is based on a two-dimensional (longitude and latitude) data domain decomposition. Initial optimization efforts centered on minimizing the impact of substantial static and weakly-dynamic load imbalances among processors through load redistribution schemes. Recent optimization efforts have centered on single-node optimization. Strategies employed include loop unrolling, both manually and through the compiler, the use of an optimized assembler-code library for special function calls, and restructuring of parts of the code to improve data locality. Data exchanges and synchronizations involved in coupling different data-distributed models can account for a significant fraction of the running time. Therefore, the required scattering and gathering of data must be optimized. In systems such as the T3E, there is much more aggregate bandwidth in the total system than in any particular processor. This suggests a distributed design. The design and implementation of a such distributed 'Data Broker' as a means to efficiently couple the components of our climate system model is described.

A web-based institutional DICOM distribution system with the integration of the Clinical Trial Processor (CTP).

PubMed

Aryanto, K Y E; Broekema, A; Langenhuysen, R G A; Oudkerk, M; van Ooijen, P M A

2015-05-01

To develop and test a fast and easy rule-based web-environment with optional de-identification of imaging data to facilitate data distribution within a hospital environment. A web interface was built using Hypertext Preprocessor (PHP), an open source scripting language for web development, and Java with SQL Server to handle the database. The system allows for the selection of patient data and for de-identifying these when necessary. Using the services provided by the RSNA Clinical Trial Processor (CTP), the selected images were pushed to the appropriate services using a protocol based on the module created for the associated task. Five pipelines, each performing a different task, were set up in the server. In a 75 month period, more than 2,000,000 images are transferred and de-identified in a proper manner while 20,000,000 images are moved from one node to another without de-identification. While maintaining a high level of security and stability, the proposed system is easy to setup, it integrate well with our clinical and research practice and it provides a fast and accurate vendor-neutral process of transferring, de-identifying, and storing DICOM images. Its ability to run different de-identification processes in parallel pipelines is a major advantage in both clinical and research setting.
International Space Station (ISS)

NASA Image and Video Library

2001-03-01

The Environmental Control and Life Support System (ECLSS) Group of the Flight Projects Directorate at the Marshall Space Flight Center in Huntsville, Alabama, is responsible for designing and building the life support systems that will provide the crew of the International Space Station (ISS) a comfortable environment in which to live and work. This photograph shows the mockup of the the ECLSS to be installed in the Node 3 module of the ISS. From left to right, shower rack, waste management rack, Water Recovery System (WRS) Rack #2, WRS Rack #1, and Oxygen Generation System (OGS) rack are shown. The WRS provides clean water through the reclamation of wastewaters and is comprised of a Urine Processor Assembly (UPA) and a Water Processor Assembly (WPA). The UPA accepts and processes pretreated crewmember urine to allow it to be processed along with other wastewaters in the WPA. The WPA removes free gas, organic, and nonorganic constituents before the water goes through a series of multifiltration beds for further purification. The OGS produces oxygen for breathing air for the crew and laboratory animals, as well as for replacing oxygen loss. The OGS is comprised of a cell stack, which electrolyzes (breaks apart the hydrogen and oxygen molecules) some of the clean water provided by the WRS, and the separators that remove the gases from the water after electrolysis.
Voltage scheduling for low power/energy

NASA Astrophysics Data System (ADS)

Manzak, Ali

2001-07-01

Power considerations have become an increasingly dominant factor in the design of both portable and desk-top systems. An effective way to reduce power consumption is to lower the supply voltage since voltage is quadratically related to power. This dissertation considers the problem of lowering the supply voltage at (i) the system level and at (ii) the behavioral level. At the system level, the voltage of the variable voltage processor is dynamically changed with the work load. Processors with limited sized buffers as well as those with very large buffers are considered. Given the task arrival times, deadline times, execution times, periods and switching activities, task scheduling algorithms that minimize energy or peak power are developed for the processors equipped with very large buffers. A relation between the operating voltages of the tasks for minimum energy/power is determined using the Lagrange multiplier method, and an iterative algorithm that utilizes this relation is developed. Experimental results show that the voltage assignment obtained by the proposed algorithm is very close (0.1% error) to that of the optimal energy assignment and the optimal peak power (1% error) assignment. Next, on-line and off-fine minimum energy task scheduling algorithms are developed for processors with limited sized buffers. These algorithms have polynomial time complexity and present optimal (off-line) and close-to-optimal (on-line) solutions. A procedure to calculate the minimum buffer size given information about the size of the task (maximum, minimum), execution time (best case, worst case) and deadlines is also presented. At the behavioral level, resources operating at multiple voltages are used to minimize power while maintaining the throughput. Such a scheme has the advantage of allowing modules on the critical paths to be assigned to the highest voltage levels (thus meeting the required timing constraints) while allowing modules on non-critical paths to be assigned to lower voltage levels (thus reducing the power consumption). A polynomial time resource and latency constrained scheduling algorithm is developed to distribute the available slack among the nodes such that power consumption is minimum. The algorithm is iterative and utilizes the slack based on the Lagrange multiplier method.
Supermassive Black Hole Binaries in High Performance Massively Parallel Direct N-body Simulations on Large GPU Clusters

NASA Astrophysics Data System (ADS)

Spurzem, R.; Berczik, P.; Zhong, S.; Nitadori, K.; Hamada, T.; Berentzen, I.; Veles, A.

2012-07-01

Astrophysical Computer Simulations of Dense Star Clusters in Galactic Nuclei with Supermassive Black Holes are presented using new cost-efficient supercomputers in China accelerated by graphical processing cards (GPU). We use large high-accuracy direct N-body simulations with Hermite scheme and block-time steps, parallelised across a large number of nodes on the large scale and across many GPU thread processors on each node on the small scale. A sustained performance of more than 350 Tflop/s for a science run on using simultaneously 1600 Fermi C2050 GPUs is reached; a detailed performance model is presented and studies for the largest GPU clusters in China with up to Petaflop/s performance and 7000 Fermi GPU cards. In our case study we look at two supermassive black holes with equal and unequal masses embedded in a dense stellar cluster in a galactic nucleus. The hardening processes due to interactions between black holes and stars, effects of rotation in the stellar system and relativistic forces between the black holes are simultaneously taken into account. The simulation stops at the complete relativistic merger of the black holes.
Comparing the Performance of Blue Gene/Q with Leading Cray XE6 and InfiniBand Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J.; Barker, Kevin J.; Vishnu, Abhinav

2013-01-21

Abstract—Three types of systems dominate the current High Performance Computing landscape: the Cray XE6, the IBM Blue Gene, and commodity clusters using InfiniBand. These systems have quite different characteristics making the choice for a particular deployment difficult. The XE6 uses Cray’s proprietary Gemini 3-D torus interconnect with two nodes at each network endpoint. The latest IBM Blue Gene/Q uses a single socket integrating processor and communication in a 5-D torus network. InfiniBand provides the flexibility of using nodes from many vendors connected in many possible topologies. The performance characteristics of each vary vastly along with their utilization model. In thismore » work we compare the performance of these three systems using a combination of micro-benchmarks and a set of production applications. In particular we discuss the causes of variability in performance across the systems and also quantify where performance is lost using a combination of measurements and models. Our results show that significant performance can be lost in normal production operation of the Cray XT6 and InfiniBand Clusters in comparison to Blue Gene/Q.« less
MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

PubMed

González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil

2016-12-15

MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Harnessing the power of emerging petascale platforms

NASA Astrophysics Data System (ADS)

Mellor-Crummey, John

2007-07-01

As part of the US Department of Energy's Scientific Discovery through Advanced Computing (SciDAC-2) program, science teams are tackling problems that require computational simulation and modeling at the petascale. A grand challenge for computer science is to develop software technology that makes it easier to harness the power of these systems to aid scientific discovery. As part of its activities, the SciDAC-2 Center for Scalable Application Development Software (CScADS) is building open source software tools to support efficient scientific computing on the emerging leadership-class platforms. In this paper, we describe two tools for performance analysis and tuning that are being developed as part of CScADS: a tool for analyzing scalability and performance, and a tool for optimizing loop nests for better node performance. We motivate these tools by showing how they apply to S3D, a turbulent combustion code under development at Sandia National Laboratory. For S3D, our node performance analysis tool helped uncover several performance bottlenecks. Using our loop nest optimization tool, we transformed S3D's most costly loop nest to reduce execution time by a factor of 2.94 for a processor working on a 503 domain.
Fast computation of voxel-level brain connectivity maps from resting-state functional MRI using l₁-norm as approximation of Pearson's temporal correlation: proof-of-concept and example vector hardware implementation.

PubMed

Minati, Ludovico; Zacà, Domenico; D'Incerti, Ludovico; Jovicich, Jorge

2014-09-01

An outstanding issue in graph-based analysis of resting-state functional MRI is choice of network nodes. Individual consideration of entire brain voxels may represent a less biased approach than parcellating the cortex according to pre-determined atlases, but entails establishing connectedness for 1(9)-1(11) links, with often prohibitive computational cost. Using a representative Human Connectome Project dataset, we show that, following appropriate time-series normalization, it may be possible to accelerate connectivity determination replacing Pearson correlation with l1-norm. Even though the adjacency matrices derived from correlation coefficients and l1-norms are not identical, their similarity is high. Further, we describe and provide in full an example vector hardware implementation of l1-norm on an array of 4096 zero instruction-set processors. Calculation times <1000 s are attainable, removing the major deterrent to voxel-based resting-sate network mapping and revealing fine-grained node degree heterogeneity. L1-norm should be given consideration as a substitute for correlation in very high-density resting-state functional connectivity analyses. Copyright © 2014 IPEM. Published by Elsevier Ltd. All rights reserved.
Entanglement of spin waves among four quantum memories.

PubMed

Choi, K S; Goban, A; Papp, S B; van Enk, S J; Kimble, H J

2010-11-18

Quantum networks are composed of quantum nodes that interact coherently through quantum channels, and open a broad frontier of scientific opportunities. For example, a quantum network can serve as a 'web' for connecting quantum processors for computation and communication, or as a 'simulator' allowing investigations of quantum critical phenomena arising from interactions among the nodes mediated by the channels. The physical realization of quantum networks generically requires dynamical systems capable of generating and storing entangled states among multiple quantum memories, and efficiently transferring stored entanglement into quantum channels for distribution across the network. Although such capabilities have been demonstrated for diverse bipartite systems, entangled states have not been achieved for interconnects capable of 'mapping' multipartite entanglement stored in quantum memories to quantum channels. Here we demonstrate measurement-induced entanglement stored in four atomic memories; user-controlled, coherent transfer of the atomic entanglement to four photonic channels; and characterization of the full quadripartite entanglement using quantum uncertainty relations. Our work therefore constitutes an advance in the distribution of multipartite entanglement across quantum networks. We also show that our entanglement verification method is suitable for studying the entanglement order of condensed-matter systems in thermal equilibrium.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott; Chen, Yang

2013-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the OpenACC compiler directives and Fortran CUDA. Mixed implementation of both Open-ACC and CUDA is demonstrated. CUDA is required for optimizing the particle deposition algorithm. We have implemented the GPU acceleration on a third generation Core I7 gaming PC with two NVIDIA GTX 680 GPUs. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. We also see enormous speedups (10 or more) on the Titan supercomputer at Oak Ridge with Kepler K20 GPUs. Results show speed-ups comparable or better than that of OpenMP models utilizing multiple cores. The use of hybrid OpenACC, CUDA Fortran, and MPI models across many nodes will also be discussed. Optimization strategies will be presented. We will discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
Enabling Functional Neural Circuit Simulations with Distributed Computing of Neuromodulated Plasticity

PubMed Central

Potjans, Wiebke; Morrison, Abigail; Diesmann, Markus

2010-01-01

A major puzzle in the field of computational neuroscience is how to relate system-level learning in higher organisms to synaptic plasticity. Recently, plasticity rules depending not only on pre- and post-synaptic activity but also on a third, non-local neuromodulatory signal have emerged as key candidates to bridge the gap between the macroscopic and the microscopic level of learning. Crucial insights into this topic are expected to be gained from simulations of neural systems, as these allow the simultaneous study of the multiple spatial and temporal scales that are involved in the problem. In particular, synaptic plasticity can be studied during the whole learning process, i.e., on a time scale of minutes to hours and across multiple brain areas. Implementing neuromodulated plasticity in large-scale network simulations where the neuromodulatory signal is dynamically generated by the network itself is challenging, because the network structure is commonly defined purely by the connectivity graph without explicit reference to the embedding of the nodes in physical space. Furthermore, the simulation of networks with realistic connectivity entails the use of distributed computing. A neuromodulated synapse must therefore be informed in an efficient way about the neuromodulatory signal, which is typically generated by a population of neurons located on different machines than either the pre- or post-synaptic neuron. Here, we develop a general framework to solve the problem of implementing neuromodulated plasticity in a time-driven distributed simulation, without reference to a particular implementation language, neuromodulator, or neuromodulated plasticity mechanism. We implement our framework in the simulator NEST and demonstrate excellent scaling up to 1024 processors for simulations of a recurrent network incorporating neuromodulated spike-timing dependent plasticity. PMID:21151370
Can High Bandwidth and Latency Justify Large Cache Blocks in Scalable Multiprocessors?

DTIC Science & Technology

1994-01-01

400 MB/second. 4 Dubnicki’s work used trace-driven simulation, with traces collected on an 8-processor machine. We would expect such small-scale...312 1 6 32 64 of odk Sb* Bad64.M Figure 17: Miss rate of Ind Blocked LU. Figure 18: MCPR of Ind Blocked LU. overall miss rate of TGauss is a factor of...easily. 17 (’his approach assunics that the model paramelers we collect from simulations with infinite band- width (such as the miss rate and the
Software for System for Controlling a Magnetically Levitated Rotor

NASA Technical Reports Server (NTRS)

Morrison, Carlos R. (Inventor)

2004-01-01

In a rotor assembly having a rotor supported for rotation by magnetic bearings, a processor controlled by software or firmware controls the generation of force vectors that position the rotor relative to its bearings in a 'bounce' mode in which the rotor axis is displaced from the principal axis defined between the bearings and a 'tilt' mode in which the rotor axis is tilted or inclined relative to the principal axis. Waveform driven perturbations are introduced to generate force vectors that excite the rotor in either the 'bounce' or 'tilt' modes.
System for Controlling a Magnetically Levitated Rotor

NASA Technical Reports Server (NTRS)

Morrison, Carlos R. (Inventor)

2006-01-01

In a rotor assembly having a rotor supported for rotation by magnetic bearings, a processor controlled by software or firmware controls the generation of force vectors that position the rotor relative to its bearings in a "bounce" mode in which the rotor axis is displaced from the principal axis defined between the bearings and a "tilt" mode in which the rotor axis is tilted or inclined relative to the principal axis. Waveform driven perturbations are introduced to generate force vectors that excite the rotor in either the "bounce" or "tilt" modes.
Light-Driven Contact Hearing Aid for Broad-Spectrum Amplification: Safety and Effectiveness Pivotal Study.

PubMed

Gantz, Bruce J; Perkins, Rodney; Murray, Michael; Levy, Suzanne Carr; Puria, Sunil

2017-03-01

Demonstrate safety and effectiveness of the light-driven contact hearing aid to support FDA clearance. A single-arm, open-label investigational-device clinical trial. Two private-practice and one hospital-based ENT clinics. Forty-three subjects (86 ears) with mild-to-severe bilateral sensorineural hearing impairment. Bilateral amplification delivered via a light-driven contact hearing aid comprising a Tympanic Lens (Lens) with a customized platform to directly drive the umbo and a behind-the-ear sound processor (Processor) that encodes sound into light pulses to wirelessly deliver signal and power to the Lens. The primary safety endpoint was a determination of "no change" (PTA4 < 10 dB) in residual unaided hearing at the 120-day measurement interval. The primary efficacy endpoint was improvement in word recognition using NU-6 at the 30-day measurement interval over the baseline unaided case. Secondary efficacy endpoints included functional gain from 2 to 10 kHz and speech-in-noise improvement over the baseline unaided case using both omnidirectional and directional microphones. The results for the 86 ears in the study determined a mean change of -0.40 dB in PTA4, indicating no change in residual hearing (p < 0.0001). There were no serious device- or procedure-related adverse events, or unanticipated adverse events. Word recognition aided with the Earlens improved significantly (p < 0.0001) over the unaided performance, by 35% rationalized arcsine units on average. Mean functional gain was 31 dB across 2 to 10 kHz. The average speech-recognition threshold improvement over the unaided case for the Hearing in Noise Test was 0.75 dB (p = 0.028) and 3.14 dB (p < 0.0001) for the omnidirectional and directional microphone modes, respectively. The safety and effectiveness data supported a de novo 510(k) submission that received clearance from the FDA.
A Novel Coarsening Method for Scalable and Efficient Mesh Generation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoo, A; Hysom, D; Gunney, B

2010-12-02

In this paper, we propose a novel mesh coarsening method called brick coarsening method. The proposed method can be used in conjunction with any graph partitioners and scales to very large meshes. This method reduces problem space by decomposing the original mesh into fixed-size blocks of nodes called bricks, layered in a similar way to conventional brick laying, and then assigning each node of the original mesh to appropriate brick. Our experiments indicate that the proposed method scales to very large meshes while allowing simple RCB partitioner to produce higher-quality partitions with significantly less edge cuts. Our results further indicatemore » that the proposed brick-coarsening method allows more complicated partitioners like PT-Scotch to scale to very large problem size while still maintaining good partitioning performance with relatively good edge-cut metric. Graph partitioning is an important problem that has many scientific and engineering applications in such areas as VLSI design, scientific computing, and resource management. Given a graph G = (V,E), where V is the set of vertices and E is the set of edges, (k-way) graph partitioning problem is to partition the vertices of the graph (V) into k disjoint groups such that each group contains roughly equal number of vertices and the number of edges connecting vertices in different groups is minimized. Graph partitioning plays a key role in large scientific computing, especially in mesh-based computations, as it is used as a tool to minimize the volume of communication and to ensure well-balanced load across computing nodes. The impact of graph partitioning on the reduction of communication can be easily seen, for example, in different iterative methods to solve a sparse system of linear equation. Here, a graph partitioning technique is applied to the matrix, which is basically a graph in which each edge is a non-zero entry in the matrix, to allocate groups of vertices to processors in such a way that many of matrix-vector multiplication can be performed locally on each processor and hence to minimize communication. Furthermore, a good graph partitioning scheme ensures the equal amount of computation performed on each processor. Graph partitioning is a well known NP-complete problem, and thus the most commonly used graph partitioning algorithms employ some forms of heuristics. These algorithms vary in terms of their complexity, partition generation time, and the quality of partitions, and they tend to trade off these factors. A significant challenge we are currently facing at the Lawrence Livermore National Laboratory is how to partition very large meshes on massive-size distributed memory machines like IBM BlueGene/P, where scalability becomes a big issue. For example, we have found that the ParMetis, a very popular graph partitioning tool, can only scale to 16K processors. An ideal graph partitioning method on such an environment should be fast and scale to very large meshes, while producing high quality partitions. This is an extremely challenging task, as to scale to that level, the partitioning algorithm should be simple and be able to produce partitions that minimize inter-processor communications and balance the load imposed on the processors. Our goals in this work are two-fold: (1) To develop a new scalable graph partitioning method with good load balancing and communication reduction capability. (2) To study the performance of the proposed partitioning method on very large parallel machines using actual data sets and compare the performance to that of existing methods. The proposed method achieves the desired scalability by reducing the mesh size. For this, it coarsens an input mesh into a smaller size mesh by coalescing the vertices and edges of the original mesh into a set of mega-vertices and mega-edges. A new coarsening method called brick algorithm is developed in this research. In the brick algorithm, the zones in a given mesh are first grouped into fixed size blocks called bricks. These brick are then laid in a way similar to conventional brick laying technique, which reduces the number of neighboring blocks each block needs to communicate. Contributions of this research are as follows: (1) We have developed a novel method that scales to a really large problem size while producing high quality mesh partitions; (2) We measured the performance and scalability of the proposed method on a machine of massive size using a set of actual large complex data sets, where we have scaled to a mesh with 110 million zones using our method. To the best of our knowledge, this is the largest complex mesh that a partitioning method is successfully applied to; and (3) We have shown that proposed method can reduce the number of edge cuts by as much as 65%.« less
Random walks on activity-driven networks with attractiveness

NASA Astrophysics Data System (ADS)

Alessandretti, Laura; Sun, Kaiyuan; Baronchelli, Andrea; Perra, Nicola

2017-05-01

Virtually all real-world networks are dynamical entities. In social networks, the propensity of nodes to engage in social interactions (activity) and their chances to be selected by active nodes (attractiveness) are heterogeneously distributed. Here, we present a time-varying network model where each node and the dynamical formation of ties are characterized by these two features. We study how these properties affect random-walk processes unfolding on the network when the time scales describing the process and the network evolution are comparable. We derive analytical solutions for the stationary state and the mean first-passage time of the process, and we study cases informed by empirical observations of social networks. Our work shows that previously disregarded properties of real social systems, such as heterogeneous distributions of activity and attractiveness as well as the correlations between them, substantially affect the dynamical process unfolding on the network.
Delay and cost performance analysis of the diffie-hellman key exchange protocol in opportunistic mobile networks

NASA Astrophysics Data System (ADS)

Soelistijanto, B.; Muliadi, V.

2018-03-01

Diffie-Hellman (DH) provides an efficient key exchange system by reducing the number of cryptographic keys distributed in the network. In this method, a node broadcasts a single public key to all nodes in the network, and in turn each peer uses this key to establish a shared secret key which then can be utilized to encrypt and decrypt traffic between the peer and the given node. In this paper, we evaluate the key transfer delay and cost performance of DH in opportunistic mobile networks, a specific scenario of MANETs where complete end-to-end paths rarely exist between sources and destinations; consequently, the end-to-end delays in these networks are much greater than typical MANETs. Simulation results, driven by a random node movement model and real human mobility traces, showed that DH outperforms a typical key distribution scheme based on the RSA algorithm in terms of key transfer delay, measured by average key convergence time; however, DH performs as well as the benchmark in terms of key transfer cost, evaluated by total key (copies) forwards.
A convex optimization method for self-organization in dynamic (FSO/RF) wireless networks

NASA Astrophysics Data System (ADS)

Llorca, Jaime; Davis, Christopher C.; Milner, Stuart D.

2008-08-01

Next generation communication networks are becoming increasingly complex systems. Previously, we presented a novel physics-based approach to model dynamic wireless networks as physical systems which react to local forces exerted on network nodes. We showed that under clear atmospheric conditions the network communication energy can be modeled as the potential energy of an analogous spring system and presented a distributed mobility control algorithm where nodes react to local forces driving the network to energy minimizing configurations. This paper extends our previous work by including the effects of atmospheric attenuation and transmitted power constraints in the optimization problem. We show how our new formulation still results in a convex energy minimization problem. Accordingly, an updated force-driven mobility control algorithm is presented. Forces on mobile backbone nodes are computed as the negative gradient of the new energy function. Results show how in the presence of atmospheric obscuration stronger forces are exerted on network nodes that make them move closer to each other, avoiding loss of connectivity. We show results in terms of network coverage and backbone connectivity and compare the developed algorithms for different scenarios.
Strategic parameter-driven routing models for multidestination traffic in telecommunication networks.

PubMed

Lee, Y; Tien, J M

2001-01-01

We present mathematical models that determine the optimal parameters for strategically routing multidestination traffic in an end-to-end network setting. Multidestination traffic refers to a traffic type that can be routed to any one of a multiple number of destinations. A growing number of communication services is based on multidestination routing. In this parameter-driven approach, a multidestination call is routed to one of the candidate destination nodes in accordance with predetermined decision parameters associated with each candidate node. We present three different approaches: (1) a link utilization (LU) approach, (2) a network cost (NC) approach, and (3) a combined parametric (CP) approach. The LU approach provides the solution that would result in an optimally balanced link utilization, whereas the NC approach provides the least expensive way to route traffic to destinations. The CP approach, on the other hand, provides multiple solutions that help leverage link utilization and cost. The LU approach has in fact been implemented by a long distance carrier resulting in a considerable efficiency improvement in its international direct services, as summarized.

Data-Driven Packet Loss Estimation for Node Healthy Sensing in Decentralized Cluster

PubMed Central

Fan, Hangyu; Wang, Huandong; Li, Yong

2018-01-01

Decentralized clustering of modern information technology is widely adopted in various fields these years. One of the main reason is the features of high availability and the failure-tolerance which can prevent the entire system form broking down by a failure of a single point. Recently, toolkits such as Akka are used by the public commonly to easily build such kind of cluster. However, clusters of such kind that use Gossip as their membership managing protocol and use link failure detecting mechanism to detect link failures cannot deal with the scenario that a node stochastically drops packets and corrupts the member status of the cluster. In this paper, we formulate the problem to be evaluating the link quality and finding a max clique (NP-Complete) in the connectivity graph. We then proposed an algorithm that consists of two models driven by data from application layer to respectively solving these two problems. Through simulations with statistical data and a real-world product, we demonstrate that our algorithm has a good performance. PMID:29360792
Demand-driven energy requirement of world economy 2007: A multi-region input-output network simulation

NASA Astrophysics Data System (ADS)

Chen, Zhan-Ming; Chen, G. Q.

2013-07-01

This study presents a network simulation of the global embodied energy flows in 2007 based on a multi-region input-output model. The world economy is portrayed as a 6384-node network and the energy interactions between any two nodes are calculated and analyzed. According to the results, about 70% of the world's direct energy input is invested in resource, heavy manufacture, and transportation sectors which provide only 30% of the embodied energy to satisfy final demand. By contrast, non-transportation services sectors contribute to 24% of the world's demand-driven energy requirement with only 6% of the direct energy input. Commodity trade is shown to be an important alternative to fuel trade in redistributing energy, as international commodity flows embody 1.74E + 20 J of energy in magnitude up to 89% of the traded fuels. China is the largest embodied energy exporter with a net export of 3.26E + 19 J, in contrast to the United States as the largest importer with a net import of 2.50E + 19 J. The recent economic fluctuations following the financial crisis accelerate the relative expansions of energy requirement by developing countries, as a consequence China will take over the place of the United States as the world's top demand-driven energy consumer in 2022 and India will become the third largest in 2015.
Modeling Temporal Behavior in Large Networks: A Dynamic Mixed-Membership Model

DOE Office of Scientific and Technical Information (OSTI.GOV)

Rossi, R; Gallagher, B; Neville, J

Given a large time-evolving network, how can we model and characterize the temporal behaviors of individual nodes (and network states)? How can we model the behavioral transition patterns of nodes? We propose a temporal behavior model that captures the 'roles' of nodes in the graph and how they evolve over time. The proposed dynamic behavioral mixed-membership model (DBMM) is scalable, fully automatic (no user-defined parameters), non-parametric/data-driven (no specific functional form or parameterization), interpretable (identifies explainable patterns), and flexible (applicable to dynamic and streaming networks). Moreover, the interpretable behavioral roles are generalizable, computationally efficient, and natively supports attributes. We applied ourmore » model for (a) identifying patterns and trends of nodes and network states based on the temporal behavior, (b) predicting future structural changes, and (c) detecting unusual temporal behavior transitions. We use eight large real-world datasets from different time-evolving settings (dynamic and streaming). In particular, we model the evolving mixed-memberships and the corresponding behavioral transitions of Twitter, Facebook, IP-Traces, Email (University), Internet AS, Enron, Reality, and IMDB. The experiments demonstrate the scalability, flexibility, and effectiveness of our model for identifying interesting patterns, detecting unusual structural transitions, and predicting the future structural changes of the network and individual nodes.« less
Delay-induced cluster patterns in coupled Cayley tree networks

NASA Astrophysics Data System (ADS)

Singh, A.; Jalan, S.

2013-07-01

We study effects of delay in diffusively coupled logistic maps on the Cayley tree networks. We find that smaller coupling values exhibit sensitiveness to value of delay, and lead to different cluster patterns of self-organized and driven types. Whereas larger coupling strengths exhibit robustness against change in delay values, and lead to stable driven clusters comprising nodes from last generation of the Cayley tree. Furthermore, introduction of delay exhibits suppression as well as enhancement of synchronization depending upon coupling strength values. To the end we discuss the importance of results to understand conflicts and cooperations observed in family business.
Efficient Ada multitasking on a RISC register window architecture

NASA Technical Reports Server (NTRS)

Kearns, J. P.; Quammen, D.

1987-01-01

This work addresses the problem of reducing context switch overhead on a processor which supports a large register file - a register file much like that which is part of the Berkeley RISC processors and several other emerging architectures (which are not necessarily reduced instruction set machines in the purest sense). Such a reduction in overhead is particularly desirable in a real-time embedded application, in which task-to-task context switch overhead may result in failure to meet crucial deadlines. A storage management technique by which a context switch may be implemented as cheaply as a procedure call is presented. The essence of this technique is the avoidance of the save/restore of registers on the context switch. This is achieved through analysis of the static source text of an Ada tasking program. Information gained during that analysis directs the optimized storage management strategy for that program at run time. A formal verification of the technique in terms of an operational control model and an evaluation of the technique's performance via simulations driven by synthetic Ada program traces are presented.
Development of an embedded atmospheric turbulence mitigation engine

NASA Astrophysics Data System (ADS)

Paolini, Aaron; Bonnett, James; Kozacik, Stephen; Kelmelis, Eric

2017-05-01

Methods to reconstruct pictures from imagery degraded by atmospheric turbulence have been under development for decades. The techniques were initially developed for observing astronomical phenomena from the Earth's surface, but have more recently been modified for ground and air surveillance scenarios. Such applications can impose significant constraints on deployment options because they both increase the computational complexity of the algorithms themselves and often dictate a requirement for low size, weight, and power (SWaP) form factors. Consequently, embedded implementations must be developed that can perform the necessary computations on low-SWaP platforms. Fortunately, there is an emerging class of embedded processors driven by the mobile and ubiquitous computing industries. We have leveraged these processors to develop embedded versions of the core atmospheric correction engine found in our ATCOM software. In this paper, we will present our experience adapting our algorithms for embedded systems on a chip (SoCs), namely the NVIDIA Tegra that couples general-purpose ARM cores with their graphics processing unit (GPU) technology and the Xilinx Zynq which pairs similar ARM cores with their field-programmable gate array (FPGA) fabric.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1988-01-01

The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Distributed processor allocation for launching applications in a massively connected processors complex

DOEpatents

Pedretti, Kevin

2008-11-18

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.
Preliminary user's manuals for DYNA3D and DYNAP. [In FORTRAN IV for CDC 7600 and Cray-1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hallquist, J. O.

1979-10-01

This report provides a user's manual for DYNA3D, an explicit three-dimensional finite-element code for analyzing the large deformation dynamic response of inelastic solids. A contact-impact algorithm permits gaps and sliding along material interfaces. By a specialization of this algorithm, such interfaces can be rigidly tied to admit variable zoning without the need of transition regions. Spatial discretization is achieved by the use of 8-node solid elements, and the equations of motion are integrated by the central difference method. Post-processors for DYNA3D include GRAPE for plotting deformed shapes and stress contours and DYNAP for plotting time histories. A user's manual formore » DYNAP is also provided. 23 figures.« less
Parallelization of the Flow Field Dependent Variation Scheme for Solving the Triple Shock/Boundary Layer Interaction Problem

NASA Technical Reports Server (NTRS)

Schunk, Richard Gregory; Chung, T. J.

2001-01-01

A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.
Large Scale GW Calculations on the Cori System

NASA Astrophysics Data System (ADS)

Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven

The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
Megatux

DOE Office of Scientific and Technical Information (OSTI.GOV)

2012-09-25

The Megatux platform enables the emulation of large scale (multi-million node) distributed systems. In particular, it allows for the emulation of large-scale networks interconnecting a very large number of emulated computer systems. It does this by leveraging virtualization and associated technologies to allow hundreds of virtual computers to be hosted on a single moderately sized server or workstation. Virtualization technology provided by modern processors allows for multiple guest OSs to run at the same time, sharing the hardware resources. The Megatux platform can be deployed on a single PC, a small cluster of a few boxes or a large clustermore » of computers. With a modest cluster, the Megatux platform can emulate complex organizational networks. By using virtualization, we emulate the hardware, but run actual software enabling large scale without sacrificing fidelity.« less
Merlin - Massively parallel heterogeneous computing

NASA Technical Reports Server (NTRS)

Wittie, Larry; Maples, Creve

1989-01-01

Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
Three-dimensional object surface identification

NASA Astrophysics Data System (ADS)

Celenk, Mehmet

1995-03-01

This paper describes a computationally efficient matching method for inspecting 3D objects using their serial cross sections. Object regions of interest in cross-sectional binary images of successive slices are aligned with those of the models. Cross-sectional differences between the object and the models are measured in the direction of the gradient of the cross section boundary. This is repeated in all the cross-sectional images. The model with minimum average cross-sectional difference is selected as the best match to the given object (i.e., no defect). The method is tested using various computer generated surfaces and matching results are presented. It is also demonstrated using Symult S-2010 16-node system that the method is suitable for parallel implementation in massage passing processors with the maximum attainable speedup (close to 16 for S-2010).
A multitasking behavioral control system for the Robotic All-Terrain Lunar Exploration Rover (RATLER)

NASA Technical Reports Server (NTRS)

Klarer, Paul

1993-01-01

An approach for a robotic control system which implements so called 'behavioral' control within a realtime multitasking architecture is proposed. The proposed system would attempt to ameliorate some of the problems noted by some researchers when implementing subsumptive or behavioral control systems, particularly with regard to multiple processor systems and realtime operations. The architecture is designed to allow synchronous operations between various behavior modules by taking advantage of a realtime multitasking system's intertask communications channels, and by implementing each behavior module and each interconnection node as a stand-alone task. The potential advantages of this approach over those previously described in the field are discussed. An implementation of the architecture is planned for a prototype Robotic All Terrain Lunar Exploration Rover (RATLER) currently under development and is briefly described.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott

2012-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the GPU accelerator compiler directives. We have implemented the GPU acceleration on a Core I7 gaming PC with a NVIDIA GTX 580 GPU. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. Optimization strategies and comparisons between DIRAC and the gaming PC will be presented. We will also discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor; Dauger, Dean; Kokelaar, Pieter

2000-03-01

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 MFlops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor K.; Dauger, Dean E.; Kokelaar, Pieter R.

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 Mflops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
A Spiking Neural Network Model of the Lateral Geniculate Nucleus on the SpiNNaker Machine

PubMed Central

Sen-Bhattacharya, Basabdatta; Serrano-Gotarredona, Teresa; Balassa, Lorinc; Bhattacharya, Akash; Stokes, Alan B.; Rowley, Andrew; Sugiarto, Indar; Furber, Steve

2017-01-01

We present a spiking neural network model of the thalamic Lateral Geniculate Nucleus (LGN) developed on SpiNNaker, which is a state-of-the-art digital neuromorphic hardware built with very-low-power ARM processors. The parallel, event-based data processing in SpiNNaker makes it viable for building massively parallel neuro-computational frameworks. The LGN model has 140 neurons representing a “basic building block” for larger modular architectures. The motivation of this work is to simulate biologically plausible LGN dynamics on SpiNNaker. Synaptic layout of the model is consistent with biology. The model response is validated with existing literature reporting entrainment in steady state visually evoked potentials (SSVEP)—brain oscillations corresponding to periodic visual stimuli recorded via electroencephalography (EEG). Periodic stimulus to the model is provided by: a synthetic spike-train with inter-spike-intervals in the range 10–50 Hz at a resolution of 1 Hz; and spike-train output from a state-of-the-art electronic retina subjected to a light emitting diode flashing at 10, 20, and 40 Hz, simulating real-world visual stimulus to the model. The resolution of simulation is 0.1 ms to ensure solution accuracy for the underlying differential equations defining Izhikevichs neuron model. Under this constraint, 1 s of model simulation time is executed in 10 s real time on SpiNNaker; this is because simulations on SpiNNaker work in real time for time-steps dt ⩾ 1 ms. The model output shows entrainment with both sets of input and contains harmonic components of the fundamental frequency. However, suppressing the feed-forward inhibition in the circuit produces subharmonics within the gamma band (>30 Hz) implying a reduced information transmission fidelity. These model predictions agree with recent lumped-parameter computational model-based predictions, using conventional computers. Scalability of the framework is demonstrated by a multi-node architecture consisting of three “nodes,” where each node is the “basic building block” LGN model. This 420 neuron model is tested with synthetic periodic stimulus at 10 Hz to all the nodes. The model output is the average of the outputs from all nodes, and conforms to the above-mentioned predictions of each node. Power consumption for model simulation on SpiNNaker is ≪1 W. PMID:28848380

A Spiking Neural Network Model of the Lateral Geniculate Nucleus on the SpiNNaker Machine.

PubMed

Sen-Bhattacharya, Basabdatta; Serrano-Gotarredona, Teresa; Balassa, Lorinc; Bhattacharya, Akash; Stokes, Alan B; Rowley, Andrew; Sugiarto, Indar; Furber, Steve

2017-01-01

We present a spiking neural network model of the thalamic Lateral Geniculate Nucleus (LGN) developed on SpiNNaker, which is a state-of-the-art digital neuromorphic hardware built with very-low-power ARM processors. The parallel, event-based data processing in SpiNNaker makes it viable for building massively parallel neuro-computational frameworks. The LGN model has 140 neurons representing a "basic building block" for larger modular architectures. The motivation of this work is to simulate biologically plausible LGN dynamics on SpiNNaker. Synaptic layout of the model is consistent with biology. The model response is validated with existing literature reporting entrainment in steady state visually evoked potentials (SSVEP)-brain oscillations corresponding to periodic visual stimuli recorded via electroencephalography (EEG). Periodic stimulus to the model is provided by: a synthetic spike-train with inter-spike-intervals in the range 10-50 Hz at a resolution of 1 Hz; and spike-train output from a state-of-the-art electronic retina subjected to a light emitting diode flashing at 10, 20, and 40 Hz, simulating real-world visual stimulus to the model. The resolution of simulation is 0.1 ms to ensure solution accuracy for the underlying differential equations defining Izhikevichs neuron model. Under this constraint, 1 s of model simulation time is executed in 10 s real time on SpiNNaker; this is because simulations on SpiNNaker work in real time for time-steps dt ⩾ 1 ms. The model output shows entrainment with both sets of input and contains harmonic components of the fundamental frequency. However, suppressing the feed-forward inhibition in the circuit produces subharmonics within the gamma band (>30 Hz) implying a reduced information transmission fidelity. These model predictions agree with recent lumped-parameter computational model-based predictions, using conventional computers. Scalability of the framework is demonstrated by a multi-node architecture consisting of three "nodes," where each node is the "basic building block" LGN model. This 420 neuron model is tested with synthetic periodic stimulus at 10 Hz to all the nodes. The model output is the average of the outputs from all nodes, and conforms to the above-mentioned predictions of each node. Power consumption for model simulation on SpiNNaker is ≪1 W.
Generic Divide and Conquer Internet-Based Computing

NASA Technical Reports Server (NTRS)

Radenski, Atanas; Follen, Gregory J. (Technical Monitor)

2001-01-01

The rapid growth of internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of new, internet-oriented software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high -performance computing applications community. The general goal of this research project is to contribute to better understanding of the transition to internet-based high -performance computing and to develop solutions for some of the difficulties of this transition. More specifically, our goal is to design an architecture for generic divide and conquer internet-based computing, to develop a portable implementation of this architecture, to create an example library of high-performance divide-and-conquer computing agents that run on top of this architecture, and to evaluate the performance of these agents. We have been designing an architecture that incorporates a master task-pool server and utilizes satellite computational servers that operate on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. Our designed architecture is intended to be complementary to and accessible from computational grids such as Globus, Legion, and Condor. Grids provide remote access to existing high-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end internet nodes. Our project is focused on a generic divide-and-conquer paradigm and its applications that operate on a loose and ever changing pool of lower-end internet nodes.
MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity

NASA Astrophysics Data System (ADS)

Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.

2012-08-01

We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.
An Ontology Driven Information Architecture for Interoperable Disparate Data Sources

NASA Technical Reports Server (NTRS)

Hughes, J. Steven; Crichton, Dan; Hardman, Sean; Joyner, Ronald; Mattmann, Chris; Ramirez, Paul; Kelly, Sean; Castano, Rebecca

2011-01-01

The mission of the Planetary Data System is to facilitate achievement of NASA's planetary science goals by efficiently collecting, archiving, and making accessible digital data produced by or relevant to NASA's planetary missions, research programs, and data analysis programs. The vision is: (1) To gather and preserve the data obtained from exploration of the Solar System by the U.S. and other nations (2) To facilitate new and exciting discoveries by providing access to and ensuring usability of those data to the worldwide community (3) To inspire the public through availability and distribution of the body of knowledge reflected in the PDS data collection PDS is a federation of heterogeneous nodes including science and support nodes
Perceptual grouping effects on cursor movement expectations.

PubMed

Dorneich, Michael C; Hamblin, Christopher J; Lancaster, Jeff A; Olofinboba, Olu

2014-05-01

Two studies were conducted to develop an understanding of factors that drive user expectations when navigating between discrete elements on a display via a limited degree-of-freedom cursor control device. For the Orion Crew Exploration Vehicle spacecraft, a free-floating cursor with a graphical user interface (GUI) would require an unachievable level of accuracy due to expected acceleration and vibration conditions during dynamic phases of flight. Therefore, Orion program proposed using a "caged" cursor to "jump" from one controllable element (node) on the GUI to another. However, nodes are not likely to be arranged on a rectilinear grid, and so movements between nodes are not obvious. Proximity between nodes, direction of nodes relative to each other, and context features may all contribute to user cursor movement expectations. In an initial study, we examined user expectations based on the nodes themselves. In a second study, we examined the effect of context features on user expectations. The studies established that perceptual grouping effects influence expectations to varying degrees. Based on these results, a simple rule set was developed to support users in building a straightforward mental model that closely matches their natural expectations for cursor movement. The results will help designers of display formats take advantage of the natural context-driven cursor movement expectations of users to reduce navigation errors, increase usability, and decrease access time. The rules set and guidelines tie theory to practice and can be applied in environments where vibration or acceleration are significant, including spacecraft, aircraft, and automobiles.
Programming for 1.6 Millon cores: Early experiences with IBM's BG/Q SMP architecture

NASA Astrophysics Data System (ADS)

Glosli, James

2013-03-01

With the stall in clock cycle improvements a decade ago, the drive for computational performance has continues along a path of increasing core counts on a processor. The multi-core evolution has been expressed in both a symmetric multi processor (SMP) architecture and cpu/GPU architecture. Debates rage in the high performance computing (HPC) community which architecture best serves HPC. In this talk I will not attempt to resolve that debate but perhaps fuel it. I will discuss the experience of exploiting Sequoia, a 98304 node IBM Blue Gene/Q SMP at Lawrence Livermore National Laboratory. The advantages and challenges of leveraging the computational power BG/Q will be detailed through the discussion of two applications. The first application is a Molecular Dynamics code called ddcMD. This is a code developed over the last decade at LLNL and ported to BG/Q. The second application is a cardiac modeling code called Cardioid. This is a code that was recently designed and developed at LLNL to exploit the fine scale parallelism of BG/Q's SMP architecture. Through the lenses of these efforts I'll illustrate the need to rethink how we express and implement our computational approaches. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A Two-Stage Reconstruction Processor for Human Detection in Compressive Sensing CMOS Radar.

PubMed

Tsao, Kuei-Chi; Lee, Ling; Chu, Ta-Shun; Huang, Yuan-Hao

2018-04-05

Complementary metal-oxide-semiconductor (CMOS) radar has recently gained much research attraction because small and low-power CMOS devices are very suitable for deploying sensing nodes in a low-power wireless sensing system. This study focuses on the signal processing of a wireless CMOS impulse radar system that can detect humans and objects in the home-care internet-of-things sensing system. The challenges of low-power CMOS radar systems are the weakness of human signals and the high computational complexity of the target detection algorithm. The compressive sensing-based detection algorithm can relax the computational costs by avoiding the utilization of matched filters and reducing the analog-to-digital converter bandwidth requirement. The orthogonal matching pursuit (OMP) is one of the popular signal reconstruction algorithms for compressive sensing radar; however, the complexity is still very high because the high resolution of human respiration leads to high-dimension signal reconstruction. Thus, this paper proposes a two-stage reconstruction algorithm for compressive sensing radar. The proposed algorithm not only has lower complexity than the OMP algorithm by 75% but also achieves better positioning performance than the OMP algorithm especially in noisy environments. This study also designed and implemented the algorithm by using Vertex-7 FPGA chip (Xilinx, San Jose, CA, USA). The proposed reconstruction processor can support the 256 × 13 real-time radar image display with a throughput of 28.2 frames per second.
Computer aided stress analysis of long bones utilizing computer tomography

DOE Office of Scientific and Technical Information (OSTI.GOV)

Marom, S.A.

1986-01-01

A computer aided analysis method, utilizing computed tomography (CT) has been developed, which together with a finite element program determines the stress-displacement pattern in a long bone section. The CT data file provides the geometry, the density and the material properties for the generated finite element model. A three-dimensional finite element model of a tibial shaft is automatically generated from the CT file by a pre-processing procedure for a finite element program. The developed pre-processor includes an edge detection algorithm which determines the boundaries of the reconstructed cross-sectional images of the scanned bone. A mesh generation procedure than automatically generatesmore » a three-dimensional mesh of a user-selected refinement. The elastic properties needed for the stress analysis are individually determined for each model element using the radiographic density (CT number) of each pixel with the elemental borders. The elastic modulus is determined from the CT radiographic density by using an empirical relationship from the literature. The generated finite element model, together with applied loads, determined from existing gait analysis and initial displacements, comprise a formatted input for the SAP IV finite element program. The output of this program, stresses and displacements at the model elements and nodes, are sorted and displayed by a developed post-processor to provide maximum and minimum values at selected locations in the model.« less
Environmental Control and Life Support System Mockup

NASA Technical Reports Server (NTRS)

2001-01-01

The Environmental Control and Life Support System (ECLSS) Group of the Flight Projects Directorate at the Marshall Space Flight Center in Huntsville, Alabama, is responsible for designing and building the life support systems that will provide the crew of the International Space Station (ISS) a comfortable environment in which to live and work. This photograph shows the mockup of the the ECLSS to be installed in the Node 3 module of the ISS. From left to right, shower rack, waste management rack, Water Recovery System (WRS) Rack #2, WRS Rack #1, and Oxygen Generation System (OGS) rack are shown. The WRS provides clean water through the reclamation of wastewaters and is comprised of a Urine Processor Assembly (UPA) and a Water Processor Assembly (WPA). The UPA accepts and processes pretreated crewmember urine to allow it to be processed along with other wastewaters in the WPA. The WPA removes free gas, organic, and nonorganic constituents before the water goes through a series of multifiltration beds for further purification. The OGS produces oxygen for breathing air for the crew and laboratory animals, as well as for replacing oxygen loss. The OGS is comprised of a cell stack, which electrolyzes (breaks apart the hydrogen and oxygen molecules) some of the clean water provided by the WRS, and the separators that remove the gases from the water after electrolysis.
Accelerating Subsurface Transport Simulation on Heterogeneous Clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Gawande, Nitin A.; Tumeo, Antonino

Reactive transport numerical models simulate chemical and microbiological reactions that occur along a flowpath. These models have to compute reactions for a large number of locations. They solve the set of ordinary differential equations (ODEs) that describes the reaction for each location through the Newton-Raphson technique. This technique involves computing a Jacobian matrix and a residual vector for each set of equation, and then solving iteratively the linearized system by performing Gaussian Elimination and LU decomposition until convergence. STOMP, a well known subsurface flow simulation tool, employs matrices with sizes in the order of 100x100 elements and, for numerical accuracy,more » LU factorization with full pivoting instead of the faster partial pivoting. Modern high performance computing systems are heterogeneous machines whose nodes integrate both CPUs and GPUs, exposing unprecedented amounts of parallelism. To exploit all their computational power, applications must use both the types of processing elements. For the case of subsurface flow simulation, this mainly requires implementing efficient batched LU-based solvers and identifying efficient solutions for enabling load balancing among the different processors of the system. In this paper we discuss two approaches that allows scaling STOMP's performance on heterogeneous clusters. We initially identify the challenges in implementing batched LU-based solvers for small matrices on GPUs, and propose an implementation that fulfills STOMP's requirements. We compare this implementation to other existing solutions. Then, we combine the batched GPU solver with an OpenMP-based CPU solver, and present an adaptive load balancer that dynamically distributes the linear systems to solve between the two components inside a node. We show how these approaches, integrated into the full application, provide speed ups from 6 to 7 times on large problems, executed on up to 16 nodes of a cluster with two AMD Opteron 6272 and a Tesla M2090 per node.« less
Scalability of Parallel Spatial Direct Numerical Simulations on Intel Hypercube and IBM SP1 and SP2

NASA Technical Reports Server (NTRS)

Joslin, Ronald D.; Hanebutte, Ulf R.; Zubair, Mohammad

1995-01-01

The implementation and performance of a parallel spatial direct numerical simulation (PSDNS) approach on the Intel iPSC/860 hypercube and IBM SP1 and SP2 parallel computers is documented. Spatially evolving disturbances associated with the laminar-to-turbulent transition in boundary-layer flows are computed with the PSDNS code. The feasibility of using the PSDNS to perform transition studies on these computers is examined. The results indicate that PSDNS approach can effectively be parallelized on a distributed-memory parallel machine by remapping the distributed data structure during the course of the calculation. Scalability information is provided to estimate computational costs to match the actual costs relative to changes in the number of grid points. By increasing the number of processors, slower than linear speedups are achieved with optimized (machine-dependent library) routines. This slower than linear speedup results because the computational cost is dominated by FFT routine, which yields less than ideal speedups. By using appropriate compile options and optimized library routines on the SP1, the serial code achieves 52-56 M ops on a single node of the SP1 (45 percent of theoretical peak performance). The actual performance of the PSDNS code on the SP1 is evaluated with a "real world" simulation that consists of 1.7 million grid points. One time step of this simulation is calculated on eight nodes of the SP1 in the same time as required by a Cray Y/MP supercomputer. For the same simulation, 32-nodes of the SP1 and SP2 are required to reach the performance of a Cray C-90. A 32 node SP1 (SP2) configuration is 2.9 (4.6) times faster than a Cray Y/MP for this simulation, while the hypercube is roughly 2 times slower than the Y/MP for this application. KEY WORDS: Spatial direct numerical simulations; incompressible viscous flows; spectral methods; finite differences; parallel computing.
Optical implementation of a parallel out-of-band controller for large broadband ATM switch applications

NASA Astrophysics Data System (ADS)

Cloonan, Thomas J.; Richards, Gaylord W.; Lentine, Anthony L.

1996-03-01

Asynchronous transfer mode (ATM) is rapidly becoming the transport mechanism of choice for the information superhighway, because it promises the bandwidth and flexibility needed for many voice, video and data service offerings. Some industry experts project that the required sizes for ATM switching equipment in the public-switched environment will reach the Tbps range by the beginning of the next decade. This paper analyzes the problems associated with controlling the flow of packets within a broadband ATM switch of this size. The analysis is based on the requirements of the growable packet switch architecture. The paper proposes a novel solution to the problem of hunting paths within an ATM packet switch network. The resulting control scheme is unconventional in two ways. First, it uses an out-of-band control algorithm instead of the more common self-routing approach. In particular, we explore the benefits of using a parallel processor as an out-of-band controller for a growable packet switch distribution network. The processor permits additional levels of parallelism to be added to the out-of-band control function so that path hunts can be performed for all N of the input ports within a single cell interval. The proposed approach is also unconventional because it uses free-space digital optics to guide signals between successive stages of the controller. The paper describes the underlying motivations for implementing an optical out-of-band controller for an ATM switch, and it also describes the logic within a controller node that has been fabricated using a hybrid Si CMOS/GaAs SEED technology. The node uses optical detectors (in GaAs), amplifiers and digital control logic (in Si), and optical modulators (in GaAs). Free-space optical connections between successive device arrays can be provided using either bulk optical elements or micro-optics, but the optical interconnects must provide massive fanout capability. An architectural analysis studying the feasibility of applying free-space optics in this proposed ATM switch controller also is presented.
A Proposed Scalable Design and Simulation of Wireless Sensor Network-Based Long-Distance Water Pipeline Leakage Monitoring System

PubMed Central

Almazyad, Abdulaziz S.; Seddiq, Yasser M.; Alotaibi, Ahmed M.; Al-Nasheri, Ahmed Y.; BenSaleh, Mohammed S.; Obeid, Abdulfattah M.; Qasim, Syed Manzoor

2014-01-01

Anomalies such as leakage and bursts in water pipelines have severe consequences for the environment and the economy. To ensure the reliability of water pipelines, they must be monitored effectively. Wireless Sensor Networks (WSNs) have emerged as an effective technology for monitoring critical infrastructure such as water, oil and gas pipelines. In this paper, we present a scalable design and simulation of a water pipeline leakage monitoring system using Radio Frequency IDentification (RFID) and WSN technology. The proposed design targets long-distance aboveground water pipelines that have special considerations for maintenance, energy consumption and cost. The design is based on deploying a group of mobile wireless sensor nodes inside the pipeline and allowing them to work cooperatively according to a prescheduled order. Under this mechanism, only one node is active at a time, while the other nodes are sleeping. The node whose turn is next wakes up according to one of three wakeup techniques: location-based, time-based and interrupt-driven. In this paper, mathematical models are derived for each technique to estimate the corresponding energy consumption and memory size requirements. The proposed equations are analyzed and the results are validated using simulation. PMID:24561404
A proposed scalable design and simulation of wireless sensor network-based long-distance water pipeline leakage monitoring system.

PubMed

Almazyad, Abdulaziz S; Seddiq, Yasser M; Alotaibi, Ahmed M; Al-Nasheri, Ahmed Y; BenSaleh, Mohammed S; Obeid, Abdulfattah M; Qasim, Syed Manzoor

2014-02-20

Anomalies such as leakage and bursts in water pipelines have severe consequences for the environment and the economy. To ensure the reliability of water pipelines, they must be monitored effectively. Wireless Sensor Networks (WSNs) have emerged as an effective technology for monitoring critical infrastructure such as water, oil and gas pipelines. In this paper, we present a scalable design and simulation of a water pipeline leakage monitoring system using Radio Frequency IDentification (RFID) and WSN technology. The proposed design targets long-distance aboveground water pipelines that have special considerations for maintenance, energy consumption and cost. The design is based on deploying a group of mobile wireless sensor nodes inside the pipeline and allowing them to work cooperatively according to a prescheduled order. Under this mechanism, only one node is active at a time, while the other nodes are sleeping. The node whose turn is next wakes up according to one of three wakeup techniques: location-based, time-based and interrupt-driven. In this paper, mathematical models are derived for each technique to estimate the corresponding energy consumption and memory size requirements. The proposed equations are analyzed and the results are validated using simulation.
Adapting Wave-front Algorithms to Efficiently Utilize Systems with Deep Communication Hierarchies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J.; Lang, Michael; Pakin, Scott

2011-09-30

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance especially in hybrid systems using accelerators. Processorcores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contains wavefront processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundarymore » data downstream and whose cost is typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional steps in the parallel computation and higher use of on-chip communications. This tradeoff is explored using a performance model. An implementation using the Reverse-acceleration programming model on the petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
Spherical Harmonic Solutions to the 3D Kobayashi Benchmark Suite

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, P.N.; Chang, B.; Hanebutte, U.R.

1999-12-29

Spherical harmonic solutions of order 5, 9 and 21 on spatial grids containing up to 3.3 million cells are presented for the Kobayashi benchmark suite. This suite of three problems with simple geometry of pure absorber with large void region was proposed by Professor Kobayashi at an OECD/NEA meeting in 1996. Each of the three problems contains a source, a void and a shield region. Problem 1 can best be described as a box in a box problem, where a source region is surrounded by a square void region which itself is embedded in a square shield region. Problems 2more » and 3 represent a shield with a void duct. Problem 2 having a straight and problem 3 a dog leg shaped duct. A pure absorber and a 50% scattering case are considered for each of the three problems. The solutions have been obtained with Ardra, a scalable, parallel neutron transport code developed at Lawrence Livermore National Laboratory (LLNL). The Ardra code takes advantage of a two-level parallelization strategy, which combines message passing between processing nodes and thread based parallelism amongst processors on each node. All calculations were performed on the IBM ASCI Blue-Pacific computer at LLNL.« less
Low-complex energy-aware image communication in visual sensor networks

NASA Astrophysics Data System (ADS)

Phamila, Yesudhas Asnath Victy; Amutha, Ramachandran

2013-10-01

A low-complex, low bit rate, energy-efficient image compression algorithm explicitly designed for resource-constrained visual sensor networks applied for surveillance, battle field, habitat monitoring, etc. is presented, where voluminous amount of image data has to be communicated over a bandwidth-limited wireless medium. The proposed method overcomes the energy limitation of individual nodes and is investigated in terms of image quality, entropy, processing time, overall energy consumption, and system lifetime. This algorithm is highly energy efficient and extremely fast since it applies energy-aware zonal binary discrete cosine transform (DCT) that computes only the few required significant coefficients and codes them using enhanced complementary Golomb Rice code without using any floating point operations. Experiments are performed using the Atmel Atmega128 and MSP430 processors to measure the resultant energy savings. Simulation results show that the proposed energy-aware fast zonal transform consumes only 0.3% of energy needed by conventional DCT. This algorithm consumes only 6% of energy needed by Independent JPEG Group (fast) version, and it suits for embedded systems requiring low power consumption. The proposed scheme is unique since it significantly enhances the lifetime of the camera sensor node and the network without any need for distributed processing as was traditionally required in existing algorithms.
A real-time robot arm collision detection system

NASA Technical Reports Server (NTRS)

Shaffer, Clifford A.; Herb, Gregory M.

1990-01-01

A data structure and update algorithm are presented for a prototype real time collision detection safety system for a multi-robot environment. The data structure is a variant of the octree, which serves as a spatial index. An octree recursively decomposes 3-D space into eight equal cubic octants until each octant meets some decomposition criteria. The octree stores cylspheres (cylinders with spheres on each end) and rectangular solids as primitives (other primitives can easily be added as required). These primitives make up the two seven degrees-of-freedom robot arms and environment modeled by the system. Octree nodes containing more than a predetermined number N of primitives are decomposed. This rule keeps the octree small, as the entire environment for the application can be modeled using a few dozen primitives. As robot arms move, the octree is updated to reflect their changed positions. During most update cycles, any given primitive does not change which octree nodes it is in. Thus, modification to the octree is rarely required. Incidents in which one robot arm comes too close to another arm or an object are reported. Cycle time for interpreting current joint angles, updating the octree, and detecting/reporting imminent collisions averages 30 milliseconds on an Intel 80386 processor running at 20 MHz.
ms2: A molecular simulation tool for thermodynamic properties

NASA Astrophysics Data System (ADS)

Deublein, Stephan; Eckl, Bernhard; Stoll, Jürgen; Lishchuk, Sergey V.; Guevara-Carrion, Gabriela; Glass, Colin W.; Merker, Thorsten; Bernreuther, Martin; Hasse, Hans; Vrabec, Jadran

2011-11-01

This work presents the molecular simulation program ms2 that is designed for the calculation of thermodynamic properties of bulk fluids in equilibrium consisting of small electro-neutral molecules. ms2 features the two main molecular simulation techniques, molecular dynamics (MD) and Monte-Carlo. It supports the calculation of vapor-liquid equilibria of pure fluids and multi-component mixtures described by rigid molecular models on the basis of the grand equilibrium method. Furthermore, it is capable of sampling various classical ensembles and yields numerous thermodynamic properties. To evaluate the chemical potential, Widom's test molecule method and gradual insertion are implemented. Transport properties are determined by equilibrium MD simulations following the Green-Kubo formalism. ms2 is designed to meet the requirements of academia and industry, particularly achieving short response times and straightforward handling. It is written in Fortran90 and optimized for a fast execution on a broad range of computer architectures, spanning from single processor PCs over PC-clusters and vector computers to high-end parallel machines. The standard Message Passing Interface (MPI) is used for parallelization and ms2 is therefore easily portable to different computing platforms. Feature tools facilitate the interaction with the code and the interpretation of input and output files. The accuracy and reliability of ms2 has been shown for a large variety of fluids in preceding work. Program summaryProgram title:ms2 Catalogue identifier: AEJF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Special Licence supplied by the authors No. of lines in distributed program, including test data, etc.: 82 794 No. of bytes in distributed program, including test data, etc.: 793 705 Distribution format: tar.gz Programming language: Fortran90 Computer: The simulation tool ms2 is usable on a wide variety of platforms, from single processor machines over PC-clusters and vector computers to vector-parallel architectures. (Tested with Fortran compilers: gfortran, Intel, PathScale, Portland Group and Sun Studio.) Operating system: Unix/Linux, Windows Has the code been vectorized or parallelized?: Yes. Message Passing Interface (MPI) protocol Scalability. Excellent scalability up to 16 processors for molecular dynamics and >512 processors for Monte-Carlo simulations. RAM:ms2 runs on single processors with 512 MB RAM. The memory demand rises with increasing number of processors used per node and increasing number of molecules. Classification: 7.7, 7.9, 12 External routines: Message Passing Interface (MPI) Nature of problem: Calculation of application oriented thermodynamic properties for rigid electro-neutral molecules: vapor-liquid equilibria, thermal and caloric data as well as transport properties of pure fluids and multi-component mixtures. Solution method: Molecular dynamics, Monte-Carlo, various classical ensembles, grand equilibrium method, Green-Kubo formalism. Restrictions: No. The system size is user-defined. Typical problems addressed by ms2 can be solved by simulating systems containing typically 2000 molecules or less. Unusual features: Feature tools are available for creating input files, analyzing simulation results and visualizing molecular trajectories. Additional comments: Sample makefiles for multiple operation platforms are provided. Documentation is provided with the installation package and is available at http://www.ms-2.de. Running time: The running time of ms2 depends on the problem set, the system size and the number of processes used in the simulation. Running four processes on a "Nehalem" processor, simulations calculating VLE data take between two and twelve hours, calculating transport properties between six and 24 hours.
Efficient Use of Distributed Systems for Scientific Applications

NASA Technical Reports Server (NTRS)

Taylor, Valerie; Chen, Jian; Canfield, Thomas; Richard, Jacques

2000-01-01

Distributed computing has been regarded as the future of high performance computing. Nationwide high speed networks such as vBNS are becoming widely available to interconnect high-speed computers, virtual environments, scientific instruments and large data sets. One of the major issues to be addressed with distributed systems is the development of computational tools that facilitate the efficient execution of parallel applications on such systems. These tools must exploit the heterogeneous resources (networks and compute nodes) in distributed systems. This paper presents a tool, called PART, which addresses this issue for mesh partitioning. PART takes advantage of the following heterogeneous system features: (1) processor speed; (2) number of processors; (3) local network performance; and (4) wide area network performance. Further, different finite element applications under consideration may have different computational complexities, different communication patterns, and different element types, which also must be taken into consideration when partitioning. PART uses parallel simulated annealing to partition the domain, taking into consideration network and processor heterogeneity. The results of using PART for an explicit finite element application executing on two IBM SPs (located at Argonne National Laboratory and the San Diego Supercomputer Center) indicate an increase in efficiency by up to 36% as compared to METIS, a widely used mesh partitioning tool. The input to METIS was modified to take into consideration heterogeneous processor performance; METIS does not take into consideration heterogeneous networks. The execution times for these applications were reduced by up to 30% as compared to METIS. These results are given in Figure 1 for four irregular meshes with number of elements ranging from 30,269 elements for the Barth5 mesh to 11,451 elements for the Barth4 mesh. Future work with PART entails using the tool with an integrated application requiring distributed systems. In particular this application, illustrated in the document entails an integration of finite element and fluid dynamic simulations to address the cooling of turbine blades of a gas turbine engine design. It is not uncommon to encounter high-temperature, film-cooled turbine airfoils with 1,000,000s of degrees of freedom. This results because of the complexity of the various components of the airfoils, requiring fine-grain meshing for accuracy. Additional information is contained in the original.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.