master processor node: Topics by Science.gov

Sample records for master processor node

Switch for serial or parallel communication networks

DOEpatents

Crosette, D.B.

1994-07-19

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.
Switch for serial or parallel communication networks

DOEpatents

Crosette, Dario B.

1994-01-01

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, D.B.

1996-12-31

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Parallel processing data network of master and slave transputers controlled by a serial control network

DOEpatents

Crosetto, Dario B.

1996-01-01

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.
A hybrid optic-fiber sensor network with the function of self-diagnosis and self-healing

NASA Astrophysics Data System (ADS)

Xu, Shibo; Liu, Tiegen; Ge, Chunfeng; Chen, Cheng; Zhang, Hongxia

2014-11-01

We develop a hybrid wavelength division multiplexing optical fiber network with distributed fiber-optic sensors and quasi-distributed FBG sensor arrays which detect vibrations, temperatures and strains at the same time. The network has the ability to locate the failure sites automatically designated as self-diagnosis and make protective switching to reestablish sensing service designated as self-healing by cooperative work of software and hardware. The processes above are accomplished by master-slave processors with the help of optical and wireless telemetry signals. All the sensing and optical telemetry signals transmit in the same fiber either working fiber or backup fiber. We take wavelength 1450nm as downstream signal and wavelength 1350nm as upstream signal to control the network in normal circumstances, both signals are sent by a light emitting node of the corresponding processor. There is also a continuous laser wavelength 1310nm sent by each node and received by next node on both working and backup fibers to monitor their healthy states, but it does not carry any message like telemetry signals do. When fibers of two sensor units are completely damaged, the master processor will lose the communication with the node between the damaged ones.However we install RF module in each node to solve the possible problem. Finally, the whole network state is transmitted to host computer by master processor. Operator could know and control the network by human-machine interface if needed.
ALMA Correlator Real-Time Data Processor

NASA Astrophysics Data System (ADS)

Pisano, J.; Amestica, R.; Perez, J.

2005-10-01

The design of a real-time Linux application utilizing Real-Time Application Interface (RTAI) to process real-time data from the radio astronomy correlator for the Atacama Large Millimeter Array (ALMA) is described. The correlator is a custom-built digital signal processor which computes the cross-correlation function of two digitized signal streams. ALMA will have 64 antennas with 2080 signal streams each with a sample rate of 4 giga-samples per second. The correlator's aggregate data output will be 1 gigabyte per second. The software is defined by hard deadlines with high input and processing data rates, while requiring interfaces to non real-time external computers. The designed computer system - the Correlator Data Processor or CDP, consists of a cluster of 17 SMP computers, 16 of which are compute nodes plus a master controller node all running real-time Linux kernels. Each compute node uses an RTAI kernel module to interface to a 32-bit parallel interface which accepts raw data at 64 megabytes per second in 1 megabyte chunks every 16 milliseconds. These data are transferred to tasks running on multiple CPUs in hard real-time using RTAI's LXRT facility to perform quantization corrections, data windowing, FFTs, and phase corrections for a processing rate of approximately 1 GFLOPS. Highly accurate timing signals are distributed to all seventeen computer nodes in order to synchronize them to other time-dependent devices in the observatory array. RTAI kernel tasks interface to the timing signals providing sub-millisecond timing resolution. The CDP interfaces, via the master node, to other computer systems on an external intra-net for command and control, data storage, and further data (image) processing. The master node accesses these external systems utilizing ALMA Common Software (ACS), a CORBA-based client-server software infrastructure providing logging, monitoring, data delivery, and intra-computer function invocation. The software is being developed in tandem with the correlator hardware which presents software engineering challenges as the hardware evolves. The current status of this project and future goals are also presented.
A FPGA embedded web server for remote monitoring and control of smart sensors networks.

PubMed

Magdaleno, Eduardo; Rodríguez, Manuel; Pérez, Fernando; Hernández, David; García, Enrique

2013-12-27

This article describes the implementation of a web server using an embedded Altera NIOS II IP core, a general purpose and configurable RISC processor which is embedded in a Cyclone FPGA. The processor uses the μCLinux operating system to support a Boa web server of dynamic pages using Common Gateway Interface (CGI). The FPGA is configured to act like the master node of a network, and also to control and monitor a network of smart sensors or instruments. In order to develop a totally functional system, the FPGA also includes an implementation of the time-triggered protocol (TTP/A). Thus, the implemented master node has two interfaces, the webserver that acts as an Internet interface and the other to control the network. This protocol is widely used to connecting smart sensors and actuators and microsystems in embedded real-time systems in different application domains, e.g., industrial, automotive, domotic, etc., although this protocol can be easily replaced by any other because of the inherent characteristics of the FPGA-based technology.
A FPGA Embedded Web Server for Remote Monitoring and Control of Smart Sensors Networks

PubMed Central

Magdaleno, Eduardo; Rodríguez, Manuel; Pérez, Fernando; Hernández, David; García, Enrique

2014-01-01

This article describes the implementation of a web server using an embedded Altera NIOS II IP core, a general purpose and configurable RISC processor which is embedded in a Cyclone FPGA. The processor uses the μCLinux operating system to support a Boa web server of dynamic pages using Common Gateway Interface (CGI). The FPGA is configured to act like the master node of a network, and also to control and monitor a network of smart sensors or instruments. In order to develop a totally functional system, the FPGA also includes an implementation of the time-triggered protocol (TTP/A). Thus, the implemented master node has two interfaces, the webserver that acts as an Internet interface and the other to control the network. This protocol is widely used to connecting smart sensors and actuators and microsystems in embedded real-time systems in different application domains, e.g., industrial, automotive, domotic, etc., although this protocol can be easily replaced by any other because of the inherent characteristics of the FPGA-based technology. PMID:24379047
Broadcasting collective operation contributions throughout a parallel computer

DOEpatents

Faraj, Ahmad [Rochester, MN

2012-02-21

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
A Wireless Electronic Nose System Using a Fe2O3 Gas Sensing Array and Least Squares Support Vector Regression

PubMed Central

Song, Kai; Wang, Qi; Liu, Qi; Zhang, Hongquan; Cheng, Yingguo

2011-01-01

This paper describes the design and implementation of a wireless electronic nose (WEN) system which can online detect the combustible gases methane and hydrogen (CH4/H2) and estimate their concentrations, either singly or in mixtures. The system is composed of two wireless sensor nodes—a slave node and a master node. The former comprises a Fe2O3 gas sensing array for the combustible gas detection, a digital signal processor (DSP) system for real-time sampling and processing the sensor array data and a wireless transceiver unit (WTU) by which the detection results can be transmitted to the master node connected with a computer. A type of Fe2O3 gas sensor insensitive to humidity is developed for resistance to environmental influences. A threshold-based least square support vector regression (LS-SVR)estimator is implemented on a DSP for classification and concentration measurements. Experimental results confirm that LS-SVR produces higher accuracy compared with artificial neural networks (ANNs) and a faster convergence rate than the standard support vector regression (SVR). The designed WEN system effectively achieves gas mixture analysis in a real-time process. PMID:22346587
Hypercluster - Parallel processing for computational mechanics

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1988-01-01

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Master/Programmable-Slave Computer

NASA Technical Reports Server (NTRS)

Smaistrla, David; Hall, William A.

1990-01-01

Unique modular computer features compactness, low power, mass storage of data, multiprocessing, and choice of various input/output modes. Master processor communicates with user via usual keyboard and video display terminal. Coordinates operations of as many as 24 slave processors, each dedicated to different experiment. Each slave circuit card includes slave microprocessor and assortment of input/output circuits for communication with external equipment, with master processor, and with other slave processors. Adaptable to industrial process control with selectable degrees of automatic control, automatic and/or manual monitoring, and manual intervention.
Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer

Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the rootmore » processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.« less
INTEGRATED MONITORING HARDWARE DEVELOPMENTS AT LOS ALAMOS

DOE Office of Scientific and Technical Information (OSTI.GOV)

R. PARKER; J. HALBIG; ET AL

1999-09-01

The hardware of the integrated monitoring system supports a family of instruments having a common internal architecture and firmware. Instruments can be easily configured from application-specific personality boards combined with common master-processor and high- and low-voltage power supply boards, and basic operating firmware. The instruments are designed to function autonomously to survive power and communication outages and to adapt to changing conditions. The personality boards allow measurement of gross gammas and neutrons, neutron coincidence and multiplicity, and gamma spectra. In addition, the Intelligent Local Node (ILON) provides a moderate-bandwidth network to tie together instruments, sensors, and computers.
Soft-core processor study for node-based architectures.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James

2008-09-01

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor builtmore » out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.« less
System and method for merging clusters of wireless nodes in a wireless network

DOEpatents

Budampati, Ramakrishna S [Maple Grove, MN; Gonia, Patrick S [Maplewood, MN; Kolavennu, Soumitri N [Blaine, MN; Mahasenan, Arun V [Kerala, IN

2012-05-29

A system includes a first cluster having multiple first wireless nodes. One first node is configured to act as a first cluster master, and other first nodes are configured to receive time synchronization information provided by the first cluster master. The system also includes a second cluster having one or more second wireless nodes. One second node is configured to act as a second cluster master, and any other second nodes configured to receive time synchronization information provided by the second cluster master. The system further includes a manager configured to merge the clusters into a combined cluster. One of the nodes is configured to act as a single cluster master for the combined cluster, and the other nodes are configured to receive time synchronization information provided by the single cluster master.
Stanford Hardware Development Program

NASA Technical Reports Server (NTRS)

Peterson, A.; Linscott, I.; Burr, J.

1986-01-01

Architectures for high performance, digital signal processing, particularly for high resolution, wide band spectrum analysis were developed. These developments are intended to provide instrumentation for NASA's Search for Extraterrestrial Intelligence (SETI) program. The real time signal processing is both formal and experimental. The efficient organization and optimal scheduling of signal processing algorithms were investigated. The work is complemented by efforts in processor architecture design and implementation. A high resolution, multichannel spectrometer that incorporates special purpose microcoded signal processors is being tested. A general purpose signal processor for the data from the multichannel spectrometer was designed to function as the processing element in a highly concurrent machine. The processor performance required for the spectrometer is in the range of 1000 to 10,000 million instructions per second (MIPS). Multiple node processor configurations, where each node performs at 100 MIPS, are sought. The nodes are microprogrammable and are interconnected through a network with high bandwidth for neighboring nodes, and medium bandwidth for nodes at larger distance. The implementation of both the current mutlichannel spectrometer and the signal processor as Very Large Scale Integration CMOS chip sets was commenced.
Development of a software interface for optical disk archival storage for a new life sciences flight experiments computer

NASA Technical Reports Server (NTRS)

Bartram, Peter N.

1989-01-01

The current Life Sciences Laboratory Equipment (LSLE) microcomputer for life sciences experiment data acquisition is now obsolete. Among the weaknesses of the current microcomputer are small memory size, relatively slow analog data sampling rates, and the lack of a bulk data storage device. While life science investigators normally prefer data to be transmitted to Earth as it is taken, this is not always possible. No down-link exists for experiments performed in the Shuttle middeck region. One important aspect of a replacement microcomputer is provision for in-flight storage of experimental data. The Write Once, Read Many (WORM) optical disk was studied because of its high storage density, data integrity, and the availability of a space-qualified unit. In keeping with the goals for a replacement microcomputer based upon commercially available components and standard interfaces, the system studied includes a Small Computer System Interface (SCSI) for interfacing the WORM drive. The system itself is designed around the STD bus, using readily available boards. Configurations examined were: (1) master processor board and slave processor board with the SCSI interface; (2) master processor with SCSI interface; (3) master processor with SCSI and Direct Memory Access (DMA); (4) master processor controlling a separate STD bus SCSI board; and (5) master processor controlling a separate STD bus SCSI board with DMA.
Picoradio: Communication/Computation Piconodes for Sensor Networks

DTIC Science & Technology

2003-01-02

diagram of PicoNode III, or Quark node. It is made from two custom chips, Strange RF and Charm digital processor , and is complemented by a set of...the chipset comprising of Strange (analog OOK transceiver) and Charm (digital processor ) chips. 44 Figure 33: System block diagram of the Quark node...19 2.B PICONODE II - TWO-CHIP PICONODE IMPLEMENTATION ......................................... 21 2.B.1 Baseband processor (BBP
Method for simultaneous overlapped communications between neighboring processors in a multiple

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1991-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.

Protocol for multiple node network

NASA Technical Reports Server (NTRS)

Kirkham, Harold (Inventor)

1995-01-01

The invention is a multiple interconnected network of intelligent message-repeating remote nodes which employs an antibody recognition message termination process performed by all remote nodes and a remote node polling process performed by other nodes which are master units controlling remote nodes in respective zones of the network assigned to respective master nodes. Each remote node repeats only those messages originated in the local zone, to provide isolation among the master nodes.
Protocol for multiple node network

NASA Technical Reports Server (NTRS)

Kirkham, Harold (Inventor)

1994-01-01

The invention is a multiple interconnected network of intelligent message-repeating remote nodes which employs an antibody recognition message termination process performed by all remote nodes and a remote node polling process performed by other nodes which are master units controlling remote nodes in respective zones of the network assigned to respective master nodes. Each remote node repeats only those messages originated in the local zone, to provide isolation among the master nodes.
A universal computer control system for motors

NASA Technical Reports Server (NTRS)

Szakaly, Zoltan F. (Inventor)

1991-01-01

A control system for a multi-motor system such as a space telerobot, having a remote computational node and a local computational node interconnected with one another by a high speed data link is described. A Universal Computer Control System (UCCS) for the telerobot is located at each node. Each node is provided with a multibus computer system which is characterized by a plurality of processors with all processors being connected to a common bus, and including at least one command processor. The command processor communicates over the bus with a plurality of joint controller cards. A plurality of direct current torque motors, of the type used in telerobot joints and telerobot hand-held controllers, are connected to the controller cards and responds to digital control signals from the command processor. Essential motor operating parameters are sensed by analog sensing circuits and the sensed analog signals are converted to digital signals for storage at the controller cards where such signals can be read during an address read/write cycle of the command processing processor.
Function Allocation in a Robust Distributed Real-Time Environment

DTIC Science & Technology

1991-12-01

fundamental characteristic of a distributed system is its ability to map individual logical functions of an application program onto many physical nodes... how much of a node’s processor time is scheduled for function processing. IMC is the function- to -function communication required to facilitate...indicator of how much excess processor time a node has. The reconfiguration algorithms use these variables to determine the most appropriate node(s) to
Performing a global barrier operation in a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2014-12-09

Executing computing tasks on a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier.
DOE Office of Scientific and Technical Information (OSTI.GOV)

None

Performing a global barrier operation in a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joinedmore » the single local barrier.« less
Effect of various features on the life cycle cost of the timing/synchronization subsystem of the DCS digital communications network

NASA Technical Reports Server (NTRS)

Kimsey, D. B.

1978-01-01

The effect on the life cycle cost of the timing subsystem was examined, when these optional features were included in various combinations. The features included mutual control, directed control, double-ended reference links, independence of clock error measurement and correction, phase reference combining, self-organization, smoothing for link and nodal dropouts, unequal reference weightings, and a master in a mutual control network. An overall design of a microprocessor-based timing subsystem was formulated. The microprocessor (8080) implements the digital filter portion of a digital phase locked loop, as well as other control functions such as organization of the network through communication with processors at neighboring nodes.
High-speed prediction of crystal structures for organic molecules

NASA Astrophysics Data System (ADS)

Obata, Shigeaki; Goto, Hitoshi

2015-02-01

We developed a master-worker type parallel algorithm for allocating tasks of crystal structure optimizations to distributed compute nodes, in order to improve a performance of simulations for crystal structure predictions. The performance experiments were demonstrated on TUT-ADSIM supercomputer system (HITACHI HA8000-tc/HT210). The experimental results show that our parallel algorithm could achieve speed-ups of 214 and 179 times using 256 processor cores on crystal structure optimizations in predictions of crystal structures for 3-aza-bicyclo(3.3.1)nonane-2,4-dione and 2-diazo-3,5-cyclohexadiene-1-one, respectively. We expect that this parallel algorithm is always possible to reduce computational costs of any crystal structure predictions.
A Parallel Genetic Algorithm for Automated Electronic Circuit Design

NASA Technical Reports Server (NTRS)

Long, Jason D.; Colombano, Silvano P.; Haith, Gary L.; Stassinopoulos, Dimitris

2000-01-01

Parallelized versions of genetic algorithms (GAs) are popular primarily for three reasons: the GA is an inherently parallel algorithm, typical GA applications are very compute intensive, and powerful computing platforms, especially Beowulf-style computing clusters, are becoming more affordable and easier to implement. In addition, the low communication bandwidth required allows the use of inexpensive networking hardware such as standard office ethernet. In this paper we describe a parallel GA and its use in automated high-level circuit design. Genetic algorithms are a type of trial-and-error search technique that are guided by principles of Darwinian evolution. Just as the genetic material of two living organisms can intermix to produce offspring that are better adapted to their environment, GAs expose genetic material, frequently strings of 1s and Os, to the forces of artificial evolution: selection, mutation, recombination, etc. GAs start with a pool of randomly-generated candidate solutions which are then tested and scored with respect to their utility. Solutions are then bred by probabilistically selecting high quality parents and recombining their genetic representations to produce offspring solutions. Offspring are typically subjected to a small amount of random mutation. After a pool of offspring is produced, this process iterates until a satisfactory solution is found or an iteration limit is reached. Genetic algorithms have been applied to a wide variety of problems in many fields, including chemistry, biology, and many engineering disciplines. There are many styles of parallelism used in implementing parallel GAs. One such method is called the master-slave or processor farm approach. In this technique, slave nodes are used solely to compute fitness evaluations (the most time consuming part). The master processor collects fitness scores from the nodes and performs the genetic operators (selection, reproduction, variation, etc.). Because of dependency issues in the GA, it is possible to have idle processors. However, as long as the load at each processing node is similar, the processors are kept busy nearly all of the time. In applying GAs to circuit design, a suitable genetic representation 'is that of a circuit-construction program. We discuss one such circuit-construction programming language and show how evolution can generate useful analog circuit designs. This language has the desirable property that virtually all sets of combinations of primitives result in valid circuit graphs. Our system allows circuit size (number of devices), circuit topology, and device values to be evolved. Using a parallel genetic algorithm and circuit simulation software, we present experimental results as applied to three analog filter and two amplifier design tasks. For example, a figure shows an 85 dB amplifier design evolved by our system, and another figure shows the performance of that circuit (gain and frequency response). In all tasks, our system is able to generate circuits that achieve the target specifications.
Digital system for structural dynamics simulation

NASA Technical Reports Server (NTRS)

Krauter, A. I.; Lagace, L. J.; Wojnar, M. K.; Glor, C.

1982-01-01

State-of-the-art digital hardware and software for the simulation of complex structural dynamic interactions, such as those which occur in rotating structures (engine systems). System were incorporated in a designed to use an array of processors in which the computation for each physical subelement or functional subsystem would be assigned to a single specific processor in the simulator. These node processors are microprogrammed bit-slice microcomputers which function autonomously and can communicate with each other and a central control minicomputer over parallel digital lines. Inter-processor nearest neighbor communications busses pass the constants which represent physical constraints and boundary conditions. The node processors are connected to the six nearest neighbor node processors to simulate the actual physical interface of real substructures. Computer generated finite element mesh and force models can be developed with the aid of the central control minicomputer. The control computer also oversees the animation of a graphics display system, disk-based mass storage along with the individual processing elements.
Multi-petascale highly efficient parallel supercomputer

DOEpatents

Asaad, Sameh; Bellofatto, Ralph E.; Blocksome, Michael A.; Blumrich, Matthias A.; Boyle, Peter; Brunheroto, Jose R.; Chen, Dong; Cher, Chen -Yong; Chiu, George L.; Christ, Norman; Coteus, Paul W.; Davis, Kristan D.; Dozsa, Gabor J.; Eichenberger, Alexandre E.; Eisley, Noel A.; Ellavsky, Matthew R.; Evans, Kahn C.; Fleischer, Bruce M.; Fox, Thomas W.; Gara, Alan; Giampapa, Mark E.; Gooding, Thomas M.; Gschwind, Michael K.; Gunnels, John A.; Hall, Shawn A.; Haring, Rudolf A.; Heidelberger, Philip; Inglett, Todd A.; Knudson, Brant L.; Kopcsay, Gerard V.; Kumar, Sameer; Mamidala, Amith R.; Marcella, James A.; Megerian, Mark G.; Miller, Douglas R.; Miller, Samuel J.; Muff, Adam J.; Mundy, Michael B.; O'Brien, John K.; O'Brien, Kathryn M.; Ohmacht, Martin; Parker, Jeffrey J.; Poole, Ruth J.; Ratterman, Joseph D.; Salapura, Valentina; Satterfield, David L.; Senger, Robert M.; Smith, Brian; Steinmacher-Burow, Burkhard; Stockdell, William M.; Stunkel, Craig B.; Sugavanam, Krishnan; Sugawara, Yutaka; Takken, Todd E.; Trager, Barry M.; Van Oosten, James L.; Wait, Charles D.; Walkup, Robert E.; Watson, Alfred T.; Wisniewski, Robert W.; Wu, Peng

2015-07-14

A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency.
High speed polling protocol for multiple node network

NASA Technical Reports Server (NTRS)

Kirkham, Harold (Inventor)

1995-01-01

The invention is a multiple interconnected network of intelligent message-repeating remote nodes which employs a remote node polling process performed by a master node by transmitting a polling message generically addressed to all remote nodes associated with the master node. Each remote node responds upon receipt of the generically addressed polling message by transmitting a poll-answering informational message and by relaying the polling message to other adjacent remote nodes.
Lambda network having 2.sup.m-1 nodes in each of m stages with each node coupled to four other nodes for bidirectional routing of data packets between nodes

DOEpatents

Napolitano, Jr., Leonard M.

1995-01-01

The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance.
Methods for operating parallel computing systems employing sequenced communications

DOEpatents

Benner, R.E.; Gustafson, J.L.; Montry, G.R.

1999-08-10

A parallel computing system and method are disclosed having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system. 15 figs.
Methods for operating parallel computing systems employing sequenced communications

DOEpatents

Benner, Robert E.; Gustafson, John L.; Montry, Gary R.

1999-01-01

A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
Scalable Multiprocessor for High-Speed Computing in Space

NASA Technical Reports Server (NTRS)

Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard

2004-01-01

A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.
Smart Sensor Network for Aircraft Corrosion Monitoring

DTIC Science & Technology

2010-02-01

Network Elements – Hub, Network capable application processor ( NCAP ) – Node, Smart transducer interface module (STIM)  Corrosion Sensing and...software Transducer software Network Protocol 1451.2 1451.3 1451.5 1451.6 1451.7 I/O Node -processor Power TEDS Smart Sensor Hub ( NCAP ) IEEE 1451.0 and
Parallel-aware, dedicated job co-scheduling within/across symmetric multiprocessing nodes

DOEpatents

Jones, Terry R.; Watson, Pythagoras C.; Tuel, William; Brenner, Larry; ,Caffrey, Patrick; Fier, Jeffrey

2010-10-05

In a parallel computing environment comprising a network of SMP nodes each having at least one processor, a parallel-aware co-scheduling method and system for improving the performance and scalability of a dedicated parallel job having synchronizing collective operations. The method and system uses a global co-scheduler and an operating system kernel dispatcher adapted to coordinate interfering system and daemon activities on a node and across nodes to promote intra-node and inter-node overlap of said interfering system and daemon activities as well as intra-node and inter-node overlap of said synchronizing collective operations. In this manner, the impact of random short-lived interruptions, such as timer-decrement processing and periodic daemon activity, on synchronizing collective operations is minimized on large processor-count SPMD bulk-synchronous programming styles.
Simulation of a master-slave event set processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Comfort, J.C.

1984-03-01

Event set manipulation may consume a considerable amount of the computation time spent in performing a discrete-event simulation. One way of minimizing this time is to allow event set processing to proceed in parallel with the remainder of the simulation computation. The paper describes a multiprocessor simulation computer, in which all non-event set processing is performed by the principal processor (called the host). Event set processing is coordinated by a front end processor (the master) and actually performed by several other functionally identical processors (the slaves). A trace-driven simulation program modeling this system was constructed, and was run with tracemore » output taken from two different simulation programs. Output from this simulation suggests that a significant reduction in run time may be realized by this approach. Sensitivity analysis was performed on the significant parameters to the system (number of slave processors, relative processor speeds, and interprocessor communication times). A comparison between actual and simulation run times for a one-processor system was used to assist in the validation of the simulation. 7 references.« less
Lambda network having 2{sup m{minus}1} nodes in each of m stages with each node coupled to four other nodes for bidirectional routing of data packets between nodes

DOEpatents

Napolitano, L.M. Jr.

1995-11-28

The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance. 14 figs.

System and method for time synchronization in a wireless network

DOEpatents

Gonia, Patrick S.; Kolavennu, Soumitri N.; Mahasenan, Arun V.; Budampati, Ramakrishna S.

2010-03-30

A system includes multiple wireless nodes forming a cluster in a wireless network, where each wireless node is configured to communicate and exchange data wirelessly based on a clock. One of the wireless nodes is configured to operate as a cluster master. Each of the other wireless nodes is configured to (i) receive time synchronization information from a parent node, (ii) adjust its clock based on the received time synchronization information, and (iii) broadcast time synchronization information based on the time synchronization information received by that wireless node. The time synchronization information received by each of the other wireless nodes is based on time synchronization information provided by the cluster master so that the other wireless nodes substantially synchronize their clocks with the clock of the cluster master.
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
A Very Large Area Network (VLAN) knowledge-base applied to space communication problems

NASA Technical Reports Server (NTRS)

Zander, Carol S.

1988-01-01

This paper first describes a hierarchical model for very large area networks (VLAN). Space communication problems whose solution could profit by the model are discussed and then an enhanced version of this model incorporating the knowledge needed for the missile detection-destruction problem is presented. A satellite network or VLAN is a network which includes at least one satellite. Due to the complexity, a compromise between fully centralized and fully distributed network management has been adopted. Network nodes are assigned to a physically localized group, called a partition. Partitions consist of groups of cell nodes with one cell node acting as the organizer or master, called the Group Master (GM). Coordinating the group masters is a Partition Master (PM). Knowledge is also distributed hierarchically existing in at least two nodes. Each satellite node has a back-up earth node. Knowledge must be distributed in such a way so as to minimize information loss when a node fails. Thus the model is hierarchical both physically and informationally.
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

PubMed Central

Kim, Youngmin; Lee, Chan-Gun

2017-01-01

In wireless sensor networks (WSNs), sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods. PMID:29240695
Multinode reconfigurable pipeline computer

NASA Technical Reports Server (NTRS)

Nosenchuck, Daniel M. (Inventor); Littman, Michael G. (Inventor)

1989-01-01

A multinode parallel-processing computer is made up of a plurality of innerconnected, large capacity nodes each including a reconfigurable pipeline of functional units such as Integer Arithmetic Logic Processors, Floating Point Arithmetic Processors, Special Purpose Processors, etc. The reconfigurable pipeline of each node is connected to a multiplane memory by a Memory-ALU switch NETwork (MASNET). The reconfigurable pipeline includes three (3) basic substructures formed from functional units which have been found to be sufficient to perform the bulk of all calculations. The MASNET controls the flow of signals from the memory planes to the reconfigurable pipeline and vice versa. the nodes are connectable together by an internode data router (hyperspace router) so as to form a hypercube configuration. The capability of the nodes to conditionally configure the pipeline at each tick of the clock, without requiring a pipeline flush, permits many powerful algorithms to be implemented directly.
High speed polling protocol for multiple node network with sequential flooding of a polling message and a poll-answering message

NASA Technical Reports Server (NTRS)

Marvit, Maclen (Inventor); Kirkham, Harold (Inventor)

1995-01-01

The invention is a multiple interconnected network of intelligent message-repeating remote nodes which employs a remote node polling process performed by a master node by transmitting a polling message generically addressed to all remote nodes associated with the master node. Each remote node responds upon receipt of the generically addressed polling message by sequentially flooding the network with a poll-answering informational message and with the polling message.
Performance of VPIC on Sequoia

NASA Astrophysics Data System (ADS)

Nystrom, William

2014-10-01

Sequoia is a major DOE computing resource which is characteristic of future resources in that it has many threads per compute node, 64, and the individual processor cores are simpler and less powerful than cores on previous processors like Intel's Sandy Bridge or AMD's Opteron. An effort is in progress to port VPIC to the Blue Gene Q architecture of Sequoia and evaluate its performance. Results of this work will be presented on single node performance of VPIC as well as multi-node scaling.
OpenMP Performance on the Columbia Supercomputer

NASA Technical Reports Server (NTRS)

Haoqiang, Jin; Hood, Robert

2005-01-01

This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.
A Software Implementation of a Satellite Interface Message Processor.

ERIC Educational Resources Information Center

Eastwood, Margaret A.; Eastwood, Lester F., Jr.

A design for network control software for a computer network is described in which some nodes are linked by a communications satellite channel. It is assumed that the network has an ARPANET-like configuration; that is, that specialized processors at each node are responsible for message switching and network control. The purpose of the control…
Parallelising a molecular dynamics algorithm on a multi-processor workstation

NASA Astrophysics Data System (ADS)

Müller-Plathe, Florian

1990-12-01

The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
An intelligent allocation algorithm for parallel processing

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Homaifar, Abdollah; Ananthram, Kishan G.

1988-01-01

The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network.
Simplifying and speeding the management of intra-node cache coherence

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-04-17

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
System Architecture For High Speed Sorting Of Potatoes

NASA Astrophysics Data System (ADS)

Marchant, J. A.; Onyango, C. M.; Street, M. J.

1989-03-01

This paper illustrates an industrial application of vision processing in which potatoes are sorted according to their size and shape at speeds of up to 40 objects per second. The result is a multi-processing approach built around the VME bus. A hardware unit has been designed and constructed to encode the boundary of the potatoes, to reducing the amount of data to be processed. A master 68000 processor is used to control this unit and to handle data transfers along the bus. Boundary data is passed to one of three 68010 slave processors each responsible for a line of potatoes across a conveyor belt. The slave processors calculate attributes such as shape, size and estimated weight of each potato and the master processor uses this data to operate the sorting mechanism. The system has been interfaced with a commercial grading machine and performance trials are now in progress.
Generic Divide and Conquer Internet-Based Computing

NASA Technical Reports Server (NTRS)

Follen, Gregory J. (Technical Monitor); Radenski, Atanas

2003-01-01

The growth of Internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of Peer to Peer (P2P) software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high-performance computing applications community. The general goal of this project is to achieve better understanding of the transition to Internet-based high-performance computing and to develop solutions for some of the technical challenges of this transition. In particular, we are interested in creating long-term motivation for end users to provide their idle processor time to support computationally intensive tasks. We believe that a practical P2P architecture should provide useful service to both clients with high-performance computing needs and contributors of lower-end computing resources. To achieve this, we are designing dual -service architecture for P2P high-performance divide-and conquer computing; we are also experimenting with a prototype implementation. Our proposed architecture incorporates a master server, utilizes dual satellite servers, and operates on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. A dual satellite server comprises a high-performance computing engine and a lower-end contributor service engine. The computing engine provides generic support for divide and conquer computations. The service engine is intended to provide free useful HTTP-based services to contributors of lower-end computing resources. Our proposed architecture is complementary to and accessible from computational grids, such as Globus, Legion, and Condor. Grids provide remote access to existing higher-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end Internet nodes. Our project is focused on a generic divide and conquer paradigm and on mobile applications of this paradigm that can operate on a loose and ever changing pool of lower-end Internet nodes.
Endobronchial ultrasound elastography: a new method in endobronchial ultrasound-guided transbronchial needle aspiration.

PubMed

Jiang, Jun-Hong; Turner, J Francis; Huang, Jian-An

2015-12-01

TBNA through the flexible bronchoscope is a 37-year-old technology that utilizes a TBNA needle to puncture the bronchial wall and obtain specimens of peribronchial and mediastinal lesions through the flexible bronchoscope for the diagnosis of benign and malignant diseases in the mediastinum and lung. Since 2002, the Olympus Company developed the first generation ultrasound equipment for use in the airway, initially utilizing an ultrasound probe introduced through the working channel followed by incoroporation of a fixed linear ultrasound array at the distal tip of the bronchoscope. This new bronchoscope equipped with a convex type ultrasound probe on the tip was subsequently introduced into clinical practice. The convex probe (CP)-EBUS allows real-time endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) of mediastinal and hilar lymph nodes. EBUS-TBNA is a minimally invasive procedure performed under local anesthesia that has been shown to have a high sensitivity and diagnostic yield for lymph node staging of lung cancer. In 10 years of EBUS development, the Olympus Company developed the second generation EBUS bronchoscope (BF-UC260FW) with the ultrasound image processor (EU-M1), and in 2013 introduced a new ultrasound image processor (EU-M2) into clinical practice. FUJI company has also developed a curvilinear array endobronchial ultrasound bronchoscope (EB-530 US) that makes it easier for the operator to master the operation of the ultrasonic bronchoscope. Also, the new thin convex probe endobronchial ultrasound bronchoscope (TCP-EBUS) is able to visualize one to three bifurcations distal to the current CP-EBUS. The emergence of EBUS-TBNA has also been accompanied by innovation in EBUS instruments. EBUS elastography is, then, a new technique for describing the compliance of structures during EBUS, which may be of use in the determination of metastasis to the mediastinal and hilar lymph nodes. This article describes these new EBUS techniques and reviews the relevant literature.
Time and frequency transfer by the Master-Slave Returnable Timing System technique - Application to solar power transmission

NASA Technical Reports Server (NTRS)

Lindsey, W. C.; Kantak, A. V.

1979-01-01

The concept of the Master Slave Returnable Timing System (MSRTS) is presented which combines the advantages of the master slave (MS) and the Returnable Timing System (RTS) for time and frequency transfer. The basic idea of MSRTS is to send the time-frequency signal received at a particular node back to the sending node. The delay accumulated by this return signal is used to advance the phase of the master (sending) node thereby canceling the effect of the delay introduced by the path. The method can be used in highly accurate clock distribution systems required in avionics, computer communications, and large retrodirective phased arrays such as the Solar Power Satellite.
Method and apparatus for connecting finite element meshes and performing simulations therewith

DOEpatents

Dohrmann, Clark R.; Key, Samuel W.; Heinstein, Martin W.

2003-05-06

The present invention provides a method of connecting dissimilar finite element meshes. A first mesh, designated the master mesh, and a second mesh, designated the slave mesh, each have interface surfaces proximal the other. Each interface surface has a corresponding interface mesh comprising a plurality of interface nodes. Each slave interface node is assigned new coordinates locating the interface node on the interface surface of the master mesh. The slave interface surface is further redefined to be the projection of the slave interface mesh onto the master interface surface.
Companion Chip: Building a Segregated Hardware Architecture

NASA Astrophysics Data System (ADS)

Pareaud, Thomas; Houelle, Alain; Vaucher, Niolas; Albinet, Mathieu; Honvault, Christophe

2011-08-01

Partitioning is a more and more mature concept in Space industry. It aims at assuring that some error propagation modes are not possible. This paper gives an overview of an analysis conducted in the frame of a research and technology study performed in 2010/2011. The "Java Companion Chip" study addresses an interesting approach to partitioning using hardware concepts: a SoC architecture integrates a master processor, a companion chip and additional hardware functions aiming at enforcing the time and space segregation between the master processor and the slave one.This paper discusses the benefits and the main challenges of the proposed approach. In addition, it presents an application of these concepts to a case study: a Leon/Java processor architecture able to concurrently execute native and Java applications.
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.

1991-01-01

A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
Configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks

DOEpatents

Archer, Charles J.; Inglett, Todd A.; Ratterman, Joseph D.; Smith, Brian E.

2010-03-02

Methods, apparatus, and products are disclosed for configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks, the compute nodes in the operational group connected together for data communications through a global combining network, that include: partitioning the compute nodes in the operational group into a plurality of non-overlapping subgroups; designating one compute node from each of the non-overlapping subgroups as a master node; and assigning, to the compute nodes in each of the non-overlapping subgroups, class routing instructions that organize the compute nodes in that non-overlapping subgroup as a collective network such that the master node is a physical root.

Modeling heterogeneous processor scheduling for real time systems

NASA Technical Reports Server (NTRS)

Leathrum, J. F.; Mielke, R. R.; Stoughton, J. W.

1994-01-01

A new model is presented to describe dataflow algorithms implemented in a multiprocessing system. Called the resource/data flow graph (RDFG), the model explicitly represents cyclo-static processor schedules as circuits of processor arcs which reflect the order that processors execute graph nodes. The model also allows the guarantee of meeting hard real-time deadlines. When unfolded, the model identifies statically the processor schedule. The model therefore is useful for determining the throughput and latency of systems with heterogeneous processors. The applicability of the model is demonstrated using a space surveillance algorithm.
Next Generation Security for the 10,240 Processor Columbia System

NASA Technical Reports Server (NTRS)

Hinke, Thomas; Kolano, Paul; Shaw, Derek; Keller, Chris; Tweton, Dave; Welch, Todd; Liu, Wen (Betty)

2005-01-01

This presentation includes a discussion of the Columbia 10,240-processor system located at the NASA Advanced Supercomputing (NAS) division at the NASA Ames Research Center which supports each of NASA's four missions: science, exploration systems, aeronautics, and space operations. It is comprised of 20 Silicon Graphics nodes, each consisting of 512 Itanium II processors. A 64 processor Columbia front-end system supports users as they prepare their jobs and then submits them to the PBS system. Columbia nodes and front-end systems use the Linux OS. Prior to SC04, the Columbia system was used to attain a processing speed of 51.87 TeraFlops, which made it number two on the Top 500 list of the world's supercomputers and the world's fastest "operational" supercomputer since it was fully engaged in supporting NASA users.
Peregrine System | High-Performance Computing | NREL

Science.gov Websites

) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
Algorithmic mechanisms for reliable crowdsourcing computation under collusion.

PubMed

Fernández Anta, Antonio; Georgiou, Chryssis; Mosteiro, Miguel A; Pareja, Daniel

2015-01-01

We consider a computing system where a master processor assigns a task for execution to worker processors that may collude. We model the workers' decision of whether to comply (compute the task) or not (return a bogus result to save the computation cost) as a game among workers. That is, we assume that workers are rational in a game-theoretic sense. We identify analytically the parameter conditions for a unique Nash Equilibrium where the master obtains the correct result. We also evaluate experimentally mixed equilibria aiming to attain better reliability-profit trade-offs. For a wide range of parameter values that may be used in practice, our simulations show that, in fact, both master and workers are better off using a pure equilibrium where no worker cheats, even under collusion, and even for colluding behaviors that involve deviating from the game.
Algorithmic Mechanisms for Reliable Crowdsourcing Computation under Collusion

PubMed Central

Fernández Anta, Antonio; Georgiou, Chryssis; Mosteiro, Miguel A.; Pareja, Daniel

2015-01-01

We consider a computing system where a master processor assigns a task for execution to worker processors that may collude. We model the workers’ decision of whether to comply (compute the task) or not (return a bogus result to save the computation cost) as a game among workers. That is, we assume that workers are rational in a game-theoretic sense. We identify analytically the parameter conditions for a unique Nash Equilibrium where the master obtains the correct result. We also evaluate experimentally mixed equilibria aiming to attain better reliability-profit trade-offs. For a wide range of parameter values that may be used in practice, our simulations show that, in fact, both master and workers are better off using a pure equilibrium where no worker cheats, even under collusion, and even for colluding behaviors that involve deviating from the game. PMID:25793524
Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1994-01-01

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2011-01-11

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-02-21

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Method and apparatus of parallel computing with simultaneously operating stream prefetching and list prefetching engines

DOEpatents

Boyle, Peter A.; Christ, Norman H.; Gara, Alan; Mawhinney, Robert D.; Ohmacht, Martin; Sugavanam, Krishnan

2012-12-11

A prefetch system improves a performance of a parallel computing system. The parallel computing system includes a plurality of computing nodes. A computing node includes at least one processor and at least one memory device. The prefetch system includes at least one stream prefetch engine and at least one list prefetch engine. The prefetch system operates those engines simultaneously. After the at least one processor issues a command, the prefetch system passes the command to a stream prefetch engine and a list prefetch engine. The prefetch system operates the stream prefetch engine and the list prefetch engine to prefetch data to be needed in subsequent clock cycles in the processor in response to the passed command.
Multiple-Flat-Panel System Displays Multidimensional Data

NASA Technical Reports Server (NTRS)

Gundo, Daniel; Levit, Creon; Henze, Christopher; Sandstrom, Timothy; Ellsworth, David; Green, Bryan; Joly, Arthur

2006-01-01

The NASA Ames hyperwall is a display system designed to facilitate the visualization of sets of multivariate and multidimensional data like those generated in complex engineering and scientific computations. The hyperwall includes a 77 matrix of computer-driven flat-panel video display units, each presenting an image of 1,280 1,024 pixels. The term hyperwall reflects the fact that this system is a more capable successor to prior computer-driven multiple-flat-panel display systems known by names that include the generic term powerwall and the trade names PowerWall and Powerwall. Each of the 49 flat-panel displays is driven by a rack-mounted, dual-central-processing- unit, workstation-class personal computer equipped with a hig-hperformance graphical-display circuit card and with a hard-disk drive having a storage capacity of 100 GB. Each such computer is a slave node in a master/ slave computing/data-communication system (see Figure 1). The computer that acts as the master node is similar to the slave-node computers, except that it runs the master portion of the system software and is equipped with a keyboard and mouse for control by a human operator. The system utilizes commercially available master/slave software along with custom software that enables the human controller to interact simultaneously with any number of selected slave nodes. In a powerwall, a single rendering task is spread across multiple processors and then the multiple outputs are tiled into one seamless super-display. It must be noted that the hyperwall concept subsumes the powerwall concept in that a single scene could be rendered as a mosaic image on the hyperwall. However, the hyperwall offers a wider set of capabilities to serve a different purpose: The hyperwall concept is one of (1) simultaneously displaying multiple different but related images, and (2) providing means for composing and controlling such sets of images. In place of elaborate software or hardware crossbar switches, the hyperwall concept substitutes reliance on the human visual system for integration, synthesis, and discrimination of patterns in complex and high-dimensional data spaces represented by the multiple displayed images. The variety of multidimensional data sets that can be displayed on the hyperwall is practically unlimited. For example, Figure 2 shows a hyperwall display of surface pressures and streamlines from a computational simulation of airflow about an aerospacecraft at various Mach numbers and angles of attack. In this display, Mach numbers increase from left to right and angles of attack increase from bottom to top. That is, all images in the same column represent simulations at the same Mach number, while all images in the same row represent simulations at the same angle of attack. The same viewing transformations and the same mapping from surface pressure to colors were used in generating all the images.
Program Description: Financial Master File Processor-SWRL Financial System.

ERIC Educational Resources Information Center

Ideda, Masumi

Computer routines designed to produce various management and accounting reports required by the Southwest Regional Laboratory's (SWRL) Financial System are described. Input data requirements and output report formats are presented together with a discussion of the Financial Master File updating capabilities of the system. This document should be…
Peregrine System Configuration | High-Performance Computing | NREL

Science.gov Websites

nodes and storage are connected by a high speed InfiniBand network. Compute nodes are diskless with an directories are mounted on all nodes, along with a file system dedicated to shared projects. A brief processors with 64 GB of memory. All nodes are connected to the high speed Infiniband network and and a
A Coherent VLSI Environment

DTIC Science & Technology

1987-03-31

processors . The symmetry-breaking algorithms give efficient ways to convert probabilistic algorithms to deterministic algorithms. Some of the...techniques have been applied to construct several efficient linear- processor algorithms for graph problems, including an O(lg* n)-time algorithm for (A + 1...On n-node graphs, the algorithm works in O(log 2 n) time using only n processors , in contrast to the previous best algorithm which used about n3
Processors for wavelet analysis and synthesis: NIFS and TI-C80 MVP

NASA Astrophysics Data System (ADS)

Brooks, Geoffrey W.

1996-03-01

Two processors are considered for image quadrature mirror filtering (QMF). The neuromorphic infrared focal-plane sensor (NIFS) is an existing prototype analog processor offering high speed spatio-temporal Gaussian filtering, which could be used for the QMF low- pass function, and difference of Gaussian filtering, which could be used for the QMF high- pass function. Although not designed specifically for wavelet analysis, the biologically- inspired system accomplishes the most computationally intensive part of QMF processing. The Texas Instruments (TI) TMS320C80 Multimedia Video Processor (MVP) is a 32-bit RISC master processor with four advanced digital signal processors (DSPs) on a single chip. Algorithm partitioning, memory management and other issues are considered for optimal performance. This paper presents these considerations with simulated results leading to processor implementation of high-speed QMF analysis and synthesis.
High-Speed Computation of the Kleene Star in Max-Plus Algebraic System Using a Cell Broadband Engine

NASA Astrophysics Data System (ADS)

Goto, Hiroyuki

This research addresses a high-speed computation method for the Kleene star of the weighted adjacency matrix in a max-plus algebraic system. We focus on systems whose precedence constraints are represented by a directed acyclic graph and implement it on a Cell Broadband Engine™ (CBE) processor. Since the resulting matrix gives the longest travel times between two adjacent nodes, it is often utilized in scheduling problem solvers for a class of discrete event systems. This research, in particular, attempts to achieve a speedup by using two approaches: parallelization and SIMDization (Single Instruction, Multiple Data), both of which can be accomplished by a CBE processor. The former refers to a parallel computation using multiple cores, while the latter is a method whereby multiple elements are computed by a single instruction. Using the implementation on a Sony PlayStation 3™ equipped with a CBE processor, we found that the SIMDization is effective regardless of the system's size and the number of processor cores used. We also found that the scalability of using multiple cores is remarkable especially for systems with a large number of nodes. In a numerical experiment where the number of nodes is 2000, we achieved a speedup of 20 times compared with the method without the above techniques.
A site oriented supercomputer for theoretical physics: The Fermilab Advanced Computer Program Multi Array Processor System (ACMAPS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nash, T.; Atac, R.; Cook, A.

1989-03-06

The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
Video sensor architecture for surveillance applications.

PubMed

Sánchez, Jordi; Benet, Ginés; Simó, José E

2012-01-01

This paper introduces a flexible hardware and software architecture for a smart video sensor. This sensor has been applied in a video surveillance application where some of these video sensors are deployed, constituting the sensory nodes of a distributed surveillance system. In this system, a video sensor node processes images locally in order to extract objects of interest, and classify them. The sensor node reports the processing results to other nodes in the cloud (a user or higher level software) in the form of an XML description. The hardware architecture of each sensor node has been developed using two DSP processors and an FPGA that controls, in a flexible way, the interconnection among processors and the image data flow. The developed node software is based on pluggable components and runs on a provided execution run-time. Some basic and application-specific software components have been developed, in particular: acquisition, segmentation, labeling, tracking, classification and feature extraction. Preliminary results demonstrate that the system can achieve up to 7.5 frames per second in the worst case, and the true positive rates in the classification of objects are better than 80%.
Video Sensor Architecture for Surveillance Applications

PubMed Central

Sánchez, Jordi; Benet, Ginés; Simó, José E.

2012-01-01

This paper introduces a flexible hardware and software architecture for a smart video sensor. This sensor has been applied in a video surveillance application where some of these video sensors are deployed, constituting the sensory nodes of a distributed surveillance system. In this system, a video sensor node processes images locally in order to extract objects of interest, and classify them. The sensor node reports the processing results to other nodes in the cloud (a user or higher level software) in the form of an XML description. The hardware architecture of each sensor node has been developed using two DSP processors and an FPGA that controls, in a flexible way, the interconnection among processors and the image data flow. The developed node software is based on pluggable components and runs on a provided execution run-time. Some basic and application-specific software components have been developed, in particular: acquisition, segmentation, labeling, tracking, classification and feature extraction. Preliminary results demonstrate that the system can achieve up to 7.5 frames per second in the worst case, and the true positive rates in the classification of objects are better than 80%. PMID:22438723
DMA shared byte counters in a parallel computer

DOEpatents

Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

2010-04-06

A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
An Application-Based Performance Characterization of the Columbia Supercluster

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Djomehri, Jahed M.; Hood, Robert; Jin, Hoaqiang; Kiris, Cetin; Saini, Subhash

2005-01-01

Columbia is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processors each, and currently ranked as the second-fastest computer in the world. In this paper, we present the performance characteristics of Columbia obtained on up to four computing nodes interconnected via the InfiniBand and/or NUMAlink4 communication fabrics. We evaluate floating-point performance, memory bandwidth, message passing communication speeds, and compilers using a subset of the HPC Challenge benchmarks, and some of the NAS Parallel Benchmarks including the multi-zone versions. We present detailed performance results for three scientific applications of interest to NASA, one from molecular dynamics, and two from computational fluid dynamics. Our results show that both the NUMAlink4 and the InfiniBand hold promise for application scaling to a large number of processors.

Managing coherence via put/get windows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A; Chen, Dong; Coteus, Paul W

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
Data communications in a parallel active messaging interface of a parallel computer

DOEpatents

Archer, Charles J; Blocksome, Michael A; Ratterman, Joseph D; Smith, Brian E

2013-11-12

Data communications in a parallel active messaging interface (`PAMI`) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application.
Portable multi-node LQCD Monte Carlo simulations using OpenACC

NASA Astrophysics Data System (ADS)

Bonati, Claudio; Calore, Enrico; D'Elia, Massimo; Mesiti, Michele; Negro, Francesco; Sanfilippo, Francesco; Schifano, Sebastiano Fabio; Silvi, Giorgio; Tripiccione, Raffaele

This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs and commodity CPUs. Portability is achieved using the OpenACC parallel programming model, used to develop a code that can be compiled for several processor architectures. The paper focuses on parallelization on multiple computing nodes using OpenACC to manage parallelism within the node, and OpenMPI to manage parallelism among the nodes. We first discuss the available strategies to be adopted to maximize performances, we then describe selected relevant details of the code, and finally measure the level of performance and scaling-performance that we are able to achieve. The work focuses mainly on GPUs, which offer a significantly high level of performances for this application, but also compares with results measured on other processors.
Active non-volatile memory post-processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kannan, Sudarsun; Milojicic, Dejan S.; Talwar, Vanish

A computing node includes an active Non-Volatile Random Access Memory (NVRAM) component which includes memory and a sub-processor component. The memory is to store data chunks received from a processor core, the data chunks comprising metadata indicating a type of post-processing to be performed on data within the data chunks. The sub-processor component is to perform post-processing of said data chunks based on said metadata.
Low latency messages on distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Rosing, Matthew; Saltz, Joel

1993-01-01

Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Marshburn updates software on the WHC UPA in the Node 3

NASA Image and Video Library

2013-01-17

ISS034-E-031133 (17 Jan. 2013) --- NASA astronaut Tom Marshburn, Expedition 34 flight engineer, updates software on the Waste and Hygiene Compartment?s Urine Processor Assembly in the Tranquility node of the International Space Station.
Marshburn updates software on the WHC UPA in the Node 3

NASA Image and Video Library

2013-01-17

ISS034-E-031130 (17 Jan. 2013) --- NASA astronaut Tom Marshburn, Expedition 34 flight engineer, updates software on the Waste and Hygiene Compartment?s Urine Processor Assembly in the Tranquility node of the International Space Station.
Flight design system level C requirements. Solid rocket booster and external tank impact prediction processors. [space transportation system

NASA Technical Reports Server (NTRS)

Seale, R. H.

1979-01-01

The prediction of the SRB and ET impact areas requires six separate processors. The SRB impact prediction processor computes the impact areas and related trajectory data for each SRB element. Output from this processor is stored on a secure file accessible by the SRB impact plot processor which generates the required plots. Similarly the ET RTLS impact prediction processor and the ET RTLS impact plot processor generates the ET impact footprints for return-to-launch-site (RTLS) profiles. The ET nominal/AOA/ATO impact prediction processor and the ET nominal/AOA/ATO impact plot processor generate the ET impact footprints for non-RTLS profiles. The SRB and ET impact processors compute the size and shape of the impact footprints by tabular lookup in a stored footprint dispersion data base. The location of each footprint is determined by simulating a reference trajectory and computing the reference impact point location. To insure consistency among all flight design system (FDS) users, much input required by these processors will be obtained from the FDS master data base.
Asynchronous Communication Scheme For Hypercube Computer

NASA Technical Reports Server (NTRS)

Madan, Herb S.

1988-01-01

Scheme devised for asynchronous-message communication system for Mark III hypercube concurrent-processor network. Network consists of up to 1,024 processing elements connected electrically as though were at corners of 10-dimensional cube. Each node contains two Motorola 68020 processors along with Motorola 68881 floating-point processor utilizing up to 4 megabytes of shared dynamic random-access memory. Scheme intended to support applications requiring passage of both polled or solicited and unsolicited messages.
Bus-Programmable Slave Card

NASA Technical Reports Server (NTRS)

Hall, William A.

1990-01-01

Slave microprocessors in multimicroprocessor computing system contains modified circuit cards programmed via bus connecting master processor with slave microprocessors. Enables interactive, microprocessor-based, single-loop control. Confers ability to load and run program from master/slave bus, without need for microprocessor development station. Tristate buffers latch all data and information on status. Slave central processing unit never connected directly to bus.
Wireless Data-Acquisition System for Testing Rocket Engines

NASA Technical Reports Server (NTRS)

Lin, Chujen; Lonske, Ben; Hou, Yalin; Xu, Yingjiu; Gang, Mei

2007-01-01

A prototype wireless data-acquisition system has been developed as a potential replacement for a wired data-acquisition system heretofore used in testing rocket engines. The traditional use of wires to connect sensors, signal-conditioning circuits, and data acquisition circuitry is time-consuming and prone to error, especially when, as is often the case, many sensors are used in a test. The system includes one master and multiple slave nodes. The master node communicates with a computer via an Ethernet connection. The slave nodes are powered by rechargeable batteries and are packaged in weatherproof enclosures. The master unit and each of the slave units are equipped with a time-modulated ultra-wide-band (TMUWB) radio transceiver, which spreads its RF energy over several gigahertz by transmitting extremely low-power and super-narrow pulses. In this prototype system, each slave node can be connected to as many as six sensors: two sensors can be connected directly to analog-to-digital converters (ADCs) in the slave node and four sensors can be connected indirectly to the ADCs via signal conditioners. The maximum sampling rate for streaming data from any given sensor is about 5 kHz. The bandwidth of one channel of the TM-UWB radio communication system is sufficient to accommodate streaming of data from five slave nodes when they are fully loaded with data collected through all possible sensor connections. TM-UWB radios have a much higher spatial capacity than traditional sinusoidal wave-based radios. Hence, this TM-UWB wireless data-acquisition can be scaled to cover denser sensor setups for rocket engine test stands. Another advantage of TM-UWB radios is that it will not interfere with existing wireless transmission. The maximum radio-communication range between the master node and a slave node for this prototype system is about 50 ft (15 m) when the master and slave transceivers are equipped with small dipole antennas. The range can be increased by changing to larger antennas and/or greater transmission power. The battery life of a slave node ranges from about six hours during operation at full capacity to as long as three days when the system is in a "sleep" mode used to conserve battery charge during times between setup and rocket-engine testing. Batteries can be added to prolong operational lifetimes. The radio transceiver dominates the power consumption.
Data transmission system with distributed microprocessors

DOEpatents

Nambu, Shigeo

1985-01-01

A data transmission system having a common request line and a special request line in addition to a transmission line. The special request line has priority over the common request line. A plurality of node stations are multi-drop connected to the transmission line. Among the node stations, a supervising station is connected to the special request line and takes precedence over other slave stations to become a master station. The master station collects data from the slave stations. The station connected to the common request line can assign a master control function to any station requesting to be assigned the master control function within a short period of time. Each station has an auto response control circuit. The master station automatically collects data by the auto response controlling circuit independently of the microprocessors of the slave stations.
Enhancing Image Processing Performance for PCID in a Heterogeneous Network of Multi-core Processors

DTIC Science & Technology

2009-09-01

TFLOPS of Playstation 3 (PS3) nodes with IBM Cell Broadband Engine multi-cores and 15 dual-quad Xeon head nodes. The interconnect fabric includes... 4 3. INFORMATION MANAGEMENT FOR PARALLELIZATION AND...STREAMING............................................................. 7 4 . RESULTS
Simultaneous master-slave Omega pairs. [navigation system featuring low cost receiver

NASA Technical Reports Server (NTRS)

Burhans, R. W.

1974-01-01

Master-slave sequence ordering of the Omega system is suggested as a method of improving the pair geometry for low-cost receiver user benefit. The sequence change will not affect present sophisticated processor users other than require new labels for some pair combinations, but may require worldwide transmitter operators to slightly alter their long-range synchronizing techniques.
Eliminating livelock by assigning the same priority state to each message that is input into a flushable routing system during N time intervals

DOEpatents

Faber, V.

1994-11-29

Livelock-free message routing is provided in a network of interconnected nodes that is flushable in time T. An input message processor generates sequences of at least N time intervals, each of duration T. An input register provides for receiving and holding each input message, where the message is assigned a priority state p during an nth one of the N time intervals. At each of the network nodes a message processor reads the assigned priority state and awards priority to messages with priority state (p-1) during an nth time interval and to messages with priority state p during an (n+1) th time interval. The messages that are awarded priority are output on an output path toward the addressed output message processor. Thus, no message remains in the network for a time longer than T. 4 figures.
Eliminating livelock by assigning the same priority state to each message that is inputted into a flushable routing system during N time intervals

DOEpatents

Faber, Vance

1994-01-01

Livelock-free message routing is provided in a network of interconnected nodes that is flushable in time T. An input message processor generates sequences of at least N time intervals, each of duration T. An input register provides for receiving and holding each input message, where the message is assigned a priority state p during an nth one of the N time intervals. At each of the network nodes a message processor reads the assigned priority state and awards priority to messages with priority state (p-1) during an nth time interval and to messages with priority state p during an (n+1) th time interval. The messages that are awarded priority are output on an output path toward the addressed output message processor. Thus, no message remains in the network for a time longer than T.
Comments on Samal and Henderson: Parallel consistent labeling algorithms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Swain, M.J.

Samal and Henderson claim that any parallel algorithm for enforcing arc consistency in the worst case must have {Omega}(na) sequential steps, where n is the number of nodes, and a is the number of labels per node. The authors argue that Samal and Henderon's argument makes assumptions about how processors are used and give a counterexample that enforces arc consistency in a constant number of steps using O(n{sup 2}a{sup 2}2{sup na}) processors. It is possible that the lower bound holds for a polynomial number of processors; if such a lower bound were to be proven it would answer an importantmore » open question in theoretical computer science concerning the relation between the complexity classes P and NC. The strongest existing lower bound for the arc consistency problem states that it cannot be solved in polynomial log time unless P = NC.« less
Extending the granularity of representation and control for the MIL-STD CAIS 1.0 node model

NASA Technical Reports Server (NTRS)

Rogers, Kathy L.

1986-01-01

The Common APSE (Ada 1 Program Support Environment) Interface Set (CAIS) (DoD85) node model provides an excellent baseline for interfaces in a single-host development environment. To encompass the entire spectrum of computing, however, the CAIS model should be extended in four areas. It should provide the interface between the engineering workstation and the host system throughout the entire lifecycle of the system. It should provide a basis for communication and integration functions needed by distributed host environments. It should provide common interfaces for communications mechanisms to and among target processors. It should provide facilities for integration, validation, and verification of test beds extending to distributed systems on geographically separate processors with heterogeneous instruction set architectures (ISAS). Additions to the PROCESS NODE model to extend the CAIS into these four areas are proposed.
Latency Hiding in Dynamic Partitioning and Load Balancing of Grid Computing Applications

NASA Technical Reports Server (NTRS)

Das, Sajal K.; Harvey, Daniel J.; Biswas, Rupak

2001-01-01

The Information Power Grid (IPG) concept developed by NASA is aimed to provide a metacomputing platform for large-scale distributed computations, by hiding the intricacies of highly heterogeneous environment and yet maintaining adequate security. In this paper, we propose a latency-tolerant partitioning scheme that dynamically balances processor workloads on the.IPG, and minimizes data movement and runtime communication. By simulating an unsteady adaptive mesh application on a wide area network, we study the performance of our load balancer under the Globus environment. The number of IPG nodes, the number of processors per node, and the interconnected speeds are parameterized to derive conditions under which the IPG would be suitable for parallel distributed processing of such applications. Experimental results demonstrate that effective solution are achieved when the IPG nodes are connected by a high-speed asynchronous interconnection network.
All-to-all sequenced fault detection system

DOEpatents

Archer, Charles Jens; Pinnow, Kurt Walter; Ratterman, Joseph D.; Smith, Brian Edward

2010-11-02

An apparatus, program product and method enable nodal fault detection by sequencing communications between all system nodes. A master node may coordinate communications between two slave nodes before sequencing to and initiating communications between a new pair of slave nodes. The communications may be analyzed to determine the nodal fault.

An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes

NASA Astrophysics Data System (ADS)

Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry

2003-08-01

Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Analysis and Implementation of Particle-to-Particle (P2P) Graphics Processor Unit (GPU) Kernel for Black-Box Adaptive Fast Multipole Method

DTIC Science & Technology

2015-06-01

5110P and 16 dx360M4 nodes each with one NVIDIA Kepler K20M/K40M GPU. Each node contained dual Intel Xeon E5-2670 (Sandy Bridge) central processing...kernel and as such does not employ multiple processors. This work makes use of a single processing core and a single NVIDIA Kepler K40 GK110...bandwidth (2 × 16 slot), 7.877 GFloat/s; Kepler K40 peak, 4,290 × 1 billion floating-point operations (GFLOPs), and 288 GB/s Kepler K40 memory
An MPA-IO interface to HPSS

NASA Technical Reports Server (NTRS)

Jones, Terry; Mark, Richard; Martin, Jeanne; May, John; Pierce, Elsie; Stanberry, Linda

1996-01-01

This paper describes an implementation of the proposed MPI-IO (Message Passing Interface - Input/Output) standard for parallel I/O. Our system uses third-party transfer to move data over an external network between the processors where it is used and the I/O devices where it resides. Data travels directly from source to destination, without the need for shuffling it among processors or funneling it through a central node. Our distributed server model lets multiple compute nodes share the burden of coordinating data transfers. The system is built on the High Performance Storage System (HPSS), and a prototype version runs on a Meiko CS-2 parallel computer.
Real-time trajectory optimization on parallel processors

NASA Technical Reports Server (NTRS)

Psiaki, Mark L.

1993-01-01

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
Generic Divide and Conquer Internet-Based Computing

NASA Technical Reports Server (NTRS)

Radenski, Atanas; Follen, Gregory J. (Technical Monitor)

2001-01-01

The rapid growth of internet-based applications and the proliferation of networking technologies have been transforming traditional commercial application areas as well as computer and computational sciences and engineering. This growth stimulates the exploration of new, internet-oriented software technologies that can open new research and application opportunities not only for the commercial world, but also for the scientific and high -performance computing applications community. The general goal of this research project is to contribute to better understanding of the transition to internet-based high -performance computing and to develop solutions for some of the difficulties of this transition. More specifically, our goal is to design an architecture for generic divide and conquer internet-based computing, to develop a portable implementation of this architecture, to create an example library of high-performance divide-and-conquer computing agents that run on top of this architecture, and to evaluate the performance of these agents. We have been designing an architecture that incorporates a master task-pool server and utilizes satellite computational servers that operate on the Internet in a dynamically changing large configuration of lower-end nodes provided by volunteer contributors. Our designed architecture is intended to be complementary to and accessible from computational grids such as Globus, Legion, and Condor. Grids provide remote access to existing high-end computing resources; in contrast, our goal is to utilize idle processor time of lower-end internet nodes. Our project is focused on a generic divide-and-conquer paradigm and its applications that operate on a loose and ever changing pool of lower-end internet nodes.
Bluetooth-based wireless sensor networks

NASA Astrophysics Data System (ADS)

You, Ke; Liu, Rui Qiang

2007-11-01

In this work a Bluetooth-based wireless sensor network is proposed. In this bluetooth-based wireless sensor networks, information-driven star topology and energy-saved mode are used, through which a blue master node can control more than seven slave node, the energy of each sensor node is reduced and secure management of each sensor node is improved.
A Locality-Based Threading Algorithm for the Configuration-Interaction Method

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shan, Hongzhang; Williams, Samuel; Johnson, Calvin

The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
A Locality-Based Threading Algorithm for the Configuration-Interaction Method

DOE PAGES

Shan, Hongzhang; Williams, Samuel; Johnson, Calvin; ...

2017-07-03

The Configuration Interaction (CI) method has been widely used to solve the non-relativistic many-body Schrodinger equation. One great challenge to implementing it efficiently on manycore architectures is its immense memory and data movement requirements. To address this issue, within each node, we exploit a hybrid MPI+OpenMP programming model in lieu of the traditional flat MPI programming model. Here in this paper, we develop optimizations that partition the workloads among OpenMP threads based on data locality,-which is essential in ensuring applications with complex data access patterns scale well on manycore architectures. The new algorithm scales to 256 threadson the 64-core Intelmore » Knights Landing (KNL) manycore processor and 24 threads on dual-socket Ivy Bridge (Xeon) nodes. Compared with the original implementation, the performance has been improved by up to 7× on theKnights Landing processor and 3× on the dual-socket Ivy Bridge node.« less
Direct match data flow machine apparatus and process for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1997-08-12

A data flow computer and method of computing are disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1988-07-22

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information from an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ''fire'' signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor G.

1995-01-01

A data flow computer which of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow machine apparatus and process for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor Gerald

1997-01-01

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing

DOEpatents

Davidson, George S.; Grafe, Victor Gerald

1997-01-01

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing

DOEpatents

Davidson, G.S.; Grafe, V.G.

1997-10-07

A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Massively parallel processor networks with optical express channels

DOEpatents

Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

1999-08-24

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.
Massively parallel processor networks with optical express channels

DOEpatents

Deri, Robert J.; Brooks, III, Eugene D.; Haigh, Ronald E.; DeGroot, Anthony J.

1999-01-01

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.
One Way of Testing a Distributed Processor

NASA Technical Reports Server (NTRS)

Edstrom, R.; Kleckner, D.

1982-01-01

Launch processing for Space Shuttle is checked out, controlled, and monitored with new system. Entire system can be exercised by two computer programs--one in master console and other in each of operations consoles. Control program in each operations console detects change in status and begins task initiation. All of front-end processors are exercised from consoles through common data buffer, and all data are logged to processed-data recorder for posttest analysis.
Control Circuitry for High Speed VLSI (Very Large Scale Integration) Winograd Fourier Transform Processors.

DTIC Science & Technology

1985-12-01

Office of Scientific Research , and Air Force Space Division are sponsoring research for the development of a high speed DFT processor. This DFT...to the arithmetic circuitry through a master/slave 11-15 %v OPR ONESHOT OUTPUT OUTPUT .., ~ INITIALIZATION COLUMN’ 00 N DONE CUTRPLANE PLAtNE Figure...Since the TSP is an NP-complete problem, many mathematicians, operations researchers , computer scientists and the like have proposed heuristic
A Spacecraft Housekeeping System-on-Chip in a Radiation Hardened Structured ASIC

NASA Technical Reports Server (NTRS)

Suarez, George; DuMonthier, Jeffrey J.; Sheikh, Salman S.; Powell, Wesley A.; King, Robyn L.

2012-01-01

Housekeeping systems are essential to health monitoring of spacecraft and instruments. Typically, sensors are distributed across various sub-systems and data is collected using components such as analog-to-digital converters, analog multiplexers and amplifiers. In most cases programmable devices are used to implement the data acquisition control and storage, and the interface to higher level systems. Such discrete implementations require additional size, weight, power and interconnect complexity versus an integrated circuit solution, as well as the qualification of multiple parts. Although commercial devices are readily available, they are not suitable for space applications due the radiation tolerance and qualification requirements. The Housekeeping System-o n-A-Chip (HKSOC) is a low power, radiation hardened integrated solution suitable for spacecraft and instrument control and data collection. A prototype has been designed and includes a wide variety of functions including a 16-channel analog front-end for driving and reading sensors, analog-to-digital and digital-to-analog converters, on-chip temperature sensor, power supply current sense circuits, general purpose comparators and amplifiers, a 32-bit processor, digital I/O, pulse-width modulation (PWM) generators, timers and I2C master and slave serial interfaces. In addition, the device can operate in a bypass mode where the processor is disabled and external logic is used to control the analog and mixed signal functions. The device is suitable for stand-alone or distributed systems where multiple chips can be deployed across different sub-systems as intelligent nodes with computing and processing capabilities.
GSFC Cutting Edge Avionics Technologies for Spacecraft

NASA Technical Reports Server (NTRS)

Luers, Philip J.; Culver, Harry L.; Plante, Jeannette

1998-01-01

With the launch of NASA's first fiber optic bus on SAMPEX in 1992, GSFC has ushered in an era of new technology development and insertion into flight programs. Predating such programs the Lewis and Clark missions and the New Millenium Program, GSFC has spearheaded the drive to use cutting edge technologies on spacecraft for three reasons: to enable next generation Space and Earth Science, to shorten spacecraft development schedules, and to reduce the cost of NASA missions. The technologies developed have addressed three focus areas: standard interface components, high performance processing, and high-density packaging techniques enabling lower cost systems. To realize the benefits of standard interface components GSFC has developed and utilized radiation hardened/tolerant devices such as PCI target ASICs, Parallel Fiber Optic Data Bus terminals, MIL-STD-1773 and AS1773 transceivers, and Essential Services Node. High performance processing has been the focus of the Mongoose I and Mongoose V rad-hard 32-bit processor programs as well as the SMEX-Lite Computation Hub. High-density packaging techniques have resulted in 3-D stack DRAM packages and Chip-On-Board processes. Lower cost systems have been demonstrated by judiciously using all of our technology developments to enable "plug and play" scalable architectures. The paper will present a survey of development and insertion experiences for the above technologies, as well as future plans to enable more "better, faster, cheaper" spacecraft. Details of ongoing GSFC programs such as Ultra-Low Power electronics, Rad-Hard FPGAs, PCI master ASICs, and Next Generation Mongoose processors.

Investigation of Large Scale Cortical Models on Clustered Multi-Core Processors

DTIC Science & Technology

2013-02-01

with the bias node ( gray ) denoted as ww and the weights associated with the remaining first layer nodes (black) denoted as W. In forming the overall...Implementation of RBF network on GPU Platform 3.5.1 The Cholesky decomposition algorithm We need to invert the matrix multiplication GTG to
Energy-Efficient Querying of Wireless Sensor Networks

DTIC Science & Technology

2007-09-01

will fail to locate the desired information. Depending on the rate of node movement , this data exchange will be costly in terms of total network...nodes is best accomplished using a small time window to reduce errors introduced by the node’s movement (i.e., older measurements are less likely to...embedded processor or input from upper layer applications,” nodes which detect their own movement transmit an alert signal over a “wake-up” channel
Dynamic Sensor Networks

DTIC Science & Technology

2004-03-01

turned off. SLEEP Set the timer for 30 seconds before scheduled transmit time, then sleep the processor. WAKE When timer trips, power up the processor...slots where none of its neighbors are schedule to transmit. This allows the sensor nodes to perform a simple power man- agement scheme that puts the...routing This simple case study highlights the following crucial observation: optimal traffic scheduling in energy constrained networks requires future
High Performance Active Database Management on a Shared-Nothing Parallel Processor

DTIC Science & Technology

1998-05-01

either stored or virtual. A stored node is like a materialized view. It actually contains the specified tuples. A virtual node is like a real view...90292-6695 DL-5 COLUMBIA UNIV/DEPT COMPUTER SCIENCi ATTN: OR GAIL £. KAISER 450 COMPUTER SCIENCE 3LDG 500 WEST 12ÖTH STRSET NEW YORK NY 10027
A wireless laser displacement sensor node for structural health monitoring.

PubMed

Park, Hyo Seon; Kim, Jong Moon; Choi, Se Woon; Kim, Yousok

2013-09-30

This study describes a wireless laser displacement sensor node that measures displacement as a representative damage index for structural health monitoring (SHM). The proposed measurement system consists of a laser displacement sensor (LDS) and a customized wireless sensor node. Wireless communication is enabled by a sensor node that consists of a sensor module, a code division multiple access (CDMA) communication module, a processor, and a power module. An LDS with a long measurement distance is chosen to increase field applicability. For a wireless sensor node driven by a battery, we use a power control module with a low-power processor, which facilitates switching between the sleep and active modes, thus maximizing the power consumption efficiency during non-measurement and non-transfer periods. The CDMA mode is also used to overcome the limitation of communication distance, which is a challenge for wireless sensor networks and wireless communication. To evaluate the reliability and field applicability of the proposed wireless displacement measurement system, the system is tested onsite to obtain the required vertical displacement measurements during the construction of mega-trusses and an edge truss, which are the primary structural members in a large-scale irregular building currently under construction. The measurement values confirm the validity of the proposed wireless displacement measurement system and its potential for use in safety evaluations of structural elements.
A parallel algorithm for 2D visco-acoustic frequency-domain full-waveform inversion: application to a dense OBS data set

NASA Astrophysics Data System (ADS)

Sourbier, F.; Operto, S.; Virieux, J.

2006-12-01

We present a distributed-memory parallel algorithm for 2D visco-acoustic full-waveform inversion of wide-angle seismic data. Our code is written in fortran90 and use MPI for parallelism. The algorithm was applied to real wide-angle data set recorded by 100 OBSs with a 1-km spacing in the eastern-Nankai trough (Japan) to image the deep structure of the subduction zone. Full-waveform inversion is applied sequentially to discrete frequencies by proceeding from the low to the high frequencies. The inverse problem is solved with a classic gradient method. Full-waveform modeling is performed with a frequency-domain finite-difference method. In the frequency-domain, solving the wave equation requires resolution of a large unsymmetric system of linear equations. We use the massively parallel direct solver MUMPS (http://www.enseeiht.fr/irit/apo/MUMPS) for distributed-memory computer to solve this system. The MUMPS solver is based on a multifrontal method for the parallel factorization. The MUMPS algorithm is subdivided in 3 main steps: a symbolic analysis step that performs re-ordering of the matrix coefficients to minimize the fill-in of the matrix during the subsequent factorization and an estimation of the assembly tree of the matrix. Second, the factorization is performed with dynamic scheduling to accomodate numerical pivoting and provides the LU factors distributed over all the processors. Third, the resolution is performed for multiple sources. To compute the gradient of the cost function, 2 simulations per shot are required (one to compute the forward wavefield and one to back-propagate residuals). The multi-source resolutions can be performed in parallel with MUMPS. In the end, each processor stores in core a sub-domain of all the solutions. These distributed solutions can be exploited to compute in parallel the gradient of the cost function. Since the gradient of the cost function is a weighted stack of the shot and residual solutions of MUMPS, each processor computes the corresponding sub-domain of the gradient. In the end, the gradient is centralized on the master processor using a collective communation. The gradient is scaled by the diagonal elements of the Hessian matrix. This scaling is computed only once per frequency before the first iteration of the inversion. Estimation of the diagonal terms of the Hessian requires performing one simulation per non redondant shot and receiver position. The same strategy that the one used for the gradient is used to compute the diagonal Hessian in parallel. This algorithm was applied to a dense wide-angle data set recorded by 100 OBSs in the eastern Nankai trough, offshore Japan. Thirteen frequencies ranging from 3 and 15 Hz were inverted. Tweny iterations per frequency were computed leading to 260 tomographic velocity models of increasing resolution. The velocity model dimensions are 105 km x 25 km corresponding to a finite-difference grid of 4201 x 1001 grid with a 25-m grid interval. The number of shot was 1005 and the number of inverted OBS gathers was 93. The inversion requires 20 days on 6 32-bits bi-processor nodes with 4 Gbytes of RAM memory per node when only the LU factorization is performed in parallel. Preliminary estimations of the time required to perform the inversion with the fully-parallelized code is 6 and 4 days using 20 and 50 processors respectively.
Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks

PubMed Central

Kim, Deokho; Park, Karam; Ro, Won W.

2011-01-01

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine. PMID:22164053
Treecode with a Special-Purpose Processor

NASA Astrophysics Data System (ADS)

Makino, Junichiro

1991-08-01

We describe an implementation of the modified Barnes-Hut tree algorithm for a gravitational N-body calculation on a GRAPE (GRAvity PipE) backend processor. GRAPE is a special-purpose computer for N-body calculations. It receives the positions and masses of particles from a host computer and then calculates the gravitational force at each coordinate specified by the host. To use this GRAPE processor with the hierarchical tree algorithm, the host computer must maintain a list of all nodes that exert force on a particle. If we create this list for each particle of the system at each timestep, the number of floating-point operations on the host and that on GRAPE would become comparable, and the increased speed obtained by using GRAPE would be small. In our modified algorithm, we create a list of nodes for many particles. Thus, the amount of the work required of the host is significantly reduced. This algorithm was originally developed by Barnes in order to vectorize the force calculation on a Cyber 205. With this algorithm, the computing time of the force calculation becomes comparable to that of the tree construction, if the GRAPE backend processor is sufficiently fast. The obtained speed-up factor is 30 to 50 for a RISC-based host computer and GRAPE-1A with a peak speed of 240 Mflops.
Parallel discrete event simulation: A shared memory approach

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1987-01-01

With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.
Efficient packet forwarding using cyber-security aware policies

DOEpatents

Ros-Giralt, Jordi

2017-04-04

For balancing load, a forwarder can selectively direct data from the forwarder to a processor according to a loading parameter. The selective direction includes forwarding the data to the processor for processing, transforming and/or forwarding the data to another node, and dropping the data. The forwarder can also adjust the loading parameter based on, at least in part, feedback received from the processor. One or more processing elements can store values associated with one or more flows into a structure without locking the structure. The stored values can be used to determine how to direct the flows, e.g., whether to process a flow or to drop it. The structure can be used within an information channel providing feedback to a processor.
Efficient packet forwarding using cyber-security aware policies

DOEpatents

Ros-Giralt, Jordi

2017-10-25

For balancing load, a forwarder can selectively direct data from the forwarder to a processor according to a loading parameter. The selective direction includes forwarding the data to the processor for processing, transforming and/or forwarding the data to another node, and dropping the data. The forwarder can also adjust the loading parameter based on, at least in part, feedback received from the processor. One or more processing elements can store values associated with one or more flows into a structure without locking the structure. The stored values can be used to determine how to direct the flows, e.g., whether to process a flow or to drop it. The structure can be used within an information channel providing feedback to a processor.
Marionette

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sullivan, M.; Anderson, D.P.

1988-01-01

Marionette is a system for distributed parallel programming in an environment of networked heterogeneous computer systems. It is based on a master/slave model. The master process can invoke worker operations (asynchronous remote procedure calls to single slaves) and context operations (updates to the state of all slaves). The master and slaves also interact through shared data structures that can be modified only by the master. The master and slave processes are programmed in a sequential language. The Marionette runtime system manages slave process creation, propagates shared data structures to slaves as needed, queues and dispatches worker and context operations, andmore » manages recovery from slave processor failures. The Marionette system also includes tools for automated compilation of program binaries for multiple architectures, and for distributing binaries to remote fuel systems. A UNIX-based implementation of Marionette is described.« less
Local area network with fault-checking, priorities, and redundant backup

NASA Technical Reports Server (NTRS)

Morales, Sergio (Inventor); Friedman, Gary L. (Inventor)

1989-01-01

This invention is a redundant error detecting and correcting local area networked computer system having a plurality of nodes each including a network connector board within the node for connecting to an interfacing transceiver operably attached to a network cable. There is a first network cable disposed along a path to interconnect the nodes. The first network cable includes a plurality of first interfacing transceivers attached thereto. A second network cable is disposed in parallel with the first cable and, in like manner, includes a plurality of second interfacing transceivers attached thereto. There are a plurality of three position switches each having a signal input, three outputs for individual selective connection to the input, and a control input for receiving signals designating which of the outputs is to be connected to the signal input. Each of the switches includes means for designating a response address for responding to addressed signals appearing at the control input and each of the switches further has its signal input connected to a respective one of the input/output lines from the nodes. Also, one of the three outputs is connected to a repective one of the plurality of first interfacing transceivers. There is master switch control means having an output connected to the control inputs of the plurality of three position switches and an input for receiving directive signals for outputting addressed switch position signals to the three position switches as well as monitor and control computer means having a pair of network connector boards therein connected to respective ones of one of the first interfacing transceivers and one of the second interfacing transceivers and an output connected to the input of the master switch means for monitoring the status of the networked computer system by sending messages to the nodes and receiving and verifying messages therefrom and for sending control signals to the master switch to cause the master switch to cause respective ones of the nodes to use a desired one of the first and second cables for transmitting and receiving messages and for disconnecting desired ones of the nodes from both cables.
Checkpointing for a hybrid computing node

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cher, Chen-Yong

2016-03-08

According to an aspect, a method for checkpointing in a hybrid computing node includes executing a task in a processing accelerator of the hybrid computing node. A checkpoint is created in a local memory of the processing accelerator. The checkpoint includes state data to restart execution of the task in the processing accelerator upon a restart operation. Execution of the task is resumed in the processing accelerator after creating the checkpoint. The state data of the checkpoint are transferred from the processing accelerator to a main processor of the hybrid computing node while the processing accelerator is executing the task.
Spacecraft On-Board Information Extraction Computer (SOBIEC)

NASA Technical Reports Server (NTRS)

Eisenman, David; Decaro, Robert E.; Jurasek, David W.

1994-01-01

The Jet Propulsion Laboratory is the Technical Monitor on an SBIR Program issued for Irvine Sensors Corporation to develop a highly compact, dual use massively parallel processing node known as SOBIEC. SOBIEC couples 3D memory stacking technology provided by nCUBE. The node contains sufficient network Input/Output to implement up to an order-13 binary hypercube. The benefit of this network, is that it scales linearly as more processors are added, and it is a superset of other commonly used interconnect topologies such as: meshes, rings, toroids, and trees. In this manner, a distributed processing network can be easily devised and supported. The SOBIEC node has sufficient memory for most multi-computer applications, and also supports external memory expansion and DMA interfaces. The SOBIEC node is supported by a mature set of software development tools from nCUBE. The nCUBE operating system (OS) provides configuration and operational support for up to 8000 SOBIEC processors in an order-13 binary hypercube or any subset or partition(s) thereof. The OS is UNIX (USL SVR4) compatible, with C, C++, and FORTRAN compilers readily available. A stand-alone development system is also available to support SOBIEC test and integration.
New Modular Ultrasonic Signal Processing Building Blocks for Real-Time Data Acquisition and Post Processing

NASA Astrophysics Data System (ADS)

Weber, Walter H.; Mair, H. Douglas; Jansen, Dion

2003-03-01

A suite of basic signal processors has been developed. These basic building blocks can be cascaded together to form more complex processors without the need for programming. The data structures between each of the processors are handled automatically. This allows a processor built for one purpose to be applied to any type of data such as images, waveform arrays and single values. The processors are part of Winspect Data Acquisition software. The new processors are fast enough to work on A-scan signals live while scanning. Their primary use is to extract features, reduce noise or to calculate material properties. The cascaded processors work equally well on live A-scan displays, live gated data or as a post-processing engine on saved data. Researchers are able to call their own MATLAB or C-code from anywhere within the processor structure. A built-in formula node processor that uses a simple algebraic editor may make external user programs unnecessary. This paper also discusses the problems associated with ad hoc software development and how graphical programming languages can tie up researchers writing software rather than designing experiments.
Systems and Methods for Locating a Target in a GPS-Denied Environment

NASA Technical Reports Server (NTRS)

Mackay, John D. (Inventor); Murdock, Ronald G. (Inventor); Cummins, Douglas A. (Inventor)

2017-01-01

A system for locating an object in a GPS-denied environment includes first and second stationary nodes of a network and an object out of synchronization with a common time base of the network. The system includes one or more processors that are configured to estimate distances between the first stationary node and the object and a distance between the second stationary node and the object by comparing time-stamps of messages relayed between the object and the nodes. A position of the object can then be trilaterated using a location of each of the first and second stationary nodes and the measured distances between the object and each of the first and second stationary nodes.
Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Class network routing

DOEpatents

Bhanot, Gyan [Princeton, NJ; Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-09-08

Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
Using algebra for massively parallel processor design and utilization

NASA Technical Reports Server (NTRS)

Campbell, Lowell; Fellows, Michael R.

1990-01-01

This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

DFT algorithms for bit-serial GaAs array processor architectures

NASA Technical Reports Server (NTRS)

Mcmillan, Gary B.

1988-01-01

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.
Implementing direct, spatially isolated problems on transputer networks

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

Parametric studies were performed on transputer networks of up to 40 processors to determine how to implement and maximize the performance of the solution of problems where no processor-to-processor data transfer is required for the problem solution (spatially isolated). Two types of problems are investigated a computationally intensive problem where the solution required the transmission of 160 bytes of data through the parallel network, and a communication intensive example that required the transmission of 3 Mbytes of data through the network. This data consists of solutions being sent back to the host processor and not intermediate results for another processor to work on. Studies were performed on both integer and floating-point transputers. The latter features an on-chip floating-point math unit and offers approximately an order of magnitude performance increase over the integer transputer on real valued computations. The results indicate that a minimum amount of work is required on each node per communication to achieve high network speedups (efficiencies). The floating-point processor requires approximately an order of magnitude more work per communication than the integer processor because of the floating-point unit's increased computing capacity.
Support for Diagnosis of Custom Computer Hardware

NASA Technical Reports Server (NTRS)

Molock, Dwaine S.

2008-01-01

The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.
Parallel discrete event simulation using shared memory

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1988-01-01

With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.
A message passing kernel for the hypercluster parallel processing test bed

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Quealy, Angela; Cole, Gary L.

1989-01-01

A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.
Distributed support vector machine in master-slave mode.

PubMed

Chen, Qingguo; Cao, Feilong

2018-05-01

It is well known that the support vector machine (SVM) is an effective learning algorithm. The alternating direction method of multipliers (ADMM) algorithm has emerged as a powerful technique for solving distributed optimisation models. This paper proposes a distributed SVM algorithm in a master-slave mode (MS-DSVM), which integrates a distributed SVM and ADMM acting in a master-slave configuration where the master node and slave nodes are connected, meaning the results can be broadcasted. The distributed SVM is regarded as a regularised optimisation problem and modelled as a series of convex optimisation sub-problems that are solved by ADMM. Additionally, the over-relaxation technique is utilised to accelerate the convergence rate of the proposed MS-DSVM. Our theoretical analysis demonstrates that the proposed MS-DSVM has linear convergence, meaning it possesses the fastest convergence rate among existing standard distributed ADMM algorithms. Numerical examples demonstrate that the convergence and accuracy of the proposed MS-DSVM are superior to those of existing methods under the ADMM framework. Copyright © 2018 Elsevier Ltd. All rights reserved.
A Scalable Software Architecture Booting and Configuring Nodes in the Whitney Commodity Computing Testbed

NASA Technical Reports Server (NTRS)

Fineberg, Samuel A.; Kutler, Paul (Technical Monitor)

1997-01-01

The Whitney project is integrating commodity off-the-shelf PC hardware and software technology to build a parallel supercomputer with hundreds to thousands of nodes. To build such a system, one must have a scalable software model, and the installation and maintenance of the system software must be completely automated. We describe the design of an architecture for booting, installing, and configuring nodes in such a system with particular consideration given to scalability and ease of maintenance. This system has been implemented on a 40-node prototype of Whitney and is to be used on the 500 processor Whitney system to be built in 1998.
The Carnegie Mellon University Insert Project

DTIC Science & Technology

1997-02-01

Real - Time Systems (INSERT) project under the DARPA Evolutionary Design for Complex Software (EDCS) Program. The INSERT team has completed an initial API definition and ported the existing real-time publication subscription group communication software to LynxOS 2.4, a POSIX.1b compliant OS. The distributed real-time publisher/subscriber communication model is now supported by a processor membership protocol which allows a node in the system to fail, or to rejoin the system later. When a node fails, all the publishers and subscribers on that node have to be
PANDA: A distributed multiprocessor operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chubb, P.

1989-01-01

PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
Status of the Node 3 Regenerative Environmental Cpntrol& Life Support System Water Recovery & Oxygen Generation Systems

NASA Technical Reports Server (NTRS)

Carrasquillo, Robyn L.

2003-01-01

NASA s Marshall Space Flight Center is providing three racks containing regenerative water recovery and oxygen generation systems (WRS and OGS) for flight on the lnternational Space Station s (ISS) Node 3 element. The major assemblies included in these racks are the Water Processor Assembly (WPA), Urine Processor Assembly (UPA), Oxygen Generation Assembly (OGA), and the Power Supply Module (PSM) supporting the OGA. The WPA and OGA are provided by Hamilton Sundstrand Space Systems lnternational (HSSSI), while the UPA and PSM are being designed and manufactured in-house by MSFC. The assemblies are currently in the manufacturing and test phase and are to be completed and integrated into flight racks this year. This paper gives an overview of the technologies and system designs, technical challenges encountered and solved, and the current status.
High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects

DOEpatents

Deri, Robert J.; DeGroot, Anthony J.; Haigh, Ronald E.

2002-01-01

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.
DMA engine for repeating communication patterns

DOEpatents

Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Steinmacher-Burow, Burkhard; Vranas, Pavlos

2010-09-21

A parallel computer system is constructed as a network of interconnected compute nodes to operate a global message-passing application for performing communications across the network. Each of the compute nodes includes one or more individual processors with memories which run local instances of the global message-passing application operating at each compute node to carry out local processing operations independent of processing operations carried out at other compute nodes. Each compute node also includes a DMA engine constructed to interact with the application via Injection FIFO Metadata describing multiple Injection FIFOs where each Injection FIFO may containing an arbitrary number of message descriptors in order to process messages with a fixed processing overhead irrespective of the number of message descriptors included in the Injection FIFO.
Reconfiguration in Robust Distributed Real-Time Systems Based on Global Checkpoints

DTIC Science & Technology

1991-12-01

achieved by utilizing distributed systems in which a single application program executes on multiple processors, connected to a network. The distributed...single application program executes on multiple proces- sors, connected to a network. The distributed nature of such systems make it possible to ...resident at every node. How - ever, the responsibility for execution of a particular function is assigned to only one node in this framework. This function
Homemade Buckeye-Pi: A Learning Many-Node Platform for High-Performance Parallel Computing

NASA Astrophysics Data System (ADS)

Amooie, M. A.; Moortgat, J.

2017-12-01

We report on the "Buckeye-Pi" cluster, the supercomputer developed in The Ohio State University School of Earth Sciences from 128 inexpensive Raspberry Pi (RPi) 3 Model B single-board computers. Each RPi is equipped with fast Quad Core 1.2GHz ARMv8 64bit processor, 1GB of RAM, and 32GB microSD card for local storage. Therefore, the cluster has a total RAM of 128GB that is distributed on the individual nodes and a flash capacity of 4TB with 512 processors, while it benefits from low power consumption, easy portability, and low total cost. The cluster uses the Message Passing Interface protocol to manage the communications between each node. These features render our platform the most powerful RPi supercomputer to date and suitable for educational applications in high-performance-computing (HPC) and handling of large datasets. In particular, we use the Buckeye-Pi to implement optimized parallel codes in our in-house simulator for subsurface media flows with the goal of achieving a massively-parallelized scalable code. We present benchmarking results for the computational performance across various number of RPi nodes. We believe our project could inspire scientists and students to consider the proposed unconventional cluster architecture as a mainstream and a feasible learning platform for challenging engineering and scientific problems.
Reliable appropriate topology design for multiple-processor systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chou, C.P.

1987-01-01

A Shift and Replace Graph which is a very appropriate candidate for the topology of a multiple-processor system is a function of two positive integers r and m, and is denoted as SRF(r,m). Pradhan and Reddy proved that the node connectivity of SRG(r,m) is at least r and also give a routing algorithm which generally requires 2m jumps if the number of node failures is no larger than r - 1. Later, Esfahanian and Hakimi proved that SRG(r,m) has maximum node connectivity 2r - 2 and give routing algorithms which require: (1) at most m + 3 + log/sub r/mmore » jumps if 3 + log/sub r/m does not exceed m and the number of node failures is at most r - 1; (2) at most m + 5 + log/sub r/m jumps if 4 + log/sub r/m less than or equal to m and the number of node failures if less than or equal to 2r - 3; (3) all the other situations require no more than 2m jumps. By modifying the SRG(r,m), it is first proved that node connectivity of SRG(r,m) can be increased to: (1) 2r - 1 when r = 2, m = 2, and (2) 2r when (r = 2, m > 2) or (r > 2, m greater than or equal to 2, m greater than or equal to 2). The routing algorithms are also given for the modified SRG (r,m), which require at most 2m + 3 jumps when the number of node failures is less than or equal to 2r - 1.« less
Network of dedicated processors for finding lowest-cost map path

NASA Technical Reports Server (NTRS)

Eberhardt, Silvio P. (Inventor)

1991-01-01

A method and associated apparatus are disclosed for finding the lowest cost path of several variable paths. The paths are comprised of a plurality of linked cost-incurring areas existing between an origin point and a destination point. The method comprises the steps of connecting a purality of nodes together in the manner of the cost-incurring areas; programming each node to have a cost associated therewith corresponding to one of the cost-incurring areas; injecting a signal into one of the nodes representing the origin point; propagating the signal through the plurality of nodes from inputs to outputs; reducing the signal in magnitude at each node as a function of the respective cost of the node; and, starting at one of the nodes representing the destination point and following a path having the least reduction in magnitude of the signal from node to node back to one of the nodes representing the origin point whereby the lowest cost path from the origin point to the destination point is found.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
Self-checking self-repairing computer nodes using the mirror processor

NASA Technical Reports Server (NTRS)

Tamir, Yuval

1992-01-01

Circuitry added to fault-tolerant systems for concurrent error deduction usually reduces performance. Using a technique called micro rollback, it is possible to eliminate most of the performance penalty of concurrent error detection. Error detection is performed in parallel with intermodule communication, and erroneous state changes are later undone. The author reports on the design and implementation of a VLSI RISC microprocessor, called the Mirror Processor (MP), which is capable of micro rollback. In order to achieve concurrent error detection, two MP chips operate in lockstep, comparing external signals and a signature of internal signals every clock cycle. If a mismatch is detected, both processors roll back to the beginning of the cycle when the error occurred. In some cases the erroneous state is corrected by copying a value from the fault-free processor to the faulty processor. The architecture, microarchitecture, and VLSI implementation of the MP, emphasizing its error-detection, error-recovery, and self-diagnosis capabilities, are described.
Distributed computation of graphics primitives on a transputer network

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

A method is developed for distributing the computation of graphics primitives on a parallel processing network. Off-the-shelf transputer boards are used to perform the graphics transformations and scan-conversion tasks that would normally be assigned to a single transputer based display processor. Each node in the network performs a single graphics primitive computation. Frequently requested tasks can be duplicated on several nodes. The results indicate that the current distribution of commands on the graphics network shows a performance degradation when compared to the graphics display board alone. A change to more computation per node for every communication (perform more complex tasks on each node) may cause the desired increase in throughput.

Integral Fast Reactor fuel pin processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levinskas, D.

1993-01-01

This report discusses the pin processor which receives metal alloy pins cast from recycled Integral Fast Reactor (IFR) fuel and prepares them for assembly into new IFR fuel elements. Either full length as-cast or precut pins are fed to the machine from a magazine, cut if necessary, and measured for length, weight, diameter and deviation from straightness. Accepted pins are loaded into cladding jackets located in a magazine, while rejects and cutting scraps are separated into trays. The magazines, trays, and the individual modules that perform the different machine functions are assembled and removed using remote manipulators and master-slaves.
Integral Fast Reactor fuel pin processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levinskas, D.

1993-03-01

This report discusses the pin processor which receives metal alloy pins cast from recycled Integral Fast Reactor (IFR) fuel and prepares them for assembly into new IFR fuel elements. Either full length as-cast or precut pins are fed to the machine from a magazine, cut if necessary, and measured for length, weight, diameter and deviation from straightness. Accepted pins are loaded into cladding jackets located in a magazine, while rejects and cutting scraps are separated into trays. The magazines, trays, and the individual modules that perform the different machine functions are assembled and removed using remote manipulators and master-slaves.
Wide-Area Persistent Energy-Efficient Maritime Sensing

DTIC Science & Technology

2015-09-30

Matt Reynolds, Lefteris Kampianakis, and Andreas Pedrosse-Engel at UW designed and tested a Software Defined Radar testbed as well as an Arduino - based ...hardware based on a software-defined radio platform. 2) Development of a standalone Arduino - based backscatter node. 3) Analysis of the limits of the... Arduino - based node that can modulate radar backscatter with data received from a sensor using a low-power Arduino Nano processor. Figure 5 shows a
Active Nodal Task Seeking for High-Performance, Ultra-Dependable Computing

DTIC Science & Technology

1994-07-01

implementation. Figure 1 shows a hardware organization of ANTS: stand-alone computing nodes inter - connected by buses. 2.1 Run Time Partitioning The...nodes in 14 respond to changing loads [27] or system reconfiguration [26]. Existing techniques are all source-initiated or server-initiated [27]. 5.1...short-running task segments. The task segments must be short-running in order that processors will become avalable often enough to satisfy changing
Non-preconditioned conjugate gradient on cell and FPGA based hybrid supercomputer nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dubois, David H; Dubois, Andrew J; Boorman, Thomas M

2009-01-01

This work presents a detailed implementation of a double precision, non-preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{sup TM} in conjunction with x86 Opteron{sup TM} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.
Non-preconditioned conjugate gradient on cell and FPCA-based hybrid supercomputer nodes

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dubois, David H; Dubois, Andrew J; Boorman, Thomas M

2009-03-10

This work presents a detailed implementation of a double precision, Non-Preconditioned, Conjugate Gradient algorithm on a Roadrunner heterogeneous supercomputer node. These nodes utilize the Cell Broadband Engine Architecture{trademark} in conjunction with x86 Opteron{trademark} processors from AMD. We implement a common Conjugate Gradient algorithm, on a variety of systems, to compare and contrast performance. Implementation results are presented for the Roadrunner hybrid supercomputer, SRC Computers, Inc. MAPStation SRC-6 FPGA enhanced hybrid supercomputer, and AMD Opteron only. In all hybrid implementations wall clock time is measured, including all transfer overhead and compute timings.
System and method for modeling and analyzing complex scenarios

DOEpatents

Shevitz, Daniel Wolf

2013-04-09

An embodiment of the present invention includes a method for analyzing and solving possibility tree. A possibility tree having a plurality of programmable nodes is constructed and solved with a solver module executed by a processor element. The solver module executes the programming of said nodes, and tracks the state of at least a variable through a branch. When a variable of said branch is out of tolerance with a parameter, the solver disables remaining nodes of the branch and marks the branch as an invalid solution. The valid solutions are then aggregated and displayed as valid tree solutions.
Development of small scale cluster computer for numerical analysis

NASA Astrophysics Data System (ADS)

Zulkifli, N. H. N.; Sapit, A.; Mohammed, A. N.

2017-09-01

In this study, two units of personal computer were successfully networked together to form a small scale cluster. Each of the processor involved are multicore processor which has four cores in it, thus made this cluster to have eight processors. Here, the cluster incorporate Ubuntu 14.04 LINUX environment with MPI implementation (MPICH2). Two main tests were conducted in order to test the cluster, which is communication test and performance test. The communication test was done to make sure that the computers are able to pass the required information without any problem and were done by using simple MPI Hello Program where the program written in C language. Additional, performance test was also done to prove that this cluster calculation performance is much better than single CPU computer. In this performance test, four tests were done by running the same code by using single node, 2 processors, 4 processors, and 8 processors. The result shows that with additional processors, the time required to solve the problem decrease. Time required for the calculation shorten to half when we double the processors. To conclude, we successfully develop a small scale cluster computer using common hardware which capable of higher computing power when compare to single CPU processor, and this can be beneficial for research that require high computing power especially numerical analysis such as finite element analysis, computational fluid dynamics, and computational physics analysis.
Design Report for Isolated RS-485 Bus Node

DTIC Science & Technology

2016-07-01

controlled wired RS-485 network. The Android-based smartphone or tablet is used in conjunction with a USB to serial bridge to operate as the bus master in...Android-based smartphone or tablet is used in conjunction with a USB to serial bridge to operate as the bus master in the system. The Android device
Feasibility of optically interconnected parallel processors using wavelength division multiplexing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deri, R.J.; De Groot, A.J.; Haigh, R.E.

1996-03-01

New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little informationmore » is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.« less
Learning Disabled College Writers Project, Evaluation Report, 1985-86.

ERIC Educational Resources Information Center

Dunham, Trudy

This report describes the Learning Disabled College Writer's Project, implemented at the University of Minnesota during the 1985-86 school year and designed to aid learning disabled college students master composition skills through training in the use of microcomputer word processors. Following an executive summary, an introduction states the…
Integrated High-Speed Torque Control System for a Robotic Joint

NASA Technical Reports Server (NTRS)

Davis, Donald R. (Inventor); Radford, Nicolaus A. (Inventor); Permenter, Frank Noble (Inventor); Valvo, Michael C. (Inventor); Askew, R. Scott (Inventor)

2013-01-01

A control system for achieving high-speed torque for a joint of a robot includes a printed circuit board assembly (PCBA) having a collocated joint processor and high-speed communication bus. The PCBA may also include a power inverter module (PIM) and local sensor conditioning electronics (SCE) for processing sensor data from one or more motor position sensors. Torque control of a motor of the joint is provided via the PCBA as a high-speed torque loop. Each joint processor may be embedded within or collocated with the robotic joint being controlled. Collocation of the joint processor, PIM, and high-speed bus may increase noise immunity of the control system, and the localized processing of sensor data from the joint motor at the joint level may minimize bus cabling to and from each control node. The joint processor may include a field programmable gate array (FPGA).
Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

NASA Technical Reports Server (NTRS)

Fatoohi, Rod; Saini, Subbash; Ciotti, Robert

2006-01-01

We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
The Mark III Hypercube-Ensemble Computers

NASA Technical Reports Server (NTRS)

Peterson, John C.; Tuazon, Jesus O.; Lieberman, Don; Pniel, Moshe

1988-01-01

Mark III Hypercube concept applied in development of series of increasingly powerful computers. Processor of each node of Mark III Hypercube ensemble is specialized computer containing three subprocessors and shared main memory. Solves problem quickly by simultaneously processing part of problem at each such node and passing combined results to host computer. Disciplines benefitting from speed and memory capacity include astrophysics, geophysics, chemistry, weather, high-energy physics, applied mechanics, image processing, oil exploration, aircraft design, and microcircuit design.
Energy efficient sensor network implementations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Frigo, Janette R; Raby, Eric Y; Brennan, Sean M

In this paper, we discuss a low power embedded sensor node architecture we are developing for distributed sensor network systems deployed in a natural environment. In particular, we examine the sensor node for energy efficient processing-at-the-sensor. We analyze the following modes of operation; event detection, sleep(wake-up), data acquisition, data processing modes using low power, high performance embedded technology such as specialized embedded DSP processors and a low power FPGAs at the sensing node. We use compute intensive sensor node applications: an acoustic vehicle classifier (frequency domain analysis) and a video license plate identification application (learning algorithm) as a case study.more » We report performance and total energy usage for our system implementations and discuss the system architecture design trade offs.« less
DD-αAMG on QPACE 3

NASA Astrophysics Data System (ADS)

Georg, Peter; Richtmann, Daniel; Wettig, Tilo

2018-03-01

We describe our experience porting the Regensburg implementation of the DD-αAMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.
Computationally Efficient Modeling and Simulation of Large Scale Systems

NASA Technical Reports Server (NTRS)

Jain, Jitesh (Inventor); Koh, Cheng-Kok (Inventor); Balakrishnan, Vankataramanan (Inventor); Cauley, Stephen F (Inventor); Li, Hong (Inventor)

2014-01-01

A system for simulating operation of a VLSI interconnect structure having capacitive and inductive coupling between nodes thereof, including a processor, and a memory, the processor configured to perform obtaining a matrix X and a matrix Y containing different combinations of passive circuit element values for the interconnect structure, the element values for each matrix including inductance L and inverse capacitance P, obtaining an adjacency matrix A associated with the interconnect structure, storing the matrices X, Y, and A in the memory, and performing numerical integration to solve first and second equations.
RAMA: A file system for massively parallel computers

NASA Technical Reports Server (NTRS)

Miller, Ethan L.; Katz, Randy H.

1993-01-01

This paper describes a file system design for massively parallel computers which makes very efficient use of a few disks per processor. This overcomes the traditional I/O bottleneck of massively parallel machines by storing the data on disks within the high-speed interconnection network. In addition, the file system, called RAMA, requires little inter-node synchronization, removing another common bottleneck in parallel processor file systems. Support for a large tertiary storage system can easily be integrated in lo the file system; in fact, RAMA runs most efficiently when tertiary storage is used.
Computing NLTE Opacities -- Node Level Parallel Calculation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Holladay, Daniel

Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
Free-Space Optical Interconnect Employing VCSEL Diodes

NASA Technical Reports Server (NTRS)

Simons, Rainee N.; Savich, Gregory R.; Torres, Heidi

2009-01-01

Sensor signal processing is widely used on aircraft and spacecraft. The scheme employs multiple input/output nodes for data acquisition and CPU (central processing unit) nodes for data processing. To connect 110 nodes and CPU nodes, scalable interconnections such as backplanes are desired because the number of nodes depends on requirements of each mission. An optical backplane consisting of vertical-cavity surface-emitting lasers (VCSELs), VCSEL drivers, photodetectors, and transimpedance amplifiers is the preferred approach since it can handle several hundred megabits per second data throughput.The next generation of satellite-borne systems will require transceivers and processors that can handle several Gb/s of data. Optical interconnects have been praised for both their speed and functionality with hopes that light can relieve the electrical bottleneck predicted for the near future. Optoelectronic interconnects provide a factor of ten improvement over electrical interconnects.

Implementation and Optimization of miniGMG - a Compact Geometric Multigrid Benchmark

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel; Kalamkar, Dhiraj; Singh, Amik

2012-12-01

Multigrid methods are widely used to accelerate the convergence of iterative solvers for linear systems used in a number of different application areas. In this report, we describe miniGMG, our compact geometric multigrid benchmark designed to proxy the multigrid solves found in AMR applications. We explore optimization techniques for geometric multigrid on existing and emerging multicore systems including the Opteron-based Cray XE6, Intel Sandy Bridge and Nehalem-based Infiniband clusters, as well as manycore-based architectures including NVIDIA's Fermi and Kepler GPUs and Intel's Knights Corner (KNC) co-processor. This report examines a variety of novel techniques including communication-aggregation, threaded wavefront-based DRAM communication-avoiding,more » dynamic threading decisions, SIMDization, and fusion of operators. We quantify performance through each phase of the V-cycle for both single-node and distributed-memory experiments and provide detailed analysis for each class of optimization. Results show our optimizations yield significant speedups across a variety of subdomain sizes while simultaneously demonstrating the potential of multi- and manycore processors to dramatically accelerate single-node performance. However, our analysis also indicates that improvements in networks and communication will be essential to reap the potential of manycore processors in large-scale multigrid calculations.« less
Scalable Domain Decomposed Monte Carlo Particle Transport

NASA Astrophysics Data System (ADS)

O'Brien, Matthew Joseph

In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
GPS-Free Localization Algorithm for Wireless Sensor Networks

PubMed Central

Wang, Lei; Xu, Qingzheng

2010-01-01

Localization is one of the most fundamental problems in wireless sensor networks, since the locations of the sensor nodes are critical to both network operations and most application level tasks. A GPS-free localization scheme for wireless sensor networks is presented in this paper. First, we develop a standardized clustering-based approach for the local coordinate system formation wherein a multiplication factor is introduced to regulate the number of master and slave nodes and the degree of connectivity among master nodes. Second, using homogeneous coordinates, we derive a transformation matrix between two Cartesian coordinate systems to efficiently merge them into a global coordinate system and effectively overcome the flip ambiguity problem. The algorithm operates asynchronously without a centralized controller; and does not require that the location of the sensors be known a priori. A set of parameter-setting guidelines for the proposed algorithm is derived based on a probability model and the energy requirements are also investigated. A simulation analysis on a specific numerical example is conducted to validate the mathematical analytical results. We also compare the performance of the proposed algorithm under a variety multiplication factor, node density and node communication radius scenario. Experiments show that our algorithm outperforms existing mechanisms in terms of accuracy and convergence time. PMID:22219694
Digital Parallel Processor Array for Optimum Path Planning

NASA Technical Reports Server (NTRS)

Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

1996-01-01

The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.
An implementation of a tree code on a SIMD, parallel computer

NASA Technical Reports Server (NTRS)

Olson, Kevin M.; Dorband, John E.

1994-01-01

We describe a fast tree algorithm for gravitational N-body simulation on SIMD parallel computers. The tree construction uses fast, parallel sorts. The sorted lists are recursively divided along their x, y and z coordinates. This data structure is a completely balanced tree (i.e., each particle is paired with exactly one other particle) and maintains good spatial locality. An implementation of this tree-building algorithm on a 16k processor Maspar MP-1 performs well and constitutes only a small fraction (approximately 15%) of the entire cycle of finding the accelerations. Each node in the tree is treated as a monopole. The tree search and the summation of accelerations also perform well. During the tree search, node data that is needed from another processor is simply fetched. Roughly 55% of the tree search time is spent in communications between processors. We apply the code to two problems of astrophysical interest. The first is a simulation of the close passage of two gravitationally, interacting, disk galaxies using 65,636 particles. We also simulate the formation of structure in an expanding, model universe using 1,048,576 particles. Our code attains speeds comparable to one head of a Cray Y-MP, so single instruction, multiple data (SIMD) type computers can be used for these simulations. The cost/performance ratio for SIMD machines like the Maspar MP-1 make them an extremely attractive alternative to either vector processors or large multiple instruction, multiple data (MIMD) type parallel computers. With further optimizations (e.g., more careful load balancing), speeds in excess of today's vector processing computers should be possible.
Landsat image registration for agricultural applications

NASA Technical Reports Server (NTRS)

Wolfe, R. H., Jr.; Juday, R. D.; Wacker, A. G.; Kaneko, T.

1982-01-01

An image registration system has been developed at the NASA Johnson Space Center (JSC) to spatially align multi-temporal Landsat acquisitions for use in agriculture and forestry research. Working in conjunction with the Master Data Processor (MDP) at the Goddard Space Flight Center, it functionally replaces the long-standing LACIE Registration Processor as JSC's data supplier. The system represents an expansion of the techniques developed for the MDP and LACIE Registration Processor, and it utilizes the experience gained in an IBM/JSC effort evaluating the performance of the latter. These techniques are discussed in detail. Several tests were developed to evaluate the registration performance of the system. The results indicate that 1/15-pixel accuracy (about 4m for Landsat MSS) is achievable in ideal circumstances, sub-pixel accuracy (often to 0.2 pixel or better) was attained on a representative set of U.S. acquisitions, and a success rate commensurate with the LACIE Registration Processor was realized. The system has been employed in a production mode on U.S. and foreign data, and a performance similar to the earlier tests has been noted.
Real-Time Imaging with a Pulsed Coherent CO, Laser Radar

DTIC Science & Technology

1997-01-01

30 joule) transmitted energy levels has just begun. The FLD program will conclude in 1997 with the demonstration of a full-up, real - time operating system . This...The master system and VMEbus controller is an off-the-shelf controller based on the Motorola 68040 processor running the VxWorks real time operating system . Application
The Influence of Computer-Based Text Editors on the Revision Strategies of Inexperienced Writers.

ERIC Educational Resources Information Center

Collier, Richard M.

A study sought to determine the effect of computer-based text editing on the revision strategies of inexperienced writers. Four subjects, none of whom had experience with computers or word processors, were selected from an introductory college composition course and required to master the basic terminal functions that would be necessary for…
Design and Application of a Field Sensing System for Ground Anchors in Slopes

PubMed Central

Choi, Se Woon; Lee, Jihoon; Kim, Jong Moon; Park, Hyo Seon

2013-01-01

In a ground anchor system, cables or tendons connected to a bearing plate are used for stabilization of slopes. Then, the stability of a slope is dependent on maintaining the tension levels in the cables. So far, no research on a strain-based field sensing system for ground anchors has been reported. Therefore, in this study, a practical monitoring system for long-term sensing of tension levels in tendons for anchor-reinforced slopes is proposed. The system for anchor-reinforced slopes is composed of: (1) load cells based on vibrating wire strain gauges (VWSGs), (2) wireless sensor nodes which receive and process the signals from load cells and then transmit the result to a master node through local area communication, (3) master nodes which transmit the data sent from sensor nodes to the server through mobile communication, and (4) a server located at the base station. The system was applied to field sensing of ground anchors in the 62 m-long and 26 m-high slope at the side of the highway. Based on the long-term monitoring, the safety of the anchor-reinforced slope can be secured by the timely applications of re-tensioning processes in tendons. PMID:23507820
Distribution of shortest path lengths in a class of node duplication network models

NASA Astrophysics Data System (ADS)

Steinbock, Chanania; Biham, Ofer; Katzav, Eytan

2017-09-01

We present analytical results for the distribution of shortest path lengths (DSPL) in a network growth model which evolves by node duplication (ND). The model captures essential properties of the structure and growth dynamics of social networks, acquaintance networks, and scientific citation networks, where duplication mechanisms play a major role. Starting from an initial seed network, at each time step a random node, referred to as a mother node, is selected for duplication. Its daughter node is added to the network, forming a link to the mother node, and with probability p to each one of its neighbors. The degree distribution of the resulting network turns out to follow a power-law distribution, thus the ND network is a scale-free network. To calculate the DSPL we derive a master equation for the time evolution of the probability Pt(L =ℓ ) , ℓ =1 ,2 ,⋯ , where L is the distance between a pair of nodes and t is the time. Finding an exact analytical solution of the master equation, we obtain a closed form expression for Pt(L =ℓ ) . The mean distance 〈L〉 t and the diameter Δt are found to scale like lnt , namely, the ND network is a small-world network. The variance of the DSPL is also found to scale like lnt . Interestingly, the mean distance and the diameter exhibit properties of a small-world network, rather than the ultrasmall-world network behavior observed in other scale-free networks, in which 〈L〉 t˜lnlnt .
GNAQPMS v1.1: accelerating the Global Nested Air Quality Prediction Modeling System (GNAQPMS) on Intel Xeon Phi processors

NASA Astrophysics Data System (ADS)

Wang, Hui; Chen, Huansheng; Wu, Qizhong; Lin, Junmin; Chen, Xueshun; Xie, Xinwei; Wang, Rongrong; Tang, Xiao; Wang, Zifa

2017-08-01

The Global Nested Air Quality Prediction Modeling System (GNAQPMS) is the global version of the Nested Air Quality Prediction Modeling System (NAQPMS), which is a multi-scale chemical transport model used for air quality forecast and atmospheric environmental research. In this study, we present the porting and optimisation of GNAQPMS on a second-generation Intel Xeon Phi processor, codenamed Knights Landing (KNL). Compared with the first-generation Xeon Phi coprocessor (codenamed Knights Corner, KNC), KNL has many new hardware features such as a bootable processor, high-performance in-package memory and ISA compatibility with Intel Xeon processors. In particular, we describe the five optimisations we applied to the key modules of GNAQPMS, including the CBM-Z gas-phase chemistry, advection, convection and wet deposition modules. These optimisations work well on both the KNL 7250 processor and the Intel Xeon E5-2697 V4 processor. They include (1) updating the pure Message Passing Interface (MPI) parallel mode to the hybrid parallel mode with MPI and OpenMP in the emission, advection, convection and gas-phase chemistry modules; (2) fully employing the 512 bit wide vector processing units (VPUs) on the KNL platform; (3) reducing unnecessary memory access to improve cache efficiency; (4) reducing the thread local storage (TLS) in the CBM-Z gas-phase chemistry module to improve its OpenMP performance; and (5) changing the global communication from writing/reading interface files to MPI functions to improve the performance and the parallel scalability. These optimisations greatly improved the GNAQPMS performance. The same optimisations also work well for the Intel Xeon Broadwell processor, specifically E5-2697 v4. Compared with the baseline version of GNAQPMS, the optimised version was 3.51 × faster on KNL and 2.77 × faster on the CPU. Moreover, the optimised version ran at 26 % lower average power on KNL than on the CPU. With the combined performance and energy improvement, the KNL platform was 37.5 % more efficient on power consumption compared with the CPU platform. The optimisations also enabled much further parallel scalability on both the CPU cluster and the KNL cluster scaled to 40 CPU nodes and 30 KNL nodes, with a parallel efficiency of 70.4 and 42.2 %, respectively.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Transient Solid Dynamics Simulations on the Sandia/Intel Teraflop Computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Attaway, S.; Brown, K.; Gardner, D.

1997-12-31

Transient solid dynamics simulations are among the most widely used engineering calculations. Industrial applications include vehicle crashworthiness studies, metal forging, and powder compaction prior to sintering. These calculations are also critical to defense applications including safety studies and weapons simulations. The practical importance of these calculations and their computational intensiveness make them natural candidates for parallelization. This has proved to be difficult, and existing implementations fail to scale to more than a few dozen processors. In this paper we describe our parallelization of PRONTO, Sandia`s transient solid dynamics code, via a novel algorithmic approach that utilizes multiple decompositions for differentmore » key segments of the computations, including the material contact calculation. This latter calculation is notoriously difficult to perform well in parallel, because it involves dynamically changing geometry, global searches for elements in contact, and unstructured communications among the compute nodes. Our approach scales to at least 3600 compute nodes of the Sandia/Intel Teraflop computer (the largest set of nodes to which we have had access to date) on problems involving millions of finite elements. On this machine we can simulate models using more than ten- million elements in a few tenths of a second per timestep, and solve problems more than 3000 times faster than a single processor Cray Jedi.« less
A Spectral Element Ocean Model on the Cray T3D: the interannual variability of the Mediterranean Sea general circulation

NASA Astrophysics Data System (ADS)

Molcard, A. J.; Pinardi, N.; Ansaloni, R.

A new numerical model, SEOM (Spectral Element Ocean Model, (Iskandarani et al, 1994)), has been implemented in the Mediterranean Sea. Spectral element methods combine the geometric flexibility of finite element techniques with the rapid convergence rate of spectral schemes. The current version solves the shallow water equations with a fifth (or sixth) order accuracy spectral scheme and about 50.000 nodes. The domain decomposition philosophy makes it possible to exploit the power of parallel machines. The original MIMD master/slave version of SEOM, written in F90 and PVM, has been ported to the Cray T3D. When critical for performance, Cray specific high-performance one-sided communication routines (SHMEM) have been adopted to fully exploit the Cray T3D interprocessor network. Tests performed with highly unstructured and irregular grid, on up to 128 processors, show an almost linear scalability even with unoptimized domain decomposition techniques. Results from various case studies on the Mediterranean Sea are shown, involving realistic coastline geometry, and monthly mean 1000mb winds from the ECMWF's atmospheric model operational analysis from the period January 1987 to December 1994. The simulation results show that variability in the wind forcing considerably affect the circulation dynamics of the Mediterranean Sea.
Efficacy of Code Optimization on Cache-Based Processors

NASA Technical Reports Server (NTRS)

VanderWijngaart, Rob F.; Saphir, William C.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

In this paper a number of techniques for improving the cache performance of a representative piece of numerical software is presented. Target machines are popular processors from several vendors: MIPS R5000 (SGI Indy), MIPS R8000 (SGI PowerChallenge), MIPS R10000 (SGI Origin), DEC Alpha EV4 + EV5 (Cray T3D & T3E), IBM RS6000 (SP Wide-node), Intel PentiumPro (Ames' Whitney), Sun UltraSparc (NERSC's NOW). The optimizations all attempt to increase the locality of memory accesses. But they meet with rather varied and often counterintuitive success on the different computing platforms. We conclude that it may be genuinely impossible to obtain portable performance on the current generation of cache-based machines. At the least, it appears that the performance of modern commodity processors cannot be described with parameters defining the cache alone.
Synchronous Parallel Emulation and Discrete Event Simulation System with Self-Contained Simulation Objects and Active Event Objects

NASA Technical Reports Server (NTRS)

Steinman, Jeffrey S. (Inventor)

1998-01-01

The present invention is embodied in a method of performing object-oriented simulation and a system having inter-connected processor nodes operating in parallel to simulate mutual interactions of a set of discrete simulation objects distributed among the nodes as a sequence of discrete events changing state variables of respective simulation objects so as to generate new event-defining messages addressed to respective ones of the nodes. The object-oriented simulation is performed at each one of the nodes by assigning passive self-contained simulation objects to each one of the nodes, responding to messages received at one node by generating corresponding active event objects having user-defined inherent capabilities and individual time stamps and corresponding to respective events affecting one of the passive self-contained simulation objects of the one node, restricting the respective passive self-contained simulation objects to only providing and receiving information from die respective active event objects, requesting information and changing variables within a passive self-contained simulation object by the active event object, and producing corresponding messages specifying events resulting therefrom by the active event objects.
Dynamic Load Balancing for Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single EBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Dynamic Load Balancing For Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single IBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
SU-E-T-628: A Cloud Computing Based Multi-Objective Optimization Method for Inverse Treatment Planning.

PubMed

Na, Y; Suh, T; Xing, L

2012-06-01

Multi-objective (MO) plan optimization entails generation of an enormous number of IMRT or VMAT plans constituting the Pareto surface, which presents a computationally challenging task. The purpose of this work is to overcome the hurdle by developing an efficient MO method using emerging cloud computing platform. As a backbone of cloud computing for optimizing inverse treatment planning, Amazon Elastic Compute Cloud with a master node (17.1 GB memory, 2 virtual cores, 420 GB instance storage, 64-bit platform) is used. The master node is able to scale seamlessly a number of working group instances, called workers, based on the user-defined setting account for MO functions in clinical setting. Each worker solved the objective function with an efficient sparse decomposition method. The workers are automatically terminated if there are finished tasks. The optimized plans are archived to the master node to generate the Pareto solution set. Three clinical cases have been planned using the developed MO IMRT and VMAT planning tools to demonstrate the advantages of the proposed method. The target dose coverage and critical structure sparing of plans are comparable obtained using the cloud computing platform are identical to that obtained using desktop PC (Intel Xeon® CPU 2.33GHz, 8GB memory). It is found that the MO planning speeds up the processing of obtaining the Pareto set substantially for both types of plans. The speedup scales approximately linearly with the number of nodes used for computing. With the use of N nodes, the computational time is reduced by the fitting model, 0.2+2.3/N, with r̂2>0.99, on average of the cases making real-time MO planning possible. A cloud computing infrastructure is developed for MO optimization. The algorithm substantially improves the speed of inverse plan optimization. The platform is valuable for both MO planning and future off- or on-line adaptive re-planning. © 2012 American Association of Physicists in Medicine.
Concurrent hypercube system with improved message passing

NASA Technical Reports Server (NTRS)

Peterson, John C. (Inventor); Tuazon, Jesus O. (Inventor); Lieberman, Don (Inventor); Pniel, Moshe (Inventor)

1989-01-01

A network of microprocessors, or nodes, are interconnected in an n-dimensional cube having bidirectional communication links along the edges of the n-dimensional cube. Each node's processor network includes an I/O subprocessor dedicated to controlling communication of message packets along a bidirectional communication link with each end thereof terminating at an I/O controlled transceiver. Transmit data lines are directly connected from a local FIFO through each node's communication link transceiver. Status and control signals from the neighboring nodes are delivered over supervisory lines to inform the local node that the neighbor node's FIFO is empty and the bidirectional link between the two nodes is idle for data communication. A clocking line between neighbors, clocks a message into an empty FIFO at a neighbor's node and vica versa. Either neighbor may acquire control over the bidirectional communication link at any time, and thus each node has circuitry for checking whether or not the communication link is busy or idle, and whether or not the receive FIFO is empty. Likewise, each node can empty its own FIFO and in turn deliver a status signal to a neighboring node indicating that the local FIFO is empty. The system includes features of automatic message rerouting, block message transfer and automatic parity checking and generation.

A Parallel Vector Machine for the PM Programming Language

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2016-04-01

PM is a new programming language which aims to make the writing of computational geoscience models on parallel hardware accessible to scientists who are not themselves expert parallel programmers. It is based around the concept of communicating operators: language constructs that enable variables local to a single invocation of a parallelised loop to be viewed as if they were arrays spanning the entire loop domain. This mechanism enables different loop invocations (which may or may not be executing on different processors) to exchange information in a manner that extends the successful Communicating Sequential Processes idiom from single messages to collective communication. Communicating operators avoid the additional synchronisation mechanisms, such as atomic variables, required when programming using the Partitioned Global Address Space (PGAS) paradigm. Using a single loop invocation as the fundamental unit of concurrency enables PM to uniformly represent different levels of parallelism from vector operations through shared memory systems to distributed grids. This paper describes an implementation of PM based on a vectorised virtual machine. On a single processor node, concurrent operations are implemented using masked vector operations. Virtual machine instructions operate on vectors of values and may be unmasked, masked using a Boolean field, or masked using an array of active vector cell locations. Conditional structures (such as if-then-else or while statement implementations) calculate and apply masks to the operations they control. A shift in mask representation from Boolean to location-list occurs when active locations become sufficiently sparse. Parallel loops unfold data structures (or vectors of data structures for nested loops) into vectors of values that may additionally be distributed over multiple computational nodes and then split into micro-threads compatible with the size of the local cache. Inter-node communication is accomplished using standard OpenMP and MPI. Performance analyses of the PM vector machine, demonstrating its scaling properties with respect to domain size and the number of processor nodes will be presented for a range of hardware configurations. The PM software and language definition are being made available under unrestrictive MIT and Creative Commons Attribution licenses respectively: www.pm-lang.org.
Burbank works on the EPIC in the Node 2

NASA Image and Video Library

2012-02-28

ISS030-E-114433 (29 Feb. 2012) --- In the International Space Station?s Destiny laboratory, NASA astronaut Dan Burbank, Expedition 30 commander, upgrades Multiplexer/Demultiplexer (MDM) computers and Portable Computer System (PCS) laptops and installs the Enhanced Processor & Integrated Communications (EPIC) hardware in the Payload 1 (PL-1) MDM.
mDARAL: A Multi-Radio Version for the DARAL Routing Algorithm.

PubMed

Estévez, Francisco José; Castillo-Secilla, José María; González, Jesús; Olivares, Joaquín; Glösekötter, Peter

2017-02-09

Smart Cities are called to change the daily life of human beings. This concept permits improving the efficiency of our cities in several areas such as the use of water, energy consumption, waste treatment, and mobility both for people as well as vehicles throughout the city. This represents an interconnected scenario in which thousands of embedded devices need to work in a collaborative way both for sensing and modifying the environment properly. Under this scenario, the majority of devices will use wireless protocols for communicating among them, representing a challenge for optimizing the use of the electromagnetic spectrum. When the density of deployed nodes increases, the competition for using the physical medium becomes harder and, in consequence, traffic collisions will be higher, affecting data-rates in the communication process. This work presents mDARAL , a multi-radio routing algorithm based on the Dynamic and Adaptive Radio Algorithm ( DARAL ), which has the capability of isolating groups of nodes into sub-networks. The nodes of each sub-network will communicate among them using a dedicated radio frequency, thus isolating the use of the radio channel to a reduced number of nodes. Each sub-network will have a master node with two physical radios, one for communicating with its neighbours and the other for being the contact point among its group and other sub-networks. The communication among sub-networks is done through master nodes in a dedicated radio frequency. The algorithm works to maximize the overall performance of the network through the distribution of the traffic messages into unoccupied frequencies. The obtained results show that mDARAL achieves great improvement in terms of the number of control messages necessary to connect a node to the network, convergence time and energy consumption during the connection phase compared to DARAL .
mDARAL: A Multi-Radio Version for the DARAL Routing Algorithm

PubMed Central

Estévez, Francisco José; Castillo-Secilla, José María; González, Jesús; Olivares, Joaquín; Glösekötter, Peter

2017-01-01

Smart Cities are called to change the daily life of human beings. This concept permits improving the efficiency of our cities in several areas such as the use of water, energy consumption, waste treatment, and mobility both for people as well as vehicles throughout the city. This represents an interconnected scenario in which thousands of embedded devices need to work in a collaborative way both for sensing and modifying the environment properly. Under this scenario, the majority of devices will use wireless protocols for communicating among them, representing a challenge for optimizing the use of the electromagnetic spectrum. When the density of deployed nodes increases, the competition for using the physical medium becomes harder and, in consequence, traffic collisions will be higher, affecting data-rates in the communication process. This work presents mDARAL, a multi-radio routing algorithm based on the Dynamic and Adaptive Radio Algorithm (DARAL), which has the capability of isolating groups of nodes into sub-networks. The nodes of each sub-network will communicate among them using a dedicated radio frequency, thus isolating the use of the radio channel to a reduced number of nodes. Each sub-network will have a master node with two physical radios, one for communicating with its neighbours and the other for being the contact point among its group and other sub-networks. The communication among sub-networks is done through master nodes in a dedicated radio frequency. The algorithm works to maximize the overall performance of the network through the distribution of the traffic messages into unoccupied frequencies. The obtained results show that mDARAL achieves great improvement in terms of the number of control messages necessary to connect a node to the network, convergence time and energy consumption during the connection phase compared to DARAL. PMID:28208760
Garbage Collection in a Distributed Object-Oriented System

NASA Technical Reports Server (NTRS)

Gupta, Aloke; Fuchs, W. Kent

1993-01-01

An algorithm is described in this paper for garbage collection in distributed systems with object sharing across processor boundaries. The algorithm allows local garbage collection at each node in the system to proceed independently of local collection at the other nodes. It requires no global synchronization or knowledge of the global state of the system and exhibits the capability of graceful degradation. The concept of a specialized dump node is proposed to facilitate the collection of inaccessible circular structures. An experimental evaluation of the algorithm is also described. The algorithm is compared with a corresponding scheme that requires global synchronization. The results show that the algorithm works well in distributed processing environments even when the locality of object references is low.
Direct memory access transfer completion notification

DOEpatents

Chen, Dong; Giampapa, Mark E.; Heidelberger, Philip; Kumar, Sameer; Parker, Jeffrey J.; Steinmacher-Burow, Burkhard D.; Vranas, Pavlos

2010-07-27

Methods, compute nodes, and computer program products are provided for direct memory access (`DMA`) transfer completion notification. Embodiments include determining, by an origin DMA engine on an origin compute node, whether a data descriptor for an application message to be sent to a target compute node is currently in an injection first-in-first-out (`FIFO`) buffer in dependence upon a sequence number previously associated with the data descriptor, the total number of descriptors currently in the injection FIFO buffer, and the current sequence number for the newest data descriptor stored in the injection FIFO buffer; and notifying a processor core on the origin DMA engine that the message has been sent if the data descriptor for the message is not currently in the injection FIFO buffer.
Parallelization of MRCI based on hole-particle symmetry.

PubMed

Suo, Bing; Zhai, Gaohong; Wang, Yubin; Wen, Zhenyi; Hu, Xiangqian; Li, Lemin

2005-01-15

The parallel implementation of multireference configuration interaction program based on the hole-particle symmetry is described. The platform to implement the parallelization is an Intel-Architectural cluster consisting of 12 nodes, each of which is equipped with two 2.4-G XEON processors, 3-GB memory, and 36-GB disk, and are connected by a Gigabit Ethernet Switch. The dependence of speedup on molecular symmetries and task granularities is discussed. Test calculations show that the scaling with the number of nodes is about 1.9 (for C1 and Cs), 1.65 (for C2v), and 1.55 (for D2h) when the number of nodes is doubled. The largest calculation performed on this cluster involves 5.6 x 10(8) CSFs.
NATIONAL WATER INFORMATION SYSTEM OF THE U. S. GEOLOGICAL SURVEY.

USGS Publications Warehouse

Edwards, Melvin D.

1985-01-01

National Water Information System (NWIS) has been designed as an interactive, distributed data system. It will integrate the existing, diverse data-processing systems into a common system. It will also provide easier, more flexible use as well as more convenient access and expanded computing, dissemination, and data-analysis capabilities. The NWIS is being implemented as part of a Distributed Information System (DIS) being developed by the Survey's Water Resources Division. The NWIS will be implemented on each node of the distributed network for the local processing, storage, and dissemination of hydrologic data collected within the node's area of responsibility. The processor at each node will also be used to perform hydrologic modeling, statistical data analysis, text editing, and some administrative work.
High order parallel numerical schemes for solving incompressible flows

NASA Technical Reports Server (NTRS)

Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.

1992-01-01

The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.
Going End to End to Deliver High-Speed Data

NASA Technical Reports Server (NTRS)

2005-01-01

By the end of the 1990s, the optical fiber "backbone" of the telecommunication and data-communication networks had evolved from megabits-per-second transmission rates to gigabits-per-second transmission rates. Despite this boom in bandwidth, however, users at the end nodes were still not being reached on a consistent basis. (An end node is any device that does not behave like a router or a managed hub or switch. Examples of end node objects are computers, printers, serial interface processor phones, and unmanaged hubs and switches.) The primary reason that prevents bandwidth from reaching the end nodes is the complex local network topology that exists between the optical backbone and the end nodes. This complex network topology consists of several layers of routing and switch equipment which introduce potential congestion points and network latency. By breaking down the complex network topology, a true optical connection can be achieved. Access Optical Networks, Inc., is making this connection a reality with guidance from NASA s nondestructive evaluation experts.
Data base manipulation for assessment of multiresource suitability and land change

NASA Technical Reports Server (NTRS)

Colwell, J.; Sanders, P.; Davis, G.; Thomson, F. (Principal Investigator)

1981-01-01

Progress is reported in three tasks which support the overall objectives of renewable resources inventory task of the AgRISTARS program. In the first task, the geometric correction algorithms of the Master Data Processor were investigated to determine the utility of data corrected by this processor for U.S. Forest Service uses. The second task involved investigation of logic to form blobs as a precursor step to automatic change detection involving two dates of LANDSAT data. Some routine procedures for selecting BLOB (spatial averaging) parameters were developed. In the third task, a major effort was made to develop land suitability modeling approches for timber, grazing, and wildlife habitat in support of resource planning efforts on the San Juan National Forest.
Parallel Directionally Split Solver Based on Reformulation of Pipelined Thomas Algorithm

NASA Technical Reports Server (NTRS)

Povitsky, A.

1998-01-01

In this research an efficient parallel algorithm for 3-D directionally split problems is developed. The proposed algorithm is based on a reformulated version of the pipelined Thomas algorithm that starts the backward step computations immediately after the completion of the forward step computations for the first portion of lines This algorithm has data available for other computational tasks while processors are idle from the Thomas algorithm. The proposed 3-D directionally split solver is based on the static scheduling of processors where local and non-local, data-dependent and data-independent computations are scheduled while processors are idle. A theoretical model of parallelization efficiency is used to define optimal parameters of the algorithm, to show an asymptotic parallelization penalty and to obtain an optimal cover of a global domain with subdomains. It is shown by computational experiments and by the theoretical model that the proposed algorithm reduces the parallelization penalty about two times over the basic algorithm for the range of the number of processors (subdomains) considered and the number of grid nodes per subdomain.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoshii, K.; Iskra, K.; Naik, H.

2011-05-01

We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
Initial Kernel Timing Using a Simple PIM Performance Model

NASA Technical Reports Server (NTRS)

Katz, Daniel S.; Block, Gary L.; Springer, Paul L.; Sterling, Thomas; Brockman, Jay B.; Callahan, David

2005-01-01

This presentation will describe some initial results of paper-and-pencil studies of 4 or 5 application kernels applied to a processor-in-memory (PIM) system roughly similar to the Cascade Lightweight Processor (LWP). The application kernels are: * Linked list traversal * Sun of leaf nodes on a tree * Bitonic sort * Vector sum * Gaussian elimination The intent of this work is to guide and validate work on the Cascade project in the areas of compilers, simulators, and languages. We will first discuss the generic PIM structure. Then, we will explain the concepts needed to program a parallel PIM system (locality, threads, parcels). Next, we will present a simple PIM performance model that will be used in the remainder of the presentation. For each kernel, we will then present a set of codes, including codes for a single PIM node, and codes for multiple PIM nodes that move data to threads and move threads to data. These codes are written at a fairly low level, between assembly and C, but much closer to C than to assembly. For each code, we will present some hand-drafted timing forecasts, based on the simple PIM performance model. Finally, we will conclude by discussing what we have learned from this work, including what programming styles seem to work best, from the point-of-view of both expressiveness and performance.
ADEN ALOS PALSAR Product Verification

NASA Astrophysics Data System (ADS)

Wright, P. A.; Meadows, P. J.; Mack, G.; Miranda, N.; Lavalle, M.

2008-11-01

Within the ALOS Data European Node (ADEN) the verification of PALSAR products is an important and continuing activity, to ensure data utility for the users. The paper will give a summary of the verification activities, the status of the ADEN PALSAR processor and the current quality issues that are important for users of ADEN PALSAR data.
Algorithm implementation on the Navier-Stokes computer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krist, S.E.; Zang, T.A.

1987-03-01

The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Algorithm implementation on the Navier-Stokes computer

NASA Technical Reports Server (NTRS)

Krist, Steven E.; Zang, Thomas A.

1987-01-01

The Navier-Stokes Computer is a multi-purpose parallel-processing supercomputer which is currently under development at Princeton University. It consists of multiple local memory parallel processors, called Nodes, which are interconnected in a hypercube network. Details of the procedures involved in implementing an algorithm on the Navier-Stokes computer are presented. The particular finite difference algorithm considered in this analysis was developed for simulation of laminar-turbulent transition in wall bounded shear flows. Projected timing results for implementing this algorithm indicate that operation rates in excess of 42 GFLOPS are feasible on a 128 Node machine.
Advanced flight computer. Special study

NASA Technical Reports Server (NTRS)

Coo, Dennis

1995-01-01

This report documents a special study to define a 32-bit radiation hardened, SEU tolerant flight computer architecture, and to investigate current or near-term technologies and development efforts that contribute to the Advanced Flight Computer (AFC) design and development. An AFC processing node architecture is defined. Each node may consist of a multi-chip processor as needed. The modular, building block approach uses VLSI technology and packaging methods that demonstrate a feasible AFC module in 1998 that meets that AFC goals. The defined architecture and approach demonstrate a clear low-risk, low-cost path to the 1998 production goal, with intermediate prototypes in 1996.
General framework for dynamic large deformation contact problems based on phantom-node X-FEM

NASA Astrophysics Data System (ADS)

Broumand, P.; Khoei, A. R.

2018-04-01

This paper presents a general framework for modeling dynamic large deformation contact-impact problems based on the phantom-node extended finite element method. The large sliding penalty contact formulation is presented based on a master-slave approach which is implemented within the phantom-node X-FEM and an explicit central difference scheme is used to model the inertial effects. The method is compared with conventional contact X-FEM; advantages, limitations and implementational aspects are also addressed. Several numerical examples are presented to show the robustness and accuracy of the proposed method.
Modern multicore and manycore architectures: Modelling, optimisation and benchmarking a multiblock CFD code

NASA Astrophysics Data System (ADS)

Hadade, Ioan; di Mare, Luca

2016-08-01

Modern multicore and manycore processors exhibit multiple levels of parallelism through a wide range of architectural features such as SIMD for data parallel execution or threads for core parallelism. The exploitation of multi-level parallelism is therefore crucial for achieving superior performance on current and future processors. This paper presents the performance tuning of a multiblock CFD solver on Intel SandyBridge and Haswell multicore CPUs and the Intel Xeon Phi Knights Corner coprocessor. Code optimisations have been applied on two computational kernels exhibiting different computational patterns: the update of flow variables and the evaluation of the Roe numerical fluxes. We discuss at great length the code transformations required for achieving efficient SIMD computations for both kernels across the selected devices including SIMD shuffles and transpositions for flux stencil computations and global memory transformations. Core parallelism is expressed through threading based on a number of domain decomposition techniques together with optimisations pertaining to alleviating NUMA effects found in multi-socket compute nodes. Results are correlated with the Roofline performance model in order to assert their efficiency for each distinct architecture. We report significant speedups for single thread execution across both kernels: 2-5X on the multicore CPUs and 14-23X on the Xeon Phi coprocessor. Computations at full node and chip concurrency deliver a factor of three speedup on the multicore processors and up to 24X on the Xeon Phi manycore coprocessor.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Learn, Mark Walter

Sandia National Laboratories is currently developing new processing and data communication architectures for use in future satellite payloads. These architectures will leverage the flexibility and performance of state-of-the-art static-random-access-memory-based Field Programmable Gate Arrays (FPGAs). One such FPGA is the radiation-hardened version of the Virtex-5 being developed by Xilinx. However, not all features of this FPGA are being radiation-hardened by design and could still be susceptible to on-orbit upsets. One such feature is the embedded hard-core PPC440 processor. Since this processor is implemented in the FPGA as a hard-core, traditional mitigation approaches such as Triple Modular Redundancy (TMR) are not availablemore » to improve the processor's on-orbit reliability. The goal of this work is to investigate techniques that can help mitigate the embedded hard-core PPC440 processor within the Virtex-5 FPGA other than TMR. Implementing various mitigation schemes reliably within the PPC440 offers a powerful reconfigurable computing resource to these node-based processing architectures. This document summarizes the work done on the cache mitigation scheme for the embedded hard-core PPC440 processor within the Virtex-5 FPGAs, and describes in detail the design of the cache mitigation scheme and the testing conducted at the radiation effects facility on the Texas A&M campus.« less
Fuzzy logic controller to improve powerline communication

NASA Astrophysics Data System (ADS)

Tirrito, Salvatore

2015-12-01

The Power Line Communications (PLC) technology allows the use of the power grid in order to ensure the exchange of data information among devices. This work proposes an approach, based on Fuzzy Logic, that dynamically manages the amplitude of the signal, with which each node transmits, by processing the master-slave link quality measured and the master-slave distance. The main objective of this is to reduce both the impact of communication interferences induced and power consumption.
Software/hardware distributed processing network supporting the Ada environment

NASA Astrophysics Data System (ADS)

Wood, Richard J.; Pryk, Zen

1993-09-01

A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.
[A novel biologic electricity signal measurement based on neuron chip].

PubMed

Lei, Yinsheng; Wang, Mingshi; Sun, Tongjing; Zhu, Qiang; Qin, Ran

2006-06-01

Neuron chip is a multiprocessor with three pipeline CPU; its communication protocol and control processor are integrated in effect to carry out the function of communication, control, attemper, I/O, etc. A novel biologic electronic signal measurement network system is composed of intelligent measurement nodes with neuron chip at the core. In this study, the electronic signals such as ECG, EEG, EMG and BOS can be synthetically measured by those intelligent nodes, and some valuable diagnostic messages are found. Wavelet transform is employed in this system to analyze various biologic electronic signals due to its strong time-frequency ability of decomposing signal local character. Better effect is gained. This paper introduces the hardware structure of network and intelligent measurement node, the measurement theory and the signal figure of data acquisition and processing.
Large-N in Volcano Settings: Volcanosri

NASA Astrophysics Data System (ADS)

Lees, J. M.; Song, W.; Xing, G.; Vick, S.; Phillips, D.

2014-12-01

We seek a paradigm shift in the approach we take on volcano monitoring where the compromise from high fidelity to large numbers of sensors is used to increase coverage and resolution. Accessibility, danger and the risk of equipment loss requires that we develop systems that are independent and inexpensive. Furthermore, rather than simply record data on hard disk for later analysis we desire a system that will work autonomously, capitalizing on wireless technology and in field network analysis. To this end we are currently producing a low cost seismic array which will incorporate, at the very basic level, seismological tools for first cut analysis of a volcano in crises mode. At the advanced end we expect to perform tomographic inversions in the network in near real time. Geophone (4 Hz) sensors connected to a low cost recording system will be installed on an active volcano where triggering earthquake location and velocity analysis will take place independent of human interaction. Stations are designed to be inexpensive and possibly disposable. In one of the first implementations the seismic nodes consist of an Arduino Due processor board with an attached Seismic Shield. The Arduino Due processor board contains an Atmel SAM3X8E ARM Cortex-M3 CPU. This 32 bit 84 MHz processor can filter and perform coarse seismic event detection on a 1600 sample signal in fewer than 200 milliseconds. The Seismic Shield contains a GPS module, 900 MHz high power mesh network radio, SD card, seismic amplifier, and 24 bit ADC. External sensors can be attached to either this 24-bit ADC or to the internal multichannel 12 bit ADC contained on the Arduino Due processor board. This allows the node to support attachment of multiple sensors. By utilizing a high-speed 32 bit processor complex signal processing tasks can be performed simultaneously on multiple sensors. Using a 10 W solar panel, second system being developed can run autonomously and collect data on 3 channels at 100Hz for 6 months with the installed 16Gb SD card. Initial designs and test results will be presented and discussed.
Benchmarking NWP Kernels on Multi- and Many-core Processors

NASA Astrophysics Data System (ADS)

Michalakes, J.; Vachharajani, M.

2008-12-01

Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Fault-Tolerant Local-Area Network

NASA Technical Reports Server (NTRS)

Morales, Sergio; Friedman, Gary L.

1988-01-01

Local-area network (LAN) for computers prevents single-point failure from interrupting communication between nodes of network. Includes two complete cables, LAN 1 and LAN 2. Microprocessor-based slave switches link cables to network-node devices as work stations, print servers, and file servers. Slave switches respond to commands from master switch, connecting nodes to two cable networks or disconnecting them so they are completely isolated. System monitor and control computer (SMC) acts as gateway, allowing nodes on either cable to communicate with each other and ensuring that LAN 1 and LAN 2 are fully used when functioning properly. Network monitors and controls itself, automatically routes traffic for efficient use of resources, and isolates and corrects its own faults, with potential dramatic reduction in time out of service.
Comparison of Communication Architectures and Network Topologies for Distributed Propulsion Controls (Preprint)

DTIC Science & Technology

2013-05-01

logic to perform control function computations and are connected to the full authority digital engine control ( FADEC ) via a high-speed data...Digital Engine Control ( FADEC ) via a high speed data communication bus. The short term distributed engine control configu- rations will be core...concen- trator; and high temperature electronics, high speed communication bus between the data concentrator and the control law processor master FADEC
Delivery and application of precise timing for a traveling wave powerline fault locator system

NASA Technical Reports Server (NTRS)

Street, Michael A.

1990-01-01

The Bonneville Power Administration (BPA) has successfully operated an in-house developed powerline fault locator system since 1986. The BPA fault locator system consists of remotes installed at cardinal power transmission line system nodes and a central master which polls the remotes for traveling wave time-of-arrival data. A power line fault produces a fast rise-time traveling wave which emanates from the fault point and propagates throughout the power grid. The remotes time-tag the traveling wave leading edge as it passes through the power system cardinal substation nodes. A synchronizing pulse transmitted via the BPA analog microwave system on a wideband channel sychronizes the time-tagging counters in the remote units to a different accuracy of better than one microsecond. The remote units correct the raw time tags for synchronizing pulse propagation delay and return these corrected values to the fault locator master. The master then calculates the power system disturbance source using the collected time tags. The system design objective is a fault location accuracy of 300 meters. BPA's fault locator system operation, error producing phenomena, and method of distributing precise timing are described.
The Control Point Library Building System. [for Landsat MSS and RBV geometric image correction

NASA Technical Reports Server (NTRS)

Niblack, W.

1981-01-01

The Earth Resources Observation System (EROS) Data Center in Sioux Falls, South Dakota distributes precision corrected Landsat MSS and RBV data. These data are derived from master data tapes produced by the Master Data Processor (MDP), NASA's system for computing and applying corrections to the data. Included in the MDP is the Control Point Library Building System (CPLBS), an interactive, menu-driven system which permits a user to build and maintain libraries of control points. The control points are required to achieve the high geometric accuracy desired in the output MSS and RBV data. This paper describes the processing performed by CPLBS, the accuracy of the system, and the host computer and special image viewing equipment employed.
Using Nested Contractions and a Hierarchical Tensor Format To Compute Vibrational Spectra of Molecules with Seven Atoms.

PubMed

Thomas, Phillip S; Carrington, Tucker

2015-12-31

We propose a method for solving the vibrational Schrödinger equation with which one can compute hundreds of energy levels of seven-atom molecules using at most a few gigabytes of memory. It uses nested contractions in conjunction with the reduced-rank block power method (RRBPM) described in J. Chem. Phys. 2014, 140, 174111. Successive basis contractions are organized into a tree, the nodes of which are associated with eigenfunctions of reduced-dimension Hamiltonians. The RRBPM is used recursively to compute eigenfunctions of nodes in bases of products of reduced-dimension eigenfunctions of nodes with fewer coordinates. The corresponding vectors are tensors in what is called CP-format. The final wave functions are therefore represented in a hierarchical CP-format. Computational efficiency and accuracy are significantly improved by representing the Hamiltonian in the same hierarchical format as the wave function. We demonstrate that with this hierarchical RRBPM it is possible to compute energy levels of a 64-D coupled-oscillator model Hamiltonian and also of acetonitrile (CH3CN) and ethylene oxide (C2H4O), for which we use quartic potentials. The most accurate acetonitrile calculation uses 139 MB of memory and takes 3.2 h on a single processor. The most accurate ethylene oxide calculation uses 6.1 GB of memory and takes 14 d on 63 processors. The hierarchical RRBPM shatters the memory barrier that impedes the calculation of vibrational spectra.
A computational system for lattice QCD with overlap Dirac quarks

NASA Astrophysics Data System (ADS)

Chiu, Ting-Wai; Hsieh, Tung-Han; Huang, Chao-Hsi; Huang, Tsung-Ren

2003-05-01

We outline the essential features of a Linux PC cluster which is now being developed at National Taiwan University, and discuss how to optimize its hardware and software for lattice QCD with overlap Dirac quarks. At present, the cluster constitutes of 30 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0 GHz), one Gbyte of PC800 RDRAM, one 40/80 Gbyte hard disk, and a network card. The speed of this system is estimated to be 30 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched lattice QCD with overlap Dirac quarks.
Real-time calibration-free C-scan images of the eye fundus using Master Slave swept source optical coherence tomography

NASA Astrophysics Data System (ADS)

Bradu, Adrian; Kapinchev, Konstantin; Barnes, Fred; Garway-Heath, David F.; Rajendram, Ranjan; Keane, Pearce; Podoleanu, Adrian G.

2015-03-01

Recently, we introduced a novel Optical Coherence Tomography (OCT) method, termed as Master Slave OCT (MS-OCT), specialized for delivering en-face images. This method uses principles of spectral domain interfereometry in two stages. MS-OCT operates like a time domain OCT, selecting only signals from a chosen depth only while scanning the laser beam across the eye. Time domain OCT allows real time production of an en-face image, although relatively slowly. As a major advance, the Master Slave method allows collection of signals from any number of depths, as required by the user. The tremendous advantage in terms of parallel provision of data from numerous depths could not be fully employed by using multi core processors only. The data processing required to generate images at multiple depths simultaneously is not achievable with commodity multicore processors only. We compare here the major improvement in processing and display, brought about by using graphic cards. We demonstrate images obtained with a swept source at 100 kHz (which determines an acquisition time [Ta] for a frame of 200×200 pixels2 of Ta =1.6 s). By the end of the acquired frame being scanned, using our computing capacity, 4 simultaneous en-face images could be created in T = 0.8 s. We demonstrate that by using graphic cards, 32 en-face images can be displayed in Td 0.3 s. Other faster swept source engines can be used with no difference in terms of Td. With 32 images (or more), volumes can be created for 3D display, using en-face images, as opposed to the current technology where volumes are created using cross section OCT images.
Flexible network wireless transceiver and flexible network telemetry transceiver

DOEpatents

Brown, Kenneth D.

2008-08-05

A transceiver for facilitating two-way wireless communication between a baseband application and other nodes in a wireless network, wherein the transceiver provides baseband communication networking and necessary configuration and control functions along with transmitter, receiver, and antenna functions to enable the wireless communication. More specifically, the transceiver provides a long-range wireless duplex communication node or channel between the baseband application, which is associated with a mobile or fixed space, air, water, or ground vehicle or other platform, and other nodes in the wireless network or grid. The transceiver broadly comprises a communication processor; a flexible telemetry transceiver including a receiver and a transmitter; a power conversion and regulation mechanism; a diplexer; and a phased array antenna system, wherein these various components and certain subcomponents thereof may be separately enclosed and distributable relative to the other components and subcomponents.
Fault isolation through no-overhead link level CRC

DOEpatents

Chen, Dong; Coteus, Paul W.; Gara, Alan G.

2007-04-24

A fault isolation technique for checking the accuracy of data packets transmitted between nodes of a parallel processor. An independent crc is kept of all data sent from one processor to another, and received from one processor to another. At the end of each checkpoint, the crcs are compared. If they do not match, there was an error. The crcs may be cleared and restarted at each checkpoint. In the preferred embodiment, the basic functionality is to calculate a CRC of all packet data that has been successfully transmitted across a given link. This CRC is done on both ends of the link, thereby allowing an independent check on all data believed to have been correctly transmitted. Preferably, all links have this CRC coverage, and the CRC used in this link level check is different from that used in the packet transfer protocol. This independent check, if successfully passed, virtually eliminates the possibility that any data errors were missed during the previous transfer period.
A multi-satellite orbit determination problem in a parallel processing environment

NASA Technical Reports Server (NTRS)

Deakyne, M. S.; Anderle, R. J.

1988-01-01

The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.
A monitoring system for vegetable greenhouses based on a wireless sensor network.

PubMed

Li, Xiu-hong; Cheng, Xiao; Yan, Ke; Gong, Peng

2010-01-01

A wireless sensor network-based automatic monitoring system is designed for monitoring the life conditions of greenhouse vegetables. The complete system architecture includes a group of sensor nodes, a base station, and an internet data center. For the design of wireless sensor node, the JN5139 micro-processor is adopted as the core component and the Zigbee protocol is used for wireless communication between nodes. With an ARM7 microprocessor and embedded ZKOS operating system, a proprietary gateway node is developed to achieve data influx, screen display, system configuration and GPRS based remote data forwarding. Through a Client/Server mode the management software for remote data center achieves real-time data distribution and time-series analysis. Besides, a GSM-short-message-based interface is developed for sending real-time environmental measurements, and for alarming when a measurement is beyond some pre-defined threshold. The whole system has been tested for over one year and satisfactory results have been observed, which indicate that this system is very useful for greenhouse environment monitoring.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Dmitriev, Alexander S.; Yemelyanov, Ruslan Yu.; Moscow Institute of Physics and Technology

The paper deals with a new multi-element processor platform assigned for modelling the behaviour of interacting dynamical systems, i.e., active wireless network. Experimentally, this ensemble is implemented in an active network, the active nodes of which include direct chaotic transceivers and special actuator boards containing microcontrollers for modelling the dynamical systems and an information display unit (colored LEDs). The modelling technique and experimental results are described and analyzed.
Effective correlator for RadioAstron project

NASA Astrophysics Data System (ADS)

Sergeev, Sergey

This paper presents the implementation of programme FX-correlator for Very Long Baseline Interferometry, adapted for the project "RadioAstron". Software correlator implemented for heterogeneous computing systems using graphics accelerators. It is shown that for the task interferometry implementation of the graphics hardware has a high efficiency. The host processor of heterogeneous computing system, performs the function of forming the data flow for graphics accelerators, the number of which corresponds to the number of frequency channels. So, for the Radioastron project, such channels is seven. Each accelerator is perform correlation matrix for all bases for a single frequency channel. Initial data is converted to the floating-point format, is correction for the corresponding delay function and computes the entire correlation matrix simultaneously. Calculation of the correlation matrix is performed using the sliding Fourier transform. Thus, thanks to the compliance of a solved problem for architecture graphics accelerators, managed to get a performance for one processor platform Kepler, which corresponds to the performance of this task, the computing cluster platforms Intel on four nodes. This task successfully scaled not only on a large number of graphics accelerators, but also on a large number of nodes with multiple accelerators.
Cots Correlator Platform

NASA Astrophysics Data System (ADS)

Schaaf, Kjeld; Overeem, Ruud

2004-06-01

Moore’s law is best exploited by using consumer market hardware. In particular, the gaming industry pushes the limit of processor performance thus reducing the cost per raw flop even faster than Moore’s law predicts. Next to the cost benefits of Common-Of-The-Shelf (COTS) processing resources, there is a rapidly growing experience pool in cluster based processing. The typical Beowulf cluster of PC’s supercomputers are well known. Multiple examples exists of specialised cluster computers based on more advanced server nodes or even gaming stations. All these cluster machines build upon the same knowledge about cluster software management, scheduling, middleware libraries and mathematical libraries. In this study, we have integrated COTS processing resources and cluster nodes into a very high performance processing platform suitable for streaming data applications, in particular to implement a correlator. The required processing power for the correlator in modern radio telescopes is in the range of the larger supercomputers, which motivates the usage of supercomputer technology. Raw processing power is provided by graphical processors and is combined with an Infiniband host bus adapter with integrated data stream handling logic. With this processing platform a scalable correlator can be built with continuously growing processing power at consumer market prices.

Turbo Pascal Implementation of a Distributed Processing Network of MS-DOS Microcomputers Connected in a Master-Slave Configuration

DTIC Science & Technology

1989-12-01

Interrupt Procedures ....... 29 13. Support for a Larger Memory Model ................ 29 C. IMPLEMENTATION ........................................ 29...describe the programmer’s model of the hardware utilized in the microcomputers and interrupt driven serial communication considerations. Chapter III...Central Processor Unit The programming model of Table 2.1 is common to the Intel 8088, 8086 and 80x86 series of microprocessors used in the IBM PC/AT
Design and implementation of a hybrid MPI-CUDA model for the Smith-Waterman algorithm.

PubMed

Khaled, Heba; Faheem, Hossam El Deen Mostafa; El Gohary, Rania

2015-01-01

This paper provides a novel hybrid model for solving the multiple pair-wise sequence alignment problem combining message passing interface and CUDA, the parallel computing platform and programming model invented by NVIDIA. The proposed model targets homogeneous cluster nodes equipped with similar Graphical Processing Unit (GPU) cards. The model consists of the Master Node Dispatcher (MND) and the Worker GPU Nodes (WGN). The MND distributes the workload among the cluster working nodes and then aggregates the results. The WGN performs the multiple pair-wise sequence alignments using the Smith-Waterman algorithm. We also propose a modified implementation to the Smith-Waterman algorithm based on computing the alignment matrices row-wise. The experimental results demonstrate a considerable reduction in the running time by increasing the number of the working GPU nodes. The proposed model achieved a performance of about 12 Giga cell updates per second when we tested against the SWISS-PROT protein knowledge base running on four nodes.
New On-board Microprocessors

NASA Astrophysics Data System (ADS)

Weigand, R.

Two new processor devices have been developed for the use on board of spacecrafts. An 8-bit 8032-microcontroller targets typical controlling applications in instruments and sub-systems, or could be used as a main processor on small satellites, whereas the LEON 32-bit SPARC processor can be used for high performance controlling and data processing tasks. The ADV80S32 is fully compliant to the Intel 80x1 architecture and instruction set, extended by additional peripherals, 512 bytes on-chip RAM and a bootstrap PROM, which allows downloading the application software using the CCSDS PacketWire pro- tocol. The memory controller provides a de-multiplexed address/data bus, and allows to access up to 16 MB data and 8 MB program RAM. The peripherals have been de- signed for the specific needs of a spacecraft, such as serial interfaces compatible to RS232, PacketWire and TTC-B-01, counters/timers for extended duration and a CRC calculation unit accelerating the CCSDS TM/TC protocol. The 0.5 um Atmel manu- facturing technology (MG2RT) provides latch-up and total dose immunity; SEU fault immunity is implemented by using SEU hardened Flip-Flops and EDAC protection of internal and external memories. The maximum clock frequency of 20 MHz allows a processing power of 3 MIPS. Engineering samples are available. For SW develop- ment, various SW packages for the 8051 architecture are on the market. The LEON processor implements a 32-bit SPARC V8 architecture, including all the multiply and divide instructions, complemented by a floating-point unit (FPU). It includes several standard peripherals, such as timers/watchdog, interrupt controller, UARTs, parallel I/Os and a memory controller, allowing to use 8, 16 and 32 bit PROM, SRAM or memory mapped I/O. With on-chip separate instruction and data caches, almost one instruction per clock cycle can be reached in some applications. A 33-MHz 32-bit PCI master/target interface and a PCI arbiter allow operating the device in a plug-in card (for SW development on PC etc.), or to consider using it as a PCI master controller in an on-board system. Advanced SEU fault tolerance is in- troduced by design, using triple modular redundancy (TMR) flip-flops for all registers and EDAC protection for all memories. The device will be manufactured in a radia- tion hard Atmel 0.25 um technology, targeting 100 MHz processor clock frequency. The non fault-tolerant LEON processor VHDL model is available as free source code, and the SPARC architecture is a well-known industry standard. Therefore, know-how, software tools and operating systems are widely available.
A method for compression of intra-cortically-recorded neural signals dedicated to implantable brain-machine interfaces.

PubMed

Shaeri, Mohammad Ali; Sodagar, Amir M

2015-05-01

This paper proposes an efficient data compression technique dedicated to implantable intra-cortical neural recording devices. The proposed technique benefits from processing neural signals in the Discrete Haar Wavelet Transform space, a new spike extraction approach, and a novel data framing scheme to telemeter the recorded neural information to the outside world. Based on the proposed technique, a 64-channel neural signal processor was designed and prototyped as a part of a wireless implantable extra-cellular neural recording microsystem. Designed in a 0.13- μ m standard CMOS process, the 64-channel neural signal processor reported in this paper occupies ∼ 0.206 mm(2) of silicon area, and consumes 94.18 μW when operating under a 1.2-V supply voltage at a master clock frequency of 1.28 MHz.
On the Design of Smart Parking Networks in the Smart Cities: An Optimal Sensor Placement Model

PubMed Central

Bagula, Antoine; Castelli, Lorenzo; Zennaro, Marco

2015-01-01

Smart parking is a typical IoT application that can benefit from advances in sensor, actuator and RFID technologies to provide many services to its users and parking owners of a smart city. This paper considers a smart parking infrastructure where sensors are laid down on the parking spots to detect car presence and RFID readers are embedded into parking gates to identify cars and help in the billing of the smart parking. Both types of devices are endowed with wired and wireless communication capabilities for reporting to a gateway where the situation recognition is performed. The sensor devices are tasked to play one of the three roles: (1) slave sensor nodes located on the parking spot to detect car presence/absence; (2) master nodes located at one of the edges of a parking lot to detect presence and collect the sensor readings from the slave nodes; and (3) repeater sensor nodes, also called “anchor” nodes, located strategically at specific locations in the parking lot to increase the coverage and connectivity of the wireless sensor network. While slave and master nodes are placed based on geographic constraints, the optimal placement of the relay/anchor sensor nodes in smart parking is an important parameter upon which the cost and efficiency of the parking system depends. We formulate the optimal placement of sensors in smart parking as an integer linear programming multi-objective problem optimizing the sensor network engineering efficiency in terms of coverage and lifetime maximization, as well as its economic gain in terms of the number of sensors deployed for a specific coverage and lifetime. We propose an exact solution to the node placement problem using single-step and two-step solutions implemented in the Mosel language based on the Xpress-MPsuite of libraries. Experimental results reveal the relative efficiency of the single-step compared to the two-step model on different performance parameters. These results are consolidated by simulation results, which reveal that our solution outperforms a random placement in terms of both energy consumption, delay and throughput achieved by a smart parking network. PMID:26134104
On the Design of Smart Parking Networks in the Smart Cities: An Optimal Sensor Placement Model.

PubMed

Bagula, Antoine; Castelli, Lorenzo; Zennaro, Marco

2015-06-30

Smart parking is a typical IoT application that can benefit from advances in sensor, actuator and RFID technologies to provide many services to its users and parking owners of a smart city. This paper considers a smart parking infrastructure where sensors are laid down on the parking spots to detect car presence and RFID readers are embedded into parking gates to identify cars and help in the billing of the smart parking. Both types of devices are endowed with wired and wireless communication capabilities for reporting to a gateway where the situation recognition is performed. The sensor devices are tasked to play one of the three roles: (1) slave sensor nodes located on the parking spot to detect car presence/absence; (2) master nodes located at one of the edges of a parking lot to detect presence and collect the sensor readings from the slave nodes; and (3) repeater sensor nodes, also called "anchor" nodes, located strategically at specific locations in the parking lot to increase the coverage and connectivity of the wireless sensor network. While slave and master nodes are placed based on geographic constraints, the optimal placement of the relay/anchor sensor nodes in smart parking is an important parameter upon which the cost and efficiency of the parking system depends. We formulate the optimal placement of sensors in smart parking as an integer linear programming multi-objective problem optimizing the sensor network engineering efficiency in terms of coverage and lifetime maximization, as well as its economic gain in terms of the number of sensors deployed for a specific coverage and lifetime. We propose an exact solution to the node placement problem using single-step and two-step solutions implemented in the Mosel language based on the Xpress-MPsuite of libraries. Experimental results reveal the relative efficiency of the single-step compared to the two-step model on different performance parameters. These results are consolidated by simulation results, which reveal that our solution outperforms a random placement in terms of both energy consumption, delay and throughput achieved by a smart parking network.
Applications Performance on NAS Intel Paragon XP/S - 15#

NASA Technical Reports Server (NTRS)

Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)

1994-01-01

The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
Toward real-time Monte Carlo simulation using a commercial cloud computing infrastructure.

PubMed

Wang, Henry; Ma, Yunzhi; Pratx, Guillem; Xing, Lei

2011-09-07

Monte Carlo (MC) methods are the gold standard for modeling photon and electron transport in a heterogeneous medium; however, their computational cost prohibits their routine use in the clinic. Cloud computing, wherein computing resources are allocated on-demand from a third party, is a new approach for high performance computing and is implemented to perform ultra-fast MC calculation in radiation therapy. We deployed the EGS5 MC package in a commercial cloud environment. Launched from a single local computer with Internet access, a Python script allocates a remote virtual cluster. A handshaking protocol designates master and worker nodes. The EGS5 binaries and the simulation data are initially loaded onto the master node. The simulation is then distributed among independent worker nodes via the message passing interface, and the results aggregated on the local computer for display and data analysis. The described approach is evaluated for pencil beams and broad beams of high-energy electrons and photons. The output of cloud-based MC simulation is identical to that produced by single-threaded implementation. For 1 million electrons, a simulation that takes 2.58 h on a local computer can be executed in 3.3 min on the cloud with 100 nodes, a 47× speed-up. Simulation time scales inversely with the number of parallel nodes. The parallelization overhead is also negligible for large simulations. Cloud computing represents one of the most important recent advances in supercomputing technology and provides a promising platform for substantially improved MC simulation. In addition to the significant speed up, cloud computing builds a layer of abstraction for high performance parallel computing, which may change the way dose calculations are performed and radiation treatment plans are completed.
Joint Experimentation on Scalable Parallel Processors (JESPP)

DTIC Science & Technology

2006-04-01

made use of local embedded relational databases, implemented using sqlite on each node of an SPP to execute queries and return results via an ad hoc ...rl.af.mil 12a. DISTRIBUTION / AVAILABILITY STATEENT APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. 12b. DISTRIBUTION CODE 13. ABSTRACT...Experimentation Directorate (J9) required expansion of its joint semi-automated forces (JSAF) code capabilities; including number of entities, behavior complexity
Communication overhead on the Intel iPSC-860 hypercube

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1990-01-01

Experiments were conducted on the Intel iPSC-860 hypercube in order to evaluate the overhead of interprocessor communication. It is demonstrated that: (1) contrary to popular belief, the distance between two communicating processors has a significant impact on communication time, (2) edge contention can increase communication time by a factor of more than 7, and (3) node contention has no measurable impact.
MOBS - A modular on-board switching system

NASA Astrophysics Data System (ADS)

Berner, W.; Grassmann, W.; Piontek, M.

The authors describe a multibeam satellite system that is designed for business services and for communications at a high bit rate. The repeater is regenerative with a modular onboard switching system. It acts not only as baseband switch but also as the central node of the network, performing network control and protocol evaluation. The hardware is based on a modular bus/memory architecture with associated processors.
Compilation of Abstracts of Theses Submitted by Candidates for Degrees.

DTIC Science & Technology

1982-05-01

which either ti- tanium or aluminum tubes are used in the heat exchanges . Master of Science in Advisor: R. H. Nunn Mechanical Engineering Department... Testing in an Inert Environment Holihan, R. G., Jr. Investigation of Heat Transfer in 208 LCDR, USN Straight and Curved Rectangular Ducts for Laminar...for c-ft materials fatigue testing . The system uses an HP-9835 Desktop . .tnr, an HP-2240A Measurement and Control Processor and a Materials P System
Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks.

PubMed

Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan

2017-06-26

Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H²RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H²RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller.
Free Mesh Method: fundamental conception, algorithms and accuracy study

PubMed Central

YAGAWA, Genki

2011-01-01

The finite element method (FEM) has been commonly employed in a variety of fields as a computer simulation method to solve such problems as solid, fluid, electro-magnetic phenomena and so on. However, creation of a quality mesh for the problem domain is a prerequisite when using FEM, which becomes a major part of the cost of a simulation. It is natural that the concept of meshless method has evolved. The free mesh method (FMM) is among the typical meshless methods intended for particle-like finite element analysis of problems that are difficult to handle using global mesh generation, especially on parallel processors. FMM is an efficient node-based finite element method that employs a local mesh generation technique and a node-by-node algorithm for the finite element calculations. In this paper, FMM and its variation are reviewed focusing on their fundamental conception, algorithms and accuracy. PMID:21558752
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Howison, Mark; Bethel, E. Wes; Childs, Hank

2012-01-01

With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
A Parallel Multigrid Solver for Viscous Flows on Anisotropic Structured Grids

NASA Technical Reports Server (NTRS)

Prieto, Manuel; Montero, Ruben S.; Llorente, Ignacio M.; Bushnell, Dennis M. (Technical Monitor)

2001-01-01

This paper presents an efficient parallel multigrid solver for speeding up the computation of a 3-D model that treats the flow of a viscous fluid over a flat plate. The main interest of this simulation lies in exhibiting some basic difficulties that prevent optimal multigrid efficiencies from being achieved. As the computing platform, we have used Coral, a Beowulf-class system based on Intel Pentium processors and equipped with GigaNet cLAN and switched Fast Ethernet networks. Our study not only examines the scalability of the solver but also includes a performance evaluation of Coral where the investigated solver has been used to compare several of its design choices, namely, the interconnection network (GigaNet versus switched Fast-Ethernet) and the node configuration (dual nodes versus single nodes). As a reference, the performance results have been compared with those obtained with the NAS-MG benchmark.
Rigid body formulation in a finite element context with contact interaction

NASA Astrophysics Data System (ADS)

Refachinho de Campos, Paulo R.; Gay Neto, Alfredo

2018-03-01

The present work proposes a formulation to employ rigid bodies together with flexible bodies in the context of a nonlinear finite element solver, with contact interactions. Inertial contributions due to distribution of mass of a rigid body are fully developed, considering a general pole position associated with a single node, representing a rigid body element. Additionally, a mechanical constraint is proposed to connect a rigid region composed by several nodes, which is useful for linking rigid/flexible bodies in a finite element environment. Rodrigues rotation parameters are used to describe finite rotations, by an updated Lagrangian description. In addition, the contact formulation entitled master-surface to master-surface is employed in conjunction with the rigid body element and flexible bodies, aiming to consider their interaction in a rigid-flexible multibody environment. New surface parameterizations are presented to establish contact pairs, permitting pointwise interaction in a frictional scenario. Numerical examples are provided to show robustness and applicability of the methods.
Sudden spreading of infections in an epidemic model with a finite seed fraction

NASA Astrophysics Data System (ADS)

Hasegawa, Takehisa; Nemoto, Koji

2018-03-01

We study a simple case of the susceptible-weakened-infected-removed model in regular random graphs in a situation where an epidemic starts from a finite fraction of initially infected nodes (seeds). Previous studies have shown that, assuming a single seed, this model exhibits a kind of discontinuous transition at a certain value of infection rate. Performing Monte Carlo simulations and evaluating approximate master equations, we find that the present model has two critical infection rates for the case with a finite seed fraction. At the first critical rate the system shows a percolation transition of clusters composed of removed nodes, and at the second critical rate, which is larger than the first one, a giant cluster suddenly grows and the order parameter jumps even though it has been already rising. Numerical evaluation of the master equations shows that such sudden epidemic spreading does occur if the degree of the underlying network is large and the seed fraction is small.
Endobronchial ultrasound-guided transbronchial needle aspiration for lung cancer staging: early experience in Brazil*,**

PubMed Central

Figueiredo, Viviane Rossi; Cardoso, Paulo Francisco Guerreiro; Jacomelli, Márcia; Demarzo, Sérgio Eduardo; Palomino, Addy Lidvina Mejia; Rodrigues, Ascédio José; Terra, Ricardo Mingarini; Pego-Fernandes, Paulo Manoel; Carvalho, Carlos Roberto Ribeiro

2015-01-01

Objective: Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) is a minimally invasive, safe and accurate method for collecting samples from mediastinal and hilar lymph nodes. This study focused on the initial results obtained with EBUS-TBNA for lung cancer and lymph node staging at three teaching hospitals in Brazil. Methods: This was a retrospective analysis of patients diagnosed with lung cancer and submitted to EBUS-TBNA for mediastinal lymph node staging. The EBUS-TBNA procedures, which involved the use of an EBUS scope, an ultrasound processor, and a compatible, disposable 22 G needle, were performed while the patients were under general anesthesia. Results: Between January of 2011 and January of 2014, 149 patients underwent EBUS-TBNA for lymph node staging. The mean age was 66 ± 12 years, and 58% were male. A total of 407 lymph nodes were sampled by EBUS-TBNA. The most common types of lung neoplasm were adenocarcinoma (in 67%) and squamous cell carcinoma (in 24%). For lung cancer staging, EBUS-TBNA was found to have a sensitivity of 96%, a specificity of 100%, and a negative predictive value of 85%. Conclusions: We found EBUS-TBNA to be a safe and accurate method for lymph node staging in lung cancer patients. PMID:25750671
Endobronchial ultrasound-guided transbronchial needle aspiration for lung cancer staging: early experience in Brazil.

PubMed

Figueiredo, Viviane Rossi; Cardoso, Paulo Francisco Guerreiro; Jacomelli, Márcia; Demarzo, Sérgio Eduardo; Palomino, Addy Lidvina Mejia; Rodrigues, Ascédio José; Terra, Ricardo Mingarini; Pego-Fernandes, Paulo Manoel; Carvalho, Carlos Roberto Ribeiro

2015-01-01

Endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA) is a minimally invasive, safe and accurate method for collecting samples from mediastinal and hilar lymph nodes. This study focused on the initial results obtained with EBUS-TBNA for lung cancer and lymph node staging at three teaching hospitals in Brazil. This was a retrospective analysis of patients diagnosed with lung cancer and submitted to EBUS-TBNA for mediastinal lymph node staging. The EBUS-TBNA procedures, which involved the use of an EBUS scope, an ultrasound processor, and a compatible, disposable 22 G needle, were performed while the patients were under general anesthesia. Between January of 2011 and January of 2014, 149 patients underwent EBUS-TBNA for lymph node staging. The mean age was 66 ± 12 years, and 58% were male. A total of 407 lymph nodes were sampled by EBUS-TBNA. The most common types of lung neoplasm were adenocarcinoma (in 67%) and squamous cell carcinoma (in 24%). For lung cancer staging, EBUS-TBNA was found to have a sensitivity of 96%, a specificity of 100%, and a negative predictive value of 85%. We found EBUS-TBNA to be a safe and accurate method for lymph node staging in lung cancer patients.

Performance of an MPI-only semiconductor device simulator on a quad socket/quad core InfiniBand platform.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shadid, John Nicolas; Lin, Paul Tinphone

2009-01-01

This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling andmore » multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.« less
Wireless and Powerless Sensing Node System Developed for Monitoring Motors.

PubMed

Lee, Dasheng

2008-08-27

Reliability and maintainability of tooling systems can be improved through condition monitoring of motors. However, it is difficult to deploy sensor nodes due to the harsh environment of industrial plants. Sensor cables are easily damaged, which renders the monitoring system deployed to assure the machine's reliability itself unreliable. A wireless and powerless sensing node integrated with a MEMS (Micro Electro-Mechanical System) sensor, a signal processor, a communication module, and a self-powered generator was developed in this study for implementation of an easily mounted network sensor for monitoring motors. A specially designed communication module transmits a sequence of electromagnetic (EM) pulses in response to the sensor signals. The EM pulses can penetrate through the machine's metal case and delivers signals from the sensor inside the motor to the external data acquisition center. By using induction power, which is generated by the motor's shaft rotation, the sensor node is self-sustaining; therefore, no power line is required. A monitoring system, equipped with novel sensing nodes, was constructed to test its performance. The test results illustrate that, the novel sensing node developed in this study can effectively enhance the reliability of the motor monitoring system and it is expected to be a valuable technology, which will be available to the plant for implementation in a reliable motor management program.
Wireless and Powerless Sensing Node System Developed for Monitoring Motors

PubMed Central

Lee, Dasheng

2008-01-01

Reliability and maintainability of tooling systems can be improved through condition monitoring of motors. However, it is difficult to deploy sensor nodes due to the harsh environment of industrial plants. Sensor cables are easily damaged, which renders the monitoring system deployed to assure the machine's reliability itself unreliable. A wireless and powerless sensing node integrated with a MEMS (Micro Electro-Mechanical System) sensor, a signal processor, a communication module, and a self-powered generator was developed in this study for implementation of an easily mounted network sensor for monitoring motors. A specially designed communication module transmits a sequence of electromagnetic (EM) pulses in response to the sensor signals. The EM pulses can penetrate through the machine's metal case and delivers signals from the sensor inside the motor to the external data acquisition center. By using induction power, which is generated by the motor's shaft rotation, the sensor node is self-sustaining; therefore, no power line is required. A monitoring system, equipped with novel sensing nodes, was constructed to test its performance. The test results illustrate that, the novel sensing node developed in this study can effectively enhance the reliability of the motor monitoring system and it is expected to be a valuable technology, which will be available to the plant for implementation in a reliable motor management program. PMID:27873798
Performance Evaluation of NWChem Ab-Initio Molecular Dynamics (AIMD) Simulations on the Intel® Xeon Phi™ Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bylaska, Eric J.; Jacquelin, Mathias; De Jong, Wibe A.

2017-10-20

Ab-initio Molecular Dynamics (AIMD) methods are an important class of algorithms, as they enable scientists to understand the chemistry and dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. Many-core architectures such as the Intel® Xeon Phi™ processor are an interesting and promising target for these algorithms, as they can provide the computational power that is needed to solve interesting problems in chemistry. In this paper, we describe the efforts of refactoring the existing AIMD plane-wave method of NWChem from an MPI-only implementation to a scalable, hybrid code that employs MPI and OpenMP tomore » exploit the capabilities of current and future many-core architectures. We describe the optimizations required to get close to optimal performance for the multiplication of the tall-and-skinny matrices that form the core of the computational algorithm. We present strong scaling results on the complete AIMD simulation for a test case that simulates 256 water molecules and that strong-scales well on a cluster of 1024 nodes of Intel Xeon Phi processors. We compare the performance obtained with a cluster of dual-socket Intel® Xeon® E5–2698v3 processors.« less
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

NASA Astrophysics Data System (ADS)

Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

2014-06-01

This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
Design of temperature monitoring system based on CAN bus

NASA Astrophysics Data System (ADS)

Zhang, Li

2017-10-01

The remote temperature monitoring system based on the Controller Area Network (CAN) bus is designed to collect the multi-node remote temperature. By using the STM32F103 as main controller and multiple DS18B20s as temperature sensors, the system achieves a master-slave node data acquisition and transmission based on the CAN bus protocol. And making use of the serial port communication technology to communicate with the host computer, the system achieves the function of remote temperature storage, historical data show and the temperature waveform display.
Hybrid Computational Architecture for Multi-Scale Modeling of Materials and Devices

DTIC Science & Technology

2016-01-03

Equivalent: Total Number: Sub Contractors (DD882) Names of Faculty Supported Names of Under Graduate students supported Names of Personnel receiving masters...GHz, 20 cores (40 with hyper-threading ( HT )) Single node performance Node # of cores Total CPU time User CPU time System CPU time Elapsed time...INTEL20 40 (with HT ) 534.785 529.984 4.800 541.179 20 468.873 466.119 2.754 476.878 10 671.798 669.653 2.145 680.510 8 772.269 770.256 2.013
Neural node network and model, and method of teaching same

DOEpatents

Parlos, A.G.; Atiya, A.F.; Fernandez, B.; Tsai, W.K.; Chong, K.T.

1995-12-26

The present invention is a fully connected feed forward network that includes at least one hidden layer. The hidden layer includes nodes in which the output of the node is fed back to that node as an input with a unit delay produced by a delay device occurring in the feedback path (local feedback). Each node within each layer also receives a delayed output (crosstalk) produced by a delay unit from all the other nodes within the same layer. The node performs a transfer function operation based on the inputs from the previous layer and the delayed outputs. The network can be implemented as analog or digital or within a general purpose processor. Two teaching methods can be used: (1) back propagation of weight calculation that includes the local feedback and the crosstalk or (2) more preferably a feed forward gradient decent which immediately follows the output computations and which also includes the local feedback and the crosstalk. Subsequent to the gradient propagation, the weights can be normalized, thereby preventing convergence to a local optimum. Education of the network can be incremental both on and off-line. An educated network is suitable for modeling and controlling dynamic nonlinear systems and time series systems and predicting the outputs as well as hidden states and parameters. The educated network can also be further educated during on-line processing. 21 figs.
Neural node network and model, and method of teaching same

DOEpatents

Parlos, Alexander G.; Atiya, Amir F.; Fernandez, Benito; Tsai, Wei K.; Chong, Kil T.

1995-01-01

The present invention is a fully connected feed forward network that includes at least one hidden layer 16. The hidden layer 16 includes nodes 20 in which the output of the node is fed back to that node as an input with a unit delay produced by a delay device 24 occurring in the feedback path 22 (local feedback). Each node within each layer also receives a delayed output (crosstalk) produced by a delay unit 36 from all the other nodes within the same layer 16. The node performs a transfer function operation based on the inputs from the previous layer and the delayed outputs. The network can be implemented as analog or digital or within a general purpose processor. Two teaching methods can be used: (1) back propagation of weight calculation that includes the local feedback and the crosstalk or (2) more preferably a feed forward gradient decent which immediately follows the output computations and which also includes the local feedback and the crosstalk. Subsequent to the gradient propagation, the weights can be normalized, thereby preventing convergence to a local optimum. Education of the network can be incremental both on and off-line. An educated network is suitable for modeling and controlling dynamic nonlinear systems and time series systems and predicting the outputs as well as hidden states and parameters. The educated network can also be further educated during on-line processing.
Autonomic Cluster Management System (ACMS): A Demonstration of Autonomic Principles at Work

NASA Technical Reports Server (NTRS)

Baldassari, James D.; Kopec, Christopher L.; Leshay, Eric S.; Truszkowski, Walt; Finkel, David

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of achieving significant computational capabilities for high-performance computing applications, while simultaneously affording the ability to. increase that capability simply by adding more (inexpensive) processors. However, the task of manually managing and con.guring a cluster quickly becomes impossible as the cluster grows in size. Autonomic computing is a relatively new approach to managing complex systems that can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Automatic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management.
Mechanism of supporting sub-communicator collectives with O(64) counters as opposed to one counter for each sub-communicator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer; Mamidala, Amith R.; Ratterman, Joseph D.

A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal tomore » the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.« less
Mechanism of supporting sub-communicator collectives with o(64) counters as opposed to one counter for each sub-communicator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blocksome, Michael; Kumar, Sameer; Mamidala, Amith R.

A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a barrier algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal tomore » the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.« less
Mechanism of supporting sub-communicator collectives with O(64) counters as opposed to one counter for each sub-communicator

DOEpatents

Kumar, Sameer; Mamidala, Amith R.; Ratterman, Joseph D.; Blocksome, Michael; Miller, Douglas

2013-09-03

A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal to the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.
An MPI-based MoSST core dynamics model

NASA Astrophysics Data System (ADS)

Jiang, Weiyuan; Kuang, Weijia

2008-09-01

Distributed systems are among the main cost-effective and expandable platforms for high-end scientific computing. Therefore scalable numerical models are important for effective use of such systems. In this paper, we present an MPI-based numerical core dynamics model for simulation of geodynamo and planetary dynamos, and for simulation of core-mantle interactions. The model is developed based on MPI libraries. Two algorithms are used for node-node communication: a "master-slave" architecture and a "divide-and-conquer" architecture. The former is easy to implement but not scalable in communication. The latter is scalable in both computation and communication. The model scalability is tested on Linux PC clusters with up to 128 nodes. This model is also benchmarked with a published numerical dynamo model solution.
Space-Shuttle Emulator Software

NASA Technical Reports Server (NTRS)

Arnold, Scott; Askew, Bill; Barry, Matthew R.; Leigh, Agnes; Mermelstein, Scott; Owens, James; Payne, Dan; Pemble, Jim; Sollinger, John; Thompson, Hiram;

2007-01-01

A package of software has been developed to execute a raw binary image of the space shuttle flight software for simulation of the computational effects of operation of space shuttle avionics. This software can be run on inexpensive computer workstations. Heretofore, it was necessary to use real flight computers to perform such tests and simulations. The package includes a program that emulates the space shuttle orbiter general- purpose computer [consisting of a central processing unit (CPU), input/output processor (IOP), master sequence controller, and buscontrol elements]; an emulator of the orbiter display electronics unit and models of the associated cathode-ray tubes, keyboards, and switch controls; computational models of the data-bus network; computational models of the multiplexer-demultiplexer components; an emulation of the pulse-code modulation master unit; an emulation of the payload data interleaver; a model of the master timing unit; a model of the mass memory unit; and a software component that ensures compatibility of telemetry and command services between the simulated space shuttle avionics and a mission control center. The software package is portable to several host platforms.

MPI parallelization of Vlasov codes for the simulation of nonlinear laser-plasma interactions

NASA Astrophysics Data System (ADS)

Savchenko, V.; Won, K.; Afeyan, B.; Decyk, V.; Albrecht-Marc, M.; Ghizzo, A.; Bertrand, P.

2003-10-01

The simulation of optical mixing driven KEEN waves [1] and electron plasma waves [1] in laser-produced plasmas require nonlinear kinetic models and massive parallelization. We use Massage Passing Interface (MPI) libraries and Appleseed [2] to solve the Vlasov Poisson system of equations on an 8 node dual processor MAC G4 cluster. We use the semi-Lagrangian time splitting method [3]. It requires only row-column exchanges in the global data redistribution, minimizing the total number of communications between processors. Recurrent communication patterns for 2D FFTs involves global transposition. In the Vlasov-Maxwell case, we use splitting into two 1D spatial advections and a 2D momentum advection [4]. Discretized momentum advection equations have a double loop structure with the outer index being assigned to different processors. We adhere to a code structure with separate routines for calculations and data management for parallel computations. [1] B. Afeyan et al., IFSA 2003 Conference Proceedings, Monterey, CA [2] V. K. Decyk, Computers in Physics, 7, 418 (1993) [3] Sonnendrucker et al., JCP 149, 201 (1998) [4] Begue et al., JCP 151, 458 (1999)
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Research in Parallel Algorithms and Software for Computational Aerosciences

NASA Technical Reports Server (NTRS)

Domel, Neal D.

1996-01-01

Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
Development of an Autonomous Navigation Technology Test Vehicle

DTIC Science & Technology

2004-08-01

as an independent thread on processors using the Linux operating system. The computer hardware selected for the nodes that host the MRS threads...communications system design. Linux was chosen as the operating system for all of the single board computers used on the Mule. Linux was specifically...used for system analysis and development. The simple realization of multi-thread processing and inter-process communications in Linux made it a
QCDOC: A 10-teraflops scale computer for lattice QCD

NASA Astrophysics Data System (ADS)

Chen, D.; Christ, N. H.; Cristian, C.; Dong, Z.; Gara, A.; Garg, K.; Joo, B.; Kim, C.; Levkova, L.; Liao, X.; Mawhinney, R. D.; Ohta, S.; Wettig, T.

2001-03-01

The architecture of a new class of computers, optimized for lattice QCD calculations, is described. An individual node is based on a single integrated circuit containing a PowerPC 32-bit integer processor with a 1 Gflops 64-bit IEEE floating point unit, 4 Mbyte of memory, 8 Gbit/sec nearest-neighbor communications and additional control and diagnostic circuitry. The machine's name, QCDOC, derives from "QCD On a Chip".

A Monitoring System for Vegetable Greenhouses based on a Wireless Sensor Network

PubMed Central

Li, Xiu-hong; Cheng, Xiao; Yan, Ke; Gong, Peng

2010-01-01

A wireless sensor network-based automatic monitoring system is designed for monitoring the life conditions of greenhouse vegetatables. The complete system architecture includes a group of sensor nodes, a base station, and an internet data center. For the design of wireless sensor node, the JN5139 micro-processor is adopted as the core component and the Zigbee protocol is used for wireless communication between nodes. With an ARM7 microprocessor and embedded ZKOS operating system, a proprietary gateway node is developed to achieve data influx, screen display, system configuration and GPRS based remote data forwarding. Through a Client/Server mode the management software for remote data center achieves real-time data distribution and time-series analysis. Besides, a GSM-short-message-based interface is developed for sending real-time environmental measurements, and for alarming when a measurement is beyond some pre-defined threshold. The whole system has been tested for over one year and satisfactory results have been observed, which indicate that this system is very useful for greenhouse environment monitoring. PMID:22163391
Adapting wave-front algorithms to efficiently utilize systems with deep communication hierarchies

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J; Lang, Michael; Pakin, Scott

2009-01-01

Large-scale systems increasingly exhibit a differential between intra-chip and inter-chip communication performance. Processor-cores on the same socket are able to communicate at lower latencies, and with higher bandwidths, than cores on different sockets either within the same node or between nodes. A key challenge is to efficiently use this communication hierarchy and hence optimize performance. We consider here the class of applications that contain wave-front processing. In these applications data can only be processed after their upstream neighbors have been processed. Similar dependencies result between processors in which communication is required to pass boundary data downstream and whose cost ismore » typically impacted by the slowest communication channel in use. In this work we develop a novel hierarchical wave-front approach that reduces the use of slower communications in the hierarchy but at the cost of additional computation and higher use of on-chip communications. This tradeoff is explored using a performance model and an implementation on the Petascale Roadrunner system demonstrates a 27% performance improvement at full system-scale on a kernel application. The approach is generally applicable to large-scale multi-core and accelerated systems where a differential in system communication performance exists.« less
Multi-threaded ATLAS simulation on Intel Knights Landing processors

NASA Astrophysics Data System (ADS)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Porting Gravitational Wave Signal Extraction to Parallel Virtual Machine (PVM)

NASA Technical Reports Server (NTRS)

Thirumalainambi, Rajkumar; Thompson, David E.; Redmon, Jeffery

2009-01-01

Laser Interferometer Space Antenna (LISA) is a planned NASA-ESA mission to be launched around 2012. The Gravitational Wave detection is fundamentally the determination of frequency, source parameters, and waveform amplitude derived in a specific order from the interferometric time-series of the rotating LISA spacecrafts. The LISA Science Team has developed a Mock LISA Data Challenge intended to promote the testing of complicated nested search algorithms to detect the 100-1 millihertz frequency signals at amplitudes of 10E-21. However, it has become clear that, sequential search of the parameters is very time consuming and ultra-sensitive; hence, a new strategy has been developed. Parallelization of existing sequential search algorithms of Gravitational Wave signal identification consists of decomposing sequential search loops, beginning with outermost loops and working inward. In this process, the main challenge is to detect interdependencies among loops and partitioning the loops so as to preserve concurrency. Existing parallel programs are based upon either shared memory or distributed memory paradigms. In PVM, master and node programs are used to execute parallelization and process spawning. The PVM can handle process management and process addressing schemes using a virtual machine configuration. The task scheduling and the messaging and signaling can be implemented efficiently for the LISA Gravitational Wave search process using a master and 6 nodes. This approach is accomplished using a server that is available at NASA Ames Research Center, and has been dedicated to the LISA Data Challenge Competition. Historically, gravitational wave and source identification parameters have taken around 7 days in this dedicated single thread Linux based server. Using PVM approach, the parameter extraction problem can be reduced to within a day. The low frequency computation and a proxy signal-to-noise ratio are calculated in separate nodes that are controlled by the master using message and vector of data passing. The message passing among nodes follows a pattern of synchronous and asynchronous send-and-receive protocols. The communication model and the message buffers are allocated dynamically to address rapid search of gravitational wave source information in the Mock LISA data sets.
Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2015-04-01

PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
DIRAC universal pilots

NASA Astrophysics Data System (ADS)

Stagni, F.; McNab, A.; Luzzi, C.; Krzemien, W.; Consortium, DIRAC

2017-10-01

In the last few years, new types of computing models, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of pledged resources, while others are in the form of opportunistic ones. Most but not all of these new infrastructures are based on virtualization techniques. In addition, some of them, present opportunities for multi-processor computing slots to the users. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to provide the transparent, uniform interface has become essential. The transparent access to the underlying resources is realized by implementing the pilot model. DIRAC’s newest generation of generic pilots (the so-called Pilots 2.0) are the “pilots for all the skies”, and have been successfully released in production more than a year ago. They use a plugin mechanism that makes them easily adaptable. Pilots 2.0 have been used for fetching and running jobs on every type of resource, being it a Worker Node (WN) behind a CREAM/ARC/HTCondor/DIRAC Computing element, a Virtual Machine running on IaaC infrastructures like Vac or BOINC, on IaaS cloud resources managed by Vcycle, the LHCb High Level Trigger farm nodes, and any type of opportunistic computing resource. Make a machine a “Pilot Machine”, and all diversities between them will disappear. This contribution describes how pilots are made suitable for different resources, and the recent steps taken towards a fully unified framework, including monitoring. Also, the cases of multi-processor computing slots either on real or virtual machines, with the whole node or a partition of it, is discussed.
Design of the SLAC RCE Platform: A General Purpose ATCA Based Data Acquisition System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Herbst, R.; Claus, R.; Freytag, M.

2015-01-23

The SLAC RCE platform is a general purpose clustered data acquisition system implemented on a custom ATCA compliant blade, called the Cluster On Board (COB). The core of the system is the Reconfigurable Cluster Element (RCE), which is a system-on-chip design based upon the Xilinx Zynq family of FPGAs, mounted on custom COB daughter-boards. The Zynq architecture couples a dual core ARM Cortex A9 based processor with a high performance 28nm FPGA. The RCE has 12 external general purpose bi-directional high speed links, each supporting serial rates of up to 12Gbps. 8 RCE nodes are included on a COB, eachmore » with a 10Gbps connection to an on-board 24-port Ethernet switch integrated circuit. The COB is designed to be used with a standard full-mesh ATCA backplane allowing multiple RCE nodes to be tightly interconnected with minimal interconnect latency. Multiple shelves can be clustered using the front panel 10-gbps connections. The COB also supports local and inter-blade timing and trigger distribution. An experiment specific Rear Transition Module adapts the 96 high speed serial links to specific experiments and allows an experiment-specific timing and busy feedback connection. This coupling of processors with a high performance FPGA fabric in a low latency, multiple node cluster allows high speed data processing that can be easily adapted to any physics experiment. RTEMS and Linux are both ported to the module. The RCE has been used or is the baseline for several current and proposed experiments (LCLS, HPS, LSST, ATLAS-CSC, LBNE, DarkSide, ILC-SiD, etc).« less
Cost/Performance Ratio Achieved by Using a Commodity-Based Cluster

NASA Technical Reports Server (NTRS)

Lopez, Isaac

2001-01-01

Researchers at the NASA Glenn Research Center acquired a commodity cluster based on Intel Corporation processors to compare its performance with a traditional UNIX cluster in the execution of aeropropulsion applications. Since the cost differential of the clusters was significant, a cost/performance ratio was calculated. After executing a propulsion application on both clusters, the researchers demonstrated a 9.4 cost/performance ratio in favor of the Intel-based cluster. These researchers utilize the Aeroshark cluster as one of the primary testbeds for developing NPSS parallel application codes and system software. The Aero-shark cluster provides 64 Intel Pentium II 400-MHz processors, housed in 32 nodes. Recently, APNASA - a code developed by a Government/industry team for the design and analysis of turbomachinery systems was used for a simulation on Glenn's Aeroshark cluster.
Benchmarking and tuning the MILC code on clusters and supercomputers

NASA Astrophysics Data System (ADS)

Gottlieb, Steven

2002-03-01

Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Benchmarking and tuning the MILC code on clusters and supercomputers

NASA Astrophysics Data System (ADS)

Gottlieb, Steven

Recently, we have benchmarked and tuned the MILC code on a number of architectures including Intel Itanium and Pentium IV (PIV), dual-CPU Athlon, and the latest Compaq Alpha nodes. Results will be presented for many of these, and we shall discuss some simple code changes that can result in a very dramatic speedup of the KS conjugate gradient on processors with more advanced memory systems such as PIV, IBM SP and Alpha.
Scalable Unix commands for parallel processors : a high-performance implementation.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.
Electrooptical adaptive switching network for the hypercube computer

NASA Technical Reports Server (NTRS)

Chow, E.; Peterson, J.

1988-01-01

An all-optical network design for the hyperswitch network using regular free-space interconnects between electronic processor nodes is presented. The adaptive routing model used is described, and an adaptive routing control example is presented. The design demonstrates that existing electrooptical techniques are sufficient for implementing efficient parallel architectures without the need for more complex means of implementing arbitrary interconnection schemes. The electrooptical hyperswitch network significantly improves the communication performance of the hypercube computer.
Wi-GIM system: a new wireless sensor network (WSN) for accurate ground instability monitoring

NASA Astrophysics Data System (ADS)

Mucchi, Lorenzo; Trippi, Federico; Schina, Rosa; Fornaciai, Alessandro; Gigli, Giovanni; Nannipieri, Luca; Favalli, Massimiliano; Marturia Alavedra, Jordi; Intrieri, Emanuele; Agostini, Andrea; Carnevale, Ennio; Bertolini, Giovanni; Pizziolo, Marco; Casagli, Nicola

2016-04-01

Landslides are among the most serious and common geologic hazards around the world. Their impact on human life is expected to increase in the next future as a consequence of human-induced climate change as well as the population growth in proximity of unstable slopes. Therefore, developing better performing technologies for monitoring landslides and providing local authorities with new instruments able to help them in the decision making process, is becoming more and more important. The recent progresses in Information and Communication Technologies (ICT) allow us to extend the use of wireless technologies in landslide monitoring. In particular, the developments in electronics components have permitted to lower the price of the sensors and, at the same time, to actuate more efficient wireless communications. In this work we present a new wireless sensor network (WSN) system, designed and developed for landslide monitoring in the framework of EU Wireless Sensor Network for Ground Instability Monitoring - Wi-GIM project (LIFE12 ENV/IT/001033). We show the preliminary performance of the Wi-GIM system after the first period of monitoring on the active Roncovetro Landslide and on a large subsiding area in the neighbourhood of Sallent village. The Roncovetro landslide is located in the province of Reggio Emilia (Italy) and moved an inferred volume of about 3 million cubic meters. Sallent village is located at the centre of the Catalan evaporitic basin in Spain. The Wi-GIM WSN monitoring system consists of three levels: 1) Master/Gateway level coordinates the WSN and performs data aggregation and local storage; 2) Master/Server level takes care of acquiring and storing data on a remote server; 3) Nodes level that is based on a mesh of peripheral nodes, each consisting in a sensor board equipped with sensors and wireless module. The nodes are located in the landslide ground perimeter and are able to create an ad-hoc WSN. The location of each sensor on the ground is determined by integrating an ultra wideband technology with a radar technology; this integration allows to push the accuracy towards the cm. An extended Kalman filter is also used to reduce the noise and enhance the accuracy of the measures. The sensor nodes are organized as a hierarchical cluster, composed by one master and several slave nodes. The landslide movement is detected by comparing day by day the x, y and z coordinates of each nodes. The 3D movements of each sensor during the monitoring period are represented as vector and displayed on a Web-GIS which is accessible at the following link: www.life-wigim.eu.
Sensor node for remote monitoring of waterborne disease-causing bacteria.

PubMed

Kim, Kyukwang; Myung, Hyun

2015-05-05

A sensor node for sampling water and checking for the presence of harmful bacteria such as E. coli in water sources was developed in this research. A chromogenic enzyme substrate assay method was used to easily detect coliform bacteria by monitoring the color change of the sampled water mixed with a reagent. Live webcam image streaming to the web browser of the end user with a Wi-Fi connected sensor node shows the water color changes in real time. The liquid can be manipulated on the web-based user interface, and also can be observed by webcam feeds. Image streaming and web console servers run on an embedded processor with an expansion board. The UART channel of the expansion board is connected to an external Arduino board and a motor driver to control self-priming water pumps to sample the water, mix the reagent, and remove the water sample after the test is completed. The sensor node can repeat water testing until the test reagent is depleted. The authors anticipate that the use of the sensor node developed in this research can decrease the cost and required labor for testing samples in a factory environment and checking the water quality of local water sources in developing countries.
Novel Hybrid Scheduling Technique for Sensor Nodes with Mixed Criticality Tasks

PubMed Central

Micea, Mihai-Victor; Stangaciu, Cristina-Sorina; Stangaciu, Valentin; Curiac, Daniel-Ioan

2017-01-01

Sensor networks become increasingly a key technology for complex control applications. Their potential use in safety- and time-critical domains has raised the need for task scheduling mechanisms specially adapted to sensor node specific requirements, often materialized in predictable jitter-less execution of tasks characterized by different criticality levels. This paper offers an efficient scheduling solution, named Hybrid Hard Real-Time Scheduling (H2RTS), which combines a static, clock driven method with a dynamic, event driven scheduling technique, in order to provide high execution predictability, while keeping a high node Central Processing Unit (CPU) utilization factor. From the detailed, integrated schedulability analysis of the H2RTS, a set of sufficiency tests are introduced and demonstrated based on the processor demand and linear upper bound metrics. The performance and correct behavior of the proposed hybrid scheduling technique have been extensively evaluated and validated both on a simulator and on a sensor mote equipped with ARM7 microcontroller. PMID:28672856
Spectral Element Method for the Simulation of Unsteady Compressible Flows

NASA Technical Reports Server (NTRS)

Diosady, Laslo Tibor; Murman, Scott M.

2013-01-01

This work uses a discontinuous-Galerkin spectral-element method (DGSEM) to solve the compressible Navier-Stokes equations [1{3]. The inviscid ux is computed using the approximate Riemann solver of Roe [4]. The viscous fluxes are computed using the second form of Bassi and Rebay (BR2) [5] in a manner consistent with the spectral-element approximation. The method of lines with the classical 4th-order explicit Runge-Kutta scheme is used for time integration. Results for polynomial orders up to p = 15 (16th order) are presented. The code is parallelized using the Message Passing Interface (MPI). The computations presented in this work are performed using the Sandy Bridge nodes of the NASA Pleiades supercomputer at NASA Ames Research Center. Each Sandy Bridge node consists of 2 eight-core Intel Xeon E5-2670 processors with a clock speed of 2.6Ghz and 2GB per core memory. On a Sandy Bridge node the Tau Benchmark [6] runs in a time of 7.6s.
The Hazard Notification System (HANS)

NASA Astrophysics Data System (ADS)

Snedigar, S. F.; Venezky, D. Y.

2009-12-01

The Volcano Hazards Program (VHP) has developed a Hazard Notification System (HANS) for distributing volcanic activity information collected by scientists to airlines, emergency services, and the general public. In the past year, data from HANS have been used by airlines to make decisions about diverting or canceling flights during the eruption of Mount Redoubt. HANS was developed to provide a single system that each of the five U.S. volcano observatories could use for communicating and storing volcanic information about the 160+ potentially active U.S. volcanoes. The data that cover ten tables and nearly 100 fields are now stored in similar formats, and the information can be released in styles requested by our agency partners, such as the International Civil Aviation Organization (ICAO). Currently, HANS has about 4500 reports stored; on average, two - three reports are added daily. HANS (at its most basic form) consists of a user interface for entering data into one of many release types (Daily Status Reports, Weekly Updates, Volcano Activity Notifications, etc.); a database holding previous releases as well as observatory information such as email address lists and volcano boilerplates; and a transmission system for formatting releases and sending them out by email or other web related system. The user interface to HANS is completely web based, providing access to our observatory scientists from any online PC. The underlying database stores the observatory information and drives the observatory and program websites' dynamic updates and archived information releases. HANS also runs scripts for generating several different feeds including the program home page Volcano Status Map. Each observatory has the capability of running an instance of HANS. There are currently three instances of HANS and each instance is synchronized to all other instances using a master-slave environment. Information can be entered on any node; slave nodes transmit data to the master node, and the master retransmits that data to all slave nodes. All data transfer between instances uses the Simple Object Access Protocol (SOAP) as the envelope in which data are transmitted between nodes. The HANS data synchronization not only works as a backup feature, but also acts as a simple fault-tolerant system. Information from any observatory can be entered on any instance, and still be transmitted to the specified observatory's distribution list, which provides added flexibility if there is a disruption in access from an area that needs to send an update. Additionally, having the same information available on our multiple websites is necessary for communicating our scientists' most up-to-date information.
Toward real-time Monte Carlo simulation using a commercial cloud computing infrastructure

NASA Astrophysics Data System (ADS)

Wang, Henry; Ma, Yunzhi; Pratx, Guillem; Xing, Lei

2011-09-01

Monte Carlo (MC) methods are the gold standard for modeling photon and electron transport in a heterogeneous medium; however, their computational cost prohibits their routine use in the clinic. Cloud computing, wherein computing resources are allocated on-demand from a third party, is a new approach for high performance computing and is implemented to perform ultra-fast MC calculation in radiation therapy. We deployed the EGS5 MC package in a commercial cloud environment. Launched from a single local computer with Internet access, a Python script allocates a remote virtual cluster. A handshaking protocol designates master and worker nodes. The EGS5 binaries and the simulation data are initially loaded onto the master node. The simulation is then distributed among independent worker nodes via the message passing interface, and the results aggregated on the local computer for display and data analysis. The described approach is evaluated for pencil beams and broad beams of high-energy electrons and photons. The output of cloud-based MC simulation is identical to that produced by single-threaded implementation. For 1 million electrons, a simulation that takes 2.58 h on a local computer can be executed in 3.3 min on the cloud with 100 nodes, a 47× speed-up. Simulation time scales inversely with the number of parallel nodes. The parallelization overhead is also negligible for large simulations. Cloud computing represents one of the most important recent advances in supercomputing technology and provides a promising platform for substantially improved MC simulation. In addition to the significant speed up, cloud computing builds a layer of abstraction for high performance parallel computing, which may change the way dose calculations are performed and radiation treatment plans are completed. This work was presented in part at the 2010 Annual Meeting of the American Association of Physicists in Medicine (AAPM), Philadelphia, PA.
A spread willingness computing-based information dissemination model.

PubMed

Huang, Haojing; Cui, Zhiming; Zhang, Shukui

2014-01-01

This paper constructs a kind of spread willingness computing based on information dissemination model for social network. The model takes into account the impact of node degree and dissemination mechanism, combined with the complex network theory and dynamics of infectious diseases, and further establishes the dynamical evolution equations. Equations characterize the evolutionary relationship between different types of nodes with time. The spread willingness computing contains three factors which have impact on user's spread behavior: strength of the relationship between the nodes, views identity, and frequency of contact. Simulation results show that different degrees of nodes show the same trend in the network, and even if the degree of node is very small, there is likelihood of a large area of information dissemination. The weaker the relationship between nodes, the higher probability of views selection and the higher the frequency of contact with information so that information spreads rapidly and leads to a wide range of dissemination. As the dissemination probability and immune probability change, the speed of information dissemination is also changing accordingly. The studies meet social networking features and can help to master the behavior of users and understand and analyze characteristics of information dissemination in social network.
A Spread Willingness Computing-Based Information Dissemination Model

PubMed Central

Cui, Zhiming; Zhang, Shukui

2014-01-01

This paper constructs a kind of spread willingness computing based on information dissemination model for social network. The model takes into account the impact of node degree and dissemination mechanism, combined with the complex network theory and dynamics of infectious diseases, and further establishes the dynamical evolution equations. Equations characterize the evolutionary relationship between different types of nodes with time. The spread willingness computing contains three factors which have impact on user's spread behavior: strength of the relationship between the nodes, views identity, and frequency of contact. Simulation results show that different degrees of nodes show the same trend in the network, and even if the degree of node is very small, there is likelihood of a large area of information dissemination. The weaker the relationship between nodes, the higher probability of views selection and the higher the frequency of contact with information so that information spreads rapidly and leads to a wide range of dissemination. As the dissemination probability and immune probability change, the speed of information dissemination is also changing accordingly. The studies meet social networking features and can help to master the behavior of users and understand and analyze characteristics of information dissemination in social network. PMID:25110738

Splitting nodes and linking channels: A method for assembling biocircuits from stochastic elementary units

NASA Astrophysics Data System (ADS)

Ferwerda, Cameron; Lipan, Ovidiu

2016-11-01

Akin to electric circuits, we construct biocircuits that are manipulated by cutting and assembling channels through which stochastic information flows. This diagrammatic manipulation allows us to create a method which constructs networks by joining building blocks selected so that (a) they cover only basic processes; (b) it is scalable to large networks; (c) the mean and variance-covariance from the Pauli master equation form a closed system; and (d) given the initial probability distribution, no special boundary conditions are necessary to solve the master equation. The method aims to help with both designing new synthetic signaling pathways and quantifying naturally existing regulatory networks.
Real-Time Acquisition and Processing System (RTAPS) Version 1.1 Installation and User’s Manual.

DTIC Science & Technology

1986-08-01

The language is incrementally compiled and procedure-oriented. It is run on an 8088 processor with 56K of available user RAM. The master board features...RTAPS/PC computers. The wiring configuration is shown in figure 10. Switch Modem Port MAC P5 or P6* 2, B4 3 B8 1%7 1 B10 *P6 recommended Figure 10. $MAC...activated switch. The AXAC output port is physically connected to the modem input on the switch. The subchannels are the labeled terminal connections
a Linux PC Cluster for Lattice QCD with Exact Chiral Symmetry

NASA Astrophysics Data System (ADS)

Chiu, Ting-Wai; Hsieh, Tung-Han; Huang, Chao-Hsi; Huang, Tsung-Ren

A computational system for lattice QCD with overlap Dirac quarks is described. The platform is a home-made Linux PC cluster, built with off-the-shelf components. At present the system constitutes of 64 nodes, with each node consisting of one Pentium 4 processor (1.6/2.0/2.5 GHz), one Gbyte of PC800/1066 RDRAM, one 40/80/120 Gbyte hard disk, and a network card. The computationally intensive parts of our program are written in SSE2 codes. The speed of our system is estimated to be 70 Gflops, and its price/performance ratio is better than $1.0/Mflops for 64-bit (double precision) computations in quenched QCD. We discuss how to optimize its hardware and software for computing propagators of overlap Dirac quarks.
Radar Data Processing Using a Distributed Computational System

DTIC Science & Technology

1992-06-01

objects to processors must reduce Toc (N) (i.e., the time to compute on 85 N nodes) [Ref. 28]. Time spent communicating can represent a degradation of...de Sistemas e Computaq&o, s/ data. [9] Vilhena R. "IntroduqAo aos Algoritmos para Processamento de Marcaq6es e DistAncias", Escola Naval - Notas de...Aula - Automaq&o de Sistemas Navais, s/ data. (101 Averbuch A., Itzikcwitz S., and Kapon T. "Parallel Implementation of Multiple Model Tracking
Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

1997-01-01

A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.
High Fidelity Simulations of Unsteady Flow through Turbopumps and Flowliners

NASA Technical Reports Server (NTRS)

Kiris, Cetin C.; Kwak, dochan; Chan, William; Housman, Jeff

2006-01-01

High fidelity computations were carried out to analyze the orbiter LH2 feedline flowliner. Computations were performed on the Columbia platform which is a 10,240-processor supercluster consisting of 20 Altix nodes with 512 processor each. Various computational models were used to characterize the unsteady flow features in the turbopump, including the orbiter Low-Pressure-Fuel-Turbopump (LPFTP) inducer, the orbiter manifold and a test article used to represent the manifold. Unsteady flow originating from the orbiter LPFTP inducer is one of the major contributors to the high frequency cyclic loading that results in high cycle fatigue damage to the gimbal flowliners just upstream of the LPFTP. The flow fields for the orbiter manifold and representative test article are computed and analyzed for similarities and differences. The incompressible Navier-Stokes flow solver INS3D, based on the artificial compressibility method, was used to compute the flow of liquid hydrogen in each test article.
IJA: an efficient algorithm for query processing in sensor networks.

PubMed

Lee, Hyun Chang; Lee, Young Jae; Lim, Ji Hyang; Kim, Dong Hwa

2011-01-01

One of main features in sensor networks is the function that processes real time state information after gathering needed data from many domains. The component technologies consisting of each node called a sensor node that are including physical sensors, processors, actuators and power have advanced significantly over the last decade. Thanks to the advanced technology, over time sensor networks have been adopted in an all-round industry sensing physical phenomenon. However, sensor nodes in sensor networks are considerably constrained because with their energy and memory resources they have a very limited ability to process any information compared to conventional computer systems. Thus query processing over the nodes should be constrained because of their limitations. Due to the problems, the join operations in sensor networks are typically processed in a distributed manner over a set of nodes and have been studied. By way of example while simple queries, such as select and aggregate queries, in sensor networks have been addressed in the literature, the processing of join queries in sensor networks remains to be investigated. Therefore, in this paper, we propose and describe an Incremental Join Algorithm (IJA) in Sensor Networks to reduce the overhead caused by moving a join pair to the final join node or to minimize the communication cost that is the main consumer of the battery when processing the distributed queries in sensor networks environments. At the same time, the simulation result shows that the proposed IJA algorithm significantly reduces the number of bytes to be moved to join nodes compared to the popular synopsis join algorithm.
IJA: An Efficient Algorithm for Query Processing in Sensor Networks

PubMed Central

Lee, Hyun Chang; Lee, Young Jae; Lim, Ji Hyang; Kim, Dong Hwa

2011-01-01

One of main features in sensor networks is the function that processes real time state information after gathering needed data from many domains. The component technologies consisting of each node called a sensor node that are including physical sensors, processors, actuators and power have advanced significantly over the last decade. Thanks to the advanced technology, over time sensor networks have been adopted in an all-round industry sensing physical phenomenon. However, sensor nodes in sensor networks are considerably constrained because with their energy and memory resources they have a very limited ability to process any information compared to conventional computer systems. Thus query processing over the nodes should be constrained because of their limitations. Due to the problems, the join operations in sensor networks are typically processed in a distributed manner over a set of nodes and have been studied. By way of example while simple queries, such as select and aggregate queries, in sensor networks have been addressed in the literature, the processing of join queries in sensor networks remains to be investigated. Therefore, in this paper, we propose and describe an Incremental Join Algorithm (IJA) in Sensor Networks to reduce the overhead caused by moving a join pair to the final join node or to minimize the communication cost that is the main consumer of the battery when processing the distributed queries in sensor networks environments. At the same time, the simulation result shows that the proposed IJA algorithm significantly reduces the number of bytes to be moved to join nodes compared to the popular synopsis join algorithm. PMID:22319375
The Design and Development of the SMEX-Lite Power System

NASA Technical Reports Server (NTRS)

Rakow, Glenn P.; Schnurr, Richard G., Jr.; Solly, Michael A.

1998-01-01

This paper describes the design and development of a 250W orbit average electrical power system electronic Power Node and software for use in Low Earth Orbit missions. The mass of the Power Node is 3.6 Kg (8 lb.). The dimensions of the Power Node are 30cm x 26cm x 7.9cm (11 in. x 10.25 in x 3.1 in.) The design was realized using software, Field Programmable Gate Array (FPGA) digital logic and surface mount technology. The design is generic enough to reduce the non-recurring engineering for different mission configurations. The Power Node charges one to five, low cost, 22-cell 4 AH D-cell battery packs independently. The battery charging algorithms are executed in the power software to reduce the mass and size of the power electronic. The Power Node implements a peak-power tracking algorithm using an innovative hardware/software approach. The power software task is hosted on the spacecraft processor. The power software task generates a MIL-STD-1553 command packet to update the Power Node control settings. The settings for the battery voltage and current limits, as well as minimum solar array voltage used to implement peak power tracking are contained in this packet. Several advanced topologies are used in the Power Node. These include synchronous rectification in the bus regulators, average current control in the battery chargers and quasi-resonant converters for the Field Effect Transistor (FET) transistor drive electronics. Lastly, the main bus regulator uses a feed-forward topology with the PWM implemented in an FPGA.
MOLA: a bootable, self-configuring system for virtual screening using AutoDock4/Vina on computer clusters.

PubMed

Abreu, Rui Mv; Froufe, Hugo Jc; Queiroz, Maria João Rp; Ferreira, Isabel Cfr

2010-10-28

Virtual screening of small molecules using molecular docking has become an important tool in drug discovery. However, large scale virtual screening is time demanding and usually requires dedicated computer clusters. There are a number of software tools that perform virtual screening using AutoDock4 but they require access to dedicated Linux computer clusters. Also no software is available for performing virtual screening with Vina using computer clusters. In this paper we present MOLA, an easy-to-use graphical user interface tool that automates parallel virtual screening using AutoDock4 and/or Vina in bootable non-dedicated computer clusters. MOLA automates several tasks including: ligand preparation, parallel AutoDock4/Vina jobs distribution and result analysis. When the virtual screening project finishes, an open-office spreadsheet file opens with the ligands ranked by binding energy and distance to the active site. All results files can automatically be recorded on an USB-flash drive or on the hard-disk drive using VirtualBox. MOLA works inside a customized Live CD GNU/Linux operating system, developed by us, that bypass the original operating system installed on the computers used in the cluster. This operating system boots from a CD on the master node and then clusters other computers as slave nodes via ethernet connections. MOLA is an ideal virtual screening tool for non-experienced users, with a limited number of multi-platform heterogeneous computers available and no access to dedicated Linux computer clusters. When a virtual screening project finishes, the computers can just be restarted to their original operating system. The originality of MOLA lies on the fact that, any platform-independent computer available can he added to the cluster, without ever using the computer hard-disk drive and without interfering with the installed operating system. With a cluster of 10 processors, and a potential maximum speed-up of 10x, the parallel algorithm of MOLA performed with a speed-up of 8,64× using AutoDock4 and 8,60× using Vina.
Effectiveness of the Benign and Malignant Diagnosis of Mediastinal and Hilar Lymph Nodes by Endobronchial Ultrasound Elastography.

PubMed

Huang, Haidong; Huang, Zhiang; Wang, Qin; Wang, Xinan; Dong, Yuchao; Zhang, Wei; Zarogoulidis, Paul; Man, Yan-Gao; Schmidt, Wolfgang Hohenforst; Bai, Chong

2017-01-01

Background and Objectives: Endobronchial ultrasound elastography is a new technique for describing the stiffness of tissue during endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA). The aims of this study were to investigate the diagnostic value of Endobronchial ultrasound (EBUS) elastography for distinguishing the difference between benign and malignant lymph nodes among mediastinal and hilar lymph node. Materials and Methods: From June 2015 to August 2015, 47 patients confirmed of mediastinal and hilar lymph node enlargement through examination of Computed tomography (CT) were enrolled, and a total of 78 lymph nodes were evaluated by endobronchial ultrasound-guided transbronchial needle aspiration (EBUS-TBNA). EBUS-guided elastography of lymph nodes was performed prior to EBUS-TBNA. A convex probe EBUS was used with a new EBUS processor to assess elastographic patterns that were classified based on color distribution as follows: Type 1, predominantly non-blue (green, yellow and red); Type 2, part blue, part non-blue (green, yellow and red); Type 3, predominantly blue. Pathological determination of malignant or benign lymph nodes was used as the gold standard for this study. The elastographic patterns were compared with the final pathologic results from EBUS-TBNA. Results: On pathological evaluation of the lymph nodes, 45 were benign and 33 were malignant. The lymph nodes that were classified as Type 1 on endobronchial ultrasound elastography were benign in 26/27 (96.3%) and malignant in 1/27 (3.7%); for Type 2 lymph nodes, 15/20 (75.0%) were benign and 5/20 (25.0%) were malignant; Type 3 lymph nodes were benign in 4/31 (12.9%) and malignant in 27/31 (87.1%). In classifying Type 1 as 'benign' and Type 3 as 'malignant,' the sensitivity, specificity, positive predictive value, negative predictive value and diagnostic accuracy rates were 96.43%, 86.67%, 87.10%, 96.30%, 91.38%, respectively. Conclusion: EBUS elastography of mediastinal and hilar lymph nodes is a noninvasive technique that can be performed reliably and may be helpful in the prediction of benign and malignant lymph nodes among mediastinal and hilar lymph node during EBUS-TBNA.
Scalable parallel communications

NASA Technical Reports Server (NTRS)

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
A Versatile Image Processor For Digital Diagnostic Imaging And Its Application In Computed Radiography

NASA Astrophysics Data System (ADS)

Blume, H.; Alexandru, R.; Applegate, R.; Giordano, T.; Kamiya, K.; Kresina, R.

1986-06-01

In a digital diagnostic imaging department, the majority of operations for handling and processing of images can be grouped into a small set of basic operations, such as image data buffering and storage, image processing and analysis, image display, image data transmission and image data compression. These operations occur in almost all nodes of the diagnostic imaging communications network of the department. An image processor architecture was developed in which each of these functions has been mapped into hardware and software modules. The modular approach has advantages in terms of economics, service, expandability and upgradeability. The architectural design is based on the principles of hierarchical functionality, distributed and parallel processing and aims at real time response. Parallel processing and real time response is facilitated in part by a dual bus system: a VME control bus and a high speed image data bus, consisting of 8 independent parallel 16-bit busses, capable of handling combined up to 144 MBytes/sec. The presented image processor is versatile enough to meet the video rate processing needs of digital subtraction angiography, the large pixel matrix processing requirements of static projection radiography, or the broad range of manipulation and display needs of a multi-modality diagnostic work station. Several hardware modules are described in detail. For illustrating the capabilities of the image processor, processed 2000 x 2000 pixel computed radiographs are shown and estimated computation times for executing the processing opera-tions are presented.
The TurboLAN project. Phase 1: Protocol choices for high speed local area networks. Phase 2: TurboLAN Intelligent Network Adapter Card, (TINAC) architecture

NASA Technical Reports Server (NTRS)

Alkhatib, Hasan S.

1991-01-01

The hardware and the software architecture of the TurboLAN Intelligent Network Adapter Card (TINAC) are described. A high level as well as detailed treatment of the workings of various components of the TINAC are presented. The TINAC is divided into the following four major functional units: (1) the network access unit (NAU); (2) the buffer management unit; (3) the host interface unit; and (4) the node processor unit.
100 GB/S Time Division Multiplex (TDM) Access Nodes and Regenerators Based on Novel Loop Mirrors with High Nonlinearity Fibers

DTIC Science & Technology

2002-07-01

spectral components remain co-polarized. We confirmed that this was the case by passing the continuum through a polarizing beam splitter . The...propagation direction through polarization beam splitters and aligned along the other axis of the fiber. Co-propagating control and signal pulses...amplifier, PBS = polarization beam splitter . Figure 8. Eye diagram of header processor. This is the trace of the eye diagrams taken with the setup of Fig
Building Columbia from the SysAdmin View

NASA Technical Reports Server (NTRS)

Chan, David

2005-01-01

Project Columbia was built at NASA Ames Research Center in partnership with SGI and Intel. Columbia consists of 20 512 processor Altix machines with 440TB of storage and achieved 51.87 TeraPlops to be ranked the second fastest on the top 500 at SuperComputing 2004. Columbia was delivered, installed and put into production in 3 months. On average, a new Columbia node was brought into production in less than a week. Columbia's configuration, installation, and future plans will be discussed.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoshii, Kazutomo; Llopis, Pablo; Zhang, Kaicheng

As CMOS scaling nears its end, parameter variations (process, temperature and voltage) are becoming a major concern. To overcome parameter variations and provide stability, modern processors are becoming dynamic, opportunistically adjusting voltage and frequency based on thermal and energy constraints, which negatively impacts traditional bulk-synchronous parallelism-minded hardware and software designs. As node-level architecture is growing in complexity, implementing variation control mechanisms only with hardware can be a challenging task. In this paper we investigate a software strategy to manage hardwareinduced variations, leveraging low-level monitoring/controlling mechanisms.
Specifications of a Simulation Model for a Local Area Network Design in Support of Stock Point Logistics Integrated Communications Environment (SPLICE).

DTIC Science & Technology

1982-10-01

class queueing system with a preemptive -resume priority service discipline, as depicted in Figure 4.2. Concerning a SPLICLAN configuration a node can...processor can be modeled as a single resource, multi-class queueing system with a preemptive -resume priority structure as the one given in Figure 4.2. An...LOCAL AREA NETWORK DESIGN IN SUPPORT OF STOCK POINT LOGISTICS INTEGRATED COMMUNICATIONS ENVIRONMENT (SPLICE) by Ioannis Th. Mastrocostopoulos October
Functional Specification and Simulation of a Floating Point Co-Processor for SPUR

DTIC Science & Technology

1986-08-01

depend on this state will not be stable until the next phase; this leaves the problem of how to control events that must occur on phi 1 of a cycle. The... problems with the structure of the chip description. The worst of these problems is the absence of Slang constructs for coding separate chip component...constructs such as UNK as well. Another related problem was the inability to explicitly declare the size of Slang node values. \\Vhile the correct
JSC Wireless Sensor Network Update

NASA Technical Reports Server (NTRS)

Wagner, Robert

2010-01-01

Sensor nodes composed of three basic components... radio module: COTS radio module implementing standardized WSN protocol; treated as WSN modem by main board main board: contains application processor (TI MSP430 microcontroller), memory, power supply; responsible for sensor data acquisition, pre-processing, and task scheduling; re-used in every application with growing library of embedded C code sensor card: contains application-specific sensors, data conditioning hardware, and any advanced hardware not built into main board (DSPs, faster A/D, etc.); requires (re-) development for each application.

A Future Accelerated Cognitive Distributed Hybrid Testbed for Big Data Science Analytics

NASA Astrophysics Data System (ADS)

Halem, M.; Prathapan, S.; Golpayegani, N.; Huang, Y.; Blattner, T.; Dorband, J. E.

2016-12-01

As increased sensor spectral data volumes from current and future Earth Observing satellites are assimilated into high-resolution climate models, intensive cognitive machine learning technologies are needed to data mine, extract and intercompare model outputs. It is clear today that the next generation of computers and storage, beyond petascale cluster architectures, will be data centric. They will manage data movement and process data in place. Future cluster nodes have been announced that integrate multiple CPUs with high-speed links to GPUs and MICS on their backplanes with massive non-volatile RAM and access to active flash RAM disk storage. Active Ethernet connected key value store disk storage drives with 10Ge or higher are now available through the Kinetic Open Storage Alliance. At the UMBC Center for Hybrid Multicore Productivity Research, a future state-of-the-art Accelerated Cognitive Computer System (ACCS) for Big Data science is being integrated into the current IBM iDataplex computational system `bluewave'. Based on the next gen IBM 200 PF Sierra processor, an interim two node IBM Power S822 testbed is being integrated with dual Power 8 processors with 10 cores, 1TB Ram, a PCIe to a K80 GPU and an FPGA Coherent Accelerated Processor Interface card to 20TB Flash Ram. This system is to be updated to the Power 8+, an NVlink 1.0 with the Pascal GPU late in 2016. Moreover, the Seagate 96TB Kinetic Disk system with 24 Ethernet connected active disks is integrated into the ACCS storage system. A Lightweight Virtual File System developed at the NASA GSFC is installed on bluewave. Since remote access to publicly available quantum annealing computers is available at several govt labs, the ACCS will offer an in-line Restricted Boltzmann Machine optimization capability to the D-Wave 2X quantum annealing processor over the campus high speed 100 Gb network to Internet 2 for large files. As an evaluation test of the cognitive functionality of the architecture, the following studies utilizing all the system components will be presented; (i) a near real time climate change study generating CO2 fluxes and (ii) a deep dive capability into an 8000 x8000 pixel image pyramid display and (iii) Large dense and sparse eigenvalue decomposition.
Circuit for Communication Over Power Lines

NASA Technical Reports Server (NTRS)

Krasowski, Michael J.; Prokop, Normal F.; Greer, Lawrence C., III; Nappier, Jennifer

2011-01-01

Many distributed systems share common sensors and instruments along with a common power line supplying current to the system. A communication technique and circuit has been developed that allows for the simple inclusion of an instrument, sensor, or actuator node within any system containing a common power bus. Wherever power is available, a node can be added, which can then draw power for itself, its associated sensors, and actuators from the power bus all while communicating with other nodes on the power bus. The technique modulates a DC power bus through capacitive coupling using on-off keying (OOK), and receives and demodulates the signal from the DC power bus through the same capacitive coupling. The circuit acts as serial modem for the physical power line communication. The circuit and technique can be made of commercially available components or included in an application specific integrated circuit (ASIC) design, which allows for the circuit to be included in current designs with additional circuitry or embedded into new designs. This device and technique moves computational, sensing, and actuation abilities closer to the source, and allows for the networking of multiple similar nodes to each other and to a central processor. This technique also allows for reconfigurable systems by adding or removing nodes at any time. It can do so using nothing more than the in situ power wiring of the system.
Fault tolerant hypercube computer system architecture

NASA Technical Reports Server (NTRS)

Madan, Herb S. (Inventor); Chow, Edward (Inventor)

1989-01-01

A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node operably connected to the first multiplexer whereby the second watch dog node can selectively communicate with individual ones of the computing nodes through the second and fourth networks. The branch is completed by a first load balancing node; and a second multiplexer connected between the first load balancing node and the first and second watch dog nodes, allowing the first load balancing node to selectively communicate with the first and second watch dog nodes.
FLY MPI-2: a parallel tree code for LSS

NASA Astrophysics Data System (ADS)

Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.

2006-04-01

New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block structure. Summary of revisions: The parallel communication schema was totally changed. The new version adopts the MPICH2 library. Now FLY can be executed on all Unix systems having an MPI-2 standard library. The main data structure, is declared in a module procedure of FLY (fly_h.F90 routine). FLY creates the MPI Window object for one-sided communication for all the shared arrays, with a call like the following: CALL MPI_WIN_CREATE(POS, SIZE, REAL8, MPI_INFO_NULL, MPI_COMM_WORLD, WIN_POS, IERR) the following main window objects are created: win_pos, win_vel, win_acc: particles positions velocities and accelerations, win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells positions, masses, quadrupole momenta, tree structure and grouping cells. Other windows are created for dynamic load balance and global counters. Restrictions: The program uses the leapfrog integrator schema, but could be changed by the user. Unusual features: FLY uses the MPI-2 standard: the MPICH2 library on Linux systems was adopted. To run this version of FLY the working directory must be shared among all the processors that execute FLY. Additional comments: Full documentation for the program is included in the distribution in the form of a README file, a User Guide and a Reference manuscript. Running time: IBM Linux Cluster 1350, 512 nodes with 2 processors for each node and 2 GB RAM for each processor, at Cineca, was adopted to make performance tests. Processor type: Intel Xeon Pentium IV 3.0 GHz and 512 KB cache (128 nodes have Nocona processors). Internal Network: Myricom LAN Card "C" Version and "D" Version. Operating System: Linux SuSE SLES 8. The code was compiled using the mpif90 compiler version 8.1 and with basic optimization options in order to have performances that could be useful compared with other generic clusters Processors
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC

PubMed Central

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-01-01

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget. PMID:28208730
A novel strategy for load balancing of distributed medical applications.

PubMed

Logeswaran, Rajasvaran; Chen, Li-Choo

2012-04-01

Current trends in medicine, specifically in the electronic handling of medical applications, ranging from digital imaging, paperless hospital administration and electronic medical records, telemedicine, to computer-aided diagnosis, creates a burden on the network. Distributed Service Architectures, such as Intelligent Network (IN), Telecommunication Information Networking Architecture (TINA) and Open Service Access (OSA), are able to meet this new challenge. Distribution enables computational tasks to be spread among multiple processors; hence, performance is an important issue. This paper proposes a novel approach in load balancing, the Random Sender Initiated Algorithm, for distribution of tasks among several nodes sharing the same computational object (CO) instances in Distributed Service Architectures. Simulations illustrate that the proposed algorithm produces better network performance than the benchmark load balancing algorithms-the Random Node Selection Algorithm and the Shortest Queue Algorithm, especially under medium and heavily loaded conditions.
Dynamic Voltage-Frequency and Workload Joint Scaling Power Management for Energy Harvesting Multi-Core WSN Node SoC.

PubMed

Li, Xiangyu; Xie, Nijie; Tian, Xinyue

2017-02-08

This paper proposes a scheduling and power management solution for energy harvesting heterogeneous multi-core WSN node SoC such that the system continues to operate perennially and uses the harvested energy efficiently. The solution consists of a heterogeneous multi-core system oriented task scheduling algorithm and a low-complexity dynamic workload scaling and configuration optimization algorithm suitable for light-weight platforms. Moreover, considering the power consumption of most WSN applications have the characteristic of data dependent behavior, we introduce branches handling mechanism into the solution as well. The experimental result shows that the proposed algorithm can operate in real-time on a lightweight embedded processor (MSP430), and that it can make a system do more valuable works and make more than 99.9% use of the power budget.
The P-Mesh: A Commodity-based Scalable Network Architecture for Clusters

NASA Technical Reports Server (NTRS)

Nitzberg, Bill; Kuszmaul, Chris; Stockdale, Ian; Becker, Jeff; Jiang, John; Wong, Parkson; Tweten, David (Technical Monitor)

1998-01-01

We designed a new network architecture, the P-Mesh which combines the scalability and fault resilience of a torus with the performance of a switch. We compare the scalability, performance, and cost of the hub, switch, torus, tree, and P-Mesh architectures. The latter three are capable of scaling to thousands of nodes, however, the torus has severe performance limitations with that many processors. The tree and P-Mesh have similar latency, bandwidth, and bisection bandwidth, but the P-Mesh outperforms the switch architecture (a lower bound for tree performance) on 16-node NAB Parallel Benchmark tests by up to 23%, and costs 40% less. Further, the P-Mesh has better fault resilience characteristics. The P-Mesh architecture trades increased management overhead for lower cost, and is a good bridging technology while the price of tree uplinks is expensive.
Avatar - a multi-sensory system for real time body position monitoring.

PubMed

Jovanov, E; Hanish, N; Courson, V; Stidham, J; Stinson, H; Webb, C; Denny, K

2009-01-01

Virtual reality and computer assisted physical rehabilitation applications require an unobtrusive and inexpensive real time monitoring systems. Existing systems are usually complex and expensive and based on infrared monitoring. In this paper we propose Avatar, a hybrid system consisting of off-the-shelf components and sensors. Absolute positioning of a few reference points is determined using infrared diode on subject's body and a set of Wii Remotes as optical sensors. Individual body segments are monitored by intelligent inertial sensor nodes iSense. A network of inertial nodes is controlled by a master node that serves as a gateway for communication with a capture device. Each sensor features a 3D accelerometer and a 2 axis gyroscope. Avatar system is used for control of avatars in Virtual Reality applications, but could be used in a variety of augmented reality, gaming, and computer assisted physical rehabilitation applications.
International Space Station USOS Waste and Hygiene Compartment Development

NASA Technical Reports Server (NTRS)

Link, Dwight E., Jr.; Broyan, James Lee, Jr.; Gelmis, Karen; Philistine, Cynthia; Balistreri, Steven

2007-01-01

The International Space Station (ISS) currently provides human waste collection and hygiene facilities in the Russian Segment Service Module (SM) which supports a three person crew. Additional hardware is planned for the United States Operational Segment (USOS) to support expansion of the crew to six person capability. The additional hardware will be integrated in an ISS standard equipment rack structure that was planned to be installed in the Node 3 element; however, the ISS Program Office recently directed implementation of the rack, or Waste and Hygiene Compartment (WHC), into the U.S. Laboratory element to provide early operational capability. In this configuration, preserved urine from the WHC waste collection system can be processed by the Urine Processor Assembly (UPA) in either the U.S. Lab or Node 3 to recover water for crew consumption or oxygen production. The human waste collection hardware is derived from the Service Module system and is provided by RSC-Energia. This paper describes the concepts, design, and integration of the WHC waste collection hardware into the USOS including integration with U.S. Lab and Node 3 systems.
An Embedded Sensor Node Microcontroller with Crypto-Processors.

PubMed

Panić, Goran; Stecklina, Oliver; Stamenković, Zoran

2016-04-27

Wireless sensor network applications range from industrial automation and control, agricultural and environmental protection, to surveillance and medicine. In most applications, data are highly sensitive and must be protected from any type of attack and abuse. Security challenges in wireless sensor networks are mainly defined by the power and computing resources of sensor devices, memory size, quality of radio channels and susceptibility to physical capture. In this article, an embedded sensor node microcontroller designed to support sensor network applications with severe security demands is presented. It features a low power 16-bitprocessor core supported by a number of hardware accelerators designed to perform complex operations required by advanced crypto algorithms. The microcontroller integrates an embedded Flash and an 8-channel 12-bit analog-to-digital converter making it a good solution for low-power sensor nodes. The article discusses the most important security topics in wireless sensor networks and presents the architecture of the proposed hardware solution. Furthermore, it gives details on the chip implementation, verification and hardware evaluation. Finally, the chip power dissipation and performance figures are estimated and analyzed.
An Embedded Sensor Node Microcontroller with Crypto-Processors

PubMed Central

Panić, Goran; Stecklina, Oliver; Stamenković, Zoran

2016-01-01

Wireless sensor network applications range from industrial automation and control, agricultural and environmental protection, to surveillance and medicine. In most applications, data are highly sensitive and must be protected from any type of attack and abuse. Security challenges in wireless sensor networks are mainly defined by the power and computing resources of sensor devices, memory size, quality of radio channels and susceptibility to physical capture. In this article, an embedded sensor node microcontroller designed to support sensor network applications with severe security demands is presented. It features a low power 16-bitprocessor core supported by a number of hardware accelerators designed to perform complex operations required by advanced crypto algorithms. The microcontroller integrates an embedded Flash and an 8-channel 12-bit analog-to-digital converter making it a good solution for low-power sensor nodes. The article discusses the most important security topics in wireless sensor networks and presents the architecture of the proposed hardware solution. Furthermore, it gives details on the chip implementation, verification and hardware evaluation. Finally, the chip power dissipation and performance figures are estimated and analyzed. PMID:27128925
Performance of a parallel thermal-hydraulics code TEMPEST

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fann, G.I.; Trent, D.S.

The authors describe the parallelization of the Tempest thermal-hydraulics code. The serial version of this code is used for production quality 3-D thermal-hydraulics simulations. Good speedup was obtained with a parallel diagonally preconditioned BiCGStab non-symmetric linear solver, using a spatial domain decomposition approach for the semi-iterative pressure-based and mass-conserved algorithm. The test case used here to illustrate the performance of the BiCGStab solver is a 3-D natural convection problem modeled using finite volume discretization in cylindrical coordinates. The BiCGStab solver replaced the LSOR-ADI method for solving the pressure equation in TEMPEST. BiCGStab also solves the coupled thermal energy equation. Scalingmore » performance of 3 problem sizes (221220 nodes, 358120 nodes, and 701220 nodes) are presented. These problems were run on 2 different parallel machines: IBM-SP and SGI PowerChallenge. The largest problem attains a speedup of 68 on an 128 processor IBM-SP. In real terms, this is over 34 times faster than the fastest serial production time using the LSOR-ADI solver.« less
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems.

PubMed

Andrade, G; Ferreira, R; Teodoro, George; Rocha, Leonardo; Saltz, Joel H; Kurc, Tahsin

2014-10-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.
Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

PubMed Central

Andrade, G.; Ferreira, R.; Teodoro, George; Rocha, Leonardo; Saltz, Joel H.; Kurc, Tahsin

2015-01-01

High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales. PMID:26640423
Modeling a Million-Node Slim Fly Network Using Parallel Discrete-Event Simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wolfe, Noah; Carothers, Christopher; Mubarak, Misbah

As supercomputers close in on exascale performance, the increased number of processors and processing power translates to an increased demand on the underlying network interconnect. The Slim Fly network topology, a new lowdiameter and low-latency interconnection network, is gaining interest as one possible solution for next-generation supercomputing interconnect systems. In this paper, we present a high-fidelity Slim Fly it-level model leveraging the Rensselaer Optimistic Simulation System (ROSS) and Co-Design of Exascale Storage (CODES) frameworks. We validate our Slim Fly model with the Kathareios et al. Slim Fly model results provided at moderately sized network scales. We further scale the modelmore » size up to n unprecedented 1 million compute nodes; and through visualization of network simulation metrics such as link bandwidth, packet latency, and port occupancy, we get an insight into the network behavior at the million-node scale. We also show linear strong scaling of the Slim Fly model on an Intel cluster achieving a peak event rate of 36 million events per second using 128 MPI tasks to process 7 billion events. Detailed analysis of the underlying discrete-event simulation performance shows that a million-node Slim Fly model simulation can execute in 198 seconds on the Intel cluster.« less
Interface Supports Lightweight Subsystem Routing for Flight Applications

NASA Technical Reports Server (NTRS)

Lux, James P.; Block, Gary L.; Ahmad, Mohammad; Whitaker, William D.; Dillon, James W.

2010-01-01

A wireless avionics interface exploits the constrained nature of data networks in flight systems to use a lightweight routing method. This simplified routing means that a processor is not required, and the logic can be implemented as an intellectual property (IP) core in a field-programmable gate array (FPGA). The FPGA can be shared with the flight subsystem application. In addition, the router is aware of redundant subsystems, and can be configured to provide hot standby support as part of the interface. This simplifies implementation of flight applications requiring hot stand - by support. When a valid inbound packet is received from the network, the destination node address is inspected to determine whether the packet is to be processed by this node. Each node has routing tables for the next neighbor node to guide the packet to the destination node. If it is to be processed, the final packet destination is inspected to determine whether the packet is to be forwarded to another node, or routed locally. If the packet is local, it is sent to an Applications Data Interface (ADI), which is attached to a local flight application. Under this scheme, an interface can support many applications in a subsystem supporting a high level of subsystem integration. If the packet is to be forwarded to another node, it is sent to the outbound packet router. The outbound packet router receives packets from an ADI or a packet to be forwarded. It then uses a lookup table to determine the next destination for the packet. Upon detecting a remote subsystem failure, the routing table can be updated to autonomously bypass the failed subsystem.
Scalable NIC-based reduction on large-scale clusters

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moody, A.; Fernández, J. C.; Petrini, F.

2003-01-01

Many parallel algorithms require effiaent support for reduction mllectives. Over the years, researchers have developed optimal reduction algonduns by taking inm account system size, dam size, and complexities of reduction operations. However, all of these algorithm have assumed the faa that the reduction precessing takes place on the host CPU. Modem Network Interface Cards (NICs) sport programmable processors with substantial memory and thus introduce a fresh variable into the equation This raises the following intersting challenge: Can we take advantage of modern NICs to implementJost redudion operations? In this paper, we take on this challenge in the context of large-scalemore » clusters. Through experiments on the 960-node, 1920-processor or ASCI Linux Cluster (ALC) located at the Lawrence Livermore National Laboratory, we show that NIC-based reductions indeed perform with reduced latency and immed consistency over host-based aleorithms for the wmmon case and that these benefits scale as the system grows. In the largest configuration tested--1812 processors-- our NIC-based algorithm can sum a single element vector in 73 ps with 32-bi integers and in 118 with Mbit floating-point numnbers. These results represent an improvement, respeaively, of 121% and 39% with resvect w the {approx}roductionle vel MPI library« less
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
ACCURATE CHEMICAL MASTER EQUATION SOLUTION USING MULTI-FINITE BUFFERS

PubMed Central

Cao, Youfang; Terebus, Anna; Liang, Jie

2016-01-01

The discrete chemical master equation (dCME) provides a fundamental framework for studying stochasticity in mesoscopic networks. Because of the multi-scale nature of many networks where reaction rates have large disparity, directly solving dCMEs is intractable due to the exploding size of the state space. It is important to truncate the state space effectively with quantified errors, so accurate solutions can be computed. It is also important to know if all major probabilistic peaks have been computed. Here we introduce the Accurate CME (ACME) algorithm for obtaining direct solutions to dCMEs. With multi-finite buffers for reducing the state space by O(n!), exact steady-state and time-evolving network probability landscapes can be computed. We further describe a theoretical framework of aggregating microstates into a smaller number of macrostates by decomposing a network into independent aggregated birth and death processes, and give an a priori method for rapidly determining steady-state truncation errors. The maximal sizes of the finite buffers for a given error tolerance can also be pre-computed without costly trial solutions of dCMEs. We show exactly computed probability landscapes of three multi-scale networks, namely, a 6-node toggle switch, 11-node phage-lambda epigenetic circuit, and 16-node MAPK cascade network, the latter two with no known solutions. We also show how probabilities of rare events can be computed from first-passage times, another class of unsolved problems challenging for simulation-based techniques due to large separations in time scales. Overall, the ACME method enables accurate and efficient solutions of the dCME for a large class of networks. PMID:27761104

Lumping of degree-based mean-field and pair-approximation equations for multistate contact processes

NASA Astrophysics Data System (ADS)

Kyriakopoulos, Charalampos; Grossmann, Gerrit; Wolf, Verena; Bortolussi, Luca

2018-01-01

Contact processes form a large and highly interesting class of dynamic processes on networks, including epidemic and information-spreading networks. While devising stochastic models of such processes is relatively easy, analyzing them is very challenging from a computational point of view, particularly for large networks appearing in real applications. One strategy to reduce the complexity of their analysis is to rely on approximations, often in terms of a set of differential equations capturing the evolution of a random node, distinguishing nodes with different topological contexts (i.e., different degrees of different neighborhoods), such as degree-based mean-field (DBMF), approximate-master-equation (AME), or pair-approximation (PA) approaches. The number of differential equations so obtained is typically proportional to the maximum degree kmax of the network, which is much smaller than the size of the master equation of the underlying stochastic model, yet numerically solving these equations can still be problematic for large kmax. In this paper, we consider AME and PA, extended to cope with multiple local states, and we provide an aggregation procedure that clusters together nodes having similar degrees, treating those in the same cluster as indistinguishable, thus reducing the number of equations while preserving an accurate description of global observables of interest. We also provide an automatic way to build such equations and to identify a small number of degree clusters that give accurate results. The method is tested on several case studies, where it shows a high level of compression and a reduction of computational time of several orders of magnitude for large networks, with minimal loss in accuracy.
The Caltech Concurrent Computation Program - Project description

NASA Technical Reports Server (NTRS)

Fox, G.; Otto, S.; Lyzenga, G.; Rogstad, D.

1985-01-01

The Caltech Concurrent Computation Program wwhich studies basic issues in computational science is described. The research builds on initial work where novel concurrent hardware, the necessary systems software to use it and twenty significant scientific implementations running on the initial 32, 64, and 128 node hypercube machines have been constructed. A major goal of the program will be to extend this work into new disciplines and more complex algorithms including general packages that decompose arbitrary problems in major application areas. New high-performance concurrent processors with up to 1024-nodes, over a gigabyte of memory and multigigaflop performance are being constructed. The implementations cover a wide range of problems in areas such as high energy and astrophysics, condensed matter, chemical reactions, plasma physics, applied mathematics, geophysics, simulation, CAD for VLSI, graphics and image processing. The products of the research program include the concurrent algorithms, hardware, systems software, and complete program implementations.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors.

PubMed

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-07-08

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction.
CAPRI (Computational Analysis PRogramming Interface): A Solid Modeling Based Infra-Structure for Engineering Analysis and Design Simulations

NASA Technical Reports Server (NTRS)

Haimes, Robert; Follen, Gregory J.

1998-01-01

CAPRI is a CAD-vendor neutral application programming interface designed for the construction of analysis and design systems. By allowing access to the geometry from within all modules (grid generators, solvers and post-processors) such tasks as meshing on the actual surfaces, node enrichment by solvers and defining which mesh faces are boundaries (for the solver and visualization system) become simpler. The overall reliance on file 'standards' is minimized. This 'Geometry Centric' approach makes multi-physics (multi-disciplinary) analysis codes much easier to build. By using the shared (coupled) surface as the foundation, CAPRI provides a single call to interpolate grid-node based data from the surface discretization in one volume to another. Finally, design systems are possible where the results can be brought back into the CAD system (and therefore manufactured) because all geometry construction and modification are performed using the CAD system's geometry kernel.
Hierarchical Address Event Routing for Reconfigurable Large-Scale Neuromorphic Systems.

PubMed

Park, Jongkil; Yu, Theodore; Joshi, Siddharth; Maier, Christoph; Cauwenberghs, Gert

2017-10-01

We present a hierarchical address-event routing (HiAER) architecture for scalable communication of neural and synaptic spike events between neuromorphic processors, implemented with five Xilinx Spartan-6 field-programmable gate arrays and four custom analog neuromophic integrated circuits serving 262k neurons and 262M synapses. The architecture extends the single-bus address-event representation protocol to a hierarchy of multiple nested buses, routing events across increasing scales of spatial distance. The HiAER protocol provides individually programmable axonal delay in addition to strength for each synapse, lending itself toward biologically plausible neural network architectures, and scales across a range of hierarchies suitable for multichip and multiboard systems in reconfigurable large-scale neuromorphic systems. We show approximately linear scaling of net global synaptic event throughput with number of routing nodes in the network, at 3.6×10 7 synaptic events per second per 16k-neuron node in the hierarchy.
Speeding up parallel processing

NASA Technical Reports Server (NTRS)

Denning, Peter J.

1988-01-01

In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.
Modular System Control Development Model (MSCDM). Design Specification.

DTIC Science & Technology

1979-08-01

with power supply and ¶ can be used independently of the loop. The PDU can be used as a general purpose processor. The loop is contained in a separate...inputs to nodes 22 (VSQC), 23 (DSQC ) , and 26 (BWBSA) will be generated by a LSI—ll microprocessor used as a simulated input generator (SIG). The SIG...who c o b m n u n i — cate tau lt - s to the FIAC module. F~IAC generates even t reports to the OCRI and DBMS. The PDP1I/40 in loop 2 generates
Polymorphous Computing Architecture (PCA) Kernel Benchmark Measurements on the MIT Raw Microprocessor

DTIC Science & Technology

2006-06-14

Robert Graybill . A Raw hoard for the use of this project was provided by the Computer Architecture Croup at the Massachusetts Institute of Technology...simulator is presented by MIT as being an accurate model of the Raw chip, we have found that it does not accurately model the board. Our comparison...G4 processor, model 7410. with a 32 kbyte level-1 cache on-chip and a 2 Mbyte L2 cache connected through a 250 MH/ bus [12]. Each node has 256 Mbyte
Feasibility study, software design, layout and simulation of a two-dimensional Fast Fourier Transform machine for use in optical array interferometry

NASA Technical Reports Server (NTRS)

Boriakoff, Valentin

1994-01-01

The goal of this project was the feasibility study of a particular architecture of a digital signal processing machine operating in real time which could do in a pipeline fashion the computation of the fast Fourier transform (FFT) of a time-domain sampled complex digital data stream. The particular architecture makes use of simple identical processors (called inner product processors) in a linear organization called a systolic array. Through computer simulation the new architecture to compute the FFT with systolic arrays was proved to be viable, and computed the FFT correctly and with the predicted particulars of operation. Integrated circuits to compute the operations expected of the vital node of the systolic architecture were proven feasible, and even with a 2 micron VLSI technology can execute the required operations in the required time. Actual construction of the integrated circuits was successful in one variant (fixed point) and unsuccessful in the other (floating point).
Operating experience with a VMEbus multiprocessor system for data acquisition and reduction in nuclear physics

NASA Astrophysics Data System (ADS)

Kutt, P. H.; Balamuth, D. P.

1989-10-01

Summary form only given, as follows. A multiprocessor system based on commercially available VMEbus components has been developed for the acquisition and reduction of event-mode data in nuclear physics experiments. The system contains seven 68000 CPUs and 14 Mbyte of memory. A minimal operating system handles data transfer and task allocation, and a compiler for a specially designed event analysis language produces code for the processors. The system has been in operation for four years at the University of Pennsylvania Tandem Accelerator Laboratory. Computation rates over three times that of a MicroVAX II have been achieved at a fraction of the cost. The use of WORM optical disks for event recording allows the processing of gigabyte data sets without operator intervention. A more powerful system is being planned which will make use of recently developed RISC (reduced instruction set computer) processors to obtain an order of magnitude increase in computing power per node.
Impacts of the IBM Cell Processor to Support Climate Models

NASA Technical Reports Server (NTRS)

Zhou, Shujia; Duffy, Daniel; Clune, Tom; Suarez, Max; Williams, Samuel; Halem, Milt

2008-01-01

NASA is interested in the performance and cost benefits for adapting its applications to the IBM Cell processor. However, its 256KB local memory per SPE and the new communication mechanism, make it very challenging to port an application. We selected the solar radiation component of the NASA GEOS-5 climate model, which: (1) is representative of column physics (approximately 50% computational time), (2) has a high computational load relative to transferring data from and to main memory, (3) performs independent calculations across multiple columns. We converted the baseline code (single-precision, Fortran) to C and ported it with manually SIMDizing 4 independent columns and found that a Cell with 8 SPEs can process 2274 columns per second. Compared with the baseline results, the Cell is approximately 5.2X, approximately 8.2X, approximately 15.1X faster than a core on Intel Woodcrest, Dempsey, and Itanium2, respectively. We believe this dramatic performance improvement makes a hybrid cluster with Cell and traditional nodes competitive.
Energy Efficient Image/Video Data Transmission on Commercial Multi-Core Processors

PubMed Central

Lee, Sungju; Kim, Heegon; Chung, Yongwha; Park, Daihee

2012-01-01

In transmitting image/video data over Video Sensor Networks (VSNs), energy consumption must be minimized while maintaining high image/video quality. Although image/video compression is well known for its efficiency and usefulness in VSNs, the excessive costs associated with encoding computation and complexity still hinder its adoption for practical use. However, it is anticipated that high-performance handheld multi-core devices will be used as VSN processing nodes in the near future. In this paper, we propose a way to improve the energy efficiency of image and video compression with multi-core processors while maintaining the image/video quality. We improve the compression efficiency at the algorithmic level or derive the optimal parameters for the combination of a machine and compression based on the tradeoff between the energy consumption and the image/video quality. Based on experimental results, we confirm that the proposed approach can improve the energy efficiency of the straightforward approach by a factor of 2∼5 without compromising image/video quality. PMID:23202181
Load Balancing Strategies for Multiphase Flows on Structured Grids

NASA Astrophysics Data System (ADS)

Olshefski, Kristopher; Owkes, Mark

2017-11-01

The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.
The Fermilab lattice supercomputer project

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fischler, M.; Atac, R.; Cook, A.

1989-02-01

The ACPMAPS system is a highly cost effective, local memory MIMD computer targeted at algorithm development and production running for gauge theory on the lattice. The machine consists of a compound hypercube of crates, each of which is a full crossbar switch containing several processors. The processing nodes are single board array processors based on the Weitek XL chip set, each with a peak power of 20 MFLOPS and supported by 8 MBytes of data memory. The system currently being assembled has a peak power of 5 GFLOPS, delivering performance at approximately $250/MFLOP. The system is programmable in C andmore » Fortran. An underpinning of software routines (CANOPY) provides an easy and natural way of coding lattice problems, such that the details of parallelism, and communication and system architecture are transparent to the user. CANOPY can easily be ported to any single CPU or MIMD system which supports C, and allows the coding of typical applications with very little effort. 3 refs., 1 fig.« less
The cost of conservative synchronization in parallel discrete event simulations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1990-01-01

The performance of a synchronous conservative parallel discrete-event simulation protocol is analyzed. The class of simulation models considered is oriented around a physical domain and possesses a limited ability to predict future behavior. A stochastic model is used to show that as the volume of simulation activity in the model increases relative to a fixed architecture, the complexity of the average per-event overhead due to synchronization, event list manipulation, lookahead calculations, and processor idle time approach the complexity of the average per-event overhead of a serial simulation. The method is therefore within a constant factor of optimal. The analysis demonstrates that on large problems--those for which parallel processing is ideally suited--there is often enough parallel workload so that processors are not usually idle. The viability of the method is also demonstrated empirically, showing how good performance is achieved on large problems using a thirty-two node Intel iPSC/2 distributed memory multiprocessor.
The research and application of multi-biometric acquisition embedded system

NASA Astrophysics Data System (ADS)

Deng, Shichao; Liu, Tiegen; Guo, Jingjing; Li, Xiuyan

2009-11-01

The identification technology based on multi-biometric can greatly improve the applicability, reliability and antifalsification. This paper presents a multi-biometric system bases on embedded system, which includes: three capture daughter boards are applied to obtain different biometric: one each for fingerprint, iris and vein of the back of hand; FPGA (Field Programmable Gate Array) is designed as coprocessor, which uses to configure three daughter boards on request and provides data path between DSP (digital signal processor) and daughter boards; DSP is the master processor and its functions include: control the biometric information acquisition, extracts feature as required and responsible for compare the results with the local database or data server through network communication. The advantages of this system were it can acquire three different biometric in real time, extracts complexity feature flexibly in different biometrics' raw data according to different purposes and arithmetic and network interface on the core-board will be the solution of big data scale. Because this embedded system has high stability, reliability, flexibility and fit for different data scale, it can satisfy the demand of multi-biometric recognition.
Three-dimensional flat shell-to-shell coupling: numerical challenges

NASA Astrophysics Data System (ADS)

Guo, Kuo; Haikal, Ghadir

2017-11-01

The node-to-surface formulation is widely used in contact simulations with finite elements because it is relatively easy to implement using different types of element discretizations. This approach, however, has a number of well-known drawbacks, including locking due to over-constraint when this formulation is used as a twopass method. Most studies on the node-to-surface contact formulation, however, have been conducted using solid elements and little has been done to investigate the effectiveness of this approach for beam or shell elements. In this paper we show that locking can also be observed with the node-to-surface contact formulation when applied to plate and flat shell elements even with a singlepass implementation with distinct master/slave designations, which is the standard solution to locking with solid elements. In our study, we use the quadrilateral four node flat shell element for thin (Kirchhoff-Love) plate and thick (Reissner-Mindlin) plate theory, both in their standard forms and with improved formulations such as the linked interpolation [1] and the Discrete Kirchhoff [2] elements for thick and thin plates, respectively. The Lagrange multiplier method is used to enforce the node-to-surface constraints for all elements. The results show clear locking when compared to those obtained using a conforming mesh configuration.
DNS of incompressible turbulence in a periodic box with up to 4096^3 grid points

NASA Astrophysics Data System (ADS)

Kaneda, Yukio

2007-11-01

Turbulence of incompressible fluid obeying the Navier-Stokes (NS) equations under periodic boundary conditions is one of the simplest dynamical systems keeping the essence of turbulence dynamics, and suitable for the study of high Reynolds number (Re) turbulence by direct numerical simulation (DNS). This talk presents a review on DNS of such a system with the number N^3 of the grid points up to 4096^3, performed on the Earth Simulator (ES). The ES consists of 640 processor nodes (=5120 arithmetic processors) with 10TB of main memory and the peak performance of 40 Tflops. The DNSs are based on a spectral method free from alias error. The convolution sums in the wave vector space were evaluated by radix-4 Fast Fourier Transforms with double precision arithmetic. Sustained performance of 16.4 Tflops was achieved on the 2048^3 DNS by using 512 processor nodes of the ES. The DNSs consist of two series; one is with kmax η1 (Series 1) and the other with kmax η2 (Series 2), where kmax is the highest wavenumber in each simulation, and η is the Kolmogorov length scale. In the 4096^3 DNS, the Taylor-scale Reynolds number Rλ1130 (675) and the ratio L/η of the integral length scale L to η is approximately 2133(1040), in Series 1 (Series 2). Such DNS data are expected to shed some light on the basic questions in turbulence research, including those on (i) the normalized mean rate of energy dissipation in the high Re limit, (ii) the universality of energy spectrum at small scale, (iii) scale- and Re- dependences of the statistics, and (iv) intermittency. We have constructed a database consisting of (a) animations and figures of turbulent fields (b) statistics including those associated with (i)-(iv) noted above, (c) snapshot data of the velocity fields. The data size of (c) can be very large for large N. For example, one snapshot of single precision data of the velocity vector field of the 4096^3 DNS requires approximately 0.8 TB.
Measurements of the LHCb software stack on the ARM architecture

NASA Astrophysics Data System (ADS)

Vijay Kartik, S.; Couturier, Ben; Clemencic, Marco; Neufeld, Niko

2014-06-01

The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture - specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda - and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance - this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed - these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.
Smart photonic networks and computer security for image data

NASA Astrophysics Data System (ADS)

Campello, Jorge; Gill, John T.; Morf, Martin; Flynn, Michael J.

1998-02-01

Work reported here is part of a larger project on 'Smart Photonic Networks and Computer Security for Image Data', studying the interactions of coding and security, switching architecture simulations, and basic technologies. Coding and security: coding methods that are appropriate for data security in data fusion networks were investigated. These networks have several characteristics that distinguish them form other currently employed networks, such as Ethernet LANs or the Internet. The most significant characteristics are very high maximum data rates; predominance of image data; narrowcasting - transmission of data form one source to a designated set of receivers; data fusion - combining related data from several sources; simple sensor nodes with limited buffering. These characteristics affect both the lower level network design and the higher level coding methods.Data security encompasses privacy, integrity, reliability, and availability. Privacy, integrity, and reliability can be provided through encryption and coding for error detection and correction. Availability is primarily a network issue; network nodes must be protected against failure or routed around in the case of failure. One of the more promising techniques is the use of 'secret sharing'. We consider this method as a special case of our new space-time code diversity based algorithms for secure communication. These algorithms enable us to exploit parallelism and scalable multiplexing schemes to build photonic network architectures. A number of very high-speed switching and routing architectures and their relationships with very high performance processor architectures were studied. Indications are that routers for very high speed photonic networks can be designed using the very robust and distributed TCP/IP protocol, if suitable processor architecture support is available.

The CEOS International Directory Network: Progress and Plans, Spring, 1999

NASA Technical Reports Server (NTRS)

Olsen, Lola M.

1999-01-01

The Global Change Master Directory (GCMD) serves as the software development hub for the Committee on Earth observation Satellites' (CEOS) International Directory Network (IDN). The GCMD has upgraded the software for the IDN nodes as Version 7 of the GCMD: MD7-Oracle and MD7-Isite, as well as three other MD7 experimental interfaces. The contribution by DLR representatives (Germany) of the DLR Thesaurus will be demonstrated as an educational tool for use with MD7-Isite. The software will be installed at twelve nodes around the world: Brazil, Argentina, the Netherlands, Canada, France, Germany, Italy, Japan, Australia, New Zealand, Switzerland, and several sites in the United States. Representing NASA for the International Directory Network and the CEOS Data Access Subgroup, NASA's contribution to this international interoperability effort will be updated. Discussion will include interoperability with the CEOS Interoperability Protocol (CIP), features of the latest version of the software, including upgraded capabilities for distributed input by the IDN nodes, installation logistics, "mirroring", population objectives, and future plans.
The CEOS International Directory Network Progress and Plans: Spring, 1999

NASA Technical Reports Server (NTRS)

Olsen, Lola M.

1999-01-01

The Global Change Master Directory (GCMD) serves as the software development hub for the Committee on Earth Observation Satellites' (CEOS) International Directory Network (IDN). The GCMD has upgraded the software for the IDN nodes as Version 7 of the GCMD: MD7-Oracle and MD7-Isite, as well as three other MD7 experimental interfaces. The contribution by DLR representatives (Germany) of the DLR Thesaurus will be demonstrated as an educational tool for use with MD7-Isite. The software will be installed at twelve nodes around the world: Brazil, Argentina, the Netherlands, Canada, France, Germany, Italy, Japan, Australia, New Zealand, Switzerland, and several sites in the United States. Representing NASA for the International Directory Network and the CEOS Data Access Subgroup, NASA's contribution to this international interoperability effort will be updated. Discussion will include interoperability with the CEOS Interoperability Protocol (CIP), features of the latest version of the software, including upgraded capabilities for distributed input by the IDN nodes, installation logistics, "mirroring', population objectives, and future plans.
GBSFP: General Bluetooth Scatternet Formation Protocol for Ad Hoc Networking

NASA Astrophysics Data System (ADS)

Lim, Chaegwon; Huh, Myung-Sun; Choi, Chong-Ho; Jeong, Gu-Min

Recently, bluetooth technology has become widely prevalent so that many laptops and mobile phones are equipped with bluetooth capability. In order to meet the increasing demand to interconnect these devices a new scatternet formation protocol named GBSFP (General Bluetooth Scatternet Formation Protocol) is proposed in this paper. GBSFP is the result of efforts to overcome the two major limitations of the legacy scatternet formation protocols as regards their real implementation, that all of the nodes should be within the Bluetooth communication range or that they should be time synchronized. In GBSFP, a node goes through three phases; 1) the Init phase to establish a bluetooth link to as many of its neighbors as possible, 2) the Ready phase to determine the role of each node, i.e., master or slave, and remove any unnecessary bluetooth links, and 3) the Complete phase to finalize the formation of the scatternet and begin data transmission. The simulation results show that GBSFP provides higher connectivity in many scenarios compared with BTCP and BlueStars.
Parallel Application Performance on Two Generations of Intel Xeon HPC Platforms

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chang, Christopher H.; Long, Hai; Sides, Scott

2015-10-15

Two next-generation node configurations hosting the Haswell microarchitecture were tested with a suite of microbenchmarks and application examples, and compared with a current Ivy Bridge production node on NREL" tm s Peregrine high-performance computing cluster. A primary conclusion from this study is that the additional cores are of little value to individual task performance--limitations to application parallelism, or resource contention among concurrently running but independent tasks, limits effective utilization of these added cores. Hyperthreading generally impacts throughput negatively, but can improve performance in the absence of detailed attention to runtime workflow configuration. The observations offer some guidance to procurement ofmore » future HPC systems at NREL. First, raw core count must be balanced with available resources, particularly memory bandwidth. Balance-of-system will determine value more than processor capability alone. Second, hyperthreading continues to be largely irrelevant to the workloads that are commonly seen, and were tested here, at NREL. Finally, perhaps the most impactful enhancement to productivity might occur through enabling multiple concurrent jobs per node. Given the right type and size of workload, more may be achieved by doing many slow things at once, than fast things in order.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Krueger, Jens; Micikevicius, Paulius; Williams, Samuel

Reverse Time Migration (RTM) is one of the main approaches in the seismic processing industry for imaging the subsurface structure of the Earth. While RTM provides qualitative advantages over its predecessors, it has a high computational cost warranting implementation on HPC architectures. We focus on three progressively more complex kernels extracted from RTM: for isotropic (ISO), vertical transverse isotropic (VTI) and tilted transverse isotropic (TTI) media. In this work, we examine performance optimization of forward wave modeling, which describes the computational kernels used in RTM, on emerging multi- and manycore processors and introduce a novel common subexpression elimination optimization formore » TTI kernels. We compare attained performance and energy efficiency in both the single-node and distributed memory environments in order to satisfy industry’s demands for fidelity, performance, and energy efficiency. Moreover, we discuss the interplay between architecture (chip and system) and optimizations (both on-node computation) highlighting the importance of NUMA-aware approaches to MPI communication. Ultimately, our results show we can improve CPU energy efficiency by more than 10× on Magny Cours nodes while acceleration via multiple GPUs can surpass the energy-efficient Intel Sandy Bridge by as much as 3.6×.« less
WebStruct and VisualStruct: Web interfaces and visualization for Structure software implemented in a cluster environment.

PubMed

Jayashree, B; Rajgopal, S; Hoisington, D; Prasanth, V P; Chandra, S

2008-09-24

Structure, is a widely used software tool to investigate population genetic structure with multi-locus genotyping data. The software uses an iterative algorithm to group individuals into "K" clusters, representing possibly K genetically distinct subpopulations. The serial implementation of this programme is processor-intensive even with small datasets. We describe an implementation of the program within a parallel framework. Speedup was achieved by running different replicates and values of K on each node of the cluster. A web-based user-oriented GUI has been implemented in PHP, through which the user can specify input parameters for the programme. The number of processors to be used can be specified in the background command. A web-based visualization tool "Visualstruct", written in PHP (HTML and Java script embedded), allows for the graphical display of population clusters output from Structure, where each individual may be visualized as a line segment with K colors defining its possible genomic composition with respect to the K genetic sub-populations. The advantage over available programs is in the increased number of individuals that can be visualized. The analyses of real datasets indicate a speedup of up to four, when comparing the speed of execution on clusters of eight processors with the speed of execution on one desktop. The software package is freely available to interested users upon request.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Towards Highly Scalable Ab Initio Molecular Dynamics (AIMD) Simulations on the Intel Knights Landing Manycore Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jacquelin, Mathias; De Jong, Wibe A.; Bylaska, Eric J.

2017-07-03

The Ab Initio Molecular Dynamics (AIMD) method allows scientists to treat the dynamics of molecular and condensed phase systems while retaining a first-principles-based description of their interactions. This extremely important method has tremendous computational requirements, because the electronic Schr¨odinger equation, approximated using Kohn-Sham Density Functional Theory (DFT), is solved at every time step. With the advent of manycore architectures, application developers have a significant amount of processing power within each compute node that can only be exploited through massive parallelism. A compute intensive application such as AIMD forms a good candidate to leverage this processing power. In this paper, wemore » focus on adding thread level parallelism to the plane wave DFT methodology implemented in NWChem. Through a careful optimization of tall-skinny matrix products, which are at the heart of the Lagrange multiplier and nonlocal pseudopotential kernels, as well as 3D FFTs, our OpenMP implementation delivers excellent strong scaling on the latest Intel Knights Landing (KNL) processor. We assess the efficiency of our Lagrange multiplier kernels by building a Roofline model of the platform, and verify that our implementation is close to the roofline for various problem sizes. Finally, we present strong scaling results on the complete AIMD simulation for a 64 water molecules test case, that scales up to all 68 cores of the Knights Landing processor.« less
Accurate chemical master equation solution using multi-finite buffers

DOE PAGES

Cao, Youfang; Terebus, Anna; Liang, Jie

2016-06-29

Here, the discrete chemical master equation (dCME) provides a fundamental framework for studying stochasticity in mesoscopic networks. Because of the multiscale nature of many networks where reaction rates have a large disparity, directly solving dCMEs is intractable due to the exploding size of the state space. It is important to truncate the state space effectively with quantified errors, so accurate solutions can be computed. It is also important to know if all major probabilistic peaks have been computed. Here we introduce the accurate CME (ACME) algorithm for obtaining direct solutions to dCMEs. With multifinite buffers for reducing the state spacemore » by $O(n!)$, exact steady-state and time-evolving network probability landscapes can be computed. We further describe a theoretical framework of aggregating microstates into a smaller number of macrostates by decomposing a network into independent aggregated birth and death processes and give an a priori method for rapidly determining steady-state truncation errors. The maximal sizes of the finite buffers for a given error tolerance can also be precomputed without costly trial solutions of dCMEs. We show exactly computed probability landscapes of three multiscale networks, namely, a 6-node toggle switch, 11-node phage-lambda epigenetic circuit, and 16-node MAPK cascade network, the latter two with no known solutions. We also show how probabilities of rare events can be computed from first-passage times, another class of unsolved problems challenging for simulation-based techniques due to large separations in time scales. Overall, the ACME method enables accurate and efficient solutions of the dCME for a large class of networks.« less
Tape underlayment rotary-node (TURN) valves for simple on-chip microfluidic flow control

PubMed Central

Markov, Dmitry A.; Manuel, Steven; Shor, Leslie M.; Opalenik, Susan R.; Wikswo, John P.; Samson, Philip C.

2013-01-01

We describe a simple and reliable fabrication method for producing multiple, manually activated microfluidic control valves in polydimethylsiloxane (PDMS) devices. These screwdriver-actuated valves reside directly on the microfluidic chip and can provide both simple on/off operation as well as graded control of fluid flow. The fabrication procedure can be easily implemented in any soft lithography lab and requires only two specialized tools – a hot-glue gun and a machined brass mold. To facilitate use in multi-valve fluidic systems, the mold is designed to produce a linear tape that contains a series of plastic rotary nodes with small stainless steel machine screws that form individual valves which can be easily separated for applications when only single valves are required. The tape and its valves are placed on the surface of a partially cured thin PDMS microchannel device while the PDMS is still on the soft-lithographic master, with the master providing alignment marks for the tape. The tape is permanently affixed to the microchannel device by pouring an over-layer of PDMS, to form a full-thickness device with the tape as an enclosed underlayment. The advantages of these Tape Underlayment Rotary-Node (TURN) valves include parallel fabrication of multiple valves, low risk of damaging a microfluidic device during valve installation, high torque, elimination of stripped threads, the capabilities of TURN hydraulic actuators, and facile customization of TURN molds. We have utilized these valves to control microfluidic flow, to control the onset of molecular diffusion, and to manipulate channel connectivity. Practical applications of TURN valves include control of loading and chemokine release in chemotaxis assay devices, flow in microfluidic bioreactors, and channel connectivity in microfluidic devices intended to study competition and predator / prey relationships among microbes. PMID:19859812
Lumping of degree-based mean-field and pair-approximation equations for multistate contact processes.

PubMed

Kyriakopoulos, Charalampos; Grossmann, Gerrit; Wolf, Verena; Bortolussi, Luca

2018-01-01

Contact processes form a large and highly interesting class of dynamic processes on networks, including epidemic and information-spreading networks. While devising stochastic models of such processes is relatively easy, analyzing them is very challenging from a computational point of view, particularly for large networks appearing in real applications. One strategy to reduce the complexity of their analysis is to rely on approximations, often in terms of a set of differential equations capturing the evolution of a random node, distinguishing nodes with different topological contexts (i.e., different degrees of different neighborhoods), such as degree-based mean-field (DBMF), approximate-master-equation (AME), or pair-approximation (PA) approaches. The number of differential equations so obtained is typically proportional to the maximum degree k_{max} of the network, which is much smaller than the size of the master equation of the underlying stochastic model, yet numerically solving these equations can still be problematic for large k_{max}. In this paper, we consider AME and PA, extended to cope with multiple local states, and we provide an aggregation procedure that clusters together nodes having similar degrees, treating those in the same cluster as indistinguishable, thus reducing the number of equations while preserving an accurate description of global observables of interest. We also provide an automatic way to build such equations and to identify a small number of degree clusters that give accurate results. The method is tested on several case studies, where it shows a high level of compression and a reduction of computational time of several orders of magnitude for large networks, with minimal loss in accuracy.
Accurate chemical master equation solution using multi-finite buffers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cao, Youfang; Terebus, Anna; Liang, Jie

Here, the discrete chemical master equation (dCME) provides a fundamental framework for studying stochasticity in mesoscopic networks. Because of the multiscale nature of many networks where reaction rates have a large disparity, directly solving dCMEs is intractable due to the exploding size of the state space. It is important to truncate the state space effectively with quantified errors, so accurate solutions can be computed. It is also important to know if all major probabilistic peaks have been computed. Here we introduce the accurate CME (ACME) algorithm for obtaining direct solutions to dCMEs. With multifinite buffers for reducing the state spacemore » by $O(n!)$, exact steady-state and time-evolving network probability landscapes can be computed. We further describe a theoretical framework of aggregating microstates into a smaller number of macrostates by decomposing a network into independent aggregated birth and death processes and give an a priori method for rapidly determining steady-state truncation errors. The maximal sizes of the finite buffers for a given error tolerance can also be precomputed without costly trial solutions of dCMEs. We show exactly computed probability landscapes of three multiscale networks, namely, a 6-node toggle switch, 11-node phage-lambda epigenetic circuit, and 16-node MAPK cascade network, the latter two with no known solutions. We also show how probabilities of rare events can be computed from first-passage times, another class of unsolved problems challenging for simulation-based techniques due to large separations in time scales. Overall, the ACME method enables accurate and efficient solutions of the dCME for a large class of networks.« less
Nanosatellite optical downlink experiment: design, simulation, and prototyping

NASA Astrophysics Data System (ADS)

Clements, Emily; Aniceto, Raichelle; Barnes, Derek; Caplan, David; Clark, James; Portillo, Iñigo del; Haughwout, Christian; Khatsenko, Maxim; Kingsbury, Ryan; Lee, Myron; Morgan, Rachel; Twichell, Jonathan; Riesing, Kathleen; Yoon, Hyosang; Ziegler, Caleb; Cahoy, Kerri

2016-11-01

The nanosatellite optical downlink experiment (NODE) implements a free-space optical communications (lasercom) capability on a CubeSat platform that can support low earth orbit (LEO) to ground downlink rates>10 Mbps. A primary goal of NODE is to leverage commercially available technologies to provide a scalable and cost-effective alternative to radio-frequency-based communications. The NODE transmitter uses a 200-mW 1550-nm master-oscillator power-amplifier design using power-efficient M-ary pulse position modulation. To facilitate pointing the 0.12-deg downlink beam, NODE augments spacecraft body pointing with a microelectromechanical fast steering mirror (FSM) and uses an 850-nm uplink beacon to an onboard CCD camera. The 30-cm aperture ground telescope uses an infrared camera and FSM for tracking to an avalanche photodiode detector-based receiver. Here, we describe our approach to transition prototype transmitter and receiver designs to a full end-to-end CubeSat-scale system. This includes link budget refinement, drive electronics miniaturization, packaging reduction, improvements to pointing and attitude estimation, implementation of modulation, coding, and interleaving, and ground station receiver design. We capture trades and technology development needs and outline plans for integrated system ground testing.
A Wireless MEMS-Based Inclinometer Sensor Node for Structural Health Monitoring

PubMed Central

Ha, Dae Woong; Park, Hyo Seon; Choi, Se Woon; Kim, Yousok

2013-01-01

This paper proposes a wireless inclinometer sensor node for structural health monitoring (SHM) that can be applied to civil engineering and building structures subjected to various loadings. The inclinometer used in this study employs a method for calculating the tilt based on the difference between the static acceleration and the acceleration due to gravity, using a micro-electro-mechanical system (MEMS)-based accelerometer. A wireless sensor node was developed through which tilt measurement data are wirelessly transmitted to a monitoring server. This node consists of a slave node that uses a short-distance wireless communication system (RF 2.4 GHz) and a master node that uses a long-distance telecommunication system (code division multiple access—CDMA). The communication distance limitation, which is recognized as an important issue in wireless monitoring systems, has been resolved via these two wireless communication components. The reliability of the proposed wireless inclinometer sensor node was verified experimentally by comparing the values measured by the inclinometer and subsequently transferred to the monitoring server via wired and wireless transfer methods to permit a performance evaluation of the wireless communication sensor nodes. The experimental results indicated that the two systems (wired and wireless transfer systems) yielded almost identical values at a tilt angle greater than 1°, and a uniform difference was observed at a tilt angle less than 0.42° (approximately 0.0032° corresponding to 0.76% of the tilt angle, 0.42°) regardless of the tilt size. This result was deemed to be within the allowable range of measurement error in SHM. Thus, the wireless transfer system proposed in this study was experimentally verified for practical application in a structural health monitoring system. PMID:24287533
Peer-to-peer architectures for exascale computing : LDRD final report.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vorobeychik, Yevgeniy; Mayo, Jackson R.; Minnich, Ronald G.

2010-09-01

The goal of this research was to investigate the potential for employing dynamic, decentralized software architectures to achieve reliability in future high-performance computing platforms. These architectures, inspired by peer-to-peer networks such as botnets that already scale to millions of unreliable nodes, hold promise for enabling scientific applications to run usefully on next-generation exascale platforms ({approx} 10{sup 18} operations per second). Traditional parallel programming techniques suffer rapid deterioration of performance scaling with growing platform size, as the work of coping with increasingly frequent failures dominates over useful computation. Our studies suggest that new architectures, in which failures are treated as ubiquitousmore » and their effects are considered as simply another controllable source of error in a scientific computation, can remove such obstacles to exascale computing for certain applications. We have developed a simulation framework, as well as a preliminary implementation in a large-scale emulation environment, for exploration of these 'fault-oblivious computing' approaches. High-performance computing (HPC) faces a fundamental problem of increasing total component failure rates due to increasing system sizes, which threaten to degrade system reliability to an unusable level by the time the exascale range is reached ({approx} 10{sup 18} operations per second, requiring of order millions of processors). As computer scientists seek a way to scale system software for next-generation exascale machines, it is worth considering peer-to-peer (P2P) architectures that are already capable of supporting 10{sup 6}-10{sup 7} unreliable nodes. Exascale platforms will require a different way of looking at systems and software because the machine will likely not be available in its entirety for a meaningful execution time. Realistic estimates of failure rates range from a few times per day to more than once per hour for these platforms. P2P architectures give us a starting point for crafting applications and system software for exascale. In the context of the Internet, P2P applications (e.g., file sharing, botnets) have already solved this problem for 10{sup 6}-10{sup 7} nodes. Usually based on a fractal distributed hash table structure, these systems have proven robust in practice to constant and unpredictable outages, failures, and even subversion. For example, a recent estimate of botnet turnover (i.e., the number of machines leaving and joining) is about 11% per week. Nonetheless, P2P networks remain effective despite these failures: The Conficker botnet has grown to {approx} 5 x 10{sup 6} peers. Unlike today's system software and applications, those for next-generation exascale machines cannot assume a static structure and, to be scalable over millions of nodes, must be decentralized. P2P architectures achieve both, and provide a promising model for 'fault-oblivious computing'. This project aimed to study the dynamics of P2P networks in the context of a design for exascale systems and applications. Having no single point of failure, the most successful P2P architectures are adaptive and self-organizing. While there has been some previous work applying P2P to message passing, little attention has been previously paid to the tightly coupled exascale domain. Typically, the per-node footprint of P2P systems is small, making them ideal for HPC use. The implementation on each peer node cooperates en masse to 'heal' disruptions rather than relying on a controlling 'master' node. Understanding this cooperative behavior from a complex systems viewpoint is essential to predicting useful environments for the inextricably unreliable exascale platforms of the future. We sought to obtain theoretical insight into the stability and large-scale behavior of candidate architectures, and to work toward leveraging Sandia's Emulytics platform to test promising candidates in a realistic (ultimately {ge} 10{sup 7} nodes) setting. Our primary example applications are drawn from linear algebra: a Jacobi relaxation solver for the heat equation, and the closely related technique of value iteration in optimization. We aimed to apply P2P concepts in designing implementations capable of surviving an unreliable machine of 10{sup 6} nodes.« less
Parallel reduced-instruction-set-computer architecture for real-time symbolic pattern matching

NASA Astrophysics Data System (ADS)

Parson, Dale E.

1991-03-01

This report discusses ongoing work on a parallel reduced-instruction- set-computer (RISC) architecture for automatic production matching. The PRIOPS compiler takes advantage of the memoryless character of automatic processing by translating a program's collection of automatic production tests into an equivalent combinational circuit-a digital circuit without memory, whose outputs are immediate functions of its inputs. The circuit provides a highly parallel, fine-grain model of automatic matching. The compiler then maps the combinational circuit onto RISC hardware. The heart of the processor is an array of comparators capable of testing production conditions in parallel, Each comparator attaches to private memory that contains virtual circuit nodes-records of the current state of nodes and busses in the combinational circuit. All comparator memories hold identical information, allowing simultaneous update for a single changing circuit node and simultaneous retrieval of different circuit nodes by different comparators. Along with the comparator-based logic unit is a sequencer that determines the current combination of production-derived comparisons to try, based on the combined success and failure of previous combinations of comparisons. The memoryless nature of automatic matching allows the compiler to designate invariant memory addresses for virtual circuit nodes, and to generate the most effective sequences of comparison test combinations. The result is maximal utilization of parallel hardware, indicating speed increases and scalability beyond that found for course-grain, multiprocessor approaches to concurrent Rete matching. Future work will consider application of this RISC architecture to the standard (controlled) Rete algorithm, where search through memory dominates portions of matching.
Interdependent Multi-Layer Networks: Modeling and Survivability Analysis with Applications to Space-Based Networks

PubMed Central

Castet, Jean-Francois; Saleh, Joseph H.

2013-01-01

This article develops a novel approach and algorithmic tools for the modeling and survivability analysis of networks with heterogeneous nodes, and examines their application to space-based networks. Space-based networks (SBNs) allow the sharing of spacecraft on-orbit resources, such as data storage, processing, and downlink. Each spacecraft in the network can have different subsystem composition and functionality, thus resulting in node heterogeneity. Most traditional survivability analyses of networks assume node homogeneity and as a result, are not suited for the analysis of SBNs. This work proposes that heterogeneous networks can be modeled as interdependent multi-layer networks, which enables their survivability analysis. The multi-layer aspect captures the breakdown of the network according to common functionalities across the different nodes, and it allows the emergence of homogeneous sub-networks, while the interdependency aspect constrains the network to capture the physical characteristics of each node. Definitions of primitives of failure propagation are devised. Formal characterization of interdependent multi-layer networks, as well as algorithmic tools for the analysis of failure propagation across the network are developed and illustrated with space applications. The SBN applications considered consist of several networked spacecraft that can tap into each other's Command and Data Handling subsystem, in case of failure of its own, including the Telemetry, Tracking and Command, the Control Processor, and the Data Handling sub-subsystems. Various design insights are derived and discussed, and the capability to perform trade-space analysis with the proposed approach for various network characteristics is indicated. The select results here shown quantify the incremental survivability gains (with respect to a particular class of threats) of the SBN over the traditional monolith spacecraft. Failure of the connectivity between nodes is also examined, and the results highlight the importance of the reliability of the wireless links between spacecraft (nodes) to enable any survivability improvements for space-based networks. PMID:23599835
Interdependent multi-layer networks: modeling and survivability analysis with applications to space-based networks.

PubMed

Castet, Jean-Francois; Saleh, Joseph H

2013-01-01

This article develops a novel approach and algorithmic tools for the modeling and survivability analysis of networks with heterogeneous nodes, and examines their application to space-based networks. Space-based networks (SBNs) allow the sharing of spacecraft on-orbit resources, such as data storage, processing, and downlink. Each spacecraft in the network can have different subsystem composition and functionality, thus resulting in node heterogeneity. Most traditional survivability analyses of networks assume node homogeneity and as a result, are not suited for the analysis of SBNs. This work proposes that heterogeneous networks can be modeled as interdependent multi-layer networks, which enables their survivability analysis. The multi-layer aspect captures the breakdown of the network according to common functionalities across the different nodes, and it allows the emergence of homogeneous sub-networks, while the interdependency aspect constrains the network to capture the physical characteristics of each node. Definitions of primitives of failure propagation are devised. Formal characterization of interdependent multi-layer networks, as well as algorithmic tools for the analysis of failure propagation across the network are developed and illustrated with space applications. The SBN applications considered consist of several networked spacecraft that can tap into each other's Command and Data Handling subsystem, in case of failure of its own, including the Telemetry, Tracking and Command, the Control Processor, and the Data Handling sub-subsystems. Various design insights are derived and discussed, and the capability to perform trade-space analysis with the proposed approach for various network characteristics is indicated. The select results here shown quantify the incremental survivability gains (with respect to a particular class of threats) of the SBN over the traditional monolith spacecraft. Failure of the connectivity between nodes is also examined, and the results highlight the importance of the reliability of the wireless links between spacecraft (nodes) to enable any survivability improvements for space-based networks.
Vapor Compression Distillation Flight Experiment

NASA Technical Reports Server (NTRS)

Hutchens, Cindy F.

2002-01-01

One of the major requirements associated with operating the International Space Station is the transportation -- space shuttle and Russian Progress spacecraft launches - necessary to re-supply station crews with food and water. The Vapor Compression Distillation (VCD) Flight Experiment, managed by NASA's Marshall Space Flight Center in Huntsville, Ala., is a full-scale demonstration of technology being developed to recycle crewmember urine and wastewater aboard the International Space Station and thereby reduce the amount of water that must be re-supplied. Based on results of the VCD Flight Experiment, an operational urine processor will be installed in Node 3 of the space station in 2005.
FPGA-Based, Self-Checking, Fault-Tolerant Computers

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2004-01-01

A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.

Solving Coupled Gross--Pitaevskii Equations on a Cluster of PlayStation 3 Computers

NASA Astrophysics Data System (ADS)

Edwards, Mark; Heward, Jeffrey; Clark, C. W.

2009-05-01

At Georgia Southern University we have constructed an 8+1--node cluster of Sony PlayStation 3 (PS3) computers with the intention of using this computing resource to solve problems related to the behavior of ultra--cold atoms in general with a particular emphasis on studying bose--bose and bose--fermi mixtures confined in optical lattices. As a first project that uses this computing resource, we have implemented a parallel solver of the coupled time--dependent, one--dimensional Gross--Pitaevskii (TDGP) equations. These equations govern the behavior of dual-- species bosonic mixtures. We chose the split--operator/FFT to solve the coupled 1D TDGP equations. The fast Fourier transform component of this solver can be readily parallelized on the PS3 cpu known as the Cell Broadband Engine (CellBE). Each CellBE chip contains a single 64--bit PowerPC Processor Element known as the PPE and eight ``Synergistic Processor Element'' identified as the SPE's. We report on this algorithm and compare its performance to a non--parallel solver as applied to modeling evaporative cooling in dual--species bosonic mixtures.
FPGA-accelerated algorithm for the regular expression matching system

NASA Astrophysics Data System (ADS)

Russek, P.; Wiatr, K.

2015-01-01

This article describes an algorithm to support a regular expressions matching system. The goal was to achieve an attractive performance system with low energy consumption. The basic idea of the algorithm comes from a concept of the Bloom filter. It starts from the extraction of static sub-strings for strings of regular expressions. The algorithm is devised to gain from its decomposition into parts which are intended to be executed by custom hardware and the central processing unit (CPU). The pipelined custom processor architecture is proposed and a software algorithm explained accordingly. The software part of the algorithm was coded in C and runs on a processor from the ARM family. The hardware architecture was described in VHDL and implemented in field programmable gate array (FPGA). The performance results and required resources of the above experiments are given. An example of target application for the presented solution is computer and network security systems. The idea was tested on nearly 100,000 body-based viruses from the ClamAV virus database. The solution is intended for the emerging technology of clusters of low-energy computing nodes.
Performance Analysis of a Hybrid Overset Multi-Block Application on Multiple Architectures

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Biswas, Rupak

2003-01-01

This paper presents a detailed performance analysis of a multi-block overset grid compu- tational fluid dynamics app!ication on multiple state-of-the-art computer architectures. The application is implemented using a hybrid MPI+OpenMP programming paradigm that exploits both coarse and fine-grain parallelism; the former via MPI message passing and the latter via OpenMP directives. The hybrid model also extends the applicability of multi-block programs to large clusters of SNIP nodes by overcoming the restriction that the number of processors be less than the number of grid blocks. A key kernel of the application, namely the LU-SGS linear solver, had to be modified to enhance the performance of the hybrid approach on the target machines. Investigations were conducted on cacheless Cray SX6 vector processors, cache-based IBM Power3 and Power4 architectures, and single system image SGI Origin3000 platforms. Overall results for complex vortex dynamics simulations demonstrate that the SX6 achieves the highest performance and outperforms the RISC-based architectures; however, the best scaling performance was achieved on the Power3.
Monitoring Data-Structure Evolution in Distributed Message-Passing Programs

NASA Technical Reports Server (NTRS)

Sarukkai, Sekhar R.; Beers, Andrew; Woodrow, Thomas S. (Technical Monitor)

1996-01-01

Monitoring the evolution of data structures in parallel and distributed programs, is critical for debugging its semantics and performance. However, the current state-of-art in tracking and presenting data-structure information on parallel and distributed environments is cumbersome and does not scale. In this paper we present a methodology that automatically tracks memory bindings (not the actual contents) of static and dynamic data-structures of message-passing C programs, using PVM. With the help of a number of examples we show that in addition to determining the impact of memory allocation overheads on program performance, graphical views can help in debugging the semantics of program execution. Scalable animations of virtual address bindings of source-level data-structures are used for debugging the semantics of parallel programs across all processors. In conjunction with light-weight core-files, this technique can be used to complement traditional debuggers on single processors. Detailed information (such as data-structure contents), on specific nodes, can be determined using traditional debuggers after the data structure evolution leading to the semantic error is observed graphically.
Global synchronization algorithms for the Intel iPSC/860

NASA Technical Reports Server (NTRS)

Seidel, Steven R.; Davis, Mark A.

1992-01-01

In a distributed memory multicomputer that has no global clock, global processor synchronization can only be achieved through software. Global synchronization algorithms are used in tridiagonal systems solvers, CFD codes, sequence comparison algorithms, and sorting algorithms. They are also useful for event simulation, debugging, and for solving mutual exclusion problems. For the Intel iPSC/860 in particular, global synchronization can be used to ensure the most effective use of the communication network for operations such as the shift, where each processor in a one-dimensional array or ring concurrently sends a message to its right (or left) neighbor. Three global synchronization algorithms are considered for the iPSC/860: the gysnc() primitive provided by Intel, the PICL primitive sync0(), and a new recursive doubling synchronization (RDS) algorithm. The performance of these algorithms is compared to the performance predicted by communication models of both the long and forced message protocols. Measurements of the cost of shift operations preceded by global synchronization show that the RDS algorithm always synchronizes the nodes more precisely and costs only slightly more than the other two algorithms.
GROMACS 4: Algorithms for Highly Efficient, Load-Balanced, and Scalable Molecular Simulation.

PubMed

Hess, Berk; Kutzner, Carsten; van der Spoel, David; Lindahl, Erik

2008-03-01

Molecular simulation is an extremely useful, but computationally very expensive tool for studies of chemical and biomolecular systems. Here, we present a new implementation of our molecular simulation toolkit GROMACS which now both achieves extremely high performance on single processors from algorithmic optimizations and hand-coded routines and simultaneously scales very well on parallel machines. The code encompasses a minimal-communication domain decomposition algorithm, full dynamic load balancing, a state-of-the-art parallel constraint solver, and efficient virtual site algorithms that allow removal of hydrogen atom degrees of freedom to enable integration time steps up to 5 fs for atomistic simulations also in parallel. To improve the scaling properties of the common particle mesh Ewald electrostatics algorithms, we have in addition used a Multiple-Program, Multiple-Data approach, with separate node domains responsible for direct and reciprocal space interactions. Not only does this combination of algorithms enable extremely long simulations of large systems but also it provides that simulation performance on quite modest numbers of standard cluster nodes.
T-L Plane Abstraction-Based Energy-Efficient Real-Time Scheduling for Multi-Core Wireless Sensors

PubMed Central

Kim, Youngmin; Lee, Ki-Seong; Pham, Ngoc-Son; Lee, Sun-Ro; Lee, Chan-Gun

2016-01-01

Energy efficiency is considered as a critical requirement for wireless sensor networks. As more wireless sensor nodes are equipped with multi-cores, there are emerging needs for energy-efficient real-time scheduling algorithms. The T-L plane-based scheme is known to be an optimal global scheduling technique for periodic real-time tasks on multi-cores. Unfortunately, there has been a scarcity of studies on extending T-L plane-based scheduling algorithms to exploit energy-saving techniques. In this paper, we propose a new T-L plane-based algorithm enabling energy-efficient real-time scheduling on multi-core sensor nodes with dynamic power management (DPM). Our approach addresses the overhead of processor mode transitions and reduces fragmentations of the idle time, which are inherent in T-L plane-based algorithms. Our experimental results show the effectiveness of the proposed algorithm compared to other energy-aware scheduling methods on T-L plane abstraction. PMID:27399722
IGA-ADS: Isogeometric analysis FEM using ADS solver

NASA Astrophysics Data System (ADS)

Łoś, Marcin M.; Woźniak, Maciej; Paszyński, Maciej; Lenharth, Andrew; Hassaan, Muhamm Amber; Pingali, Keshav

2017-08-01

In this paper we present a fast explicit solver for solution of non-stationary problems using L2 projections with isogeometric finite element method. The solver has been implemented within GALOIS framework. It enables parallel multi-core simulations of different time-dependent problems, in 1D, 2D, or 3D. We have prepared the solver framework in a way that enables direct implementation of the selected PDE and corresponding boundary conditions. In this paper we describe the installation, implementation of exemplary three PDEs, and execution of the simulations on multi-core Linux cluster nodes. We consider three case studies, including heat transfer, linear elasticity, as well as non-linear flow in heterogeneous media. The presented package generates output suitable for interfacing with Gnuplot and ParaView visualization software. The exemplary simulations show near perfect scalability on Gilbert shared-memory node with four Intel® Xeon® CPU E7-4860 processors, each possessing 10 physical cores (for a total of 40 cores).
Distributed Two-Dimensional Fourier Transforms on DSPs with an Application for Phase Retrieval

NASA Technical Reports Server (NTRS)

Smith, Jeffrey Scott

2006-01-01

Many applications of two-dimensional Fourier Transforms require fixed timing as defined by system specifications. One example is image-based wavefront sensing. The image-based approach has many benefits, yet it is a computational intensive solution for adaptive optic correction, where optical adjustments are made in real-time to correct for external (atmospheric turbulence) and internal (stability) aberrations, which cause image degradation. For phase retrieval, a type of image-based wavefront sensing, numerous two-dimensional Fast Fourier Transforms (FFTs) are used. To meet the required real-time specifications, a distributed system is needed, and thus, the 2-D FFT necessitates an all-to-all communication among the computational nodes. The 1-D floating point FFT is very efficient on a digital signal processor (DSP). For this study, several architectures and analysis of such are presented which address the all-to-all communication with DSPs. Emphasis of this research is on a 64-node cluster of Analog Devices TigerSharc TS-101 DSPs.
Data decomposition of Monte Carlo particle transport simulations via tally servers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Romano, Paul K.; Siegel, Andrew R.; Forget, Benoit

An algorithm for decomposing large tally data in Monte Carlo particle transport simulations is developed, analyzed, and implemented in a continuous-energy Monte Carlo code, OpenMC. The algorithm is based on a non-overlapping decomposition of compute nodes into tracking processors and tally servers. The former are used to simulate the movement of particles through the domain while the latter continuously receive and update tally data. A performance model for this approach is developed, suggesting that, for a range of parameters relevant to LWR analysis, the tally server algorithm should perform with minimal overhead on contemporary supercomputers. An implementation of the algorithmmore » in OpenMC is then tested on the Intrepid and Titan supercomputers, supporting the key predictions of the model over a wide range of parameters. We thus conclude that the tally server algorithm is a successful approach to circumventing classical on-node memory constraints en route to unprecedentedly detailed Monte Carlo reactor simulations.« less
Seeing the forest for the trees: Networked workstations as a parallel processing computer

NASA Technical Reports Server (NTRS)

Breen, J. O.; Meleedy, D. M.

1992-01-01

Unlike traditional 'serial' processing computers in which one central processing unit performs one instruction at a time, parallel processing computers contain several processing units, thereby, performing several instructions at once. Many of today's fastest supercomputers achieve their speed by employing thousands of processing elements working in parallel. Few institutions can afford these state-of-the-art parallel processors, but many already have the makings of a modest parallel processing system. Workstations on existing high-speed networks can be harnessed as nodes in a parallel processing environment, bringing the benefits of parallel processing to many. While such a system can not rival the industry's latest machines, many common tasks can be accelerated greatly by spreading the processing burden and exploiting idle network resources. We study several aspects of this approach, from algorithms to select nodes to speed gains in specific tasks. With ever-increasing volumes of astronomical data, it becomes all the more necessary to utilize our computing resources fully.
Global-view coefficients: a data management solution for parallel quantum Monte Carlo applications: A DATA MANAGEMENT SOLUTION FOR QMC APPLICATIONS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Niu, Qingpeng; Dinan, James; Tirukkovalur, Sravya

2016-01-28

Quantum Monte Carlo (QMC) applications perform simulation with respect to an initial state of the quantum mechanical system, which is often captured by using a cubic B-spline basis. This representation is stored as a read-only table of coefficients and accesses to the table are generated at random as part of the Monte Carlo simulation. Current QMC applications, such as QWalk and QMCPACK, replicate this table at every process or node, which limits scalability because increasing the number of processors does not enable larger systems to be run. We present a partitioned global address space approach to transparently managing this datamore » using Global Arrays in a manner that allows the memory of multiple nodes to be aggregated. We develop an automated data management system that significantly reduces communication overheads, enabling new capabilities for QMC codes. Experimental results with QWalk and QMCPACK demonstrate the effectiveness of the data management system.« less
Augmenting computer networks

NASA Technical Reports Server (NTRS)

Bokhari, S. H.; Raza, A. D.

1984-01-01

Three methods of augmenting computer networks by adding at most one link per processor are discussed: (1) A tree of N nodes may be augmented such that the resulting graph has diameter no greater than 4log sub 2((N+2)/3)-2. Thi O(N(3)) algorithm can be applied to any spanning tree of a connected graph to reduce the diameter of that graph to O(log N); (2) Given a binary tree T and a chain C of N nodes each, C may be augmented to produce C so that T is a subgraph of C. This algorithm is O(N) and may be used to produce augmented chains or rings that have diameter no greater than 2log sub 2((N+2)/3) and are planar; (3) Any rectangular two-dimensional 4 (8) nearest neighbor array of size N = 2(k) may be augmented so that it can emulate a single step shuffle-exchange network of size N/2 in 3(t) time steps.
A game theory approach to target tracking in sensor networks.

PubMed

Gu, Dongbing

2011-02-01

In this paper, we investigate a moving-target tracking problem with sensor networks. Each sensor node has a sensor to observe the target and a processor to estimate the target position. It also has wireless communication capability but with limited range and can only communicate with neighbors. The moving target is assumed to be an intelligent agent, which is "smart" enough to escape from the detection by maximizing the estimation error. This adversary behavior makes the target tracking problem more difficult. We formulate this target estimation problem as a zero-sum game in this paper and use a minimax filter to estimate the target position. The minimax filter is a robust filter that minimizes the estimation error by considering the worst case noise. Furthermore, we develop a distributed version of the minimax filter for multiple sensor nodes. The distributed computation is implemented via modeling the information received from neighbors as measurements in the minimax filter. The simulation results show that the target tracking algorithm proposed in this paper provides a satisfactory result.
Force-reflective teleoperated system with shared and compliant control capabilities

NASA Technical Reports Server (NTRS)

Szakaly, Z.; Kim, W. S.; Bejczy, A. K.

1989-01-01

The force-reflecting teleoperator breadboard is described. It is the first system among available Research and Development systems with the following combined capabilities: (1) The master input device is not a replica of the slave arm. It is a general purpose device which can be applied to the control of different robot arms through proper mathematical transformations. (2) Force reflection generated in the master hand controller is referenced to forces and moments measured by a six DOF force-moment sensor at the base of the robot hand. (3) The system permits a smooth spectrum of operations between full manual, shared manual and automatic, and full automatic (called traded) control. (4) The system can be operated with variable compliance or stiffness in force-reflecting control. Some of the key points of the system are the data handling and computing architecture, the communication method, and the handling of mathematical transformations. The architecture is a fully synchronized pipeline. The communication method achieves optimal use of a parallel communication channel between the local and remote computing nodes. A time delay box is also implemented in this communication channel permitting experiments with up to 8 sec time delay. The mathematical transformations are computed faster than 1 msec so that control at each node can be operated at 1 kHz servo rate without interpolation. This results in an overall force-reflecting loop rate of 200 Hz.
Energy-Efficient Implementation of ECDH Key Exchange for Wireless Sensor Networks

NASA Astrophysics Data System (ADS)

Lederer, Christian; Mader, Roland; Koschuch, Manuel; Großschädl, Johann; Szekely, Alexander; Tillich, Stefan

Wireless Sensor Networks (WSNs) are playing a vital role in an ever-growing number of applications ranging from environmental surveillance over medical monitoring to home automation. Since WSNs are often deployed in unattended or even hostile environments, they can be subject to various malicious attacks, including the manipulation and capture of nodes. The establishment of a shared secret key between two or more individual nodes is one of the most important security services needed to guarantee the proper functioning of a sensor network. Despite some recent advances in this field, the efficient implementation of cryptographic key establishment for WSNs remains a challenge due to the resource constraints of small sensor nodes such as the MICAz mote. In this paper we present a lightweight implementation of the elliptic curve Diffie-Hellman (ECDH) key exchange for ZigBee-compliant sensor nodes equipped with an ATmega128 processor running the TinyOS operating system. Our implementation uses a 192-bit prime field specified by the NIST as underlying algebraic structure and requires only 5.20 ·106 clock cycles to compute a scalar multiplication if the base point is fixed and known a priori. A scalar multiplication using a random base point takes about 12.33 ·106 cycles. Our results show that a full ECDH key exchange between two MICAz motes consumes an energy of 57.33 mJ (including radio communication), which is significantly better than most previously reported ECDH implementations on comparable platforms.
Deterministic quantum state transfer and remote entanglement using microwave photons.

PubMed

Kurpiers, P; Magnard, P; Walter, T; Royer, B; Pechal, M; Heinsoo, J; Salathé, Y; Akin, A; Storz, S; Besse, J-C; Gasparinetti, S; Blais, A; Wallraff, A

2018-06-01

Sharing information coherently between nodes of a quantum network is fundamental to distributed quantum information processing. In this scheme, the computation is divided into subroutines and performed on several smaller quantum registers that are connected by classical and quantum channels 1 . A direct quantum channel, which connects nodes deterministically rather than probabilistically, achieves larger entanglement rates between nodes and is advantageous for distributed fault-tolerant quantum computation 2 . Here we implement deterministic state-transfer and entanglement protocols between two superconducting qubits fabricated on separate chips. Superconducting circuits 3 constitute a universal quantum node 4 that is capable of sending, receiving, storing and processing quantum information 5-8 . Our implementation is based on an all-microwave cavity-assisted Raman process 9 , which entangles or transfers the qubit state of a transmon-type artificial atom 10 with a time-symmetric itinerant single photon. We transfer qubit states by absorbing these itinerant photons at the receiving node, with a probability of 98.1 ± 0.1 per cent, achieving a transfer-process fidelity of 80.02 ± 0.07 per cent for a protocol duration of only 180 nanoseconds. We also prepare remote entanglement on demand with a fidelity as high as 78.9 ± 0.1 per cent at a rate of 50 kilohertz. Our results are in excellent agreement with numerical simulations based on a master-equation description of the system. This deterministic protocol has the potential to be used for quantum computing distributed across different nodes of a cryogenic network.
Multi-processor including data flow accelerator module

DOEpatents

Davidson, George S.; Pierce, Paul E.

1990-01-01

An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Present Status and Extensions of the Monte Carlo Performance Benchmark

NASA Astrophysics Data System (ADS)

Hoogenboom, J. Eduard; Petrovic, Bojan; Martin, William R.

2014-06-01

The NEA Monte Carlo Performance benchmark started in 2011 aiming to monitor over the years the abilities to perform a full-size Monte Carlo reactor core calculation with a detailed power production for each fuel pin with axial distribution. This paper gives an overview of the contributed results thus far. It shows that reaching a statistical accuracy of 1 % for most of the small fuel zones requires about 100 billion neutron histories. The efficiency of parallel execution of Monte Carlo codes on a large number of processor cores shows clear limitations for computer clusters with common type computer nodes. However, using true supercomputers the speedup of parallel calculations is increasing up to large numbers of processor cores. More experience is needed from calculations on true supercomputers using large numbers of processors in order to predict if the requested calculations can be done in a short time. As the specifications of the reactor geometry for this benchmark test are well suited for further investigations of full-core Monte Carlo calculations and a need is felt for testing other issues than its computational performance, proposals are presented for extending the benchmark to a suite of benchmark problems for evaluating fission source convergence for a system with a high dominance ratio, for coupling with thermal-hydraulics calculations to evaluate the use of different temperatures and coolant densities and to study the correctness and effectiveness of burnup calculations. Moreover, other contemporary proposals for a full-core calculation with realistic geometry and material composition will be discussed.
Call Admission Control on Single Node Networks under Output Rate-Controlled Generalized Processor Sharing (ORC-GPS) Scheduler

NASA Astrophysics Data System (ADS)

Hanada, Masaki; Nakazato, Hidenori; Watanabe, Hitoshi

Multimedia applications such as music or video streaming, video teleconferencing and IP telephony are flourishing in packet-switched networks. Applications that generate such real-time data can have very diverse quality-of-service (QoS) requirements. In order to guarantee diverse QoS requirements, the combined use of a packet scheduling algorithm based on Generalized Processor Sharing (GPS) and leaky bucket traffic regulator is the most successful QoS mechanism. GPS can provide a minimum guaranteed service rate for each session and tight delay bounds for leaky bucket constrained sessions. However, the delay bounds for leaky bucket constrained sessions under GPS are unnecessarily large because each session is served according to its associated constant weight until the session buffer is empty. In order to solve this problem, a scheduling policy called Output Rate-Controlled Generalized Processor Sharing (ORC-GPS) was proposed in [17]. ORC-GPS is a rate-based scheduling like GPS, and controls the service rate in order to lower the delay bounds for leaky bucket constrained sessions. In this paper, we propose a call admission control (CAC) algorithm for ORC-GPS, for leaky-bucket constrained sessions with deterministic delay requirements. This CAC algorithm for ORC-GPS determines the optimal values of parameters of ORC-GPS from the deterministic delay requirements of the sessions. In numerical experiments, we compare the CAC algorithm for ORC-GPS with one for GPS in terms of schedulable region and computational complexity.

Parallel definition of tear film maps on distributed-memory clusters for the support of dry eye diagnosis.

PubMed

González-Domínguez, Jorge; Remeseiro, Beatriz; Martín, María J

2017-02-01

The analysis of the interference patterns on the tear film lipid layer is a useful clinical test to diagnose dry eye syndrome. This task can be automated with a high degree of accuracy by means of the use of tear film maps. However, the time required by the existing applications to generate them prevents a wider acceptance of this method by medical experts. Multithreading has been previously successfully employed by the authors to accelerate the tear film map definition on multicore single-node machines. In this work, we propose a hybrid message-passing and multithreading parallel approach that further accelerates the generation of tear film maps by exploiting the computational capabilities of distributed-memory systems such as multicore clusters and supercomputers. The algorithm for drawing tear film maps is parallelized using Message Passing Interface (MPI) for inter-node communications and the multithreading support available in the C++11 standard for intra-node parallelization. The original algorithm is modified to reduce the communications and increase the scalability. The hybrid method has been tested on 32 nodes of an Intel cluster (with two 12-core Haswell 2680v3 processors per node) using 50 representative images. Results show that maximum runtime is reduced from almost two minutes using the previous only-multithreaded approach to less than ten seconds using the hybrid method. The hybrid MPI/multithreaded implementation can be used by medical experts to obtain tear film maps in only a few seconds, which will significantly accelerate and facilitate the diagnosis of the dry eye syndrome. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Status report of the end-to-end ASKAP software system: towards early science operations

NASA Astrophysics Data System (ADS)

Guzman, Juan Carlos; Chapman, Jessica; Marquarding, Malte; Whiting, Matthew

2016-08-01

The Australian SKA Pathfinder (ASKAP) is a novel centimetre radio synthesis telescope currently in the commissioning phase and located in the midwest region of Western Australia. It comprises of 36 x 12 m diameter reflector antennas each equipped with state-of-the-art and award winning Phased Array Feeds (PAF) technology. The PAFs provide a wide, 30 square degree field-of-view by forming up to 36 separate dual-polarisation beams at once. This results in a high data rate: 70 TB of correlated visibilities in an 8-hour observation, requiring custom-written, high-performance software running in dedicated High Performance Computing (HPC) facilities. The first six antennas equipped with first-generation PAF technology (Mark I), named the Boolardy Engineering Test Array (BETA) have been in use since 2014 as a platform to test PAF calibration and imaging techniques, and along the way it has been producing some great science results. Commissioning of the ASKAP Array Release 1, that is the first six antennas with second-generation PAFs (Mark II) is currently under way. An integral part of the instrument is the Central Processor platform hosted at the Pawsey Supercomputing Centre in Perth, which executes custom-written software pipelines, designed specifically to meet the ASKAP imaging requirements of wide field of view and high dynamic range. There are three key hardware components of the Central Processor: The ingest nodes (16 x node cluster), the fast temporary storage (1 PB Lustre file system) and the processing supercomputer (200 TFlop system). This High-Performance Computing (HPC) platform is managed and supported by the Pawsey support team. Due to the limited amount of data generated by BETA and the first ASKAP Array Release, the Central Processor platform has been running in a more "traditional" or user-interactive mode. But this is about to change: integration and verification of the online ingest pipeline starts in early 2016, which is required to support the full 300 MHz bandwidth for Array Release 1; followed by the deployment of the real-time data processing components. In addition to the Central Processor, the first production release of the CSIRO ASKAP Science Data Archive (CASDA) has also been deployed in one of the Pawsey Supercomputing Centre facilities and it is integrated to the end-to-end ASKAP data flow system. This paper describes the current status of the "end-to-end" data flow software system from preparing observations to data acquisition, processing and archiving; and the challenges of integrating an HPC facility as a key part of the instrument. It also shares some lessons learned since the start of integration activities and the challenges ahead in preparation for the start of the Early Science program.
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
An Integrative Structural Health Monitoring System for the Local/Global Responses of a Large-Scale Irregular Building under Construction

PubMed Central

Park, Hyo Seon; Shin, Yunah; Choi, Se Woon; Kim, Yousok

2013-01-01

In this study, a practical and integrative SHM system was developed and applied to a large-scale irregular building under construction, where many challenging issues exist. In the proposed sensor network, customized energy-efficient wireless sensing units (sensor nodes, repeater nodes, and master nodes) were employed and comprehensive communications from the sensor node to the remote monitoring server were conducted through wireless communications. The long-term (13-month) monitoring results recorded from a large number of sensors (75 vibrating wire strain gauges, 10 inclinometers, and three laser displacement sensors) indicated that the construction event exhibiting the largest influence on structural behavior was the removal of bents that were temporarily installed to support the free end of the cantilevered members during their construction. The safety of each member could be confirmed based on the quantitative evaluation of each response. Furthermore, it was also confirmed that the relation between these responses (i.e., deflection, strain, and inclination) can provide information about the global behavior of structures induced from specific events. Analysis of the measurement results demonstrates the proposed sensor network system is capable of automatic and real-time monitoring and can be applied and utilized for both the safety evaluation and precise implementation of buildings under construction. PMID:23860317
Quantum information density scaling and qubit operation time constraints of CMOS silicon-based quantum computer architectures

NASA Astrophysics Data System (ADS)

Rotta, Davide; Sebastiano, Fabio; Charbon, Edoardo; Prati, Enrico

2017-06-01

Even the quantum simulation of an apparently simple molecule such as Fe2S2 requires a considerable number of qubits of the order of 106, while more complex molecules such as alanine (C3H7NO2) require about a hundred times more. In order to assess such a multimillion scale of identical qubits and control lines, the silicon platform seems to be one of the most indicated routes as it naturally provides, together with qubit functionalities, the capability of nanometric, serial, and industrial-quality fabrication. The scaling trend of microelectronic devices predicting that computing power would double every 2 years, known as Moore's law, according to the new slope set after the 32-nm node of 2009, suggests that the technology roadmap will achieve the 3-nm manufacturability limit proposed by Kelly around 2020. Today, circuital quantum information processing architectures are predicted to take advantage from the scalability ensured by silicon technology. However, the maximum amount of quantum information per unit surface that can be stored in silicon-based qubits and the consequent space constraints on qubit operations have never been addressed so far. This represents one of the key parameters toward the implementation of quantum error correction for fault-tolerant quantum information processing and its dependence on the features of the technology node. The maximum quantum information per unit surface virtually storable and controllable in the compact exchange-only silicon double quantum dot qubit architecture is expressed as a function of the complementary metal-oxide-semiconductor technology node, so the size scale optimizing both physical qubit operation time and quantum error correction requirements is assessed by reviewing the physical and technological constraints. According to the requirements imposed by the quantum error correction method and the constraints given by the typical strength of the exchange coupling, we determine the workable operation frequency range of a silicon complementary metal-oxide-semiconductor quantum processor to be within 1 and 100 GHz. Such constraint limits the feasibility of fault-tolerant quantum information processing with complementary metal-oxide-semiconductor technology only to the most advanced nodes. The compatibility with classical complementary metal-oxide-semiconductor control circuitry is discussed, focusing on the cryogenic complementary metal-oxide-semiconductor operation required to bring the classical controller as close as possible to the quantum processor and to enable interfacing thousands of qubits on the same chip via time-division, frequency-division, and space-division multiplexing. The operation time range prospected for cryogenic control electronics is found to be compatible with the operation time expected for qubits. By combining the forecast of the development of scaled technology nodes with operation time and classical circuitry constraints, we derive a maximum quantum information density for logical qubits of 2.8 and 4 Mqb/cm2 for the 10 and 7-nm technology nodes, respectively, for the Steane code. The density is one and two orders of magnitude less for surface codes and for concatenated codes, respectively. Such values provide a benchmark for the development of fault-tolerant quantum algorithms by circuital quantum information based on silicon platforms and a guideline for other technologies in general.
Announcement/Subscription/Publication: Message Based Communication for Heterogeneous Mobile Environments

NASA Astrophysics Data System (ADS)

Ristau, Henry

Many tasks in smart environments can be implemented using message based communication paradigms that decouple applications in time, space, synchronization and semantics. Current solutions for decoupled message based communication either do not support message processing and thus semantic decoupling or rely on clearly defined network structures. In this paper we present ASP, a novel concept for such communication that can directly operate on neighbor relations between brokers and does not rely on a homogeneous addressing scheme or anymore than simple link layer communication. We show by simulation that ASP performs well in a heterogeneous scenario with mobile nodes and decreases network or processor load significantly compared to message flooding.
Studies of an Optical Multi-Processor Interconnect

DTIC Science & Technology

1994-01-01

that the destinations are uniformly distributed, I is given by (E2k-i 2 k(l -(1 -Pd)i)]ýE) N g -1 i+ ( -- Pd)- +2k- 1 2k - 2’-k’ + k(1 -- (1 -- pd)k...15oo 2000- Number of Nodes Figure 5.16: Variation of Maximum User Throughput with Size I I 94I I I I6 Io•MSe P (snowed) g (dsehd) o𔃺.0 OJU/ I£ 0. 00...curves. Table (6.9) shows that the size and topology of the network does not have any significant effect on the number g of threads needed to keep the
Efficient Parallel Formulations of Hierarchical Methods and Their Applications

NASA Astrophysics Data System (ADS)

Grama, Ananth Y.

1996-01-01

Hierarchical methods such as the Fast Multipole Method (FMM) and Barnes-Hut (BH) are used for rapid evaluation of potential (gravitational, electrostatic) fields in particle systems. They are also used for solving integral equations using boundary element methods. The linear systems arising from these methods are dense and are solved iteratively. Hierarchical methods reduce the complexity of the core matrix-vector product from O(n^2) to O(n log n) and the memory requirement from O(n^2) to O(n). We have developed highly scalable parallel formulations of a hybrid FMM/BH method that are capable of handling arbitrarily irregular distributions. We apply these formulations to astrophysical simulations of Plummer and Gaussian galaxies. We have used our parallel formulations to solve the integral form of the Laplace equation. We show that our parallel hierarchical mat-vecs yield high efficiency and overall performance even on relatively small problems. A problem containing approximately 200K nodes takes under a second to compute on 256 processors and yet yields over 85% efficiency. The efficiency and raw performance is expected to increase for bigger problems. For the 200K node problem, our code delivers about 5 GFLOPS of performance on a 256 processor T3D. This is impressive considering the fact that the problem has floating point divides and roots, and very little locality resulting in poor cache performance. A dense matrix-vector product of the same dimensions would require about 0.5 TeraBytes of memory and about 770 TeraFLOPS of computing speed. Clearly, if the loss in accuracy resulting from the use of hierarchical methods is acceptable, our code yields significant savings in time and memory. We also study the convergence of a GMRES solver built around this mat-vec. We accelerate the convergence of the solver using three preconditioning techniques: diagonal scaling, block-diagonal preconditioning, and inner-outer preconditioning. We study the performance and parallel efficiency of these preconditioned solvers. Using this solver, we solve dense linear systems with hundreds of thousands of unknowns. Solving a 105K unknown problem takes about 10 minutes on a 64 processor T3D. Until very recently, boundary element problems of this magnitude could not even be generated, let alone solved.
Maximizing synchronizability of duplex networks

NASA Astrophysics Data System (ADS)

Wei, Xiang; Emenheiser, Jeffrey; Wu, Xiaoqun; Lu, Jun-an; D'Souza, Raissa M.

2018-01-01

We study the synchronizability of duplex networks formed by two randomly generated network layers with different patterns of interlayer node connections. According to the master stability function, we use the smallest nonzero eigenvalue and the eigenratio between the largest and the second smallest eigenvalues of supra-Laplacian matrices to characterize synchronizability on various duplexes. We find that the interlayer linking weight and linking fraction have a profound impact on synchronizability of duplex networks. The increasingly large inter-layer coupling weight is found to cause either decreasing or constant synchronizability for different classes of network dynamics. In addition, negative node degree correlation across interlayer links outperforms positive degree correlation when most interlayer links are present. The reverse is true when a few interlayer links are present. The numerical results and understanding based on these representative duplex networks are illustrative and instructive for building insights into maximizing synchronizability of more realistic multiplex networks.
Automated mass spectrometer analysis system

NASA Technical Reports Server (NTRS)

Giffin, Charles E. (Inventor); Kuppermann, Aron (Inventor); Dreyer, William J. (Inventor); Boettger, Heinz G. (Inventor)

1982-01-01

An automated mass spectrometer analysis system is disclosed, in which samples are automatically processed in a sample processor and converted into volatilizable samples, or their characteristic volatilizable derivatives. Each volatilizable sample is sequentially volatilized and analyzed in a double focusing mass spectrometer, whose output is in the form of separate ion beams all of which are simultaneously focused in a focal plane. Each ion beam is indicative of a different sample component or different fragments of one or more sample components and the beam intensity is related to the relative abundance of the sample component. The system includes an electro-optical ion detector which automatically and simultaneously converts the ion beams, first into electron beams which in turn produce a related image which is transferred to the target of a vilicon unit. The latter converts the images into electrical signals which are supplied to a data processor, whose output is a list of the components of the analyzed sample and their abundances. The system is under the control of a master control unit, which in addition to monitoring and controlling various power sources, controls the automatic operation of the system under expected and some unexpected conditions and further protects various critical parts of the system from damage due to particularly abnormal conditions.
Automated mass spectrometer analysis system

NASA Technical Reports Server (NTRS)

Boettger, Heinz G. (Inventor); Giffin, Charles E. (Inventor); Dreyer, William J. (Inventor); Kuppermann, Aron (Inventor)

1978-01-01

An automated mass spectrometer analysis system is disclosed, in which samples are automatically processed in a sample processor and converted into volatilizable samples, or their characteristic volatilizable derivatives. Each volatizable sample is sequentially volatilized and analyzed in a double focusing mass spectrometer, whose output is in the form of separate ion beams all of which are simultaneously focused in a focal plane. Each ion beam is indicative of a different sample component or different fragments of one or more sample components and the beam intensity is related to the relative abundance of the sample component. The system includes an electro-optical ion detector which automatically and simultaneously converts the ion beams, first into electron beams which in turn produce a related image which is transferred to the target of a vidicon unit. The latter converts the images into electrical signals which are supplied to a data processor, whose output is a list of the components of the analyzed sample and their abundances. The system is under the control of a master control unit, which in addition to monitoring and controlling various power sources, controls the automatic operation of the system under expected and some unexpected conditions and further protects various critical parts of the system from damage due to particularly abnormal conditions.
Sub-nanosecond clock synchronization and trigger management in the nuclear physics experiment AGATA

NASA Astrophysics Data System (ADS)

Bellato, M.; Bortolato, D.; Chavas, J.; Isocrate, R.; Rampazzo, G.; Triossi, A.; Bazzacco, D.; Mengoni, D.; Recchia, F.

2013-07-01

The new-generation spectrometer AGATA, the Advanced GAmma Tracking Array, requires sub-nanosecond clock synchronization among readout and front-end electronics modules that may lie hundred meters apart. We call GTS (Global Trigger and Synchronization System) the infrastructure responsible for precise clock synchronization and for the trigger management of AGATA. It is made of a central trigger processor and nodes, connected in a tree structure by means of optical fibers operated at 2Gb/s. The GTS tree handles the synchronization and the trigger data flow, whereas the trigger processor analyses and eventually validates the trigger primitives centrally. Sub-nanosecond synchronization is achieved by measuring two different types of round-trip times and by automatically correcting for phase-shift differences. For a tree of depth two, the peak-to-peak clock jitter at each leaf is 70 ps; the mean phase difference is 180 ps, while the standard deviation over such phase difference, namely the phase equalization repeatability, is 20 ps. The GTS system has run flawlessly for the two-year long AGATA campaign, held at the INFN Legnaro National Laboratories, Italy, where five triple clusters of the AGATA sub-array were coupled with a variety of ancillary detectors.
Recent advances and future prospects for Monte Carlo

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brown, Forrest B

2010-01-01

The history of Monte Carlo methods is closely linked to that of computers: The first known Monte Carlo program was written in 1947 for the ENIAC; a pre-release of the first Fortran compiler was used for Monte Carlo In 1957; Monte Carlo codes were adapted to vector computers in the 1980s, clusters and parallel computers in the 1990s, and teraflop systems in the 2000s. Recent advances include hierarchical parallelism, combining threaded calculations on multicore processors with message-passing among different nodes. With the advances In computmg, Monte Carlo codes have evolved with new capabilities and new ways of use. Production codesmore » such as MCNP, MVP, MONK, TRIPOLI and SCALE are now 20-30 years old (or more) and are very rich in advanced featUres. The former 'method of last resort' has now become the first choice for many applications. Calculations are now routinely performed on office computers, not just on supercomputers. Current research and development efforts are investigating the use of Monte Carlo methods on FPGAs. GPUs, and many-core processors. Other far-reaching research is exploring ways to adapt Monte Carlo methods to future exaflop systems that may have 1M or more concurrent computational processes.« less
Optimization of the coherence function estimation for multi-core central processing unit

NASA Astrophysics Data System (ADS)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
Assumed--stress hybrid elements with drilling degrees of freedom for nonlinear analysis of composite structures

NASA Technical Reports Server (NTRS)

Knight, Norman F., Jr. (Principal Investigator)

1996-01-01

The goal of this research project is to develop assumed-stress hybrid elements with rotational degrees of freedom for analyzing composite structures. During the first year of the three-year activity, the effort was directed to further assess the AQ4 shell element and its extensions to buckling and free vibration problems. In addition, the development of a compatible 2-node beam element was to be accomplished. The extensions and new developments were implemented in the Computational Structural Mechanics Testbed COMET. An assessment was performed to verify the implementation and to assess the performance of these elements in terms of accuracy. During the second and third years, extensions to geometrically nonlinear problems were developed and tested. This effort involved working with the nonlinear solution strategy as well as the nonlinear formulation for the elements. This research has resulted in the development and implementation of two additional element processors (ES22 for the beam element and ES24 for the shell elements) in COMET. The software was developed using a SUN workstation and has been ported to the NASA Langley Convex named blackbird. Both element processors are now part of the baseline version of COMET.
BOKASUN: A fast and precise numerical program to calculate the Master Integrals of the two-loop sunrise diagrams

NASA Astrophysics Data System (ADS)

Caffo, Michele; Czyż, Henryk; Gunia, Michał; Remiddi, Ettore

2009-03-01

We present the program BOKASUN for fast and precise evaluation of the Master Integrals of the two-loop self-mass sunrise diagram for arbitrary values of the internal masses and the external four-momentum. We use a combination of two methods: a Bernoulli accelerated series expansion and a Runge-Kutta numerical solution of a system of linear differential equations. Program summaryProgram title: BOKASUN Catalogue identifier: AECG_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AECG_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 9404 No. of bytes in distributed program, including test data, etc.: 104 123 Distribution format: tar.gz Programming language: FORTRAN77 Computer: Any computer with a Fortran compiler accepting FORTRAN77 standard. Tested on various PC's with LINUX Operating system: LINUX RAM: 120 kbytes Classification: 4.4 Nature of problem: Any integral arising in the evaluation of the two-loop sunrise Feynman diagram can be expressed in terms of a given set of Master Integrals, which should be calculated numerically. The program provides a fast and precise evaluation method of the Master Integrals for arbitrary (but not vanishing) masses and arbitrary value of the external momentum. Solution method: The integrals depend on three internal masses and the external momentum squared p. The method is a combination of an accelerated expansion in 1/p in its (pretty large!) region of fast convergence and of a Runge-Kutta numerical solution of a system of linear differential equations. Running time: To obtain 4 Master Integrals on PC with 2 GHz processor it takes 3 μs for series expansion with pre-calculated coefficients, 80 μs for series expansion without pre-calculated coefficients, from a few seconds up to a few minutes for Runge-Kutta method (depending on the required accuracy and the values of the physical parameters).
Designing Next Generation Massively Multithreaded Architectures for Irregular Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Secchi, Simone; Villa, Oreste

Irregular applications, such as data mining or graph-based computations, show unpredictable memory/network access patterns and control structures. Massively multi-threaded architectures with large node count, like the Cray XMT, have been shown to address their requirements better than commodity clusters. In this paper we present the approaches that we are currently pursuing to design future generations of these architectures. First, we introduce the Cray XMT and compare it to other multithreaded architectures. We then propose an evolution of the architecture, integrating multiple cores per node and next generation network interconnect. We advocate the use of hardware support for remote memory referencemore » aggregation to optimize network utilization. For this evaluation we developed a highly parallel, custom simulation infrastructure for multi-threaded systems. Our simulator executes unmodified XMT binaries with very large datasets, capturing effects due to contention and hot-spotting, while predicting execution times with greater than 90% accuracy. We also discuss the FPGA prototyping approach that we are employing to study efficient support for irregular applications in next generation manycore processors.« less
Distributed Virtual System (DIVIRS) Project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1993-01-01

As outlined in our continuation proposal 92-ISI-50R (revised) on contract NCC 2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to program parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the virtual system model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)

NASA Astrophysics Data System (ADS)

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter

2015-12-01

AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
DIstributed VIRtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1994-01-01

As outlined in our continuation proposal 92-ISI-. OR (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.

DIstributed VIRtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, Clifford B.

1995-01-01

As outlined in our continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Distributed Virtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1993-01-01

As outlined in the continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC 2-539, the investigators are developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; developing communications routines that support the abstractions implemented; continuing the development of file and information systems based on the Virtual System Model; and incorporating appropriate security measures to allow the mechanisms developed to be used on an open network. The goal throughout the work is to provide a uniform model that can be applied to both parallel and distributed systems. The authors believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. The work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

NASA Technical Reports Server (NTRS)

Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.

2003-01-01

With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.
MIROS: A Hybrid Real-Time Energy-Efficient Operating System for the Resource-Constrained Wireless Sensor Nodes

PubMed Central

Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; Gholami, Khalid El

2014-01-01

Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant. PMID:25248069
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Using SRAM Based FPGAs for Power-Aware High Performance Wireless Sensor Networks

PubMed Central

Valverde, Juan; Otero, Andres; Lopez, Miguel; Portilla, Jorge; de la Torre, Eduardo; Riesgo, Teresa

2012-01-01

While for years traditional wireless sensor nodes have been based on ultra-low power microcontrollers with sufficient but limited computing power, the complexity and number of tasks of today’s applications are constantly increasing. Increasing the node duty cycle is not feasible in all cases, so in many cases more computing power is required. This extra computing power may be achieved by either more powerful microcontrollers, though more power consumption or, in general, any solution capable of accelerating task execution. At this point, the use of hardware based, and in particular FPGA solutions, might appear as a candidate technology, since though power use is higher compared with lower power devices, execution time is reduced, so energy could be reduced overall. In order to demonstrate this, an innovative WSN node architecture is proposed. This architecture is based on a high performance high capacity state-of-the-art FPGA, which combines the advantages of the intrinsic acceleration provided by the parallelism of hardware devices, the use of partial reconfiguration capabilities, as well as a careful power-aware management system, to show that energy savings for certain higher-end applications can be achieved. Finally, comprehensive tests have been done to validate the platform in terms of performance and power consumption, to proof that better energy efficiency compared to processor based solutions can be achieved, for instance, when encryption is imposed by the application requirements. PMID:22736971
Using SRAM based FPGAs for power-aware high performance wireless sensor networks.

PubMed

Valverde, Juan; Otero, Andres; Lopez, Miguel; Portilla, Jorge; de la Torre, Eduardo; Riesgo, Teresa

2012-01-01

While for years traditional wireless sensor nodes have been based on ultra-low power microcontrollers with sufficient but limited computing power, the complexity and number of tasks of today's applications are constantly increasing. Increasing the node duty cycle is not feasible in all cases, so in many cases more computing power is required. This extra computing power may be achieved by either more powerful microcontrollers, though more power consumption or, in general, any solution capable of accelerating task execution. At this point, the use of hardware based, and in particular FPGA solutions, might appear as a candidate technology, since though power use is higher compared with lower power devices, execution time is reduced, so energy could be reduced overall. In order to demonstrate this, an innovative WSN node architecture is proposed. This architecture is based on a high performance high capacity state-of-the-art FPGA, which combines the advantages of the intrinsic acceleration provided by the parallelism of hardware devices, the use of partial reconfiguration capabilities, as well as a careful power-aware management system, to show that energy savings for certain higher-end applications can be achieved. Finally, comprehensive tests have been done to validate the platform in terms of performance and power consumption, to proof that better energy efficiency compared to processor based solutions can be achieved, for instance, when encryption is imposed by the application requirements.
Efficient implementation of MrBayes on multi-GPU.

PubMed

Bao, Jie; Xia, Hongju; Zhou, Jianfu; Liu, Xiaoguang; Wang, Gang

2013-06-01

MrBayes, using Metropolis-coupled Markov chain Monte Carlo (MCMCMC or (MC)(3)), is a popular program for Bayesian inference. As a leading method of using DNA data to infer phylogeny, the (MC)(3) Bayesian algorithm and its improved and parallel versions are now not fast enough for biologists to analyze massive real-world DNA data. Recently, graphics processor unit (GPU) has shown its power as a coprocessor (or rather, an accelerator) in many fields. This article describes an efficient implementation a(MC)(3) (aMCMCMC) for MrBayes (MC)(3) on compute unified device architecture. By dynamically adjusting the task granularity to adapt to input data size and hardware configuration, it makes full use of GPU cores with different data sets. An adaptive method is also developed to split and combine DNA sequences to make full use of a large number of GPU cards. Furthermore, a new "node-by-node" task scheduling strategy is developed to improve concurrency, and several optimizing methods are used to reduce extra overhead. Experimental results show that a(MC)(3) achieves up to 63× speedup over serial MrBayes on a single machine with one GPU card, and up to 170× speedup with four GPU cards, and up to 478× speedup with a 32-node GPU cluster. a(MC)(3) is dramatically faster than all the previous (MC)(3) algorithms and scales well to large GPU clusters.
MIROS: a hybrid real-time energy-efficient operating system for the resource-constrained wireless sensor nodes.

PubMed

Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; El Gholami, Khalid

2014-09-22

Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant.
a Non-Overlapping Discretization Method for Partial Differential Equations

NASA Astrophysics Data System (ADS)

Rosas-Medina, A.; Herrera, I.

2013-05-01

Mathematical models of many systems of interest, including very important continuous systems of Engineering and Science, lead to a great variety of partial differential equations whose solution methods are based on the computational processing of large-scale algebraic systems. Furthermore, the incredible expansion experienced by the existing computational hardware and software has made amenable to effective treatment problems of an ever increasing diversity and complexity, posed by engineering and scientific applications. The emergence of parallel computing prompted on the part of the computational-modeling community a continued and systematic effort with the purpose of harnessing it for the endeavor of solving boundary-value problems (BVPs) of partial differential equations. Very early after such an effort began, it was recognized that domain decomposition methods (DDM) were the most effective technique for applying parallel computing to the solution of partial differential equations, since such an approach drastically simplifies the coordination of the many processors that carry out the different tasks and also reduces very much the requirements of information-transmission between them. Ideally, DDMs intend producing algorithms that fulfill the DDM-paradigm; i.e., such that "the global solution is obtained by solving local problems defined separately in each subdomain of the coarse-mesh -or domain-decomposition-". Stated in a simplistic manner, the basic idea is that, when the DDM-paradigm is satisfied, full parallelization can be achieved by assigning each subdomain to a different processor. When intensive DDM research began much attention was given to overlapping DDMs, but soon after attention shifted to non-overlapping DDMs. This evolution seems natural when the DDM-paradigm is taken into account: it is easier to uncouple the local problems when the subdomains are separated. However, an important limitation of non-overlapping domain decompositions, as that concept is usually understood today, is that interface nodes are shared by two or more subdomains of the coarse-mesh and, therefore, even non-overlapping DDMs are actually overlapping when seen from the perspective of the nodes used in the discretization. In this talk we present and discuss a discretization method in which the nodes used are non-overlapping, in the sense that each one of them belongs to one and only one subdomain of the coarse-mesh.
Compact silicon photonics-based multi laser module for sensing

NASA Astrophysics Data System (ADS)

Ayotte, S.; Costin, F.; Babin, A.; Paré-Olivier, G.; Morin, M.; Filion, B.; Bédard, K.; Chrétien, P.; Bilodeau, G.; Girard-Deschênes, E.; Perron, L.-P.; Davidson, C.-A.; D'Amato, D.; Laplante, M.; Blanchet-Létourneau, J.

2018-02-01

A compact three-laser source for optical sensing is presented. It is based on a low-noise implementation of the Pound Drever-Hall method and comprises high-bandwidth optical phase-locked loops. The outputs from three semiconductor distributed feedback lasers, mounted on thermo-electric coolers (TEC), are coupled with micro-lenses into a silicon photonics (SiP) chip that performs beat note detection and several other functions. The chip comprises phase modulators, variable optical attenuators, multi-mode-interference couplers, variable ratio tap couplers, integrated photodiodes and optical fiber butt-couplers. Electrical connections between a metallized ceramic and the TECs, lasers and SiP chip are achieved by wirebonds. All these components stand within a 35 mm by 35 mm package which is interfaced with 90 electrical pins and two fiber pigtails. One pigtail carries the signals from a master and slave lasers, while another carries that from a second slave laser. The pins are soldered to a printed circuit board featuring a micro-processor that controls and monitors the system to ensure stable operation over fluctuating environmental conditions. This highly adaptable multi-laser source can address various sensing applications requiring the tracking of up to three narrow spectral features with a high bandwidth. It is used to sense a fiber-based ring resonator emulating a resonant fiber optics gyroscope. The master laser is locked to the resonator with a loop bandwidth greater than 1 MHz. The slave lasers are offset frequency locked to the master laser with loop bandwidths greater than 100 MHz. This high performance source is compact, automated, robust, and remains locked for days.
Category-theoretic models of algebraic computer systems

NASA Astrophysics Data System (ADS)

Kovalyov, S. P.

2016-01-01

A computer system is said to be algebraic if it contains nodes that implement unconventional computation paradigms based on universal algebra. A category-based approach to modeling such systems that provides a theoretical basis for mapping tasks to these systems' architecture is proposed. The construction of algebraic models of general-purpose computations involving conditional statements and overflow control is formally described by a reflector in an appropriate category of algebras. It is proved that this reflector takes the modulo ring whose operations are implemented in the conventional arithmetic processors to the Łukasiewicz logic matrix. Enrichments of the set of ring operations that form bases in the Łukasiewicz logic matrix are found.
Ultra-Compact Transputer-Based Controller for High-Level, Multi-Axis Coordination

NASA Technical Reports Server (NTRS)

Zenowich, Brian; Crowell, Adam; Townsend, William T.

2013-01-01

The design of machines that rely on arrays of servomotors such as robotic arms, orbital platforms, and combinations of both, imposes a heavy computational burden to coordinate their actions to perform coherent tasks. For example, the robotic equivalent of a person tracing a straight line in space requires enormously complex kinematics calculations, and complexity increases with the number of servo nodes. A new high-level architecture for coordinated servo-machine control enables a practical, distributed transputer alternative to conventional central processor electronics. The solution is inherently scalable, dramatically reduces bulkiness and number of conductor runs throughout the machine, requires only a fraction of the power, and is designed for cooling in a vacuum.
Single-Event Transient Testing of Low Dropout PNP Series Linear Voltage Regulators

NASA Technical Reports Server (NTRS)

Adell, Philippe; Allen, Gregory

2013-01-01

As demand for high-speed, on-board, digital-processing integrated circuits on spacecraft increases (field-programmable gate arrays and digital signal processors in particular), the need for the next generation point-of-load (POL) regulator becomes a prominent design issue. Shrinking process nodes have resulted in core rails dropping to values close to 1.0 V, drastically reducing margin to standard switching converters or regulators that power digital ICs. The goal of this task is to perform SET characterization of several commercial POL converters, and provide a discussion of the impact of these results to state-of-the-art digital processing IC through laser and heavy ion testing
Distributed software framework and continuous integration in hydroinformatics systems

NASA Astrophysics Data System (ADS)

Zhou, Jianzhong; Zhang, Wei; Xie, Mengfei; Lu, Chengwei; Chen, Xiao

2017-08-01

When encountering multiple and complicated models, multisource structured and unstructured data, complex requirements analysis, the platform design and integration of hydroinformatics systems become a challenge. To properly solve these problems, we describe a distributed software framework and it’s continuous integration process in hydroinformatics systems. This distributed framework mainly consists of server cluster for models, distributed database, GIS (Geographic Information System) servers, master node and clients. Based on it, a GIS - based decision support system for joint regulating of water quantity and water quality of group lakes in Wuhan China is established.
Current Saturation Avoidance with Real-Time Control using DPCS

NASA Astrophysics Data System (ADS)

Ferrara, M.; Hutchinson, I.; Wolfe, S.; Stillerman, J.; Fredian, T.

2008-11-01

Tokamak ohmic-transformer and equilibrium-field coils need to be able to operate near their maximum current capabilities. However if they reach their upper limit during high-performance discharges or in the presence of a strong off-normal event, shape control is compromised, and instability, even plasma disruptions can result. On Alcator C-Mod we designed and tested an anti-saturation routine which detects the impending saturation of OH and EF currents and interpolates to a neighboring safe equilibrium in real-time. The routine was implemented with a multi-processor, multi-time-scale control scheme, which is based on a master process and multiple asynchronous slave processes. The scheme is general and can be used for any computationally-intensive algorithm. USDoE award DE- FC02-99ER545512.
A SPDS Node to Support the Systematic Interpretation of Cosmic Ray Data

NASA Technical Reports Server (NTRS)

1997-01-01

The purpose of this project was to establish and maintain a Space Physics Data System (SPDS) node that supports the analysis and interpretation of current and future galactic cosmic ray (GCR) measurements by (1) providing on-line databases relevant to GCR propagation studies; (2) providing other on-line services, such as anonymous FTP access, mail list service and pointers to e-mail address books, to support the cosmic ray community; (3) providing a mechanism for those in the community who might wish to submit similar contributions for public access; (4) maintaining the node to assure that the databases remain current; and (5) investigating other possibilities, such as CD-ROM, for public dissemination of the data products. Shortly after the original grant to support these activities was established at Louisiana State University a detailed study of alternate choices for the node hardware was initiated. The chosen hardware was an Apple Workgroup Server 9150/120 consisting of a 120 MHz PowerPC 601 processor, 32 MB of memory, two I GB disks and one 2 GB disk. This hardware was ordered and installed and has been operating reliably ever since. A preliminary version of the database server was available during the first year effort and was used as part of the very successful SPDS demonstration during the Rome, Italy International Cosmic Ray Conference. For this server version we were able to establish the html and anonymous FTP server software, develop a Web page structure which can be easily modified to include new items, provide an on-line database of charge changing total cross sections, include the cross section prediction software of Silberberg & Tsao as well as Webber, Kish and Schrier for download access, and provide an on-line bibliography of the cross section measurement references by the Transport Collaboration. The preliminary version of this SPDS Cosmic Ray node was examined by members of the C&H SPDS committee and returned comments were used to refine the implementation.
Robust scalable stabilisability conditions for large-scale heterogeneous multi-agent systems with uncertain nonlinear interactions: towards a distributed computing architecture

NASA Astrophysics Data System (ADS)

Manfredi, Sabato

2016-06-01

Large-scale dynamic systems are becoming highly pervasive in their occurrence with applications ranging from system biology, environment monitoring, sensor networks, and power systems. They are characterised by high dimensionality, complexity, and uncertainty in the node dynamic/interactions that require more and more computational demanding methods for their analysis and control design, as well as the network size and node system/interaction complexity increase. Therefore, it is a challenging problem to find scalable computational method for distributed control design of large-scale networks. In this paper, we investigate the robust distributed stabilisation problem of large-scale nonlinear multi-agent systems (briefly MASs) composed of non-identical (heterogeneous) linear dynamical systems coupled by uncertain nonlinear time-varying interconnections. By employing Lyapunov stability theory and linear matrix inequality (LMI) technique, new conditions are given for the distributed control design of large-scale MASs that can be easily solved by the toolbox of MATLAB. The stabilisability of each node dynamic is a sufficient assumption to design a global stabilising distributed control. The proposed approach improves some of the existing LMI-based results on MAS by both overcoming their computational limits and extending the applicative scenario to large-scale nonlinear heterogeneous MASs. Additionally, the proposed LMI conditions are further reduced in terms of computational requirement in the case of weakly heterogeneous MASs, which is a common scenario in real application where the network nodes and links are affected by parameter uncertainties. One of the main advantages of the proposed approach is to allow to move from a centralised towards a distributed computing architecture so that the expensive computation workload spent to solve LMIs may be shared among processors located at the networked nodes, thus increasing the scalability of the approach than the network size. Finally, a numerical example shows the applicability of the proposed method and its advantage in terms of computational complexity when compared with the existing approaches.
A versatile small form factor twisted-pair TFC FMC for MTCA AMCs

NASA Astrophysics Data System (ADS)

Meder, L.; Lebedev, J.; Becker, J.

2017-03-01

In continuous readout systems of particle physics experiments, the provision of a common clock and time reference and the distribution of critical low latency messages to the processing and fronted layers of the readout are crucial tasks. In the context of the Compressed Baryonic Matter (CBM) experiment, a versatile small form factor Timing and Fast-Control (TFC) interfacing FPGA Mezzanine Card (FMC) was developed, offering bidirectional twisted-pair (TP) links for the communication between TFC nodes. Also a versatile clocking including voltage controlled oscillators and a connection to the telecommunication clock lines of mTCA crates are available. Being designed for both TFC Master and Slaves, the card allows rapid system developments without additional Slave hardware circuits. Measurements show that it is possible to transmit over cable lengths of 25 m at a rate of 240 Mbit/s for all data channels simultaneously. A TFC Master-Slave system using two of these cards can be synchronized with a precision of ±10 ps to an user-defined phase setpoint.
Evolutionary prisoner's dilemma games coevolving on adaptive networks.

PubMed

Lee, Hsuan-Wei; Malik, Nishant; Mucha, Peter J

2018-02-01

We study a model for switching strategies in the Prisoner's Dilemma game on adaptive networks of player pairings that coevolve as players attempt to maximize their return. We use a node-based strategy model wherein each player follows one strategy at a time (cooperate or defect) across all of its neighbors, changing that strategy and possibly changing partners in response to local changes in the network of player pairing and in the strategies used by connected partners. We compare and contrast numerical simulations with existing pair approximation differential equations for describing this system, as well as more accurate equations developed here using the framework of approximate master equations. We explore the parameter space of the model, demonstrating the relatively high accuracy of the approximate master equations for describing the system observations made from simulations. We study two variations of this partner-switching model to investigate the system evolution, predict stationary states, and compare the total utilities and other qualitative differences between these two model variants.

Proactive Fault Tolerance for HPC with Xen Virtualization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian

2007-01-01

with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplied by but not limited to Xen. This paper contributes an automatic and transparent mechanismmore » for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/ restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the rst comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.« less
Benchmarking worker nodes using LHCb productions and comparing with HEPSpec06

NASA Astrophysics Data System (ADS)

Charpentier, P.

2017-10-01

In order to estimate the capabilities of a computing slot with limited processing time, it is necessary to know with a rather good precision its “power”. This allows for example pilot jobs to match a task for which the required CPU-work is known, or to define the number of events to be processed knowing the CPU-work per event. Otherwise one always has the risk that the task is aborted because it exceeds the CPU capabilities of the resource. It also allows a better accounting of the consumed resources. The traditional way the CPU power is estimated in WLCG since 2007 is using the HEP-Spec06 benchmark (HS06) suite that was verified at the time to scale properly with a set of typical HEP applications. However, the hardware architecture of processors has evolved, all WLCG experiments moved to using 64-bit applications and use different compilation flags from those advertised for running HS06. It is therefore interesting to check the scaling of HS06 with the HEP applications. For this purpose, we have been using CPU intensive massive simulation productions from the LHCb experiment and compared their event throughput to the HS06 rating of the worker nodes. We also compared it with a much faster benchmark script that is used by the DIRAC framework used by LHCb for evaluating at run time the performance of the worker nodes. This contribution reports on the finding of these comparisons: the main observation is that the scaling with HS06 is no longer fulfilled, while the fast benchmarks have a better scaling but are less precise. One can also clearly see that some hardware or software features when enabled on the worker nodes may enhance their performance beyond expectation from either benchmark, depending on external factors.
Separable projection integrals for higher-order correlators of the cosmic microwave sky: Acceleration by factors exceeding 100

NASA Astrophysics Data System (ADS)

Briggs, J. P.; Pennycook, S. J.; Fergusson, J. R.; Jäykkä, J.; Shellard, E. P. S.

2016-04-01

We present a case study describing efforts to optimise and modernise "Modal", the simulation and analysis pipeline used by the Planck satellite experiment for constraining general non-Gaussian models of the early universe via the bispectrum (or three-point correlator) of the cosmic microwave background radiation. We focus on one particular element of the code: the projection of bispectra from the end of inflation to the spherical shell at decoupling, which defines the CMB we observe today. This code involves a three-dimensional inner product between two functions, one of which requires an integral, on a non-rectangular domain containing a sparse grid. We show that by employing separable methods this calculation can be reduced to a one-dimensional summation plus two integrations, reducing the overall dimensionality from four to three. The introduction of separable functions also solves the issue of the non-rectangular sparse grid. This separable method can become unstable in certain scenarios and so the slower non-separable integral must be calculated instead. We present a discussion of the optimisation of both approaches. We demonstrate significant speed-ups of ≈100×, arising from a combination of algorithmic improvements and architecture-aware optimisations targeted at improving thread and vectorisation behaviour. The resulting MPI/OpenMP hybrid code is capable of executing on clusters containing processors and/or coprocessors, with strong-scaling efficiency of 98.6% on up to 16 nodes. We find that a single coprocessor outperforms two processor sockets by a factor of 1.3× and that running the same code across a combination of both microarchitectures improves performance-per-node by a factor of 3.38×. By making bispectrum calculations competitive with those for the power spectrum (or two-point correlator) we are now able to consider joint analysis for cosmological science exploitation of new data.
Fast variogram analysis of remotely sensed images in HPC environment

NASA Astrophysics Data System (ADS)

Pesquer, Lluís; Cortés, Anna; Masó, Joan; Pons, Xavier

2013-04-01

Exploring and describing spatial variation of images is one of the main applications of geostatistics to remote sensing. The variogram is a very suitable tool to carry out this spatial pattern analysis. Variogram analysis is composed of two steps: empirical variogram generation and fitting a variogram model. The empirical variogram generation is a very quick procedure for most analyses of irregularly distributed samples, but time consuming increases quite significantly for remotely sensed images, because number of samples (pixels) involved is usually huge (more than 30 million for a Landsat TM scene), basically depending on extension and spatial resolution of images. In several remote sensing applications this type of analysis is repeated for each image, sometimes hundreds of scenes and sometimes for each radiometric band (high number in the case of hyperspectral images) so that there is a need for a fast implementation. In order to reduce this high execution time, we carried out a parallel solution of the variogram analyses. The solution adopted is the master/worker programming paradigm in which the master process distributes and coordinates the tasks executed by the worker processes. The code is written in ANSI-C language, including MPI (Message Passing Interface) as a message-passing library in order to communicate the master with the workers. This solution (ANSI-C + MPI) guarantees portability between different computer platforms. The High Performance Computing (HPC) environment is formed by 32 nodes, each with two Dual Core Intel(R) Xeon (R) 3.0 GHz processors with 12 Gb of RAM, communicated with integrated dual gigabit Ethernet. This IBM cluster is located in the research laboratory of the Computer Architecture and Operating Systems Department of the Universitat Autònoma de Barcelona. The performance results for a 15km x 15km subcene of 198-31 path-row Landsat TM image are shown in table 1. The proximity between empirical speedup behaviour and theoretical linear speedup confirms a suitable parallel design and implementation applied. N Workers Time (s) Speedup 0 2975.03 2 2112.33 1.41 4 1067.45 2.79 8 534.18 5.57 12 357.54 8.32 16 269.00 11.06 20 216.24 13.76 24 186.31 15.97 Furthermore, very similar performance results are obtained for CASI images (hyperspectral and finer spatial resolution than Landsat), showed in table 2, and demonstrating that the distributed load design is not specifically defined and optimized for unique type of images, but it is a flexible design that maintains a good balance and scalability suitable for different range of image dimensions. N Workers Time (s) Speedup 0 5485.03 2 3847.47 1.43 4 1921.62 2.85 8 965.55 5.68 12 644.26 8.51 16 483.40 11.35 20 393.67 13.93 24 347.15 15.80 28 306.33 17.91 32 304.39 18.02 Finally, we conclude that this significant time reduction underlines the utility of distributed environments for processing large amount of data as remotely sensed images.
A master-slave parallel hybrid multi-objective evolutionary algorithm for groundwater remediation design under general hydrogeological conditions

NASA Astrophysics Data System (ADS)

Wu, J.; Yang, Y.; Luo, Q.; Wu, J.

2012-12-01

This study presents a new hybrid multi-objective evolutionary algorithm, the niched Pareto tabu search combined with a genetic algorithm (NPTSGA), whereby the global search ability of niched Pareto tabu search (NPTS) is improved by the diversification of candidate solutions arose from the evolving nondominated sorting genetic algorithm II (NSGA-II) population. Also, the NPTSGA coupled with the commonly used groundwater flow and transport codes, MODFLOW and MT3DMS, is developed for multi-objective optimal design of groundwater remediation systems. The proposed methodology is then applied to a large-scale field groundwater remediation system for cleanup of large trichloroethylene (TCE) plume at the Massachusetts Military Reservation (MMR) in Cape Cod, Massachusetts. Furthermore, a master-slave (MS) parallelization scheme based on the Message Passing Interface (MPI) is incorporated into the NPTSGA to implement objective function evaluations in distributed processor environment, which can greatly improve the efficiency of the NPTSGA in finding Pareto-optimal solutions to the real-world application. This study shows that the MS parallel NPTSGA in comparison with the original NPTS and NSGA-II can balance the tradeoff between diversity and optimality of solutions during the search process and is an efficient and effective tool for optimizing the multi-objective design of groundwater remediation systems under complicated hydrogeologic conditions.
Superelement Analysis of Tile-Reinforced Composite Armor

NASA Technical Reports Server (NTRS)

Davila, Carlos G.

1998-01-01

Super-elements can greatly improve the computational efficiency of analyses of tile-reinforced structures such as the hull of the Composite Armored Vehicle. By taking advantage of the periodicity in this type of construction, super-elements can be used to simplify the task of modeling, to virtually eliminate the time required to assemble the stiffness matrices, and to reduce significantly the analysis solution time. Furthermore, super-elements are fully transferable between analyses and analysts, so that they provide a consistent method to share information and reduce duplication. This paper describes a methodology that was developed to model and analyze large upper hull components of the Composite Armored Vehicle. The analyses are based on two types of superelement models. The first type is based on element-layering, which consists of modeling a laminate by using several layers of shell elements constrained together with compatibility equations. Element layering is used to ensure the proper transverse shear deformation in the laminate rubber layer. The second type of model uses three-dimensional elements. Since no graphical pre-processor currently supports super-elements, a special technique based on master-elements was developed. Master-elements are representations of super-elements that are used in conjunction with a custom translator to write the superelement connectivities as input decks for ABAQUS.
Producing an Infrared Multiwavelength Galactic Plane Atlas Using Montage, Pegasus, and Amazon Web Services

NASA Astrophysics Data System (ADS)

Rynge, M.; Juve, G.; Kinney, J.; Good, J.; Berriman, B.; Merrihew, A.; Deelman, E.

2014-05-01

In this paper, we describe how to leverage cloud resources to generate large-scale mosaics of the galactic plane in multiple wavelengths. Our goal is to generate a 16-wavelength infrared Atlas of the Galactic Plane at a common spatial sampling of 1 arcsec, processed so that they appear to have been measured with a single instrument. This will be achieved by using the Montage image mosaic engine process observations from the 2MASS, GLIMPSE, MIPSGAL, MSX and WISE datasets, over a wavelength range of 1 μm to 24 μm, and by using the Pegasus Workflow Management System for managing the workload. When complete, the Atlas will be made available to the community as a data product. We are generating images that cover ±180° in Galactic longitude and ±20° in Galactic latitude, to the extent permitted by the spatial coverage of each dataset. Each image will be 5°x5° in size (including an overlap of 1° with neighboring tiles), resulting in an atlas of 1,001 images. The final size will be about 50 TBs. This paper will focus on the computational challenges, solutions, and lessons learned in producing the Atlas. To manage the computation we are using the Pegasus Workflow Management System, a mature, highly fault-tolerant system now in release 4.2.2 that has found wide applicability across many science disciplines. A scientific workflow describes the dependencies between the tasks and in most cases the workflow is described as a directed acyclic graph, where the nodes are tasks and the edges denote the task dependencies. A defining property for a scientific workflow is that it manages data flow between tasks. Applied to the galactic plane project, each 5 by 5 mosaic is a Pegasus workflow. Pegasus is used to fetch the source images, execute the image mosaicking steps of Montage, and store the final outputs in a storage system. As these workflows are very I/O intensive, care has to be taken when choosing what infrastructure to execute the workflow on. In our setup, we choose to use dynamically provisioned compute clusters running on the Amazon Elastic Compute Cloud (EC2). All our instances are using the same base image, which is configured to come up as a master node by default. The master node is a central instance from where the workflow can be managed. Additional worker instances are provisioned and configured to accept work assignments from the master node. The system allows for adding/removing workers in an ad hoc fashion, and could be run in large configurations. To-date we have performed 245,000 CPU hours of computing and generated 7,029 images and totaling 30 TB. With the current set up our runtime would be 340,000 CPU hours for the whole project. Using spot m2.4xlarge instances, the cost would be approximately $5,950. Using faster AWS instances, such as cc2.8xlarge could potentially decrease the total CPU hours and further reduce the compute costs. The paper will explore these tradeoffs.
Using Intel's Knight Landing Processor to Accelerate Global Nested Air Quality Prediction Modeling System (GNAQPMS) Model

NASA Astrophysics Data System (ADS)

Wang, H.; Chen, H.; Chen, X.; Wu, Q.; Wang, Z.

2016-12-01

The Global Nested Air Quality Prediction Modeling System for Hg (GNAQPMS-Hg) is a global chemical transport model coupled Hg transport module to investigate the mercury pollution. In this study, we present our work of transplanting the GNAQPMS model on Intel Xeon Phi processor, Knights Landing (KNL) to accelerate the model. KNL is the second-generation product adopting Many Integrated Core Architecture (MIC) architecture. Compared with the first generation Knight Corner (KNC), KNL has more new hardware features, that it can be used as unique processor as well as coprocessor with other CPU. According to the Vtune tool, the high overhead modules in GNAQPMS model have been addressed, including CBMZ gas chemistry, advection and convection module, and wet deposition module. These high overhead modules were accelerated by optimizing code and using new techniques of KNL. The following optimized measures was done: 1) Changing the pure MPI parallel mode to hybrid parallel mode with MPI and OpenMP; 2.Vectorizing the code to using the 512-bit wide vector computation unit. 3. Reducing unnecessary memory access and calculation. 4. Reducing Thread Local Storage (TLS) for common variables with each OpenMP thread in CBMZ. 5. Changing the way of global communication from files writing and reading to MPI functions. After optimization, the performance of GNAQPMS is greatly increased both on CPU and KNL platform, the single-node test showed that optimized version has 2.6x speedup on two sockets CPU platform and 3.3x speedup on one socket KNL platform compared with the baseline version code, which means the KNL has 1.29x speedup when compared with 2 sockets CPU platform.
Real-Time Spaceborne Synthetic Aperture Radar Float-Point Imaging System Using Optimized Mapping Methodology and a Multi-Node Parallel Accelerating Technique

PubMed Central

Li, Bingyi; Chen, Liang; Yu, Wenyue; Xie, Yizhuang; Bian, Mingming; Zhang, Qingjun; Pang, Long

2018-01-01

With the development of satellite load technology and very large-scale integrated (VLSI) circuit technology, on-board real-time synthetic aperture radar (SAR) imaging systems have facilitated rapid response to disasters. A key goal of the on-board SAR imaging system design is to achieve high real-time processing performance under severe size, weight, and power consumption constraints. This paper presents a multi-node prototype system for real-time SAR imaging processing. We decompose the commonly used chirp scaling (CS) SAR imaging algorithm into two parts according to the computing features. The linearization and logic-memory optimum allocation methods are adopted to realize the nonlinear part in a reconfigurable structure, and the two-part bandwidth balance method is used to realize the linear part. Thus, float-point SAR imaging processing can be integrated into a single Field Programmable Gate Array (FPGA) chip instead of relying on distributed technologies. A single-processing node requires 10.6 s and consumes 17 W to focus on 25-km swath width, 5-m resolution stripmap SAR raw data with a granularity of 16,384 × 16,384. The design methodology of the multi-FPGA parallel accelerating system under the real-time principle is introduced. As a proof of concept, a prototype with four processing nodes and one master node is implemented using a Xilinx xc6vlx315t FPGA. The weight and volume of one single machine are 10 kg and 32 cm × 24 cm × 20 cm, respectively, and the power consumption is under 100 W. The real-time performance of the proposed design is demonstrated on Chinese Gaofen-3 stripmap continuous imaging. PMID:29495637
Reduze - Feynman integral reduction in C++

NASA Astrophysics Data System (ADS)

Studerus, C.

2010-07-01

Reduze is a computer program for reducing Feynman integrals to master integrals employing a Laporta algorithm. The program is written in C++ and uses classes provided by the GiNaC library to perform the simplifications of the algebraic prefactors in the system of equations. Reduze offers the possibility to run reductions in parallel. Program summaryProgram title:Reduze Catalogue identifier: AEGE_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGE_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions:: yes No. of lines in distributed program, including test data, etc.: 55 433 No. of bytes in distributed program, including test data, etc.: 554 866 Distribution format: tar.gz Programming language: C++ Computer: All Operating system: Unix/Linux Number of processors used: The number of processors is problem dependent. More than one possible but not arbitrary many. RAM: Depends on the complexity of the system. Classification: 4.4, 5 External routines: CLN ( http://www.ginac.de/CLN/), GiNaC ( http://www.ginac.de/) Nature of problem: Solving large systems of linear equations with Feynman integrals as unknowns and rational polynomials as prefactors. Solution method: Using a Gauss/Laporta algorithm to solve the system of equations. Restrictions: Limitations depend on the complexity of the system (number of equations, number of kinematic invariants). Running time: Depends on the complexity of the system.
CanOpen on RASTA: The Integration of the CanOpen IP Core in the Avionics Testbed

NASA Astrophysics Data System (ADS)

Furano, Gianluca; Guettache, Farid; Magistrati, Giorgio; Tiotto, Gabriele; Ortega, Carlos Urbina; Valverde, Alberto

2013-08-01

This paper presents the work done within the ESA Estec Data Systems Division, targeting the integration of the CanOpen IP Core with the existing Reference Architecture Test-bed for Avionics (RASTA). RASTA is the reference testbed system of the ESA Avionics Lab, designed to integrate the main elements of a typical Data Handling system. It aims at simulating a scenario where a Mission Control Center communicates with on-board computers and systems through a TM/TC link, thus providing the data management through qualified processors and interfaces such as Leon2 core processors, CAN bus controllers, MIL-STD-1553 and SpaceWire. This activity aims at the extension of the RASTA with two boards equipped with HurriCANe controller, acting as CANOpen slaves. CANOpen software modules have been ported on the RASTA system I/O boards equipped with Gaisler GR-CAN controller and acts as master communicating with the CCIPC boards. CanOpen serves as upper application layer for based on CAN defined within the CAN-in-Automation standard and can be regarded as the definitive standard for the implementation of CAN-based systems solutions. The development and integration of CCIPC performed by SITAEL S.p.A., is the first application that aims to bring the CANOpen standard for space applications. The definition of CANOpen within the European Cooperation for Space Standardization (ECSS) is under development.
A Decentralized Event-Triggered Dissipative Control Scheme for Systems With Multiple Sensors to Sample the System Outputs.

PubMed

Zhang, Xian-Ming; Han, Qing-Long

2016-12-01

This paper is concerned with decentralized event-triggered dissipative control for systems with the entries of the system outputs having different physical properties. Depending on these different physical properties, the entries of the system outputs are grouped into multiple nodes. A number of sensors are used to sample the signals from different nodes. A decentralized event-triggering scheme is introduced to select those necessary sampled-data packets to be transmitted so that communication resources can be saved significantly while preserving the prescribed closed-loop performance. First, in order to organize the decentralized data packets transmitted from the sensor nodes, a data packet processor (DPP) is used to generate a new signal to be held by the zero-order-hold once the signal stored by the DPP is updated at some time instant. Second, under the mechanism of the DPP, the resulting closed-loop system is modeled as a linear system with an interval time-varying delay. A sufficient condition is derived such that the closed-loop system is asymptotically stable and strictly (Q 0 ,S 0 ,R 0 ) -dissipative, where Q 0 ,S 0 , and R 0 are real matrices of appropriate dimensions with Q 0 and R 0 symmetric. Third, suitable output-based controllers can be designed based on solutions to a set of a linear matrix inequality. Finally, two examples are given to demonstrate the effectiveness of the proposed method.
Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools

PubMed Central

Cheng, Yinhe; Tzeng, Tzy-Hwa Kathy

2016-01-01

This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156–186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data pre-processing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory. PMID:27861637
Performance analysis of three dimensional integral equation computations on a massively parallel computer. M.S. Thesis

NASA Technical Reports Server (NTRS)

Logan, Terry G.

1994-01-01

The purpose of this study is to investigate the performance of the integral equation computations using numerical source field-panel method in a massively parallel processing (MPP) environment. A comparative study of computational performance of the MPP CM-5 computer and conventional Cray-YMP supercomputer for a three-dimensional flow problem is made. A serial FORTRAN code is converted into a parallel CM-FORTRAN code. Some performance results are obtained on CM-5 with 32, 62, 128 nodes along with those on Cray-YMP with a single processor. The comparison of the performance indicates that the parallel CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code for some cases.
Towards an Autonomic Cluster Management System (ACMS) with Reflex Autonomicity

NASA Technical Reports Server (NTRS)

Truszkowski, Walt; Hinchey, Mike; Sterritt, Roy

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of providing a fault-tolerant environment and achieving significant computational capabilities for high-performance computing applications. However, the task of manually managing and configuring a cluster quickly becomes daunting as the cluster grows in size. Autonomic computing, with its vision to provide self-management, can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Autonomic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management and its evolution to include reflex reactions via pulse monitoring.
High-performance reconfigurable hardware architecture for restricted Boltzmann machines.

PubMed

Ly, Daniel Le; Chow, Paul

2010-11-01

Despite the popularity and success of neural networks in research, the number of resulting commercial or industrial applications has been limited. A primary cause for this lack of adoption is that neural networks are usually implemented as software running on general-purpose processors. Hence, a hardware implementation that can exploit the inherent parallelism in neural networks is desired. This paper investigates how the restricted Boltzmann machine (RBM), which is a popular type of neural network, can be mapped to a high-performance hardware architecture on field-programmable gate array (FPGA) platforms. The proposed modular framework is designed to reduce the time complexity of the computations through heavily customized hardware engines. A method to partition large RBMs into smaller congruent components is also presented, allowing the distribution of one RBM across multiple FPGA resources. The framework is tested on a platform of four Xilinx Virtex II-Pro XC2VP70 FPGAs running at 100 MHz through a variety of different configurations. The maximum performance was obtained by instantiating an RBM of 256 × 256 nodes distributed across four FPGAs, which resulted in a computational speed of 3.13 billion connection-updates-per-second and a speedup of 145-fold over an optimized C program running on a 2.8-GHz Intel processor.
PC-CUBE: A Personal Computer Based Hypercube

NASA Technical Reports Server (NTRS)

Ho, Alex; Fox, Geoffrey; Walker, David; Snyder, Scott; Chang, Douglas; Chen, Stanley; Breaden, Matt; Cole, Terry

1988-01-01

PC-CUBE is an ensemble of IBM PCs or close compatibles connected in the hypercube topology with ordinary computer cables. Communication occurs at the rate of 115.2 K-band via the RS-232 serial links. Available for PC-CUBE is the Crystalline Operating System III (CrOS III), Mercury Operating System, CUBIX and PLOTIX which are parallel I/O and graphics libraries. A CrOS performance monitor was developed to facilitate the measurement of communication and computation time of a program and their effects on performance. Also available are CXLISP, a parallel version of the XLISP interpreter; GRAFIX, some graphics routines for the EGA and CGA; and a general execution profiler for determining execution time spent by program subroutines. PC-CUBE provides a programming environment similar to all hypercube systems running CrOS III, Mercury and CUBIX. In addition, every node (personal computer) has its own graphics display monitor and storage devices. These allow data to be displayed or stored at every processor, which has much instructional value and enables easier debugging of applications. Some application programs which are taken from the book Solving Problems on Concurrent Processors (Fox 88) were implemented with graphics enhancement on PC-CUBE. The applications range from solving the Mandelbrot set, Laplace equation, wave equation, long range force interaction, to WaTor, an ecological simulation.
An Atmospheric General Circulation Model with Chemistry for the CRAY T3E: Design, Performance Optimization and Coupling to an Ocean Model

NASA Technical Reports Server (NTRS)

Farrara, John D.; Drummond, Leroy A.; Mechoso, Carlos R.; Spahr, Joseph A.

1998-01-01

The design, implementation and performance optimization on the CRAY T3E of an atmospheric general circulation model (AGCM) which includes the transport of, and chemical reactions among, an arbitrary number of constituents is reviewed. The parallel implementation is based on a two-dimensional (longitude and latitude) data domain decomposition. Initial optimization efforts centered on minimizing the impact of substantial static and weakly-dynamic load imbalances among processors through load redistribution schemes. Recent optimization efforts have centered on single-node optimization. Strategies employed include loop unrolling, both manually and through the compiler, the use of an optimized assembler-code library for special function calls, and restructuring of parts of the code to improve data locality. Data exchanges and synchronizations involved in coupling different data-distributed models can account for a significant fraction of the running time. Therefore, the required scattering and gathering of data must be optimized. In systems such as the T3E, there is much more aggregate bandwidth in the total system than in any particular processor. This suggests a distributed design. The design and implementation of a such distributed 'Data Broker' as a means to efficiently couple the components of our climate system model is described.
A web-based institutional DICOM distribution system with the integration of the Clinical Trial Processor (CTP).

PubMed

Aryanto, K Y E; Broekema, A; Langenhuysen, R G A; Oudkerk, M; van Ooijen, P M A

2015-05-01

To develop and test a fast and easy rule-based web-environment with optional de-identification of imaging data to facilitate data distribution within a hospital environment. A web interface was built using Hypertext Preprocessor (PHP), an open source scripting language for web development, and Java with SQL Server to handle the database. The system allows for the selection of patient data and for de-identifying these when necessary. Using the services provided by the RSNA Clinical Trial Processor (CTP), the selected images were pushed to the appropriate services using a protocol based on the module created for the associated task. Five pipelines, each performing a different task, were set up in the server. In a 75 month period, more than 2,000,000 images are transferred and de-identified in a proper manner while 20,000,000 images are moved from one node to another without de-identification. While maintaining a high level of security and stability, the proposed system is easy to setup, it integrate well with our clinical and research practice and it provides a fast and accurate vendor-neutral process of transferring, de-identifying, and storing DICOM images. Its ability to run different de-identification processes in parallel pipelines is a major advantage in both clinical and research setting.
International Space Station (ISS)

NASA Image and Video Library

2001-03-01

The Environmental Control and Life Support System (ECLSS) Group of the Flight Projects Directorate at the Marshall Space Flight Center in Huntsville, Alabama, is responsible for designing and building the life support systems that will provide the crew of the International Space Station (ISS) a comfortable environment in which to live and work. This photograph shows the mockup of the the ECLSS to be installed in the Node 3 module of the ISS. From left to right, shower rack, waste management rack, Water Recovery System (WRS) Rack #2, WRS Rack #1, and Oxygen Generation System (OGS) rack are shown. The WRS provides clean water through the reclamation of wastewaters and is comprised of a Urine Processor Assembly (UPA) and a Water Processor Assembly (WPA). The UPA accepts and processes pretreated crewmember urine to allow it to be processed along with other wastewaters in the WPA. The WPA removes free gas, organic, and nonorganic constituents before the water goes through a series of multifiltration beds for further purification. The OGS produces oxygen for breathing air for the crew and laboratory animals, as well as for replacing oxygen loss. The OGS is comprised of a cell stack, which electrolyzes (breaks apart the hydrogen and oxygen molecules) some of the clean water provided by the WRS, and the separators that remove the gases from the water after electrolysis.

SpaceWire Driver Software for Special DSPs

NASA Technical Reports Server (NTRS)

Clark, Douglas; Lux, James; Nishimoto, Kouji; Lang, Minh

2003-01-01

A computer program provides a high-level C-language interface to electronics circuitry that controls a SpaceWire interface in a system based on a space qualified version of the ADSP-21020 digital signal processor (DSP). SpaceWire is a spacecraft-oriented standard for packet-switching data-communication networks that comprise nodes connected through bidirectional digital serial links that utilize low-voltage differential signaling (LVDS). The software is tailored to the SMCS-332 application-specific integrated circuit (ASIC) (also available as the TSS901E), which provides three highspeed (150 Mbps) serial point-to-point links compliant with the proposed Institute of Electrical and Electronics Engineers (IEEE) Standard 1355.2 and equivalent European Space Agency (ESA) Standard ECSS-E-50-12. In the specific application of this software, the SpaceWire ASIC was combined with the DSP processor, memory, and control logic in a Multi-Chip Module DSP (MCM-DSP). The software is a collection of low-level driver routines that provide a simple message-passing application programming interface (API) for software running on the DSP. Routines are provided for interrupt-driven access to the two styles of interface provided by the SMCS: (1) the "word at a time" conventional host interface (HOCI); and (2) a higher performance "dual port memory" style interface (COMI).
Voltage scheduling for low power/energy

NASA Astrophysics Data System (ADS)

Manzak, Ali

2001-07-01

Power considerations have become an increasingly dominant factor in the design of both portable and desk-top systems. An effective way to reduce power consumption is to lower the supply voltage since voltage is quadratically related to power. This dissertation considers the problem of lowering the supply voltage at (i) the system level and at (ii) the behavioral level. At the system level, the voltage of the variable voltage processor is dynamically changed with the work load. Processors with limited sized buffers as well as those with very large buffers are considered. Given the task arrival times, deadline times, execution times, periods and switching activities, task scheduling algorithms that minimize energy or peak power are developed for the processors equipped with very large buffers. A relation between the operating voltages of the tasks for minimum energy/power is determined using the Lagrange multiplier method, and an iterative algorithm that utilizes this relation is developed. Experimental results show that the voltage assignment obtained by the proposed algorithm is very close (0.1% error) to that of the optimal energy assignment and the optimal peak power (1% error) assignment. Next, on-line and off-fine minimum energy task scheduling algorithms are developed for processors with limited sized buffers. These algorithms have polynomial time complexity and present optimal (off-line) and close-to-optimal (on-line) solutions. A procedure to calculate the minimum buffer size given information about the size of the task (maximum, minimum), execution time (best case, worst case) and deadlines is also presented. At the behavioral level, resources operating at multiple voltages are used to minimize power while maintaining the throughput. Such a scheme has the advantage of allowing modules on the critical paths to be assigned to the highest voltage levels (thus meeting the required timing constraints) while allowing modules on non-critical paths to be assigned to lower voltage levels (thus reducing the power consumption). A polynomial time resource and latency constrained scheduling algorithm is developed to distribute the available slack among the nodes such that power consumption is minimum. The algorithm is iterative and utilizes the slack based on the Lagrange multiplier method.
Real-time synchronization of wireless sensor network by 1-PPS signal

NASA Astrophysics Data System (ADS)

Giammarini, Marco; Pieralisi, Marco; Isidori, Daniela; Concettoni, Enrico; Cristalli, Cristina; Fioravanti, Matteo

2015-05-01

The use of wireless sensor networks with different nodes is desirable in a smart environment, because the network setting up and installation on preexisting structures can be done without a fixed cabled infrastructure. The flexibility of the monitoring system is fundamental where the use of a considerable quantity of cables could compromise the normal exercise, could affect the quality of acquired signal and finally increase the cost of the materials and installation. The network is composed of several intelligent "nodes", which acquires data from different kind of sensors, and then store or transmit them to a central elaboration unit. The synchronization of data acquisition is the core of the real-time wireless sensor network (WSN). In this paper, we present a comparison between different methods proposed by literature for the real-time acquisition in a WSN and finally we present our solution based on 1-Pulse-Per-Second (1-PPS) signal generated by GPS systems. The sensor node developed is a small-embedded system based on ARM microcontroller that manages the acquisition, the timing and the post-processing of the data. The communications between the sensors and the master based on IEEE 802.15.4 protocol and managed by dedicated software. Finally, we present the preliminary results obtained on a 3 floor building simulator with the wireless sensors system developed.
Non-Markovian dynamics in chiral quantum networks with spins and photons

NASA Astrophysics Data System (ADS)

Ramos, Tomás; Vermersch, Benoît; Hauke, Philipp; Pichler, Hannes; Zoller, Peter

2016-06-01

We study the dynamics of chiral quantum networks consisting of nodes coupled by unidirectional or asymmetric bidirectional quantum channels. In contrast to familiar photonic networks where driven two-level atoms exchange photons via 1D photonic nanostructures, we propose and study a setup where interactions between the atoms are mediated by spin excitations (magnons) in 1D X X spin chains representing spin waveguides. While Markovian quantum network theory eliminates quantum channels as structureless reservoirs in a Born-Markov approximation to obtain a master equation for the nodes, we are interested in non-Markovian dynamics. This arises from the nonlinear character of the dispersion with band-edge effects, and from finite spin propagation velocities leading to time delays in interactions. To account for the non-Markovian dynamics we treat the quantum degrees of freedom of the nodes and connecting channel as a composite spin system with the surrounding of the quantum network as a Markovian bath, allowing for an efficient solution with time-dependent density matrix renormalization-group techniques. We illustrate our approach showing non-Markovian effects in the driven-dissipative formation of quantum dimers, and we present examples for quantum information protocols involving quantum state transfer with engineered elements as basic building blocks of quantum spintronic circuits.
CrossVit: enhancing canopy monitoring management practices in viticulture.

PubMed

Matese, Alessandro; Vaccari, Francesco Primo; Tomasi, Diego; Di Gennaro, Salvatore Filippo; Primicerio, Jacopo; Sabatini, Francesco; Guidoni, Silvia

2013-06-13

A new wireless sensor network (WSN), called CrossVit, and based on MEMSIC products, has been tested for two growing seasons in two vineyards in Italy. The aims are to evaluate the monitoring performances of the new WSN directly in the vineyard and collect air temperature, air humidity and solar radiation data to support vineyard management practices. The WSN consists of various levels: the Master/Gateway level coordinates the WSN and performs data aggregation; the Farm/Server level takes care of storing data on a server, data processing and graphic rendering; Nodes level is based on a network of peripheral nodes consisting of a MDA300 sensor board and Iris module and equipped with thermistors for air temperature, photodiodes for global and diffuse solar radiation, and an HTM2500LF sensor for relative humidity. The communication levels are: WSN links between gateways and sensor nodes by ZigBee, and long-range GSM/GPRS links between gateways and the server farm level. The system was able to monitor the agrometeorological parameters in the vineyard: solar radiation, air temperature and air humidity, detecting the differences between the canopy treatments applied. The performance of CrossVit, in terms of monitoring and reliability of the system, have been evaluated considering: its handiness, cost-effective, non-invasive dimensions and low power consumption.
CrossVit: Enhancing Canopy Monitoring Management Practices in Viticulture

PubMed Central

Matese, Alessandro; Vaccari, Francesco Primo; Tomasi, Diego; Di Gennaro, Salvatore Filippo; Primicerio, Jacopo; Sabatini, Francesco; Guidoni, Silvia

2013-01-01

A new wireless sensor network (WSN), called CrossVit, and based on MEMSIC products, has been tested for two growing seasons in two vineyards in Italy. The aims are to evaluate the monitoring performances of the new WSN directly in the vineyard and collect air temperature, air humidity and solar radiation data to support vineyard management practices. The WSN consists of various levels: the Master/Gateway level coordinates the WSN and performs data aggregation; the Farm/Server level takes care of storing data on a server, data processing and graphic rendering; Nodes level is based on a network of peripheral nodes consisting of a MDA300 sensor board and Iris module and equipped with thermistors for air temperature, photodiodes for global and diffuse solar radiation, and an HTM2500LF sensor for relative humidity. The communication levels are: WSN links between gateways and sensor nodes by ZigBee, and long-range GSM/GPRS links between gateways and the server farm level. The system was able to monitor the agrometeorological parameters in the vineyard: solar radiation, air temperature and air humidity, detecting the differences between the canopy treatments applied. The performance of CrossVit, in terms of monitoring and reliability of the system, have been evaluated considering: its handiness, cost-effective, non-invasive dimensions and low power consumption. PMID:23765273
Supermassive Black Hole Binaries in High Performance Massively Parallel Direct N-body Simulations on Large GPU Clusters

NASA Astrophysics Data System (ADS)

Spurzem, R.; Berczik, P.; Zhong, S.; Nitadori, K.; Hamada, T.; Berentzen, I.; Veles, A.

2012-07-01

Astrophysical Computer Simulations of Dense Star Clusters in Galactic Nuclei with Supermassive Black Holes are presented using new cost-efficient supercomputers in China accelerated by graphical processing cards (GPU). We use large high-accuracy direct N-body simulations with Hermite scheme and block-time steps, parallelised across a large number of nodes on the large scale and across many GPU thread processors on each node on the small scale. A sustained performance of more than 350 Tflop/s for a science run on using simultaneously 1600 Fermi C2050 GPUs is reached; a detailed performance model is presented and studies for the largest GPU clusters in China with up to Petaflop/s performance and 7000 Fermi GPU cards. In our case study we look at two supermassive black holes with equal and unequal masses embedded in a dense stellar cluster in a galactic nucleus. The hardening processes due to interactions between black holes and stars, effects of rotation in the stellar system and relativistic forces between the black holes are simultaneously taken into account. The simulation stops at the complete relativistic merger of the black holes.
Comparing the Performance of Blue Gene/Q with Leading Cray XE6 and InfiniBand Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kerbyson, Darren J.; Barker, Kevin J.; Vishnu, Abhinav

2013-01-21

Abstract—Three types of systems dominate the current High Performance Computing landscape: the Cray XE6, the IBM Blue Gene, and commodity clusters using InfiniBand. These systems have quite different characteristics making the choice for a particular deployment difficult. The XE6 uses Cray’s proprietary Gemini 3-D torus interconnect with two nodes at each network endpoint. The latest IBM Blue Gene/Q uses a single socket integrating processor and communication in a 5-D torus network. InfiniBand provides the flexibility of using nodes from many vendors connected in many possible topologies. The performance characteristics of each vary vastly along with their utilization model. In thismore » work we compare the performance of these three systems using a combination of micro-benchmarks and a set of production applications. In particular we discuss the causes of variability in performance across the systems and also quantify where performance is lost using a combination of measurements and models. Our results show that significant performance can be lost in normal production operation of the Cray XT6 and InfiniBand Clusters in comparison to Blue Gene/Q.« less
MSAProbs-MPI: parallel multiple sequence aligner for distributed-memory systems.

PubMed

González-Domínguez, Jorge; Liu, Yongchao; Touriño, Juan; Schmidt, Bertil

2016-12-15

MSAProbs is a state-of-the-art protein multiple sequence alignment tool based on hidden Markov models. It can achieve high alignment accuracy at the expense of relatively long runtimes for large-scale input datasets. In this work we present MSAProbs-MPI, a distributed-memory parallel version of the multithreaded MSAProbs tool that is able to reduce runtimes by exploiting the compute capabilities of common multicore CPU clusters. Our performance evaluation on a cluster with 32 nodes (each containing two Intel Haswell processors) shows reductions in execution time of over one order of magnitude for typical input datasets. Furthermore, MSAProbs-MPI using eight nodes is faster than the GPU-accelerated QuickProbs running on a Tesla K20. Another strong point is that MSAProbs-MPI can deal with large datasets for which MSAProbs and QuickProbs might fail due to time and memory constraints, respectively. Source code in C ++ and MPI running on Linux systems as well as a reference manual are available at http://msaprobs.sourceforge.net CONTACT: jgonzalezd@udc.esSupplementary information: Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Harnessing the power of emerging petascale platforms

NASA Astrophysics Data System (ADS)

Mellor-Crummey, John

2007-07-01

As part of the US Department of Energy's Scientific Discovery through Advanced Computing (SciDAC-2) program, science teams are tackling problems that require computational simulation and modeling at the petascale. A grand challenge for computer science is to develop software technology that makes it easier to harness the power of these systems to aid scientific discovery. As part of its activities, the SciDAC-2 Center for Scalable Application Development Software (CScADS) is building open source software tools to support efficient scientific computing on the emerging leadership-class platforms. In this paper, we describe two tools for performance analysis and tuning that are being developed as part of CScADS: a tool for analyzing scalability and performance, and a tool for optimizing loop nests for better node performance. We motivate these tools by showing how they apply to S3D, a turbulent combustion code under development at Sandia National Laboratory. For S3D, our node performance analysis tool helped uncover several performance bottlenecks. Using our loop nest optimization tool, we transformed S3D's most costly loop nest to reduce execution time by a factor of 2.94 for a processor working on a 503 domain.
Fast computation of voxel-level brain connectivity maps from resting-state functional MRI using l₁-norm as approximation of Pearson's temporal correlation: proof-of-concept and example vector hardware implementation.

PubMed

Minati, Ludovico; Zacà, Domenico; D'Incerti, Ludovico; Jovicich, Jorge

2014-09-01

An outstanding issue in graph-based analysis of resting-state functional MRI is choice of network nodes. Individual consideration of entire brain voxels may represent a less biased approach than parcellating the cortex according to pre-determined atlases, but entails establishing connectedness for 1(9)-1(11) links, with often prohibitive computational cost. Using a representative Human Connectome Project dataset, we show that, following appropriate time-series normalization, it may be possible to accelerate connectivity determination replacing Pearson correlation with l1-norm. Even though the adjacency matrices derived from correlation coefficients and l1-norms are not identical, their similarity is high. Further, we describe and provide in full an example vector hardware implementation of l1-norm on an array of 4096 zero instruction-set processors. Calculation times <1000 s are attainable, removing the major deterrent to voxel-based resting-sate network mapping and revealing fine-grained node degree heterogeneity. L1-norm should be given consideration as a substitute for correlation in very high-density resting-state functional connectivity analyses. Copyright © 2014 IPEM. Published by Elsevier Ltd. All rights reserved.
Entanglement of spin waves among four quantum memories.

PubMed

Choi, K S; Goban, A; Papp, S B; van Enk, S J; Kimble, H J

2010-11-18

Quantum networks are composed of quantum nodes that interact coherently through quantum channels, and open a broad frontier of scientific opportunities. For example, a quantum network can serve as a 'web' for connecting quantum processors for computation and communication, or as a 'simulator' allowing investigations of quantum critical phenomena arising from interactions among the nodes mediated by the channels. The physical realization of quantum networks generically requires dynamical systems capable of generating and storing entangled states among multiple quantum memories, and efficiently transferring stored entanglement into quantum channels for distribution across the network. Although such capabilities have been demonstrated for diverse bipartite systems, entangled states have not been achieved for interconnects capable of 'mapping' multipartite entanglement stored in quantum memories to quantum channels. Here we demonstrate measurement-induced entanglement stored in four atomic memories; user-controlled, coherent transfer of the atomic entanglement to four photonic channels; and characterization of the full quadripartite entanglement using quantum uncertainty relations. Our work therefore constitutes an advance in the distribution of multipartite entanglement across quantum networks. We also show that our entanglement verification method is suitable for studying the entanglement order of condensed-matter systems in thermal equilibrium.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott; Chen, Yang

2013-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the OpenACC compiler directives and Fortran CUDA. Mixed implementation of both Open-ACC and CUDA is demonstrated. CUDA is required for optimizing the particle deposition algorithm. We have implemented the GPU acceleration on a third generation Core I7 gaming PC with two NVIDIA GTX 680 GPUs. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. We also see enormous speedups (10 or more) on the Titan supercomputer at Oak Ridge with Kepler K20 GPUs. Results show speed-ups comparable or better than that of OpenMP models utilizing multiple cores. The use of hybrid OpenACC, CUDA Fortran, and MPI models across many nodes will also be discussed. Optimization strategies will be presented. We will discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
A modified implementation of tristate inverter based static master-slave flip-flop with improved power-delay-area product.

PubMed

Singh, Kunwar; Tiwari, Satish Chandra; Gupta, Maneesha

2014-01-01

The paper introduces novel architectures for implementation of fully static master-slave flip-flops for low power, high performance, and high density. Based on the proposed structure, traditional C(2)MOS latch (tristate inverter/clocked inverter) based flip-flop is implemented with fewer transistors. The modified C(2)MOS based flip-flop designs mC(2)MOSff1 and mC(2)MOSff2 are realized using only sixteen transistors each while the number of clocked transistors is also reduced in case of mC(2)MOSff1. Postlayout simulations indicate that mC(2)MOSff1 flip-flop shows 12.4% improvement in PDAP (power-delay-area product) when compared with transmission gate flip-flop (TGFF) at 16X capacitive load which is considered to be the best design alternative among the conventional master-slave flip-flops. To validate the correct behaviour of the proposed design, an eight bit asynchronous counter is designed to layout level. LVS and parasitic extraction were carried out on Calibre, whereas layouts were implemented using IC station (Mentor Graphics). HSPICE simulations were used to characterize the transient response of the flip-flop designs in a 180 nm/1.8 V CMOS technology. Simulations were also performed at 130 nm, 90 nm, and 65 nm to reveal the scalability of both the designs at modern process nodes.
A Modified Implementation of Tristate Inverter Based Static Master-Slave Flip-Flop with Improved Power-Delay-Area Product

PubMed Central

Tiwari, Satish Chandra; Gupta, Maneesha

2014-01-01

The paper introduces novel architectures for implementation of fully static master-slave flip-flops for low power, high performance, and high density. Based on the proposed structure, traditional C2MOS latch (tristate inverter/clocked inverter) based flip-flop is implemented with fewer transistors. The modified C2MOS based flip-flop designs mC2MOSff1 and mC2MOSff2 are realized using only sixteen transistors each while the number of clocked transistors is also reduced in case of mC2MOSff1. Postlayout simulations indicate that mC2MOSff1 flip-flop shows 12.4% improvement in PDAP (power-delay-area product) when compared with transmission gate flip-flop (TGFF) at 16X capacitive load which is considered to be the best design alternative among the conventional master-slave flip-flops. To validate the correct behaviour of the proposed design, an eight bit asynchronous counter is designed to layout level. LVS and parasitic extraction were carried out on Calibre, whereas layouts were implemented using IC station (Mentor Graphics). HSPICE simulations were used to characterize the transient response of the flip-flop designs in a 180 nm/1.8 V CMOS technology. Simulations were also performed at 130 nm, 90 nm, and 65 nm to reveal the scalability of both the designs at modern process nodes. PMID:24723808
A Novel Coarsening Method for Scalable and Efficient Mesh Generation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoo, A; Hysom, D; Gunney, B

2010-12-02

In this paper, we propose a novel mesh coarsening method called brick coarsening method. The proposed method can be used in conjunction with any graph partitioners and scales to very large meshes. This method reduces problem space by decomposing the original mesh into fixed-size blocks of nodes called bricks, layered in a similar way to conventional brick laying, and then assigning each node of the original mesh to appropriate brick. Our experiments indicate that the proposed method scales to very large meshes while allowing simple RCB partitioner to produce higher-quality partitions with significantly less edge cuts. Our results further indicatemore » that the proposed brick-coarsening method allows more complicated partitioners like PT-Scotch to scale to very large problem size while still maintaining good partitioning performance with relatively good edge-cut metric. Graph partitioning is an important problem that has many scientific and engineering applications in such areas as VLSI design, scientific computing, and resource management. Given a graph G = (V,E), where V is the set of vertices and E is the set of edges, (k-way) graph partitioning problem is to partition the vertices of the graph (V) into k disjoint groups such that each group contains roughly equal number of vertices and the number of edges connecting vertices in different groups is minimized. Graph partitioning plays a key role in large scientific computing, especially in mesh-based computations, as it is used as a tool to minimize the volume of communication and to ensure well-balanced load across computing nodes. The impact of graph partitioning on the reduction of communication can be easily seen, for example, in different iterative methods to solve a sparse system of linear equation. Here, a graph partitioning technique is applied to the matrix, which is basically a graph in which each edge is a non-zero entry in the matrix, to allocate groups of vertices to processors in such a way that many of matrix-vector multiplication can be performed locally on each processor and hence to minimize communication. Furthermore, a good graph partitioning scheme ensures the equal amount of computation performed on each processor. Graph partitioning is a well known NP-complete problem, and thus the most commonly used graph partitioning algorithms employ some forms of heuristics. These algorithms vary in terms of their complexity, partition generation time, and the quality of partitions, and they tend to trade off these factors. A significant challenge we are currently facing at the Lawrence Livermore National Laboratory is how to partition very large meshes on massive-size distributed memory machines like IBM BlueGene/P, where scalability becomes a big issue. For example, we have found that the ParMetis, a very popular graph partitioning tool, can only scale to 16K processors. An ideal graph partitioning method on such an environment should be fast and scale to very large meshes, while producing high quality partitions. This is an extremely challenging task, as to scale to that level, the partitioning algorithm should be simple and be able to produce partitions that minimize inter-processor communications and balance the load imposed on the processors. Our goals in this work are two-fold: (1) To develop a new scalable graph partitioning method with good load balancing and communication reduction capability. (2) To study the performance of the proposed partitioning method on very large parallel machines using actual data sets and compare the performance to that of existing methods. The proposed method achieves the desired scalability by reducing the mesh size. For this, it coarsens an input mesh into a smaller size mesh by coalescing the vertices and edges of the original mesh into a set of mega-vertices and mega-edges. A new coarsening method called brick algorithm is developed in this research. In the brick algorithm, the zones in a given mesh are first grouped into fixed size blocks called bricks. These brick are then laid in a way similar to conventional brick laying technique, which reduces the number of neighboring blocks each block needs to communicate. Contributions of this research are as follows: (1) We have developed a novel method that scales to a really large problem size while producing high quality mesh partitions; (2) We measured the performance and scalability of the proposed method on a machine of massive size using a set of actual large complex data sets, where we have scaled to a mesh with 110 million zones using our method. To the best of our knowledge, this is the largest complex mesh that a partitioning method is successfully applied to; and (3) We have shown that proposed method can reduce the number of edge cuts by as much as 65%.« less
Transformation of two and three-dimensional regions by elliptic systems

NASA Technical Reports Server (NTRS)

Mastin, C. Wayne

1994-01-01

Several reports are attached to this document which contain the results of our research at the end of this contract period. Three of the reports deal with our work on generating surface grids. One is a preprint of a paper which will appear in the journal Applied Mathematics and Computation. Another is the abstract from a dissertation which has been prepared by Ahmed Khamayseh, a graduate student who has been supported by this grant for the last two years. The last report on surface grids is the extended abstract of a paper to be presented at the 14th IMACS World Congress in July. This report contains results on conformal mappings of surfaces, which are closely related to elliptic methods for surface grid generation. A preliminary report is included on new methods for dealing with block interfaces in multiblock grid systems. The development work is complete and the methods will eventually be incorporated into the National Grid Project (NGP) grid generation code. Thus, the attached report contains only a simple grid system which was used to test the algorithms to prove that the concepts are sound. These developments will greatly aid grid control when using elliptic systems and prevent unwanted grid movement. The last report is a brief summary of some timings that were obtained when the multiblock grid generation code was run on the Intel IPSC/860 hypercube. Since most of the data in a grid code is local to a particular block, only a small fraction of the total data must be passed between processors. The data is also distributed among the processors so that the total size of the grid can be increase along with the number of processors. This work is only in a preliminary stage. However, one of the ERC graduate students has taken an interest in the project and is presently extending these results as a part of his master's thesis.
Synchronization sampling method based on delta-sigma analog-digital converter for underwater towed array system.

PubMed

Jiang, Jia-Jia; Duan, Fa-Jie; Li, Yan-Chao; Hua, Xiang-Ning

2014-03-01

Synchronization sampling is very important in underwater towed array system where every acquisition node (AN) samples analog signals by its own analog-digital converter (ADC). In this paper, a simple and effective synchronization sampling method is proposed to ensure synchronized operation among different ANs of the underwater towed array system. We first present a master-slave synchronization sampling model, and then design a high accuracy phase-locked loop to synchronize all delta-sigma ADCs to a reference clock. However, when the master-slave synchronization sampling model is used, both the time-delay (TD) of messages traveling along the wired transmission medium and the jitter of the clocks will bring out synchronization sampling error (SSE). Therefore, a simple method is proposed to estimate and compensate the TD of the messages transmission, and then another effective method is presented to overcome the SSE caused by the jitter of the clocks. An experimental system with three ANs is set up, and the related experimental results verify the validity of the synchronization sampling method proposed in this paper.
Synchronization sampling method based on delta-sigma analog-digital converter for underwater towed array system

NASA Astrophysics Data System (ADS)

Jiang, Jia-Jia; Duan, Fa-Jie; Li, Yan-Chao; Hua, Xiang-Ning

2014-03-01

Synchronization sampling is very important in underwater towed array system where every acquisition node (AN) samples analog signals by its own analog-digital converter (ADC). In this paper, a simple and effective synchronization sampling method is proposed to ensure synchronized operation among different ANs of the underwater towed array system. We first present a master-slave synchronization sampling model, and then design a high accuracy phase-locked loop to synchronize all delta-sigma ADCs to a reference clock. However, when the master-slave synchronization sampling model is used, both the time-delay (TD) of messages traveling along the wired transmission medium and the jitter of the clocks will bring out synchronization sampling error (SSE). Therefore, a simple method is proposed to estimate and compensate the TD of the messages transmission, and then another effective method is presented to overcome the SSE caused by the jitter of the clocks. An experimental system with three ANs is set up, and the related experimental results verify the validity of the synchronization sampling method proposed in this paper.
A scalable approach for tree segmentation within small-footprint airborne LiDAR data

NASA Astrophysics Data System (ADS)

Hamraz, Hamid; Contreras, Marco A.; Zhang, Jun

2017-05-01

This paper presents a distributed approach that scales up to segment tree crowns within a LiDAR point cloud representing an arbitrarily large forested area. The approach uses a single-processor tree segmentation algorithm as a building block in order to process the data delivered in the shape of tiles in parallel. The distributed processing is performed in a master-slave manner, in which the master maintains the global map of the tiles and coordinates the slaves that segment tree crowns within and across the boundaries of the tiles. A minimal bias was introduced to the number of detected trees because of trees lying across the tile boundaries, which was quantified and adjusted for. Theoretical and experimental analyses of the runtime of the approach revealed a near linear speedup. The estimated number of trees categorized by crown class and the associated error margins as well as the height distribution of the detected trees aligned well with field estimations, verifying that the distributed approach works correctly. The approach enables providing information of individual tree locations and point cloud segments for a forest-level area in a timely manner, which can be used to create detailed remotely sensed forest inventories. Although the approach was presented for tree segmentation within LiDAR point clouds, the idea can also be generalized to scale up processing other big spatial datasets.

Distributed processor allocation for launching applications in a massively connected processors complex

DOEpatents

Pedretti, Kevin

2008-11-18

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.
Preliminary user's manuals for DYNA3D and DYNAP. [In FORTRAN IV for CDC 7600 and Cray-1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hallquist, J. O.

1979-10-01

This report provides a user's manual for DYNA3D, an explicit three-dimensional finite-element code for analyzing the large deformation dynamic response of inelastic solids. A contact-impact algorithm permits gaps and sliding along material interfaces. By a specialization of this algorithm, such interfaces can be rigidly tied to admit variable zoning without the need of transition regions. Spatial discretization is achieved by the use of 8-node solid elements, and the equations of motion are integrated by the central difference method. Post-processors for DYNA3D include GRAPE for plotting deformed shapes and stress contours and DYNAP for plotting time histories. A user's manual formore » DYNAP is also provided. 23 figures.« less
Parallelization of the Flow Field Dependent Variation Scheme for Solving the Triple Shock/Boundary Layer Interaction Problem

NASA Technical Reports Server (NTRS)

Schunk, Richard Gregory; Chung, T. J.

2001-01-01

A parallelized version of the Flowfield Dependent Variation (FDV) Method is developed to analyze a problem of current research interest, the flowfield resulting from a triple shock/boundary layer interaction. Such flowfields are often encountered in the inlets of high speed air-breathing vehicles including the NASA Hyper-X research vehicle. In order to resolve the complex shock structure and to provide adequate resolution for boundary layer computations of the convective heat transfer from surfaces inside the inlet, models containing over 500,000 nodes are needed. Efficient parallelization of the computation is essential to achieving results in a timely manner. Results from a parallelization scheme, based upon multi-threading, as implemented on multiple processor supercomputers and workstations is presented.
Large Scale GW Calculations on the Cori System

NASA Astrophysics Data System (ADS)

Deslippe, Jack; Del Ben, Mauro; da Jornada, Felipe; Canning, Andrew; Louie, Steven

The NERSC Cori system, powered by 9000+ Intel Xeon-Phi processors, represents one of the largest HPC systems for open-science in the United States and the world. We discuss the optimization of the GW methodology for this system, including both node level and system-scale optimizations. We highlight multiple large scale (thousands of atoms) case studies and discuss both absolute application performance and comparison to calculations on more traditional HPC architectures. We find that the GW method is particularly well suited for many-core architectures due to the ability to exploit a large amount of parallelism across many layers of the system. This work was supported by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, Materials Sciences and Engineering Division, as part of the Computational Materials Sciences Program.
Megatux

DOE Office of Scientific and Technical Information (OSTI.GOV)

2012-09-25

The Megatux platform enables the emulation of large scale (multi-million node) distributed systems. In particular, it allows for the emulation of large-scale networks interconnecting a very large number of emulated computer systems. It does this by leveraging virtualization and associated technologies to allow hundreds of virtual computers to be hosted on a single moderately sized server or workstation. Virtualization technology provided by modern processors allows for multiple guest OSs to run at the same time, sharing the hardware resources. The Megatux platform can be deployed on a single PC, a small cluster of a few boxes or a large clustermore » of computers. With a modest cluster, the Megatux platform can emulate complex organizational networks. By using virtualization, we emulate the hardware, but run actual software enabling large scale without sacrificing fidelity.« less
Merlin - Massively parallel heterogeneous computing

NASA Technical Reports Server (NTRS)

Wittie, Larry; Maples, Creve

1989-01-01

Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
Three-dimensional object surface identification

NASA Astrophysics Data System (ADS)

Celenk, Mehmet

1995-03-01

This paper describes a computationally efficient matching method for inspecting 3D objects using their serial cross sections. Object regions of interest in cross-sectional binary images of successive slices are aligned with those of the models. Cross-sectional differences between the object and the models are measured in the direction of the gradient of the cross section boundary. This is repeated in all the cross-sectional images. The model with minimum average cross-sectional difference is selected as the best match to the given object (i.e., no defect). The method is tested using various computer generated surfaces and matching results are presented. It is also demonstrated using Symult S-2010 16-node system that the method is suitable for parallel implementation in massage passing processors with the maximum attainable speedup (close to 16 for S-2010).
A multitasking behavioral control system for the Robotic All-Terrain Lunar Exploration Rover (RATLER)

NASA Technical Reports Server (NTRS)

Klarer, Paul

1993-01-01

An approach for a robotic control system which implements so called 'behavioral' control within a realtime multitasking architecture is proposed. The proposed system would attempt to ameliorate some of the problems noted by some researchers when implementing subsumptive or behavioral control systems, particularly with regard to multiple processor systems and realtime operations. The architecture is designed to allow synchronous operations between various behavior modules by taking advantage of a realtime multitasking system's intertask communications channels, and by implementing each behavior module and each interconnection node as a stand-alone task. The potential advantages of this approach over those previously described in the field are discussed. An implementation of the architecture is planned for a prototype Robotic All Terrain Lunar Exploration Rover (RATLER) currently under development and is briefly described.
Graphics Processing Unit Acceleration of Gyrokinetic Turbulence Simulations

NASA Astrophysics Data System (ADS)

Hause, Benjamin; Parker, Scott

2012-10-01

We find a substantial increase in on-node performance using Graphics Processing Unit (GPU) acceleration in gyrokinetic delta-f particle-in-cell simulation. Optimization is performed on a two-dimensional slab gyrokinetic particle simulation using the Portland Group Fortran compiler with the GPU accelerator compiler directives. We have implemented the GPU acceleration on a Core I7 gaming PC with a NVIDIA GTX 580 GPU. We find comparable, or better, acceleration relative to the NERSC DIRAC cluster with the NVIDIA Tesla C2050 computing processor. The Tesla C 2050 is about 2.6 times more expensive than the GTX 580 gaming GPU. Optimization strategies and comparisons between DIRAC and the gaming PC will be presented. We will also discuss progress on optimizing the comprehensive three dimensional general geometry GEM code.
A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor; Dauger, Dean; Kokelaar, Pieter

2000-03-01

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 MFlops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
Plasma Physics Calculations on a Parallel Macintosh Cluster

NASA Astrophysics Data System (ADS)

Decyk, Viktor K.; Dauger, Dean E.; Kokelaar, Pieter R.

We have constructed a parallel cluster consisting of 16 Apple Macintosh G3 computers running the MacOS, and achieved very good performance on numerically intensive, parallel plasma particle-in-cell simulations. A subset of the MPI message-passing library was implemented in Fortran77 and C. This library enabled us to port code, without modification, from other parallel processors to the Macintosh cluster. For large problems where message packets are large and relatively few in number, performance of 50-150 Mflops/node is possible, depending on the problem. This is fast enough that 3D calculations can be routinely done. Unlike Unix-based clusters, no special expertise in operating systems is required to build and run the cluster. Full details are available on our web site: http://exodus.physics.ucla.edu/appleseed/.
A Spiking Neural Network Model of the Lateral Geniculate Nucleus on the SpiNNaker Machine

PubMed Central

Sen-Bhattacharya, Basabdatta; Serrano-Gotarredona, Teresa; Balassa, Lorinc; Bhattacharya, Akash; Stokes, Alan B.; Rowley, Andrew; Sugiarto, Indar; Furber, Steve

2017-01-01

We present a spiking neural network model of the thalamic Lateral Geniculate Nucleus (LGN) developed on SpiNNaker, which is a state-of-the-art digital neuromorphic hardware built with very-low-power ARM processors. The parallel, event-based data processing in SpiNNaker makes it viable for building massively parallel neuro-computational frameworks. The LGN model has 140 neurons representing a “basic building block” for larger modular architectures. The motivation of this work is to simulate biologically plausible LGN dynamics on SpiNNaker. Synaptic layout of the model is consistent with biology. The model response is validated with existing literature reporting entrainment in steady state visually evoked potentials (SSVEP)—brain oscillations corresponding to periodic visual stimuli recorded via electroencephalography (EEG). Periodic stimulus to the model is provided by: a synthetic spike-train with inter-spike-intervals in the range 10–50 Hz at a resolution of 1 Hz; and spike-train output from a state-of-the-art electronic retina subjected to a light emitting diode flashing at 10, 20, and 40 Hz, simulating real-world visual stimulus to the model. The resolution of simulation is 0.1 ms to ensure solution accuracy for the underlying differential equations defining Izhikevichs neuron model. Under this constraint, 1 s of model simulation time is executed in 10 s real time on SpiNNaker; this is because simulations on SpiNNaker work in real time for time-steps dt ⩾ 1 ms. The model output shows entrainment with both sets of input and contains harmonic components of the fundamental frequency. However, suppressing the feed-forward inhibition in the circuit produces subharmonics within the gamma band (>30 Hz) implying a reduced information transmission fidelity. These model predictions agree with recent lumped-parameter computational model-based predictions, using conventional computers. Scalability of the framework is demonstrated by a multi-node architecture consisting of three “nodes,” where each node is the “basic building block” LGN model. This 420 neuron model is tested with synthetic periodic stimulus at 10 Hz to all the nodes. The model output is the average of the outputs from all nodes, and conforms to the above-mentioned predictions of each node. Power consumption for model simulation on SpiNNaker is ≪1 W. PMID:28848380
A Spiking Neural Network Model of the Lateral Geniculate Nucleus on the SpiNNaker Machine.

PubMed

Sen-Bhattacharya, Basabdatta; Serrano-Gotarredona, Teresa; Balassa, Lorinc; Bhattacharya, Akash; Stokes, Alan B; Rowley, Andrew; Sugiarto, Indar; Furber, Steve

2017-01-01

We present a spiking neural network model of the thalamic Lateral Geniculate Nucleus (LGN) developed on SpiNNaker, which is a state-of-the-art digital neuromorphic hardware built with very-low-power ARM processors. The parallel, event-based data processing in SpiNNaker makes it viable for building massively parallel neuro-computational frameworks. The LGN model has 140 neurons representing a "basic building block" for larger modular architectures. The motivation of this work is to simulate biologically plausible LGN dynamics on SpiNNaker. Synaptic layout of the model is consistent with biology. The model response is validated with existing literature reporting entrainment in steady state visually evoked potentials (SSVEP)-brain oscillations corresponding to periodic visual stimuli recorded via electroencephalography (EEG). Periodic stimulus to the model is provided by: a synthetic spike-train with inter-spike-intervals in the range 10-50 Hz at a resolution of 1 Hz; and spike-train output from a state-of-the-art electronic retina subjected to a light emitting diode flashing at 10, 20, and 40 Hz, simulating real-world visual stimulus to the model. The resolution of simulation is 0.1 ms to ensure solution accuracy for the underlying differential equations defining Izhikevichs neuron model. Under this constraint, 1 s of model simulation time is executed in 10 s real time on SpiNNaker; this is because simulations on SpiNNaker work in real time for time-steps dt ⩾ 1 ms. The model output shows entrainment with both sets of input and contains harmonic components of the fundamental frequency. However, suppressing the feed-forward inhibition in the circuit produces subharmonics within the gamma band (>30 Hz) implying a reduced information transmission fidelity. These model predictions agree with recent lumped-parameter computational model-based predictions, using conventional computers. Scalability of the framework is demonstrated by a multi-node architecture consisting of three "nodes," where each node is the "basic building block" LGN model. This 420 neuron model is tested with synthetic periodic stimulus at 10 Hz to all the nodes. The model output is the average of the outputs from all nodes, and conforms to the above-mentioned predictions of each node. Power consumption for model simulation on SpiNNaker is ≪1 W.
MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity

NASA Astrophysics Data System (ADS)

Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.

2012-08-01

We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.
Digital optical tomography system for dynamic breast imaging

NASA Astrophysics Data System (ADS)

Flexman, Molly L.; Khalil, Michael A.; Al Abdi, Rabah; Kim, Hyun K.; Fong, Christopher J.; Desperito, Elise; Hershman, Dawn L.; Barbour, Randall L.; Hielscher, Andreas H.

2011-07-01

Diffuse optical tomography has shown promising results as a tool for breast cancer screening and monitoring response to chemotherapy. Dynamic imaging of the transient response of the breast to an external stimulus, such as pressure or a respiratory maneuver, can provide additional information that can be used to detect tumors. We present a new digital continuous-wave optical tomography system designed to simultaneously image both breasts at fast frame rates and with a large number of sources and detectors. The system uses a master-slave digital signal processor-based detection architecture to achieve a dynamic range of 160 dB and a frame rate of 1.7 Hz with 32 sources, 64 detectors, and 4 wavelengths per breast. Included is a preliminary study of one healthy patient and two breast cancer patients showing the ability to identify an invasive carcinoma based on the hemodynamic response to a breath hold.
Evaluation of a parallel implementation of the learning portion of the backward error propagation neural network: experiments in artifact identification.

PubMed Central

Sittig, D. F.; Orr, J. A.

1991-01-01

Various methods have been proposed in an attempt to solve problems in artifact and/or alarm identification including expert systems, statistical signal processing techniques, and artificial neural networks (ANN). ANNs consist of a large number of simple processing units connected by weighted links. To develop truly robust ANNs, investigators are required to train their networks on huge training data sets, requiring enormous computing power. We implemented a parallel version of the backward error propagation neural network training algorithm in the widely portable parallel programming language C-Linda. A maximum speedup of 4.06 was obtained with six processors. This speedup represents a reduction in total run-time from approximately 6.4 hours to 1.5 hours. We conclude that use of the master-worker model of parallel computation is an excellent method for obtaining speedups in the backward error propagation neural network training algorithm. PMID:1807607
A Framework for Parallel Unstructured Grid Generation for Complex Aerodynamic Simulations

NASA Technical Reports Server (NTRS)

Zagaris, George; Pirzadeh, Shahyar Z.; Chrisochoides, Nikos

2009-01-01

A framework for parallel unstructured grid generation targeting both shared memory multi-processors and distributed memory architectures is presented. The two fundamental building-blocks of the framework consist of: (1) the Advancing-Partition (AP) method used for domain decomposition and (2) the Advancing Front (AF) method used for mesh generation. Starting from the surface mesh of the computational domain, the AP method is applied recursively to generate a set of sub-domains. Next, the sub-domains are meshed in parallel using the AF method. The recursive nature of domain decomposition naturally maps to a divide-and-conquer algorithm which exhibits inherent parallelism. For the parallel implementation, the Master/Worker pattern is employed to dynamically balance the varying workloads of each task on the set of available CPUs. Performance results by this approach are presented and discussed in detail as well as future work and improvements.
Biomedical Wireless Ambulatory Crew Monitor

NASA Technical Reports Server (NTRS)

Chmiel, Alan; Humphreys, Brad

2009-01-01

A compact, ambulatory biometric data acquisition system has been developed for space and commercial terrestrial use. BioWATCH (Bio medical Wireless and Ambulatory Telemetry for Crew Health) acquires signals from biomedical sensors using acquisition modules attached to a common data and power bus. Several slots allow the user to configure the unit by inserting sensor-specific modules. The data are then sent real-time from the unit over any commercially implemented wireless network including 802.11b/g, WCDMA, 3G. This system has a distributed computing hierarchy and has a common data controller on each sensor module. This allows for the modularity of the device along with the tailored ability to control the cards using a relatively small master processor. The distributed nature of this system affords the modularity, size, and power consumption that betters the current state of the art in medical ambulatory data acquisition. A new company was created to market this technology.
The Goddard Space Flight Center (GSFC) robotics technology testbed

NASA Technical Reports Server (NTRS)

Schnurr, Rick; Obrien, Maureen; Cofer, Sue

1989-01-01

Much of the technology planned for use in NASA's Flight Telerobotic Servicer (FTS) and the Demonstration Test Flight (DTF) is relatively new and untested. To provide the answers needed to design safe, reliable, and fully functional robotics for flight, NASA/GSFC is developing a robotics technology testbed for research of issues such as zero-g robot control, dual arm teleoperation, simulations, and hierarchical control using a high level programming language. The testbed will be used to investigate these high risk technologies required for the FTS and DTF projects. The robotics technology testbed is centered around the dual arm teleoperation of a pair of 7 degree-of-freedom (DOF) manipulators, each with their own 6-DOF mini-master hand controllers. Several levels of safety are implemented using the control processor, a separate watchdog computer, and other low level features. High speed input/output ports allow the control processor to interface to a simulation workstation: all or part of the testbed hardware can be used in real time dynamic simulation of the testbed operations, allowing a quick and safe means for testing new control strategies. The NASA/National Bureau of Standards Standard Reference Model for Telerobot Control System Architecture (NASREM) hierarchical control scheme, is being used as the reference standard for system design. All software developed for the testbed, excluding some of simulation workstation software, is being developed in Ada. The testbed is being developed in phases. The first phase, which is nearing completion, and highlights future developments is described.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.