Scalable Quantum Networks for Distributed Computing and Sensing
2016-04-01
probabilistic measurement , so we developed quantum memories and guided-wave implementations of same, demonstrating controlled delay of a heralded single...Second, fundamental scalability requires a method to synchronize protocols based on quantum measurements , which are inherently probabilistic. To meet...AFRL-AFOSR-UK-TR-2016-0007 Scalable Quantum Networks for Distributed Computing and Sensing Ian Walmsley THE UNIVERSITY OF OXFORD Final Report 04/01
A Fast Approximate Algorithm for Mapping Long Reads to Large Reference Databases.
Jain, Chirag; Dilthey, Alexander; Koren, Sergey; Aluru, Srinivas; Phillippy, Adam M
2018-04-30
Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.
Scalable DB+IR Technology: Processing Probabilistic Datalog with HySpirit.
Frommholz, Ingo; Roelleke, Thomas
2016-01-01
Probabilistic Datalog (PDatalog, proposed in 1995) is a probabilistic variant of Datalog and a nice conceptual idea to model Information Retrieval in a logical, rule-based programming paradigm. Making PDatalog work in real-world applications requires more than probabilistic facts and rules, and the semantics associated with the evaluation of the programs. We report in this paper some of the key features of the HySpirit system required to scale the execution of PDatalog programs. Firstly, there is the requirement to express probability estimation in PDatalog. Secondly, fuzzy-like predicates are required to model vague predicates (e.g. vague match of attributes such as age or price). Thirdly, to handle large data sets there are scalability issues to be addressed, and therefore, HySpirit provides probabilistic relational indexes and parallel and distributed processing . The main contribution of this paper is a consolidated view on the methods of the HySpirit system to make PDatalog applicable in real-scale applications that involve a wide range of requirements typical for data (information) management and analysis.
A Scalable Approach to Probabilistic Latent Space Inference of Large-Scale Networks
Yin, Junming; Ho, Qirong; Xing, Eric P.
2014-01-01
We propose a scalable approach for making inference about latent spaces of large networks. With a succinct representation of networks as a bag of triangular motifs, a parsimonious statistical model, and an efficient stochastic variational inference algorithm, we are able to analyze real networks with over a million vertices and hundreds of latent roles on a single machine in a matter of hours, a setting that is out of reach for many existing methods. When compared to the state-of-the-art probabilistic approaches, our method is several orders of magnitude faster, with competitive or improved accuracy for latent space recovery and link prediction. PMID:25400487
Architecture Knowledge for Evaluating Scalable Databases
2015-01-16
problems, arising from the proliferation of new data models and distributed technologies for building scalable, available data stores . Architects must...longer are relational databases the de facto standard for building data repositories. Highly distributed, scalable “ NoSQL ” databases [11] have emerged...This is especially challenging at the data storage layer. The multitude of competing NoSQL database technologies creates a complex and rapidly
Xia, Kai; Dong, Dong; Han, Jing-Dong J
2006-01-01
Background Although protein-protein interaction (PPI) networks have been explored by various experimental methods, the maps so built are still limited in coverage and accuracy. To further expand the PPI network and to extract more accurate information from existing maps, studies have been carried out to integrate various types of functional relationship data. A frequently updated database of computationally analyzed potential PPIs to provide biological researchers with rapid and easy access to analyze original data as a biological network is still lacking. Results By applying a probabilistic model, we integrated 27 heterogeneous genomic, proteomic and functional annotation datasets to predict PPI networks in human. In addition to previously studied data types, we show that phenotypic distances and genetic interactions can also be integrated to predict PPIs. We further built an easy-to-use, updatable integrated PPI database, the Integrated Network Database (IntNetDB) online, to provide automatic prediction and visualization of PPI network among genes of interest. The networks can be visualized in SVG (Scalable Vector Graphics) format for zooming in or out. IntNetDB also provides a tool to extract topologically highly connected network neighborhoods from a specific network for further exploration and research. Using the MCODE (Molecular Complex Detections) algorithm, 190 such neighborhoods were detected among all the predicted interactions. The predicted PPIs can also be mapped to worm, fly and mouse interologs. Conclusion IntNetDB includes 180,010 predicted protein-protein interactions among 9,901 human proteins and represents a useful resource for the research community. Our study has increased prediction coverage by five-fold. IntNetDB also provides easy-to-use network visualization and analysis tools that allow biological researchers unfamiliar with computational biology to access and analyze data over the internet. The web interface of IntNetDB is freely accessible at . Visualization requires Mozilla version 1.8 (or higher) or Internet Explorer with installation of SVGviewer. PMID:17112386
A probabilistic NF2 relational algebra for integrated information retrieval and database systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fuhr, N.; Roelleke, T.
The integration of information retrieval (IR) and database systems requires a data model which allows for modelling documents as entities, representing uncertainty and vagueness and performing uncertain inference. For this purpose, we present a probabilistic data model based on relations in non-first-normal-form (NF2). Here, tuples are assigned probabilistic weights giving the probability that a tuple belongs to a relation. Thus, the set of weighted index terms of a document are represented as a probabilistic subrelation. In a similar way, imprecise attribute values are modelled as a set-valued attribute. We redefine the relational operators for this type of relations such thatmore » the result of each operator is again a probabilistic NF2 relation, where the weight of a tuple gives the probability that this tuple belongs to the result. By ordering the tuples according to decreasing probabilities, the model yields a ranking of answers like in most IR models. This effect also can be used for typical database queries involving imprecise attribute values as well as for combinations of database and IR queries.« less
Optimal entangling operations between deterministic blocks of qubits encoded into single photons
NASA Astrophysics Data System (ADS)
Smith, Jake A.; Kaplan, Lev
2018-01-01
Here, we numerically simulate probabilistic elementary entangling operations between rail-encoded photons for the purpose of scalable universal quantum computation or communication. We propose grouping logical qubits into single-photon blocks wherein single-qubit rotations and the controlled-not (cnot) gate are fully deterministic and simple to implement. Interblock communication is then allowed through said probabilistic entangling operations. We find a promising trend in the increasing probability of successful interblock communication as we increase the number of optical modes operated on by our elementary entangling operations.
Advanced technologies for scalable ATLAS conditions database access on the grid
NASA Astrophysics Data System (ADS)
Basset, R.; Canali, L.; Dimitrov, G.; Girone, M.; Hawkings, R.; Nevski, P.; Valassi, A.; Vaniachine, A.; Viegas, F.; Walker, R.; Wong, A.
2010-04-01
During massive data reprocessing operations an ATLAS Conditions Database application must support concurrent access from numerous ATLAS data processing jobs running on the Grid. By simulating realistic work-flow, ATLAS database scalability tests provided feedback for Conditions Db software optimization and allowed precise determination of required distributed database resources. In distributed data processing one must take into account the chaotic nature of Grid computing characterized by peak loads, which can be much higher than average access rates. To validate database performance at peak loads, we tested database scalability at very high concurrent jobs rates. This has been achieved through coordinated database stress tests performed in series of ATLAS reprocessing exercises at the Tier-1 sites. The goal of database stress tests is to detect scalability limits of the hardware deployed at the Tier-1 sites, so that the server overload conditions can be safely avoided in a production environment. Our analysis of server performance under stress tests indicates that Conditions Db data access is limited by the disk I/O throughput. An unacceptable side-effect of the disk I/O saturation is a degradation of the WLCG 3D Services that update Conditions Db data at all ten ATLAS Tier-1 sites using the technology of Oracle Streams. To avoid such bottlenecks we prototyped and tested a novel approach for database peak load avoidance in Grid computing. Our approach is based upon the proven idea of pilot job submission on the Grid: instead of the actual query, an ATLAS utility library sends to the database server a pilot query first.
Discriminative confidence estimation for probabilistic multi-atlas label fusion.
Benkarim, Oualid M; Piella, Gemma; González Ballester, Miguel Angel; Sanroma, Gerard
2017-12-01
Quantitative neuroimaging analyses often rely on the accurate segmentation of anatomical brain structures. In contrast to manual segmentation, automatic methods offer reproducible outputs and provide scalability to study large databases. Among existing approaches, multi-atlas segmentation has recently shown to yield state-of-the-art performance in automatic segmentation of brain images. It consists in propagating the labelmaps from a set of atlases to the anatomy of a target image using image registration, and then fusing these multiple warped labelmaps into a consensus segmentation on the target image. Accurately estimating the contribution of each atlas labelmap to the final segmentation is a critical step for the success of multi-atlas segmentation. Common approaches to label fusion either rely on local patch similarity, probabilistic statistical frameworks or a combination of both. In this work, we propose a probabilistic label fusion framework based on atlas label confidences computed at each voxel of the structure of interest. Maximum likelihood atlas confidences are estimated using a supervised approach, explicitly modeling the relationship between local image appearances and segmentation errors produced by each of the atlases. We evaluate different spatial pooling strategies for modeling local segmentation errors. We also present a novel type of label-dependent appearance features based on atlas labelmaps that are used during confidence estimation to increase the accuracy of our label fusion. Our approach is evaluated on the segmentation of seven subcortical brain structures from the MICCAI 2013 SATA Challenge dataset and the hippocampi from the ADNI dataset. Overall, our results indicate that the proposed label fusion framework achieves superior performance to state-of-the-art approaches in the majority of the evaluated brain structures and shows more robustness to registration errors. Copyright © 2017 Elsevier B.V. All rights reserved.
WOVOdat, A Worldwide Volcano Unrest Database, to Improve Eruption Forecasts
NASA Astrophysics Data System (ADS)
Widiwijayanti, C.; Costa, F.; Win, N. T. Z.; Tan, K.; Newhall, C. G.; Ratdomopurbo, A.
2015-12-01
WOVOdat is the World Organization of Volcano Observatories' Database of Volcanic Unrest. An international effort to develop common standards for compiling and storing data on volcanic unrests in a centralized database and freely web-accessible for reference during volcanic crises, comparative studies, and basic research on pre-eruption processes. WOVOdat will be to volcanology as an epidemiological database is to medicine. Despite the large spectrum of monitoring techniques, the interpretation of monitoring data throughout the evolution of the unrest and making timely forecasts remain the most challenging tasks for volcanologists. The field of eruption forecasting is becoming more quantitative, based on the understanding of the pre-eruptive magmatic processes and dynamic interaction between variables that are at play in a volcanic system. Such forecasts must also acknowledge and express the uncertainties, therefore most of current research in this field focused on the application of event tree analysis to reflect multiple possible scenarios and the probability of each scenario. Such forecasts are critically dependent on comprehensive and authoritative global volcano unrest data sets - the very information currently collected in WOVOdat. As the database becomes more complete, Boolean searches, side-by-side digital and thus scalable comparisons of unrest, pattern recognition, will generate reliable results. Statistical distribution obtained from WOVOdat can be then used to estimate the probabilities of each scenario after specific patterns of unrest. We established main web interface for data submission and visualizations, and have now incorporated ~20% of worldwide unrest data into the database, covering more than 100 eruptive episodes. In the upcoming years we will concentrate in acquiring data from volcano observatories develop a robust data query interface, optimizing data mining, and creating tools by which WOVOdat can be used for probabilistic eruption forecasting. The more data in WOVOdat, the more useful it will be.
Scalable Database Design of End-Game Model with Decoupled Countermeasure and Threat Information
2017-11-01
Threat Information by Decetria Akole and Michael Chen Approved for public release; distribution is unlimited...Scalable Database Design of End-Game Model with Decoupled Countermeasure and Threat Information by Decetria Akole The Thurgood Marshall...for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data
A probabilistic approach to randomness in geometric configuration of scalable origami structures
NASA Astrophysics Data System (ADS)
Liu, Ke; Paulino, Glaucio; Gardoni, Paolo
2015-03-01
Origami, an ancient paper folding art, has inspired many solutions to modern engineering challenges. The demand for actual engineering applications motivates further investigation in this field. Although rooted from the historic art form, many applications of origami are based on newly designed origami patterns to match the specific requirenments of an engineering problem. The application of origami to structural design problems ranges from micro-structure of materials to large scale deployable shells. For instance, some origami-inspired designs have unique properties such as negative Poisson ratio and flat foldability. However, origami structures are typically constrained by strict mathematical geometric relationships, which in reality, can be easily violated, due to, for example, random imperfections introduced during manufacturing, or non-uniform deformations under working conditions (e.g. due to non-uniform thermal effects). Therefore, the effects of uncertainties in origami-like structures need to be studied in further detail in order to provide a practical guide for scalable origami-inspired engineering designs. Through reliability and probabilistic analysis, we investigate the effect of randomness in origami structures on their mechanical properties. Dislocations of vertices of an origami structure have different impacts on different mechanical properties, and different origami designs could have different sensitivities to imperfections. Thus we aim to provide a preliminary understanding of the structural behavior of some common scalable origami structures subject to randomness in their geometric configurations in order to help transition the technology toward practical applications of origami engineering.
Building Scalable Knowledge Graphs for Earth Science
NASA Technical Reports Server (NTRS)
Ramachandran, Rahul; Maskey, Manil; Gatlin, Patrick; Zhang, Jia; Duan, Xiaoyi; Miller, J. J.; Bugbee, Kaylin; Christopher, Sundar; Freitag, Brian
2017-01-01
Knowledge Graphs link key entities in a specific domain with other entities via relationships. From these relationships, researchers can query knowledge graphs for probabilistic recommendations to infer new knowledge. Scientific papers are an untapped resource which knowledge graphs could leverage to accelerate research discovery. Goal: Develop an end-to-end (semi) automated methodology for constructing Knowledge Graphs for Earth Science.
NASA Astrophysics Data System (ADS)
Velazquez, Enrique Israel
Improvements in medical and genomic technologies have dramatically increased the production of electronic data over the last decade. As a result, data management is rapidly becoming a major determinant, and urgent challenge, for the development of Precision Medicine. Although successful data management is achievable using Relational Database Management Systems (RDBMS), exponential data growth is a significant contributor to failure scenarios. Growing amounts of data can also be observed in other sectors, such as economics and business, which, together with the previous facts, suggests that alternate database approaches (NoSQL) may soon be required for efficient storage and management of big databases. However, this hypothesis has been difficult to test in the Precision Medicine field since alternate database architectures are complex to assess and means to integrate heterogeneous electronic health records (EHR) with dynamic genomic data are not easily available. In this dissertation, we present a novel set of experiments for identifying NoSQL database approaches that enable effective data storage and management in Precision Medicine using patients' clinical and genomic information from the cancer genome atlas (TCGA). The first experiment draws on performance and scalability from biologically meaningful queries with differing complexity and database sizes. The second experiment measures performance and scalability in database updates without schema changes. The third experiment assesses performance and scalability in database updates with schema modifications due dynamic data. We have identified two NoSQL approach, based on Cassandra and Redis, which seems to be the ideal database management systems for our precision medicine queries in terms of performance and scalability. We present NoSQL approaches and show how they can be used to manage clinical and genomic big data. Our research is relevant to the public health since we are focusing on one of the main challenges to the development of Precision Medicine and, consequently, investigating a potential solution to the progressively increasing demands on health care.
Integration of Oracle and Hadoop: Hybrid Databases Affordable at Scale
NASA Astrophysics Data System (ADS)
Canali, L.; Baranowski, Z.; Kothuri, P.
2017-10-01
This work reports on the activities aimed at integrating Oracle and Hadoop technologies for the use cases of CERN database services and in particular on the development of solutions for offloading data and queries from Oracle databases into Hadoop-based systems. The goal and interest of this investigation is to increase the scalability and optimize the cost/performance footprint for some of our largest Oracle databases. These concepts have been applied, among others, to build offline copies of CERN accelerator controls and logging databases. The tested solution allows to run reports on the controls data offloaded in Hadoop without affecting the critical production database, providing both performance benefits and cost reduction for the underlying infrastructure. Other use cases discussed include building hybrid database solutions with Oracle and Hadoop, offering the combined advantages of a mature relational database system with a scalable analytics engine.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Knio, Omar
2017-05-05
The current project develops a novel approach that uses a probabilistic description to capture the current state of knowledge about the computational solution. To effectively spread the computational effort over multiple nodes, the global computational domain is split into many subdomains. Computational uncertainty in the solution translates into uncertain boundary conditions for the equation system to be solved on those subdomains, and many independent, concurrent subdomain simulations are used to account for this bound- ary condition uncertainty. By relying on the fact that solutions on neighboring subdomains must agree with each other, a more accurate estimate for the global solutionmore » can be achieved. Statistical approaches in this update process make it possible to account for the effect of system faults in the probabilistic description of the computational solution, and the associated uncertainty is reduced through successive iterations. By combining all of these elements, the probabilistic reformulation allows splitting the computational work over very many independent tasks for good scalability, while being robust to system faults.« less
Eddy, Sean R.
2008-01-01
Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (λ) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty (“Forward” scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores (“Viterbi” scores) are Gumbel-distributed with constant λ = log 2, and the high scoring tail of Forward scores is exponential with the same constant λ. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments. PMID:18516236
Supermultiplicative Speedups of Probabilistic Model-Building Genetic Algorithms
2009-02-01
physicists as well as practitioners in evolutionary computation. The project was later extended to the one-dimensional SK spin glass with power -law... Brasil ) 10. Yuji Sato (Hosei University, Japan) 11. Shunsukc Saruwatari (Tokyo University, Japan) 12. Jian-Hung Chen (Feng Chia University, Taiwan...scalability. In A. Tiwari, J. Knowlcs, E. Avincri, K. Dahal, and R. Roy (Eds.) Applications of Soft Computing: Recent Trends. Berlin: Springer (2006
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework
2012-01-01
Background For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. Results We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. Conclusion The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources. PMID:23216909
Hydra: a scalable proteomic search engine which utilizes the Hadoop distributed computing framework.
Lewis, Steven; Csordas, Attila; Killcoyne, Sarah; Hermjakob, Henning; Hoopmann, Michael R; Moritz, Robert L; Deutsch, Eric W; Boyle, John
2012-12-05
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed. We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed. The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
The composite load spectra project
NASA Technical Reports Server (NTRS)
Newell, J. F.; Ho, H.; Kurth, R. E.
1990-01-01
Probabilistic methods and generic load models capable of simulating the load spectra that are induced in space propulsion system components are being developed. Four engine component types (the transfer ducts, the turbine blades, the liquid oxygen posts and the turbopump oxidizer discharge duct) were selected as representative hardware examples. The composite load spectra that simulate the probabilistic loads for these components are typically used as the input loads for a probabilistic structural analysis. The knowledge-based system approach used for the composite load spectra project provides an ideal environment for incremental development. The intelligent database paradigm employed in developing the expert system provides a smooth coupling between the numerical processing and the symbolic (information) processing. Large volumes of engine load information and engineering data are stored in database format and managed by a database management system. Numerical procedures for probabilistic load simulation and database management functions are controlled by rule modules. Rules were hard-wired as decision trees into rule modules to perform process control tasks. There are modules to retrieve load information and models. There are modules to select loads and models to carry out quick load calculations or make an input file for full duty-cycle time dependent load simulation. The composite load spectra load expert system implemented today is capable of performing intelligent rocket engine load spectra simulation. Further development of the expert system will provide tutorial capability for users to learn from it.
Aldridge, Robert W; Shaji, Kunju; Hayward, Andrew C; Abubakar, Ibrahim
2015-01-01
The Enhanced Matching System (EMS) is a probabilistic record linkage program developed by the tuberculosis section at Public Health England to match data for individuals across two datasets. This paper outlines how EMS works and investigates its accuracy for linkage across public health datasets. EMS is a configurable Microsoft SQL Server database program. To examine the accuracy of EMS, two public health databases were matched using National Health Service (NHS) numbers as a gold standard unique identifier. Probabilistic linkage was then performed on the same two datasets without inclusion of NHS number. Sensitivity analyses were carried out to examine the effect of varying matching process parameters. Exact matching using NHS number between two datasets (containing 5931 and 1759 records) identified 1071 matched pairs. EMS probabilistic linkage identified 1068 record pairs. The sensitivity of probabilistic linkage was calculated as 99.5% (95%CI: 98.9, 99.8), specificity 100.0% (95%CI: 99.9, 100.0), positive predictive value 99.8% (95%CI: 99.3, 100.0), and negative predictive value 99.9% (95%CI: 99.8, 100.0). Probabilistic matching was most accurate when including address variables and using the automatically generated threshold for determining links with manual review. With the establishment of national electronic datasets across health and social care, EMS enables previously unanswerable research questions to be tackled with confidence in the accuracy of the linkage process. In scenarios where a small sample is being matched into a very large database (such as national records of hospital attendance) then, compared to results presented in this analysis, the positive predictive value or sensitivity may drop according to the prevalence of matches between databases. Despite this possible limitation, probabilistic linkage has great potential to be used where exact matching using a common identifier is not possible, including in low-income settings, and for vulnerable groups such as homeless populations, where the absence of unique identifiers and lower data quality has historically hindered the ability to identify individuals across datasets.
The Eruption Forecasting Information System (EFIS) database project
NASA Astrophysics Data System (ADS)
Ogburn, Sarah; Harpel, Chris; Pesicek, Jeremy; Wellik, Jay; Pallister, John; Wright, Heather
2016-04-01
The Eruption Forecasting Information System (EFIS) project is a new initiative of the U.S. Geological Survey-USAID Volcano Disaster Assistance Program (VDAP) with the goal of enhancing VDAP's ability to forecast the outcome of volcanic unrest. The EFIS project seeks to: (1) Move away from relying on the collective memory to probability estimation using databases (2) Create databases useful for pattern recognition and for answering common VDAP questions; e.g. how commonly does unrest lead to eruption? how commonly do phreatic eruptions portend magmatic eruptions and what is the range of antecedence times? (3) Create generic probabilistic event trees using global data for different volcano 'types' (4) Create background, volcano-specific, probabilistic event trees for frequently active or particularly hazardous volcanoes in advance of a crisis (5) Quantify and communicate uncertainty in probabilities A major component of the project is the global EFIS relational database, which contains multiple modules designed to aid in the construction of probabilistic event trees and to answer common questions that arise during volcanic crises. The primary module contains chronologies of volcanic unrest, including the timing of phreatic eruptions, column heights, eruptive products, etc. and will be initially populated using chronicles of eruptive activity from Alaskan volcanic eruptions in the GeoDIVA database (Cameron et al. 2013). This database module allows us to query across other global databases such as the WOVOdat database of monitoring data and the Smithsonian Institution's Global Volcanism Program (GVP) database of eruptive histories and volcano information. The EFIS database is in the early stages of development and population; thus, this contribution also serves as a request for feedback from the community.
DESIGNING ENVIRONMENTAL MONITORING DATABASES FOR STATISTIC ASSESSMENT
Databases designed for statistical analyses have characteristics that distinguish them from databases intended for general use. EMAP uses a probabilistic sampling design to collect data to produce statistical assessments of environmental conditions. In addition to supporting the ...
Lee, Ken Ka-Yin; Tang, Wai-Choi; Choi, Kup-Sze
2013-04-01
Clinical data are dynamic in nature, often arranged hierarchically and stored as free text and numbers. Effective management of clinical data and the transformation of the data into structured format for data analysis are therefore challenging issues in electronic health records development. Despite the popularity of relational databases, the scalability of the NoSQL database model and the document-centric data structure of XML databases appear to be promising features for effective clinical data management. In this paper, three database approaches--NoSQL, XML-enabled and native XML--are investigated to evaluate their suitability for structured clinical data. The database query performance is reported, together with our experience in the databases development. The results show that NoSQL database is the best choice for query speed, whereas XML databases are advantageous in terms of scalability, flexibility and extensibility, which are essential to cope with the characteristics of clinical data. While NoSQL and XML technologies are relatively new compared to the conventional relational database, both of them demonstrate potential to become a key database technology for clinical data management as the technology further advances. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Probabilistic Risk Assessment: A Bibliography
NASA Technical Reports Server (NTRS)
2000-01-01
Probabilistic risk analysis is an integration of failure modes and effects analysis (FMEA), fault tree analysis and other techniques to assess the potential for failure and to find ways to reduce risk. This bibliography references 160 documents in the NASA STI Database that contain the major concepts, probabilistic risk assessment, risk and probability theory, in the basic index or major subject terms, An abstract is included with most citations, followed by the applicable subject terms.
ERIC Educational Resources Information Center
Lundquist, Carol; Frieder, Ophir; Holmes, David O.; Grossman, David
1999-01-01
Describes a scalable, parallel, relational database-drive information retrieval engine. To support portability across a wide range of execution environments, all algorithms adhere to the SQL-92 standard. By incorporating relevance feedback algorithms, accuracy is enhanced over prior database-driven information retrieval efforts. Presents…
ERIC Educational Resources Information Center
Kim, Deok-Hwan; Chung, Chin-Wan
2003-01-01
Discusses the collection fusion problem of image databases, concerned with retrieving relevant images by content based retrieval from image databases distributed on the Web. Focuses on a metaserver which selects image databases supporting similarity measures and proposes a new algorithm which exploits a probabilistic technique using Bayesian…
NASA Astrophysics Data System (ADS)
Jing, Changfeng; Liang, Song; Ruan, Yong; Huang, Jie
2008-10-01
During the urbanization process, when facing complex requirements of city development, ever-growing urban data, rapid development of planning business and increasing planning complexity, a scalable, extensible urban planning management information system is needed urgently. PM2006 is such a system that can deal with these problems. In response to the status and problems in urban planning, the scalability and extensibility of PM2006 are introduced which can be seen as business-oriented workflow extensibility, scalability of DLL-based architecture, flexibility on platforms of GIS and database, scalability of data updating and maintenance and so on. It is verified that PM2006 system has good extensibility and scalability which can meet the requirements of all levels of administrative divisions and can adapt to ever-growing changes in urban planning business. At the end of this paper, the application of PM2006 in Urban Planning Bureau of Suzhou city is described.
The CEBAF Element Database and Related Operational Software
DOE Office of Scientific and Technical Information (OSTI.GOV)
Larrieu, Theodore; Slominski, Christopher; Keesee, Marie
The newly commissioned 12GeV CEBAF accelerator relies on a flexible, scalable and comprehensive database to define the accelerator. This database delivers the configuration for CEBAF operational tools, including hardware checkout, the downloadable optics model, control screens, and much more. The presentation will describe the flexible design of the CEBAF Element Database (CED), its features and assorted use case examples.
Effects of distributed database modeling on evaluation of transaction rollbacks
NASA Technical Reports Server (NTRS)
Mukkamala, Ravi
1991-01-01
Data distribution, degree of data replication, and transaction access patterns are key factors in determining the performance of distributed database systems. In order to simplify the evaluation of performance measures, database designers and researchers tend to make simplistic assumptions about the system. The effect is studied of modeling assumptions on the evaluation of one such measure, the number of transaction rollbacks, in a partitioned distributed database system. Six probabilistic models and expressions are developed for the numbers of rollbacks under each of these models. Essentially, the models differ in terms of the available system information. The analytical results so obtained are compared to results from simulation. From here, it is concluded that most of the probabilistic models yield overly conservative estimates of the number of rollbacks. The effect of transaction commutativity on system throughout is also grossly undermined when such models are employed.
Effects of distributed database modeling on evaluation of transaction rollbacks
NASA Technical Reports Server (NTRS)
Mukkamala, Ravi
1991-01-01
Data distribution, degree of data replication, and transaction access patterns are key factors in determining the performance of distributed database systems. In order to simplify the evaluation of performance measures, database designers and researchers tend to make simplistic assumptions about the system. Here, researchers investigate the effect of modeling assumptions on the evaluation of one such measure, the number of transaction rollbacks in a partitioned distributed database system. The researchers developed six probabilistic models and expressions for the number of rollbacks under each of these models. Essentially, the models differ in terms of the available system information. The analytical results obtained are compared to results from simulation. It was concluded that most of the probabilistic models yield overly conservative estimates of the number of rollbacks. The effect of transaction commutativity on system throughput is also grossly undermined when such models are employed.
Rivas, Elena; Lang, Raymond; Eddy, Sean R
2012-02-01
The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases.
Rivas, Elena; Lang, Raymond; Eddy, Sean R.
2012-01-01
The standard approach for single-sequence RNA secondary structure prediction uses a nearest-neighbor thermodynamic model with several thousand experimentally determined energy parameters. An attractive alternative is to use statistical approaches with parameters estimated from growing databases of structural RNAs. Good results have been reported for discriminative statistical methods using complex nearest-neighbor models, including CONTRAfold, Simfold, and ContextFold. Little work has been reported on generative probabilistic models (stochastic context-free grammars [SCFGs]) of comparable complexity, although probabilistic models are generally easier to train and to use. To explore a range of probabilistic models of increasing complexity, and to directly compare probabilistic, thermodynamic, and discriminative approaches, we created TORNADO, a computational tool that can parse a wide spectrum of RNA grammar architectures (including the standard nearest-neighbor model and more) using a generalized super-grammar that can be parameterized with probabilities, energies, or arbitrary scores. By using TORNADO, we find that probabilistic nearest-neighbor models perform comparably to (but not significantly better than) discriminative methods. We find that complex statistical models are prone to overfitting RNA structure and that evaluations should use structurally nonhomologous training and test data sets. Overfitting has affected at least one published method (ContextFold). The most important barrier to improving statistical approaches for RNA secondary structure prediction is the lack of diversity of well-curated single-sequence RNA secondary structures in current RNA databases. PMID:22194308
A scalable healthcare information system based on a service-oriented architecture.
Yang, Tzu-Hsiang; Sun, Yeali S; Lai, Feipei
2011-06-01
Many existing healthcare information systems are composed of a number of heterogeneous systems and face the important issue of system scalability. This paper first describes the comprehensive healthcare information systems used in National Taiwan University Hospital (NTUH) and then presents a service-oriented architecture (SOA)-based healthcare information system (HIS) based on the service standard HL7. The proposed architecture focuses on system scalability, in terms of both hardware and software. Moreover, we describe how scalability is implemented in rightsizing, service groups, databases, and hardware scalability. Although SOA-based systems sometimes display poor performance, through a performance evaluation of our HIS based on SOA, the average response time for outpatient, inpatient, and emergency HL7Central systems are 0.035, 0.04, and 0.036 s, respectively. The outpatient, inpatient, and emergency WebUI average response times are 0.79, 1.25, and 0.82 s. The scalability of the rightsizing project and our evaluation results show that the SOA HIS we propose provides evidence that SOA can provide system scalability and sustainability in a highly demanding healthcare information system.
Windows on the brain: the emerging role of atlases and databases in neuroscience
NASA Technical Reports Server (NTRS)
Van Essen, David C.; VanEssen, D. C. (Principal Investigator)
2002-01-01
Brain atlases and associated databases have great potential as gateways for navigating, accessing, and visualizing a wide range of neuroscientific data. Recent progress towards realizing this potential includes the establishment of probabilistic atlases, surface-based atlases and associated databases, combined with improvements in visualization capabilities and internet access.
A probabilistic approach to information retrieval in heterogeneous databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chatterjee, A.; Segev, A.
During the post decade, organizations have increased their scope and operations beyond their traditional geographic boundaries. At the same time, they have adopted heterogeneous and incompatible information systems independent of each other without a careful consideration that one day they may need to be integrated. As a result of this diversity, many important business applications today require access to data stored in multiple autonomous databases. This paper examines a problem of inter-database information retrieval in a heterogeneous environment, where conventional techniques are no longer efficient. To solve the problem, broader definitions for join, union, intersection and selection operators are proposed.more » Also, a probabilistic method to specify the selectivity of these operators is discussed. An algorithm to compute these probabilities is provided in pseudocode.« less
Huang, Yi-Fei; Gulko, Brad; Siepel, Adam
2017-04-01
Many genetic variants that influence phenotypes of interest are located outside of protein-coding genes, yet existing methods for identifying such variants have poor predictive power. Here we introduce a new computational method, called LINSIGHT, that substantially improves the prediction of noncoding nucleotide sites at which mutations are likely to have deleterious fitness consequences, and which, therefore, are likely to be phenotypically important. LINSIGHT combines a generalized linear model for functional genomic data with a probabilistic model of molecular evolution. The method is fast and highly scalable, enabling it to exploit the 'big data' available in modern genomics. We show that LINSIGHT outperforms the best available methods in identifying human noncoding variants associated with inherited diseases. In addition, we apply LINSIGHT to an atlas of human enhancers and show that the fitness consequences at enhancers depend on cell type, tissue specificity, and constraints at associated promoters.
HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.
O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D
2015-04-01
The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.
Unbiased, scalable sampling of protein loop conformations from probabilistic priors.
Zhang, Yajia; Hauser, Kris
2013-01-01
Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents Sub-Loop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences. Our simulation experiments demonstrate that the method computes high-scoring conformations of large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques. Two new developments contribute to the scalability of the new method. First, structural preferences are specified via a probabilistic graphical model (PGM) that links conformation variables, spatial variables (e.g., atom positions), constraints and prior information in a unified framework. The method uses a sparse PGM that exploits locality of interactions between atoms and residues. Second, a novel method for sampling sub-loops is developed to generate statistically unbiased samples of probability densities restricted by loop-closure constraints. Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ion-binding demonstrate its potential as a tool for loop ensemble generation and missing structure completion.
Unbiased, scalable sampling of protein loop conformations from probabilistic priors
2013-01-01
Background Protein loops are flexible structures that are intimately tied to function, but understanding loop motion and generating loop conformation ensembles remain significant computational challenges. Discrete search techniques scale poorly to large loops, optimization and molecular dynamics techniques are prone to local minima, and inverse kinematics techniques can only incorporate structural preferences in adhoc fashion. This paper presents Sub-Loop Inverse Kinematics Monte Carlo (SLIKMC), a new Markov chain Monte Carlo algorithm for generating conformations of closed loops according to experimentally available, heterogeneous structural preferences. Results Our simulation experiments demonstrate that the method computes high-scoring conformations of large loops (>10 residues) orders of magnitude faster than standard Monte Carlo and discrete search techniques. Two new developments contribute to the scalability of the new method. First, structural preferences are specified via a probabilistic graphical model (PGM) that links conformation variables, spatial variables (e.g., atom positions), constraints and prior information in a unified framework. The method uses a sparse PGM that exploits locality of interactions between atoms and residues. Second, a novel method for sampling sub-loops is developed to generate statistically unbiased samples of probability densities restricted by loop-closure constraints. Conclusion Numerical experiments confirm that SLIKMC generates conformation ensembles that are statistically consistent with specified structural preferences. Protein conformations with 100+ residues are sampled on standard PC hardware in seconds. Application to proteins involved in ion-binding demonstrate its potential as a tool for loop ensemble generation and missing structure completion. PMID:24565175
Fast and Scalable Gaussian Process Modeling with Applications to Astronomical Time Series
NASA Astrophysics Data System (ADS)
Foreman-Mackey, Daniel; Agol, Eric; Ambikasaran, Sivaram; Angus, Ruth
2017-12-01
The growing field of large-scale time domain astronomy requires methods for probabilistic data analysis that are computationally tractable, even with large data sets. Gaussian processes (GPs) are a popular class of models used for this purpose, but since the computational cost scales, in general, as the cube of the number of data points, their application has been limited to small data sets. In this paper, we present a novel method for GPs modeling in one dimension where the computational requirements scale linearly with the size of the data set. We demonstrate the method by applying it to simulated and real astronomical time series data sets. These demonstrations are examples of probabilistic inference of stellar rotation periods, asteroseismic oscillation spectra, and transiting planet parameters. The method exploits structure in the problem when the covariance function is expressed as a mixture of complex exponentials, without requiring evenly spaced observations or uniform noise. This form of covariance arises naturally when the process is a mixture of stochastically driven damped harmonic oscillators—providing a physical motivation for and interpretation of this choice—but we also demonstrate that it can be a useful effective model in some other cases. We present a mathematical description of the method and compare it to existing scalable GP methods. The method is fast and interpretable, with a range of potential applications within astronomical data analysis and beyond. We provide well-tested and documented open-source implementations of this method in C++, Python, and Julia.
Multi-Resolution Playback of Network Trace Files
2015-06-01
a com- plete MySQL database, C++ developer tools and the libraries utilized in the development of the system (Boost and Libcrafter), and Wireshark...XE suite has a limit to the allowed size of each database. In order to be scalable, the project had to switch to the MySQL database suite. The...programs that access the database use the MySQL C++ connector, provided by Oracle, and the supplied methods and libraries. 4.4 Flow Generator Chapter 3
Decerns: A framework for multi-criteria decision analysis
Yatsalo, Boris; Didenko, Vladimir; Gritsyuk, Sergey; ...
2015-02-27
A new framework, Decerns, for multicriteria decision analysis (MCDA) of a wide range of practical problems on risk management is introduced. Decerns framework contains a library of modules that are the basis for two scalable systems: DecernsMCDA for analysis of multicriteria problems, and DecernsSDSS for multicriteria analysis of spatial options. DecernsMCDA includes well known MCDA methods and original methods for uncertainty treatment based on probabilistic approaches and fuzzy numbers. As a result, these MCDA methods are described along with a case study on analysis of multicriteria location problem.
Probabilistic liquefaction triggering based on the cone penetration test
Moss, R.E.S.; Seed, R.B.; Kayen, R.E.; Stewart, J.P.; Tokimatsu, K.
2005-01-01
Performance-based earthquake engineering requires a probabilistic treatment of potential failure modes in order to accurately quantify the overall stability of the system. This paper is a summary of the application portions of the probabilistic liquefaction triggering correlations proposed recently proposed by Moss and co-workers. To enable probabilistic treatment of liquefaction triggering, the variables comprising the seismic load and the liquefaction resistance were treated as inherently uncertain. Supporting data from an extensive Cone Penetration Test (CPT)-based liquefaction case history database were used to develop a probabilistic correlation. The methods used to measure the uncertainty of the load and resistance variables, how the interactions of these variables were treated using Bayesian updating, and how reliability analysis was applied to produce curves of equal probability of liquefaction are presented. The normalization for effective overburden stress, the magnitude correlated duration weighting factor, and the non-linear shear mass participation factor used are also discussed.
Integration of Information Retrieval and Database Management Systems.
ERIC Educational Resources Information Center
Deogun, Jitender S.; Raghavan, Vijay V.
1988-01-01
Discusses the motivation for integrating information retrieval and database management systems, and proposes a probabilistic retrieval model in which records in a file may be composed of attributes (formatted data items) and descriptors (content indicators). The details and resolutions of difficulties involved in integrating such systems are…
Automated Database Schema Design Using Mined Data Dependencies.
ERIC Educational Resources Information Center
Wong, S. K. M.; Butz, C. J.; Xiang, Y.
1998-01-01
Describes a bottom-up procedure for discovering multivalued dependencies in observed data without knowing a priori the relationships among the attributes. The proposed algorithm is an application of technique designed for learning conditional independencies in probabilistic reasoning; a prototype system for automated database schema design has…
Performances of the PIPER scalable child human body model in accident reconstruction
Giordano, Chiara; Kleiven, Svein
2017-01-01
Human body models (HBMs) have the potential to provide significant insights into the pediatric response to impact. This study describes a scalable/posable approach to perform child accident reconstructions using the Position and Personalize Advanced Human Body Models for Injury Prediction (PIPER) scalable child HBM of different ages and in different positions obtained by the PIPER tool. Overall, the PIPER scalable child HBM managed reasonably well to predict the injury severity and location of the children involved in real-life crash scenarios documented in the medical records. The developed methodology and workflow is essential for future work to determine child injury tolerances based on the full Child Advanced Safety Project for European Roads (CASPER) accident reconstruction database. With the workflow presented in this study, the open-source PIPER scalable HBM combined with the PIPER tool is also foreseen to have implications for improved safety designs for a better protection of children in traffic accidents. PMID:29135997
Identifying work-related motor vehicle crashes in multiple databases.
Thomas, Andrea M; Thygerson, Steven M; Merrill, Ray M; Cook, Lawrence J
2012-01-01
To compare and estimate the magnitude of work-related motor vehicle crashes in Utah using 2 probabilistically linked statewide databases. Data from 2006 and 2007 motor vehicle crash and hospital databases were joined through probabilistic linkage. Summary statistics and capture-recapture were used to describe occupants injured in work-related motor vehicle crashes and estimate the size of this population. There were 1597 occupants in the motor vehicle crash database and 1673 patients in the hospital database identified as being in a work-related motor vehicle crash. We identified 1443 occupants with at least one record from either the motor vehicle crash or hospital database indicating work-relatedness that linked to any record in the opposing database. We found that 38.7 percent of occupants injured in work-related motor vehicle crashes identified in the motor vehicle crash database did not have a primary payer code of workers' compensation in the hospital database and 40.0 percent of patients injured in work-related motor vehicle crashes identified in the hospital database did not meet our definition of a work-related motor vehicle crash in the motor vehicle crash database. Depending on how occupants injured in work-related motor crashes are identified, we estimate the population to be between 1852 and 8492 in Utah for the years 2006 and 2007. Research on single databases may lead to biased interpretations of work-related motor vehicle crashes. Combining 2 population based databases may still result in an underestimate of the magnitude of work-related motor vehicle crashes. Improved coding of work-related incidents is needed in current databases.
2012-11-27
with powerful analysis tools and an informatics approach leveraging best-of-breed NoSQL databases, in order to store, search and retrieve relevant...dictionaries, and JavaScript also has good support. The MongoDB project[15] was chosen as a scalable NoSQL data store for the cheminfor- matics components
Improving the Scalability of an Exact Approach for Frequent Item Set Hiding
ERIC Educational Resources Information Center
LaMacchia, Carolyn
2013-01-01
Technological advances have led to the generation of large databases of organizational data recognized as an information-rich, strategic asset for internal analysis and sharing with trading partners. Data mining techniques can discover patterns in large databases including relationships considered strategically relevant to the owner of the data.…
Durham, Erin-Elizabeth A; Yu, Xiaxia; Harrison, Robert W
2014-12-01
Effective machine-learning handles large datasets efficiently. One key feature of handling large data is the use of databases such as MySQL. The freeware fuzzy decision tree induction tool, FDT, is a scalable supervised-classification software tool implementing fuzzy decision trees. It is based on an optimized fuzzy ID3 (FID3) algorithm. FDT 2.0 improves upon FDT 1.0 by bridging the gap between data science and data engineering: it combines a robust decisioning tool with data retention for future decisions, so that the tool does not need to be recalibrated from scratch every time a new decision is required. In this paper we briefly review the analytical capabilities of the freeware FDT tool and its major features and functionalities; examples of large biological datasets from HIV, microRNAs and sRNAs are included. This work shows how to integrate fuzzy decision algorithms with modern database technology. In addition, we show that integrating the fuzzy decision tree induction tool with database storage allows for optimal user satisfaction in today's Data Analytics world.
Privacy-Aware Location Database Service for Granular Queries
NASA Astrophysics Data System (ADS)
Kiyomoto, Shinsaku; Martin, Keith M.; Fukushima, Kazuhide
Future mobile markets are expected to increasingly embrace location-based services. This paper presents a new system architecture for location-based services, which consists of a location database and distributed location anonymizers. The service is privacy-aware in the sense that the location database always maintains a degree of anonymity. The location database service permits three different levels of query and can thus be used to implement a wide range of location-based services. Furthermore, the architecture is scalable and employs simple functions that are similar to those found in general database systems.
Space Situational Awareness Data Processing Scalability Utilizing Google Cloud Services
NASA Astrophysics Data System (ADS)
Greenly, D.; Duncan, M.; Wysack, J.; Flores, F.
Space Situational Awareness (SSA) is a fundamental and critical component of current space operations. The term SSA encompasses the awareness, understanding and predictability of all objects in space. As the population of orbital space objects and debris increases, the number of collision avoidance maneuvers grows and prompts the need for accurate and timely process measures. The SSA mission continually evolves to near real-time assessment and analysis demanding the need for higher processing capabilities. By conventional methods, meeting these demands requires the integration of new hardware to keep pace with the growing complexity of maneuver planning algorithms. SpaceNav has implemented a highly scalable architecture that will track satellites and debris by utilizing powerful virtual machines on the Google Cloud Platform. SpaceNav algorithms for processing CDMs outpace conventional means. A robust processing environment for tracking data, collision avoidance maneuvers and various other aspects of SSA can be created and deleted on demand. Migrating SpaceNav tools and algorithms into the Google Cloud Platform will be discussed and the trials and tribulations involved. Information will be shared on how and why certain cloud products were used as well as integration techniques that were implemented. Key items to be presented are: 1.Scientific algorithms and SpaceNav tools integrated into a scalable architecture a) Maneuver Planning b) Parallel Processing c) Monte Carlo Simulations d) Optimization Algorithms e) SW Application Development/Integration into the Google Cloud Platform 2. Compute Engine Processing a) Application Engine Automated Processing b) Performance testing and Performance Scalability c) Cloud MySQL databases and Database Scalability d) Cloud Data Storage e) Redundancy and Availability
A Bayesian network approach to the database search problem in criminal proceedings
2012-01-01
Background The ‘database search problem’, that is, the strengthening of a case - in terms of probative value - against an individual who is found as a result of a database search, has been approached during the last two decades with substantial mathematical analyses, accompanied by lively debate and centrally opposing conclusions. This represents a challenging obstacle in teaching but also hinders a balanced and coherent discussion of the topic within the wider scientific and legal community. This paper revisits and tracks the associated mathematical analyses in terms of Bayesian networks. Their derivation and discussion for capturing probabilistic arguments that explain the database search problem are outlined in detail. The resulting Bayesian networks offer a distinct view on the main debated issues, along with further clarity. Methods As a general framework for representing and analyzing formal arguments in probabilistic reasoning about uncertain target propositions (that is, whether or not a given individual is the source of a crime stain), this paper relies on graphical probability models, in particular, Bayesian networks. This graphical probability modeling approach is used to capture, within a single model, a series of key variables, such as the number of individuals in a database, the size of the population of potential crime stain sources, and the rarity of the corresponding analytical characteristics in a relevant population. Results This paper demonstrates the feasibility of deriving Bayesian network structures for analyzing, representing, and tracking the database search problem. The output of the proposed models can be shown to agree with existing but exclusively formulaic approaches. Conclusions The proposed Bayesian networks allow one to capture and analyze the currently most well-supported but reputedly counter-intuitive and difficult solution to the database search problem in a way that goes beyond the traditional, purely formulaic expressions. The method’s graphical environment, along with its computational and probabilistic architectures, represents a rich package that offers analysts and discussants with additional modes of interaction, concise representation, and coherent communication. PMID:22849390
Information Security Considerations for Applications Using Apache Accumulo
2014-09-01
Distributed File System INSCOM United States Army Intelligence and Security Command JPA Java Persistence API JSON JavaScript Object Notation MAC Mandatory... MySQL [13]. BigTable can process 20 petabytes per day [14]. High degree of scalability on commodity hardware. NoSQL databases do not rely on highly...manipulation in relational databases. NoSQL databases each have a unique programming interface that uses a lower level procedural language (e.g., Java
NASA Astrophysics Data System (ADS)
Lange, Rense
2015-02-01
An extension of concurrent validity is proposed that uses qualitative data for the purpose of validating quantitative measures. The approach relies on Latent Semantic Analysis (LSA) which places verbal (written) statements in a high dimensional semantic space. Using data from a medical / psychiatric domain as a case study - Near Death Experiences, or NDE - we established concurrent validity by connecting NDErs qualitative (written) experiential accounts with their locations on a Rasch scalable measure of NDE intensity. Concurrent validity received strong empirical support since the variance in the Rasch measures could be predicted reliably from the coordinates of their accounts in the LSA derived semantic space (R2 = 0.33). These coordinates also predicted NDErs age with considerable precision (R2 = 0.25). Both estimates are probably artificially low due to the small available data samples (n = 588). It appears that Rasch scalability of NDE intensity is a prerequisite for these findings, as each intensity level is associated (at least probabilistically) with a well- defined pattern of item endorsements.
Probabilistic combination of static and dynamic gait features for verification
NASA Astrophysics Data System (ADS)
Bazin, Alex I.; Nixon, Mark S.
2005-03-01
This paper describes a novel probabilistic framework for biometric identification and data fusion. Based on intra and inter-class variation extracted from training data, posterior probabilities describing the similarity between two feature vectors may be directly calculated from the data using the logistic function and Bayes rule. Using a large publicly available database we show the two imbalanced gait modalities may be fused using this framework. All fusion methods tested provide an improvement over the best modality, with the weighted sum rule giving the best performance, hence showing that highly imbalanced classifiers may be fused in a probabilistic setting; improving not only the performance, but also generalized application capability.
PCEMCAN - Probabilistic Ceramic Matrix Composites Analyzer: User's Guide, Version 1.0
NASA Technical Reports Server (NTRS)
Shah, Ashwin R.; Mital, Subodh K.; Murthy, Pappu L. N.
1998-01-01
PCEMCAN (Probabalistic CEramic Matrix Composites ANalyzer) is an integrated computer code developed at NASA Lewis Research Center that simulates uncertainties associated with the constituent properties, manufacturing process, and geometric parameters of fiber reinforced ceramic matrix composites and quantifies their random thermomechanical behavior. The PCEMCAN code can perform the deterministic as well as probabilistic analyses to predict thermomechanical properties. This User's guide details the step-by-step procedure to create input file and update/modify the material properties database required to run PCEMCAN computer code. An overview of the geometric conventions, micromechanical unit cell, nonlinear constitutive relationship and probabilistic simulation methodology is also provided in the manual. Fast probability integration as well as Monte-Carlo simulation methods are available for the uncertainty simulation. Various options available in the code to simulate probabilistic material properties and quantify sensitivity of the primitive random variables have been described. The description of deterministic as well as probabilistic results have been described using demonstration problems. For detailed theoretical description of deterministic and probabilistic analyses, the user is referred to the companion documents "Computational Simulation of Continuous Fiber-Reinforced Ceramic Matrix Composite Behavior," NASA TP-3602, 1996 and "Probabilistic Micromechanics and Macromechanics for Ceramic Matrix Composites", NASA TM 4766, June 1997.
Extracting Databases from Dark Data with DeepDive.
Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng
2016-01-01
DeepDive is a system for extracting relational databases from dark data : the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
Scalable Indoor Localization via Mobile Crowdsourcing and Gaussian Process
Chang, Qiang; Li, Qun; Shi, Zesen; Chen, Wei; Wang, Weiping
2016-01-01
Indoor localization using Received Signal Strength Indication (RSSI) fingerprinting has been extensively studied for decades. The positioning accuracy is highly dependent on the density of the signal database. In areas without calibration data, however, this algorithm breaks down. Building and updating a dense signal database is labor intensive, expensive, and even impossible in some areas. Researchers are continually searching for better algorithms to create and update dense databases more efficiently. In this paper, we propose a scalable indoor positioning algorithm that works both in surveyed and unsurveyed areas. We first propose Minimum Inverse Distance (MID) algorithm to build a virtual database with uniformly distributed virtual Reference Points (RP). The area covered by the virtual RPs can be larger than the surveyed area. A Local Gaussian Process (LGP) is then applied to estimate the virtual RPs’ RSSI values based on the crowdsourced training data. Finally, we improve the Bayesian algorithm to estimate the user’s location using the virtual database. All the parameters are optimized by simulations, and the new algorithm is tested on real-case scenarios. The results show that the new algorithm improves the accuracy by 25.5% in the surveyed area, with an average positioning error below 2.2 m for 80% of the cases. Moreover, the proposed algorithm can localize the users in the neighboring unsurveyed area. PMID:26999139
2013-01-01
commercial NoSQL database system. The results show that In-dexedHBase provides a data loading speed that is 6 times faster than Riak, and is...compare it with Riak, a widely adopted commercial NoSQL database system. The results show that In- dexedHBase provides a data loading speed that is 6...events. This chapter describes our research towards building an efficient and scalable storage platform for Truthy. Many existing NoSQL databases
A universal exchange language for healthcare.
Robson, Barry; Caruso, Thomas P
2013-01-01
We have defined a Universal Exchange Language (UEL) for healthcare that takes a green field approach to the development of a novel "XML-like" language. We consider here what given a free hand might mean: a UEL that incorporates an advanced mathematical foundation that uses Dirac's notation and algebra. For consented and public information, it allows probabilistic inference from UEL semantic web triplet tags. But also it is possible to use similar thinking to maximize the security and analytic characteristics of private health data by disaggregating or "shredding" it. Both are scalable to millions of records that could be spread across the Internet.
A Living Laboratory for Energy Systems Integration - Continuum Magazine |
research centers across NREL to study how to optimize the campus's energy use. The Energy DataBus The , at second-by-second intervals, 24 hours per day, and stores it all in one giant database. And the use solution that is designed for large, scalable databases. "It's similar to the one that Facebook and
TreeVector: scalable, interactive, phylogenetic trees for the web.
Pethica, Ralph; Barker, Gary; Kovacs, Tim; Gough, Julian
2010-01-28
Phylogenetic trees are complex data forms that need to be graphically displayed to be human-readable. Traditional techniques of plotting phylogenetic trees focus on rendering a single static image, but increases in the production of biological data and large-scale analyses demand scalable, browsable, and interactive trees. We introduce TreeVector, a Scalable Vector Graphics-and Java-based method that allows trees to be integrated and viewed seamlessly in standard web browsers with no extra software required, and can be modified and linked using standard web technologies. There are now many bioinformatics servers and databases with a range of dynamic processes and updates to cope with the increasing volume of data. TreeVector is designed as a framework to integrate with these processes and produce user-customized phylogenies automatically. We also address the strengths of phylogenetic trees as part of a linked-in browsing process rather than an end graphic for print. TreeVector is fast and easy to use and is available to download precompiled, but is also open source. It can also be run from the web server listed below or the user's own web server. It has already been deployed on two recognized and widely used database Web sites.
Singer, D.A.
2006-01-01
A probabilistic neural network is employed to classify 1610 mineral deposits into 18 types using tonnage, average Cu, Mo, Ag, Au, Zn, and Pb grades, and six generalized rock types. The purpose is to examine whether neural networks might serve for integrating geoscience information available in large mineral databases to classify sites by deposit type. Successful classifications of 805 deposits not used in training - 87% with grouped porphyry copper deposits - and the nature of misclassifications demonstrate the power of probabilistic neural networks and the value of quantitative mineral-deposit models. The results also suggest that neural networks can classify deposits as well as experienced economic geologists. ?? International Association for Mathematical Geology 2006.
A Hybrid EAV-Relational Model for Consistent and Scalable Capture of Clinical Research Data.
Khan, Omar; Lim Choi Keung, Sarah N; Zhao, Lei; Arvanitis, Theodoros N
2014-01-01
Many clinical research databases are built for specific purposes and their design is often guided by the requirements of their particular setting. Not only does this lead to issues of interoperability and reusability between research groups in the wider community but, within the project itself, changes and additions to the system could be implemented using an ad hoc approach, which may make the system difficult to maintain and even more difficult to share. In this paper, we outline a hybrid Entity-Attribute-Value and relational model approach for modelling data, in light of frequently changing requirements, which enables the back-end database schema to remain static, improving the extensibility and scalability of an application. The model also facilitates data reuse. The methods used build on the modular architecture previously introduced in the CURe project.
Comparison of probabilistic and deterministic fiber tracking of cranial nerves.
Zolal, Amir; Sobottka, Stephan B; Podlesek, Dino; Linn, Jennifer; Rieger, Bernhard; Juratli, Tareq A; Schackert, Gabriele; Kitzler, Hagen H
2017-09-01
OBJECTIVE The depiction of cranial nerves (CNs) using diffusion tensor imaging (DTI) is of great interest in skull base tumor surgery and DTI used with deterministic tracking methods has been reported previously. However, there are still no good methods usable for the elimination of noise from the resulting depictions. The authors have hypothesized that probabilistic tracking could lead to more accurate results, because it more efficiently extracts information from the underlying data. Moreover, the authors have adapted a previously described technique for noise elimination using gradual threshold increases to probabilistic tracking. To evaluate the utility of this new approach, a comparison is provided with this work between the gradual threshold increase method in probabilistic and deterministic tracking of CNs. METHODS Both tracking methods were used to depict CNs II, III, V, and the VII+VIII bundle. Depiction of 240 CNs was attempted with each of the above methods in 30 healthy subjects, which were obtained from 2 public databases: the Kirby repository (KR) and Human Connectome Project (HCP). Elimination of erroneous fibers was attempted by gradually increasing the respective thresholds (fractional anisotropy [FA] and probabilistic index of connectivity [PICo]). The results were compared with predefined ground truth images based on corresponding anatomical scans. Two label overlap measures (false-positive error and Dice similarity coefficient) were used to evaluate the success of both methods in depicting the CN. Moreover, the differences between these parameters obtained from the KR and HCP (with higher angular resolution) databases were evaluated. Additionally, visualization of 10 CNs in 5 clinical cases was attempted with both methods and evaluated by comparing the depictions with intraoperative findings. RESULTS Maximum Dice similarity coefficients were significantly higher with probabilistic tracking (p < 0.001; Wilcoxon signed-rank test). The false-positive error of the last obtained depiction was also significantly lower in probabilistic than in deterministic tracking (p < 0.001). The HCP data yielded significantly better results in terms of the Dice coefficient in probabilistic tracking (p < 0.001, Mann-Whitney U-test) and in deterministic tracking (p = 0.02). The false-positive errors were smaller in HCP data in deterministic tracking (p < 0.001) and showed a strong trend toward significance in probabilistic tracking (p = 0.06). In the clinical cases, the probabilistic method visualized 7 of 10 attempted CNs accurately, compared with 3 correct depictions with deterministic tracking. CONCLUSIONS High angular resolution DTI scans are preferable for the DTI-based depiction of the cranial nerves. Probabilistic tracking with a gradual PICo threshold increase is more effective for this task than the previously described deterministic tracking with a gradual FA threshold increase and might represent a method that is useful for depicting cranial nerves with DTI since it eliminates the erroneous fibers without manual intervention.
Privacy-preserving heterogeneous health data sharing.
Mohammed, Noman; Jiang, Xiaoqian; Chen, Rui; Fung, Benjamin C M; Ohno-Machado, Lucila
2013-05-01
Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, ε-differential privacy provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure ε-differential privacy handle the problem of disclosing relational and set-valued data in a privacy-preserving manner separately. In this paper, we propose an algorithm that considers both relational and set-valued data in differentially private disclosure of healthcare data. The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee ε-differential privacy. We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis. The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases. Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and set-valued data in a differentially private manner, and can retain essential information for discriminative analysis.
Privacy-preserving heterogeneous health data sharing
Mohammed, Noman; Jiang, Xiaoqian; Chen, Rui; Fung, Benjamin C M; Ohno-Machado, Lucila
2013-01-01
Objective Privacy-preserving data publishing addresses the problem of disclosing sensitive data when mining for useful information. Among existing privacy models, ε-differential privacy provides one of the strongest privacy guarantees and makes no assumptions about an adversary's background knowledge. All existing solutions that ensure ε-differential privacy handle the problem of disclosing relational and set-valued data in a privacy-preserving manner separately. In this paper, we propose an algorithm that considers both relational and set-valued data in differentially private disclosure of healthcare data. Methods The proposed approach makes a simple yet fundamental switch in differentially private algorithm design: instead of listing all possible records (ie, a contingency table) for noise addition, records are generalized before noise addition. The algorithm first generalizes the raw data in a probabilistic way, and then adds noise to guarantee ε-differential privacy. Results We showed that the disclosed data could be used effectively to build a decision tree induction classifier. Experimental results demonstrated that the proposed algorithm is scalable and performs better than existing solutions for classification analysis. Limitation The resulting utility may degrade when the output domain size is very large, making it potentially inappropriate to generate synthetic data for large health databases. Conclusions Unlike existing techniques, the proposed algorithm allows the disclosure of health data containing both relational and set-valued data in a differentially private manner, and can retain essential information for discriminative analysis. PMID:23242630
Cost Considerations in Cloud Computing
2014-01-01
investments. 2. Database Options The potential promise that “ big data ” analytics holds for many enterprise mission areas makes relevant the question of the...development of a range of new distributed file systems and data - bases that have better scalability properties than traditional SQL databases. Hadoop ... data . Many systems exist that extend or supplement Hadoop —such as Apache Accumulo, which provides a highly granular mechanism for managing security
The LSST Data Mining Research Agenda
NASA Astrophysics Data System (ADS)
Borne, K.; Becla, J.; Davidson, I.; Szalay, A.; Tyson, J. A.
2008-12-01
We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night) multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.
SPMBR: a scalable algorithm for mining sequential patterns based on bitmaps
NASA Astrophysics Data System (ADS)
Xu, Xiwei; Zhang, Changhai
2013-12-01
Now some sequential patterns mining algorithms generate too many candidate sequences, and increase the processing cost of support counting. Therefore, we present an effective and scalable algorithm called SPMBR (Sequential Patterns Mining based on Bitmap Representation) to solve the problem of mining the sequential patterns for large databases. Our method differs from previous related works of mining sequential patterns. The main difference is that the database of sequential patterns is represented by bitmaps, and a simplified bitmap structure is presented firstly. In this paper, First the algorithm generate candidate sequences by SE(Sequence Extension) and IE(Item Extension), and then obtain all frequent sequences by comparing the original bitmap and the extended item bitmap .This method could simplify the problem of mining the sequential patterns and avoid the high processing cost of support counting. Both theories and experiments indicate that the performance of SPMBR is predominant for large transaction databases, the required memory size for storing temporal data is much less during mining process, and all sequential patterns can be mined with feasibility.
Efficient data management tools for the heterogeneous big data warehouse
NASA Astrophysics Data System (ADS)
Alekseev, A. A.; Osipova, V. V.; Ivanov, M. A.; Klimentov, A.; Grigorieva, N. V.; Nalamwar, H. S.
2016-09-01
The traditional RDBMS has been consistent for the normalized data structures. RDBMS served well for decades, but the technology is not optimal for data processing and analysis in data intensive fields like social networks, oil-gas industry, experiments at the Large Hadron Collider, etc. Several challenges have been raised recently on the scalability of data warehouse like workload against the transactional schema, in particular for the analysis of archived data or the aggregation of data for summary and accounting purposes. The paper evaluates new database technologies like HBase, Cassandra, and MongoDB commonly referred as NoSQL databases for handling messy, varied and large amount of data. The evaluation depends upon the performance, throughput and scalability of the above technologies for several scientific and industrial use-cases. This paper outlines the technologies and architectures needed for processing Big Data, as well as the description of the back-end application that implements data migration from RDBMS to NoSQL data warehouse, NoSQL database organization and how it could be useful for further data analytics.
NASA Astrophysics Data System (ADS)
Moncoulon, D.; Labat, D.; Ardon, J.; Onfroy, T.; Leblois, E.; Poulard, C.; Aji, S.; Rémy, A.; Quantin, A.
2013-07-01
The analysis of flood exposure at a national scale for the French insurance market must combine the generation of a probabilistic event set of all possible but not yet occurred flood situations with hazard and damage modeling. In this study, hazard and damage models are calibrated on a 1995-2012 historical event set, both for hazard results (river flow, flooded areas) and loss estimations. Thus, uncertainties in the deterministic estimation of a single event loss are known before simulating a probabilistic event set. To take into account at least 90% of the insured flood losses, the probabilistic event set must combine the river overflow (small and large catchments) with the surface runoff due to heavy rainfall, on the slopes of the watershed. Indeed, internal studies of CCR claim database has shown that approximately 45% of the insured flood losses are located inside the floodplains and 45% outside. 10% other percent are due to seasurge floods and groundwater rise. In this approach, two independent probabilistic methods are combined to create a single flood loss distribution: generation of fictive river flows based on the historical records of the river gauge network and generation of fictive rain fields on small catchments, calibrated on the 1958-2010 Météo-France rain database SAFRAN. All the events in the probabilistic event sets are simulated with the deterministic model. This hazard and damage distribution is used to simulate the flood losses at the national scale for an insurance company (MACIF) and to generate flood areas associated with hazard return periods. The flood maps concern river overflow and surface water runoff. Validation of these maps is conducted by comparison with the address located claim data on a small catchment (downstream Argens).
A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data.
Delussu, Giovanni; Lianas, Luca; Frexia, Francesca; Zanetti, Gianluigi
2016-01-01
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR's formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called "Constant Load" and "Constant Number of Records", with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes.
Extracting Databases from Dark Data with DeepDive
Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng
2016-01-01
DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts a massive and valuable new set of “big data.” DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference. PMID:28316365
SBROME: a scalable optimization and module matching framework for automated biosystems design.
Huynh, Linh; Tsoukalas, Athanasios; Köppe, Matthias; Tagkopoulos, Ilias
2013-05-17
The development of a scalable framework for biodesign automation is a formidable challenge given the expected increase in part availability and the ever-growing complexity of synthetic circuits. To allow for (a) the use of previously constructed and characterized circuits or modules and (b) the implementation of designs that can scale up to hundreds of nodes, we here propose a divide-and-conquer Synthetic Biology Reusable Optimization Methodology (SBROME). An abstract user-defined circuit is first transformed and matched against a module database that incorporates circuits that have previously been experimentally characterized. Then the resulting circuit is decomposed to subcircuits that are populated with the set of parts that best approximate the desired function. Finally, all subcircuits are subsequently characterized and deposited back to the module database for future reuse. We successfully applied SBROME toward two alternative designs of a modular 3-input multiplexer that utilize pre-existing logic gates and characterized biological parts.
Probabilistic Tsunami Hazard Assessment: the Seaside, Oregon Pilot Study
NASA Astrophysics Data System (ADS)
Gonzalez, F. I.; Geist, E. L.; Synolakis, C.; Titov, V. V.
2004-12-01
A pilot study of Seaside, Oregon is underway, to develop methodologies for probabilistic tsunami hazard assessments that can be incorporated into Flood Insurance Rate Maps (FIRMs) developed by FEMA's National Flood Insurance Program (NFIP). Current NFIP guidelines for tsunami hazard assessment rely on the science, technology and methodologies developed in the 1970s; although generally regarded as groundbreaking and state-of-the-art for its time, this approach is now superseded by modern methods that reflect substantial advances in tsunami research achieved in the last two decades. In particular, post-1990 technical advances include: improvements in tsunami source specification; improved tsunami inundation models; better computational grids by virtue of improved bathymetric and topographic databases; a larger database of long-term paleoseismic and paleotsunami records and short-term, historical earthquake and tsunami records that can be exploited to develop improved probabilistic methodologies; better understanding of earthquake recurrence and probability models. The NOAA-led U.S. National Tsunami Hazard Mitigation Program (NTHMP), in partnership with FEMA, USGS, NSF and Emergency Management and Geotechnical agencies of the five Pacific States, incorporates these advances into site-specific tsunami hazard assessments for coastal communities in Alaska, California, Hawaii, Oregon and Washington. NTHMP hazard assessment efforts currently focus on developing deterministic, "credible worst-case" scenarios that provide valuable guidance for hazard mitigation and emergency management. The NFIP focus, on the other hand, is on actuarial needs that require probabilistic hazard assessments such as those that characterize 100- and 500-year flooding events. There are clearly overlaps in NFIP and NTHMP objectives. NTHMP worst-case scenario assessments that include an estimated probability of occurrence could benefit the NFIP; NFIP probabilistic assessments of 100- and 500-yr events could benefit the NTHMP. The joint NFIP/NTHMP pilot study at Seaside, Oregon is organized into three closely related components: Probabilistic, Modeling, and Impact studies. Probabilistic studies (Geist, et al., this session) are led by the USGS and include the specification of near- and far-field seismic tsunami sources and their associated probabilities. Modeling studies (Titov, et al., this session) are led by NOAA and include the development and testing of a Seaside tsunami inundation model and an associated database of computed wave height and flow velocity fields. Impact studies (Synolakis, et al., this session) are led by USC and include the computation and analyses of indices for the categorization of hazard zones. The results of each component study will be integrated to produce a Seaside tsunami hazard map. This presentation will provide a brief overview of the project and an update on progress, while the above-referenced companion presentations will provide details on the methods used and the preliminary results obtained by each project component.
Research on high availability architecture of SQL and NoSQL
NASA Astrophysics Data System (ADS)
Wang, Zhiguo; Wei, Zhiqiang; Liu, Hao
2017-03-01
With the advent of the era of big data, amount and importance of data have increased dramatically. SQL database develops in performance and scalability, but more and more companies tend to use NoSQL database as their databases, because NoSQL database has simpler data model and stronger extension capacity than SQL database. Almost all database designers including SQL database and NoSQL database aim to improve performance and ensure availability by reasonable architecture which can reduce the effects of software failures and hardware failures, so that they can provide better experiences for their customers. In this paper, I mainly discuss the architectures of MySQL, MongoDB, and Redis, which are high available and have been deployed in practical application environment, and design a hybrid architecture.
High Resolution Soil Water from Regional Databases and Satellite Images
NASA Technical Reports Server (NTRS)
Morris, Robin D.; Smelyanskly, Vadim N.; Coughlin, Joseph; Dungan, Jennifer; Clancy, Daniel (Technical Monitor)
2002-01-01
This viewgraph presentation provides information on the ways in which plant growth can be inferred from satellite data and can then be used to infer soil water. There are several steps in this process, the first of which is the acquisition of data from satellite observations and relevant information databases such as the State Soil Geographic Database (STATSGO). Then probabilistic analysis and inversion with the Bayes' theorem reveals sources of uncertainty. The Markov chain Monte Carlo method is also used.
Pouzou, Jane G.; Cullen, Alison C.; Yost, Michael G.; Kissel, John C.; Fenske, Richard A.
2018-01-01
Implementation of probabilistic analyses in exposure assessment can provide valuable insight into the risks of those at the extremes of population distributions, including more vulnerable or sensitive subgroups. Incorporation of these analyses into current regulatory methods for occupational pesticide exposure is enabled by the exposure data sets and associated data currently used in the risk assessment approach of the Environmental Protection Agency (EPA). Monte Carlo simulations were performed on exposure measurements from the Agricultural Handler Exposure Database and the Pesticide Handler Exposure Database along with data from the Exposure Factors Handbook and other sources to calculate exposure rates for three different neurotoxic compounds (azinphos methyl, acetamiprid, emamectin benzoate) across four pesticide-handling scenarios. Probabilistic estimates of doses were compared with the no observable effect levels used in the EPA occupational risk assessments. Some percentage of workers were predicted to exceed the level of concern for all three compounds: 54% for azinphos methyl, 5% for acetamiprid, and 20% for emamectin benzoate. This finding has implications for pesticide risk assessment and offers an alternative procedure that may be more protective of those at the extremes of exposure than the current approach. PMID:29105804
Pouzou, Jane G; Cullen, Alison C; Yost, Michael G; Kissel, John C; Fenske, Richard A
2017-11-06
Implementation of probabilistic analyses in exposure assessment can provide valuable insight into the risks of those at the extremes of population distributions, including more vulnerable or sensitive subgroups. Incorporation of these analyses into current regulatory methods for occupational pesticide exposure is enabled by the exposure data sets and associated data currently used in the risk assessment approach of the Environmental Protection Agency (EPA). Monte Carlo simulations were performed on exposure measurements from the Agricultural Handler Exposure Database and the Pesticide Handler Exposure Database along with data from the Exposure Factors Handbook and other sources to calculate exposure rates for three different neurotoxic compounds (azinphos methyl, acetamiprid, emamectin benzoate) across four pesticide-handling scenarios. Probabilistic estimates of doses were compared with the no observable effect levels used in the EPA occupational risk assessments. Some percentage of workers were predicted to exceed the level of concern for all three compounds: 54% for azinphos methyl, 5% for acetamiprid, and 20% for emamectin benzoate. This finding has implications for pesticide risk assessment and offers an alternative procedure that may be more protective of those at the extremes of exposure than the current approach. © 2017 Society for Risk Analysis.
PROTAX-Sound: A probabilistic framework for automated animal sound identification
Somervuo, Panu; Ovaskainen, Otso
2017-01-01
Autonomous audio recording is stimulating new field in bioacoustics, with a great promise for conducting cost-effective species surveys. One major current challenge is the lack of reliable classifiers capable of multi-species identification. We present PROTAX-Sound, a statistical framework to perform probabilistic classification of animal sounds. PROTAX-Sound is based on a multinomial regression model, and it can utilize as predictors any kind of sound features or classifications produced by other existing algorithms. PROTAX-Sound combines audio and image processing techniques to scan environmental audio files. It identifies regions of interest (a segment of the audio file that contains a vocalization to be classified), extracts acoustic features from them and compares with samples in a reference database. The output of PROTAX-Sound is the probabilistic classification of each vocalization, including the possibility that it represents species not present in the reference database. We demonstrate the performance of PROTAX-Sound by classifying audio from a species-rich case study of tropical birds. The best performing classifier achieved 68% classification accuracy for 200 bird species. PROTAX-Sound improves the classification power of current techniques by combining information from multiple classifiers in a manner that yields calibrated classification probabilities. PMID:28863178
PROTAX-Sound: A probabilistic framework for automated animal sound identification.
de Camargo, Ulisses Moliterno; Somervuo, Panu; Ovaskainen, Otso
2017-01-01
Autonomous audio recording is stimulating new field in bioacoustics, with a great promise for conducting cost-effective species surveys. One major current challenge is the lack of reliable classifiers capable of multi-species identification. We present PROTAX-Sound, a statistical framework to perform probabilistic classification of animal sounds. PROTAX-Sound is based on a multinomial regression model, and it can utilize as predictors any kind of sound features or classifications produced by other existing algorithms. PROTAX-Sound combines audio and image processing techniques to scan environmental audio files. It identifies regions of interest (a segment of the audio file that contains a vocalization to be classified), extracts acoustic features from them and compares with samples in a reference database. The output of PROTAX-Sound is the probabilistic classification of each vocalization, including the possibility that it represents species not present in the reference database. We demonstrate the performance of PROTAX-Sound by classifying audio from a species-rich case study of tropical birds. The best performing classifier achieved 68% classification accuracy for 200 bird species. PROTAX-Sound improves the classification power of current techniques by combining information from multiple classifiers in a manner that yields calibrated classification probabilities.
Hamiltonian Monte Carlo acceleration using surrogate functions with random bases.
Zhang, Cheng; Shahbaba, Babak; Zhao, Hongkai
2017-11-01
For big data analysis, high computational cost for Bayesian methods often limits their applications in practice. In recent years, there have been many attempts to improve computational efficiency of Bayesian inference. Here we propose an efficient and scalable computational technique for a state-of-the-art Markov chain Monte Carlo methods, namely, Hamiltonian Monte Carlo. The key idea is to explore and exploit the structure and regularity in parameter space for the underlying probabilistic model to construct an effective approximation of its geometric properties. To this end, we build a surrogate function to approximate the target distribution using properly chosen random bases and an efficient optimization process. The resulting method provides a flexible, scalable, and efficient sampling algorithm, which converges to the correct target distribution. We show that by choosing the basis functions and optimization process differently, our method can be related to other approaches for the construction of surrogate functions such as generalized additive models or Gaussian process models. Experiments based on simulated and real data show that our approach leads to substantially more efficient sampling algorithms compared to existing state-of-the-art methods.
A comprehensive and scalable database search system for metaproteomics.
Chatterjee, Sandip; Stupp, Gregory S; Park, Sung Kyu Robin; Ducom, Jean-Christophe; Yates, John R; Su, Andrew I; Wolan, Dennis W
2016-08-16
Mass spectrometry-based shotgun proteomics experiments rely on accurate matching of experimental spectra against a database of protein sequences. Existing computational analysis methods are limited in the size of their sequence databases, which severely restricts the proteomic sequencing depth and functional analysis of highly complex samples. The growing amount of public high-throughput sequencing data will only exacerbate this problem. We designed a broadly applicable metaproteomic analysis method (ComPIL) that addresses protein database size limitations. Our approach to overcome this significant limitation in metaproteomics was to design a scalable set of sequence databases assembled for optimal library querying speeds. ComPIL was integrated with a modified version of the search engine ProLuCID (termed "Blazmass") to permit rapid matching of experimental spectra. Proof-of-principle analysis of human HEK293 lysate with a ComPIL database derived from high-quality genomic libraries was able to detect nearly all of the same peptides as a search with a human database (~500x fewer peptides in the database), with a small reduction in sensitivity. We were also able to detect proteins from the adenovirus used to immortalize these cells. We applied our method to a set of healthy human gut microbiome proteomic samples and showed a substantial increase in the number of identified peptides and proteins compared to previous metaproteomic analyses, while retaining a high degree of protein identification accuracy and allowing for a more in-depth characterization of the functional landscape of the samples. The combination of ComPIL with Blazmass allows proteomic searches to be performed with database sizes much larger than previously possible. These large database searches can be applied to complex meta-samples with unknown composition or proteomic samples where unexpected proteins may be identified. The protein database, proteomic search engine, and the proteomic data files for the 5 microbiome samples characterized and discussed herein are open source and available for use and additional analysis.
Chiu, Maria; Lebenbaum, Michael; Lam, Kelvin; Chong, Nelson; Azimaee, Mahmoud; Iron, Karey; Manuel, Doug; Guttmann, Astrid
2016-10-21
Ontario, the most populous province in Canada, has a universal healthcare system that routinely collects health administrative data on its 13 million legal residents that is used for health research. Record linkage has become a vital tool for this research by enriching this data with the Immigration, Refugees and Citizenship Canada Permanent Resident (IRCC-PR) database and the Office of the Registrar General's Vital Statistics-Death (ORG-VSD) registry. Our objectives were to estimate linkage rates and compare characteristics of individuals in the linked versus unlinked files. We used both deterministic and probabilistic linkage methods to link the IRCC-PR database (1985-2012) and ORG-VSD registry (1990-2012) to the Ontario's Registered Persons Database. Linkage rates were estimated and standardized differences were used to assess differences in socio-demographic and other characteristics between the linked and unlinked records. The overall linkage rates for the IRCC-PR database and ORG-VSD registry were 86.4 and 96.2 %, respectively. The majority (68.2 %) of the record linkages in IRCC-PR were achieved after three deterministic passes, 18.2 % were linked probabilistically, and 13.6 % were unlinked. Similarly the majority (79.8 %) of the record linkages in the ORG-VSD were linked using deterministic record linkage, 16.3 % were linked after probabilistic and manual review, and 3.9 % were unlinked. Unlinked and linked files were similar for most characteristics, such as age and marital status for IRCC-PR and sex and most causes of death for ORG-VSD. However, lower linkage rates were observed among people born in East Asia (78 %) in the IRCC-PR database and certain causes of death in the ORG-VSD registry, namely perinatal conditions (61.3 %) and congenital anomalies (81.3 %). The linkages of immigration and vital statistics data to existing population-based healthcare data in Ontario, Canada will enable many novel cross-sectional and longitudinal studies to be conducted. Analytic techniques to account for sub-optimal linkage rates may be required in studies of certain ethnic groups or certain causes of death among children and infants.
NASA Astrophysics Data System (ADS)
Marzocchi, W.
2011-12-01
Eruption forecasting is the probability of eruption in a specific time-space-magnitude window. The use of probabilities to track the evolution of a phase of unrest is unavoidable for two main reasons: first, eruptions are intrinsically unpredictable in a deterministic sense, and, second, probabilities represent a quantitative tool that can be rationally used by decision-makers (this is usually done in many other fields). The primary information for the probability assessment during a phase of unrest come from monitoring data of different quantities, such as the seismic activity, ground deformation, geochemical signatures, and so on. Nevertheless, the probabilistic forecast based on monitoring data presents two main difficulties. First, many high-risk volcanoes do not have monitoring pre-eruptive and unrest databases, making impossible a probabilistic assessment based on the frequency of past observations. The ongoing project WOVOdat (led by Christopher Newhall) is trying to tackle this limitation creating a sort of worldwide epidemiological database that may cope with the lack of monitoring pre-eruptive and unrest databases for a specific volcano using observations of 'analogs' volcanoes. Second, the quantity and quality of monitoring data are rapidly increasing in many volcanoes, creating strongly inhomogeneous dataset. In these cases, classical statistical analysis can be performed on high quality monitoring observations only for (usually too) short periods of time, or alternatively using only few specific monitoring data that are available for longer times (such as the number of earthquakes), therefore neglecting a lot of information carried out by the most recent kind of monitoring. Here, we explore a possible strategy to cope with these limitations. In particular, we present a Bayesian strategy that merges different kinds of information. In this approach, all relevant monitoring observations are embedded into a probabilistic scheme through expert opinion, conceptual models, and, possibly, real past data. After discussing all scientific and philosophical aspects of such approach, we present some applications for Campi Flegrei and Vesuvius.
A Scalable Data Access Layer to Manage Structured Heterogeneous Biomedical Data
Lianas, Luca; Frexia, Francesca; Zanetti, Gianluigi
2016-01-01
This work presents a scalable data access layer, called PyEHR, designed to support the implementation of data management systems for secondary use of structured heterogeneous biomedical and clinical data. PyEHR adopts the openEHR’s formalisms to guarantee the decoupling of data descriptions from implementation details and exploits structure indexing to accelerate searches. Data persistence is guaranteed by a driver layer with a common driver interface. Interfaces for two NoSQL Database Management Systems are already implemented: MongoDB and Elasticsearch. We evaluated the scalability of PyEHR experimentally through two types of tests, called “Constant Load” and “Constant Number of Records”, with queries of increasing complexity on synthetic datasets of ten million records each, containing very complex openEHR archetype structures, distributed on up to ten computing nodes. PMID:27936191
NPTool: Towards Scalability and Reliability of Business Process Management
NASA Astrophysics Data System (ADS)
Braghetto, Kelly Rosa; Ferreira, João Eduardo; Pu, Calton
Currently one important challenge in business process management is provide at the same time scalability and reliability of business process executions. This difficulty becomes more accentuated when the execution control assumes complex countless business processes. This work presents NavigationPlanTool (NPTool), a tool to control the execution of business processes. NPTool is supported by Navigation Plan Definition Language (NPDL), a language for business processes specification that uses process algebra as formal foundation. NPTool implements the NPDL language as a SQL extension. The main contribution of this paper is a description of the NPTool showing how the process algebra features combined with a relational database model can be used to provide a scalable and reliable control in the execution of business processes. The next steps of NPTool include reuse of control-flow patterns and support to data flow management.
Design of a Multi Dimensional Database for the Archimed DataWarehouse.
Bréant, Claudine; Thurler, Gérald; Borst, François; Geissbuhler, Antoine
2005-01-01
The Archimed data warehouse project started in 1993 at the Geneva University Hospital. It has progressively integrated seven data marts (or domains of activity) archiving medical data such as Admission/Discharge/Transfer (ADT) data, laboratory results, radiology exams, diagnoses, and procedure codes. The objective of the Archimed data warehouse is to facilitate the access to an integrated and coherent view of patient medical in order to support analytical activities such as medical statistics, clinical studies, retrieval of similar cases and data mining processes. This paper discusses three principal design aspects relative to the conception of the database of the data warehouse: 1) the granularity of the database, which refers to the level of detail or summarization of data, 2) the database model and architecture, describing how data will be presented to end users and how new data is integrated, 3) the life cycle of the database, in order to ensure long term scalability of the environment. Both, the organization of patient medical data using a standardized elementary fact representation and the use of the multi dimensional model have proved to be powerful design tools to integrate data coming from the multiple heterogeneous database systems part of the transactional Hospital Information System (HIS). Concurrently, the building of the data warehouse in an incremental way has helped to control the evolution of the data content. These three design aspects bring clarity and performance regarding data access. They also provide long term scalability to the system and resilience to further changes that may occur in source systems feeding the data warehouse.
United States Air Force Summer Research Program -- 1993. Volume 4. Rome Laboratory
1993-12-01
H., eds., Object-Oriented Concepts, Databases , and Applications, Addison-Wesley, Reading, MA, 1989. [Lano9l] Lano, K., "Z++, An Object-Orientated...1433 46.92 60 TCP janus.rl.af.mil mensa.rl.af.mil 1433 2611 The Target Filter Manager responds to requests for data and accesses the target database . A...2.5 2- 1.5- 28 -3 -2 -10 12 3 AZIMUTH (OE(3) Figure 12. Contour plot of antenna pattern, QC2 algorithm 5-32 UPDATING PROBABILISTIC DATABASES Michael A
NASA Technical Reports Server (NTRS)
Saile, Lynn; Lopez, Vilma; Bickham, Grandin; FreiredeCarvalho, Mary; Kerstman, Eric; Byrne, Vicky; Butler, Douglas; Myers, Jerry; Walton, Marlei
2011-01-01
This slide presentation reviews the Integrated Medical Model (IMM) database, which is an organized evidence base for assessing in-flight crew health risk. The database is a relational database accessible to many people. The database quantifies the model inputs by a ranking based on the highest value of the data as Level of Evidence (LOE) and the quality of evidence (QOE) score that provides an assessment of the evidence base for each medical condition. The IMM evidence base has already been able to provide invaluable information for designers, and for other uses.
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases
NASA Astrophysics Data System (ADS)
Dykstra, Dave
2012-12-01
One of the main attractions of non-relational “NoSQL” databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It also compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dykstra, Dave
One of the main attractions of non-relational NoSQL databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It alsomore » compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.« less
Asymmetric author-topic model for knowledge discovering of big data in toxicogenomics.
Chung, Ming-Hua; Wang, Yuping; Tang, Hailin; Zou, Wen; Basinger, John; Xu, Xiaowei; Tong, Weida
2015-01-01
The advancement of high-throughput screening technologies facilitates the generation of massive amount of biological data, a big data phenomena in biomedical science. Yet, researchers still heavily rely on keyword search and/or literature review to navigate the databases and analyses are often done in rather small-scale. As a result, the rich information of a database has not been fully utilized, particularly for the information embedded in the interactive nature between data points that are largely ignored and buried. For the past 10 years, probabilistic topic modeling has been recognized as an effective machine learning algorithm to annotate the hidden thematic structure of massive collection of documents. The analogy between text corpus and large-scale genomic data enables the application of text mining tools, like probabilistic topic models, to explore hidden patterns of genomic data and to the extension of altered biological functions. In this paper, we developed a generalized probabilistic topic model to analyze a toxicogenomics dataset that consists of a large number of gene expression data from the rat livers treated with drugs in multiple dose and time-points. We discovered the hidden patterns in gene expression associated with the effect of doses and time-points of treatment. Finally, we illustrated the ability of our model to identify the evidence of potential reduction of animal use.
Searching Across the International Space Station Databases
NASA Technical Reports Server (NTRS)
Maluf, David A.; McDermott, William J.; Smith, Ernest E.; Bell, David G.; Gurram, Mohana
2007-01-01
Data access in the enterprise generally requires us to combine data from different sources and different formats. It is advantageous thus to focus on the intersection of the knowledge across sources and domains; keeping irrelevant knowledge around only serves to make the integration more unwieldy and more complicated than necessary. A context search over multiple domain is proposed in this paper to use context sensitive queries to support disciplined manipulation of domain knowledge resources. The objective of a context search is to provide the capability for interrogating many domain knowledge resources, which are largely semantically disjoint. The search supports formally the tasks of selecting, combining, extending, specializing, and modifying components from a diverse set of domains. This paper demonstrates a new paradigm in composition of information for enterprise applications. In particular, it discusses an approach to achieving data integration across multiple sources, in a manner that does not require heavy investment in database and middleware maintenance. This lean approach to integration leads to cost-effectiveness and scalability of data integration with an underlying schemaless object-relational database management system. This highly scalable, information on demand system framework, called NX-Search, which is an implementation of an information system built on NETMARK. NETMARK is a flexible, high-throughput open database integration framework for managing, storing, and searching unstructured or semi-structured arbitrary XML and HTML used widely at the National Aeronautics Space Administration (NASA) and industry.
A Parallel Fast Sweeping Method for the Eikonal Equation
NASA Astrophysics Data System (ADS)
Baker, B.
2017-12-01
Recently, there has been an exciting emergence of probabilistic methods for travel time tomography. Unlike gradient-based optimization strategies, probabilistic tomographic methods are resistant to becoming trapped in a local minimum and provide a much better quantification of parameter resolution than, say, appealing to ray density or performing checkerboard reconstruction tests. The benefits associated with random sampling methods however are only realized by successive computation of predicted travel times in, potentially, strongly heterogeneous media. To this end this abstract is concerned with expediting the solution of the Eikonal equation. While many Eikonal solvers use a fast marching method, the proposed solver will use the iterative fast sweeping method because the eight fixed sweep orderings in each iteration are natural targets for parallelization. To reduce the number of iterations and grid points required the high-accuracy finite difference stencil of Nobel et al., 2014 is implemented. A directed acyclic graph (DAG) is created with a priori knowledge of the sweep ordering and finite different stencil. By performing a topological sort of the DAG sets of independent nodes are identified as candidates for concurrent updating. Additionally, the proposed solver will also address scalability during earthquake relocation, a necessary step in local and regional earthquake tomography and a barrier to extending probabilistic methods from active source to passive source applications, by introducing an asynchronous parallel forward solve phase for all receivers in the network. Synthetic examples using the SEG over-thrust model will be presented.
ConvNetQuake: Convolutional Neural Network for Earthquake Detection and Location
NASA Astrophysics Data System (ADS)
Denolle, M.; Perol, T.; Gharbi, M.
2017-12-01
Over the last decades, the volume of seismic data has increased exponentially, creating a need for efficient algorithms to reliably detect and locate earthquakes. Today's most elaborate methods scan through the plethora of continuous seismic records, searching for repeating seismic signals. In this work, we leverage the recent advances in artificial intelligence and present ConvNetQuake, a highly scalable convolutional neural network for probabilistic earthquake detection and location from single stations. We apply our technique to study two years of induced seismicity in Oklahoma (USA). We detect 20 times more earthquakes than previously cataloged by the Oklahoma Geological Survey. Our algorithm detection performances are at least one order of magnitude faster than other established methods.
Ali, Amira Mohammed; Ahmed, Anwar; Sharaf, Amira; Kawakami, Norito; Abdeldayem, Samia M; Green, Joseph
2017-12-01
This study aimed to examine the validity of the Arabic version of the Depression Anxiety Stress Scale-21 (DASS-21) in 149 illicit drug users. We calculated α coefficient, inter-item and item-total correlations, coefficients of reproducibility and scalability (CR and CS), item difficulty and discrimination indices. The DASS-21 had an acceptable reliability; but values of the CR and the CS were less than acceptable. Items varied in difficulty and discrimination; some items are candidates for elimination. The DASS-21 is a probabilistic and not a deterministic measure of distress; it has problematic items and needs further investigations. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Moncoulon, D.; Labat, D.; Ardon, J.; Leblois, E.; Onfroy, T.; Poulard, C.; Aji, S.; Rémy, A.; Quantin, A.
2014-09-01
The analysis of flood exposure at a national scale for the French insurance market must combine the generation of a probabilistic event set of all possible (but which have not yet occurred) flood situations with hazard and damage modeling. In this study, hazard and damage models are calibrated on a 1995-2010 historical event set, both for hazard results (river flow, flooded areas) and loss estimations. Thus, uncertainties in the deterministic estimation of a single event loss are known before simulating a probabilistic event set. To take into account at least 90 % of the insured flood losses, the probabilistic event set must combine the river overflow (small and large catchments) with the surface runoff, due to heavy rainfall, on the slopes of the watershed. Indeed, internal studies of the CCR (Caisse Centrale de Reassurance) claim database have shown that approximately 45 % of the insured flood losses are located inside the floodplains and 45 % outside. Another 10 % is due to sea surge floods and groundwater rise. In this approach, two independent probabilistic methods are combined to create a single flood loss distribution: a generation of fictive river flows based on the historical records of the river gauge network and a generation of fictive rain fields on small catchments, calibrated on the 1958-2010 Météo-France rain database SAFRAN. All the events in the probabilistic event sets are simulated with the deterministic model. This hazard and damage distribution is used to simulate the flood losses at the national scale for an insurance company (Macif) and to generate flood areas associated with hazard return periods. The flood maps concern river overflow and surface water runoff. Validation of these maps is conducted by comparison with the address located claim data on a small catchment (downstream Argens).
Selecting the Right Courseware for Your Online Learning Program.
ERIC Educational Resources Information Center
O'Mara, Heather
2000-01-01
Presents criteria for selecting courseware for online classes. Highlights include ease of use, including navigation; assessment tools; advantages of Java-enabled courseware; advantages of Oracle databases, including scalability; future possibilities for multimedia technology; and open architecture that will integrate with other systems. (LRW)
Scalable global grid catalogue for Run3 and beyond
NASA Astrophysics Data System (ADS)
Martinez Pedreira, M.; Grigoras, C.;
2017-10-01
The AliEn (ALICE Environment) file catalogue is a global unique namespace providing mapping between a UNIX-like logical name structure and the corresponding physical files distributed over 80 storage elements worldwide. Powerful search tools and hierarchical metadata information are integral parts of the system and are used by the Grid jobs as well as local users to store and access all files on the Grid storage elements. The catalogue has been in production since 2005 and over the past 11 years has grown to more than 2 billion logical file names. The backend is a set of distributed relational databases, ensuring smooth growth and fast access. Due to the anticipated fast future growth, we are looking for ways to enhance the performance and scalability by simplifying the catalogue schema while keeping the functionality intact. We investigated different backend solutions, such as distributed key value stores, as replacement for the relational database. This contribution covers the architectural changes in the system, together with the technology evaluation, benchmark results and conclusions.
Kiranyaz, Serkan; Mäkinen, Toni; Gabbouj, Moncef
2012-10-01
In this paper, we propose a novel framework based on a collective network of evolutionary binary classifiers (CNBC) to address the problems of feature and class scalability. The main goal of the proposed framework is to achieve a high classification performance over dynamic audio and video repositories. The proposed framework adopts a "Divide and Conquer" approach in which an individual network of binary classifiers (NBC) is allocated to discriminate each audio class. An evolutionary search is applied to find the best binary classifier in each NBC with respect to a given criterion. Through the incremental evolution sessions, the CNBC framework can dynamically adapt to each new incoming class or feature set without resorting to a full-scale re-training or re-configuration. Therefore, the CNBC framework is particularly designed for dynamically varying databases where no conventional static classifiers can adapt to such changes. In short, it is entirely a novel topology, an unprecedented approach for dynamic, content/data adaptive and scalable audio classification. A large set of audio features can be effectively used in the framework, where the CNBCs make appropriate selections and combinations so as to achieve the highest discrimination among individual audio classes. Experiments demonstrate a high classification accuracy (above 90%) and efficiency of the proposed framework over large and dynamic audio databases. Copyright © 2012 Elsevier Ltd. All rights reserved.
Term Dependence: Truncating the Bahadur Lazarsfeld Expansion.
ERIC Educational Resources Information Center
Losee, Robert M., Jr.
1994-01-01
Studies the performance of probabilistic information retrieval systems using differing statistical dependence assumptions when estimating the probabilities inherent in the retrieval model. Experimental results using the Bahadur Lazarsfeld expansion on the Cystic Fibrosis database are discussed that suggest that incorporating term dependence…
Visualization for genomics: the Microbial Genome Viewer.
Kerkhoven, Robert; van Enckevort, Frank H J; Boekhorst, Jos; Molenaar, Douwe; Siezen, Roland J
2004-07-22
A Web-based visualization tool, the Microbial Genome Viewer, is presented that allows the user to combine complex genomic data in a highly interactive way. This Web tool enables the interactive generation of chromosome wheels and linear genome maps from genome annotation data stored in a MySQL database. The generated images are in scalable vector graphics (SVG) format, which is suitable for creating high-quality scalable images and dynamic Web representations. Gene-related data such as transcriptome and time-course microarray experiments can be superimposed on the maps for visual inspection. The Microbial Genome Viewer 1.0 is freely available at http://www.cmbi.kun.nl/MGV
CICS Region Virtualization for Cost Effective Application Development
ERIC Educational Resources Information Center
Khan, Kamal Waris
2012-01-01
Mainframe is used for hosting large commercial databases, transaction servers and applications that require a greater degree of reliability, scalability and security. Customer Information Control System (CICS) is a mainframe software framework for implementing transaction services. It is designed for rapid, high-volume online processing. In order…
A Web-Based Database for Nurse Led Outreach Teams (NLOT) in Toronto.
Li, Shirley; Kuo, Mu-Hsing; Ryan, David
2016-01-01
A web-based system can provide access to real-time data and information. Healthcare is moving towards digitizing patients' medical information and securely exchanging it through web-based systems. In one of Ontario's health regions, Nurse Led Outreach Teams (NLOT) provide emergency mobile nursing services to help reduce unnecessary transfers from long-term care homes to emergency departments. Currently the NLOT team uses a Microsoft Access database to keep track of the health information on the residents that they serve. The Access database lacks scalability, portability, and interoperability. The objective of this study is the development of a web-based database using Oracle Application Express that is easily accessible from mobile devices. The web-based database will allow NLOT nurses to enter and access resident information anytime and from anywhere.
In Situ Distribution Guided Analysis and Visualization of Transonic Jet Engine Simulations.
Dutta, Soumya; Chen, Chun-Ming; Heinlein, Gregory; Shen, Han-Wei; Chen, Jen-Ping
2017-01-01
Study of flow instability in turbine engine compressors is crucial to understand the inception and evolution of engine stall. Aerodynamics experts have been working on detecting the early signs of stall in order to devise novel stall suppression technologies. A state-of-the-art Navier-Stokes based, time-accurate computational fluid dynamics simulator, TURBO, has been developed in NASA to enhance the understanding of flow phenomena undergoing rotating stall. Despite the proven high modeling accuracy of TURBO, the excessive simulation data prohibits post-hoc analysis in both storage and I/O time. To address these issues and allow the expert to perform scalable stall analysis, we have designed an in situ distribution guided stall analysis technique. Our method summarizes statistics of important properties of the simulation data in situ using a probabilistic data modeling scheme. This data summarization enables statistical anomaly detection for flow instability in post analysis, which reveals the spatiotemporal trends of rotating stall for the expert to conceive new hypotheses. Furthermore, the verification of the hypotheses and exploratory visualization using the summarized data are realized using probabilistic visualization techniques such as uncertain isocontouring. Positive feedback from the domain scientist has indicated the efficacy of our system in exploratory stall analysis.
Liao, Stephen Shaoyi; Wang, Huai Qing; Li, Qiu Dan; Liu, Wei Yi
2006-06-01
This paper presents a new method for learning Bayesian networks from functional dependencies (FD) and third normal form (3NF) tables in relational databases. The method sets up a linkage between the theory of relational databases and probabilistic reasoning models, which is interesting and useful especially when data are incomplete and inaccurate. The effectiveness and practicability of the proposed method is demonstrated by its implementation in a mobile commerce system.
Cotič, Živa; Rees, Rebecca; Wark, Petra A; Car, Josip
2016-10-19
In 2013, there was a shortage of approximately 7.2 million health workers worldwide, which is larger among family physicians than among specialists. eLearning could provide a potential solution to some of these global workforce challenges. However, there is little evidence on factors facilitating or hindering implementation, adoption, use, scalability and sustainability of eLearning. This review aims to synthesise results from qualitative and mixed methods studies to provide insight on factors influencing implementation of eLearning for family medicine specialty education and training. Additionally, this review aims to identify the actions needed to increase effectiveness of eLearning and identify the strategies required to improve eLearning implementation, adoption, use, sustainability and scalability for family medicine speciality education and training. A systematic search will be conducted across a range of databases for qualitative studies focusing on experiences, barriers, facilitators, and other factors related to the implementation, adoption, use, sustainability and scalability of eLearning for family medicine specialty education and training. Studies will be synthesised by using the framework analysis approach. This study will contribute to the evaluation of eLearning implementation, adoption, use, sustainability and scalability for family medicine specialty training and education and the development of eLearning guidelines for postgraduate medical education. PROSPERO http://www.crd.york.ac.uk/PROSPERO/display_record.asp?ID=CRD42016036449.
Plumb, Jenny; Pigat, Sandrine; Bompola, Foteini; Cushen, Maeve; Pinchen, Hannah; Nørby, Eric; Astley, Siân; Lyons, Jacqueline; Kiely, Mairead; Finglas, Paul
2017-03-23
eBASIS (Bioactive Substances in Food Information Systems), a web-based database that contains compositional and biological effects data for bioactive compounds of plant origin, has been updated with new data on fruits and vegetables, wheat and, due to some evidence of potential beneficial effects, extended to include meat bioactives. eBASIS remains one of only a handful of comprehensive and searchable databases, with up-to-date coherent and validated scientific information on the composition of food bioactives and their putative health benefits. The database has a user-friendly, efficient, and flexible interface facilitating use by both the scientific community and food industry. Overall, eBASIS contains data for 267 foods, covering the composition of 794 bioactive compounds, from 1147 quality-evaluated peer-reviewed publications, together with information from 567 publications describing beneficial bioeffect studies carried out in humans. This paper highlights recent updates and expansion of eBASIS and the newly-developed link to a probabilistic intake model, allowing exposure assessment of dietary bioactive compounds to be estimated and modelled in human populations when used in conjunction with national food consumption data. This new tool could assist small- and medium-sized enterprises (SMEs) in the development of food product health claim dossiers for submission to the European Food Safety Authority (EFSA).
Lung Cancer Assistant: a hybrid clinical decision support application for lung cancer care.
Sesen, M Berkan; Peake, Michael D; Banares-Alcantara, Rene; Tse, Donald; Kadir, Timor; Stanley, Roz; Gleeson, Fergus; Brady, Michael
2014-09-06
Multidisciplinary team (MDT) meetings are becoming the model of care for cancer patients worldwide. While MDTs have improved the quality of cancer care, the meetings impose substantial time pressure on the members, who generally attend several such MDTs. We describe Lung Cancer Assistant (LCA), a clinical decision support (CDS) prototype designed to assist the experts in the treatment selection decisions in the lung cancer MDTs. A novel feature of LCA is its ability to provide rule-based and probabilistic decision support within a single platform. The guideline-based CDS is based on clinical guideline rules, while the probabilistic CDS is based on a Bayesian network trained on the English Lung Cancer Audit Database (LUCADA). We assess rule-based and probabilistic recommendations based on their concordances with the treatments recorded in LUCADA. Our results reveal that the guideline rule-based recommendations perform well in simulating the recorded treatments with exact and partial concordance rates of 0.57 and 0.79, respectively. On the other hand, the exact and partial concordance rates achieved with probabilistic results are relatively poorer with 0.27 and 0.76. However, probabilistic decision support fulfils a complementary role in providing accurate survival estimations. Compared to recorded treatments, both CDS approaches promote higher resection rates and multimodality treatments.
[Application of the life sciences platform based on oracle to biomedical informations].
Zhao, Zhi-Yun; Li, Tai-Huan; Yang, Hong-Qiao
2008-03-01
The life sciences platform based on Oracle database technology is introduced in this paper. By providing a powerful data access, integrating a variety of data types, and managing vast quantities of data, the software presents a flexible, safe and scalable management platform for biomedical data processing.
Mining and Indexing Graph Databases
ERIC Educational Resources Information Center
Yuan, Dayu
2013-01-01
Graphs are widely used to model structures and relationships of objects in various scientific and commercial fields. Chemical molecules, proteins, malware system-call dependencies and three-dimensional mechanical parts are all modeled as graphs. In this dissertation, we propose to mine and index those graph data to enable fast and scalable search.…
Databases Don't Measure Motivation
ERIC Educational Resources Information Center
Yeager, Joseph
2005-01-01
Automated persuasion is the Holy Grail of quantitatively biased data base designers. However, data base histories are, at best, probabilistic estimates of customer behavior and do not make use of more sophisticated qualitative motivational profiling tools. While usually absent from web designer thinking, qualitative motivational profiling can be…
EXPOSURE TO PESTICIDES BY MEDIUM AND ROUTE: THE 90TH PERCENTILE AND RELATED UNCERTAINTIES
This study investigates distributions of exposure to chlorpyrifos and diazinon using the database generated in the state of Arizona by the National Human Exposure Assessment Survey (NHEXAS-AZ). Exposure to pesticide and associated uncertainties are estimated using probabilistic...
Developing and Implementing the Data Mining Algorithms in RAVEN
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sen, Ramazan Sonat; Maljovec, Daniel Patrick; Alfonsi, Andrea
The RAVEN code is becoming a comprehensive tool to perform probabilistic risk assessment, uncertainty quantification, and verification and validation. The RAVEN code is being developed to support many programs and to provide a set of methodologies and algorithms for advanced analysis. Scientific computer codes can generate enormous amounts of data. To post-process and analyze such data might, in some cases, take longer than the initial software runtime. Data mining algorithms/methods help in recognizing and understanding patterns in the data, and thus discover knowledge in databases. The methodologies used in the dynamic probabilistic risk assessment or in uncertainty and error quantificationmore » analysis couple system/physics codes with simulation controller codes, such as RAVEN. RAVEN introduces both deterministic and stochastic elements into the simulation while the system/physics code model the dynamics deterministically. A typical analysis is performed by sampling values of a set of parameter values. A major challenge in using dynamic probabilistic risk assessment or uncertainty and error quantification analysis for a complex system is to analyze the large number of scenarios generated. Data mining techniques are typically used to better organize and understand data, i.e. recognizing patterns in the data. This report focuses on development and implementation of Application Programming Interfaces (APIs) for different data mining algorithms, and the application of these algorithms to different databases.« less
Fault displacement hazard assessment for nuclear installations based on IAEA safety standards
NASA Astrophysics Data System (ADS)
Fukushima, Y.
2016-12-01
In the IAEA Safety NS-R-3, surface fault displacement hazard assessment (FDHA) is required for the siting of nuclear installations. If any capable faults exist in the candidate site, IAEA recommends the consideration of alternative sites. However, due to the progress in palaeoseismological investigations, capable faults may be found in existing site. In such a case, IAEA recommends to evaluate the safety using probabilistic FDHA (PFDHA), which is an empirical approach based on still quite limited database. Therefore a basic and crucial improvement is to increase the database. In 2015, IAEA produced a TecDoc-1767 on Palaeoseismology as a reference for the identification of capable faults. Another IAEA Safety Report 85 on ground motion simulation based on fault rupture modelling provides an annex introducing recent PFDHAs and fault displacement simulation methodologies. The IAEA expanded the project of FDHA for the probabilistic approach and the physics based fault rupture modelling. The first approach needs a refinement of the empirical methods by building a world wide database, and the second approach needs to shift from kinematic to the dynamic scheme. Both approaches can complement each other, since simulated displacement can fill the gap of a sparse database and geological observations can be useful to calibrate the simulations. The IAEA already supported a workshop in October 2015 to discuss the existing databases with the aim of creating a common worldwide database. A consensus of a unified database was reached. The next milestone is to fill the database with as many fault rupture data sets as possible. Another IAEA work group had a WS in November 2015 to discuss the state-of-the-art PFDHA as well as simulation methodologies. Two groups jointed a consultancy meeting in February 2016, shared information, identified issues, discussed goals and outputs, and scheduled future meetings. Now we may aim at coordinating activities for the whole FDHA tasks jointly.
Agents, assemblers, and ANTS: scheduling assembly with market and biological software mechanisms
NASA Astrophysics Data System (ADS)
Toth-Fejel, Tihamer T.
2000-06-01
Nanoscale assemblers will need robust, scalable, flexible, and well-understood mechanisms such as software agents to control them. This paper discusses assemblers and agents, and proposes a taxonomy of their possible interaction. Molecular assembly is seen as a special case of general assembly, subject to many of the same issues, such as the advantages of convergent assembly, and the problem of scheduling. This paper discusses the contract net architecture of ANTS, an agent-based scheduling application under development. It also describes an algorithm for least commitment scheduling, which uses probabilistic committed capacity profiles of resources over time, along with realistic costs, to provide an abstract search space over which the agents can wander to quickly find optimal solutions.
NASA Technical Reports Server (NTRS)
Maluf, David A.; Tran, Peter B.
2003-01-01
Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object-oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK, is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword search of records spanning across both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semistructured documents existing within NASA enterprises. Today, NETMARK is a flexible, high-throughput open database framework for managing, storing, and searching unstructured or semi-structured arbitrary hierarchal models, such as XML and HTML.
An Extensible Schema-less Database Framework for Managing High-throughput Semi-Structured Documents
NASA Technical Reports Server (NTRS)
Maluf, David A.; Tran, Peter B.; La, Tracy; Clancy, Daniel (Technical Monitor)
2002-01-01
Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword searches of records for both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semi-structured documents existing within NASA enterprises. Today, NETMARK is a flexible, high throughput open database framework for managing, storing, and searching unstructured or semi structured arbitrary hierarchal models, XML and HTML.
NASA Technical Reports Server (NTRS)
Maluf, David A.; Tran, Peter B.
2003-01-01
Object-Relational database management system is an integrated hybrid cooperative approach to combine the best practices of both the relational model utilizing SQL queries and the object-oriented, semantic paradigm for supporting complex data creation. In this paper, a highly scalable, information on demand database framework, called NETMARK, is introduced. NETMARK takes advantages of the Oracle 8i object-relational database using physical addresses data types for very efficient keyword search of records spanning across both context and content. NETMARK was originally developed in early 2000 as a research and development prototype to solve the vast amounts of unstructured and semi-structured documents existing within NASA enterprises. Today, NETMARK is a flexible, high-throughput open database framework for managing, storing, and searching unstructured or semi-structured arbitrary hierarchal models, such as XML and HTML.
A Probabilistic Approach to Crosslingual Information Retrieval
2001-06-01
language expansion step can be performed before the translation process. Implemented as a call to the INQUERY function get_modified_query with one of the...database consists of American English while the dictionary is British English. Therefore, e.g. the Spanish word basura is translated to rubbish and
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R; Bock, Davi D; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R Clay; Smith, Stephen J; Szalay, Alexander S; Vogelstein, Joshua T; Vogelstein, R Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes - neural connectivity maps of the brain-using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems-reads to parallel disk arrays and writes to solid-state storage-to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization.
Burns, Randal; Roncal, William Gray; Kleissas, Dean; Lillaney, Kunal; Manavalan, Priya; Perlman, Eric; Berger, Daniel R.; Bock, Davi D.; Chung, Kwanghun; Grosenick, Logan; Kasthuri, Narayanan; Weiler, Nicholas C.; Deisseroth, Karl; Kazhdan, Michael; Lichtman, Jeff; Reid, R. Clay; Smith, Stephen J.; Szalay, Alexander S.; Vogelstein, Joshua T.; Vogelstein, R. Jacob
2013-01-01
We describe a scalable database cluster for the spatial analysis and annotation of high-throughput brain imaging data, initially for 3-d electron microscopy image stacks, but for time-series and multi-channel data as well. The system was designed primarily for workloads that build connectomes— neural connectivity maps of the brain—using the parallel execution of computer vision algorithms on high-performance compute clusters. These services and open-science data sets are publicly available at openconnecto.me. The system design inherits much from NoSQL scale-out and data-intensive computing architectures. We distribute data to cluster nodes by partitioning a spatial index. We direct I/O to different systems—reads to parallel disk arrays and writes to solid-state storage—to avoid I/O interference and maximize throughput. All programming interfaces are RESTful Web services, which are simple and stateless, improving scalability and usability. We include a performance evaluation of the production system, highlighting the effec-tiveness of spatial data organization. PMID:24401992
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, Richard A.; Panyala, Ajay R.; Glass, Kevin A.
MerCat is a parallel, highly scalable and modular property software package for robust analysis of features in next-generation sequencing data. MerCat inputs include assembled contigs and raw sequence reads from any platform resulting in feature abundance counts tables. MerCat allows for direct analysis of data properties without reference sequence database dependency commonly used by search tools such as BLAST and/or DIAMOND for compositional analysis of whole community shotgun sequencing (e.g. metagenomes and metatranscriptomes).
NASA Astrophysics Data System (ADS)
Shahzad, Muhammad A.
1999-02-01
With the emergence of data warehousing, Decision support systems have evolved to its best. At the core of these warehousing systems lies a good database management system. Database server, used for data warehousing, is responsible for providing robust data management, scalability, high performance query processing and integration with other servers. Oracle being the initiator in warehousing servers, provides a wide range of features for facilitating data warehousing. This paper is designed to review the features of data warehousing - conceptualizing the concept of data warehousing and, lastly, features of Oracle servers for implementing a data warehouse.
Soeiro, Claudia Marques de Oliveira; Miranda, Angélica Espinosa; Saraceni, Valeria; Santos, Marcelo Cordeiro dos; Talhari, Sinesio; Ferreira, Luiz Carlos de Lima
2014-04-01
This study analyzes notification of syphilis in pregnancy and congenital syphilis in Amazo- nas State, Brazil, from 2007 to 2009 and verifies underreporting in databases in the National Information System on Diseases of Notification (SINAN) and the occurrence of perinatal deaths associated with congenital syphilis and not reported in the Mortality Information System (SIM). This was a cross-sectional study with probabilistic record linkage between the SINAN and SIM. There were 666 reports of syphilis in pregnant women, including 224 in 2007 (3.8/1,000), 244(4.5/1,000) in 2008, and 198(4.0/1,000) in 2009. The study found 486 cases of congenital syphilis, of which 153 in 2007 (2.1/1,000), 193 in 2008 (2.6/1,000), and 140 in 2009 (2.0/1,000). After linkage of the SINAN databases, 237 pregnant women (35.6%) had cases of congenital syphilis reported. The SIM recorded 4,905 perinatal deaths, of which 57.8% were stillbirths. Probabilistic record linkage between SIM and SINAN-Congenital Syphilis yielded 13 matched records. The use of SINAN and SIM may not reflect the total magnitude of syphilis, but provide the basis for monitoring and analyzing this health problem, with a view towards planning and management.
Sjöberg, C; Ahnesjö, A
2013-06-01
Label fusion multi-atlas approaches for image segmentation can give better segmentation results than single atlas methods. We present a multi-atlas label fusion strategy based on probabilistic weighting of distance maps. Relationships between image similarities and segmentation similarities are estimated in a learning phase and used to derive fusion weights that are proportional to the probability for each atlas to improve the segmentation result. The method was tested using a leave-one-out strategy on a database of 21 pre-segmented prostate patients for different image registrations combined with different image similarity scorings. The probabilistic weighting yields results that are equal or better compared to both fusion with equal weights and results using the STAPLE algorithm. Results from the experiments demonstrate that label fusion by weighted distance maps is feasible, and that probabilistic weighted fusion improves segmentation quality more the stronger the individual atlas segmentation quality depends on the corresponding registered image similarity. The regions used for evaluation of the image similarity measures were found to be more important than the choice of similarity measure. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Plumb, Jenny; Pigat, Sandrine; Bompola, Foteini; Cushen, Maeve; Pinchen, Hannah; Nørby, Eric; Astley, Siân; Lyons, Jacqueline; Kiely, Mairead; Finglas, Paul
2017-01-01
eBASIS (Bioactive Substances in Food Information Systems), a web-based database that contains compositional and biological effects data for bioactive compounds of plant origin, has been updated with new data on fruits and vegetables, wheat and, due to some evidence of potential beneficial effects, extended to include meat bioactives. eBASIS remains one of only a handful of comprehensive and searchable databases, with up-to-date coherent and validated scientific information on the composition of food bioactives and their putative health benefits. The database has a user-friendly, efficient, and flexible interface facilitating use by both the scientific community and food industry. Overall, eBASIS contains data for 267 foods, covering the composition of 794 bioactive compounds, from 1147 quality-evaluated peer-reviewed publications, together with information from 567 publications describing beneficial bioeffect studies carried out in humans. This paper highlights recent updates and expansion of eBASIS and the newly-developed link to a probabilistic intake model, allowing exposure assessment of dietary bioactive compounds to be estimated and modelled in human populations when used in conjunction with national food consumption data. This new tool could assist small- and medium-sized enterprises (SMEs) in the development of food product health claim dossiers for submission to the European Food Safety Authority (EFSA). PMID:28333085
Building Scalable Knowledge Graphs for Earth Science
NASA Astrophysics Data System (ADS)
Ramachandran, R.; Maskey, M.; Gatlin, P. N.; Zhang, J.; Duan, X.; Bugbee, K.; Christopher, S. A.; Miller, J. J.
2017-12-01
Estimates indicate that the world's information will grow by 800% in the next five years. In any given field, a single researcher or a team of researchers cannot keep up with this rate of knowledge expansion without the help of cognitive systems. Cognitive computing, defined as the use of information technology to augment human cognition, can help tackle large systemic problems. Knowledge graphs, one of the foundational components of cognitive systems, link key entities in a specific domain with other entities via relationships. Researchers could mine these graphs to make probabilistic recommendations and to infer new knowledge. At this point, however, there is a dearth of tools to generate scalable Knowledge graphs using existing corpus of scientific literature for Earth science research. Our project is currently developing an end-to-end automated methodology for incrementally constructing Knowledge graphs for Earth Science. Semantic Entity Recognition (SER) is one of the key steps in this methodology. SER for Earth Science uses external resources (including metadata catalogs and controlled vocabulary) as references to guide entity extraction and recognition (i.e., labeling) from unstructured text, in order to build a large training set to seed the subsequent auto-learning component in our algorithm. Results from several SER experiments will be presented as well as lessons learned.
Visibiome: an efficient microbiome search engine based on a scalable, distributed architecture.
Azman, Syafiq Kamarul; Anwar, Muhammad Zohaib; Henschel, Andreas
2017-07-24
Given the current influx of 16S rRNA profiles of microbiota samples, it is conceivable that large amounts of them eventually are available for search, comparison and contextualization with respect to novel samples. This process facilitates the identification of similar compositional features in microbiota elsewhere and therefore can help to understand driving factors for microbial community assembly. We present Visibiome, a microbiome search engine that can perform exhaustive, phylogeny based similarity search and contextualization of user-provided samples against a comprehensive dataset of 16S rRNA profiles environments, while tackling several computational challenges. In order to scale to high demands, we developed a distributed system that combines web framework technology, task queueing and scheduling, cloud computing and a dedicated database server. To further ensure speed and efficiency, we have deployed Nearest Neighbor search algorithms, capable of sublinear searches in high-dimensional metric spaces in combination with an optimized Earth Mover Distance based implementation of weighted UniFrac. The search also incorporates pairwise (adaptive) rarefaction and optionally, 16S rRNA copy number correction. The result of a query microbiome sample is the contextualization against a comprehensive database of microbiome samples from a diverse range of environments, visualized through a rich set of interactive figures and diagrams, including barchart-based compositional comparisons and ranking of the closest matches in the database. Visibiome is a convenient, scalable and efficient framework to search microbiomes against a comprehensive database of environmental samples. The search engine leverages a popular but computationally expensive, phylogeny based distance metric, while providing numerous advantages over the current state of the art tool.
2014-01-01
Automatic reconstruction of metabolic pathways for an organism from genomics and transcriptomics data has been a challenging and important problem in bioinformatics. Traditionally, known reference pathways can be mapped into an organism-specific ones based on its genome annotation and protein homology. However, this simple knowledge-based mapping method might produce incomplete pathways and generally cannot predict unknown new relations and reactions. In contrast, ab initio metabolic network construction methods can predict novel reactions and interactions, but its accuracy tends to be low leading to a lot of false positives. Here we combine existing pathway knowledge and a new ab initio Bayesian probabilistic graphical model together in a novel fashion to improve automatic reconstruction of metabolic networks. Specifically, we built a knowledge database containing known, individual gene / protein interactions and metabolic reactions extracted from existing reference pathways. Known reactions and interactions were then used as constraints for Bayesian network learning methods to predict metabolic pathways. Using individual reactions and interactions extracted from different pathways of many organisms to guide pathway construction is new and improves both the coverage and accuracy of metabolic pathway construction. We applied this probabilistic knowledge-based approach to construct the metabolic networks from yeast gene expression data and compared its results with 62 known metabolic networks in the KEGG database. The experiment showed that the method improved the coverage of metabolic network construction over the traditional reference pathway mapping method and was more accurate than pure ab initio methods. PMID:25374614
Toward uniform probabilistic seismic hazard assessments for Southeast Asia
NASA Astrophysics Data System (ADS)
Chan, C. H.; Wang, Y.; Shi, X.; Ornthammarath, T.; Warnitchai, P.; Kosuwan, S.; Thant, M.; Nguyen, P. H.; Nguyen, L. M.; Solidum, R., Jr.; Irsyam, M.; Hidayati, S.; Sieh, K.
2017-12-01
Although most Southeast Asian countries have seismic hazard maps, various methodologies and quality result in appreciable mismatches at national boundaries. We aim to conduct a uniform assessment across the region by through standardized earthquake and fault databases, ground-shaking scenarios, and regional hazard maps. Our earthquake database contains earthquake parameters obtained from global and national seismic networks, harmonized by removal of duplicate events and the use of moment magnitude. Our active-fault database includes fault parameters from previous studies and from the databases implemented for national seismic hazard maps. Another crucial input for seismic hazard assessment is proper evaluation of ground-shaking attenuation. Since few ground-motion prediction equations (GMPEs) have used local observations from this region, we evaluated attenuation by comparison of instrumental observations and felt intensities for recent earthquakes with predicted ground shaking from published GMPEs. We then utilize the best-fitting GMPEs and site conditions into our seismic hazard assessments. Based on the database and proper GMPEs, we have constructed regional probabilistic seismic hazard maps. The assessment shows highest seismic hazard levels near those faults with high slip rates, including the Sagaing Fault in central Myanmar, the Sumatran Fault in Sumatra, the Palu-Koro, Matano and Lawanopo Faults in Sulawesi, and the Philippine Fault across several islands of the Philippines. In addition, our assessment demonstrates the important fact that regions with low earthquake probability may well have a higher aggregate probability of future earthquakes, since they encompass much larger areas than the areas of high probability. The significant irony then is that in areas of low to moderate probability, where building codes are usually to provide less seismic resilience, seismic risk is likely to be greater. Infrastructural damage in East Malaysia during the 2015 Sabah earthquake offers a case in point.
Zhang, Miaomiao; Wells, William M; Golland, Polina
2017-10-01
We present an efficient probabilistic model of anatomical variability in a linear space of initial velocities of diffeomorphic transformations and demonstrate its benefits in clinical studies of brain anatomy. To overcome the computational challenges of the high dimensional deformation-based descriptors, we develop a latent variable model for principal geodesic analysis (PGA) based on a low dimensional shape descriptor that effectively captures the intrinsic variability in a population. We define a novel shape prior that explicitly represents principal modes as a multivariate complex Gaussian distribution on the initial velocities in a bandlimited space. We demonstrate the performance of our model on a set of 3D brain MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Our model yields a more compact representation of group variation at substantially lower computational cost than the state-of-the-art method such as tangent space PCA (TPCA) and probabilistic principal geodesic analysis (PPGA) that operate in the high dimensional image space. Copyright © 2017 Elsevier B.V. All rights reserved.
Compression of Probabilistic XML Documents
NASA Astrophysics Data System (ADS)
Veldman, Irma; de Keijzer, Ander; van Keulen, Maurice
Database techniques to store, query and manipulate data that contains uncertainty receives increasing research interest. Such UDBMSs can be classified according to their underlying data model: relational, XML, or RDF. We focus on uncertain XML DBMS with as representative example the Probabilistic XML model (PXML) of [10,9]. The size of a PXML document is obviously a factor in performance. There are PXML-specific techniques to reduce the size, such as a push down mechanism, that produces equivalent but more compact PXML documents. It can only be applied, however, where possibilities are dependent. For normal XML documents there also exist several techniques for compressing a document. Since Probabilistic XML is (a special form of) normal XML, it might benefit from these methods even more. In this paper, we show that existing compression mechanisms can be combined with PXML-specific compression techniques. We also show that best compression rates are obtained with a combination of PXML-specific technique with a rather simple generic DAG-compression technique.
A computational platform to maintain and migrate manual functional annotations for BioCyc databases.
Walsh, Jesse R; Sen, Taner Z; Dickerson, Julie A
2014-10-12
BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database. We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers. Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.
Newgard, Craig; Malveau, Susan; Staudenmayer, Kristan; Wang, N. Ewen; Hsia, Renee Y.; Mann, N. Clay; Holmes, James F.; Kuppermann, Nathan; Haukoos, Jason S.; Bulger, Eileen M.; Dai, Mengtao; Cook, Lawrence J.
2012-01-01
Objectives The objective was to evaluate the process of using existing data sources, probabilistic linkage, and multiple imputation to create large population-based injury databases matched to outcomes. Methods This was a retrospective cohort study of injured children and adults transported by 94 emergency medical systems (EMS) agencies to 122 hospitals in seven regions of the western United States over a 36-month period (2006 to 2008). All injured patients evaluated by EMS personnel within specific geographic catchment areas were included, regardless of field disposition or outcome. The authors performed probabilistic linkage of EMS records to four hospital and postdischarge data sources (emergency department [ED] data, patient discharge data, trauma registries, and vital statistics files) and then handled missing values using multiple imputation. The authors compare and evaluate matched records, match rates (proportion of matches among eligible patients), and injury outcomes within and across sites. Results There were 381,719 injured patients evaluated by EMS personnel in the seven regions. Among transported patients, match rates ranged from 14.9% to 87.5% and were directly affected by the availability of hospital data sources and proportion of missing values for key linkage variables. For vital statistics records (1-year mortality), estimated match rates ranged from 88.0% to 98.7%. Use of multiple imputation (compared to complete case analysis) reduced bias for injury outcomes, although sample size, percentage missing, type of variable, and combined-site versus single-site imputation models all affected the resulting estimates and variance. Conclusions This project demonstrates the feasibility and describes the process of constructing population-based injury databases across multiple phases of care using existing data sources and commonly available analytic methods. Attention to key linkage variables and decisions for handling missing values can be used to increase match rates between data sources, minimize bias, and preserve sampling design. PMID:22506952
Protein Simulation Data in the Relational Model.
Simms, Andrew M; Daggett, Valerie
2012-10-01
High performance computing is leading to unprecedented volumes of data. Relational databases offer a robust and scalable model for storing and analyzing scientific data. However, these features do not come without a cost-significant design effort is required to build a functional and efficient repository. Modeling protein simulation data in a relational database presents several challenges: the data captured from individual simulations are large, multi-dimensional, and must integrate with both simulation software and external data sites. Here we present the dimensional design and relational implementation of a comprehensive data warehouse for storing and analyzing molecular dynamics simulations using SQL Server.
Protein Simulation Data in the Relational Model
Simms, Andrew M.; Daggett, Valerie
2011-01-01
High performance computing is leading to unprecedented volumes of data. Relational databases offer a robust and scalable model for storing and analyzing scientific data. However, these features do not come without a cost—significant design effort is required to build a functional and efficient repository. Modeling protein simulation data in a relational database presents several challenges: the data captured from individual simulations are large, multi-dimensional, and must integrate with both simulation software and external data sites. Here we present the dimensional design and relational implementation of a comprehensive data warehouse for storing and analyzing molecular dynamics simulations using SQL Server. PMID:23204646
Scalable and expressive medical terminologies.
Mays, E; Weida, R; Dionne, R; Laker, M; White, B; Liang, C; Oles, F J
1996-01-01
The K-Rep system, based on description logic, is used to represent and reason with large and expressive controlled medical terminologies. Expressive concept descriptions incorporate semantically precise definitions composed using logical operators, together with important non-semantic information such as synonyms and codes. Examples are drawn from our experience with K-Rep in modeling the InterMed laboratory terminology and also developing a large clinical terminology now in production use at Kaiser-Permanente. System-level scalability of performance is achieved through an object-oriented database system which efficiently maps persistent memory to virtual memory. Equally important is conceptual scalability-the ability to support collaborative development, organization, and visualization of a substantial terminology as it evolves over time. K-Rep addresses this need by logically completing concept definitions and automatically classifying concepts in a taxonomy via subsumption inferences. The K-Rep system includes a general-purpose GUI environment for terminology development and browsing, a custom interface for formulary term maintenance, a C+2 application program interface, and a distributed client-server mode which provides lightweight clients with efficient run-time access to K-Rep by means of a scripting language.
Scalable metagenomic taxonomy classification using a reference genome database
Ames, Sasha K.; Hysom, David A.; Gardner, Shea N.; Lloyd, G. Scott; Gokhale, Maya B.; Allen, Jonathan E.
2013-01-01
Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge. Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample. Availability: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat Contact: allen99@llnl.gov Supplementary information: Supplementary data are available at Bioinformatics online. PMID:23828782
A new feature constituting approach to detection of vocal fold pathology
NASA Astrophysics Data System (ADS)
Hariharan, M.; Polat, Kemal; Yaacob, Sazali
2014-08-01
In the last two decades, non-invasive methods through acoustic analysis of voice signal have been proved to be excellent and reliable tool to diagnose vocal fold pathologies. This paper proposes a new feature vector based on the wavelet packet transform and singular value decomposition for the detection of vocal fold pathology. k-means clustering based feature weighting is proposed to increase the distinguishing performance of the proposed features. In this work, two databases Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database and MAPACI speech pathology database are used. Four different supervised classifiers such as k-nearest neighbour (k-NN), least-square support vector machine, probabilistic neural network and general regression neural network are employed for testing the proposed features. The experimental results uncover that the proposed features give very promising classification accuracy of 100% for both MEEI database and MAPACI speech pathology database.
Integration of Evidence Base into a Probabilistic Risk Assessment
NASA Technical Reports Server (NTRS)
Saile, Lyn; Lopez, Vilma; Bickham, Grandin; Kerstman, Eric; FreiredeCarvalho, Mary; Byrne, Vicky; Butler, Douglas; Myers, Jerry; Walton, Marlei
2011-01-01
INTRODUCTION: A probabilistic decision support model such as the Integrated Medical Model (IMM) utilizes an immense amount of input data that necessitates a systematic, integrated approach for data collection, and management. As a result of this approach, IMM is able to forecasts medical events, resource utilization and crew health during space flight. METHODS: Inflight data is the most desirable input for the Integrated Medical Model. Non-attributable inflight data is collected from the Lifetime Surveillance for Astronaut Health study as well as the engineers, flight surgeons, and astronauts themselves. When inflight data is unavailable cohort studies, other models and Bayesian analyses are used, in addition to subject matters experts input on occasion. To determine the quality of evidence of a medical condition, the data source is categorized and assigned a level of evidence from 1-5; the highest level is one. The collected data reside and are managed in a relational SQL database with a web-based interface for data entry and review. The database is also capable of interfacing with outside applications which expands capabilities within the database itself. Via the public interface, customers can access a formatted Clinical Findings Form (CLiFF) that outlines the model input and evidence base for each medical condition. Changes to the database are tracked using a documented Configuration Management process. DISSCUSSION: This strategic approach provides a comprehensive data management plan for IMM. The IMM Database s structure and architecture has proven to support additional usages. As seen by the resources utilization across medical conditions analysis. In addition, the IMM Database s web-based interface provides a user-friendly format for customers to browse and download the clinical information for medical conditions. It is this type of functionality that will provide Exploratory Medicine Capabilities the evidence base for their medical condition list. CONCLUSION: The IMM Database in junction with the IMM is helping NASA aerospace program improve the health care and reduce risk for the astronauts crew. Both the database and model will continue to expand to meet customer needs through its multi-disciplinary evidence based approach to managing data. Future expansion could serve as a platform for a Space Medicine Wiki of medical conditions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Grabaskas, Dave; Brunett, Acacia J.; Bucknor, Matthew
GE Hitachi Nuclear Energy (GEH) and Argonne National Laboratory are currently engaged in a joint effort to modernize and develop probabilistic risk assessment (PRA) techniques for advanced non-light water reactors. At a high level the primary outcome of this project will be the development of next-generation PRA methodologies that will enable risk-informed prioritization of safety- and reliability-focused research and development, while also identifying gaps that may be resolved through additional research. A subset of this effort is the development of a reliability database (RDB) methodology to determine applicable reliability data for inclusion in the quantification of the PRA. The RDBmore » method developed during this project seeks to satisfy the requirements of the Data Analysis element of the ASME/ANS Non-LWR PRA standard. The RDB methodology utilizes a relevancy test to examine reliability data and determine whether it is appropriate to include as part of the reliability database for the PRA. The relevancy test compares three component properties to establish the level of similarity to components examined as part of the PRA. These properties include the component function, the component failure modes, and the environment/boundary conditions of the component. The relevancy test is used to gauge the quality of data found in a variety of sources, such as advanced reactor-specific databases, non-advanced reactor nuclear databases, and non-nuclear databases. The RDB also establishes the integration of expert judgment or separate reliability analysis with past reliability data. This paper provides details on the RDB methodology, and includes an example application of the RDB methodology for determining the reliability of the intermediate heat exchanger of a sodium fast reactor. The example explores a variety of reliability data sources, and assesses their applicability for the PRA of interest through the use of the relevancy test.« less
Fan, Ming; Thongsri, Tepwitoon; Axe, Lisa; Tyson, Trevor A
2005-06-01
A probabilistic approach was applied in an ecological risk assessment (ERA) to characterize risk and address uncertainty employing Monte Carlo simulations for assessing parameter and risk probabilistic distributions. This simulation tool (ERA) includes a Window's based interface, an interactive and modifiable database management system (DBMS) that addresses a food web at trophic levels, and a comprehensive evaluation of exposure pathways. To illustrate this model, ecological risks from depleted uranium (DU) exposure at the US Army Yuma Proving Ground (YPG) and Aberdeen Proving Ground (APG) were assessed and characterized. Probabilistic distributions showed that at YPG, a reduction in plant root weight is considered likely to occur (98% likelihood) from exposure to DU; for most terrestrial animals, likelihood for adverse reproduction effects ranges from 0.1% to 44%. However, for the lesser long-nosed bat, the effects are expected to occur (>99% likelihood) through the reduction in size and weight of offspring. Based on available DU data for the firing range at APG, DU uptake will not likely affect survival of aquatic plants and animals (<0.1% likelihood). Based on field and laboratory studies conducted at APG and YPG on pocket mice, kangaroo rat, white-throated woodrat, deer, and milfoil, body burden concentrations observed fall into the distributions simulated at both sites.
Forest Inventory and Analysis Database of the United States of America (FIA)
Andrew N. Gray; Thomas J. Brandeis; John D. Shaw; William H. McWilliams; Patrick Miles
2012-01-01
Extensive vegetation inventories established with a probabilistic design are an indispensable tool in describing distributions of species and community types and detecting changes in composition in response to climate or other drivers. The Forest Inventory and Analysis Program measures vegetation in permanent plots on forested lands across the United States of America...
Using ontology databases for scalable query answering, inconsistency detection, and data integration
Dou, Dejing
2011-01-01
An ontology database is a basic relational database management system that models an ontology plus its instances. To reason over the transitive closure of instances in the subsumption hierarchy, for example, an ontology database can either unfold views at query time or propagate assertions using triggers at load time. In this paper, we use existing benchmarks to evaluate our method—using triggers—and we demonstrate that by forward computing inferences, we not only improve query time, but the improvement appears to cost only more space (not time). However, we go on to show that the true penalties were simply opaque to the benchmark, i.e., the benchmark inadequately captures load-time costs. We have applied our methods to two case studies in biomedicine, using ontologies and data from genetics and neuroscience to illustrate two important applications: first, ontology databases answer ontology-based queries effectively; second, using triggers, ontology databases detect instance-based inconsistencies—something not possible using views. Finally, we demonstrate how to extend our methods to perform data integration across multiple, distributed ontology databases. PMID:22163378
NASA Astrophysics Data System (ADS)
Obulesu, O.; Rama Mohan Reddy, A., Dr; Mahendra, M.
2017-08-01
Detecting regular and efficient cyclic models is the demanding activity for data analysts due to unstructured, vigorous and enormous raw information produced from web. Many existing approaches generate large candidate patterns in the occurrence of huge and complex databases. In this work, two novel algorithms are proposed and a comparative examination is performed by considering scalability and performance parameters. The first algorithm is, EFPMA (Extended Regular Model Detection Algorithm) used to find frequent sequential patterns from the spatiotemporal dataset and the second one is, ETMA (Enhanced Tree-based Mining Algorithm) for detecting effective cyclic models with symbolic database representation. EFPMA is an algorithm grows models from both ends (prefixes and suffixes) of detected patterns, which results in faster pattern growth because of less levels of database projection compared to existing approaches such as Prefixspan and SPADE. ETMA uses distinct notions to store and manage transactions data horizontally such as segment, sequence and individual symbols. ETMA exploits a partition-and-conquer method to find maximal patterns by using symbolic notations. Using this algorithm, we can mine cyclic models in full-series sequential patterns including subsection series also. ETMA reduces the memory consumption and makes use of the efficient symbolic operation. Furthermore, ETMA only records time-series instances dynamically, in terms of character, series and section approaches respectively. The extent of the pattern and proving efficiency of the reducing and retrieval techniques from synthetic and actual datasets is a really open & challenging mining problem. These techniques are useful in data streams, traffic risk analysis, medical diagnosis, DNA sequence Mining, Earthquake prediction applications. Extensive investigational outcomes illustrates that the algorithms outperforms well towards efficiency and scalability than ECLAT, STNR and MAFIA approaches.
NASA Astrophysics Data System (ADS)
Naseri Kouzehgarani, Asal
2009-12-01
Most models of aircraft trajectories are non-linear and stochastic in nature; and their internal parameters are often poorly defined. The ability to model, simulate and analyze realistic air traffic management conflict detection scenarios in a scalable, composable, multi-aircraft fashion is an extremely difficult endeavor. Accurate techniques for aircraft mode detection are critical in order to enable the precise projection of aircraft conflicts, and for the enactment of altitude separation resolution strategies. Conflict detection is an inherently probabilistic endeavor; our ability to detect conflicts in a timely and accurate manner over a fixed time horizon is traded off against the increased human workload created by false alarms---that is, situations that would not develop into an actual conflict, or would resolve naturally in the appropriate time horizon-thereby introducing a measure of probabilistic uncertainty in any decision aid fashioned to assist air traffic controllers. The interaction of the continuous dynamics of the aircraft, used for prediction purposes, with the discrete conflict detection logic gives rise to the hybrid nature of the overall system. The introduction of the probabilistic element, common to decision alerting and aiding devices, places the conflict detection and resolution problem in the domain of probabilistic hybrid phenomena. A hidden Markov model (HMM) has two stochastic components: a finite-state Markov chain and a finite set of output probability distributions. In other words an unobservable stochastic process (hidden) that can only be observed through another set of stochastic processes that generate the sequence of observations. The problem of self separation in distributed air traffic management reduces to the ability of aircraft to communicate state information to neighboring aircraft, as well as model the evolution of aircraft trajectories between communications, in the presence of probabilistic uncertain dynamics as well as partially observable and uncertain data. We introduce the Hybrid Hidden Markov Modeling (HHMM) formalism to enable the prediction of the stochastic aircraft states (and thus, potential conflicts), by combining elements of the probabilistic timed input output automaton and the partially observable Markov decision process frameworks, along with the novel addition of a Markovian scheduler to remove the non-deterministic elements arising from the enabling of several actions simultaneously. Comparisons of aircraft in level, climbing/descending and turning flight are performed, and unknown flight track data is evaluated probabilistically against the tuned model in order to assess the effectiveness of the model in detecting the switch between multiple flight modes for a given aircraft. This also allows for the generation of probabilistic distribution over the execution traces of the hybrid hidden Markov model, which then enables the prediction of the states of aircraft based on partially observable and uncertain data. Based on the composition properties of the HHMM, we study a decentralized air traffic system where aircraft are moving along streams and can perform cruise, accelerate, climb and turn maneuvers. We develop a common decentralized policy for conflict avoidance with spatially distributed agents (aircraft in the sky) and assure its safety properties via correctness proofs.
Validation analysis of probabilistic models of dietary exposure to food additives.
Gilsenan, M B; Thompson, R L; Lambe, J; Gibney, M J
2003-10-01
The validity of a range of simple conceptual models designed specifically for the estimation of food additive intakes using probabilistic analysis was assessed. Modelled intake estimates that fell below traditional conservative point estimates of intake and above 'true' additive intakes (calculated from a reference database at brand level) were considered to be in a valid region. Models were developed for 10 food additives by combining food intake data, the probability of an additive being present in a food group and additive concentration data. Food intake and additive concentration data were entered as raw data or as a lognormal distribution, and the probability of an additive being present was entered based on the per cent brands or the per cent eating occasions within a food group that contained an additive. Since the three model components assumed two possible modes of input, the validity of eight (2(3)) model combinations was assessed. All model inputs were derived from the reference database. An iterative approach was employed in which the validity of individual model components was assessed first, followed by validation of full conceptual models. While the distribution of intake estimates from models fell below conservative intakes, which assume that the additive is present at maximum permitted levels (MPLs) in all foods in which it is permitted, intake estimates were not consistently above 'true' intakes. These analyses indicate the need for more complex models for the estimation of food additive intakes using probabilistic analysis. Such models should incorporate information on market share and/or brand loyalty.
Learning Optimized Local Difference Binaries for Scalable Augmented Reality on Mobile Devices.
Xin Yang; Kwang-Ting Cheng
2014-06-01
The efficiency, robustness and distinctiveness of a feature descriptor are critical to the user experience and scalability of a mobile augmented reality (AR) system. However, existing descriptors are either too computationally expensive to achieve real-time performance on a mobile device such as a smartphone or tablet, or not sufficiently robust and distinctive to identify correct matches from a large database. As a result, current mobile AR systems still only have limited capabilities, which greatly restrict their deployment in practice. In this paper, we propose a highly efficient, robust and distinctive binary descriptor, called Learning-based Local Difference Binary (LLDB). LLDB directly computes a binary string for an image patch using simple intensity and gradient difference tests on pairwise grid cells within the patch. To select an optimized set of grid cell pairs, we densely sample grid cells from an image patch and then leverage a modified AdaBoost algorithm to automatically extract a small set of critical ones with the goal of maximizing the Hamming distance between mismatches while minimizing it between matches. Experimental results demonstrate that LLDB is extremely fast to compute and to match against a large database due to its high robustness and distinctiveness. Compared to the state-of-the-art binary descriptors, primarily designed for speed, LLDB has similar efficiency for descriptor construction, while achieving a greater accuracy and faster matching speed when matching over a large database with 2.3M descriptors on mobile devices.
A Comparison of Different Database Technologies for the CMS AsyncStageOut Transfer Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ciangottini, D.; Balcas, J.; Mascheroni, M.
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB) that manages users transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It addresses a major weakness of the previous, decentralized model, namely that the transfer of the user’s output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate. Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 sites. ASO uses amore » NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB components. Since ASO/CRAB were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort. In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and how their different strengths and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering output to the user.« less
A comparison of different database technologies for the CMS AsyncStageOut transfer database
NASA Astrophysics Data System (ADS)
Ciangottini, D.; Balcas, J.; Mascheroni, M.; Rupeika, E. A.; Vaandering, E.; Riahi, H.; Silva, J. M. D.; Hernandez, J. M.; Belforte, S.; Ivanov, T. T.
2017-10-01
AsyncStageOut (ASO) is the component of the CMS distributed data analysis system (CRAB) that manages users transfers in a centrally controlled way using the File Transfer System (FTS3) at CERN. It addresses a major weakness of the previous, decentralized model, namely that the transfer of the user’s output data to a single remote site was part of the job execution, resulting in inefficient use of job slots and an unacceptable failure rate. Currently ASO manages up to 600k files of various sizes per day from more than 500 users per month, spread over more than 100 sites. ASO uses a NoSQL database (CouchDB) as internal bookkeeping and as way to communicate with other CRAB components. Since ASO/CRAB were put in production in 2014, the number of transfers constantly increased up to a point where the pressure to the central CouchDB instance became critical, creating new challenges for the system scalability, performance, and monitoring. This forced a re-engineering of the ASO application to increase its scalability and lowering its operational effort. In this contribution we present a comparison of the performance of the current NoSQL implementation and a new SQL implementation, and how their different strengths and features influenced the design choices and operational experience. We also discuss other architectural changes introduced in the system to handle the increasing load and latency in delivering output to the user.
Integrating Scientific Array Processing into Standard SQL
NASA Astrophysics Data System (ADS)
Misev, Dimitar; Bachhuber, Johannes; Baumann, Peter
2014-05-01
We live in a time that is dominated by data. Data storage is cheap and more applications than ever accrue vast amounts of data. Storing the emerging multidimensional data sets efficiently, however, and allowing them to be queried by their inherent structure, is a challenge many databases have to face today. Despite the fact that multidimensional array data is almost always linked to additional, non-array information, array databases have mostly developed separately from relational systems, resulting in a disparity between the two database categories. The current SQL standard and SQL DBMS supports arrays - and in an extension also multidimensional arrays - but does so in a very rudimentary and inefficient way. This poster demonstrates the practicality of an SQL extension for array processing, implemented in a proof-of-concept multi-faceted system that manages a federation of array and relational database systems, providing transparent, efficient and scalable access to the heterogeneous data in them.
NASA Astrophysics Data System (ADS)
Schrodt, Franziska; Shan, Hanhuai; Fazayeli, Farideh; Karpatne, Anuj; Kattge, Jens; Banerjee, Arindam; Reichstein, Markus; Reich, Peter
2013-04-01
With the advent of remotely sensed data and coordinated efforts to create global databases, the ecological community has progressively become more data-intensive. However, in contrast to other disciplines, statistical ways of handling these large data sets, especially the gaps which are inherent to them, are lacking. Widely used theoretical approaches, for example model averaging based on Akaike's information criterion (AIC), are sensitive to missing values. Yet, the most common way of handling sparse matrices - the deletion of cases with missing data (complete case analysis) - is known to severely reduce statistical power as well as inducing biased parameter estimates. In order to address these issues, we present novel approaches to gap filling in large ecological data sets using matrix factorization techniques. Factorization based matrix completion was developed in a recommender system context and has since been widely used to impute missing data in fields outside the ecological community. Here, we evaluate the effectiveness of probabilistic matrix factorization techniques for imputing missing data in ecological matrices using two imputation techniques. Hierarchical Probabilistic Matrix Factorization (HPMF) effectively incorporates hierarchical phylogenetic information (phylogenetic group, family, genus, species and individual plant) into the trait imputation. Advanced Hierarchical Probabilistic Matrix Factorization (aHPMF) on the other hand includes climate and soil information into the matrix factorization by regressing the environmental variables against residuals of the HPMF. One unique opportunity opened up by aHPMF is out-of-sample prediction, where traits can be predicted for specific species at locations different to those sampled in the past. This has potentially far-reaching consequences for the study of global-scale plant functional trait patterns. We test the accuracy and effectiveness of HPMF and aHPMF in filling sparse matrices, using the TRY database of plant functional traits (http://www.try-db.org). TRY is one of the largest global compilations of plant trait databases (750 traits of 1 million plants), encompassing data on morphological, anatomical, biochemical, phenological and physiological features of plants. However, despite of unprecedented coverage, the TRY database is still very sparse, severely limiting joint trait analyses. Plant traits are the key to understanding how plants as primary producers adjust to changes in environmental conditions and in turn influence them. Forming the basis for Dynamic Global Vegetation Models (DGVMs), plant traits are also fundamental in global change studies for predicting future ecosystem changes. It is thus imperative that missing data is imputed in as accurate and precise a way as possible. In this study, we show the advantages and disadvantages of applying probabilistic matrix factorization techniques in incorporating hierarchical and environmental information for the prediction of missing plant traits as compared to conventional imputation techniques such as the complete case and mean approaches. We will discuss the implications of using gap-filled data for global-scale studies of plant functional trait - environment relationship as opposed to the above-mentioned conventional techniques, using examples of out-of-sample predictions of foliar Nitrogen across several species' ranges and biomes.
CORAL Server and CORAL Server Proxy: Scalable Access to Relational Databases from CORAL Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Valassi, A.; /CERN; Bartoldus, R.
The CORAL software is widely used at CERN by the LHC experiments to access the data they store on relational databases, such as Oracle. Two new components have recently been added to implement a model involving a middle tier 'CORAL server' deployed close to the database and a tree of 'CORAL server proxies', providing data caching and multiplexing, deployed close to the client. A first implementation of the two new components, released in the summer 2009, is now deployed in the ATLAS online system to read the data needed by the High Level Trigger, allowing the configuration of a farmmore » of several thousand processes. This paper reviews the architecture of the software, its development status and its usage in ATLAS.« less
NASA Astrophysics Data System (ADS)
Cervone, G.; Clemente-Harding, L.; Alessandrini, S.; Delle Monache, L.
2016-12-01
A methodology based on Artificial Neural Networks (ANN) and an Analog Ensemble (AnEn) is presented to generate 72-hour deterministic and probabilistic forecasts of power generated by photovoltaic (PV) power plants using input from a numerical weather prediction model and computed astronomical variables. ANN and AnEn are used individually and in combination to generate forecasts for three solar power plant located in Italy. The computational scalability of the proposed solution is tested using synthetic data simulating 4,450 PV power stations. The NCAR Yellowstone supercomputer is employed to test the parallel implementation of the proposed solution, ranging from 1 node (32 cores) to 4,450 nodes (141,140 cores). Results show that a combined AnEn + ANN solution yields best results, and that the proposed solution is well suited for massive scale computation.
Joint Experimentation on Scalable Parallel Processors (JESPP)
2006-04-01
made use of local embedded relational databases, implemented using sqlite on each node of an SPP to execute queries and return results via an ad hoc ...rl.af.mil 12a. DISTRIBUTION / AVAILABILITY STATEENT APPROVED FOR PUBLIC RELEASE; DISTRIBUTION UNLIMITED. 12b. DISTRIBUTION CODE 13. ABSTRACT...Experimentation Directorate (J9) required expansion of its joint semi-automated forces (JSAF) code capabilities; including number of entities, behavior complexity
Modality, probability, and mental models.
Hinterecker, Thomas; Knauff, Markus; Johnson-Laird, P N
2016-10-01
We report 3 experiments investigating novel sorts of inference, such as: A or B or both. Therefore, possibly (A and B). Where the contents were sensible assertions, for example, Space tourism will achieve widespread popularity in the next 50 years or advances in material science will lead to the development of antigravity materials in the next 50 years, or both . Most participants accepted the inferences as valid, though they are invalid in modal logic and in probabilistic logic too. But, the theory of mental models predicts that individuals should accept them. In contrast, inferences of this sort—A or B but not both. Therefore, A or B or both—are both logically valid and probabilistically valid. Yet, as the model theory also predicts, most reasoners rejected them. The participants’ estimates of probabilities showed that their inferences tended not to be based on probabilistic validity, but that they did rate acceptable conclusions as more probable than unacceptable conclusions. We discuss the implications of the results for current theories of reasoning. PsycINFO Database Record (c) 2016 APA, all rights reserved
Development, deployment and operations of ATLAS databases
NASA Astrophysics Data System (ADS)
Vaniachine, A. V.; Schmitt, J. G. v. d.
2008-07-01
In preparation for ATLAS data taking, a coordinated shift from development towards operations has occurred in ATLAS database activities. In addition to development and commissioning activities in databases, ATLAS is active in the development and deployment (in collaboration with the WLCG 3D project) of the tools that allow the worldwide distribution and installation of databases and related datasets, as well as the actual operation of this system on ATLAS multi-grid infrastructure. We describe development and commissioning of major ATLAS database applications for online and offline. We present the first scalability test results and ramp-up schedule over the initial LHC years of operations towards the nominal year of ATLAS running, when the database storage volumes are expected to reach 6.1 TB for the Tag DB and 1.0 TB for the Conditions DB. ATLAS database applications require robust operational infrastructure for data replication between online and offline at Tier-0, and for the distribution of the offline data to Tier-1 and Tier-2 computing centers. We describe ATLAS experience with Oracle Streams and other technologies for coordinated replication of databases in the framework of the WLCG 3D services.
NASA Astrophysics Data System (ADS)
Sastry, Kumara Narasimha
2007-03-01
Effective and efficient rnultiscale modeling is essential to advance both the science and synthesis in a, wide array of fields such as physics, chemistry, materials science; biology, biotechnology and pharmacology. This study investigates the efficacy and potential of rising genetic algorithms for rnultiscale materials modeling and addresses some of the challenges involved in designing competent algorithms that solve hard problems quickly, reliably and accurately. In particular, this thesis demonstrates the use of genetic algorithms (GAs) and genetic programming (GP) in multiscale modeling with the help of two non-trivial case studies in materials science and chemistry. The first case study explores the utility of genetic programming (GP) in multi-timescaling alloy kinetics simulations. In essence, GP is used to bridge molecular dynamics and kinetic Monte Carlo methods to span orders-of-magnitude in simulation time. Specifically, GP is used to regress symbolically an inline barrier function from a limited set of molecular dynamics simulations to enable kinetic Monte Carlo that simulate seconds of real time. Results on a non-trivial example of vacancy-assisted migration on a surface of a face-centered cubic (fcc) Copper-Cobalt (CuxCo 1-x) alloy show that GP predicts all barriers with 0.1% error from calculations for less than 3% of active configurations, independent of type of potentials used to obtain the learning set of barriers via molecular dynamics. The resulting method enables 2--9 orders-of-magnitude increase in real-time dynamics simulations taking 4--7 orders-of-magnitude less CPU time. The second case study presents the application of multiobjective genetic algorithms (MOGAs) in multiscaling quantum chemistry simulations. Specifically, MOGAs are used to bridge high-level quantum chemistry and semiempirical methods to provide accurate representation of complex molecular excited-state and ground-state behavior. Results on ethylene and benzene---two common building blocks in organic chemistry---indicate that MOGAs produce High-quality semiempirical methods that (1) are stable to small perturbations, (2) yield accurate configuration energies on untested and critical excited states, and (3) yield ab initio quality excited-state dynamics. The proposed method enables simulations of more complex systems to realistic, multi-picosecond timescales, well beyond previous attempts or expectation of human experts, and 2--3 orders-of-magnitude reduction in computational cost. While the two applications use simple evolutionary operators, in order to tackle more complex systems, their scalability and limitations have to be investigated. The second part of the thesis addresses some of the challenges involved with a successful design of genetic algorithms and genetic programming for multiscale modeling. The first issue addressed is the scalability of genetic programming, where facetwise models are built to assess the population size required by GP to ensure adequate supply of raw building blocks and also to ensure accurate decision-making between competing building blocks. This study also presents a design of competent genetic programming, where traditional fixed recombination operators are replaced by building and sampling probabilistic models of promising candidate programs. The proposed scalable GP, called extended compact GP (eCGP), combines the ideas from extended compact genetic algorithm (eCGA) and probabilistic incremental program evolution (PIPE) and adaptively identifies, propagates and exchanges important subsolutions of a search problem. Results show that eCGP scales cubically with problem size on both GP-easy and GP-hard problems. Finally, facetwise models are developed to explore limitations of scalability of MOGAs, where the scalability of multiobjective algorithms in reliably maintaining Pareto-optimal solutions is addressed. The results show that even when the building blocks are accurately identified, massive multimodality of the search problems can easily overwhelm the nicher (diversity preserving operator) and lead to exponential scale-up. Facetwise models are developed, which incorporate the combined effects of model accuracy, decision making, and sub-structure supply, as well as the effect of niching on the population sizing, to predict a limit on the growth rate of a maximum number of sub-structures that can compete in the two objectives to circumvent the failure of the niching method. The results show that if the number of competing building blocks between multiple objectives is less than the proposed limit, multiobjective GAs scale-up polynomially with the problem size on boundedly-difficult problems.
Probabilistic Assessment of Cancer Risk from Solar Particle Events
NASA Astrophysics Data System (ADS)
Kim, Myung-Hee Y.; Cucinotta, Francis A.
For long duration missions outside of the protection of the Earth's magnetic field, space radi-ation presents significant health risks including cancer mortality. Space radiation consists of solar particle events (SPEs), comprised largely of medium energy protons (less than several hundred MeV); and galactic cosmic ray (GCR), which include high energy protons and heavy ions. While the frequency distribution of SPEs depends strongly upon the phase within the solar activity cycle, the individual SPE occurrences themselves are random in nature. We es-timated the probability of SPE occurrence using a non-homogeneous Poisson model to fit the historical database of proton measurements. Distributions of particle fluences of SPEs for a specified mission period were simulated ranging from its 5th to 95th percentile to assess the cancer risk distribution. Spectral variability of SPEs was also examined, because the detailed energy spectra of protons are important especially at high energy levels for assessing the cancer risk associated with energetic particles for large events. We estimated the overall cumulative probability of GCR environment for a specified mission period using a solar modulation model for the temporal characterization of the GCR environment represented by the deceleration po-tential (φ). Probabilistic assessment of cancer fatal risk was calculated for various periods of lunar and Mars missions. This probabilistic approach to risk assessment from space radiation is in support of mission design and operational planning for future manned space exploration missions. In future work, this probabilistic approach to the space radiation will be combined with a probabilistic approach to the radiobiological factors that contribute to the uncertainties in projecting cancer risks.
Probabilistic Assessment of Cancer Risk from Solar Particle Events
NASA Technical Reports Server (NTRS)
Kim, Myung-Hee Y.; Cucinotta, Francis A.
2010-01-01
For long duration missions outside of the protection of the Earth s magnetic field, space radiation presents significant health risks including cancer mortality. Space radiation consists of solar particle events (SPEs), comprised largely of medium energy protons (less than several hundred MeV); and galactic cosmic ray (GCR), which include high energy protons and heavy ions. While the frequency distribution of SPEs depends strongly upon the phase within the solar activity cycle, the individual SPE occurrences themselves are random in nature. We estimated the probability of SPE occurrence using a non-homogeneous Poisson model to fit the historical database of proton measurements. Distributions of particle fluences of SPEs for a specified mission period were simulated ranging from its 5 th to 95th percentile to assess the cancer risk distribution. Spectral variability of SPEs was also examined, because the detailed energy spectra of protons are important especially at high energy levels for assessing the cancer risk associated with energetic particles for large events. We estimated the overall cumulative probability of GCR environment for a specified mission period using a solar modulation model for the temporal characterization of the GCR environment represented by the deceleration potential (^). Probabilistic assessment of cancer fatal risk was calculated for various periods of lunar and Mars missions. This probabilistic approach to risk assessment from space radiation is in support of mission design and operational planning for future manned space exploration missions. In future work, this probabilistic approach to the space radiation will be combined with a probabilistic approach to the radiobiological factors that contribute to the uncertainties in projecting cancer risks.
Probabilistic simulation of concurrent engineering of propulsion systems
NASA Technical Reports Server (NTRS)
Chamis, C. C.; Singhal, S. N.
1993-01-01
Technology readiness and the available infrastructure is assessed for timely computational simulation of concurrent engineering for propulsion systems. Results for initial coupled multidisciplinary, fabrication-process, and system simulators are presented including uncertainties inherent in various facets of engineering processes. An approach is outlined for computationally formalizing the concurrent engineering process from cradle-to-grave via discipline dedicated workstations linked with a common database.
Data-Based Detection of Potential Terrorist Attacks: Statistical and Graphical Methods
2010-06-01
Naren; Vasquez-Robinet, Cecilia; Watkinson, Jonathan: "A General Probabilistic Model of the PCR Process," Applied Mathematics and Computation 182(1...September 2006. Seminar, Measuring the effect of Length biased sampling, Mathematical Sciences Section, National Security Agency, 19 September 2006...Committee on National Statistics, 9 February 2007. Invited seminar, Statistical Tests for Bullet Lead Comparisons, Department of Mathematics , Butler
ERIC Educational Resources Information Center
Hammonds, S. J.
1990-01-01
A technique for the numerical identification of bacteria using normalized likelihoods calculated from a probabilistic database is described, and the principles of the technique are explained. The listing of the computer program is included. Specimen results from the program, and examples of how they should be interpreted, are given. (KR)
A Four-Dimensional Probabilistic Atlas of the Human Brain
Mazziotta, John; Toga, Arthur; Evans, Alan; Fox, Peter; Lancaster, Jack; Zilles, Karl; Woods, Roger; Paus, Tomas; Simpson, Gregory; Pike, Bruce; Holmes, Colin; Collins, Louis; Thompson, Paul; MacDonald, David; Iacoboni, Marco; Schormann, Thorsten; Amunts, Katrin; Palomero-Gallagher, Nicola; Geyer, Stefan; Parsons, Larry; Narr, Katherine; Kabani, Noor; Le Goualher, Georges; Feidler, Jordan; Smith, Kenneth; Boomsma, Dorret; Pol, Hilleke Hulshoff; Cannon, Tyrone; Kawashima, Ryuta; Mazoyer, Bernard
2001-01-01
The authors describe the development of a four-dimensional atlas and reference system that includes both macroscopic and microscopic information on structure and function of the human brain in persons between the ages of 18 and 90 years. Given the presumed large but previously unquantified degree of structural and functional variance among normal persons in the human population, the basis for this atlas and reference system is probabilistic. Through the efforts of the International Consortium for Brain Mapping (ICBM), 7,000 subjects will be included in the initial phase of database and atlas development. For each subject, detailed demographic, clinical, behavioral, and imaging information is being collected. In addition, 5,800 subjects will contribute DNA for the purpose of determining genotype– phenotype–behavioral correlations. The process of developing the strategies, algorithms, data collection methods, validation approaches, database structures, and distribution of results is described in this report. Examples of applications of the approach are described for the normal brain in both adults and children as well as in patients with schizophrenia. This project should provide new insights into the relationship between microscopic and macroscopic structure and function in the human brain and should have important implications in basic neuroscience, clinical diagnostics, and cerebral disorders. PMID:11522763
High-performance metadata indexing and search in petascale data storage systems
NASA Astrophysics Data System (ADS)
Leung, A. W.; Shao, M.; Bisson, T.; Pasupathy, S.; Miller, E. L.
2008-07-01
Large-scale storage systems used for scientific applications can store petabytes of data and billions of files, making the organization and management of data in these systems a difficult, time-consuming task. The ability to search file metadata in a storage system can address this problem by allowing scientists to quickly navigate experiment data and code while allowing storage administrators to gather the information they need to properly manage the system. In this paper, we present Spyglass, a file metadata search system that achieves scalability by exploiting storage system properties, providing the scalability that existing file metadata search tools lack. In doing so, Spyglass can achieve search performance up to several thousand times faster than existing database solutions. We show that Spyglass enables important functionality that can aid data management for scientists and storage administrators.
Active in-database processing to support ambient assisted living systems.
de Morais, Wagner O; Lundström, Jens; Wickström, Nicholas
2014-08-12
As an alternative to the existing software architectures that underpin the development of smart homes and ambient assisted living (AAL) systems, this work presents a database-centric architecture that takes advantage of active databases and in-database processing. Current platforms supporting AAL systems use database management systems (DBMSs) exclusively for data storage. Active databases employ database triggers to detect and react to events taking place inside or outside of the database. DBMSs can be extended with stored procedures and functions that enable in-database processing. This means that the data processing is integrated and performed within the DBMS. The feasibility and flexibility of the proposed approach were demonstrated with the implementation of three distinct AAL services. The active database was used to detect bed-exits and to discover common room transitions and deviations during the night. In-database machine learning methods were used to model early night behaviors. Consequently, active in-database processing avoids transferring sensitive data outside the database, and this improves performance, security and privacy. Furthermore, centralizing the computation into the DBMS facilitates code reuse, adaptation and maintenance. These are important system properties that take into account the evolving heterogeneity of users, their needs and the devices that are characteristic of smart homes and AAL systems. Therefore, DBMSs can provide capabilities to address requirements for scalability, security, privacy, dependability and personalization in applications of smart environments in healthcare.
Active In-Database Processing to Support Ambient Assisted Living Systems
de Morais, Wagner O.; Lundström, Jens; Wickström, Nicholas
2014-01-01
As an alternative to the existing software architectures that underpin the development of smart homes and ambient assisted living (AAL) systems, this work presents a database-centric architecture that takes advantage of active databases and in-database processing. Current platforms supporting AAL systems use database management systems (DBMSs) exclusively for data storage. Active databases employ database triggers to detect and react to events taking place inside or outside of the database. DBMSs can be extended with stored procedures and functions that enable in-database processing. This means that the data processing is integrated and performed within the DBMS. The feasibility and flexibility of the proposed approach were demonstrated with the implementation of three distinct AAL services. The active database was used to detect bed-exits and to discover common room transitions and deviations during the night. In-database machine learning methods were used to model early night behaviors. Consequently, active in-database processing avoids transferring sensitive data outside the database, and this improves performance, security and privacy. Furthermore, centralizing the computation into the DBMS facilitates code reuse, adaptation and maintenance. These are important system properties that take into account the evolving heterogeneity of users, their needs and the devices that are characteristic of smart homes and AAL systems. Therefore, DBMSs can provide capabilities to address requirements for scalability, security, privacy, dependability and personalization in applications of smart environments in healthcare. PMID:25120164
Probabilistic Models for Solar Particle Events
NASA Technical Reports Server (NTRS)
Adams, James H., Jr.; Dietrich, W. F.; Xapsos, M. A.; Welton, A. M.
2009-01-01
Probabilistic Models of Solar Particle Events (SPEs) are used in space mission design studies to provide a description of the worst-case radiation environment that the mission must be designed to tolerate.The models determine the worst-case environment using a description of the mission and a user-specified confidence level that the provided environment will not be exceeded. This poster will focus on completing the existing suite of models by developing models for peak flux and event-integrated fluence elemental spectra for the Z>2 elements. It will also discuss methods to take into account uncertainties in the data base and the uncertainties resulting from the limited number of solar particle events in the database. These new probabilistic models are based on an extensive survey of SPE measurements of peak and event-integrated elemental differential energy spectra. Attempts are made to fit the measured spectra with eight different published models. The model giving the best fit to each spectrum is chosen and used to represent that spectrum for any energy in the energy range covered by the measurements. The set of all such spectral representations for each element is then used to determine the worst case spectrum as a function of confidence level. The spectral representation that best fits these worst case spectra is found and its dependence on confidence level is parameterized. This procedure creates probabilistic models for the peak and event-integrated spectra.
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
Karisani, Payam; Qin, Zhaohui S; Agichtein, Eugene
2018-01-01
Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie PMID:29688379
Adaptive predictors based on probabilistic SVM for real time disruption mitigation on JET
NASA Astrophysics Data System (ADS)
Murari, A.; Lungaroni, M.; Peluso, E.; Gaudio, P.; Vega, J.; Dormido-Canto, S.; Baruzzo, M.; Gelfusa, M.; Contributors, JET
2018-05-01
Detecting disruptions with sufficient anticipation time is essential to undertake any form of remedial strategy, mitigation or avoidance. Traditional predictors based on machine learning techniques can be very performing, if properly optimised, but do not provide a natural estimate of the quality of their outputs and they typically age very quickly. In this paper a new set of tools, based on probabilistic extensions of support vector machines (SVM), are introduced and applied for the first time to JET data. The probabilistic output constitutes a natural qualification of the prediction quality and provides additional flexibility. An adaptive training strategy ‘from scratch’ has also been devised, which allows preserving the performance even when the experimental conditions change significantly. Large JET databases of disruptions, covering entire campaigns and thousands of discharges, have been analysed, both for the case of the graphite and the ITER Like Wall. Performance significantly better than any previous predictor using adaptive training has been achieved, satisfying even the requirements of the next generation of devices. The adaptive approach to the training has also provided unique information about the evolution of the operational space. The fact that the developed tools give the probability of disruption improves the interpretability of the results, provides an estimate of the predictor quality and gives new insights into the physics. Moreover, the probabilistic treatment permits to insert more easily these classifiers into general decision support and control systems.
Praveen, Paurush; Fröhlich, Holger
2013-01-01
Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available.
AsyncStageOut: Distributed user data management for CMS Analysis
NASA Astrophysics Data System (ADS)
Riahi, H.; Wildish, T.; Ciangottini, D.; Hernández, J. M.; Andreeva, J.; Balcas, J.; Karavakis, E.; Mascheroni, M.; Tanasijczuk, A. J.; Vaandering, E. W.
2015-12-01
AsyncStageOut (ASO) is a new component of the distributed data analysis system of CMS, CRAB, designed for managing users' data. It addresses a major weakness of the previous model, namely that mass storage of output data was part of the job execution resulting in inefficient use of job slots and an unacceptable failure rate at the end of the jobs. ASO foresees the management of up to 400k files per day of various sizes, spread worldwide across more than 60 sites. It must handle up to 1000 individual users per month, and work with minimal delay. This creates challenging requirements for system scalability, performance and monitoring. ASO uses FTS to schedule and execute the transfers between the storage elements of the source and destination sites. It has evolved from a limited prototype to a highly adaptable service, which manages and monitors the user file placement and bookkeeping. To ensure system scalability and data monitoring, it employs new technologies such as a NoSQL database and re-uses existing components of PhEDEx and the FTS Dashboard. We present the asynchronous stage-out strategy and the architecture of the solution we implemented to deal with those issues and challenges. The deployment model for the high availability and scalability of the service is discussed. The performance of the system during the commissioning and the first phase of production are also shown, along with results from simulations designed to explore the limits of scalability.
AsyncStageOut: Distributed User Data Management for CMS Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Riahi, H.; Wildish, T.; Ciangottini, D.
2015-12-23
AsyncStageOut (ASO) is a new component of the distributed data analysis system of CMS, CRAB, designed for managing users' data. It addresses a major weakness of the previous model, namely that mass storage of output data was part of the job execution resulting in inefficient use of job slots and an unacceptable failure rate at the end of the jobs. ASO foresees the management of up to 400k files per day of various sizes, spread worldwide across more than 60 sites. It must handle up to 1000 individual users per month, and work with minimal delay. This creates challenging requirementsmore » for system scalability, performance and monitoring. ASO uses FTS to schedule and execute the transfers between the storage elements of the source and destination sites. It has evolved from a limited prototype to a highly adaptable service, which manages and monitors the user file placement and bookkeeping. To ensure system scalability and data monitoring, it employs new technologies such as a NoSQL database and re-uses existing components of PhEDEx and the FTS Dashboard. We present the asynchronous stage-out strategy and the architecture of the solution we implemented to deal with those issues and challenges. The deployment model for the high availability and scalability of the service is discussed. The performance of the system during the commissioning and the first phase of production are also shown, along with results from simulations designed to explore the limits of scalability.« less
Daniel J. Isaak; Jay M. Ver Hoef; Erin E. Peterson; Dona L. Horan; David E. Nagel
2017-01-01
Population size estimates for stream fishes are important for conservation and management, but sampling costs limit the extent of most estimates to small portions of river networks that encompass 100sâ10 000s of linear kilometres. However, the advent of large fish density data sets, spatial-stream-network (SSN) models that benefit from nonindependence among samples,...
Fu, H C; Xu, Y Y; Chang, H Y
1999-12-01
Recognition of similar (confusion) characters is a difficult problem in optical character recognition (OCR). In this paper, we introduce a neural network solution that is capable of modeling minor differences among similar characters, and is robust to various personal handwriting styles. The Self-growing Probabilistic Decision-based Neural Network (SPDNN) is a probabilistic type neural network, which adopts a hierarchical network structure with nonlinear basis functions and a competitive credit-assignment scheme. Based on the SPDNN model, we have constructed a three-stage recognition system. First, a coarse classifier determines a character to be input to one of the pre-defined subclasses partitioned from a large character set, such as Chinese mixed with alphanumerics. Then a character recognizer determines the input image which best matches the reference character in the subclass. Lastly, the third module is a similar character recognizer, which can further enhance the recognition accuracy among similar or confusing characters. The prototype system has demonstrated a successful application of SPDNN to similar handwritten Chinese recognition for the public database CCL/HCCR1 (5401 characters x200 samples). Regarding performance, experiments on the CCL/HCCR1 database produced 90.12% recognition accuracy with no rejection, and 94.11% accuracy with 6.7% rejection, respectively. This recognition accuracy represents about 4% improvement on the previously announced performance. As to processing speed, processing before recognition (including image preprocessing, segmentation, and feature extraction) requires about one second for an A4 size character image, and recognition consumes approximately 0.27 second per character on a Pentium-100 based personal computer, without use of any hardware accelerator or co-processor.
Rothschild, Adam S.; Lehmann, Harold P.
2005-01-01
Objective: The aim of this study was to preliminarily determine the feasibility of probabilistically generating problem-specific computerized provider order entry (CPOE) pick-lists from a database of explicitly linked orders and problems from actual clinical cases. Design: In a pilot retrospective validation, physicians reviewed internal medicine cases consisting of the admission history and physical examination and orders placed using CPOE during the first 24 hours after admission. They created coded problem lists and linked orders from individual cases to the problem for which they were most indicated. Problem-specific order pick-lists were generated by including a given order in a pick-list if the probability of linkage of order and problem (PLOP) equaled or exceeded a specified threshold. PLOP for a given linked order-problem pair was computed as its prevalence among the other cases in the experiment with the given problem. The orders that the reviewer linked to a given problem instance served as the reference standard to evaluate its system-generated pick-list. Measurements: Recall, precision, and length of the pick-lists. Results: Average recall reached a maximum of .67 with a precision of .17 and pick-list length of 31.22 at a PLOP threshold of 0. Average precision reached a maximum of .73 with a recall of .09 and pick-list length of .42 at a PLOP threshold of .9. Recall varied inversely with precision in classic information retrieval behavior. Conclusion: We preliminarily conclude that it is feasible to generate problem-specific CPOE pick-lists probabilistically from a database of explicitly linked orders and problems. Further research is necessary to determine the usefulness of this approach in real-world settings. PMID:15684134
Probabilistic techniques for obtaining accurate patient counts in Clinical Data Warehouses
Myers, Risa B.; Herskovic, Jorge R.
2011-01-01
Proposal and execution of clinical trials, computation of quality measures and discovery of correlation between medical phenomena are all applications where an accurate count of patients is needed. However, existing sources of this type of patient information, including Clinical Data Warehouses (CDW) may be incomplete or inaccurate. This research explores applying probabilistic techniques, supported by the MayBMS probabilistic database, to obtain accurate patient counts from a clinical data warehouse containing synthetic patient data. We present a synthetic clinical data warehouse (CDW), and populate it with simulated data using a custom patient data generation engine. We then implement, evaluate and compare different techniques for obtaining patients counts. We model billing as a test for the presence of a condition. We compute billing’s sensitivity and specificity both by conducting a “Simulated Expert Review” where a representative sample of records are reviewed and labeled by experts, and by obtaining the ground truth for every record. We compute the posterior probability of a patient having a condition through a “Bayesian Chain”, using Bayes’ Theorem to calculate the probability of a patient having a condition after each visit. The second method is a “one-shot” approach that computes the probability of a patient having a condition based on whether the patient is ever billed for the condition Our results demonstrate the utility of probabilistic approaches, which improve on the accuracy of raw counts. In particular, the simulated review paired with a single application of Bayes’ Theorem produces the best results, with an average error rate of 2.1% compared to 43.7% for the straightforward billing counts. Overall, this research demonstrates that Bayesian probabilistic approaches improve patient counts on simulated patient populations. We believe that total patient counts based on billing data are one of the many possible applications of our Bayesian framework. Use of these probabilistic techniques will enable more accurate patient counts and better results for applications requiring this metric. PMID:21986292
Paixão, Enny S; Harron, Katie; Andrade, Kleydson; Teixeira, Maria Glória; Fiaccone, Rosemeire L; Costa, Maria da Conceição N; Rodrigues, Laura C
2017-07-17
Due to the increasing availability of individual-level information across different electronic datasets, record linkage has become an efficient and important research tool. High quality linkage is essential for producing robust results. The objective of this study was to describe the process of preparing and linking national Brazilian datasets, and to compare the accuracy of different linkage methods for assessing the risk of stillbirth due to dengue in pregnancy. We linked mothers and stillbirths in two routinely collected datasets from Brazil for 2009-2010: for dengue in pregnancy, notifications of infectious diseases (SINAN); for stillbirths, mortality (SIM). Since there was no unique identifier, we used probabilistic linkage based on maternal name, age and municipality. We compared two probabilistic approaches, each with two thresholds: 1) a bespoke linkage algorithm; 2) a standard linkage software widely used in Brazil (ReclinkIII), and used manual review to identify further links. Sensitivity and positive predictive value (PPV) were estimated using a subset of gold-standard data created through manual review. We examined the characteristics of false-matches and missed-matches to identify any sources of bias. From records of 678,999 dengue cases and 62,373 stillbirths, the gold-standard linkage identified 191 cases. The bespoke linkage algorithm with a conservative threshold produced 131 links, with sensitivity = 64.4% (68 missed-matches) and PPV = 92.5% (8 false-matches). Manual review of uncertain links identified an additional 37 links, increasing sensitivity to 83.7%. The bespoke algorithm with a relaxed threshold identified 132 true matches (sensitivity = 69.1%), but introduced 61 false-matches (PPV = 68.4%). ReclinkIII produced lower sensitivity and PPV than the bespoke linkage algorithm. Linkage error was not associated with any recorded study variables. Despite a lack of unique identifiers for linking mothers and stillbirths, we demonstrate a high standard of linkage of large routine databases from a middle income country. Probabilistic linkage and manual review were essential for accurately identifying cases for a case-control study, but this approach may not be feasible for larger databases or for linkage of more common outcomes.
NASA Astrophysics Data System (ADS)
Appel, Marius; Lahn, Florian; Pebesma, Edzer; Buytaert, Wouter; Moulds, Simon
2016-04-01
Today's amount of freely available data requires scientists to spend large parts of their work on data management. This is especially true in environmental sciences when working with large remote sensing datasets, such as obtained from earth-observation satellites like the Sentinel fleet. Many frameworks like SpatialHadoop or Apache Spark address the scalability but target programmers rather than data analysts, and are not dedicated to imagery or array data. In this work, we use the open-source data management and analytics system SciDB to bring large earth-observation datasets closer to analysts. Its underlying data representation as multidimensional arrays fits naturally to earth-observation datasets, distributes storage and computational load over multiple instances by multidimensional chunking, and also enables efficient time-series based analyses, which is usually difficult using file- or tile-based approaches. Existing interfaces to R and Python furthermore allow for scalable analytics with relatively little learning effort. However, interfacing SciDB and file-based earth-observation datasets that come as tiled temporal snapshots requires a lot of manual bookkeeping during ingestion, and SciDB natively only supports loading data from CSV-like and custom binary formatted files, which currently limits its practical use in earth-observation analytics. To make it easier to work with large multi-temporal datasets in SciDB, we developed software tools that enrich SciDB with earth observation metadata and allow working with commonly used file formats: (i) the SciDB extension library scidb4geo simplifies working with spatiotemporal arrays by adding relevant metadata to the database and (ii) the Geospatial Data Abstraction Library (GDAL) driver implementation scidb4gdal allows to ingest and export remote sensing imagery from and to a large number of file formats. Using added metadata on temporal resolution and coverage, the GDAL driver supports time-based ingestion of imagery to existing multi-temporal SciDB arrays. While our SciDB plugin works directly in the database, the GDAL driver has been specifically developed using a minimum amount of external dependencies (i.e. CURL). Source code for both tools is available from github [1]. We present these tools in a case-study that demonstrates the ingestion of multi-temporal tiled earth-observation data to SciDB, followed by a time-series analysis using R and SciDBR. Through the exclusive use of open-source software, our approach supports reproducibility in scalable large-scale earth-observation analytics. In the future, these tools can be used in an automated way to let scientists only work on ready-to-use SciDB arrays to significantly reduce the data management workload for domain scientists. [1] https://github.com/mappl/scidb4geo} and \\url{https://github.com/mappl/scidb4gdal
Papageorgiou, Eirini; Nieuwenhuys, Angela; Desloovere, Kaat
2017-01-01
Background This study aimed to improve the automatic probabilistic classification of joint motion gait patterns in children with cerebral palsy by using the expert knowledge available via a recently developed Delphi-consensus study. To this end, this study applied both Naïve Bayes and Logistic Regression classification with varying degrees of usage of the expert knowledge (expert-defined and discretized features). A database of 356 patients and 1719 gait trials was used to validate the classification performance of eleven joint motions. Hypotheses Two main hypotheses stated that: (1) Joint motion patterns in children with CP, obtained through a Delphi-consensus study, can be automatically classified following a probabilistic approach, with an accuracy similar to clinical expert classification, and (2) The inclusion of clinical expert knowledge in the selection of relevant gait features and the discretization of continuous features increases the performance of automatic probabilistic joint motion classification. Findings This study provided objective evidence supporting the first hypothesis. Automatic probabilistic gait classification using the expert knowledge available from the Delphi-consensus study resulted in accuracy (91%) similar to that obtained with two expert raters (90%), and higher accuracy than that obtained with non-expert raters (78%). Regarding the second hypothesis, this study demonstrated that the use of more advanced machine learning techniques such as automatic feature selection and discretization instead of expert-defined and discretized features can result in slightly higher joint motion classification performance. However, the increase in performance is limited and does not outweigh the additional computational cost and the higher risk of loss of clinical interpretability, which threatens the clinical acceptance and applicability. PMID:28570616
Kang, Dongwan D.; Froula, Jeff; Egan, Rob; ...
2015-01-01
Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. Lastly, it automatically formsmore » hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.« less
WORDGRAPH: Keyword-in-Context Visualization for NETSPEAK's Wildcard Search.
Riehmann, Patrick; Gruendl, Henning; Potthast, Martin; Trenkmann, Martin; Stein, Benno; Froehlich, Benno
2012-09-01
The WORDGRAPH helps writers in visually choosing phrases while writing a text. It checks for the commonness of phrases and allows for the retrieval of alternatives by means of wildcard queries. To support such queries, we implement a scalable retrieval engine, which returns high-quality results within milliseconds using a probabilistic retrieval strategy. The results are displayed as WORDGRAPH visualization or as a textual list. The graphical interface provides an effective means for interactive exploration of search results using filter techniques, query expansion, and navigation. Our observations indicate that, of three investigated retrieval tasks, the textual interface is sufficient for the phrase verification task, wherein both interfaces support context-sensitive word choice, and the WORDGRAPH best supports the exploration of a phrase's context or the underlying corpus. Our user study confirms these observations and shows that WORDGRAPH is generally the preferred interface over the textual result list for queries containing multiple wildcards.
Atom-by-atom assembly of defect-free one-dimensional cold atom arrays.
Endres, Manuel; Bernien, Hannes; Keesling, Alexander; Levine, Harry; Anschuetz, Eric R; Krajenbrink, Alexandre; Senko, Crystal; Vuletic, Vladan; Greiner, Markus; Lukin, Mikhail D
2016-11-25
The realization of large-scale fully controllable quantum systems is an exciting frontier in modern physical science. We use atom-by-atom assembly to implement a platform for the deterministic preparation of regular one-dimensional arrays of individually controlled cold atoms. In our approach, a measurement and feedback procedure eliminates the entropy associated with probabilistic trap occupation and results in defect-free arrays of more than 50 atoms in less than 400 milliseconds. The technique is based on fast, real-time control of 100 optical tweezers, which we use to arrange atoms in desired geometric patterns and to maintain these configurations by replacing lost atoms with surplus atoms from a reservoir. This bottom-up approach may enable controlled engineering of scalable many-body systems for quantum information processing, quantum simulations, and precision measurements. Copyright © 2016, American Association for the Advancement of Science.
Gao, Xiang; Lin, Huaiying; Revanna, Kashi; Dong, Qunfeng
2017-05-10
Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA .
PASS2: an automated database of protein alignments organised as structural superfamilies.
Bhaduri, Anirban; Pugalenthi, Ganesan; Sowdhamini, Ramanathan
2004-04-02
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
Development of noSQL data storage for the ATLAS PanDA Monitoring System
NASA Astrophysics Data System (ADS)
Potekhin, M.; ATLAS Collaboration
2012-06-01
For several years the PanDA Workload Management System has been the basis for distributed production and analysis for the ATLAS experiment at the LHC. Since the start of data taking PanDA usage has ramped up steadily, typically exceeding 500k completed jobs/day by June 2011. The associated monitoring data volume has been rising as well, to levels that present a new set of challenges in the areas of database scalability and monitoring system performance and efficiency. These challenges are being met with a R&D effort aimed at implementing a scalable and efficient monitoring data storage based on a noSQL solution (Cassandra). We present our motivations for using this technology, as well as data design and the techniques used for efficient indexing of the data. We also discuss the hardware requirements as they were determined by testing with actual data and realistic loads.
The BioIntelligence Framework: a new computational platform for biomedical knowledge computing.
Farley, Toni; Kiefer, Jeff; Lee, Preston; Von Hoff, Daniel; Trent, Jeffrey M; Colbourn, Charles; Mousses, Spyro
2013-01-01
Breakthroughs in molecular profiling technologies are enabling a new data-intensive approach to biomedical research, with the potential to revolutionize how we study, manage, and treat complex diseases. The next great challenge for clinical applications of these innovations will be to create scalable computational solutions for intelligently linking complex biomedical patient data to clinically actionable knowledge. Traditional database management systems (DBMS) are not well suited to representing complex syntactic and semantic relationships in unstructured biomedical information, introducing barriers to realizing such solutions. We propose a scalable computational framework for addressing this need, which leverages a hypergraph-based data model and query language that may be better suited for representing complex multi-lateral, multi-scalar, and multi-dimensional relationships. We also discuss how this framework can be used to create rapid learning knowledge base systems to intelligently capture and relate complex patient data to biomedical knowledge in order to automate the recovery of clinically actionable information.
Relax with CouchDB - Into the non-relational DBMS era of Bioinformatics
Manyam, Ganiraju; Payton, Michelle A.; Roth, Jack A.; Abruzzo, Lynne V.; Coombes, Kevin R.
2012-01-01
With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. PMID:22609849
Experience with ATLAS MySQL PanDA database service
NASA Astrophysics Data System (ADS)
Smirnov, Y.; Wlodek, T.; De, K.; Hover, J.; Ozturk, N.; Smith, J.; Wenaus, T.; Yu, D.
2010-04-01
The PanDA distributed production and analysis system has been in production use for ATLAS data processing and analysis since late 2005 in the US, and globally throughout ATLAS since early 2008. Its core architecture is based on a set of stateless web services served by Apache and backed by a suite of MySQL databases that are the repository for all PanDA information: active and archival job queues, dataset and file catalogs, site configuration information, monitoring information, system control parameters, and so on. This database system is one of the most critical components of PanDA, and has successfully delivered the functional and scaling performance required by PanDA, currently operating at a scale of half a million jobs per week, with much growth still to come. In this paper we describe the design and implementation of the PanDA database system, its architecture of MySQL servers deployed at BNL and CERN, backup strategy and monitoring tools. The system has been developed, thoroughly tested, and brought to production to provide highly reliable, scalable, flexible and available database services for ATLAS Monte Carlo production, reconstruction and physics analysis.
Probabilistic assessment method of the non-monotonic dose-responses-Part I: Methodological approach.
Chevillotte, Grégoire; Bernard, Audrey; Varret, Clémence; Ballet, Pascal; Bodin, Laurent; Roudot, Alain-Claude
2017-08-01
More and more studies aim to characterize non-monotonic dose response curves (NMDRCs). The greatest difficulty is to assess the statistical plausibility of NMDRCs from previously conducted dose response studies. This difficulty is linked to the fact that these studies present (i) few doses tested, (ii) a low sample size per dose, and (iii) the absence of any raw data. In this study, we propose a new methodological approach to probabilistically characterize NMDRCs. The methodology is composed of three main steps: (i) sampling from summary data to cover all the possibilities that may be presented by the responses measured by dose and to obtain a new raw database, (ii) statistical analysis of each sampled dose-response curve to characterize the slopes and their signs, and (iii) characterization of these dose-response curves according to the variation of the sign in the slope. This method allows characterizing all types of dose-response curves and can be applied both to continuous data and to discrete data. The aim of this study is to present the general principle of this probabilistic method which allows to assess the non-monotonic dose responses curves, and to present some results. Copyright © 2017 Elsevier Ltd. All rights reserved.
Communicating weather forecast uncertainty: Do individual differences matter?
Grounds, Margaret A; Joslyn, Susan L
2018-03-01
Research suggests that people make better weather-related decisions when they are given numeric probabilities for critical outcomes (Joslyn & Leclerc, 2012, 2013). However, it is unclear whether all users can take advantage of probabilistic forecasts to the same extent. The research reported here assessed key cognitive and demographic factors to determine their relationship to the use of probabilistic forecasts to improve decision quality. In two studies, participants decided between spending resources to prevent icy conditions on roadways or risk a larger penalty when freezing temperatures occurred. Several forecast formats were tested, including a control condition with the night-time low temperature alone and experimental conditions that also included the probability of freezing and advice based on expected value. All but those with extremely low numeracy scores made better decisions with probabilistic forecasts. Importantly, no groups made worse decisions when probabilities were included. Moreover, numeracy was the best predictor of decision quality, regardless of forecast format, suggesting that the advantage may extend beyond understanding the forecast to general decision strategy issues. This research adds to a growing body of evidence that numerical uncertainty estimates may be an effective way to communicate weather danger to general public end users. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Probabilistic TSUnami Hazard MAPS for the NEAM Region: The TSUMAPS-NEAM Project
NASA Astrophysics Data System (ADS)
Basili, R.; Babeyko, A. Y.; Baptista, M. A.; Ben Abdallah, S.; Canals, M.; El Mouraouah, A.; Harbitz, C. B.; Ibenbrahim, A.; Lastras, G.; Lorito, S.; Løvholt, F.; Matias, L. M.; Omira, R.; Papadopoulos, G. A.; Pekcan, O.; Nmiri, A.; Selva, J.; Yalciner, A. C.
2016-12-01
As global awareness of tsunami hazard and risk grows, the North-East Atlantic, the Mediterranean, and connected Seas (NEAM) region still lacks a thorough probabilistic tsunami hazard assessment. The TSUMAPS-NEAM project aims to fill this gap in the NEAM region by 1) producing the first region-wide long-term homogenous Probabilistic Tsunami Hazard Assessment (PTHA) from earthquake sources, and by 2) triggering a common tsunami risk management strategy. The specific objectives of the project are tackled by the following four consecutive actions: 1) Conduct a state-of-the-art, standardized, and updatable PTHA with full uncertainty treatment; 2) Review the entire process with international experts; 3) Produce the PTHA database, with documentation of the entire hazard assessment process; and 4) Publicize the results through an awareness raising and education phase, and a capacity building phase. This presentation will illustrate the project layout, summarize its current status of advancement and prospective results, and outline its connections with similar initiatives in the international context. The TSUMAPS-NEAM Project (http://www.tsumaps-neam.eu/) is co-financed by the European Union Civil Protection Mechanism, Agreement Number: ECHO/SUB/2015/718568/PREV26.
Probabilistic TSUnami Hazard MAPS for the NEAM Region: The TSUMAPS-NEAM Project
NASA Astrophysics Data System (ADS)
Basili, Roberto; Babeyko, Andrey Y.; Hoechner, Andreas; Baptista, Maria Ana; Ben Abdallah, Samir; Canals, Miquel; El Mouraouah, Azelarab; Bonnevie Harbitz, Carl; Ibenbrahim, Aomar; Lastras, Galderic; Lorito, Stefano; Løvholt, Finn; Matias, Luis Manuel; Omira, Rachid; Papadopoulos, Gerassimos A.; Pekcan, Onur; Nmiri, Abdelwaheb; Selva, Jacopo; Yalciner, Ahmet C.; Thio, Hong K.
2017-04-01
As global awareness of tsunami hazard and risk grows, the North-East Atlantic, the Mediterranean, and connected Seas (NEAM) region still lacks a thorough probabilistic tsunami hazard assessment. The TSUMAPS-NEAM project aims to fill this gap in the NEAM region by 1) producing the first region-wide long-term homogenous Probabilistic Tsunami Hazard Assessment (PTHA) from earthquake sources, and by 2) triggering a common tsunami risk management strategy. The specific objectives of the project are tackled by the following four consecutive actions: 1) Conduct a state-of-the-art, standardized, and updatable PTHA with full uncertainty treatment; 2) Review the entire process with international experts; 3) Produce the PTHA database, with documentation of the entire hazard assessment process; and 4) Publicize the results through an awareness raising and education phase, and a capacity building phase. This presentation will illustrate the project layout, summarize its current status of advancement including the firs preliminary release of the assessment, and outline its connections with similar initiatives in the international context. The TSUMAPS-NEAM Project (http://www.tsumaps-neam.eu/) is co-financed by the European Union Civil Protection Mechanism, Agreement Number: ECHO/SUB/2015/718568/PREV26.
Nuclear data made easily accessible through the Notre Dame Nuclear Database
NASA Astrophysics Data System (ADS)
Khouw, Timothy; Lee, Kevin; Fasano, Patrick; Mumpower, Matthew; Aprahamian, Ani
2014-09-01
In 1994, the NNDC revolutionized nuclear research by providing a colorful, clickable, searchable database over the internet. Over the last twenty years, web technology has evolved dramatically. Our project, the Notre Dame Nuclear Database, aims to provide a more comprehensive and broadly searchable interactive body of data. The database can be searched by an array of filters which includes metadata such as the facility where a measurement is made, the author(s), or date of publication for the datum of interest. The user interface takes full advantage of HTML, a web markup language, CSS (cascading style sheets to define the aesthetics of the website), and JavaScript, a language that can process complex data. A command-line interface is supported that interacts with the database directly on a user's local machine which provides single command access to data. This is possible through the use of a standardized API (application programming interface) that relies upon well-defined filtering variables to produce customized search results. We offer an innovative chart of nuclides utilizing scalable vector graphics (SVG) to deliver users an unsurpassed level of interactivity supported on all computers and mobile devices. We will present a functional demo of our database at the conference.
Extensible Probabilistic Repository Technology (XPRT)
2004-10-01
projects, such as, Centaurus , Evidence Data Base (EDB), etc., others were fabricated, such as INS and FED, while others contain data from the open...Google Web Report Unlimited SOAP API News BBC News Unlimited WEB RSS 1.0 Centaurus Person Demographics 204,402 people from 240 countries...objects of the domain ontology map to the various simulated data-sources. For example, the PersonDemographics are stored in the Centaurus database, while
Comment on "Secure quantum private information retrieval using phase-encoded queries"
NASA Astrophysics Data System (ADS)
Shi, Run-hua; Mu, Yi; Zhong, Hong; Zhang, Shun
2016-12-01
In this Comment, we reexamine the security of phase-encoded quantum private query (QPQ). We find that the current phase-encoded QPQ protocols, including their applications, are vulnerable to a probabilistic entangle-and-measure attack performed by the owner of the database. Furthermore, we discuss how to overcome this security loophole and present an improved cheat-sensitive QPQ protocol without losing the good features of the original protocol.
NASA Astrophysics Data System (ADS)
Tien Bui, Dieu; Hoang, Nhat-Duc
2017-09-01
In this study, a probabilistic model, named as BayGmmKda, is proposed for flood susceptibility assessment in a study area in central Vietnam. The new model is a Bayesian framework constructed by a combination of a Gaussian mixture model (GMM), radial-basis-function Fisher discriminant analysis (RBFDA), and a geographic information system (GIS) database. In the Bayesian framework, GMM is used for modeling the data distribution of flood-influencing factors in the GIS database, whereas RBFDA is utilized to construct a latent variable that aims at enhancing the model performance. As a result, the posterior probabilistic output of the BayGmmKda model is used as flood susceptibility index. Experiment results showed that the proposed hybrid framework is superior to other benchmark models, including the adaptive neuro-fuzzy inference system and the support vector machine. To facilitate the model implementation, a software program of BayGmmKda has been developed in MATLAB. The BayGmmKda program can accurately establish a flood susceptibility map for the study region. Accordingly, local authorities can overlay this susceptibility map onto various land-use maps for the purpose of land-use planning or management.
Probabilistic Assessment of Cancer Risk for Astronauts on Lunar Missions
NASA Technical Reports Server (NTRS)
Kim, Myung-Hee Y.; Cucinotta, Francis A.
2009-01-01
During future lunar missions, exposure to solar particle events (SPEs) is a major safety concern for crew members during extra-vehicular activities (EVAs) on the lunar surface or Earth-to-moon transit. NASA s new lunar program anticipates that up to 15% of crew time may be on EVA, with minimal radiation shielding. For the operational challenge to respond to events of unknown size and duration, a probabilistic risk assessment approach is essential for mission planning and design. Using the historical database of proton measurements during the past 5 solar cycles, a typical hazard function for SPE occurrence was defined using a non-homogeneous Poisson model as a function of time within a non-specific future solar cycle of 4000 days duration. Distributions ranging from the 5th to 95th percentile of particle fluences for a specified mission period were simulated. Organ doses corresponding to particle fluences at the median and at the 95th percentile for a specified mission period were assessed using NASA s baryon transport model, BRYNTRN. The cancer fatality risk for astronauts as functions of age, gender, and solar cycle activity were then analyzed. The probability of exceeding the NASA 30- day limit of blood forming organ (BFO) dose inside a typical spacecraft was calculated. Future work will involve using this probabilistic risk assessment approach to SPE forecasting, combined with a probabilistic approach to the radiobiological factors that contribute to the uncertainties in projecting cancer risks.
Poças, Maria F; Oliveira, Jorge C; Brandsch, Rainer; Hogg, Timothy
2010-07-01
The use of probabilistic approaches in exposure assessments of contaminants migrating from food packages is of increasing interest but the lack of concentration or migration data is often referred as a limitation. Data accounting for the variability and uncertainty that can be expected in migration, for example, due to heterogeneity in the packaging system, variation of the temperature along the distribution chain, and different time of consumption of each individual package, are required for probabilistic analysis. The objective of this work was to characterize quantitatively the uncertainty and variability in estimates of migration. A Monte Carlo simulation was applied to a typical solution of the Fick's law with given variability in the input parameters. The analysis was performed based on experimental data of a model system (migration of Irgafos 168 from polyethylene into isooctane) and illustrates how important sources of variability and uncertainty can be identified in order to refine analyses. For long migration times and controlled conditions of temperature the affinity of the migrant to the food can be the major factor determining the variability in the migration values (more than 70% of variance). In situations where both the time of consumption and temperature can vary, these factors can be responsible, respectively, for more than 60% and 20% of the variance in the migration estimates. The approach presented can be used with databases from consumption surveys to yield a true probabilistic estimate of exposure.
NASA Astrophysics Data System (ADS)
Subramanian, A. C.; Lavers, D.; Matsueda, M.; Shukla, S.; Cayan, D. R.; Ralph, M.
2017-12-01
Atmospheric rivers (ARs) - elongated plumes of intense moisture transport - are a primary source of hydrological extremes, water resources and impactful weather along the West Coast of North America and Europe. There is strong demand in the water management, societal infrastructure and humanitarian sectors for reliable sub-seasonal forecasts, particularly of extreme events, such as floods and droughts so that actions to mitigate disastrous impacts can be taken with sufficient lead-time. Many recent studies have shown that ARs in the Pacific and the Atlantic are modulated by large-scale modes of climate variability. Leveraging the improved understanding of how these large-scale climate modes modulate the ARs in these two basins, we use the state-of-the-art multi-model forecast systems such as the North American Multi-Model Ensemble (NMME) and the Subseasonal-to-Seasonal (S2S) database to help inform and assess the probabilistic prediction of ARs and related extreme weather events over the North American and European West Coasts. We will present results from evaluating probabilistic forecasts of extreme precipitation and AR activity at the sub-seasonal scale. In particular, results from the comparison of two winters (2015-16 and 2016-17) will be shown, winters which defied canonical El Niño teleconnection patterns over North America and Europe. We further extend this study to analyze probabilistic forecast skill of AR events in these two basins and the variability in forecast skill during certain regimes of large-scale climate modes.
Praveen, Paurush; Fröhlich, Holger
2013-01-01
Inferring regulatory networks from experimental data via probabilistic graphical models is a popular framework to gain insights into biological systems. However, the inherent noise in experimental data coupled with a limited sample size reduces the performance of network reverse engineering. Prior knowledge from existing sources of biological information can address this low signal to noise problem by biasing the network inference towards biologically plausible network structures. Although integrating various sources of information is desirable, their heterogeneous nature makes this task challenging. We propose two computational methods to incorporate various information sources into a probabilistic consensus structure prior to be used in graphical model inference. Our first model, called Latent Factor Model (LFM), assumes a high degree of correlation among external information sources and reconstructs a hidden variable as a common source in a Bayesian manner. The second model, a Noisy-OR, picks up the strongest support for an interaction among information sources in a probabilistic fashion. Our extensive computational studies on KEGG signaling pathways as well as on gene expression data from breast cancer and yeast heat shock response reveal that both approaches can significantly enhance the reconstruction accuracy of Bayesian Networks compared to other competing methods as well as to the situation without any prior. Our framework allows for using diverse information sources, like pathway databases, GO terms and protein domain data, etc. and is flexible enough to integrate new sources, if available. PMID:23826291
rCAD: A Novel Database Schema for the Comparative Analysis of RNA.
Ozer, Stuart; Doshi, Kishore J; Xu, Weijia; Gutell, Robin R
2011-12-31
Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios.
rCAD: A Novel Database Schema for the Comparative Analysis of RNA
Ozer, Stuart; Doshi, Kishore J.; Xu, Weijia; Gutell, Robin R.
2013-01-01
Beyond its direct involvement in protein synthesis with mRNA, tRNA, and rRNA, RNA is now being appreciated for its significance in the overall metabolism and regulation of the cell. Comparative analysis has been very effective in the identification and characterization of RNA molecules, including the accurate prediction of their secondary structure. We are developing an integrative scalable data management and analysis system, the RNA Comparative Analysis Database (rCAD), implemented with SQL Server to support RNA comparative analysis. The platformagnostic database schema of rCAD captures the essential relationships between the different dimensions of information for RNA comparative analysis datasets. The rCAD implementation enables a variety of comparative analysis manipulations with multiple integrated data dimensions for advanced RNA comparative analysis workflows. In this paper, we describe details of the rCAD schema design and illustrate its usefulness with two usage scenarios. PMID:24772454
LHCb experience with LFC replication
NASA Astrophysics Data System (ADS)
Bonifazi, F.; Carbone, A.; Perez, E. D.; D'Apice, A.; dell'Agnello, L.; Duellmann, D.; Girone, M.; Re, G. L.; Martelli, B.; Peco, G.; Ricci, P. P.; Sapunenko, V.; Vagnoni, V.; Vitlacil, D.
2008-07-01
Database replication is a key topic in the framework of the LHC Computing Grid to allow processing of data in a distributed environment. In particular, the LHCb computing model relies on the LHC File Catalog, i.e. a database which stores information about files spread across the GRID, their logical names and the physical locations of all the replicas. The LHCb computing model requires the LFC to be replicated at Tier-1s. The LCG 3D project deals with the database replication issue and provides a replication service based on Oracle Streams technology. This paper describes the deployment of the LHC File Catalog replication to the INFN National Center for Telematics and Informatics (CNAF) and to other LHCb Tier-1 sites. We performed stress tests designed to evaluate any delay in the propagation of the streams and the scalability of the system. The tests show the robustness of the replica implementation with performance going much beyond the LHCb requirements.
PathCase-SB architecture and database design
2011-01-01
Background Integration of metabolic pathways resources and regulatory metabolic network models, and deploying new tools on the integrated platform can help perform more effective and more efficient systems biology research on understanding the regulation in metabolic networks. Therefore, the tasks of (a) integrating under a single database environment regulatory metabolic networks and existing models, and (b) building tools to help with modeling and analysis are desirable and intellectually challenging computational tasks. Description PathCase Systems Biology (PathCase-SB) is built and released. The PathCase-SB database provides data and API for multiple user interfaces and software tools. The current PathCase-SB system provides a database-enabled framework and web-based computational tools towards facilitating the development of kinetic models for biological systems. PathCase-SB aims to integrate data of selected biological data sources on the web (currently, BioModels database and KEGG), and to provide more powerful and/or new capabilities via the new web-based integrative framework. This paper describes architecture and database design issues encountered in PathCase-SB's design and implementation, and presents the current design of PathCase-SB's architecture and database. Conclusions PathCase-SB architecture and database provide a highly extensible and scalable environment with easy and fast (real-time) access to the data in the database. PathCase-SB itself is already being used by researchers across the world. PMID:22070889
Modular Bayesian Networks with Low-Power Wearable Sensors for Recognizing Eating Activities.
Kim, Kee-Hoon; Cho, Sung-Bae
2017-12-11
Recently, recognizing a user's daily activity using a smartphone and wearable sensors has become a popular issue. However, in contrast with the ideal definition of an experiment, there could be numerous complex activities in real life with respect to its various background and contexts: time, space, age, culture, and so on. Recognizing these complex activities with limited low-power sensors, considering the power and memory constraints of the wearable environment and the user's obtrusiveness at once is not an easy problem, although it is very crucial for the activity recognizer to be practically useful. In this paper, we recognize activity of eating, which is one of the most typical examples of a complex activity, using only daily low-power mobile and wearable sensors. To organize the related contexts systemically, we have constructed the context model based on activity theory and the "Five W's", and propose a Bayesian network with 88 nodes to predict uncertain contexts probabilistically. The structure of the proposed Bayesian network is designed by a modular and tree-structured approach to reduce the time complexity and increase the scalability. To evaluate the proposed method, we collected the data with 10 different activities from 25 volunteers of various ages, occupations, and jobs, and have obtained 79.71% accuracy, which outperforms other conventional classifiers by 7.54-14.4%. Analyses of the results showed that our probabilistic approach could also give approximate results even when one of contexts or sensor values has a very heterogeneous pattern or is missing.
Scalable Probabilistic Inference for Global Seismic Monitoring
NASA Astrophysics Data System (ADS)
Arora, N. S.; Dear, T.; Russell, S.
2011-12-01
We describe a probabilistic generative model for seismic events, their transmission through the earth, and their detection (or mis-detection) at seismic stations. We also describe an inference algorithm that constructs the most probable event bulletin explaining the observed set of detections. The model and inference are called NET-VISA (network processing vertically integrated seismic analysis) and is designed to replace the current automated network processing at the IDC, the SEL3 bulletin. Our results (attached table) demonstrate that NET-VISA significantly outperforms SEL3 by reducing the missed events from 30.3% down to 12.5%. The difference is even more dramatic for smaller magnitude events. NET-VISA has no difficulty in locating nuclear explosions as well. The attached figure demonstrates the location predicted by NET-VISA versus other bulletins for the second DPRK event. Further evaluation on dense regional networks demonstrates that NET-VISA finds many events missed in the LEB bulletin, which is produced by the human analysts. Large aftershock sequences, as produced by the 2004 December Sumatra earthquake and the 2011 March Tohoku earthquake, can pose a significant load for automated processing, often delaying the IDC bulletins by weeks or months. Indeed these sequences can overload the serial NET-VISA inference as well. We describe an enhancement to NET-VISA to make it multi-threaded, and hence take full advantage of the processing power of multi-core and -cpu machines. Our experiments show that the new inference algorithm is able to achieve 80% efficiency in parallel speedup.
Database of potential sources for earthquakes larger than magnitude 6 in Northern California
,
1996-01-01
The Northern California Earthquake Potential (NCEP) working group, composed of many contributors and reviewers in industry, academia and government, has pooled its collective expertise and knowledge of regional tectonics to identify potential sources of large earthquakes in northern California. We have created a map and database of active faults, both surficial and buried, that forms the basis for the northern California portion of the national map of probabilistic seismic hazard. The database contains 62 potential sources, including fault segments and areally distributed zones. The working group has integrated constraints from broadly based plate tectonic and VLBI models with local geologic slip rates, geodetic strain rate, and microseismicity. Our earthquake source database derives from a scientific consensus that accounts for conflict in the diverse data. Our preliminary product, as described in this report brings to light many gaps in the data, including a need for better information on the proportion of deformation in fault systems that is aseismic.
A database of the coseismic effects following the 30 October 2016 Norcia earthquake in Central Italy
Villani, Fabio; Civico, Riccardo; Pucci, Stefano; Pizzimenti, Luca; Nappi, Rosa; De Martini, Paolo Marco; Villani, Fabio; Civico, Riccardo; Pucci, Stefano; Pizzimenti, Luca; Nappi, Rosa; De Martini, Paolo Marco; Agosta, F.; Alessio, G.; Alfonsi, L.; Amanti, M.; Amoroso, S.; Aringoli, D.; Auciello, E.; Azzaro, R.; Baize, S.; Bello, S.; Benedetti, L.; Bertagnini, A.; Binda, G.; Bisson, M.; Blumetti, A.M.; Bonadeo, L.; Boncio, P.; Bornemann, P.; Branca, S.; Braun, T.; Brozzetti, F.; Brunori, C.A.; Burrato, P.; Caciagli, M.; Campobasso, C.; Carafa, M.; Cinti, F.R.; Cirillo, D.; Comerci, V.; Cucci, L.; De Ritis, R.; Deiana, G.; Del Carlo, P.; Del Rio, L.; Delorme, A.; Di Manna, P.; Di Naccio, D.; Falconi, L.; Falcucci, E.; Farabollini, P.; Faure Walker, J.P.; Ferrarini, F.; Ferrario, M.F.; Ferry, M.; Feuillet, N.; Fleury, J.; Fracassi, U.; Frigerio, C.; Galluzzo, F.; Gambillara, R.; Gaudiosi, G.; Goodall, H.; Gori, S.; Gregory, L.C.; Guerrieri, L.; Hailemikael, S.; Hollingsworth, J.; Iezzi, F.; Invernizzi, C.; Jablonská, D.; Jacques, E.; Jomard, H.; Kastelic, V.; Klinger, Y.; Lavecchia, G.; Leclerc, F.; Liberi, F.; Lisi, A.; Livio, F.; Lo Sardo, L.; Malet, J.P.; Mariucci, M.T.; Materazzi, M.; Maubant, L.; Mazzarini, F.; McCaffrey, K.J.W.; Michetti, A.M.; Mildon, Z.K.; Montone, P.; Moro, M.; Nave, R.; Odin, M.; Pace, B.; Paggi, S.; Pagliuca, N.; Pambianchi, G.; Pantosti, D.; Patera, A.; Pérouse, E.; Pezzo, G.; Piccardi, L.; Pierantoni, P.P.; Pignone, M.; Pinzi, S.; Pistolesi, E.; Point, J.; Pousse, L.; Pozzi, A.; Proposito, M.; Puglisi, C.; Puliti, I.; Ricci, T.; Ripamonti, L.; Rizza, M.; Roberts, G.P.; Roncoroni, M.; Sapia, V.; Saroli, M.; Sciarra, A.; Scotti, O.; Skupinski, G.; Smedile, A.; Soquet, A.; Tarabusi, G.; Tarquini, S.; Terrana, S.; Tesson, J.; Tondi, E.; Valentini, A.; Vallone, R.; Van der Woerd, J.; Vannoli, P.; Venuti, A.; Vittori, E.; Volatili, T.; Wedmore, L.N.J.; Wilkinson, M.; Zambrano, M.
2018-01-01
We provide a database of the coseismic geological surface effects following the Mw 6.5 Norcia earthquake that hit central Italy on 30 October 2016. This was one of the strongest seismic events to occur in Europe in the past thirty years, causing complex surface ruptures over an area of >400 km2. The database originated from the collaboration of several European teams (Open EMERGEO Working Group; about 130 researchers) coordinated by the Istituto Nazionale di Geofisica e Vulcanologia. The observations were collected by performing detailed field surveys in the epicentral region in order to describe the geometry and kinematics of surface faulting, and subsequently of landslides and other secondary coseismic effects. The resulting database consists of homogeneous georeferenced records identifying 7323 observation points, each of which contains 18 numeric and string fields of relevant information. This database will impact future earthquake studies focused on modelling of the seismic processes in active extensional settings, updating probabilistic estimates of slip distribution, and assessing the hazard of surface faulting. PMID:29583143
Villani, Fabio; Civico, Riccardo; Pucci, Stefano; Pizzimenti, Luca; Nappi, Rosa; De Martini, Paolo Marco
2018-03-27
We provide a database of the coseismic geological surface effects following the Mw 6.5 Norcia earthquake that hit central Italy on 30 October 2016. This was one of the strongest seismic events to occur in Europe in the past thirty years, causing complex surface ruptures over an area of >400 km 2 . The database originated from the collaboration of several European teams (Open EMERGEO Working Group; about 130 researchers) coordinated by the Istituto Nazionale di Geofisica e Vulcanologia. The observations were collected by performing detailed field surveys in the epicentral region in order to describe the geometry and kinematics of surface faulting, and subsequently of landslides and other secondary coseismic effects. The resulting database consists of homogeneous georeferenced records identifying 7323 observation points, each of which contains 18 numeric and string fields of relevant information. This database will impact future earthquake studies focused on modelling of the seismic processes in active extensional settings, updating probabilistic estimates of slip distribution, and assessing the hazard of surface faulting.
A database of the coseismic effects following the 30 October 2016 Norcia earthquake in Central Italy
NASA Astrophysics Data System (ADS)
Villani, Fabio; Civico, Riccardo; Pucci, Stefano; Pizzimenti, Luca; Nappi, Rosa; de Martini, Paolo Marco; Villani, Fabio; Civico, Riccardo; Pucci, Stefano; Pizzimenti, Luca; Nappi, Rosa; de Martini, Paolo Marco; Agosta, F.; Alessio, G.; Alfonsi, L.; Amanti, M.; Amoroso, S.; Aringoli, D.; Auciello, E.; Azzaro, R.; Baize, S.; Bello, S.; Benedetti, L.; Bertagnini, A.; Binda, G.; Bisson, M.; Blumetti, A. M.; Bonadeo, L.; Boncio, P.; Bornemann, P.; Branca, S.; Braun, T.; Brozzetti, F.; Brunori, C. A.; Burrato, P.; Caciagli, M.; Campobasso, C.; Carafa, M.; Cinti, F. R.; Cirillo, D.; Comerci, V.; Cucci, L.; de Ritis, R.; Deiana, G.; Del Carlo, P.; Del Rio, L.; Delorme, A.; di Manna, P.; di Naccio, D.; Falconi, L.; Falcucci, E.; Farabollini, P.; Faure Walker, J. P.; Ferrarini, F.; Ferrario, M. F.; Ferry, M.; Feuillet, N.; Fleury, J.; Fracassi, U.; Frigerio, C.; Galluzzo, F.; Gambillara, R.; Gaudiosi, G.; Goodall, H.; Gori, S.; Gregory, L. C.; Guerrieri, L.; Hailemikael, S.; Hollingsworth, J.; Iezzi, F.; Invernizzi, C.; Jablonská, D.; Jacques, E.; Jomard, H.; Kastelic, V.; Klinger, Y.; Lavecchia, G.; Leclerc, F.; Liberi, F.; Lisi, A.; Livio, F.; Lo Sardo, L.; Malet, J. P.; Mariucci, M. T.; Materazzi, M.; Maubant, L.; Mazzarini, F.; McCaffrey, K. J. W.; Michetti, A. M.; Mildon, Z. K.; Montone, P.; Moro, M.; Nave, R.; Odin, M.; Pace, B.; Paggi, S.; Pagliuca, N.; Pambianchi, G.; Pantosti, D.; Patera, A.; Pérouse, E.; Pezzo, G.; Piccardi, L.; Pierantoni, P. P.; Pignone, M.; Pinzi, S.; Pistolesi, E.; Point, J.; Pousse, L.; Pozzi, A.; Proposito, M.; Puglisi, C.; Puliti, I.; Ricci, T.; Ripamonti, L.; Rizza, M.; Roberts, G. P.; Roncoroni, M.; Sapia, V.; Saroli, M.; Sciarra, A.; Scotti, O.; Skupinski, G.; Smedile, A.; Soquet, A.; Tarabusi, G.; Tarquini, S.; Terrana, S.; Tesson, J.; Tondi, E.; Valentini, A.; Vallone, R.; van der Woerd, J.; Vannoli, P.; Venuti, A.; Vittori, E.; Volatili, T.; Wedmore, L. N. J.; Wilkinson, M.; Zambrano, M.
2018-03-01
We provide a database of the coseismic geological surface effects following the Mw 6.5 Norcia earthquake that hit central Italy on 30 October 2016. This was one of the strongest seismic events to occur in Europe in the past thirty years, causing complex surface ruptures over an area of >400 km2. The database originated from the collaboration of several European teams (Open EMERGEO Working Group; about 130 researchers) coordinated by the Istituto Nazionale di Geofisica e Vulcanologia. The observations were collected by performing detailed field surveys in the epicentral region in order to describe the geometry and kinematics of surface faulting, and subsequently of landslides and other secondary coseismic effects. The resulting database consists of homogeneous georeferenced records identifying 7323 observation points, each of which contains 18 numeric and string fields of relevant information. This database will impact future earthquake studies focused on modelling of the seismic processes in active extensional settings, updating probabilistic estimates of slip distribution, and assessing the hazard of surface faulting.
Malware detection and analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chiang, Ken; Lloyd, Levi; Crussell, Jonathan
Embodiments of the invention describe systems and methods for malicious software detection and analysis. A binary executable comprising obfuscated malware on a host device may be received, and incident data indicating a time when the binary executable was received and identifying processes operating on the host device may be recorded. The binary executable is analyzed via a scalable plurality of execution environments, including one or more non-virtual execution environments and one or more virtual execution environments, to generate runtime data and deobfuscation data attributable to the binary executable. At least some of the runtime data and deobfuscation data attributable tomore » the binary executable is stored in a shared database, while at least some of the incident data is stored in a private, non-shared database.« less
The Data Acquisition System of the Stockholm Educational Air Shower Array
NASA Astrophysics Data System (ADS)
Hofverberg, P.; Johansson, H.; Pearce, M.; Rydstrom, S.; Wikstrom, C.
2005-12-01
The Stockholm Educational Air Shower Array (SEASA) project is deploying an array of plastic scintillator detector stations on school roofs in the Stockholm area. Signals from GPS satellites are used to time synchronise signals from the widely separated detector stations, allowing cosmic ray air showers to be identified and studied. A low-cost and highly scalable data acquisition system has been produced using embedded Linux processors which communicate station data to a central server running a MySQL database. Air shower data can be visualised in real-time using a Java-applet client. It is also possible to query the database and manage detector stations from the client. In this paper, the design and performance of the system are described
A Probabilistic Risk Assessment of Groundwater-Related Risks at Excavation Sites
NASA Astrophysics Data System (ADS)
Jurado, A.; de Gaspari, F.; Vilarrasa, V.; Sanchez-Vila, X.; Fernandez-Garcia, D.; Tartakovsky, D. M.; Bolster, D.
2010-12-01
Excavation sites such as those associated with the construction of subway lines, railways and highway tunnels are hazardous places, posing risks to workers, machinery and surrounding buildings. Many of these risks can be groundwater related. In this work we develop a general framework based on a probabilistic risk assessment (PRA) to quantify such risks. This approach is compatible with standard PRA practices and it employs many well-developed risk analysis tools, such as fault trees. The novelty and computational challenges of the proposed approach stem from the reliance on stochastic differential equations, rather than reliability databases, to compute the probabilities of basic events. The general framework is applied to a specific case study in Spain. It is used to estimate and minimize risks for a potential construction site of an underground station for the new subway line in the Barcelona metropolitan area.
Zhang, Miaomiao; Wells, William M; Golland, Polina
2016-10-01
Using image-based descriptors to investigate clinical hypotheses and therapeutic implications is challenging due to the notorious "curse of dimensionality" coupled with a small sample size. In this paper, we present a low-dimensional analysis of anatomical shape variability in the space of diffeomorphisms and demonstrate its benefits for clinical studies. To combat the high dimensionality of the deformation descriptors, we develop a probabilistic model of principal geodesic analysis in a bandlimited low-dimensional space that still captures the underlying variability of image data. We demonstrate the performance of our model on a set of 3D brain MRI scans from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. Our model yields a more compact representation of group variation at substantially lower computational cost than models based on the high-dimensional state-of-the-art approaches such as tangent space PCA (TPCA) and probabilistic principal geodesic analysis (PPGA).
Evaluation of NoSQL databases for DIRAC monitoring and beyond
NASA Astrophysics Data System (ADS)
Mathe, Z.; Casajus Ramo, A.; Stagni, F.; Tomassetti, L.
2015-12-01
Nowadays, many database systems are available but they may not be optimized for storing time series data. Monitoring DIRAC jobs would be better done using a database optimised for storing time series data. So far it was done using a MySQL database, which is not well suited for such an application. Therefore alternatives have been investigated. Choosing an appropriate database for storing huge amounts of time series data is not trivial as one must take into account different aspects such as manageability, scalability and extensibility. We compared the performance of Elasticsearch, OpenTSDB (based on HBase) and InfluxDB NoSQL databases, using the same set of machines and the same data. We also evaluated the effort required for maintaining them. Using the LHCb Workload Management System (WMS), based on DIRAC as a use case we set up a new monitoring system, in parallel with the current MySQL system, and we stored the same data into the databases under test. We evaluated Grafana (for OpenTSDB) and Kibana (for ElasticSearch) metrics and graph editors for creating dashboards, in order to have a clear picture on the usability of each candidate. In this paper we present the results of this study and the performance of the selected technology. We also give an outlook of other potential applications of NoSQL databases within the DIRAC project.
Probabilistic seismic hazard assessment for northern Southeast Asia
NASA Astrophysics Data System (ADS)
Chan, C. H.; Wang, Y.; Kosuwan, S.; Nguyen, M. L.; Shi, X.; Sieh, K.
2016-12-01
We assess seismic hazard for northern Southeast Asia through constructing an earthquake and fault database, conducting a series of ground-shaking scenarios and proposing regional seismic hazard maps. Our earthquake database contains earthquake parameters from global and local seismic catalogues, including the ISC, ISC-GEM, the global ANSS Comprehensive Catalogues, Seismological Bureau, Thai Meteorological Department, Thailand, and Institute of Geophysics Vietnam Academy of Science and Technology, Vietnam. To harmonize the earthquake parameters from various catalogue sources, we remove duplicate events and unify magnitudes into the same scale. Our active fault database include active fault data from previous studies, e.g. the active fault parameters determined by Wang et al. (2014), Department of Mineral Resources, Thailand, and Institute of Geophysics, Vietnam Academy of Science and Technology, Vietnam. Based on the parameters from analysis of the databases (i.e., the Gutenberg-Richter relationship, slip rate, maximum magnitude and time elapsed of last events), we determined the earthquake recurrence models of seismogenic sources. To evaluate the ground shaking behaviours in different tectonic regimes, we conducted a series of tests by matching the felt intensities of historical earthquakes to the modelled ground motions using ground motion prediction equations (GMPEs). By incorporating the best-fitting GMPEs and site conditions, we utilized site effect and assessed probabilistic seismic hazard. The highest seismic hazard is in the region close to the Sagaing Fault, which cuts through some major cities in central Myanmar. The northern segment of Sunda megathrust, which could potentially cause M8-class earthquake, brings significant hazard along the Western Coast of Myanmar and eastern Bangladesh. Besides, we conclude a notable hazard level in northern Vietnam and the boundary between Myanmar, Thailand and Laos, due to a series of strike-slip faults, which could potentially cause moderate-large earthquakes. Note that although much of the region has a low probability of damaging shaking, low-probability events have resulted in much destruction recently in SE Asia (e.g. 2008 Wenchuan, 2015 Sabah earthquakes).
NASA Technical Reports Server (NTRS)
Maluf, David A.; Bell, David g.; Ashish, Naveen
2005-01-01
This paper describes an approach to achieving data integration across multiple sources in an enterprise, in a manner that is cost efficient and economically scalable. We present an approach that does not rely on major investment in structured, heavy-weight database systems for data storage or heavy-weight middleware responsible for integrated access. The approach is centered around pushing any required data structure and semantics functionality (schema) to application clients, as well as pushing integration specification and functionality to clients where integration can be performed on-the-fly .
Perspectives in astrophysical databases
NASA Astrophysics Data System (ADS)
Frailis, Marco; de Angelis, Alessandro; Roberto, Vito
2004-07-01
Astrophysics has become a domain extremely rich of scientific data. Data mining tools are needed for information extraction from such large data sets. This asks for an approach to data management emphasizing the efficiency and simplicity of data access; efficiency is obtained using multidimensional access methods and simplicity is achieved by properly handling metadata. Moreover, clustering and classification techniques on large data sets pose additional requirements in terms of computation and memory scalability and interpretability of results. In this study we review some possible solutions.
2006-09-01
STELLA and PowerLoomn. These modules comunicate with a knowledge basec using KIF and stan(lardl relational database systelnis using either standard...groups ontology as well as a rule that infers additional seed members based on joint participation in a terrorism event. EDB schema files are a special... terrorism links from the Ali Baba EDB. Our interpretation of such links is that they KOJAK Manual E-42 encode that two people committed an act of
Service Management Database for DSN Equipment
NASA Technical Reports Server (NTRS)
Zendejas, Silvino; Bui, Tung; Bui, Bach; Malhotra, Shantanu; Chen, Fannie; Wolgast, Paul; Allen, Christopher; Luong, Ivy; Chang, George; Sadaqathulla, Syed
2009-01-01
This data- and event-driven persistent storage system leverages the use of commercial software provided by Oracle for portability, ease of maintenance, scalability, and ease of integration with embedded, client-server, and multi-tiered applications. In this role, the Service Management Database (SMDB) is a key component of the overall end-to-end process involved in the scheduling, preparation, and configuration of the Deep Space Network (DSN) equipment needed to perform the various telecommunication services the DSN provides to its customers worldwide. SMDB makes efficient use of triggers, stored procedures, queuing functions, e-mail capabilities, data management, and Java integration features provided by the Oracle relational database management system. SMDB uses a third normal form schema design that allows for simple data maintenance procedures and thin layers of integration with client applications. The software provides an integrated event logging system with ability to publish events to a JMS messaging system for synchronous and asynchronous delivery to subscribed applications. It provides a structured classification of events and application-level messages stored in database tables that are accessible by monitoring applications for real-time monitoring or for troubleshooting and analysis over historical archives.
Relax with CouchDB--into the non-relational DBMS era of bioinformatics.
Manyam, Ganiraju; Payton, Michelle A; Roth, Jack A; Abruzzo, Lynne V; Coombes, Kevin R
2012-07-01
With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. Copyright © 2012 Elsevier Inc. All rights reserved.
Denison, Stephanie; Trikutam, Pallavi; Xu, Fei
2014-08-01
A rich tradition in developmental psychology explores physical reasoning in infancy. However, no research to date has investigated whether infants can reason about physical objects that behave probabilistically, rather than deterministically. Physical events are often quite variable, in that similar-looking objects can be placed in similar contexts with different outcomes. Can infants rapidly acquire probabilistic physical knowledge, such as some leaves fall and some glasses break by simply observing the statistical regularity with which objects behave and apply that knowledge in subsequent reasoning? We taught 11-month-old infants physical constraints on objects and asked them to reason about the probability of different outcomes when objects were drawn from a large distribution. Infants could have reasoned either by using the perceptual similarity between the samples and larger distributions or by applying physical rules to adjust base rates and estimate the probabilities. Infants learned the physical constraints quickly and used them to estimate probabilities, rather than relying on similarity, a version of the representativeness heuristic. These results indicate that infants can rapidly and flexibly acquire physical knowledge about objects following very brief exposure and apply it in subsequent reasoning. PsycINFO Database Record (c) 2014 APA, all rights reserved.
A Hybrid Probabilistic Model for Unified Collaborative and Content-Based Image Tagging.
Zhou, Ning; Cheung, William K; Qiu, Guoping; Xue, Xiangyang
2011-07-01
The increasing availability of large quantities of user contributed images with labels has provided opportunities to develop automatic tools to tag images to facilitate image search and retrieval. In this paper, we present a novel hybrid probabilistic model (HPM) which integrates low-level image features and high-level user provided tags to automatically tag images. For images without any tags, HPM predicts new tags based solely on the low-level image features. For images with user provided tags, HPM jointly exploits both the image features and the tags in a unified probabilistic framework to recommend additional tags to label the images. The HPM framework makes use of the tag-image association matrix (TIAM). However, since the number of images is usually very large and user-provided tags are diverse, TIAM is very sparse, thus making it difficult to reliably estimate tag-to-tag co-occurrence probabilities. We developed a collaborative filtering method based on nonnegative matrix factorization (NMF) for tackling this data sparsity issue. Also, an L1 norm kernel method is used to estimate the correlations between image features and semantic concepts. The effectiveness of the proposed approach has been evaluated using three databases containing 5,000 images with 371 tags, 31,695 images with 5,587 tags, and 269,648 images with 5,018 tags, respectively.
Lee, Dong-Hoon; Lee, Do-Wan; Han, Bong-Soo
2016-04-01
The purpose of this study is to elucidate the symmetrical characteristics of corticospinal tract (CST) related with hand movement in bilateral hemispheres using probabilistic fiber tracking method. Seventeen subjects were participated in this study. Fiber tracking was performed with 2 regions of interest, hand activated functional magnetic resonance imaging (fMRI) results and pontomedullary junction in each cerebral hemisphere. Each subject's extracted fiber tract was normalized with a brain template. To measure the symmetrical distributions of the CST related with hand movement, the laterality and anteriority indices were defined in upper corona radiata (CR), lower CR, and posterior limb of internal capsule. The measured laterality and anteriority indices between the hemispheres in each different brain location showed no significant differences with P < 0.05. There were significant differences in the measured indices among 3 different brain locations in each cerebral hemisphere with P < 0.001. Our results clearly showed that the hand CST had symmetric structures in bilateral hemispheres. The probabilistic fiber tracking with fMRI approach demonstrated that the hand CST can be successfully extracted regardless of crossing fiber problem. Our analytical approaches and results seem to be helpful for providing the database of CST somatotopy to neurologists and clinical researches.
Probabilistic Elastic Part Model: A Pose-Invariant Representation for Real-World Face Verification.
Li, Haoxiang; Hua, Gang
2018-04-01
Pose variation remains to be a major challenge for real-world face recognition. We approach this problem through a probabilistic elastic part model. We extract local descriptors (e.g., LBP or SIFT) from densely sampled multi-scale image patches. By augmenting each descriptor with its location, a Gaussian mixture model (GMM) is trained to capture the spatial-appearance distribution of the face parts of all face images in the training corpus, namely the probabilistic elastic part (PEP) model. Each mixture component of the GMM is confined to be a spherical Gaussian to balance the influence of the appearance and the location terms, which naturally defines a part. Given one or multiple face images of the same subject, the PEP-model builds its PEP representation by sequentially concatenating descriptors identified by each Gaussian component in a maximum likelihood sense. We further propose a joint Bayesian adaptation algorithm to adapt the universally trained GMM to better model the pose variations between the target pair of faces/face tracks, which consistently improves face verification accuracy. Our experiments show that we achieve state-of-the-art face verification accuracy with the proposed representations on the Labeled Face in the Wild (LFW) dataset, the YouTube video face database, and the CMU MultiPIE dataset.
NASA Astrophysics Data System (ADS)
Tang, Zhongqian; Zhang, Hua; Yi, Shanzhen; Xiao, Yangfan
2018-03-01
GIS-based multi-criteria decision analysis (MCDA) is increasingly used to support flood risk assessment. However, conventional GIS-MCDA methods fail to adequately represent spatial variability and are accompanied with considerable uncertainty. It is, thus, important to incorporate spatial variability and uncertainty into GIS-based decision analysis procedures. This research develops a spatially explicit, probabilistic GIS-MCDA approach for the delineation of potentially flood susceptible areas. The approach integrates the probabilistic and the local ordered weighted averaging (OWA) methods via Monte Carlo simulation, to take into account the uncertainty related to criteria weights, spatial heterogeneity of preferences and the risk attitude of the analyst. The approach is applied to a pilot study for the Gucheng County, central China, heavily affected by the hazardous 2012 flood. A GIS database of six geomorphological and hydrometeorological factors for the evaluation of susceptibility was created. Moreover, uncertainty and sensitivity analysis were performed to investigate the robustness of the model. The results indicate that the ensemble method improves the robustness of the model outcomes with respect to variation in criteria weights and identifies which criteria weights are most responsible for the variability of model outcomes. Therefore, the proposed approach is an improvement over the conventional deterministic method and can provides a more rational, objective and unbiased tool for flood susceptibility evaluation.
NASA Astrophysics Data System (ADS)
Gilliom, R.; Hogue, T. S.; McCray, J. E.
2017-12-01
There is a need for improved parameterization of stormwater best management practices (BMP) performance estimates to improve modeling of urban hydrology, planning and design of green infrastructure projects, and water quality crediting for stormwater management. Percent removal is commonly used to estimate BMP pollutant removal efficiency, but there is general agreement that this approach has significant uncertainties and is easily affected by site-specific factors. Additionally, some fraction of monitored BMPs have negative percent removal, so it is important to understand the probability that a BMP will provide the desired water quality function versus exacerbating water quality problems. The widely used k-C* equation has shown to provide a more adaptable and accurate method to model BMP contaminant attenuation, and previous work has begun to evaluate the strengths and weaknesses of the k-C* method. However, no systematic method exists for obtaining first-order removal rate constants needed to use the k-C* equation for stormwater BMPs; thus there is minimal application of the method. The current research analyzes existing water quality data in the International Stormwater BMP Database to provide screening-level parameterization of the k-C* equation for selected BMP types and analysis of factors that skew the distribution of efficiency estimates from the database. Results illustrate that while certain BMPs are more likely to provide desired contaminant removal than others, site- and design-specific factors strongly influence performance. For example, bioretention systems show both the highest and lowest removal rates of dissolved copper, total phosphorous, and total nitrogen. Exploration and discussion of this and other findings will inform the application of the probabilistic pollutant removal rate constants. Though data limitations exist, this research will facilitate improved accuracy of BMP modeling and ultimately aid decision-making for stormwater quality management in urban systems.
Surviving the Glut: The Management of Event Streams in Cyberphysical Systems
NASA Astrophysics Data System (ADS)
Buchmann, Alejandro
Alejandro Buchmann is Professor in the Department of Computer Science, Technische Universität Darmstadt, where he heads the Databases and Distributed Systems Group. He received his MS (1977) and PhD (1980) from the University of Texas at Austin. He was an Assistant/Associate Professor at the Institute for Applied Mathematics and Systems IIMAS/UNAM in Mexico, doing research on databases for CAD, geographic information systems, and objectoriented databases. At Computer Corporation of America (later Xerox Advanced Information Systems) in Cambridge, Mass., he worked in the areas of active databases and real-time databases, and at GTE Laboratories, Waltham, in the areas of distributed object systems and the integration of heterogeneous legacy systems. 1991 he returned to academia and joined T.U. Darmstadt. His current research interests are at the intersection of middleware, databases, eventbased distributed systems, ubiquitous computing, and very large distributed systems (P2P, WSN). Much of the current research is concerned with guaranteeing quality of service and reliability properties in these systems, for example, scalability, performance, transactional behaviour, consistency, and end-to-end security. Many research projects imply collaboration with industry and cover a broad spectrum of application domains. Further information can be found at http://www.dvs.tu-darmstadt.de
NASA Astrophysics Data System (ADS)
Lebedev, A. A.; Maksimov, N. V.; Smirnova, E. V.
2017-01-01
The paper presents a model of information interactions, based on a probabilistic concept of meanings. The proposed hypothesis about the wave nature of information and use of quantum mechanics mathematical apparatus allow to consider the phenomena of interference and diffraction with respect to the linguistic variables, and to quantify dynamics of terms in subject areas. Retrospective database INIS IAEA was used as an experimental base.
Probabilistic Model for Laser Damage to the Human Retina
2012-03-01
the beam. Power density may be measured in radiant exposure, J cm2 , or by irradiance , W cm2 . In the experimental database used in this study and...to quan- tify a binary response, either lethal or non-lethal, within a population such as insects or rats. In directed energy research, probit...value of the normalized Arrhenius damage integral. In a one-dimensional simulation, the source term is determined as a spatially averaged irradiance (W
Halligan, Brian D.; Geiger, Joey F.; Vallejos, Andrew K.; Greene, Andrew S.; Twigger, Simon N.
2009-01-01
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step by step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center website (http://proteomics.mcw.edu/vipdac). PMID:19358578
The BioIntelligence Framework: a new computational platform for biomedical knowledge computing
Farley, Toni; Kiefer, Jeff; Lee, Preston; Von Hoff, Daniel; Trent, Jeffrey M; Colbourn, Charles
2013-01-01
Breakthroughs in molecular profiling technologies are enabling a new data-intensive approach to biomedical research, with the potential to revolutionize how we study, manage, and treat complex diseases. The next great challenge for clinical applications of these innovations will be to create scalable computational solutions for intelligently linking complex biomedical patient data to clinically actionable knowledge. Traditional database management systems (DBMS) are not well suited to representing complex syntactic and semantic relationships in unstructured biomedical information, introducing barriers to realizing such solutions. We propose a scalable computational framework for addressing this need, which leverages a hypergraph-based data model and query language that may be better suited for representing complex multi-lateral, multi-scalar, and multi-dimensional relationships. We also discuss how this framework can be used to create rapid learning knowledge base systems to intelligently capture and relate complex patient data to biomedical knowledge in order to automate the recovery of clinically actionable information. PMID:22859646
Halligan, Brian D; Geiger, Joey F; Vallejos, Andrew K; Greene, Andrew S; Twigger, Simon N
2009-06-01
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step-by-step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center Web site ( http://proteomics.mcw.edu/vipdac ).
Demirkus, Meltem; Precup, Doina; Clark, James J; Arbel, Tal
2016-06-01
Recent literature shows that facial attributes, i.e., contextual facial information, can be beneficial for improving the performance of real-world applications, such as face verification, face recognition, and image search. Examples of face attributes include gender, skin color, facial hair, etc. How to robustly obtain these facial attributes (traits) is still an open problem, especially in the presence of the challenges of real-world environments: non-uniform illumination conditions, arbitrary occlusions, motion blur and background clutter. What makes this problem even more difficult is the enormous variability presented by the same subject, due to arbitrary face scales, head poses, and facial expressions. In this paper, we focus on the problem of facial trait classification in real-world face videos. We have developed a fully automatic hierarchical and probabilistic framework that models the collective set of frame class distributions and feature spatial information over a video sequence. The experiments are conducted on a large real-world face video database that we have collected, labelled and made publicly available. The proposed method is flexible enough to be applied to any facial classification problem. Experiments on a large, real-world video database McGillFaces [1] of 18,000 video frames reveal that the proposed framework outperforms alternative approaches, by up to 16.96 and 10.13%, for the facial attributes of gender and facial hair, respectively.
User's Manual for RESRAD-OFFSITE Version 2.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, C.; Gnanapragasam, E.; Biwer, B. M.
2007-09-05
The RESRAD-OFFSITE code is an extension of the RESRAD (onsite) code, which has been widely used for calculating doses and risks from exposure to radioactively contaminated soils. The development of RESRAD-OFFSITE started more than 10 years ago, but new models and methodologies have been developed, tested, and incorporated since then. Some of the new models have been benchmarked against other independently developed (international) models. The databases used have also expanded to include all the radionuclides (more than 830) contained in the International Commission on Radiological Protection (ICRP) 38 database. This manual provides detailed information on the design and application ofmore » the RESRAD-OFFSITE code. It describes in detail the new models used in the code, such as the three-dimensional dispersion groundwater flow and radionuclide transport model, the Gaussian plume model for atmospheric dispersion, and the deposition model used to estimate the accumulation of radionuclides in offsite locations and in foods. Potential exposure pathways and exposure scenarios that can be modeled by the RESRAD-OFFSITE code are also discussed. A user's guide is included in Appendix A of this manual. The default parameter values and parameter distributions are presented in Appendix B, along with a discussion on the statistical distributions for probabilistic analysis. A detailed discussion on how to reduce run time, especially when conducting probabilistic (uncertainty) analysis, is presented in Appendix C of this manual.« less
DataSpread: Unifying Databases and Spreadsheets.
Bendre, Mangesh; Sun, Bofan; Zhang, Ding; Zhou, Xinyan; Chang, Kevin ChenChuan; Parameswaran, Aditya
2015-08-01
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current "pane" (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.
DataSpread: Unifying Databases and Spreadsheets
Bendre, Mangesh; Sun, Bofan; Zhang, Ding; Zhou, Xinyan; Chang, Kevin ChenChuan; Parameswaran, Aditya
2015-01-01
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current “pane” (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases. PMID:26900487
Mobile object retrieval in server-based image databases
NASA Astrophysics Data System (ADS)
Manger, D.; Pagel, F.; Widak, H.
2013-05-01
The increasing number of mobile phones equipped with powerful cameras leads to huge collections of user-generated images. To utilize the information of the images on site, image retrieval systems are becoming more and more popular to search for similar objects in an own image database. As the computational performance and the memory capacity of mobile devices are constantly increasing, this search can often be performed on the device itself. This is feasible, for example, if the images are represented with global image features or if the search is done using EXIF or textual metadata. However, for larger image databases, if multiple users are meant to contribute to a growing image database or if powerful content-based image retrieval methods with local features are required, a server-based image retrieval backend is needed. In this work, we present a content-based image retrieval system with a client server architecture working with local features. On the server side, the scalability to large image databases is addressed with the popular bag-of-word model with state-of-the-art extensions. The client end of the system focuses on a lightweight user interface presenting the most similar images of the database highlighting the visual information which is common with the query image. Additionally, new images can be added to the database making it a powerful and interactive tool for mobile contentbased image retrieval.
Mining Quality Phrases from Massive Text Corpora
Liu, Jialu; Shang, Jingbo; Wang, Chi; Ren, Xiang; Han, Jiawei
2015-01-01
Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method. PMID:26705375
Harrigan, Robert L; Yvernault, Benjamin C; Boyd, Brian D; Damon, Stephen M; Gibney, Kyla David; Conrad, Benjamin N; Phillips, Nicholas S; Rogers, Baxter P; Gao, Yurui; Landman, Bennett A
2016-01-01
The Vanderbilt University Institute for Imaging Science (VUIIS) Center for Computational Imaging (CCI) has developed a database built on XNAT housing over a quarter of a million scans. The database provides framework for (1) rapid prototyping, (2) large scale batch processing of images and (3) scalable project management. The system uses the web-based interfaces of XNAT and REDCap to allow for graphical interaction. A python middleware layer, the Distributed Automation for XNAT (DAX) package, distributes computation across the Vanderbilt Advanced Computing Center for Research and Education high performance computing center. All software are made available in open source for use in combining portable batch scripting (PBS) grids and XNAT servers. Copyright © 2015 Elsevier Inc. All rights reserved.
GLAD: a system for developing and deploying large-scale bioinformatics grid.
Teo, Yong-Meng; Wang, Xianbing; Ng, Yew-Kwong
2005-03-01
Grid computing is used to solve large-scale bioinformatics problems with gigabytes database by distributing the computation across multiple platforms. Until now in developing bioinformatics grid applications, it is extremely tedious to design and implement the component algorithms and parallelization techniques for different classes of problems, and to access remotely located sequence database files of varying formats across the grid. In this study, we propose a grid programming toolkit, GLAD (Grid Life sciences Applications Developer), which facilitates the development and deployment of bioinformatics applications on a grid. GLAD has been developed using ALiCE (Adaptive scaLable Internet-based Computing Engine), a Java-based grid middleware, which exploits the task-based parallelism. Two bioinformatics benchmark applications, such as distributed sequence comparison and distributed progressive multiple sequence alignment, have been developed using GLAD.
Scalable Algorithms for Global Scale Remote Sensing Applications
NASA Astrophysics Data System (ADS)
Vatsavai, R. R.; Bhaduri, B. L.; Singh, N.
2015-12-01
Recent decade has witnessed major changes on the Earth, for example, deforestation, varying cropping and human settlement patterns, and crippling damages due to disasters. Accurate damage assessment caused by major natural and anthropogenic disasters is becoming critical due to increases in human and economic loss. This increase in loss of life and severe damages can be attributed to the growing population, as well as human migration to the disaster prone regions of the world. Rapid assessment of these changes and dissemination of accurate information is critical for creating an effective emergency response. Change detection using high-resolution satellite images is a primary tool in assessing damages, monitoring biomass and critical infrastructures, and identifying new settlements. Existing change detection methods suffer from registration errors and often based on pixel (location) wise comparison of spectral observations from single sensor. In this paper we present a novel probabilistic change detection framework based on patch comparison and a GPU implementation that supports near real-time rapid damage exploration capability.
A Scalable Distribution Network Risk Evaluation Framework via Symbolic Dynamics
Yuan, Kai; Liu, Jian; Liu, Kaipei; Tan, Tianyuan
2015-01-01
Background Evaluations of electric power distribution network risks must address the problems of incomplete information and changing dynamics. A risk evaluation framework should be adaptable to a specific situation and an evolving understanding of risk. Methods This study investigates the use of symbolic dynamics to abstract raw data. After introducing symbolic dynamics operators, Kolmogorov-Sinai entropy and Kullback-Leibler relative entropy are used to quantitatively evaluate relationships between risk sub-factors and main factors. For layered risk indicators, where the factors are categorized into four main factors – device, structure, load and special operation – a merging algorithm using operators to calculate the risk factors is discussed. Finally, an example from the Sanya Power Company is given to demonstrate the feasibility of the proposed method. Conclusion Distribution networks are exposed and can be affected by many things. The topology and the operating mode of a distribution network are dynamic, so the faults and their consequences are probabilistic. PMID:25789859
EMR Database Upgrade from MUMPS to CACHE: Lessons Learned.
Alotaibi, Abduallah; Emshary, Mshary; Househ, Mowafa
2014-01-01
Over the past few years, Saudi hospitals have been implementing and upgrading Electronic Medical Record Systems (EMRs) to ensure secure data transfer and exchange between EMRs.This paper focuses on the process and lessons learned in upgrading the MUMPS database to a the newer Caché database to ensure the integrity of electronic data transfer within a local Saudi hospital. This paper examines the steps taken by the departments concerned, their action plans and how the change process was managed. Results show that user satisfaction was achieved after the upgrade was completed. The system was stable and offered better healthcare quality to patients as a result of the data exchange. Hardware infrastructure upgrades improved scalability and software upgrades to Caché improved stability. The overall performance was enhanced and new functions were added (CPOE) during the upgrades. The essons learned were: 1) Involve higher management; 2) Research multiple solutions available in the market; 3) Plan for a variety of implementation scenarios.
SiC: An Agent Based Architecture for Preventing and Detecting Attacks to Ubiquitous Databases
NASA Astrophysics Data System (ADS)
Pinzón, Cristian; de Paz, Yanira; Bajo, Javier; Abraham, Ajith; Corchado, Juan M.
One of the main attacks to ubiquitous databases is the structure query language (SQL) injection attack, which causes severe damages both in the commercial aspect and in the user’s confidence. This chapter proposes the SiC architecture as a solution to the SQL injection attack problem. This is a hierarchical distributed multiagent architecture, which involves an entirely new approach with respect to existing architectures for the prevention and detection of SQL injections. SiC incorporates a kind of intelligent agent, which integrates a case-based reasoning system. This agent, which is the core of the architecture, allows the application of detection techniques based on anomalies as well as those based on patterns, providing a great degree of autonomy, flexibility, robustness and dynamic scalability. The characteristics of the multiagent system allow an architecture to detect attacks from different types of devices, regardless of the physical location. The architecture has been tested on a medical database, guaranteeing safe access from various devices such as PDAs and notebook computers.
Modular Bayesian Networks with Low-Power Wearable Sensors for Recognizing Eating Activities
Kim, Kee-Hoon
2017-01-01
Recently, recognizing a user’s daily activity using a smartphone and wearable sensors has become a popular issue. However, in contrast with the ideal definition of an experiment, there could be numerous complex activities in real life with respect to its various background and contexts: time, space, age, culture, and so on. Recognizing these complex activities with limited low-power sensors, considering the power and memory constraints of the wearable environment and the user’s obtrusiveness at once is not an easy problem, although it is very crucial for the activity recognizer to be practically useful. In this paper, we recognize activity of eating, which is one of the most typical examples of a complex activity, using only daily low-power mobile and wearable sensors. To organize the related contexts systemically, we have constructed the context model based on activity theory and the “Five W’s”, and propose a Bayesian network with 88 nodes to predict uncertain contexts probabilistically. The structure of the proposed Bayesian network is designed by a modular and tree-structured approach to reduce the time complexity and increase the scalability. To evaluate the proposed method, we collected the data with 10 different activities from 25 volunteers of various ages, occupations, and jobs, and have obtained 79.71% accuracy, which outperforms other conventional classifiers by 7.54–14.4%. Analyses of the results showed that our probabilistic approach could also give approximate results even when one of contexts or sensor values has a very heterogeneous pattern or is missing. PMID:29232937
Learning and recognition of on-premise signs from weakly labeled street view images.
Tsai, Tsung-Hung; Cheng, Wen-Huang; You, Chuang-Wen; Hu, Min-Chun; Tsui, Arvin Wen; Chi, Heng-Yu
2014-03-01
Camera-enabled mobile devices are commonly used as interaction platforms for linking the user's virtual and physical worlds in numerous research and commercial applications, such as serving an augmented reality interface for mobile information retrieval. The various application scenarios give rise to a key technique of daily life visual object recognition. On-premise signs (OPSs), a popular form of commercial advertising, are widely used in our living life. The OPSs often exhibit great visual diversity (e.g., appearing in arbitrary size), accompanied with complex environmental conditions (e.g., foreground and background clutter). Observing that such real-world characteristics are lacking in most of the existing image data sets, in this paper, we first proposed an OPS data set, namely OPS-62, in which totally 4649 OPS images of 62 different businesses are collected from Google's Street View. Further, for addressing the problem of real-world OPS learning and recognition, we developed a probabilistic framework based on the distributional clustering, in which we proposed to exploit the distributional information of each visual feature (the distribution of its associated OPS labels) as a reliable selection criterion for building discriminative OPS models. Experiments on the OPS-62 data set demonstrated the outperformance of our approach over the state-of-the-art probabilistic latent semantic analysis models for more accurate recognitions and less false alarms, with a significant 151.28% relative improvement in the average recognition rate. Meanwhile, our approach is simple, linear, and can be executed in a parallel fashion, making it practical and scalable for large-scale multimedia applications.
A Machine Reading System for Assembling Synthetic Paleontological Databases
Peters, Shanan E.; Zhang, Ce; Livny, Miron; Ré, Christopher
2014-01-01
Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry. PMID:25436610
Hammarstrom, Jane M.; Bookstrom, Arthur A.; Dicken, Connie L.; Drenth, Benjamin J.; Ludington, Steve; Robinson, Gilpin R.; Setiabudi, Bambang Tjahjono; Sukserm, Wudhikarn; Sunuhadi, Dwi Nugroho; Wah, Alexander Yan Sze; Zientek, Michael L.
2013-01-01
This assessment includes an overview of the assessment results with summary tables. Detailed descriptions of each tract are included in appendixes, with estimates of numbers of undiscovered deposits, and probabilistic estimates of amounts of copper, molybdenum, gold, and silver that could be contained in undiscovered deposits for each permissive tract. A geographic information system (GIS) that accompanies the report includes tract boundaries and a database of known porphyry copper deposits and significant prospects.
Ludington, Steve; Mihalasky, Mark J.; Hammarstrom, Jane M.; Robinson, Giplin R.; Frost, Thomas P.; Gans, Kathleen D.; Light, Thomas D.; Miller, Robert J.; Alexeiev, Dmitriy V.
2012-01-01
This report includes an overview of the assessment results and summary tables. Descriptions of each tract are included in appendixes, with estimates of numbers of undiscovered deposits, and probabilistic estimates of amounts of copper, molybdenum, gold, and silver that could be contained in undiscovered deposits for each permissive tract. A geographic information system that accompanies the report includes tract boundaries and a database of known porphyry copper deposits and prospects.
Probabilistic seismic hazard estimates incorporating site effects - An example from Indiana, U.S.A
Hasse, J.S.; Park, C.H.; Nowack, R.L.; Hill, J.R.
2010-01-01
The U.S. Geological Survey (USGS) has published probabilistic earthquake hazard maps for the United States based on current knowledge of past earthquake activity and geological constraints on earthquake potential. These maps for the central and eastern United States assume standard site conditions with Swave velocities of 760 m/s in the top 30 m. For urban and infrastructure planning and long-term budgeting, the public is interested in similar probabilistic seismic hazard maps that take into account near-surface geological materials. We have implemented a probabilistic method for incorporating site effects into the USGS seismic hazard analysis that takes into account the first-order effects of the surface geologic conditions. The thicknesses of sediments, which play a large role in amplification, were derived from a P-wave refraction database with over 13, 000 profiles, and a preliminary geology-based velocity model was constructed from available information on S-wave velocities. An interesting feature of the preliminary hazard maps incorporating site effects is the approximate factor of two increases in the 1-Hz spectral acceleration with 2 percent probability of exceedance in 50 years for parts of the greater Indianapolis metropolitan region and surrounding parts of central Indiana. This effect is primarily due to the relatively thick sequence of sediments infilling ancient bedrock topography that has been deposited since the Pleistocene Epoch. As expected, the Late Pleistocene and Holocene depositional systems of the Wabash and Ohio Rivers produce additional amplification in the southwestern part of Indiana. Ground motions decrease, as would be expected, toward the bedrock units in south-central Indiana, where motions are significantly lower than the values on the USGS maps.
Probabilistic topic modeling for the analysis and classification of genomic sequences
2015-01-01
Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734
Blind image quality assessment via probabilistic latent semantic analysis.
Yang, Xichen; Sun, Quansen; Wang, Tianshu
2016-01-01
We propose a blind image quality assessment that is highly unsupervised and training free. The new method is based on the hypothesis that the effect caused by distortion can be expressed by certain latent characteristics. Combined with probabilistic latent semantic analysis, the latent characteristics can be discovered by applying a topic model over a visual word dictionary. Four distortion-affected features are extracted to form the visual words in the dictionary: (1) the block-based local histogram; (2) the block-based local mean value; (3) the mean value of contrast within a block; (4) the variance of contrast within a block. Based on the dictionary, the latent topics in the images can be discovered. The discrepancy between the frequency of the topics in an unfamiliar image and a large number of pristine images is applied to measure the image quality. Experimental results for four open databases show that the newly proposed method correlates well with human subjective judgments of diversely distorted images.
An improved probabilistic account of counterfactual reasoning.
Lucas, Christopher G; Kemp, Charles
2015-10-01
When people want to identify the causes of an event, assign credit or blame, or learn from their mistakes, they often reflect on how things could have gone differently. In this kind of reasoning, one considers a counterfactual world in which some events are different from their real-world counterparts and considers what else would have changed. Researchers have recently proposed several probabilistic models that aim to capture how people do (or should) reason about counterfactuals. We present a new model and show that it accounts better for human inferences than several alternative models. Our model builds on the work of Pearl (2000), and extends his approach in a way that accommodates backtracking inferences and that acknowledges the difference between counterfactual interventions and counterfactual observations. We present 6 new experiments and analyze data from 4 experiments carried out by Rips (2010), and the results suggest that the new model provides an accurate account of both mean human judgments and the judgments of individuals. (PsycINFO Database Record (c) 2015 APA, all rights reserved).
Jones, Michael N.
2017-01-01
A central goal of cognitive neuroscience is to decode human brain activity—that is, to infer mental processes from observed patterns of whole-brain activation. Previous decoding efforts have focused on classifying brain activity into a small set of discrete cognitive states. To attain maximal utility, a decoding framework must be open-ended, systematic, and context-sensitive—that is, capable of interpreting numerous brain states, presented in arbitrary combinations, in light of prior information. Here we take steps towards this objective by introducing a probabilistic decoding framework based on a novel topic model—Generalized Correspondence Latent Dirichlet Allocation—that learns latent topics from a database of over 11,000 published fMRI studies. The model produces highly interpretable, spatially-circumscribed topics that enable flexible decoding of whole-brain images. Importantly, the Bayesian nature of the model allows one to “seed” decoder priors with arbitrary images and text—enabling researchers, for the first time, to generate quantitative, context-sensitive interpretations of whole-brain patterns of brain activity. PMID:29059185
NASA Astrophysics Data System (ADS)
Land, Walker H., Jr.; Anderson, Frances; Smith, Tom; Fahlbusch, Stephen; Choma, Robert; Wong, Lut
2005-04-01
Achieving consistent and correct database cases is crucial to the correct evaluation of any computer-assisted diagnostic (CAD) paradigm. This paper describes the application of artificial intelligence (AI), knowledge engineering (KE) and knowledge representation (KR) to a data set of ~2500 cases from six separate hospitals, with the objective of removing/reducing inconsistent outlier data. Several support vector machine (SVM) kernels were used to measure diagnostic performance of the original and a "cleaned" data set. Specifically, KE and ER principles were applied to the two data sets which were re-examined with respect to the environment and agents. One data set was found to contain 25 non-characterizable sets. The other data set contained 180 non-characterizable sets. CAD system performance was measured with both the original and "cleaned" data sets using two SVM kernels as well as a multivariate probabilistic neural network (PNN). Results demonstrated: (i) a 10% average improvement in overall Az and (ii) approximately a 50% average improvement in partial Az.
A probabilistic safety analysis of incidents in nuclear research reactors.
Lopes, Valdir Maciel; Agostinho Angelo Sordi, Gian Maria; Moralles, Mauricio; Filho, Tufic Madi
2012-06-01
This work aims to evaluate the potential risks of incidents in nuclear research reactors. For its development, two databases of the International Atomic Energy Agency (IAEA) were used: the Research Reactor Data Base (RRDB) and the Incident Report System for Research Reactor (IRSRR). For this study, the probabilistic safety analysis (PSA) was used. To obtain the result of the probability calculations for PSA, the theory and equations in the paper IAEA TECDOC-636 were used. A specific program to analyse the probabilities was developed within the main program, Scilab 5.1.1. for two distributions, Fischer and chi-square, both with the confidence level of 90 %. Using Sordi equations, the maximum admissible doses to compare with the risk limits established by the International Commission on Radiological Protection (ICRP) were obtained. All results achieved with this probability analysis led to the conclusion that the incidents which occurred had radiation doses within the stochastic effects reference interval established by the ICRP-64.
Astrobiological complexity with probabilistic cellular automata.
Vukotić, Branislav; Ćirković, Milan M
2012-08-01
The search for extraterrestrial life and intelligence constitutes one of the major endeavors in science, but has yet been quantitatively modeled only rarely and in a cursory and superficial fashion. We argue that probabilistic cellular automata (PCA) represent the best quantitative framework for modeling the astrobiological history of the Milky Way and its Galactic Habitable Zone. The relevant astrobiological parameters are to be modeled as the elements of the input probability matrix for the PCA kernel. With the underlying simplicity of the cellular automata constructs, this approach enables a quick analysis of large and ambiguous space of the input parameters. We perform a simple clustering analysis of typical astrobiological histories with "Copernican" choice of input parameters and discuss the relevant boundary conditions of practical importance for planning and guiding empirical astrobiological and SETI projects. In addition to showing how the present framework is adaptable to more complex situations and updated observational databases from current and near-future space missions, we demonstrate how numerical results could offer a cautious rationale for continuation of practical SETI searches.
Studies of Big Data metadata segmentation between relational and non-relational databases
NASA Astrophysics Data System (ADS)
Golosova, M. V.; Grigorieva, M. A.; Klimentov, A. A.; Ryabinkin, E. A.; Dimitrov, G.; Potekhin, M.
2015-12-01
In recent years the concepts of Big Data became well established in IT. Systems managing large data volumes produce metadata that describe data and workflows. These metadata are used to obtain information about current system state and for statistical and trend analysis of the processes these systems drive. Over the time the amount of the stored metadata can grow dramatically. In this article we present our studies to demonstrate how metadata storage scalability and performance can be improved by using hybrid RDBMS/NoSQL architecture.
Open system environment procurement
NASA Technical Reports Server (NTRS)
Fisher, Gary
1994-01-01
Relationships between the request for procurement (RFP) process and open system environment (OSE) standards are described. A guide was prepared to help Federal agency personnel overcome problems in writing an adequate statement of work and developing realistic evaluation criteria when transitioning to an OSE. The guide contains appropriate decision points and transition strategies for developing applications that are affordable, scalable and interoperable across a broad range of computing environments. While useful, the guide does not eliminate the requirement that agencies posses in-depth expertise in software development, communications, and database technology in order to evaluate open systems.
Virtual file system on NoSQL for processing high volumes of HL7 messages.
Kimura, Eizen; Ishihara, Ken
2015-01-01
The Standardized Structured Medical Information Exchange (SS-MIX) is intended to be the standard repository for HL7 messages that depend on a local file system. However, its scalability is limited. We implemented a virtual file system using NoSQL to incorporate modern computing technology into SS-MIX and allow the system to integrate local patient IDs from different healthcare systems into a universal system. We discuss its implementation using the database MongoDB and describe its performance in a case study.
Parameter Studies, time-dependent simulations and design with automated Cartesian methods
NASA Technical Reports Server (NTRS)
Aftosmis, Michael
2005-01-01
Over the past decade, NASA has made a substantial investment in developing adaptive Cartesian grid methods for aerodynamic simulation. Cartesian-based methods played a key role in both the Space Shuttle Accident Investigation and in NASA's return to flight activities. The talk will provide an overview of recent technological developments focusing on the generation of large-scale aerodynamic databases, automated CAD-based design, and time-dependent simulations with of bodies in relative motion. Automation, scalability and robustness underly all of these applications and research in each of these topics will be presented.
Asking better questions: How presentation formats influence information search.
Wu, Charley M; Meder, Björn; Filimon, Flavia; Nelson, Jonathan D
2017-08-01
While the influence of presentation formats have been widely studied in Bayesian reasoning tasks, we present the first systematic investigation of how presentation formats influence information search decisions. Four experiments were conducted across different probabilistic environments, where subjects (N = 2,858) chose between 2 possible search queries, each with binary probabilistic outcomes, with the goal of maximizing classification accuracy. We studied 14 different numerical and visual formats for presenting information about the search environment, constructed across 6 design features that have been prominently related to improvements in Bayesian reasoning accuracy (natural frequencies, posteriors, complement, spatial extent, countability, and part-to-whole information). The posterior variants of the icon array and bar graph formats led to the highest proportion of correct responses, and were substantially better than the standard probability format. Results suggest that presenting information in terms of posterior probabilities and visualizing natural frequencies using spatial extent (a perceptual feature) were especially helpful in guiding search decisions, although environments with a mixture of probabilistic and certain outcomes were challenging across all formats. Subjects who made more accurate probability judgments did not perform better on the search task, suggesting that simple decision heuristics may be used to make search decisions without explicitly applying Bayesian inference to compute probabilities. We propose a new take-the-difference (TTD) heuristic that identifies the accuracy-maximizing query without explicit computation of posterior probabilities. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
A model-based test for treatment effects with probabilistic classifications.
Cavagnaro, Daniel R; Davis-Stober, Clintin P
2018-05-21
Within modern psychology, computational and statistical models play an important role in describing a wide variety of human behavior. Model selection analyses are typically used to classify individuals according to the model(s) that best describe their behavior. These classifications are inherently probabilistic, which presents challenges for performing group-level analyses, such as quantifying the effect of an experimental manipulation. We answer this challenge by presenting a method for quantifying treatment effects in terms of distributional changes in model-based (i.e., probabilistic) classifications across treatment conditions. The method uses hierarchical Bayesian mixture modeling to incorporate classification uncertainty at the individual level into the test for a treatment effect at the group level. We illustrate the method with several worked examples, including a reanalysis of the data from Kellen, Mata, and Davis-Stober (2017), and analyze its performance more generally through simulation studies. Our simulations show that the method is both more powerful and less prone to type-1 errors than Fisher's exact test when classifications are uncertain. In the special case where classifications are deterministic, we find a near-perfect power-law relationship between the Bayes factor, derived from our method, and the p value obtained from Fisher's exact test. We provide code in an online supplement that allows researchers to apply the method to their own data. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Electrochemical Impedance Sensors for Monitoring Trace Amounts of NO3 in Selected Growing Media.
Ghaffari, Seyed Alireza; Caron, William-O; Loubier, Mathilde; Normandeau, Charles-O; Viens, Jeff; Lamhamedi, Mohammed S; Gosselin, Benoit; Messaddeq, Younes
2015-07-21
With the advent of smart cities and big data, precision agriculture allows the feeding of sensor data into online databases for continuous crop monitoring, production optimization, and data storage. This paper describes a low-cost, compact, and scalable nitrate sensor based on electrochemical impedance spectroscopy for monitoring trace amounts of NO3- in selected growing media. The nitrate sensor can be integrated to conventional microelectronics to perform online nitrate sensing continuously over a wide concentration range from 0.1 ppm to 100 ppm, with a response time of about 1 min, and feed data into a database for storage and analysis. The paper describes the structural design, the Nyquist impedance response, the measurement sensitivity and accuracy, and the field testing of the nitrate sensor performed within tree nursery settings under ISO/IEC 17025 certifications.
Electrochemical Impedance Sensors for Monitoring Trace Amounts of NO3 in Selected Growing Media
Ghaffari, Seyed Alireza; Caron, William-O.; Loubier, Mathilde; Normandeau, Charles-O.; Viens, Jeff; Lamhamedi, Mohammed S.; Gosselin, Benoit; Messaddeq, Younes
2015-01-01
With the advent of smart cities and big data, precision agriculture allows the feeding of sensor data into online databases for continuous crop monitoring, production optimization, and data storage. This paper describes a low-cost, compact, and scalable nitrate sensor based on electrochemical impedance spectroscopy for monitoring trace amounts of NO3− in selected growing media. The nitrate sensor can be integrated to conventional microelectronics to perform online nitrate sensing continuously over a wide concentration range from 0.1 ppm to 100 ppm, with a response time of about 1 min, and feed data into a database for storage and analysis. The paper describes the structural design, the Nyquist impedance response, the measurement sensitivity and accuracy, and the field testing of the nitrate sensor performed within tree nursery settings under ISO/IEC 17025 certifications. PMID:26197322
High dimensional biological data retrieval optimization with NoSQL technology.
Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike
2014-01-01
High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
High dimensional biological data retrieval optimization with NoSQL technology
2014-01-01
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data. PMID:25435347
Bennett, T D; Dean, J M; Keenan, H T; McGlincy, M H; Thomas, A M; Cook, L J
2015-01-01
Record linkage may create powerful datasets with which investigators can conduct comparative effectiveness studies evaluating the impact of tests or interventions on health. All linkages of health care data files to date have used protected health information (PHI) in their linkage variables. A technique to link datasets without using PHI would be advantageous both to preserve privacy and to increase the number of potential linkages. We applied probabilistic linkage to records of injured children in the National Trauma Data Bank (NTDB, N = 156,357) and the Pediatric Health Information Systems (PHIS, N = 104,049) databases from 2007 to 2010. 49 match variables without PHI were used, many of them administrative variables and indicators for procedures recorded as International Classification of Diseases, 9th revision, Clinical Modification codes. We validated the accuracy of the linkage using identified data from a single center that submits to both databases. We accurately linked the PHIS and NTDB records for 69% of children with any injury, and 88% of those with severe traumatic brain injury eligible for a study of intervention effectiveness (positive predictive value of 98%, specificity of 99.99%). Accurate linkage was associated with longer lengths of stay, more severe injuries, and multiple injuries. In populations with substantial illness or injury severity, accurate record linkage may be possible in the absence of PHI. This methodology may enable linkages and, in turn, comparative effectiveness studies that would be unlikely or impossible otherwise.
A three-way approach for protein function classification
2017-01-01
The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy. PMID:28234929
A three-way approach for protein function classification.
Ur Rehman, Hafeez; Azam, Nouman; Yao, JingTao; Benso, Alfredo
2017-01-01
The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy.
Application of kernel functions for accurate similarity search in large chemical databases.
Wang, Xiaohong; Huan, Jun; Smalter, Aaron; Lushington, Gerald H
2010-04-29
Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.
Scalable Machine Learning for Massive Astronomical Datasets
NASA Astrophysics Data System (ADS)
Ball, Nicholas M.; Gray, A.
2014-04-01
We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors. This is likely of particular interest to the radio astronomy community given, for example, that survey projects contain groups dedicated to this topic. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.
Scalable Machine Learning for Massive Astronomical Datasets
NASA Astrophysics Data System (ADS)
Ball, Nicholas M.; Astronomy Data Centre, Canadian
2014-01-01
We present the ability to perform data mining and machine learning operations on a catalog of half a billion astronomical objects. This is the result of the combination of robust, highly accurate machine learning algorithms with linear scalability that renders the applications of these algorithms to massive astronomical data tractable. We demonstrate the core algorithms kernel density estimation, K-means clustering, linear regression, nearest neighbors, random forest and gradient-boosted decision tree, singular value decomposition, support vector machine, and two-point correlation function. Each of these is relevant for astronomical applications such as finding novel astrophysical objects, characterizing artifacts in data, object classification (including for rare objects), object distances, finding the important features describing objects, density estimation of distributions, probabilistic quantities, and exploring the unknown structure of new data. The software, Skytree Server, runs on any UNIX-based machine, a virtual machine, or cloud-based and distributed systems including Hadoop. We have integrated it on the cloud computing system of the Canadian Astronomical Data Centre, the Canadian Advanced Network for Astronomical Research (CANFAR), creating the world's first cloud computing data mining system for astronomy. We demonstrate results showing the scaling of each of our major algorithms on large astronomical datasets, including the full 470,992,970 objects of the 2 Micron All-Sky Survey (2MASS) Point Source Catalog. We demonstrate the ability to find outliers in the full 2MASS dataset utilizing multiple methods, e.g., nearest neighbors, and the local outlier factor. 2MASS is used as a proof-of-concept dataset due to its convenience and availability. These results are of interest to any astronomical project with large and/or complex datasets that wishes to extract the full scientific value from its data.
Learning to play Go using recursive neural networks.
Wu, Lin; Baldi, Pierre
2008-11-01
Go is an ancient board game that poses unique opportunities and challenges for artificial intelligence. Currently, there are no computer Go programs that can play at the level of a good human player. However, the emergence of large repositories of games is opening the door for new machine learning approaches to address this challenge. Here we develop a machine learning approach to Go, and related board games, focusing primarily on the problem of learning a good evaluation function in a scalable way. Scalability is essential at multiple levels, from the library of local tactical patterns, to the integration of patterns across the board, to the size of the board itself. The system we propose is capable of automatically learning the propensity of local patterns from a library of games. Propensity and other local tactical information are fed into recursive neural networks, derived from a probabilistic Bayesian network architecture. The recursive neural networks in turn integrate local information across the board in all four cardinal directions and produce local outputs that represent local territory ownership probabilities. The aggregation of these probabilities provides an effective strategic evaluation function that is an estimate of the expected area at the end, or at various other stages, of the game. Local area targets for training can be derived from datasets of games played by human players. In this approach, while requiring a learning time proportional to N(4), skills learned on a board of size N(2) can easily be transferred to boards of other sizes. A system trained using only 9 x 9 amateur game data performs surprisingly well on a test set derived from 19 x 19 professional game data. Possible directions for further improvements are briefly discussed.
Gigwa-Genotype investigator for genome-wide analyses.
Sempéré, Guilhem; Philippe, Florian; Dereeper, Alexis; Ruiz, Manuel; Sarah, Gautier; Larmande, Pierre
2016-06-06
Exploring the structure of genomes and analyzing their evolution is essential to understanding the ecological adaptation of organisms. However, with the large amounts of data being produced by next-generation sequencing, computational challenges arise in terms of storage, search, sharing, analysis and visualization. This is particularly true with regards to studies of genomic variation, which are currently lacking scalable and user-friendly data exploration solutions. Here we present Gigwa, a web-based tool that provides an easy and intuitive way to explore large amounts of genotyping data by filtering it not only on the basis of variant features, including functional annotations, but also on genotype patterns. The data storage relies on MongoDB, which offers good scalability properties. Gigwa can handle multiple databases and may be deployed in either single- or multi-user mode. In addition, it provides a wide range of popular export formats. The Gigwa application is suitable for managing large amounts of genomic variation data. Its user-friendly web interface makes such processing widely accessible. It can either be simply deployed on a workstation or be used to provide a shared data portal for a given community of researchers.
Development of noSQL data storage for the ATLAS PanDA Monitoring System
NASA Astrophysics Data System (ADS)
Ito, H.; Potekhin, M.; Wenaus, T.
2012-12-01
For several years the PanDA Workload Management System has been the basis for distributed production and analysis for the ATLAS experiment at the LHC. Since the start of data taking PanDA usage has ramped up steadily, typically exceeding 500k completed jobs/day by June 2011. The associated monitoring data volume has been rising as well, to levels that present a new set of challenges in the areas of database scalability and monitoring system performance and efficiency. These challenges are being met with an R&D effort aimed at implementing a scalable and efficient monitoring data storage based on a noSQL solution (Cassandra). We present our motivations for using this technology, as well as data design and the techniques used for efficient indexing of the data. We also discuss the hardware requirements as they were determined by testing with actual data and realistic rate of queries. In conclusion, we present our experience with operating a Cassandra cluster over an extended period of time and with data load adequate for planned application.
LVFS: A Scalable Petabye/Exabyte Data Storage System
NASA Astrophysics Data System (ADS)
Golpayegani, N.; Halem, M.; Masuoka, E. J.; Ye, G.; Devine, N. K.
2013-12-01
Managing petabytes of data with hundreds of millions of files is the first step necessary towards an effective big data computing and collaboration environment in a distributed system. We describe here the MODAPS LAADS Virtual File System (LVFS), a new storage architecture which replaces the previous MODAPS operational Level 1 Land Atmosphere Archive Distribution System (LAADS) NFS based approach to storing and distributing datasets from several instruments, such as MODIS, MERIS, and VIIRS. LAADS is responsible for the distribution of over 4 petabytes of data and over 300 million files across more than 500 disks. We present here the first LVFS big data comparative performance results and new capabilities not previously possible with the LAADS system. We consider two aspects in addressing inefficiencies of massive scales of data. First, is dealing in a reliable and resilient manner with the volume and quantity of files in such a dataset, and, second, minimizing the discovery and lookup times for accessing files in such large datasets. There are several popular file systems that successfully deal with the first aspect of the problem. Their solution, in general, is through distribution, replication, and parallelism of the storage architecture. The Hadoop Distributed File System (HDFS), Parallel Virtual File System (PVFS), and Lustre are examples of such file systems that deal with petabyte data volumes. The second aspect deals with data discovery among billions of files, the largest bottleneck in reducing access time. The metadata of a file, generally represented in a directory layout, is stored in ways that are not readily scalable. This is true for HDFS, PVFS, and Lustre as well. Recent experimental file systems, such as Spyglass or Pantheon, have attempted to address this problem through redesign of the metadata directory architecture. LVFS takes a radically different architectural approach by eliminating the need for a separate directory within the file system. The LVFS system replaces the NFS disk mounting approach of LAADS and utilizes the already existing highly optimized metadata database server, which is applicable to most scientific big data intensive compute systems. Thus, LVFS ties the existing storage system with the existing metadata infrastructure system which we believe leads to a scalable exabyte virtual file system. The uniqueness of the implemented design is not limited to LAADS but can be employed with most scientific data processing systems. By utilizing the Filesystem In Userspace (FUSE), a kernel module available in many operating systems, LVFS was able to replace the NFS system while staying POSIX compliant. As a result, the LVFS system becomes scalable to exabyte sizes owing to the use of highly scalable database servers optimized for metadata storage. The flexibility of the LVFS design allows it to organize data on the fly in different ways, such as by region, date, instrument or product without the need for duplication, symbolic links, or any other replication methods. We proposed here a strategic reference architecture that addresses the inefficiencies of scientific petabyte/exabyte file system access through the dynamic integration of the observing system's large metadata file.
NASA Astrophysics Data System (ADS)
Feng, X.; Shen, S.
2014-12-01
The US coastline, over the past few years, has been overwhelmed by major storms including Hurricane Katrina (2005), Ike (2008), Irene (2011), and Sandy (2012). Supported by a growing and extensive body of evidence, a majority of research agrees hurricane activities have been enhanced due to climate change. However, the precise prediction of hurricane induced inundation remains a challenge. This study proposed a probabilistic inundation map based on a Statistically Modeled Storm Database (SMSD) to assess the probabilistic coastal inundation risk of Southwest Florida for near-future (20 years) scenario considering climate change. This map was processed through a Joint Probability Method with Optimal-Sampling (JPM-OS), developed by Condon and Sheng in 2012, and accompanied by a high resolution storm surge modeling system CH3D-SSMS. The probabilistic inundation map shows a 25.5-31.2% increase in spatially averaged inundation height compared to an inundation map of present-day scenario. To estimate climate change impacts on coastal communities, socioeconomic analyses were conducted using both the SMSD based probabilistic inundation map and the present-day inundation map. Combined with 2010 census data and 2012 parcel data from Florida Geographic Data Library, the differences of economic loss between the near-future and present day scenarios were used to generate an economic exposure map at census block group level to reflect coastal communities' exposure to climate change. The results show that climate change induced inundation increase has significant economic impacts. Moreover, the impacts are not equally distributed among different social groups considering their social vulnerability to hazards. Social vulnerability index at census block group level were obtained from Hazards and Vulnerability Research Institute. The demographic and economic variables in the index represent a community's adaptability to hazards. Local Moran's I was calculated to identify the clusters of highly exposed and vulnerable communities. The economic-exposure cluster map was overlapped with social-vulnerability cluster map to identify communities with low adaptive capability but high exposure. The result provides decision makers an intuitive tool to identify most susceptible communities for adaptation.
Myria: Scalable Analytics as a Service
NASA Astrophysics Data System (ADS)
Howe, B.; Halperin, D.; Whitaker, A.
2014-12-01
At the UW eScience Institute, we're working to empower non-experts, especially in the sciences, to write and use data-parallel algorithms. To this end, we are building Myria, a web-based platform for scalable analytics and data-parallel programming. Myria's internal model of computation is the relational algebra extended with iteration, such that every program is inherently data-parallel, just as every query in a database is inherently data-parallel. But unlike databases, iteration is a first class concept, allowing us to express machine learning tasks, graph traversal tasks, and more. Programs can be expressed in a number of languages and can be executed on a number of execution environments, but we emphasize a particular language called MyriaL that supports both imperative and declarative styles and a particular execution engine called MyriaX that uses an in-memory column-oriented representation and asynchronous iteration. We deliver Myria over the web as a service, providing an editor, performance analysis tools, and catalog browsing features in a single environment. We find that this web-based "delivery vector" is critical in reaching non-experts: they are insulated from irrelevant effort technical work associated with installation, configuration, and resource management. The MyriaX backend, one of several execution runtimes we support, is a main-memory, column-oriented, RDBMS-on-the-worker system that supports cyclic data flows as a first-class citizen and has been shown to outperform competitive systems on 100-machine cluster sizes. I will describe the Myria system, give a demo, and present some new results in large-scale oceanographic microbiology.
Development Of New Databases For Tsunami Hazard Analysis In California
NASA Astrophysics Data System (ADS)
Wilson, R. I.; Barberopoulou, A.; Borrero, J. C.; Bryant, W. A.; Dengler, L. A.; Goltz, J. D.; Legg, M.; McGuire, T.; Miller, K. M.; Real, C. R.; Synolakis, C.; Uslu, B.
2009-12-01
The California Geological Survey (CGS) has partnered with other tsunami specialists to produce two statewide databases to facilitate the evaluation of tsunami hazard products for both emergency response and land-use planning and development. A robust, State-run tsunami deposit database is being developed that compliments and expands on existing databases from the National Geophysical Data Center (global) and the USGS (Cascadia). Whereas these existing databases focus on references or individual tsunami layers, the new State-maintained database concentrates on the location and contents of individual borings/trenches that sample tsunami deposits. These data provide an important observational benchmark for evaluating the results of tsunami inundation modeling. CGS is collaborating with and sharing the database entry form with other states to encourage its continued development beyond California’s coastline so that historic tsunami deposits can be evaluated on a regional basis. CGS is also developing an internet-based, tsunami source scenario database and forum where tsunami source experts and hydrodynamic modelers can discuss the validity of tsunami sources and their contribution to hazard assessments for California and other coastal areas bordering the Pacific Ocean. The database includes all distant and local tsunami sources relevant to California starting with the forty scenarios evaluated during the creation of the recently completed statewide series of tsunami inundation maps for emergency response planning. Factors germane to probabilistic tsunami hazard analyses (PTHA), such as event histories and recurrence intervals, are also addressed in the database and discussed in the forum. Discussions with other tsunami source experts will help CGS determine what additional scenarios should be considered in PTHA for assessing the feasibility of generating products of value to local land-use planning and development.
NASA Technical Reports Server (NTRS)
Thomas, J. M.; Hanagud, S.
1974-01-01
The design criteria and test options for aerospace structural reliability were investigated. A decision methodology was developed for selecting a combination of structural tests and structural design factors. The decision method involves the use of Bayesian statistics and statistical decision theory. Procedures are discussed for obtaining and updating data-based probabilistic strength distributions for aerospace structures when test information is available and for obtaining subjective distributions when data are not available. The techniques used in developing the distributions are explained.
eCOMPAGT – efficient Combination and Management of Phenotypes and Genotypes for Genetic Epidemiology
Schönherr, Sebastian; Weißensteiner, Hansi; Coassin, Stefan; Specht, Günther; Kronenberg, Florian; Brandstätter, Anita
2009-01-01
Background High-throughput genotyping and phenotyping projects of large epidemiological study populations require sophisticated laboratory information management systems. Most epidemiological studies include subject-related personal information, which needs to be handled with care by following data privacy protection guidelines. In addition, genotyping core facilities handling cooperative projects require a straightforward solution to monitor the status and financial resources of the different projects. Description We developed a database system for an efficient combination and management of phenotypes and genotypes (eCOMPAGT) deriving from genetic epidemiological studies. eCOMPAGT securely stores and manages genotype and phenotype data and enables different user modes with different rights. Special attention was drawn on the import of data deriving from TaqMan and SNPlex genotyping assays. However, the database solution is adjustable to other genotyping systems by programming additional interfaces. Further important features are the scalability of the database and an export interface to statistical software. Conclusion eCOMPAGT can store, administer and connect phenotype data with all kinds of genotype data and is available as a downloadable version at . PMID:19432954
DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases
NASA Astrophysics Data System (ADS)
Bröcheler, Matthias; Pugliese, Andrea; Subrahmanian, V. S.
RDF is an increasingly important paradigm for the representation of information on the Web. As RDF databases increase in size to approach tens of millions of triples, and as sophisticated graph matching queries expressible in languages like SPARQL become increasingly important, scalability becomes an issue. To date, there is no graph-based indexing method for RDF data where the index was designed in a way that makes it disk-resident. There is therefore a growing need for indexes that can operate efficiently when the index itself resides on disk. In this paper, we first propose the DOGMA index for fast subgraph matching on disk and then develop a basic algorithm to answer queries over this index. This algorithm is then significantly sped up via an optimized algorithm that uses efficient (but correct) pruning strategies when combined with two different extensions of the index. We have implemented a preliminary system and tested it against four existing RDF database systems developed by others. Our experiments show that our algorithm performs very well compared to these systems, with orders of magnitude improvements for complex graph queries.
Iavindrasana, Jimison; Depeursinge, Adrien; Ruch, Patrick; Spahni, Stéphane; Geissbuhler, Antoine; Müller, Henning
2007-01-01
The diagnostic and therapeutic processes, as well as the development of new treatments, are hindered by the fragmentation of information which underlies them. In a multi-institutional research study database, the clinical information system (CIS) contains the primary data input. An important part of the money of large scale clinical studies is often paid for data creation and maintenance. The objective of this work is to design a decentralized, scalable, reusable database architecture with lower maintenance costs for managing and integrating distributed heterogeneous data required as basis for a large-scale research project. Technical and legal aspects are taken into account based on various use case scenarios. The architecture contains 4 layers: data storage and access are decentralized at their production source, a connector as a proxy between the CIS and the external world, an information mediator as a data access point and the client side. The proposed design will be implemented inside six clinical centers participating in the @neurIST project as part of a larger system on data integration and reuse for aneurism treatment.
The Virtual Xenbase: transitioning an online bioinformatics resource to a private cloud
Karimi, Kamran; Vize, Peter D.
2014-01-01
As a model organism database, Xenbase has been providing informatics and genomic data on Xenopus (Silurana) tropicalis and Xenopus laevis frogs for more than a decade. The Xenbase database contains curated, as well as community-contributed and automatically harvested literature, gene and genomic data. A GBrowse genome browser, a BLAST+ server and stock center support are available on the site. When this resource was first built, all software services and components in Xenbase ran on a single physical server, with inherent reliability, scalability and inter-dependence issues. Recent advances in networking and virtualization techniques allowed us to move Xenbase to a virtual environment, and more specifically to a private cloud. To do so we decoupled the different software services and components, such that each would run on a different virtual machine. In the process, we also upgraded many of the components. The resulting system is faster and more reliable. System maintenance is easier, as individual virtual machines can now be updated, backed up and changed independently. We are also experiencing more effective resource allocation and utilization. Database URL: www.xenbase.org PMID:25380782
Probabilistic Seismic Hazard Analysis for Georgia
NASA Astrophysics Data System (ADS)
Tsereteli, N. S.; Varazanashvili, O.; Sharia, T.; Arabidze, V.; Tibaldi, A.; Bonali, F. L. L.; Russo, E.; Pasquaré Mariotto, F.
2017-12-01
Nowadays, seismic hazard studies are developed in terms of the calculation of Peak Ground Acceleration (PGA), Spectral Acceleration (SA), Peak Ground Velocity (PGV) and other recorded parameters. In the frame of EMME project PSH were calculated for Georgia using GMPE based on selection criteria. In the frame of Project N 216758 (supported by Shota Rustaveli National Science Foundation (SRNF)) PSH maps were estimated using hybrid- empirical ground motion prediction equation developed for Georgia. Due to the paucity of seismically recorded information, in this work we focused our research on a more robust dataset related to macroseismic data,and attempted to calculate the probabilistic seismic hazard directly in terms of macroseismicintensity. For this reason, we started calculating new intensity prediction equations (IPEs)for Georgia taking into account different sets, belonging to the same new database, as well as distances from the seismic source.With respect to the seismic source, in order to improve the quality of the results, we have also hypothesized the size of faults from empirical relations, and calculated new IPEs also by considering Joyner-Boore and rupture distances in addition to epicentral and hypocentral distances. Finally, site conditions have been included as variables for IPEs calculation Regarding the database, we used a brand new revised set of macroseismic data and instrumental records for the significant earthquakes that struck Georgia between 1900 and 2002.Particularly, a large amount of research and documents related to macroseismic effects of individual earthquakes, stored in the archives of the Institute of Geophysics, were used as sources for the new macroseismic data. The latter are reported in the Medvedev-Sponheuer-Karnikmacroseismic scale (MSK64). For each earthquake the magnitude, the focal depth and the epicenter location are also reported. An online version of the database, with therelated metadata,has been produced for the 69 revised earthquakes and is available online (http://www.enguriproject.unimib.it/; .
Just, Rebecca S; Irwin, Jodi A
2018-05-01
Some of the expected advantages of next generation sequencing (NGS) for short tandem repeat (STR) typing include enhanced mixture detection and genotype resolution via sequence variation among non-homologous alleles of the same length. However, at the same time that NGS methods for forensic DNA typing have advanced in recent years, many caseworking laboratories have implemented or are transitioning to probabilistic genotyping to assist the interpretation of complex autosomal STR typing results. Current probabilistic software programs are designed for length-based data, and were not intended to accommodate sequence strings as the product input. Yet to leverage the benefits of NGS for enhanced genotyping and mixture deconvolution, the sequence variation among same-length products must be utilized in some form. Here, we propose use of the longest uninterrupted stretch (LUS) in allele designations as a simple method to represent sequence variation within the STR repeat regions and facilitate - in the nearterm - probabilistic interpretation of NGS-based typing results. An examination of published population data indicated that a reference LUS region is straightforward to define for most autosomal STR loci, and that using repeat unit plus LUS length as the allele designator can represent greater than 80% of the alleles detected by sequencing. A proof of concept study performed using a freely available probabilistic software demonstrated that the LUS length can be used in allele designations when a program does not require alleles to be integers, and that utilizing sequence information improves interpretation of both single-source and mixed contributor STR typing results as compared to using repeat unit information alone. The LUS concept for allele designation maintains the repeat-based allele nomenclature that will permit backward compatibility to extant STR databases, and the LUS lengths themselves will be concordant regardless of the NGS assay or analysis tools employed. Further, these biologically based, easy-to-derive designations uphold clear relationships between parent alleles and their stutter products, enabling analysis in fully continuous probabilistic programs that model stutter while avoiding the algorithmic complexities that come with string based searches. Though using repeat unit plus LUS length as the allele designator does not capture variation that occurs outside of the core repeat regions, this straightforward approach would permit the large majority of known STR sequence variation to be used for mixture deconvolution and, in turn, result in more informative mixture statistics in the near term. Ultimately, the method could bridge the gap from current length-based probabilistic systems to facilitate broader adoption of NGS by forensic DNA testing laboratories. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.
Pan, Xuequn; Cimino, James J
2014-01-01
Clinicians and clinical researchers often seek information in electronic health records (EHRs) that are relevant to some concept of interest, such as a disease or finding. The heterogeneous nature of EHRs can complicate retrieval, risking incomplete results. We frame this problem as the presence of two gaps: 1) a gap between clinical concepts and their representations in EHR data and 2) a gap between data representations and their locations within EHR data structures. We bridge these gaps with a knowledge structure that comprises relationships among clinical concepts (including concepts of interest and concepts that may be instantiated in EHR data) and relationships between clinical concepts and the database structures. We make use of available knowledge resources to develop a reproducible, scalable process for creating a knowledge base that can support automated query expansion from a clinical concept to all relevant EHR data.
Reflective random indexing for semi-automatic indexing of the biomedical literature.
Vasuki, Vidya; Cohen, Trevor
2010-10-01
The rapid growth of biomedical literature is evident in the increasing size of the MEDLINE research database. Medical Subject Headings (MeSH), a controlled set of keywords, are used to index all the citations contained in the database to facilitate search and retrieval. This volume of citations calls for efficient tools to assist indexers at the US National Library of Medicine (NLM). Currently, the Medical Text Indexer (MTI) system provides assistance by recommending MeSH terms based on the title and abstract of an article using a combination of distributional and vocabulary-based methods. In this paper, we evaluate a novel approach toward indexer assistance by using nearest neighbor classification in combination with Reflective Random Indexing (RRI), a scalable alternative to the established methods of distributional semantics. On a test set provided by the NLM, our approach significantly outperforms the MTI system, suggesting that the RRI approach would make a useful addition to the current methodologies.
NASA Astrophysics Data System (ADS)
Baumann, Peter
2013-04-01
There is a traditional saying that metadata are understandable, semantic-rich, and searchable. Data, on the other hand, are big, with no accessible semantics, and just downloadable. Not only has this led to an imbalance of search support form a user perspective, but also underneath to a deep technology divide often using relational databases for metadata and bespoke archive solutions for data. Our vision is that this barrier will be overcome, and data and metadata become searchable likewise, leveraging the potential of semantic technologies in combination with scalability technologies. Ultimately, in this vision ad-hoc processing and filtering will not distinguish any longer, forming a uniformly accessible data universe. In the European EarthServer initiative, we work towards this vision by federating database-style raster query languages with metadata search and geo broker technology. We present our approach taken, how it can leverage OGC standards, the benefits envisaged, and first results.
Sequential data access with Oracle and Hadoop: a performance comparison
NASA Astrophysics Data System (ADS)
Baranowski, Zbigniew; Canali, Luca; Grancher, Eric
2014-06-01
The Hadoop framework has proven to be an effective and popular approach for dealing with "Big Data" and, thanks to its scaling ability and optimised storage access, Hadoop Distributed File System-based projects such as MapReduce or HBase are seen as candidates to replace traditional relational database management systems whenever scalable speed of data processing is a priority. But do these projects deliver in practice? Does migrating to Hadoop's "shared nothing" architecture really improve data access throughput? And, if so, at what cost? Authors answer these questions-addressing cost/performance as well as raw performance- based on a performance comparison between an Oracle-based relational database and Hadoop's distributed solutions like MapReduce or HBase for sequential data access. A key feature of our approach is the use of an unbiased data model as certain data models can significantly favour one of the technologies tested.
NASA Astrophysics Data System (ADS)
Bartolini, S.; Becerril, L.; Martí, J.
2014-11-01
One of the most important issues in modern volcanology is the assessment of volcanic risk, which will depend - among other factors - on both the quantity and quality of the available data and an optimum storage mechanism. This will require the design of purpose-built databases that take into account data format and availability and afford easy data storage and sharing, and will provide for a more complete risk assessment that combines different analyses but avoids any duplication of information. Data contained in any such database should facilitate spatial and temporal analysis that will (1) produce probabilistic hazard models for future vent opening, (2) simulate volcanic hazards and (3) assess their socio-economic impact. We describe the design of a new spatial database structure, VERDI (Volcanic managEment Risk Database desIgn), which allows different types of data, including geological, volcanological, meteorological, monitoring and socio-economic information, to be manipulated, organized and managed. The root of the question is to ensure that VERDI will serve as a tool for connecting different kinds of data sources, GIS platforms and modeling applications. We present an overview of the database design, its components and the attributes that play an important role in the database model. The potential of the VERDI structure and the possibilities it offers in regard to data organization are here shown through its application on El Hierro (Canary Islands). The VERDI database will provide scientists and decision makers with a useful tool that will assist to conduct volcanic risk assessment and management.
Nosql for Storage and Retrieval of Large LIDAR Data Collections
NASA Astrophysics Data System (ADS)
Boehm, J.; Liu, K.
2015-08-01
Developments in LiDAR technology over the past decades have made LiDAR to become a mature and widely accepted source of geospatial information. This in turn has led to an enormous growth in data volume. The central idea for a file-centric storage of LiDAR point clouds is the observation that large collections of LiDAR data are typically delivered as large collections of files, rather than single files of terabyte size. This split of the dataset, commonly referred to as tiling, was usually done to accommodate a specific processing pipeline. It makes therefore sense to preserve this split. A document oriented NoSQL database can easily emulate this data partitioning, by representing each tile (file) in a separate document. The document stores the metadata of the tile. The actual files are stored in a distributed file system emulated by the NoSQL database. We demonstrate the use of MongoDB a highly scalable document oriented NoSQL database for storing large LiDAR files. MongoDB like any NoSQL database allows for queries on the attributes of the document. As a specialty MongoDB also allows spatial queries. Hence we can perform spatial queries on the bounding boxes of the LiDAR tiles. Inserting and retrieving files on a cloud-based database is compared to native file system and cloud storage transfer speed.
Mining of high utility-probability sequential patterns from uncertain databases
Zhang, Binbin; Fournier-Viger, Philippe; Li, Ting
2017-01-01
High-utility sequential pattern mining (HUSPM) has become an important issue in the field of data mining. Several HUSPM algorithms have been designed to mine high-utility sequential patterns (HUPSPs). They have been applied in several real-life situations such as for consumer behavior analysis and event detection in sensor networks. Nonetheless, most studies on HUSPM have focused on mining HUPSPs in precise data. But in real-life, uncertainty is an important factor as data is collected using various types of sensors that are more or less accurate. Hence, data collected in a real-life database can be annotated with existing probabilities. This paper presents a novel pattern mining framework called high utility-probability sequential pattern mining (HUPSPM) for mining high utility-probability sequential patterns (HUPSPs) in uncertain sequence databases. A baseline algorithm with three optional pruning strategies is presented to mine HUPSPs. Moroever, to speed up the mining process, a projection mechanism is designed to create a database projection for each processed sequence, which is smaller than the original database. Thus, the number of unpromising candidates can be greatly reduced, as well as the execution time for mining HUPSPs. Substantial experiments both on real-life and synthetic datasets show that the designed algorithm performs well in terms of runtime, number of candidates, memory usage, and scalability for different minimum utility and minimum probability thresholds. PMID:28742847
Life in extra dimensions of database world or penetration of NoSQL in HEP community
NASA Astrophysics Data System (ADS)
Kuznetsov, V.; Evans, D.; Metson, S.
2012-12-01
The recent buzzword in IT world is NoSQL. Major players, such as Facebook, Yahoo, Google, etc. are widely adopted different “NoSQL” solutions for their needs. Horizontal scalability, flexible data model and management of big data volumes are only a few advantages of NoSQL. In CMS experiment we use several of them in production environment. Here, we present CMS projects based on NoSQL solutions, their strengths and weaknesses as well as our experience with those tools and their coexistence with standard RDBMS solutions in our applications.
Hierarchical probabilistic Gabor and MRF segmentation of brain tumours in MRI volumes.
Subbanna, Nagesh K; Precup, Doina; Collins, D Louis; Arbel, Tal
2013-01-01
In this paper, we present a fully automated hierarchical probabilistic framework for segmenting brain tumours from multispectral human brain magnetic resonance images (MRIs) using multiwindow Gabor filters and an adapted Markov Random Field (MRF) framework. In the first stage, a customised Gabor decomposition is developed, based on the combined-space characteristics of the two classes (tumour and non-tumour) in multispectral brain MRIs in order to optimally separate tumour (including edema) from healthy brain tissues. A Bayesian framework then provides a coarse probabilistic texture-based segmentation of tumours (including edema) whose boundaries are then refined at the voxel level through a modified MRF framework that carefully separates the edema from the main tumour. This customised MRF is not only built on the voxel intensities and class labels as in traditional MRFs, but also models the intensity differences between neighbouring voxels in the likelihood model, along with employing a prior based on local tissue class transition probabilities. The second inference stage is shown to resolve local inhomogeneities and impose a smoothing constraint, while also maintaining the appropriate boundaries as supported by the local intensity difference observations. The method was trained and tested on the publicly available MICCAI 2012 Brain Tumour Segmentation Challenge (BRATS) Database [1] on both synthetic and clinical volumes (low grade and high grade tumours). Our method performs well compared to state-of-the-art techniques, outperforming the results of the top methods in cases of clinical high grade and low grade tumour core segmentation by 40% and 45% respectively.
Probabilistic Assessment of Radiation Risk for Astronauts in Space Missions
NASA Technical Reports Server (NTRS)
Kim, Myung-Hee; DeAngelis, Giovanni; Cucinotta, Francis A.
2009-01-01
Accurate predictions of the health risks to astronauts from space radiation exposure are necessary for enabling future lunar and Mars missions. Space radiation consists of solar particle events (SPEs), comprised largely of medium energy protons, (less than 100 MeV); and galactic cosmic rays (GCR), which include protons and heavy ions of higher energies. While the expected frequency of SPEs is strongly influenced by the solar activity cycle, SPE occurrences themselves are random in nature. A solar modulation model has been developed for the temporal characterization of the GCR environment, which is represented by the deceleration potential, phi. The risk of radiation exposure from SPEs during extra-vehicular activities (EVAs) or in lightly shielded vehicles is a major concern for radiation protection, including determining the shielding and operational requirements for astronauts and hardware. To support the probabilistic risk assessment for EVAs, which would be up to 15% of crew time on lunar missions, we estimated the probability of SPE occurrence as a function of time within a solar cycle using a nonhomogeneous Poisson model to fit the historical database of measurements of protons with energy > 30 MeV, (phi)30. The resultant organ doses and dose equivalents, as well as effective whole body doses for acute and cancer risk estimations are analyzed for a conceptual habitat module and a lunar rover during defined space mission periods. This probabilistic approach to radiation risk assessment from SPE and GCR is in support of mission design and operational planning to manage radiation risks for space exploration.
Assessment of a Tsunami Hazard for Mediterranean Coast of Egypt
NASA Astrophysics Data System (ADS)
Zaytsev, Andrey; Babeyko, Andrey; Yalciner, Ahmet; Pelinovsky, Efim
2017-04-01
Analysis of tsunami hazard for Egypt based on historic data and numerical modelling of historic and prognostic events is given. There are 13 historic events for 4000 years, including one instrumental record (1956). Tsunami database includes 12 earthquake tsunamis and 1 event of volcanic origin (Santorini eruption). Tsunami intensity of events (365, 881, 1303, 1870) is estimated as I = 3 led to tsunami wave height more than 6 m. Numerical simulation of some possible scenario of tsunamis of seismic and landslide origin is done with use of NAMI-DANCE software solved the shallow-water equations. The PTHA method (Probabilistic Tsunami Hazard Assessment - Probabilistic assessment of a tsunami hazard) for the Mediterranean Sea developed in (Sorensen M.B., Spada M., Babeyko A., Wiemer S., Grunthal G. Probabilistic tsunami hazard in the Mediterranean Sea. J Geophysical Research, 2012, vol. 117, B01305) is used to evaluate the probability of tsunami occurrence on the Egyptian coast. The synthetic catalogue of prognostic tsunamis of seismic origin with magnitude more than 6.5 includes 84 920 events for 100000 years. For the wave heights more 1 m the curve: exceedance probability - tsunami height can be approximated by exponential Gumbel function with two parameters which are determined for each coastal location in Egypt (totally. 24 points). Prognostic extreme highest events with probability less 10-4 are not satisfied to the Gumbel function (approximately 10 events) and required the special analysis. Acknowledgements: This work was supported EU FP7 ASTARTE Project [603839], and for EP - NS6637.2016.5.
PROBABILISTIC RISK ANALYSIS OF RADIOACTIVE WASTE DISPOSALS - a case study
NASA Astrophysics Data System (ADS)
Trinchero, P.; Delos, A.; Tartakovsky, D. M.; Fernandez-Garcia, D.; Bolster, D.; Dentz, M.; Sanchez-Vila, X.; Molinero, J.
2009-12-01
The storage of contaminant material in superficial or sub-superficial repositories, such as tailing piles for mine waste or disposal sites for low and intermediate nuclear waste, poses a potential threat for the surrounding biosphere. The minimization of these risks can be achieved by supporting decision-makers with quantitative tools capable to incorporate all source of uncertainty within a rigorous probabilistic framework. A case study is presented where we assess the risks associated to the superficial storage of hazardous waste close to a populated area. The intrinsic complexity of the problem, involving many events with different spatial and time scales and many uncertainty parameters is overcome by using a formal PRA (probabilistic risk assessment) procedure that allows decomposing the system into a number of key events. Hence, the failure of the system is directly linked to the potential contamination of one of the three main receptors: the underlying karst aquifer, a superficial stream that flows near the storage piles and a protection area surrounding a number of wells used for water supply. The minimal cut sets leading to the failure of the system are obtained by defining a fault-tree that incorporates different events including the failure of the engineered system (e.g. cover of the piles) and the failure of the geological barrier (e.g. clay layer that separates the bottom of the pile from the karst formation). Finally the probability of failure is quantitatively assessed combining individual independent or conditional probabilities that are computed numerically or borrowed from reliability database.
Phase change cellular automata modeling of GeTe, GaSb and SnSe stacked chalcogenide films
NASA Astrophysics Data System (ADS)
Mihai, C.; Velea, A.
2018-06-01
Data storage needs are increasing at a rapid pace across all economic sectors, so the need for new memory technologies with adequate capabilities is also high. Phase change memories (PCMs) are a leading contender in the emerging race for non-volatile memories due to their fast operation speed, high scalability, good reliability and low power consumption. However, in order to meet the present and future storage demands, PCM technologies must further increase the storage density. Here, we employ a probabilistic cellular automata approach to explore the multi-step threshold switching from the reset (off) to the set (on) state in chalcogenide stacked structures. Simulations have shown that in order to obtain multi-step switching with high contrast among different resistance states, the stacked structure needs to contain materials with a large difference among their crystallization temperatures and careful tuning of strata thicknesses. The crystallization dynamics can be controlled through the external energy pulses applied to the system, in such a way that a balance between nucleation and growth in phase change behavior can be achieved, optimized for PCMs.
An incremental DPMM-based method for trajectory clustering, modeling, and retrieval.
Hu, Weiming; Li, Xi; Tian, Guodong; Maybank, Stephen; Zhang, Zhongfei
2013-05-01
Trajectory analysis is the basis for many applications, such as indexing of motion events in videos, activity recognition, and surveillance. In this paper, the Dirichlet process mixture model (DPMM) is applied to trajectory clustering, modeling, and retrieval. We propose an incremental version of a DPMM-based clustering algorithm and apply it to cluster trajectories. An appropriate number of trajectory clusters is determined automatically. When trajectories belonging to new clusters arrive, the new clusters can be identified online and added to the model without any retraining using the previous data. A time-sensitive Dirichlet process mixture model (tDPMM) is applied to each trajectory cluster for learning the trajectory pattern which represents the time-series characteristics of the trajectories in the cluster. Then, a parameterized index is constructed for each cluster. A novel likelihood estimation algorithm for the tDPMM is proposed, and a trajectory-based video retrieval model is developed. The tDPMM-based probabilistic matching method and the DPMM-based model growing method are combined to make the retrieval model scalable and adaptable. Experimental comparisons with state-of-the-art algorithms demonstrate the effectiveness of our algorithm.
Uncertainty analyses of CO2 plume expansion subsequent to wellbore CO2 leakage into aquifers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hou, Zhangshuan; Bacon, Diana H.; Engel, David W.
2014-08-01
In this study, we apply an uncertainty quantification (UQ) framework to CO2 sequestration problems. In one scenario, we look at the risk of wellbore leakage of CO2 into a shallow unconfined aquifer in an urban area; in another scenario, we study the effects of reservoir heterogeneity on CO2 migration. We combine various sampling approaches (quasi-Monte Carlo, probabilistic collocation, and adaptive sampling) in order to reduce the number of forward calculations while trying to fully explore the input parameter space and quantify the input uncertainty. The CO2 migration is simulated using the PNNL-developed simulator STOMP-CO2e (the water-salt-CO2 module). For computationally demandingmore » simulations with 3D heterogeneity fields, we combined the framework with a scalable version module, eSTOMP, as the forward modeling simulator. We built response curves and response surfaces of model outputs with respect to input parameters, to look at the individual and combined effects, and identify and rank the significance of the input parameters.« less
Architecture for biomedical multimedia information delivery on the World Wide Web
NASA Astrophysics Data System (ADS)
Long, L. Rodney; Goh, Gin-Hua; Neve, Leif; Thoma, George R.
1997-10-01
Research engineers at the National Library of Medicine are building a prototype system for the delivery of multimedia biomedical information on the World Wide Web. This paper discuses the architecture and design considerations for the system, which will be used initially to make images and text from the third National Health and Nutrition Examination Survey (NHANES) publicly available. We categorized our analysis as follows: (1) fundamental software tools: we analyzed trade-offs among use of conventional HTML/CGI, X Window Broadway, and Java; (2) image delivery: we examined the use of unconventional TCP transmission methods; (3) database manager and database design: we discuss the capabilities and planned use of the Informix object-relational database manager and the planned schema for the HNANES database; (4) storage requirements for our Sun server; (5) user interface considerations; (6) the compatibility of the system with other standard research and analysis tools; (7) image display: we discuss considerations for consistent image display for end users. Finally, we discuss the scalability of the system in terms of incorporating larger or more databases of similar data, and the extendibility of the system for supporting content-based retrieval of biomedical images. The system prototype is called the Web-based Medical Information Retrieval System. An early version was built as a Java applet and tested on Unix, PC, and Macintosh platforms. This prototype used the MiniSQL database manager to do text queries on a small database of records of participants in the second NHANES survey. The full records and associated x-ray images were retrievable and displayable on a standard Web browser. A second version has now been built, also a Java applet, using the MySQL database manager.
A high-performance spatial database based approach for pathology imaging algorithm evaluation
Wang, Fusheng; Kong, Jun; Gao, Jingjing; Cooper, Lee A.D.; Kurc, Tahsin; Zhou, Zhengwen; Adler, David; Vergara-Niedermayr, Cristobal; Katigbak, Bryan; Brat, Daniel J.; Saltz, Joel H.
2013-01-01
Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS) data model. Aims: (1) Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2) Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3) Develop a set of queries to support data sampling and result comparisons; (4) Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1) algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2) algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The validated data were formatted based on the PAIS data model and loaded into a spatial database. To support efficient data loading, we have implemented a parallel data loading tool that takes advantage of multi-core CPUs to accelerate data injection. The spatial database manages both geometric shapes and image features or classifications, and enables spatial sampling, result comparison, and result aggregation through expressive structured query language (SQL) queries with spatial extensions. To provide scalable and efficient query support, we have employed a shared nothing parallel database architecture, which distributes data homogenously across multiple database partitions to take advantage of parallel computation power and implements spatial indexing to achieve high I/O throughput. Results: Our work proposes a high performance, parallel spatial database platform for algorithm validation and comparison. This platform was evaluated by storing, managing, and comparing analysis results from a set of brain tumor whole slide images. The tools we develop are open source and available to download. Conclusions: Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. One critical component is the support for queries involving spatial predicates and comparisons. In our work, we develop an efficient data model and parallel database approach to model, normalize, manage and query large volumes of analytical image result data. Our experiments demonstrate that the data partitioning strategy and the grid-based indexing result in good data distribution across database nodes and reduce I/O overhead in spatial join queries through parallel retrieval of relevant data and quick subsetting of datasets. The set of tools in the framework provide a full pipeline to normalize, load, manage and query analytical results for algorithm evaluation. PMID:23599905
A Diffusion MRI Tractography Connectome of the Mouse Brain and Comparison with Neuronal Tracer Data
Calabrese, Evan; Badea, Alexandra; Cofer, Gary; Qi, Yi; Johnson, G. Allan
2015-01-01
Interest in structural brain connectivity has grown with the understanding that abnormal neural connections may play a role in neurologic and psychiatric diseases. Small animal connectivity mapping techniques are particularly important for identifying aberrant connectivity in disease models. Diffusion magnetic resonance imaging tractography can provide nondestructive, 3D, brain-wide connectivity maps, but has historically been limited by low spatial resolution, low signal-to-noise ratio, and the difficulty in estimating multiple fiber orientations within a single image voxel. Small animal diffusion tractography can be substantially improved through the combination of ex vivo MRI with exogenous contrast agents, advanced diffusion acquisition and reconstruction techniques, and probabilistic fiber tracking. Here, we present a comprehensive, probabilistic tractography connectome of the mouse brain at microscopic resolution, and a comparison of these data with a neuronal tracer-based connectivity data from the Allen Brain Atlas. This work serves as a reference database for future tractography studies in the mouse brain, and demonstrates the fundamental differences between tractography and neuronal tracer data. PMID:26048951
Lost in search: (Mal-)adaptation to probabilistic decision environments in children and adults.
Betsch, Tilmann; Lehmann, Anne; Lindow, Stefanie; Lang, Anna; Schoemann, Martin
2016-02-01
Adaptive decision making in probabilistic environments requires individuals to use probabilities as weights in predecisional information searches and/or when making subsequent choices. Within a child-friendly computerized environment (Mousekids), we tracked 205 children's (105 children 5-6 years of age and 100 children 9-10 years of age) and 103 adults' (age range: 21-22 years) search behaviors and decisions under different probability dispersions (.17; .33, .83 vs. .50, .67, .83) and constraint conditions (instructions to limit search: yes vs. no). All age groups limited their depth of search when instructed to do so and when probability dispersion was high (range: .17-.83). Unlike adults, children failed to use probabilities as weights for their searches, which were largely not systematic. When examining choices, however, elementary school children (unlike preschoolers) systematically used probabilities as weights in their decisions. This suggests that an intuitive understanding of probabilities and the capacity to use them as weights during integration is not a sufficient condition for applying simple selective search strategies that place one's focus on weight distributions. PsycINFO Database Record (c) 2016 APA, all rights reserved.
Cetin, K.O.; Seed, R.B.; Der Kiureghian, A.; Tokimatsu, K.; Harder, L.F.; Kayen, R.E.; Moss, R.E.S.
2004-01-01
This paper presents'new correlations for assessment of the likelihood of initiation (or triggering) of soil liquefaction. These new correlations eliminate several sources of bias intrinsic to previous, similar correlations, and provide greatly reduced overall uncertainty and variance. Key elements in the development of these new correlations are (1) accumulation of a significantly expanded database of field performance case histories; (2) use of improved knowledge and understanding of factors affecting interpretation of standard penetration test data; (3) incorporation of improved understanding of factors affecting site-specific earthquake ground motions (including directivity effects, site-specific response, etc.); (4) use of improved methods for assessment of in situ cyclic shear stress ratio; (5) screening of field data case histories on a quality/uncertainty basis; and (6) use of high-order probabilistic tools (Bayesian updating). The resulting relationships not only provide greatly reduced uncertainty, they also help to resolve a number of corollary issues that have long been difficult and controversial including: (1) magnitude-correlated duration weighting factors, (2) adjustments for fines content, and (3) corrections for overburden stress. ?? ASCE.
Probabilistic consensus scoring improves tandem mass spectrometry peptide identification.
Nahnsen, Sven; Bertsch, Andreas; Rahnenführer, Jörg; Nordheim, Alfred; Kohlbacher, Oliver
2011-08-05
Database search is a standard technique for identifying peptides from their tandem mass spectra. To increase the number of correctly identified peptides, we suggest a probabilistic framework that allows the combination of scores from different search engines into a joint consensus score. Central to the approach is a novel method to estimate scores for peptides not found by an individual search engine. This approach allows the estimation of p-values for each candidate peptide and their combination across all search engines. The consensus approach works better than any single search engine across all different instrument types considered in this study. Improvements vary strongly from platform to platform and from search engine to search engine. Compared to the industry standard MASCOT, our approach can identify up to 60% more peptides. The software for consensus predictions is implemented in C++ as part of OpenMS, a software framework for mass spectrometry. The source code is available in the current development version of OpenMS and can easily be used as a command line application or via a graphical pipeline designer TOPPAS.
Dynamic full-scalability conversion in scalable video coding
NASA Astrophysics Data System (ADS)
Lee, Dong Su; Bae, Tae Meon; Thang, Truong Cong; Ro, Yong Man
2007-02-01
For outstanding coding efficiency with scalability functions, SVC (Scalable Video Coding) is being standardized. SVC can support spatial, temporal and SNR scalability and these scalabilities are useful to provide a smooth video streaming service even in a time varying network such as a mobile environment. But current SVC is insufficient to support dynamic video conversion with scalability, thereby the adaptation of bitrate to meet a fluctuating network condition is limited. In this paper, we propose dynamic full-scalability conversion methods for QoS adaptive video streaming in SVC. To accomplish full scalability dynamic conversion, we develop corresponding bitstream extraction, encoding and decoding schemes. At the encoder, we insert the IDR NAL periodically to solve the problems of spatial scalability conversion. At the extractor, we analyze the SVC bitstream to get the information which enable dynamic extraction. Real time extraction is achieved by using this information. Finally, we develop the decoder so that it can manage the changing scalability. Experimental results showed that dynamic full-scalability conversion was verified and it was necessary for time varying network condition.
Efficient and Scalable Cross-Matching of (Very) Large Catalogs
NASA Astrophysics Data System (ADS)
Pineau, F.-X.; Boch, T.; Derriere, S.
2011-07-01
Whether it be for building multi-wavelength datasets from independent surveys, studying changes in objects luminosities, or detecting moving objects (stellar proper motions, asteroids), cross-catalog matching is a technique widely used in astronomy. The need for efficient, reliable and scalable cross-catalog matching is becoming even more pressing with forthcoming projects which will produce huge catalogs in which astronomers will dig for rare objects, perform statistical analysis and classification, or real-time transients detection. We have developed a formalism and the corresponding technical framework to address the challenge of fast cross-catalog matching. Our formalism supports more than simple nearest-neighbor search, and handles elliptical positional errors. Scalability is improved by partitioning the sky using the HEALPix scheme, and processing independently each sky cell. The use of multi-threaded two-dimensional kd-trees adapted to managing equatorial coordinates enables efficient neighbor search. The whole process can run on a single computer, but could also use clusters of machines to cross-match future very large surveys such as GAIA or LSST in reasonable times. We already achieve performances where the 2MASS (˜470M sources) and SDSS DR7 (˜350M sources) can be matched on a single machine in less than 10 minutes. We aim at providing astronomers with a catalog cross-matching service, available on-line and leveraging on the catalogs present in the VizieR database. This service will allow users both to access pre-computed cross-matches across some very large catalogs, and to run customized cross-matching operations. It will also support VO protocols for synchronous or asynchronous queries.
Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform
Poucke, Sven Van; Zhang, Zhongheng; Schmitz, Martin; Vukicevic, Milan; Laenen, Margot Vander; Celi, Leo Anthony; Deyne, Cathy De
2016-01-01
With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Personalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, limiting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. In this study, the authors address this problem by focusing on open, visual environments, suited to be applied by the medical community. Moreover, we review code free applications of big data technologies. As a showcase, a framework was developed for the meaningful use of data from critical care patients by integrating the MIMIC-II database in a data mining environment (RapidMiner) supporting scalable predictive analytics using visual tools (RapidMiner’s Radoop extension). Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process (Extract, Transform, Load) was initiated by retrieving data from the MIMIC-II tables of interest. As use case, correlation of platelet count and ICU survival was quantitatively assessed. Using visual tools for ETL on Hadoop and predictive modeling in RapidMiner, we developed robust processes for automatic building, parameter optimization and evaluation of various predictive models, under different feature selection schemes. Because these processes can be easily adopted in other projects, this environment is attractive for scalable predictive analytics in health research. PMID:26731286
Scalable Predictive Analysis in Critically Ill Patients Using a Visual Open Data Analysis Platform.
Van Poucke, Sven; Zhang, Zhongheng; Schmitz, Martin; Vukicevic, Milan; Laenen, Margot Vander; Celi, Leo Anthony; De Deyne, Cathy
2016-01-01
With the accumulation of large amounts of health related data, predictive analytics could stimulate the transformation of reactive medicine towards Predictive, Preventive and Personalized (PPPM) Medicine, ultimately affecting both cost and quality of care. However, high-dimensionality and high-complexity of the data involved, prevents data-driven methods from easy translation into clinically relevant models. Additionally, the application of cutting edge predictive methods and data manipulation require substantial programming skills, limiting its direct exploitation by medical domain experts. This leaves a gap between potential and actual data usage. In this study, the authors address this problem by focusing on open, visual environments, suited to be applied by the medical community. Moreover, we review code free applications of big data technologies. As a showcase, a framework was developed for the meaningful use of data from critical care patients by integrating the MIMIC-II database in a data mining environment (RapidMiner) supporting scalable predictive analytics using visual tools (RapidMiner's Radoop extension). Guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM), the ETL process (Extract, Transform, Load) was initiated by retrieving data from the MIMIC-II tables of interest. As use case, correlation of platelet count and ICU survival was quantitatively assessed. Using visual tools for ETL on Hadoop and predictive modeling in RapidMiner, we developed robust processes for automatic building, parameter optimization and evaluation of various predictive models, under different feature selection schemes. Because these processes can be easily adopted in other projects, this environment is attractive for scalable predictive analytics in health research.
Huang, Haiyan; Liu, Chun-Chi; Zhou, Xianghong Jasmine
2010-04-13
The rapid accumulation of gene expression data has offered unprecedented opportunities to study human diseases. The National Center for Biotechnology Information Gene Expression Omnibus is currently the largest database that systematically documents the genome-wide molecular basis of diseases. However, thus far, this resource has been far from fully utilized. This paper describes the first study to transform public gene expression repositories into an automated disease diagnosis database. Particularly, we have developed a systematic framework, including a two-stage Bayesian learning approach, to achieve the diagnosis of one or multiple diseases for a query expression profile along a hierarchical disease taxonomy. Our approach, including standardizing cross-platform gene expression data and heterogeneous disease annotations, allows analyzing both sources of information in a unified probabilistic system. A high level of overall diagnostic accuracy was shown by cross validation. It was also demonstrated that the power of our method can increase significantly with the continued growth of public gene expression repositories. Finally, we showed how our disease diagnosis system can be used to characterize complex phenotypes and to construct a disease-drug connectivity map.
Statistical organelle dissection of Arabidopsis guard cells using image database LIPS.
Higaki, Takumi; Kutsuna, Natsumaro; Hosokawa, Yoichiroh; Akita, Kae; Ebine, Kazuo; Ueda, Takashi; Kondo, Noriaki; Hasezawa, Seiichiro
2012-01-01
To comprehensively grasp cell biological events in plant stomatal movement, we have captured microscopic images of guard cells with various organelles markers. The 28,530 serial optical sections of 930 pairs of Arabidopsis guard cells have been released as a new image database, named Live Images of Plant Stomata (LIPS). We visualized the average organellar distributions in guard cells using probabilistic mapping and image clustering techniques. The results indicated that actin microfilaments and endoplasmic reticulum (ER) are mainly localized to the dorsal side and connection regions of guard cells. Subtractive images of open and closed stomata showed distribution changes in intracellular structures, including the ER, during stomatal movement. Time-lapse imaging showed that similar ER distribution changes occurred during stomatal opening induced by light irradiation or femtosecond laser shots on neighboring epidermal cells, indicating that our image analysis approach has identified a novel ER relocation in stomatal opening.
Global gridded crop specific agricultural areas from 1961-2014
NASA Astrophysics Data System (ADS)
Konar, M.; Jackson, N. D.
2017-12-01
Current global cropland datasets are limited in crop specificity and temporal resolution. Time series maps of crop specific agricultural areas would enable us to better understand the global agricultural geography of the 20th century. To this end, we develop a global gridded dataset of crop specific agricultural areas from 1961-2014. To do this, we downscale national cropland information using a probabilistic approach. Our method relies upon gridded Global Agro-Ecological Zones (GAEZ) maps, the History Database of the Global Environment (HYDE), and crop calendars from Sacks et al. (2010). We estimate crop-specific agricultural areas for a 0.25 degree spatial grid and annual time scale for all major crops. We validate our global estimates for the year 2000 with Monfreda et al. (2008) and our time series estimates within the United States using government data. This database will contribute to our understanding of global agricultural change of the past century.
Privacy-preserving record linkage using Bloom filters
2009-01-01
Background Combining multiple databases with disjunctive or additional information on the same person is occurring increasingly throughout research. If unique identification numbers for these individuals are not available, probabilistic record linkage is used for the identification of matching record pairs. In many applications, identifiers have to be encrypted due to privacy concerns. Methods A new protocol for privacy-preserving record linkage with encrypted identifiers allowing for errors in identifiers has been developed. The protocol is based on Bloom filters on q-grams of identifiers. Results Tests on simulated and actual databases yield linkage results comparable to non-encrypted identifiers and superior to results from phonetic encodings. Conclusion We proposed a protocol for privacy-preserving record linkage with encrypted identifiers allowing for errors in identifiers. Since the protocol can be easily enhanced and has a low computational burden, the protocol might be useful for many applications requiring privacy-preserving record linkage. PMID:19706187
Warwick, Peter D.; Verma, Mahendra K.; Attanasi, Emil; Olea, Ricardo A.; Blondes, Madalyn S.; Freeman, Philip; Brennan, Sean T.; Merrill, Matthew; Jahediesfanjani, Hossein; Roueche, Jacqueline; Lohr, Celeste D.
2017-01-01
The U.S. Geological Survey (USGS) has developed an assessment methodology for estimating the potential incremental technically recoverable oil resources resulting from carbon dioxide-enhanced oil recovery (CO2-EOR) in reservoirs with appropriate depth, pressure, and oil composition. The methodology also includes a procedure for estimating the CO2 that remains in the reservoir after the CO2-EOR process is complete. The methodology relies on a reservoir-level database that incorporates commercially available geologic and engineering data. The mathematical calculations of this assessment methodology were tested and produced realistic results for the Permian Basin Horseshoe Atoll, Upper Pennsylvanian-Wolfcampian Play (Texas, USA). The USGS plans to use the new methodology to conduct an assessment of technically recoverable hydrocarbons and associated CO2 sequestration resulting from CO2-EOR in the United States.
Privacy-preserving record linkage using Bloom filters.
Schnell, Rainer; Bachteler, Tobias; Reiher, Jörg
2009-08-25
Combining multiple databases with disjunctive or additional information on the same person is occurring increasingly throughout research. If unique identification numbers for these individuals are not available, probabilistic record linkage is used for the identification of matching record pairs. In many applications, identifiers have to be encrypted due to privacy concerns. A new protocol for privacy-preserving record linkage with encrypted identifiers allowing for errors in identifiers has been developed. The protocol is based on Bloom filters on q-grams of identifiers. Tests on simulated and actual databases yield linkage results comparable to non-encrypted identifiers and superior to results from phonetic encodings. We proposed a protocol for privacy-preserving record linkage with encrypted identifiers allowing for errors in identifiers. Since the protocol can be easily enhanced and has a low computational burden, the protocol might be useful for many applications requiring privacy-preserving record linkage.
Fan, Long; Hui, Jerome H L; Yu, Zu Guo; Chu, Ka Hou
2014-07-01
Species identification based on short sequences of DNA markers, that is, DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multilocus barcoding data sets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g. >5000 sequences), but its accuracy is a concern and has been criticized for its local optimization. However, current more accurate software requires sequence alignment or complex calculations, which are time-consuming when dealing with large data sets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: a user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two-stage algorithm. First, an alignment-free composition vector (CV) method is utilized to reduce searching space by screening a reference database. The alignment-based K2P distance nearest-neighbour method is then employed to analyse the smaller data set generated in the first stage. In comparison with other software, we demonstrate that VIP Barcoding has (i) higher accuracy than Blastn and several alignment-free methods and (ii) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multilocus barcoding data with accuracy and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at http://msl.sls.cuhk.edu.hk/vipbarcoding/. © 2014 John Wiley & Sons Ltd.
Cruella: developing a scalable tissue microarray data management system.
Cowan, James D; Rimm, David L; Tuck, David P
2006-06-01
Compared with DNA microarray technology, relatively little information is available concerning the special requirements, design influences, and implementation strategies of data systems for tissue microarray technology. These issues include the requirement to accommodate new and different data elements for each new project as well as the need to interact with pre-existing models for clinical, biological, and specimen-related data. To design and implement a flexible, scalable tissue microarray data storage and management system that could accommodate information regarding different disease types and different clinical investigators, and different clinical investigation questions, all of which could potentially contribute unforeseen data types that require dynamic integration with existing data. The unpredictability of the data elements combined with the novelty of automated analysis algorithms and controlled vocabulary standards in this area require flexible designs and practical decisions. Our design includes a custom Java-based persistence layer to mediate and facilitate interaction with an object-relational database model and a novel database schema. User interaction is provided through a Java Servlet-based Web interface. Cruella has become an indispensable resource and is used by dozens of researchers every day. The system stores millions of experimental values covering more than 300 biological markers and more than 30 disease types. The experimental data are merged with clinical data that has been aggregated from multiple sources and is available to the researchers for management, analysis, and export. Cruella addresses many of the special considerations for managing tissue microarray experimental data and the associated clinical information. A metadata-driven approach provides a practical solution to many of the unique issues inherent in tissue microarray research, and allows relatively straightforward interoperability with and accommodation of new data models.
NASA Astrophysics Data System (ADS)
Karami, Mojtaba; Rangzan, Kazem; Saberi, Azim
2013-10-01
With emergence of air-borne and space-borne hyperspectral sensors, spectroscopic measurements are gaining more importance in remote sensing. Therefore, the number of available spectral reference data is constantly increasing. This rapid increase often exhibits a poor data management, which leads to ultimate isolation of data on disk storages. Spectral data without precise description of the target, methods, environment, and sampling geometry cannot be used by other researchers. Moreover, existing spectral data (in case it accompanied with good documentation) become virtually invisible or unreachable for researchers. Providing documentation and a data-sharing framework for spectral data, in which researchers are able to search for or share spectral data and documentation, would definitely improve the data lifetime. Relational Database Management Systems (RDBMS) are main candidates for spectral data management and their efficiency is proven by many studies and applications to date. In this study, a new approach to spectral data administration is presented based on spatial identity of spectral samples. This method benefits from scalability and performance of RDBMS for storage of spectral data, but uses GIS servers to provide users with interactive maps as an interface to the system. The spectral files, photographs and descriptive data are considered as belongings of a geospatial object. A spectral processing unit is responsible for evaluation of metadata quality and performing routine spectral processing tasks for newly-added data. As a result, by using internet browser software the users would be able to visually examine availability of data and/or search for data based on descriptive attributes associated to it. The proposed system is scalable and besides giving the users good sense of what data are available in the database, it facilitates participation of spectral reference data in producing geoinformation.
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest. PMID:26958859
Freire, Sergio Miranda; Teodoro, Douglas; Wei-Kleiner, Fang; Sundvall, Erik; Karlsson, Daniel; Lambrix, Patrick
2016-01-01
This study provides an experimental performance evaluation on population-based queries of NoSQL databases storing archetype-based Electronic Health Record (EHR) data. There are few published studies regarding the performance of persistence mechanisms for systems that use multilevel modelling approaches, especially when the focus is on population-based queries. A healthcare dataset with 4.2 million records stored in a relational database (MySQL) was used to generate XML and JSON documents based on the openEHR reference model. Six datasets with different sizes were created from these documents and imported into three single machine XML databases (BaseX, eXistdb and Berkeley DB XML) and into a distributed NoSQL database system based on the MapReduce approach, Couchbase, deployed in different cluster configurations of 1, 2, 4, 8 and 12 machines. Population-based queries were submitted to those databases and to the original relational database. Database size and query response times are presented. The XML databases were considerably slower and required much more space than Couchbase. Overall, Couchbase had better response times than MySQL, especially for larger datasets. However, Couchbase requires indexing for each differently formulated query and the indexing time increases with the size of the datasets. The performances of the clusters with 2, 4, 8 and 12 nodes were not better than the single node cluster in relation to the query response time, but the indexing time was reduced proportionally to the number of nodes. The tested XML databases had acceptable performance for openEHR-based data in some querying use cases and small datasets, but were generally much slower than Couchbase. Couchbase also outperformed the response times of the relational database, but required more disk space and had a much longer indexing time. Systems like Couchbase are thus interesting research targets for scalable storage and querying of archetype-based EHR data when population-based use cases are of interest.
Duchrow, Timo; Shtatland, Timur; Guettler, Daniel; Pivovarov, Misha; Kramer, Stefan; Weissleder, Ralph
2009-01-01
Background The breadth of biological databases and their information content continues to increase exponentially. Unfortunately, our ability to query such sources is still often suboptimal. Here, we introduce and apply community voting, database-driven text classification, and visual aids as a means to incorporate distributed expert knowledge, to automatically classify database entries and to efficiently retrieve them. Results Using a previously developed peptide database as an example, we compared several machine learning algorithms in their ability to classify abstracts of published literature results into categories relevant to peptide research, such as related or not related to cancer, angiogenesis, molecular imaging, etc. Ensembles of bagged decision trees met the requirements of our application best. No other algorithm consistently performed better in comparative testing. Moreover, we show that the algorithm produces meaningful class probability estimates, which can be used to visualize the confidence of automatic classification during the retrieval process. To allow viewing long lists of search results enriched by automatic classifications, we added a dynamic heat map to the web interface. We take advantage of community knowledge by enabling users to cast votes in Web 2.0 style in order to correct automated classification errors, which triggers reclassification of all entries. We used a novel framework in which the database "drives" the entire vote aggregation and reclassification process to increase speed while conserving computational resources and keeping the method scalable. In our experiments, we simulate community voting by adding various levels of noise to nearly perfectly labelled instances, and show that, under such conditions, classification can be improved significantly. Conclusion Using PepBank as a model database, we show how to build a classification-aided retrieval system that gathers training data from the community, is completely controlled by the database, scales well with concurrent change events, and can be adapted to add text classification capability to other biomedical databases. The system can be accessed at . PMID:19799796
UPM: unified policy-based network management
NASA Astrophysics Data System (ADS)
Law, Eddie; Saxena, Achint
2001-07-01
Besides providing network management to the Internet, it has become essential to offer different Quality of Service (QoS) to users. Policy-based management provides control on network routers to achieve this goal. The Internet Engineering Task Force (IETF) has proposed a two-tier architecture whose implementation is based on the Common Open Policy Service (COPS) protocol and Lightweight Directory Access Protocol (LDAP). However, there are several limitations to this design such as scalability and cross-vendor hardware compatibility. To address these issues, we present a functionally enhanced multi-tier policy management architecture design in this paper. Several extensions are introduced thereby adding flexibility and scalability. In particular, an intermediate entity between the policy server and policy rule database called the Policy Enforcement Agent (PEA) is introduced. By keeping internal data in a common format, using a standard protocol, and by interpreting and translating request and decision messages from multi-vendor hardware, this agent allows a dynamic Unified Information Model throughout the architecture. We have tailor-made this unique information system to save policy rules in the directory server and allow executions of policy rules with dynamic addition of new equipment during run-time.
A Testbed to Evaluate the FIWARE-Based IoT Platform in the Domain of Precision Agriculture.
Martínez, Ramón; Pastor, Juan Ángel; Álvarez, Bárbara; Iborra, Andrés
2016-11-23
Wireless sensor networks (WSNs) represent one of the most promising technologies for precision farming. Over the next few years, a significant increase in the use of such systems on commercial farms is expected. WSNs present a number of problems, regarding scalability, interoperability, communications, connectivity with databases and data processing. Different Internet of Things middleware is appearing to overcome these challenges. This paper checks whether one of these middleware, FIWARE, is suitable for the development of agricultural applications. To the authors' knowledge, there are no works that show how to use FIWARE in precision agriculture and study its appropriateness, its scalability and its efficiency for this kind of applications. To do this, a testbed has been designed and implemented to simulate different deployments and load conditions. The testbed is a typical FIWARE application, complete, yet simple and comprehensible enough to show the main features and components of FIWARE, as well as the complexity of using this technology. Although the testbed has been deployed in a laboratory environment, its design is based on the analysis of an Internet of Things use case scenario in the domain of precision agriculture.
A Testbed to Evaluate the FIWARE-Based IoT Platform in the Domain of Precision Agriculture
Martínez, Ramón; Pastor, Juan Ángel; Álvarez, Bárbara; Iborra, Andrés
2016-01-01
Wireless sensor networks (WSNs) represent one of the most promising technologies for precision farming. Over the next few years, a significant increase in the use of such systems on commercial farms is expected. WSNs present a number of problems, regarding scalability, interoperability, communications, connectivity with databases and data processing. Different Internet of Things middleware is appearing to overcome these challenges. This paper checks whether one of these middleware, FIWARE, is suitable for the development of agricultural applications. To the authors’ knowledge, there are no works that show how to use FIWARE in precision agriculture and study its appropriateness, its scalability and its efficiency for this kind of applications. To do this, a testbed has been designed and implemented to simulate different deployments and load conditions. The testbed is a typical FIWARE application, complete, yet simple and comprehensible enough to show the main features and components of FIWARE, as well as the complexity of using this technology. Although the testbed has been deployed in a laboratory environment, its design is based on the analysis of an Internet of Things use case scenario in the domain of precision agriculture. PMID:27886091
NASA Astrophysics Data System (ADS)
Kacprzyk, Janusz; Zadrożny, Sławomir
2010-05-01
We present how the conceptually and numerically simple concept of a fuzzy linguistic database summary can be a very powerful tool for gaining much insight into the very essence of data. The use of linguistic summaries provides tools for the verbalisation of data analysis (mining) results which, in addition to the more commonly used visualisation, e.g. via a graphical user interface, can contribute to an increased human consistency and ease of use, notably for supporting decision makers via the data-driven decision support system paradigm. Two new relevant aspects of the analysis are also outlined which were first initiated by the authors. First, following Kacprzyk and Zadrożny, it is further considered how linguistic data summarisation is closely related to some types of solutions used in natural language generation (NLG). This can make it possible to use more and more effective and efficient tools and techniques developed in NLG. Second, similar remarks are given on relations to systemic functional linguistics. Moreover, following Kacprzyk and Zadrożny, comments are given on an extremely relevant aspect of scalability of linguistic summarisation of data, using a new concept of a conceptual scalability.
From EGEE Operations Portal towards EGI Operations Portal
NASA Astrophysics Data System (ADS)
Cordier, Hélène; L'Orphelin, Cyril; Reynaud, Sylvain; Lequeux, Olivier; Loikkanen, Sinikka; Veyre, Pierre
Grid operators in EGEE have been using a dedicated dashboard as their central operational tool, stable and scalable for the last 5 years despite continuous upgrade from specifications by users, monitoring tools or data providers. In EGEE-III, recent regionalisation of operations led the Operations Portal developers to conceive a standalone instance of this tool. We will see how the dashboard reorganization paved the way for the re-engineering of the portal itself. The outcome is an easily deployable package customized with relevant information sources and specific decentralized operational requirements. This package is composed of a generic and scalable data access mechanism, Lavoisier; a renowned php framework for configuration flexibility, Symfony and a MySQL database. VO life cycle and operational information, EGEE broadcast and Downtime notifications are next for the major reorganization until all other key features of the Operations Portal are migrated to the framework. Features specifications will be sketched at the same time to adapt to EGI requirements and to upgrade. Future work on feature regionalisation, on new advanced features or strategy planning will be tracked in EGI- Inspire through the Operations Tools Advisory Group, OTAG, where all users, customers and third parties of the Operations Portal are represented from January 2010.
Ergatis: a web interface and scalable software system for bioinformatics workflows
Orvis, Joshua; Crabtree, Jonathan; Galens, Kevin; Gussman, Aaron; Inman, Jason M.; Lee, Eduardo; Nampally, Sreenath; Riley, David; Sundaram, Jaideep P.; Felix, Victor; Whitty, Brett; Mahurkar, Anup; Wortman, Jennifer; White, Owen; Angiuoli, Samuel V.
2010-01-01
Motivation: The growth of sequence data has been accompanied by an increasing need to analyze data on distributed computer clusters. The use of these systems for routine analysis requires scalable and robust software for data management of large datasets. Software is also needed to simplify data management and make large-scale bioinformatics analysis accessible and reproducible to a wide class of target users. Results: We have developed a workflow management system named Ergatis that enables users to build, execute and monitor pipelines for computational analysis of genomics data. Ergatis contains preconfigured components and template pipelines for a number of common bioinformatics tasks such as prokaryotic genome annotation and genome comparisons. Outputs from many of these components can be loaded into a Chado relational database. Ergatis was designed to be accessible to a broad class of users and provides a user friendly, web-based interface. Ergatis supports high-throughput batch processing on distributed compute clusters and has been used for data management in a number of genome annotation and comparative genomics projects. Availability: Ergatis is an open-source project and is freely available at http://ergatis.sourceforge.net Contact: jorvis@users.sourceforge.net PMID:20413634
CellAtlasSearch: a scalable search engine for single cells.
Srivastava, Divyanshu; Iyer, Arvind; Kumar, Vibhor; Sengupta, Debarka
2018-05-21
Owing to the advent of high throughput single cell transcriptomics, past few years have seen exponential growth in production of gene expression data. Recently efforts have been made by various research groups to homogenize and store single cell expression from a large number of studies. The true value of this ever increasing data deluge can be unlocked by making it searchable. To this end, we propose CellAtlasSearch, a novel search architecture for high dimensional expression data, which is massively parallel as well as light-weight, thus infinitely scalable. In CellAtlasSearch, we use a Graphical Processing Unit (GPU) friendly version of Locality Sensitive Hashing (LSH) for unmatched speedup in data processing and query. Currently, CellAtlasSearch features over 300 000 reference expression profiles including both bulk and single-cell data. It enables the user query individual single cell transcriptomes and finds matching samples from the database along with necessary meta information. CellAtlasSearch aims to assist researchers and clinicians in characterizing unannotated single cells. It also facilitates noise free, low dimensional representation of single-cell expression profiles by projecting them on a wide variety of reference samples. The web-server is accessible at: http://www.cellatlassearch.com.
Probabilistic Learning in Junior High School: Investigation of Student Probabilistic Thinking Levels
NASA Astrophysics Data System (ADS)
Kurniasih, R.; Sujadi, I.
2017-09-01
This paper was to investigate level on students’ probabilistic thinking. Probabilistic thinking level is level of probabilistic thinking. Probabilistic thinking is thinking about probabilistic or uncertainty matter in probability material. The research’s subject was students in grade 8th Junior High School students. The main instrument is a researcher and a supporting instrument is probabilistic thinking skills test and interview guidelines. Data was analyzed using triangulation method. The results showed that the level of students probabilistic thinking before obtaining a teaching opportunity at the level of subjective and transitional. After the students’ learning level probabilistic thinking is changing. Based on the results of research there are some students who have in 8th grade level probabilistic thinking numerically highest of levels. Level of students’ probabilistic thinking can be used as a reference to make a learning material and strategy.
A spatial database for landslides in northern Bavaria: A methodological approach
NASA Astrophysics Data System (ADS)
Jäger, Daniel; Kreuzer, Thomas; Wilde, Martina; Bemm, Stefan; Terhorst, Birgit
2018-04-01
Landslide databases provide essential information for hazard modeling, damages on buildings and infrastructure, mitigation, and research needs. This study presents the development of a landslide database system named WISL (Würzburg Information System on Landslides), currently storing detailed landslide data for northern Bavaria, Germany, in order to enable scientific queries as well as comparisons with other regional landslide inventories. WISL is based on free open source software solutions (PostgreSQL, PostGIS) assuring good correspondence of the various softwares and to enable further extensions with specific adaptions of self-developed software. Apart from that, WISL was designed to be particularly compatible for easy communication with other databases. As a central pre-requisite for standardized, homogeneous data acquisition in the field, a customized data sheet for landslide description was compiled. This sheet also serves as an input mask for all data registration procedures in WISL. A variety of "in-database" solutions for landslide analysis provides the necessary scalability for the database, enabling operations at the local server. In its current state, WISL already enables extensive analysis and queries. This paper presents an example analysis of landslides in Oxfordian Limestones in the northeastern Franconian Alb, northern Bavaria. The results reveal widely differing landslides in terms of geometry and size. Further queries related to landslide activity classifies the majority of the landslides as currently inactive, however, they clearly possess a certain potential for remobilization. Along with some active mass movements, a significant percentage of landslides potentially endangers residential areas or infrastructure. The main aspect of future enhancements of the WISL database is related to data extensions in order to increase research possibilities, as well as to transfer the system to other regions and countries.
SeqWare Query Engine: storing and searching sequence data in the cloud.
O'Connor, Brian D; Merriman, Barry; Nelson, Stanley F
2010-12-21
Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.
SeqWare Query Engine: storing and searching sequence data in the cloud
2010-01-01
Background Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands. Results In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net). Conclusions The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets. PMID:21210981
NASA Astrophysics Data System (ADS)
Sari, Dwi Ivayana; Budayasa, I. Ketut; Juniati, Dwi
2017-08-01
Formulation of mathematical learning goals now is not only oriented on cognitive product, but also leads to cognitive process, which is probabilistic thinking. Probabilistic thinking is needed by students to make a decision. Elementary school students are required to develop probabilistic thinking as foundation to learn probability at higher level. A framework of probabilistic thinking of students had been developed by using SOLO taxonomy, which consists of prestructural probabilistic thinking, unistructural probabilistic thinking, multistructural probabilistic thinking and relational probabilistic thinking. This study aimed to analyze of probability task completion based on taxonomy of probabilistic thinking. The subjects were two students of fifth grade; boy and girl. Subjects were selected by giving test of mathematical ability and then based on high math ability. Subjects were given probability tasks consisting of sample space, probability of an event and probability comparison. The data analysis consisted of categorization, reduction, interpretation and conclusion. Credibility of data used time triangulation. The results was level of boy's probabilistic thinking in completing probability tasks indicated multistructural probabilistic thinking, while level of girl's probabilistic thinking in completing probability tasks indicated unistructural probabilistic thinking. The results indicated that level of boy's probabilistic thinking was higher than level of girl's probabilistic thinking. The results could contribute to curriculum developer in developing probability learning goals for elementary school students. Indeed, teachers could teach probability with regarding gender difference.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fisher,D
Concerns about the long-term viability of SFS as the metadata store for HPSS have been increasing. A concern that Transarc may discontinue support for SFS motivates us to consider alternative means to store HPSS metadata. The obvious alternative is a commercial database. Commercial databases have the necessary characteristics for storage of HPSS metadata records. They are robust and scalable and can easily accommodate the volume of data that must be stored. They provide programming interfaces, transactional semantics and a full set of maintenance and performance enhancement tools. A team was organized within the HPSS project to study and recommend anmore » approach for the replacement of SFS. Members of the team are David Fisher, Jim Minton, Donna Mecozzi, Danny Cook, Bart Parliman and Lynn Jones. We examined several possible solutions to the problem of replacing SFS, and recommended on May 22, 2000, in a report to the HPSS Technical and Executive Committees, to change HPSS into a database application over either Oracle or DB2. We recommended either Oracle or DB2 on the basis of market share and technical suitability. Oracle and DB2 are dominant offerings in the market, and it is in the best interest of HPSS to use a major player's product. Both databases provide a suitable programming interface. Transaction management functions, support for multi-threaded clients and data manipulation languages (DML) are available. These findings were supported in meetings held with technical experts from both companies. In both cases, the evidence indicated that either database would provide the features needed to host HPSS.« less
Time and Space Efficient Algorithms for Two-Party Authenticated Data Structures
NASA Astrophysics Data System (ADS)
Papamanthou, Charalampos; Tamassia, Roberto
Authentication is increasingly relevant to data management. Data is being outsourced to untrusted servers and clients want to securely update and query their data. For example, in database outsourcing, a client's database is stored and maintained by an untrusted server. Also, in simple storage systems, clients can store very large amounts of data but at the same time, they want to assure their integrity when they retrieve them. In this paper, we present a model and protocol for two-party authentication of data structures. Namely, a client outsources its data structure and verifies that the answers to the queries have not been tampered with. We provide efficient algorithms to securely outsource a skip list with logarithmic time overhead at the server and client and logarithmic communication cost, thus providing an efficient authentication primitive for outsourced data, both structured (e.g., relational databases) and semi-structured (e.g., XML documents). In our technique, the client stores only a constant amount of space, which is optimal. Our two-party authentication framework can be deployed on top of existing storage applications, thus providing an efficient authentication service. Finally, we present experimental results that demonstrate the practical efficiency and scalability of our scheme.
The Matchmaker Exchange: a platform for rare disease gene discovery.
Philippakis, Anthony A; Azzariti, Danielle R; Beltran, Sergi; Brookes, Anthony J; Brownstein, Catherine A; Brudno, Michael; Brunner, Han G; Buske, Orion J; Carey, Knox; Doll, Cassie; Dumitriu, Sergiu; Dyke, Stephanie O M; den Dunnen, Johan T; Firth, Helen V; Gibbs, Richard A; Girdea, Marta; Gonzalez, Michael; Haendel, Melissa A; Hamosh, Ada; Holm, Ingrid A; Huang, Lijia; Hurles, Matthew E; Hutton, Ben; Krier, Joel B; Misyura, Andriy; Mungall, Christopher J; Paschall, Justin; Paten, Benedict; Robinson, Peter N; Schiettecatte, François; Sobreira, Nara L; Swaminathan, Ganesh J; Taschner, Peter E; Terry, Sharon F; Washington, Nicole L; Züchner, Stephan; Boycott, Kym M; Rehm, Heidi L
2015-10-01
There are few better examples of the need for data sharing than in the rare disease community, where patients, physicians, and researchers must search for "the needle in a haystack" to uncover rare, novel causes of disease within the genome. Impeding the pace of discovery has been the existence of many small siloed datasets within individual research or clinical laboratory databases and/or disease-specific organizations, hoping for serendipitous occasions when two distant investigators happen to learn they have a rare phenotype in common and can "match" these cases to build evidence for causality. However, serendipity has never proven to be a reliable or scalable approach in science. As such, the Matchmaker Exchange (MME) was launched to provide a robust and systematic approach to rare disease gene discovery through the creation of a federated network connecting databases of genotypes and rare phenotypes using a common application programming interface (API). The core building blocks of the MME have been defined and assembled. Three MME services have now been connected through the API and are available for community use. Additional databases that support internal matching are anticipated to join the MME network as it continues to grow. © 2015 WILEY PERIODICALS, INC.
The Virtual Xenbase: transitioning an online bioinformatics resource to a private cloud.
Karimi, Kamran; Vize, Peter D
2014-01-01
As a model organism database, Xenbase has been providing informatics and genomic data on Xenopus (Silurana) tropicalis and Xenopus laevis frogs for more than a decade. The Xenbase database contains curated, as well as community-contributed and automatically harvested literature, gene and genomic data. A GBrowse genome browser, a BLAST+ server and stock center support are available on the site. When this resource was first built, all software services and components in Xenbase ran on a single physical server, with inherent reliability, scalability and inter-dependence issues. Recent advances in networking and virtualization techniques allowed us to move Xenbase to a virtual environment, and more specifically to a private cloud. To do so we decoupled the different software services and components, such that each would run on a different virtual machine. In the process, we also upgraded many of the components. The resulting system is faster and more reliable. System maintenance is easier, as individual virtual machines can now be updated, backed up and changed independently. We are also experiencing more effective resource allocation and utilization. Database URL: www.xenbase.org. © The Author(s) 2014. Published by Oxford University Press.
The Matchmaker Exchange: A Platform for Rare Disease Gene Discovery
Philippakis, Anthony A.; Azzariti, Danielle R.; Beltran, Sergi; Brookes, Anthony J.; Brownstein, Catherine A.; Brudno, Michael; Brunner, Han G.; Buske, Orion J.; Carey, Knox; Doll, Cassie; Dumitriu, Sergiu; Dyke, Stephanie O.M.; den Dunnen, Johan T.; Firth, Helen V.; Gibbs, Richard A.; Girdea, Marta; Gonzalez, Michael; Haendel, Melissa A.; Hamosh, Ada; Holm, Ingrid A.; Huang, Lijia; Hurles, Matthew E.; Hutton, Ben; Krier, Joel B.; Misyura, Andriy; Mungall, Christopher J.; Paschall, Justin; Paten, Benedict; Robinson, Peter N.; Schiettecatte, François; Sobreira, Nara L.; Swaminathan, Ganesh J.; Taschner, Peter E.; Terry, Sharon F.; Washington, Nicole L.; Züchner, Stephan; Boycott, Kym M.; Rehm, Heidi L.
2015-01-01
There are few better examples of the need for data sharing than in the rare disease community, where patients, physicians, and researchers must search for “the needle in a haystack” to uncover rare, novel causes of disease within the genome. Impeding the pace of discovery has been the existence of many small siloed datasets within individual research or clinical laboratory databases and/or disease-specific organizations, hoping for serendipitous occasions when two distant investigators happen to learn they have a rare phenotype in common and can “match” these cases to build evidence for causality. However, serendipity has never proven to be a reliable or scalable approach in science. As such, the Matchmaker Exchange (MME) was launched to provide a robust and systematic approach to rare disease gene discovery through the creation of a federated network connecting databases of genotypes and rare phenotypes using a common application programming interface (API). The core building blocks of the MME have been defined and assembled. Three MME services have now been connected through the API and are available for community use. Additional databases that support internal matching are anticipated to join the MME network as it continues to grow. PMID:26295439
The Matchmaker Exchange: A Platform for Rare Disease Gene Discovery
Philippakis, Anthony A.; Azzariti, Danielle R.; Beltran, Sergi; ...
2015-09-17
There are few better examples of the need for data sharing than in the rare disease community, where patients, physicians, and researchers must search for "the needle in a haystack" to uncover rare, novel causes of disease within the genome. Impeding the pace of discovery has been the existence of many small siloed datasets within individual research or clinical laboratory databases and/or disease-specific organizations, hoping for serendipitous occasions when two distant investigators happen to learn they have a rare phenotype in common and can "match" these cases to build evidence for causality. However, serendipity has never proven to be amore » reliable or scalable approach in science. As such, the Matchmaker Exchange (MME) was launched to provide a robust and systematic approach to rare disease gene discovery through the creation of a federated network connecting databases of genotypes and rare phenotypes using a common application programming interface (API). The core building blocks of the MME have been defined and assembled. In conclusion, three MME services have now been connected through the API and are available for community use. Additional databases that support internal matching are anticipated to join the MME network as it continues to grow.« less
The BioMart community portal: an innovative alternative to large, centralized data repositories
Smedley, Damian; Haider, Syed; Durinck, Steffen; Pandini, Luca; Provero, Paolo; Allen, James; Arnaiz, Olivier; Awedh, Mohammad Hamza; Baldock, Richard; Barbiera, Giulia; Bardou, Philippe; Beck, Tim; Blake, Andrew; Bonierbale, Merideth; Brookes, Anthony J.; Bucci, Gabriele; Buetti, Iwan; Burge, Sarah; Cabau, Cédric; Carlson, Joseph W.; Chelala, Claude; Chrysostomou, Charalambos; Cittaro, Davide; Collin, Olivier; Cordova, Raul; Cutts, Rosalind J.; Dassi, Erik; Genova, Alex Di; Djari, Anis; Esposito, Anthony; Estrella, Heather; Eyras, Eduardo; Fernandez-Banet, Julio; Forbes, Simon; Free, Robert C.; Fujisawa, Takatomo; Gadaleta, Emanuela; Garcia-Manteiga, Jose M.; Goodstein, David; Gray, Kristian; Guerra-Assunção, José Afonso; Haggarty, Bernard; Han, Dong-Jin; Han, Byung Woo; Harris, Todd; Harshbarger, Jayson; Hastings, Robert K.; Hayes, Richard D.; Hoede, Claire; Hu, Shen; Hu, Zhi-Liang; Hutchins, Lucie; Kan, Zhengyan; Kawaji, Hideya; Keliet, Aminah; Kerhornou, Arnaud; Kim, Sunghoon; Kinsella, Rhoda; Klopp, Christophe; Kong, Lei; Lawson, Daniel; Lazarevic, Dejan; Lee, Ji-Hyun; Letellier, Thomas; Li, Chuan-Yun; Lio, Pietro; Liu, Chu-Jun; Luo, Jie; Maass, Alejandro; Mariette, Jerome; Maurel, Thomas; Merella, Stefania; Mohamed, Azza Mostafa; Moreews, Francois; Nabihoudine, Ibounyamine; Ndegwa, Nelson; Noirot, Céline; Perez-Llamas, Cristian; Primig, Michael; Quattrone, Alessandro; Quesneville, Hadi; Rambaldi, Davide; Reecy, James; Riba, Michela; Rosanoff, Steven; Saddiq, Amna Ali; Salas, Elisa; Sallou, Olivier; Shepherd, Rebecca; Simon, Reinhard; Sperling, Linda; Spooner, William; Staines, Daniel M.; Steinbach, Delphine; Stone, Kevin; Stupka, Elia; Teague, Jon W.; Dayem Ullah, Abu Z.; Wang, Jun; Ware, Doreen; Wong-Erasmus, Marie; Youens-Clark, Ken; Zadissa, Amonida; Zhang, Shi-Jian; Kasprzyk, Arek
2015-01-01
The BioMart Community Portal (www.biomart.org) is a community-driven effort to provide a unified interface to biomedical databases that are distributed worldwide. The portal provides access to numerous database projects supported by 30 scientific organizations. It includes over 800 different biological datasets spanning genomics, proteomics, model organisms, cancer data, ontology information and more. All resources available through the portal are independently administered and funded by their host organizations. The BioMart data federation technology provides a unified interface to all the available data. The latest version of the portal comes with many new databases that have been created by our ever-growing community. It also comes with better support and extensibility for data analysis and visualization tools. A new addition to our toolbox, the enrichment analysis tool is now accessible through graphical and web service interface. The BioMart community portal averages over one million requests per day. Building on this level of service and the wealth of information that has become available, the BioMart Community Portal has introduced a new, more scalable and cheaper alternative to the large data stores maintained by specialized organizations. PMID:25897122
The Matchmaker Exchange: A Platform for Rare Disease Gene Discovery
DOE Office of Scientific and Technical Information (OSTI.GOV)
Philippakis, Anthony A.; Azzariti, Danielle R.; Beltran, Sergi
There are few better examples of the need for data sharing than in the rare disease community, where patients, physicians, and researchers must search for "the needle in a haystack" to uncover rare, novel causes of disease within the genome. Impeding the pace of discovery has been the existence of many small siloed datasets within individual research or clinical laboratory databases and/or disease-specific organizations, hoping for serendipitous occasions when two distant investigators happen to learn they have a rare phenotype in common and can "match" these cases to build evidence for causality. However, serendipity has never proven to be amore » reliable or scalable approach in science. As such, the Matchmaker Exchange (MME) was launched to provide a robust and systematic approach to rare disease gene discovery through the creation of a federated network connecting databases of genotypes and rare phenotypes using a common application programming interface (API). The core building blocks of the MME have been defined and assembled. In conclusion, three MME services have now been connected through the API and are available for community use. Additional databases that support internal matching are anticipated to join the MME network as it continues to grow.« less
Earth science big data at users' fingertips: the EarthServer Science Gateway Mobile
NASA Astrophysics Data System (ADS)
Barbera, Roberto; Bruno, Riccardo; Calanducci, Antonio; Fargetta, Marco; Pappalardo, Marco; Rundo, Francesco
2014-05-01
The EarthServer project (www.earthserver.eu), funded by the European Commission under its Seventh Framework Program, aims at establishing open access and ad-hoc analytics on extreme-size Earth Science data, based on and extending leading-edge Array Database technology. The core idea is to use database query languages as client/server interface to achieve barrier-free "mix & match" access to multi-source, any-size, multi-dimensional space-time data -- in short: "Big Earth Data Analytics" - based on the open standards of the Open Geospatial Consortium Web Coverage Processing Service (OGC WCPS) and the W3C XQuery. EarthServer combines both, thereby achieving a tight data/metadata integration. Further, the rasdaman Array Database System (www.rasdaman.com) is extended with further space-time coverage data types. On server side, highly effective optimizations - such as parallel and distributed query processing - ensure scalability to Exabyte volumes. In this contribution we will report on the EarthServer Science Gateway Mobile, an app for both iOS and Android-based devices that allows users to seamlessly access some of the EarthServer applications using SAML-based federated authentication and fine-grained authorisation mechanisms.
VisANT 3.0: new modules for pathway visualization, editing, prediction and construction.
Hu, Zhenjun; Ng, David M; Yamada, Takuji; Chen, Chunnuan; Kawashima, Shuichi; Mellor, Joe; Linghu, Bolan; Kanehisa, Minoru; Stuart, Joshua M; DeLisi, Charles
2007-07-01
With the integration of the KEGG and Predictome databases as well as two search engines for coexpressed genes/proteins using data sets obtained from the Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) database, VisANT 3.0 supports exploratory pathway analysis, which includes multi-scale visualization of multiple pathways, editing and annotating pathways using a KEGG compatible visual notation and visualization of expression data in the context of pathways. Expression levels are represented either by color intensity or by nodes with an embedded expression profile. Multiple experiments can be navigated or animated. Known KEGG pathways can be enriched by querying either coexpressed components of known pathway members or proteins with known physical interactions. Predicted pathways for genes/proteins with unknown functions can be inferred from coexpression or physical interaction data. Pathways produced in VisANT can be saved as computer-readable XML format (VisML), graphic images or high-resolution Scalable Vector Graphics (SVG). Pathways in the format of VisML can be securely shared within an interested group or published online using a simple Web link. VisANT is freely available at http://visant.bu.edu.
A machine learning system to improve heart failure patient assistance.
Guidi, Gabriele; Pettenati, Maria Chiara; Melillo, Paolo; Iadanza, Ernesto
2014-11-01
In this paper, we present a clinical decision support system (CDSS) for the analysis of heart failure (HF) patients, providing various outputs such as an HF severity evaluation, HF-type prediction, as well as a management interface that compares the different patients' follow-ups. The whole system is composed of a part of intelligent core and of an HF special-purpose management tool also providing the function to act as interface for the artificial intelligence training and use. To implement the smart intelligent functions, we adopted a machine learning approach. In this paper, we compare the performance of a neural network (NN), a support vector machine, a system with fuzzy rules genetically produced, and a classification and regression tree and its direct evolution, which is the random forest, in analyzing our database. Best performances in both HF severity evaluation and HF-type prediction functions are obtained by using the random forest algorithm. The management tool allows the cardiologist to populate a "supervised database" suitable for machine learning during his or her regular outpatient consultations. The idea comes from the fact that in literature there are a few databases of this type, and they are not scalable to our case.
The Design of a High Performance Earth Imagery and Raster Data Management and Processing Platform
NASA Astrophysics Data System (ADS)
Xie, Qingyun
2016-06-01
This paper summarizes the general requirements and specific characteristics of both geospatial raster database management system and raster data processing platform from a domain-specific perspective as well as from a computing point of view. It also discusses the need of tight integration between the database system and the processing system. These requirements resulted in Oracle Spatial GeoRaster, a global scale and high performance earth imagery and raster data management and processing platform. The rationale, design, implementation, and benefits of Oracle Spatial GeoRaster are described. Basically, as a database management system, GeoRaster defines an integrated raster data model, supports image compression, data manipulation, general and spatial indices, content and context based queries and updates, versioning, concurrency, security, replication, standby, backup and recovery, multitenancy, and ETL. It provides high scalability using computer and storage clustering. As a raster data processing platform, GeoRaster provides basic operations, image processing, raster analytics, and data distribution featuring high performance computing (HPC). Specifically, HPC features include locality computing, concurrent processing, parallel processing, and in-memory computing. In addition, the APIs and the plug-in architecture are discussed.
NASA Astrophysics Data System (ADS)
Schaefer, Andreas; Daniell, James; Khazai, Bijan; Wenzel, Friedemann
2016-04-01
The occurrence and impact of strong earthquakes often triggers the long-lasting impact of a seismic sequence. Strong earthquakes are generally followed by many aftershocks or even strong subsequently triggered ruptures. The Nepal 2015 earthquake sequence is one of the most recent examples where aftershocks significantly contributed to human and economic losses. In addition, rumours about upcoming mega-earthquakes, false predictions and on-going cycles of aftershocks induced a psychological burden on the society, which caused panic, additional casualties and prevented people from returning to normal life. This study shows the current phase of development of an operationalised aftershock intensity index, which will contribute to the mitigation of aftershock hazard. Hereby, various methods of earthquake forecasting and seismic risk assessments are utilised and an integration of the inherent aftershock intensity is performed. A spatio-temporal analysis of past earthquake clustering provides first-hand data about the nature of aftershock occurrence. Epidemic methods can additionally provide time-dependent variation indices of the cascading effects of aftershock generation. The aftershock hazard is often combined with the potential for significant losses through the vulnerability of structural systems and population. A historical database of aftershock socioeconomic effects from CATDAT has been used in order to calibrate the index based on observed impacts of historical events and their aftershocks. In addition, analytical analysis of cyclic behaviour and fragility functions of various building typologies are explored. The integration of many different probabilistic computation methods will provide a combined index parameter which can then be transformed into an easy-to-read spatio-temporal intensity index. The index provides daily updated information about the probability of the inherent seismic risk of aftershocks by providing a scalable scheme fordifferent aftershock intensities. These intensities define spatial locations and the temporal period when aftershocks are either probable or damaging. Instead of providing a highly scientific probability mesh-up, the aftershock intensity index is an easy-to-communicate system of intensity levels for rescue and relief organizations but also governments and the common people. For this study, the metric is tested retrospectively on the earthquake sequences of Nepal 2015 and Darfield-Christchurch of 2010/2011.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, C
2009-11-12
In FY09 they will (1) complete the implementation, verification, calibration, and sensitivity and scalability analysis of the in-cell virus replication model; (2) complete the design of the cell culture (cell-to-cell infection) model; (3) continue the research, design, and development of their bioinformatics tools: the Web-based structure-alignment-based sequence variability tool and the functional annotation of the genome database; (4) collaborate with the University of California at San Francisco on areas of common interest; and (5) submit journal articles that describe the in-cell model with simulations and the bioinformatics approaches to evaluation of genome variability and fitness.
NASA Astrophysics Data System (ADS)
Baumgartner, Peter O.
A database on Middle Jurassic-Early Cretaceous radiolarians consisting of first and final occurrences of 110 species in 226 samples from 43 localities was used to compute Unitary Associations and probabilistic ranking and scaling (RASC), in order to test deterministic versus probabilistic quantitative biostratigraphic methods. Because the Mesozoic radiolarian fossil record is mainly dissolution-controlled, the sequence of events differs greatly from section to section. The scatter of local first and final appearances along a time scale is large compared to the species range; it is asymmetrical, with a maximum near the ends of the range and it is non-random. Thus, these data do not satisfy the statistical assumptions made in ranking and scaling. Unitary Associations produce maximum ranges of the species relative to each other by stacking cooccurrence data from all sections and therefore compensate for the local dissolution effects. Ranking and scaling, based on the assumption of a normal random distribution of the events, produces average ranges which are for most species much shorter than the maximum UA-ranges. There are, however, a number of species with similar ranges in both solutions. These species are believed to be the most dissolution-resistant and, therefore, the most reliable ones for the definition of biochronozones. The comparison of maximum and average ranges may be a powerful tool to test reliability of species for biochronology. Dissolution-controlled fossil data yield high crossover frequencies and therefore small, statistically insignificant interfossil distances. Scaling has not produced a useful sequence for this type of data.
NASA Astrophysics Data System (ADS)
Armand, P.; Brocheton, F.; Poulet, D.; Vendel, F.; Dubourg, V.; Yalamas, T.
2014-10-01
This paper is an original contribution to uncertainty quantification in atmospheric transport & dispersion (AT&D) at the local scale (1-10 km). It is proposed to account for the imprecise knowledge of the meteorological and release conditions in the case of an accidental hazardous atmospheric emission. The aim is to produce probabilistic risk maps instead of a deterministic toxic load map in order to help the stakeholders making their decisions. Due to the urge attached to such situations, the proposed methodology is able to produce such maps in a limited amount of time. It resorts to a Lagrangian particle dispersion model (LPDM) using wind fields interpolated from a pre-established database that collects the results from a computational fluid dynamics (CFD) model. This enables a decoupling of the CFD simulations from the dispersion analysis, thus a considerable saving of computational time. In order to make the Monte-Carlo-sampling-based estimation of the probability field even faster, it is also proposed to recourse to the use of a vector Gaussian process surrogate model together with high performance computing (HPC) resources. The Gaussian process (GP) surrogate modelling technique is coupled with a probabilistic principal component analysis (PCA) for reducing the number of GP predictors to fit, store and predict. The design of experiments (DOE) from which the surrogate model is built, is run over a cluster of PCs for making the total production time as short as possible. The use of GP predictors is validated by comparing the results produced by this technique with those obtained by crude Monte Carlo sampling.
NASA Astrophysics Data System (ADS)
Nakagawa, Y.; Kawahara, S.; Araki, F.; Matsuoka, D.; Ishikawa, Y.; Fujita, M.; Sugimoto, S.; Okada, Y.; Kawazoe, S.; Watanabe, S.; Ishii, M.; Mizuta, R.; Murata, A.; Kawase, H.
2017-12-01
Analyses of large ensemble data are quite useful in order to produce probabilistic effect projection of climate change. Ensemble data of "+2K future climate simulations" are currently produced by Japanese national project "Social Implementation Program on Climate Change Adaptation Technology (SI-CAT)" as a part of a database for Policy Decision making for Future climate change (d4PDF; Mizuta et al. 2016) produced by Program for Risk Information on Climate Change. Those data consist of global warming simulations and regional downscaling simulations. Considering that those data volumes are too large (a few petabyte) to download to a local computer of users, a user-friendly system is required to search and download data which satisfy requests of the users. We develop "a database system for near-future climate change projections" for providing functions to find necessary data for the users under SI-CAT. The database system for near-future climate change projections mainly consists of a relational database, a data download function and user interface. The relational database using PostgreSQL is a key function among them. Temporally and spatially compressed data are registered on the relational database. As a first step, we develop the relational database for precipitation, temperature and track data of typhoon according to requests by SI-CAT members. The data download function using Open-source Project for a Network Data Access Protocol (OPeNDAP) provides a function to download temporally and spatially extracted data based on search results obtained by the relational database. We also develop the web-based user interface for using the relational database and the data download function. A prototype of the database system for near-future climate change projections are currently in operational test on our local server. The database system for near-future climate change projections will be released on Data Integration and Analysis System Program (DIAS) in fiscal year 2017. Techniques of the database system for near-future climate change projections might be quite useful for simulation and observational data in other research fields. We report current status of development and some case studies of the database system for near-future climate change projections.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Drotar, Alexander P.; Quinn, Erin E.; Sutherland, Landon D.
2012-07-30
Project description is: (1) Build a high performance computer; and (2) Create a tool to monitor node applications in Component Based Tool Framework (CBTF) using code from Lightweight Data Metric Service (LDMS). The importance of this project is that: (1) there is a need a scalable, parallel tool to monitor nodes on clusters; and (2) New LDMS plugins need to be able to be easily added to tool. CBTF stands for Component Based Tool Framework. It's scalable and adjusts to different topologies automatically. It uses MRNet (Multicast/Reduction Network) mechanism for information transport. CBTF is flexible and general enough to bemore » used for any tool that needs to do a task on many nodes. Its components are reusable and 'EASILY' added to a new tool. There are three levels of CBTF: (1) frontend node - interacts with users; (2) filter nodes - filters or concatenates information from backend nodes; and (3) backend nodes - where the actual work of the tool is done. LDMS stands for lightweight data metric servies. It's a tool used for monitoring nodes. Ltool is the name of the tool we derived from LDMS. It's dynamically linked and includes the following components: Vmstat, Meminfo, Procinterrupts and more. It works by: Ltool command is run on the frontend node; Ltool collects information from the backend nodes; backend nodes send information to the filter nodes; and filter nodes concatenate information and send to a database on the front end node. Ltool is a useful tool when it comes to monitoring nodes on a cluster because the overhead involved with running the tool is not particularly high and it will automatically scale to any size cluster.« less
SeqHBase: a big data toolset for family based sequencing data analysis.
He, Min; Person, Thomas N; Hebbring, Scott J; Heinzen, Ethan; Ye, Zhan; Schrodi, Steven J; McPherson, Elizabeth W; Lin, Simon M; Peissig, Peggy L; Brilliant, Murray H; O'Rawe, Jason; Robison, Reid J; Lyon, Gholson J; Wang, Kai
2015-04-01
Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
A Web-Based System for Bayesian Benchmark Dose Estimation.
Shao, Kan; Shapiro, Andrew J
2018-01-11
Benchmark dose (BMD) modeling is an important step in human health risk assessment and is used as the default approach to identify the point of departure for risk assessment. A probabilistic framework for dose-response assessment has been proposed and advocated by various institutions and organizations; therefore, a reliable tool is needed to provide distributional estimates for BMD and other important quantities in dose-response assessment. We developed an online system for Bayesian BMD (BBMD) estimation and compared results from this software with U.S. Environmental Protection Agency's (EPA's) Benchmark Dose Software (BMDS). The system is built on a Bayesian framework featuring the application of Markov chain Monte Carlo (MCMC) sampling for model parameter estimation and BMD calculation, which makes the BBMD system fundamentally different from the currently prevailing BMD software packages. In addition to estimating the traditional BMDs for dichotomous and continuous data, the developed system is also capable of computing model-averaged BMD estimates. A total of 518 dichotomous and 108 continuous data sets extracted from the U.S. EPA's Integrated Risk Information System (IRIS) database (and similar databases) were used as testing data to compare the estimates from the BBMD and BMDS programs. The results suggest that the BBMD system may outperform the BMDS program in a number of aspects, including fewer failed BMD and BMDL calculations and estimates. The BBMD system is a useful alternative tool for estimating BMD with additional functionalities for BMD analysis based on most recent research. Most importantly, the BBMD has the potential to incorporate prior information to make dose-response modeling more reliable and can provide distributional estimates for important quantities in dose-response assessment, which greatly facilitates the current trend for probabilistic risk assessment. https://doi.org/10.1289/EHP1289.
NASA Astrophysics Data System (ADS)
Bonaccorso, Brunella; Cancelliere, Antonino
2015-04-01
In the present study two probabilistic models for short-medium term drought forecasting able to include information provided by teleconnection indices are proposed and applied to Sicily region (Italy). Drought conditions are expressed in terms of the Standardized Precipitation-Evapotranspiration Index (SPEI) at different aggregation time scales. More specifically, a multivariate approach based on normal distribution is developed in order to estimate: 1) on the one hand transition probabilities to future SPEI drought classes and 2) on the other hand, SPEI forecasts at a generic time horizon M, as functions of past values of SPEI and the selected teleconnection index. To this end, SPEI series at 3, 4 and 6 aggregation time scales for Sicily region are extracted from the Global SPEI database, SPEIbase , available at Web repository of the Spanish National Research Council (http://sac.csic.es/spei/database.html), and averaged over the study area. In particular, SPEIbase v2.3 with spatial resolution of 0.5° lat/lon and temporal coverage between January 1901 and December 2013 is used. A preliminary correlation analysis is carried out to investigate the link between the drought index and different teleconnection patterns, namely: the North Atlantic Oscillation (NAO), the Scandinavian (SCA) and the East Atlantic-West Russia (EA-WR) patterns. Results of such analysis indicate a strongest influence of NAO on drought conditions in Sicily with respect to other teleconnection indices. Then, the proposed forecasting methodology is applied and the skill in forecasting of the proposed models is quantitatively assessed through the application of a simple score approach and of performance indices. Results indicate that inclusion of NAO index generally enhance model performance thus confirming the suitability of the models for short- medium term forecast of drought conditions.
Characterizing Wildfire Regimes and Risk in the USA
NASA Astrophysics Data System (ADS)
Malamud, B. D.; Millington, J. D.; Perry, G. L.
2004-12-01
Over the last decade, high profile wildfires have resulted in numerous fatalities and loss of infrastructure. Wildfires also have a significant impact on climate and ecosystems, with recent authors emphasizing the need for regional-level examinations of wildfire-regime dynamics and change, and the factors driving them. With implications for hazard management, climate studies, and ecosystem research, there is therefore significant interest in appropriate analysis of historical wildfire databases. Insightful studies using wildfire database statistics exist, but are often hampered by the low spatial and/or temporal resolution of their datasets. In this paper, we use a high-resolution dataset consisting of 88,855 USFS wildfires over the time period 1970--2000, and consider wildfire occurrence across the conterminous USA as a function of ecoregion (land units classified by climate, vegetation, and topography), ignition source (anthropogenic vs. lightning), and decade (1970--1979, 1980--1989, 1990--1999). We find that for the conterminous USA (a) wildfires exhibit robust frequency-area power-law behavior in 17 different ecoregions, (b) normalized power-law exponents may be used to compare the scaling of wildfire burned areas between regions, (c) power-law exponents change systematically from east to west, (d) wildfires in 75% of the conterminous USA (particularly the east) have higher power-law exponents for anthropogenic vs. lightning ignition sources, and (e) recurrence intervals for wildfires of a given burned area or larger for each ecoregion can be assessed, allowing for the classification of wildfire regimes for probabilistic hazard estimation in the same vein as is now used for earthquakes. By examining wildfire statistics in a spatially and temporally explicit manner, we are able to present resultant wildfire regime summary statistics and conclusions, along with a probabilistic hazard assessment of wildfire risk at the ecoregion division level across the conterminous USA.
NoSQL data model for semi-automatic integration of ethnomedicinal plant data from multiple sources.
Ningthoujam, Sanjoy Singh; Choudhury, Manabendra Dutta; Potsangbam, Kumar Singh; Chetia, Pankaj; Nahar, Lutfun; Sarker, Satyajit D; Basar, Norazah; Das Talukdar, Anupam
2014-01-01
Sharing traditional knowledge with the scientific community could refine scientific approaches to phytochemical investigation and conservation of ethnomedicinal plants. As such, integration of traditional knowledge with scientific data using a single platform for sharing is greatly needed. However, ethnomedicinal data are available in heterogeneous formats, which depend on cultural aspects, survey methodology and focus of the study. Phytochemical and bioassay data are also available from many open sources in various standards and customised formats. To design a flexible data model that could integrate both primary and curated ethnomedicinal plant data from multiple sources. The current model is based on MongoDB, one of the Not only Structured Query Language (NoSQL) databases. Although it does not contain schema, modifications were made so that the model could incorporate both standard and customised ethnomedicinal plant data format from different sources. The model presented can integrate both primary and secondary data related to ethnomedicinal plants. Accommodation of disparate data was accomplished by a feature of this database that supported a different set of fields for each document. It also allowed storage of similar data having different properties. The model presented is scalable to a highly complex level with continuing maturation of the database, and is applicable for storing, retrieving and sharing ethnomedicinal plant data. It can also serve as a flexible alternative to a relational and normalised database. Copyright © 2014 John Wiley & Sons, Ltd.
Rasdaman for Big Spatial Raster Data
NASA Astrophysics Data System (ADS)
Hu, F.; Huang, Q.; Scheele, C. J.; Yang, C. P.; Yu, M.; Liu, K.
2015-12-01
Spatial raster data have grown exponentially over the past decade. Recent advancements on data acquisition technology, such as remote sensing, have allowed us to collect massive observation data of various spatial resolution and domain coverage. The volume, velocity, and variety of such spatial data, along with the computational intensive nature of spatial queries, pose grand challenge to the storage technologies for effective big data management. While high performance computing platforms (e.g., cloud computing) can be used to solve the computing-intensive issues in big data analysis, data has to be managed in a way that is suitable for distributed parallel processing. Recently, rasdaman (raster data manager) has emerged as a scalable and cost-effective database solution to store and retrieve massive multi-dimensional arrays, such as sensor, image, and statistics data. Within this paper, the pros and cons of using rasdaman to manage and query spatial raster data will be examined and compared with other common approaches, including file-based systems, relational databases (e.g., PostgreSQL/PostGIS), and NoSQL databases (e.g., MongoDB and Hive). Earth Observing System (EOS) data collected from NASA's Atmospheric Scientific Data Center (ASDC) will be used and stored in these selected database systems, and a set of spatial and non-spatial queries will be designed to benchmark their performance on retrieving large-scale, multi-dimensional arrays of EOS data. Lessons learnt from using rasdaman will be discussed as well.
Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf
2014-01-01
CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB PMID:25281234
Ameur, Adam; Bunikis, Ignas; Enroth, Stefan; Gyllensten, Ulf
2014-01-01
CanvasDB is an infrastructure for management and analysis of genetic variants from massively parallel sequencing (MPS) projects. The system stores SNP and indel calls in a local database, designed to handle very large datasets, to allow for rapid analysis using simple commands in R. Functional annotations are included in the system, making it suitable for direct identification of disease-causing mutations in human exome- (WES) or whole-genome sequencing (WGS) projects. The system has a built-in filtering function implemented to simultaneously take into account variant calls from all individual samples. This enables advanced comparative analysis of variant distribution between groups of samples, including detection of candidate causative mutations within family structures and genome-wide association by sequencing. In most cases, these analyses are executed within just a matter of seconds, even when there are several hundreds of samples and millions of variants in the database. We demonstrate the scalability of canvasDB by importing the individual variant calls from all 1092 individuals present in the 1000 Genomes Project into the system, over 4.4 billion SNPs and indels in total. Our results show that canvasDB makes it possible to perform advanced analyses of large-scale WGS projects on a local server. Database URL: https://github.com/UppsalaGenomeCenter/CanvasDB. © The Author(s) 2014. Published by Oxford University Press.
PATtyFams: Protein families for the microbial genomes in the PATRIC database
Davis, James J.; Gerdes, Svetlana; Olsen, Gary J.; ...
2016-02-08
The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based functionmore » assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). In conclusion, this new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.« less
LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis.
Nagraj, V P; Magee, Neal E; Sheffield, Nathan C
2018-06-06
The past few years have seen an explosion of interest in understanding the role of regulatory DNA. This interest has driven large-scale production of functional genomics data and analytical methods. One popular analysis is to test for enrichment of overlaps between a query set of genomic regions and a database of region sets. In this way, new genomic data can be easily connected to annotations from external data sources. Here, we present an interactive interface for enrichment analysis of genomic locus overlaps using a web server called LOLAweb. LOLAweb accepts a set of genomic ranges from the user and tests it for enrichment against a database of region sets. LOLAweb renders results in an R Shiny application to provide interactive visualization features, enabling users to filter, sort, and explore enrichment results dynamically. LOLAweb is built and deployed in a Linux container, making it scalable to many concurrent users on our servers and also enabling users to download and run LOLAweb locally.
PATtyFams: Protein families for the microbial genomes in the PATRIC database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Davis, James J.; Gerdes, Svetlana; Olsen, Gary J.
The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based functionmore » assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). In conclusion, this new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.« less
NASA Technical Reports Server (NTRS)
Rilee, Michael Lee; Kuo, Kwo-Sen
2017-01-01
The SpatioTemporal Adaptive Resolution Encoding (STARE) is a unifying scheme encoding geospatial and temporal information for organizing data on scalable computing/storage resources, minimizing expensive data transfers. STARE provides a compact representation that turns set-logic functions into integer operations, e.g. conditional sub-setting, taking into account representative spatiotemporal resolutions of the data in the datasets. STARE geo-spatiotemporally aligns data placements of diverse data on massive parallel resources to maximize performance. Automating important scientific functions (e.g. regridding) and computational functions (e.g. data placement) allows scientists to focus on domain-specific questions instead of expending their efforts and expertise on data processing. With STARE-enabled automation, SciDB (Scientific Database) plus STARE provides a database interface, reducing costly data preparation, increasing the volume and variety of interoperable data, and easing result sharing. Using SciDB plus STARE as part of an integrated analysis infrastructure dramatically eases combining diametrically different datasets.
The psychology of intelligence analysis: drivers of prediction accuracy in world politics.
Mellers, Barbara; Stone, Eric; Atanasov, Pavel; Rohrbaugh, Nick; Metz, S Emlen; Ungar, Lyle; Bishop, Michael M; Horowitz, Michael; Merkle, Ed; Tetlock, Philip
2015-03-01
This article extends psychological methods and concepts into a domain that is as profoundly consequential as it is poorly understood: intelligence analysis. We report findings from a geopolitical forecasting tournament that assessed the accuracy of more than 150,000 forecasts of 743 participants on 199 events occurring over 2 years. Participants were above average in intelligence and political knowledge relative to the general population. Individual differences in performance emerged, and forecasting skills were surprisingly consistent over time. Key predictors were (a) dispositional variables of cognitive ability, political knowledge, and open-mindedness; (b) situational variables of training in probabilistic reasoning and participation in collaborative teams that shared information and discussed rationales (Mellers, Ungar, et al., 2014); and (c) behavioral variables of deliberation time and frequency of belief updating. We developed a profile of the best forecasters; they were better at inductive reasoning, pattern detection, cognitive flexibility, and open-mindedness. They had greater understanding of geopolitics, training in probabilistic reasoning, and opportunities to succeed in cognitively enriched team environments. Last but not least, they viewed forecasting as a skill that required deliberate practice, sustained effort, and constant monitoring of current affairs. PsycINFO Database Record (c) 2015 APA, all rights reserved.
A Probabilistic Atlas of Diffuse WHO Grade II Glioma Locations in the Brain
Baumann, Cédric; Zouaoui, Sonia; Yordanova, Yordanka; Blonski, Marie; Rigau, Valérie; Chemouny, Stéphane; Taillandier, Luc; Bauchet, Luc; Duffau, Hugues; Paragios, Nikos
2016-01-01
Diffuse WHO grade II gliomas are diffusively infiltrative brain tumors characterized by an unavoidable anaplastic transformation. Their management is strongly dependent on their location in the brain due to interactions with functional regions and potential differences in molecular biology. In this paper, we present the construction of a probabilistic atlas mapping the preferential locations of diffuse WHO grade II gliomas in the brain. This is carried out through a sparse graph whose nodes correspond to clusters of tumors clustered together based on their spatial proximity. The interest of such an atlas is illustrated via two applications. The first one correlates tumor location with the patient’s age via a statistical analysis, highlighting the interest of the atlas for studying the origins and behavior of the tumors. The second exploits the fact that the tumors have preferential locations for automatic segmentation. Through a coupled decomposed Markov Random Field model, the atlas guides the segmentation process, and characterizes which preferential location the tumor belongs to and consequently which behavior it could be associated to. Leave-one-out cross validation experiments on a large database highlight the robustness of the graph, and yield promising segmentation results. PMID:26751577
Protein classification using probabilistic chain graphs and the Gene Ontology structure.
Carroll, Steven; Pavlovic, Vladimir
2006-08-01
Probabilistic graphical models have been developed in the past for the task of protein classification. In many cases, classifications obtained from the Gene Ontology have been used to validate these models. In this work we directly incorporate the structure of the Gene Ontology into the graphical representation for protein classification. We present a method in which each protein is represented by a replicate of the Gene Ontology structure, effectively modeling each protein in its own 'annotation space'. Proteins are also connected to one another according to different measures of functional similarity, after which belief propagation is run to make predictions at all ontology terms. The proposed method was evaluated on a set of 4879 proteins from the Saccharomyces Genome Database whose interactions were also recorded in the GRID project. Results indicate that direct utilization of the Gene Ontology improves predictive ability, outperforming traditional models that do not take advantage of dependencies among functional terms. Average increase in accuracy (precision) of positive and negative term predictions of 27.8% (2.0%) over three different similarity measures and three subontologies was observed. C/C++/Perl implementation is available from authors upon request.
NASA Technical Reports Server (NTRS)
Litt, Jonathan S.; Soditus, Sherry; Hendricks, Robert C.; Zaretsky, Erwin V.
2002-01-01
Over the past two decades there has been considerable effort by NASA Glenn and others to develop probabilistic codes to predict with reasonable engineering certainty the life and reliability of critical components in rotating machinery and, more specifically, in the rotating sections of airbreathing and rocket engines. These codes have, to a very limited extent, been verified with relatively small bench rig type specimens under uniaxial loading. Because of the small and very narrow database the acceptance of these codes within the aerospace community has been limited. An alternate approach to generating statistically significant data under complex loading and environments simulating aircraft and rocket engine conditions is to obtain, catalog and statistically analyze actual field data. End users of the engines, such as commercial airlines and the military, record and store operational and maintenance information. This presentation describes a cooperative program between the NASA GRC, United Airlines, USAF Wright Laboratory, U.S. Army Research Laboratory and Australian Aeronautical & Maritime Research Laboratory to obtain and analyze these airline data for selected components such as blades, disks and combustors. These airline data will be used to benchmark and compare existing life prediction codes.
Zenker, Sven
2010-08-01
Combining mechanistic mathematical models of physiology with quantitative observations using probabilistic inference may offer advantages over established approaches to computerized decision support in acute care medicine. Particle filters (PF) can perform such inference successively as data becomes available. The potential of PF for real-time state estimation (SE) for a model of cardiovascular physiology is explored using parallel computers and the ability to achieve joint state and parameter estimation (JSPE) given minimal prior knowledge tested. A parallelized sequential importance sampling/resampling algorithm was implemented and its scalability for the pure SE problem for a non-linear five-dimensional ODE model of the cardiovascular system evaluated on a Cray XT3 using up to 1,024 cores. JSPE was implemented using a state augmentation approach with artificial stochastic evolution of the parameters. Its performance when simultaneously estimating the 5 states and 18 unknown parameters when given observations only of arterial pressure, central venous pressure, heart rate, and, optionally, cardiac output, was evaluated in a simulated bleeding/resuscitation scenario. SE was successful and scaled up to 1,024 cores with appropriate algorithm parametrization, with real-time equivalent performance for up to 10 million particles. JSPE in the described underdetermined scenario achieved excellent reproduction of observables and qualitative tracking of enddiastolic ventricular volumes and sympathetic nervous activity. However, only a subset of the posterior distributions of parameters concentrated around the true values for parts of the estimated trajectories. Parallelized PF's performance makes their application to complex mathematical models of physiology for the purpose of clinical data interpretation, prediction, and therapy optimization appear promising. JSPE in the described extremely underdetermined scenario nevertheless extracted information of potential clinical relevance from the data in this simulation setting. However, fully satisfactory resolution of this problem when minimal prior knowledge about parameter values is available will require further methodological improvements, which are discussed.
Authomatization of Digital Collection Access Using Mobile and Wireless Data Terminals
NASA Astrophysics Data System (ADS)
Leontiev, I. V.
Information technologies become vital due to information processing needs, database access, data analysis and decision support. Currently, a lot of scientific projects are oriented on database integration of heterogeneous systems. The problem of on-line and rapid access to large integrated systems of digital collections is also very important. Usually users move between different locations, either at work or at home. In most cases users need an efficient and remote access to information, stored in integrated data collections. Desktop computers are unable to fulfill the needs, so mobile and wireless devices become helpful. Handhelds and data terminals are nessessary in medical assistance (they store detailed information about each patient, and helpful for nurses), immediate access to data collections is used in a Highway patrol services (databanks of cars, owners, driver licences). Using mobile access, warehouse operations can be validated. Library and museum items cyclecounting will speed up using online barcode-scanning and central database access. That's why mobile devices - cell phones, PDA, handheld computers with wireless access, WindowsCE and PalmOS terminals become popular. Generally, mobile devices have a relatively slow processor, and limited display capabilities, but they are effective for storing and displaying textual data, recognize user hand-writing with stylus, support GUI. Users can perform operations on handheld terminal, and exchange data with the main system (using immediate radio access, or offline access during syncronization process) for update. In our report, we give an approach for mobile access to data collections, which raises an efficiency of data processing in a book library, helps to control available books, books in stock, validate service charges, eliminate staff mistakes, generate requests for book delivery. Our system uses mobile devices Symbol RF (with radio-channel access), and data terminals Symbol Palm Terminal for batch-processing and synchronization with remote library databases. We discuss the use of PalmOS-compatible devices, and WindowsCE terminals. Our software system is based on modular, scalable three-tier architecture. Additional functionality can be easily customized. Scalability is also supplied by Internet / Intranet technologies, and radio-access points. The base module of the system supports generic warehouse operations: cyclecounting with handheld barcode-scanners, efficient items delivery and issue, item movement, reserving, report generating on finished and in-process operations. Movements are optimized using worker's current location, operations are sorted in a priority order and transmitted to mobile and wireless worker's terminals. Mobile terminals improve of tasks processing control, eliminate staff mistakes, display actual information about main processes, provide data for online-reports, and significantly raise the efficiency of data exchange.
O'Neill, M A; Hilgetag, C C
2001-08-29
Many problems in analytical biology, such as the classification of organisms, the modelling of macromolecules, or the structural analysis of metabolic or neural networks, involve complex relational data. Here, we describe a software environment, the portable UNIX programming system (PUPS), which has been developed to allow efficient computational representation and analysis of such data. The system can also be used as a general development tool for database and classification applications. As the complexity of analytical biology problems may lead to computation times of several days or weeks even on powerful computer hardware, the PUPS environment gives support for persistent computations by providing mechanisms for dynamic interaction and homeostatic protection of processes. Biological objects and their interrelations are also represented in a homeostatic way in PUPS. Object relationships are maintained and updated by the objects themselves, thus providing a flexible, scalable and current data representation. Based on the PUPS environment, we have developed an optimization package, CANTOR, which can be applied to a wide range of relational data and which has been employed in different analyses of neuroanatomical connectivity. The CANTOR package makes use of the PUPS system features by modifying candidate arrangements of objects within the system's database. This restructuring is carried out via optimization algorithms that are based on user-defined cost functions, thus providing flexible and powerful tools for the structural analysis of the database content. The use of stochastic optimization also enables the CANTOR system to deal effectively with incomplete and inconsistent data. Prototypical forms of PUPS and CANTOR have been coded and used successfully in the analysis of anatomical and functional mammalian brain connectivity, involving complex and inconsistent experimental data. In addition, PUPS has been used for solving multivariate engineering optimization problems and to implement the digital identification system (DAISY), a system for the automated classification of biological objects. PUPS is implemented in ANSI-C under the POSIX.1 standard and is to a great extent architecture- and operating-system independent. The software is supported by systems libraries that allow multi-threading (the concurrent processing of several database operations), as well as the distribution of the dynamic data objects and library operations over clusters of computers. These attributes make the system easily scalable, and in principle allow the representation and analysis of arbitrarily large sets of relational data. PUPS and CANTOR are freely distributed (http://www.pups.org.uk) as open-source software under the GNU license agreement.
O'Neill, M A; Hilgetag, C C
2001-01-01
Many problems in analytical biology, such as the classification of organisms, the modelling of macromolecules, or the structural analysis of metabolic or neural networks, involve complex relational data. Here, we describe a software environment, the portable UNIX programming system (PUPS), which has been developed to allow efficient computational representation and analysis of such data. The system can also be used as a general development tool for database and classification applications. As the complexity of analytical biology problems may lead to computation times of several days or weeks even on powerful computer hardware, the PUPS environment gives support for persistent computations by providing mechanisms for dynamic interaction and homeostatic protection of processes. Biological objects and their interrelations are also represented in a homeostatic way in PUPS. Object relationships are maintained and updated by the objects themselves, thus providing a flexible, scalable and current data representation. Based on the PUPS environment, we have developed an optimization package, CANTOR, which can be applied to a wide range of relational data and which has been employed in different analyses of neuroanatomical connectivity. The CANTOR package makes use of the PUPS system features by modifying candidate arrangements of objects within the system's database. This restructuring is carried out via optimization algorithms that are based on user-defined cost functions, thus providing flexible and powerful tools for the structural analysis of the database content. The use of stochastic optimization also enables the CANTOR system to deal effectively with incomplete and inconsistent data. Prototypical forms of PUPS and CANTOR have been coded and used successfully in the analysis of anatomical and functional mammalian brain connectivity, involving complex and inconsistent experimental data. In addition, PUPS has been used for solving multivariate engineering optimization problems and to implement the digital identification system (DAISY), a system for the automated classification of biological objects. PUPS is implemented in ANSI-C under the POSIX.1 standard and is to a great extent architecture- and operating-system independent. The software is supported by systems libraries that allow multi-threading (the concurrent processing of several database operations), as well as the distribution of the dynamic data objects and library operations over clusters of computers. These attributes make the system easily scalable, and in principle allow the representation and analysis of arbitrarily large sets of relational data. PUPS and CANTOR are freely distributed (http://www.pups.org.uk) as open-source software under the GNU license agreement. PMID:11545702
A scalable database model for multiparametric time series: a volcano observatory case study
NASA Astrophysics Data System (ADS)
Montalto, Placido; Aliotta, Marco; Cassisi, Carmelo; Prestifilippo, Michele; Cannata, Andrea
2014-05-01
The variables collected by a sensor network constitute a heterogeneous data source that needs to be properly organized in order to be used in research and geophysical monitoring. With the time series term we refer to a set of observations of a given phenomenon acquired sequentially in time. When the time intervals are equally spaced one speaks of period or sampling frequency. Our work describes in detail a possible methodology for storage and management of time series using a specific data structure. We designed a framework, hereinafter called TSDSystem (Time Series Database System), in order to acquire time series from different data sources and standardize them within a relational database. The operation of standardization provides the ability to perform operations, such as query and visualization, of many measures synchronizing them using a common time scale. The proposed architecture follows a multiple layer paradigm (Loaders layer, Database layer and Business Logic layer). Each layer is specialized in performing particular operations for the reorganization and archiving of data from different sources such as ASCII, Excel, ODBC (Open DataBase Connectivity), file accessible from the Internet (web pages, XML). In particular, the loader layer performs a security check of the working status of each running software through an heartbeat system, in order to automate the discovery of acquisition issues and other warning conditions. Although our system has to manage huge amounts of data, performance is guaranteed by using a smart partitioning table strategy, that keeps balanced the percentage of data stored in each database table. TSDSystem also contains modules for the visualization of acquired data, that provide the possibility to query different time series on a specified time range, or follow the realtime signal acquisition, according to a data access policy from the users.
Data Mining Research with the LSST
NASA Astrophysics Data System (ADS)
Borne, Kirk D.; Strauss, M. A.; Tyson, J. A.
2007-12-01
The LSST catalog database will exceed 10 petabytes, comprising several hundred attributes for 5 billion galaxies, 10 billion stars, and over 1 billion variable sources (optical variables, transients, or moving objects), extracted from over 20,000 square degrees of deep imaging in 5 passbands with thorough time domain coverage: 1000 visits over the 10-year LSST survey lifetime. The opportunities are enormous for novel scientific discoveries within this rich time-domain ultra-deep multi-band survey database. Data Mining, Machine Learning, and Knowledge Discovery research opportunities with the LSST are now under study, with a potential for new collaborations to develop to contribute to these investigations. We will describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. We also give some illustrative examples of current scientific data mining research in astronomy, and point out where new research is needed. In particular, the data mining research community will need to address several issues in the coming years as we prepare for the LSST data deluge. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; visual data mining algorithms for visual exploration of the data; indexing of multi-attribute multi-dimensional astronomical databases (beyond RA-Dec spatial indexing) for rapid querying of petabyte databases; and more. Finally, we will identify opportunities for synergistic collaboration between the data mining research group and the LSST Data Management and Science Collaboration teams.
Proceedings, Seminar on Probabilistic Methods in Geotechnical Engineering
NASA Astrophysics Data System (ADS)
Hynes-Griffin, M. E.; Buege, L. L.
1983-09-01
Contents: Applications of Probabilistic Methods in Geotechnical Engineering; Probabilistic Seismic and Geotechnical Evaluation at a Dam Site; Probabilistic Slope Stability Methodology; Probability of Liquefaction in a 3-D Soil Deposit; Probabilistic Design of Flood Levees; Probabilistic and Statistical Methods for Determining Rock Mass Deformability Beneath Foundations: An Overview; Simple Statistical Methodology for Evaluating Rock Mechanics Exploration Data; New Developments in Statistical Techniques for Analyzing Rock Slope Stability.
Modelling and analysis of the sugar cataract development process using stochastic hybrid systems.
Riley, D; Koutsoukos, X; Riley, K
2009-05-01
Modelling and analysis of biochemical systems such as sugar cataract development (SCD) are critical because they can provide new insights into systems, which cannot be easily tested with experiments; however, they are challenging problems due to the highly coupled chemical reactions that are involved. The authors present a stochastic hybrid system (SHS) framework for modelling biochemical systems and demonstrate the approach for the SCD process. A novel feature of the framework is that it allows modelling the effect of drug treatment on the system dynamics. The authors validate the three sugar cataract models by comparing trajectories computed by two simulation algorithms. Further, the authors present a probabilistic verification method for computing the probability of sugar cataract formation for different chemical concentrations using safety and reachability analysis methods for SHSs. The verification method employs dynamic programming based on a discretisation of the state space and therefore suffers from the curse of dimensionality. To analyse the SCD process, a parallel dynamic programming implementation that can handle large, realistic systems was developed. Although scalability is a limiting factor, this work demonstrates that the proposed method is feasible for realistic biochemical systems.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ahn, Tae-Hyuk; Sandu, Adrian; Watson, Layne T.
2015-08-01
Ensembles of simulations are employed to estimate the statistics of possible future states of a system, and are widely used in important applications such as climate change and biological modeling. Ensembles of runs can naturally be executed in parallel. However, when the CPU times of individual simulations vary considerably, a simple strategy of assigning an equal number of tasks per processor can lead to serious work imbalances and low parallel efficiency. This paper presents a new probabilistic framework to analyze the performance of dynamic load balancing algorithms for ensembles of simulations where many tasks are mapped onto each processor, andmore » where the individual compute times vary considerably among tasks. Four load balancing strategies are discussed: most-dividing, all-redistribution, random-polling, and neighbor-redistribution. Simulation results with a stochastic budding yeast cell cycle model are consistent with the theoretical analysis. It is especially significant that there is a provable global decrease in load imbalance for the local rebalancing algorithms due to scalability concerns for the global rebalancing algorithms. The overall simulation time is reduced by up to 25 %, and the total processor idle time by 85 %.« less
Extending Wireless Rechargeable Sensor Network Life without Full Knowledge.
Najeeb, Najeeb W; Detweiler, Carrick
2017-07-17
When extending the life of Wireless Rechargeable Sensor Networks (WRSN), one challenge is charging networks as they grow larger. Overcoming this limitation will render a WRSN more practical and highly adaptable to growth in the real world. Most charging algorithms require a priori full knowledge of sensor nodes' power levels in order to determine the nodes that require charging. In this work, we present a probabilistic algorithm that extends the life of scalable WRSN without a priori power knowledge and without full network exploration. We develop a probability bound on the power level of the sensor nodes and utilize this bound to make decisions while exploring a WRSN. We verify the algorithm by simulating a wireless power transfer unmanned aerial vehicle, and charging a WRSN to extend its life. Our results show that, without knowledge, our proposed algorithm extends the life of a WRSN on average 90% of what an optimal full knowledge algorithm can achieve. This means that the charging robot does not need to explore the whole network, which enables the scaling of WRSN. We analyze the impact of network parameters on our algorithm and show that it is insensitive to a large range of parameter values.
Extending Wireless Rechargeable Sensor Network Life without Full Knowledge
Najeeb, Najeeb W.; Detweiler, Carrick
2017-01-01
When extending the life of Wireless Rechargeable Sensor Networks (WRSN), one challenge is charging networks as they grow larger. Overcoming this limitation will render a WRSN more practical and highly adaptable to growth in the real world. Most charging algorithms require a priori full knowledge of sensor nodes’ power levels in order to determine the nodes that require charging. In this work, we present a probabilistic algorithm that extends the life of scalable WRSN without a priori power knowledge and without full network exploration. We develop a probability bound on the power level of the sensor nodes and utilize this bound to make decisions while exploring a WRSN. We verify the algorithm by simulating a wireless power transfer unmanned aerial vehicle, and charging a WRSN to extend its life. Our results show that, without knowledge, our proposed algorithm extends the life of a WRSN on average 90% of what an optimal full knowledge algorithm can achieve. This means that the charging robot does not need to explore the whole network, which enables the scaling of WRSN. We analyze the impact of network parameters on our algorithm and show that it is insensitive to a large range of parameter values. PMID:28714936
Jin, Rui-Bo; Shimizu, Ryosuke; Morohashi, Isao; Wakui, Kentaro; Takeoka, Masahiro; Izumi, Shuro; Sakamoto, Takahide; Fujiwara, Mikio; Yamashita, Taro; Miki, Shigehito; Terai, Hirotaka; Wang, Zhen; Sasaki, Masahide
2014-12-19
Efficient generation and detection of indistinguishable twin photons are at the core of quantum information and communications technology (Q-ICT). These photons are conventionally generated by spontaneous parametric down conversion (SPDC), which is a probabilistic process, and hence occurs at a limited rate, which restricts wider applications of Q-ICT. To increase the rate, one had to excite SPDC by higher pump power, while it inevitably produced more unwanted multi-photon components, harmfully degrading quantum interference visibility. Here we solve this problem by using recently developed 10 GHz repetition-rate-tunable comb laser, combined with a group-velocity-matched nonlinear crystal, and superconducting nanowire single photon detectors. They operate at telecom wavelengths more efficiently with less noises than conventional schemes, those typically operate at visible and near infrared wavelengths generated by a 76 MHz Ti Sapphire laser and detected by Si detectors. We could show high interference visibilities, which are free from the pump-power induced degradation. Our laser, nonlinear crystal, and detectors constitute a powerful tool box, which will pave a way to implementing quantum photonics circuits with variety of good and low-cost telecom components, and will eventually realize scalable Q-ICT in optical infra-structures.
Barium Qubit State Detection and Ba Ion-Photon Entanglement
NASA Astrophysics Data System (ADS)
Sosnova, Ksenia; Inlek, Ismail Volkan; Crocker, Clayton; Lichtman, Martin; Monroe, Christopher
2016-05-01
A modular ion-trap network is a promising framework for scalable quantum-computational devices. In this architecture, different ion-trap modules are connected via photonic buses while within one module ions interact locally via phonons. To eliminate cross-talk between photonic-link qubits and memory qubits, we use different atomic species for quantum information storage (171 Yb+) and intermodular communication (138 Ba+). Conventional deterministic Zeeman-qubit state detection schemes require additional stabilized narrow-linewidth lasers. Instead, we perform fast probabilistic state detection utilizing efficient detectors and high-NA lenses to detect emitted photons from circularly polarized 493 nm laser excitation. Our method is not susceptible to intensity and frequency noise, and we show single-shot detection efficiency of ~ 2%, meaning that we can discriminate between the two qubits states with 99% confidence after as little as 50 ms of averaging. Using this measurement technique, we report entanglement between a single 138 Ba+ ion and its emitted photon with 86% fidelity. This work is supported by the ARO with funding from the IARPA MQCO program, the DARPA Quiness program, the AFOSR MURI on Quantum Transduction, and the ARL Center for Distributed Quantum Information.
Albertson, John D; Harvey, Tierney; Foderaro, Greg; Zhu, Pingping; Zhou, Xiaochi; Ferrari, Silvia; Amin, M Shahrooz; Modrak, Mark; Brantley, Halley; Thoma, Eben D
2016-03-01
This paper addresses the need for surveillance of fugitive methane emissions over broad geographical regions. Most existing techniques suffer from being either extensive (but qualitative) or quantitative (but intensive with poor scalability). A total of two novel advancements are made here. First, a recursive Bayesian method is presented for probabilistically characterizing fugitive point-sources from mobile sensor data. This approach is made possible by a new cross-plume integrated dispersion formulation that overcomes much of the need for time-averaging concentration data. The method is tested here against a limited data set of controlled methane release and shown to perform well. We then present an information-theoretic approach to plan the paths of the sensor-equipped vehicle, where the path is chosen so as to maximize expected reduction in integrated target source rate uncertainty in the region, subject to given starting and ending positions and prevailing meteorological conditions. The information-driven sensor path planning algorithm is tested and shown to provide robust results across a wide range of conditions. An overall system concept is presented for optionally piggybacking of these techniques onto normal industry maintenance operations using sensor-equipped work trucks.
Peeking Network States with Clustered Patterns
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kim, Jinoh; Sim, Alex
2015-10-20
Network traffic monitoring has long been a core element for effec- tive network management and security. However, it is still a chal- lenging task with a high degree of complexity for comprehensive analysis when considering multiple variables and ever-increasing traffic volumes to monitor. For example, one of the widely con- sidered approaches is to scrutinize probabilistic distributions, but it poses a scalability concern and multivariate analysis is not gen- erally supported due to the exponential increase of the complexity. In this work, we propose a novel method for network traffic moni- toring based on clustering, one of the powerful deep-learningmore » tech- niques. We show that the new approach enables us to recognize clustered results as patterns representing the network states, which can then be utilized to evaluate “similarity” of network states over time. In addition, we define a new quantitative measure for the similarity between two compared network states observed in dif- ferent time windows, as a supportive means for intuitive analysis. Finally, we demonstrate the clustering-based network monitoring with public traffic traces, and show that the proposed approach us- ing the clustering method has a great opportunity for feasible, cost- effective network monitoring.« less
Probabilistic Modeling of the Renal Stone Formation Module
NASA Technical Reports Server (NTRS)
Best, Lauren M.; Myers, Jerry G.; Goodenow, Debra A.; McRae, Michael P.; Jackson, Travis C.
2013-01-01
The Integrated Medical Model (IMM) is a probabilistic tool, used in mission planning decision making and medical systems risk assessments. The IMM project maintains a database of over 80 medical conditions that could occur during a spaceflight, documenting an incidence rate and end case scenarios for each. In some cases, where observational data are insufficient to adequately define the inflight medical risk, the IMM utilizes external probabilistic modules to model and estimate the event likelihoods. One such medical event of interest is an unpassed renal stone. Due to a high salt diet and high concentrations of calcium in the blood (due to bone depletion caused by unloading in the microgravity environment) astronauts are at a considerable elevated risk for developing renal calculi (nephrolithiasis) while in space. Lack of observed incidences of nephrolithiasis has led HRP to initiate the development of the Renal Stone Formation Module (RSFM) to create a probabilistic simulator capable of estimating the likelihood of symptomatic renal stone presentation in astronauts on exploration missions. The model consists of two major parts. The first is the probabilistic component, which utilizes probability distributions to assess the range of urine electrolyte parameters and a multivariate regression to transform estimated crystal density and size distributions to the likelihood of the presentation of nephrolithiasis symptoms. The second is a deterministic physical and chemical model of renal stone growth in the kidney developed by Kassemi et al. The probabilistic component of the renal stone model couples the input probability distributions describing the urine chemistry, astronaut physiology, and system parameters with the physical and chemical outputs and inputs to the deterministic stone growth model. These two parts of the model are necessary to capture the uncertainty in the likelihood estimate. The model will be driven by Monte Carlo simulations, continuously randomly sampling the probability distributions of the electrolyte concentrations and system parameters that are inputs into the deterministic model. The total urine chemistry concentrations are used to determine the urine chemistry activity using the Joint Expert Speciation System (JESS), a biochemistry model. Information used from JESS is then fed into the deterministic growth model. Outputs from JESS and the deterministic model are passed back to the probabilistic model where a multivariate regression is used to assess the likelihood of a stone forming and the likelihood of a stone requiring clinical intervention. The parameters used to determine to quantify these risks include: relative supersaturation (RS) of calcium oxalate, citrate/calcium ratio, crystal number density, total urine volume, pH, magnesium excretion, maximum stone width, and ureteral location. Methods and Validation: The RSFM is designed to perform a Monte Carlo simulation to generate probability distributions of clinically significant renal stones, as well as provide an associated uncertainty in the estimate. Initially, early versions will be used to test integration of the components and assess component validation and verification (V&V), with later versions used to address questions regarding design reference mission scenarios. Once integrated with the deterministic component, the credibility assessment of the integrated model will follow NASA STD 7009 requirements.
Trade Studies of Space Launch Architectures using Modular Probabilistic Risk Analysis
NASA Technical Reports Server (NTRS)
Mathias, Donovan L.; Go, Susie
2006-01-01
A top-down risk assessment in the early phases of space exploration architecture development can provide understanding and intuition of the potential risks associated with new designs and technologies. In this approach, risk analysts draw from their past experience and the heritage of similar existing systems as a source for reliability data. This top-down approach captures the complex interactions of the risk driving parts of the integrated system without requiring detailed knowledge of the parts themselves, which is often unavailable in the early design stages. Traditional probabilistic risk analysis (PRA) technologies, however, suffer several drawbacks that limit their timely application to complex technology development programs. The most restrictive of these is a dependence on static planning scenarios, expressed through fault and event trees. Fault trees incorporating comprehensive mission scenarios are routinely constructed for complex space systems, and several commercial software products are available for evaluating fault statistics. These static representations cannot capture the dynamic behavior of system failures without substantial modification of the initial tree. Consequently, the development of dynamic models using fault tree analysis has been an active area of research in recent years. This paper discusses the implementation and demonstration of dynamic, modular scenario modeling for integration of subsystem fault evaluation modules using the Space Architecture Failure Evaluation (SAFE) tool. SAFE is a C++ code that was originally developed to support NASA s Space Launch Initiative. It provides a flexible framework for system architecture definition and trade studies. SAFE supports extensible modeling of dynamic, time-dependent risk drivers of the system and functions at the level of fidelity for which design and failure data exists. The approach is scalable, allowing inclusion of additional information as detailed data becomes available. The tool performs a Monte Carlo analysis to provide statistical estimates. Example results of an architecture system reliability study are summarized for an exploration system concept using heritage data from liquid-fueled expendable Saturn V/Apollo launch vehicles.
An Improved Algorithm to Generate a Wi-Fi Fingerprint Database for Indoor Positioning
Chen, Lina; Li, Binghao; Zhao, Kai; Rizos, Chris; Zheng, Zhengqi
2013-01-01
The major problem of Wi-Fi fingerprint-based positioning technology is the signal strength fingerprint database creation and maintenance. The significant temporal variation of received signal strength (RSS) is the main factor responsible for the positioning error. A probabilistic approach can be used, but the RSS distribution is required. The Gaussian distribution or an empirically-derived distribution (histogram) is typically used. However, these distributions are either not always correct or require a large amount of data for each reference point. Double peaks of the RSS distribution have been observed in experiments at some reference points. In this paper a new algorithm based on an improved double-peak Gaussian distribution is proposed. Kurtosis testing is used to decide if this new distribution, or the normal Gaussian distribution, should be applied. Test results show that the proposed algorithm can significantly improve the positioning accuracy, as well as reduce the workload of the off-line data training phase. PMID:23966197
An improved algorithm to generate a Wi-Fi fingerprint database for indoor positioning.
Chen, Lina; Li, Binghao; Zhao, Kai; Rizos, Chris; Zheng, Zhengqi
2013-08-21
The major problem of Wi-Fi fingerprint-based positioning technology is the signal strength fingerprint database creation and maintenance. The significant temporal variation of received signal strength (RSS) is the main factor responsible for the positioning error. A probabilistic approach can be used, but the RSS distribution is required. The Gaussian distribution or an empirically-derived distribution (histogram) is typically used. However, these distributions are either not always correct or require a large amount of data for each reference point. Double peaks of the RSS distribution have been observed in experiments at some reference points. In this paper a new algorithm based on an improved double-peak Gaussian distribution is proposed. Kurtosis testing is used to decide if this new distribution, or the normal Gaussian distribution, should be applied. Test results show that the proposed algorithm can significantly improve the positioning accuracy, as well as reduce the workload of the off-line data training phase.
NASA Astrophysics Data System (ADS)
Tonini, Roberto; Selva, Jacopo; Costa, Antonio; Sandri, Laura
2014-05-01
Probabilistic Hazard Assessment (PHA) is becoming an essential tool for risk mitigation policies, since it allows to quantify the hazard due to hazardous phenomena and, differently from the deterministic approach, it accounts for both aleatory and epistemic uncertainties. On the other hand, one of the main disadvantages of PHA methods is that their results are not easy to understand and interpret by people who are not specialist in probabilistic tools. For scientists, this leads to the issue of providing tools that can be easily used and understood by decision makers (i.e., risk managers or local authorities). The work here presented fits into the problem of simplifying the transfer between scientific knowledge and land protection policies, by providing an interface between scientists, who produce PHA's results, and decision makers, who use PHA's results for risk analyses. In this framework we present pyPHaz, an open tool developed and designed to visualize and analyze PHA results due to one or more phenomena affecting a specific area of interest. The software implementation has been fully developed with the free and open-source Python programming language and some featured Python-based libraries and modules. The pyPHaz tool allows to visualize the Hazard Curves (HC) calculated in a selected target area together with different levels of uncertainty (mean and percentiles) on maps that can be interactively created and modified by the user, thanks to a dedicated Graphical User Interface (GUI). Moreover, the tool can be used to compare the results of different PHA models and to merge them, by creating ensemble models. The pyPHaz software has been designed with the features of storing and accessing all the data through a MySQL database and of being able to read as input the XML-based standard file formats defined in the frame of GEM (Global Earthquake Model). This format model is easy to extend also to any other kind of hazard, as it will be shown in the applications here used as examples of the pyPHaz potentialities, that are focused on a Probabilistic Volcanic Hazard Assessment (PVHA) for tephra dispersal and fallout applied to the municipality of Naples.
Probalistic Assessment of Radiation Risk for Solar Particle Events
NASA Technical Reports Server (NTRS)
Kim, Myung-Hee Y.; Cucinotta, Francis A.
2008-01-01
For long duration missions outside of the protection of the Earth's magnetic field, exposure to solar particle events (SPEs) is a major safety concern for crew members during extra-vehicular activities (EVAs) on the lunar surface or Earth-to-moon or Earth-to-Mars transit. The large majority (90%) of SPEs have small or no health consequences because the doses are low and the particles do not penetrate to organ depths. However, there is an operational challenge to respond to events of unknown size and duration. We have developed a probabilistic approach to SPE risk assessment in support of mission design and operational planning. Using the historical database of proton measurements during the past 5 solar cycles, the functional form of hazard function of SPE occurrence per cycle was found for nonhomogeneous Poisson model. A typical hazard function was defined as a function of time within a non-specific future solar cycle of 4000 days duration. Distributions of particle fluences for a specified mission period were simulated ranging from its 5th to 95th percentile. Organ doses from large SPEs were assessed using NASA's Baryon transport model, BRYNTRN. The SPE risk was analyzed with the organ dose distribution for the given particle fluences during a mission period. In addition to the total particle fluences of SPEs, the detailed energy spectra of protons, especially at high energy levels, were recognized as extremely important for assessing the cancer risk associated with energetic particles for large events. The probability of exceeding the NASA 30-day limit of blood forming organ (BFO) dose inside a typical spacecraft was calculated for various SPE sizes. This probabilistic approach to SPE protection will be combined with a probabilistic approach to the radiobiological factors that contribute to the uncertainties in projecting cancer risks in future work.
A Parallel Ghosting Algorithm for The Flexible Distributed Mesh Database
Mubarak, Misbah; Seol, Seegyoung; Lu, Qiukai; ...
2013-01-01
Critical to the scalability of parallel adaptive simulations are parallel control functions including load balancing, reduced inter-process communication and optimal data decomposition. In distributed meshes, many mesh-based applications frequently access neighborhood information for computational purposes which must be transmitted efficiently to avoid parallel performance degradation when the neighbors are on different processors. This article presents a parallel algorithm of creating and deleting data copies, referred to as ghost copies, which localize neighborhood data for computation purposes while minimizing inter-process communication. The key characteristics of the algorithm are: (1) It can create ghost copies of any permissible topological order in amore » 1D, 2D or 3D mesh based on selected adjacencies. (2) It exploits neighborhood communication patterns during the ghost creation process thus eliminating all-to-all communication. (3) For applications that need neighbors of neighbors, the algorithm can create n number of ghost layers up to a point where the whole partitioned mesh can be ghosted. Strong and weak scaling results are presented for the IBM BG/P and Cray XE6 architectures up to a core count of 32,768 processors. The algorithm also leads to scalable results when used in a parallel super-convergent patch recovery error estimator, an application that frequently accesses neighborhood data to carry out computation.« less
Research on distributed heterogeneous data PCA algorithm based on cloud platform
NASA Astrophysics Data System (ADS)
Zhang, Jin; Huang, Gang
2018-05-01
Principal component analysis (PCA) of heterogeneous data sets can solve the problem that centralized data scalability is limited. In order to reduce the generation of intermediate data and error components of distributed heterogeneous data sets, a principal component analysis algorithm based on heterogeneous data sets under cloud platform is proposed. The algorithm performs eigenvalue processing by using Householder tridiagonalization and QR factorization to calculate the error component of the heterogeneous database associated with the public key to obtain the intermediate data set and the lost information. Experiments on distributed DBM heterogeneous datasets show that the model method has the feasibility and reliability in terms of execution time and accuracy.
Monitoring service for the Gran Telescopio Canarias control system
NASA Astrophysics Data System (ADS)
Huertas, Manuel; Molgo, Jordi; Macías, Rosa; Ramos, Francisco
2016-07-01
The Monitoring Service collects, persists and propagates the Telescope and Instrument telemetry, for the Gran Telescopio CANARIAS (GTC), an optical-infrared 10-meter segmented mirror telescope at the ORM observatory in Canary Islands (Spain). A new version of the Monitoring Service has been developed in order to improve performance, provide high availability, guarantee fault tolerance and scalability to cope with high volume of data. The architecture is based on a distributed in-memory data store with a Product/Consumer pattern design. The producer generates the data samples. The consumers either persists the samples to a database for further analysis or propagates them to the consoles in the control room to monitorize the state of the whole system.
Probabilistic flood warning using grand ensemble weather forecasts
NASA Astrophysics Data System (ADS)
He, Y.; Wetterhall, F.; Cloke, H.; Pappenberger, F.; Wilson, M.; Freer, J.; McGregor, G.
2009-04-01
As the severity of floods increases, possibly due to climate and landuse change, there is urgent need for more effective and reliable warning systems. The incorporation of numerical weather predictions (NWP) into a flood warning system can increase forecast lead times from a few hours to a few days. A single NWP forecast from a single forecast centre, however, is insufficient as it involves considerable non-predictable uncertainties and can lead to a high number of false or missed warnings. An ensemble of weather forecasts from one Ensemble Prediction System (EPS), when used on catchment hydrology, can provide improved early flood warning as some of the uncertainties can be quantified. EPS forecasts from a single weather centre only account for part of the uncertainties originating from initial conditions and stochastic physics. Other sources of uncertainties, including numerical implementations and/or data assimilation, can only be assessed if a grand ensemble of EPSs from different weather centres is used. When various models that produce EPS from different weather centres are aggregated, the probabilistic nature of the ensemble precipitation forecasts can be better retained and accounted for. The availability of twelve global EPSs through the 'THORPEX Interactive Grand Global Ensemble' (TIGGE) offers a new opportunity for the design of an improved probabilistic flood forecasting framework. This work presents a case study using the TIGGE database for flood warning on a meso-scale catchment. The upper reach of the River Severn catchment located in the Midlands Region of England is selected due to its abundant data for investigation and its relatively small size (4062 km2) (compared to the resolution of the NWPs). This choice was deliberate as we hypothesize that the uncertainty in the forcing of smaller catchments cannot be represented by a single EPS with a very limited number of ensemble members, but only through the variance given by a large number ensembles and ensemble system. A coupled atmospheric-hydrologic-hydraulic cascade system driven by the TIGGE ensemble forecasts is set up to study the potential benefits of using the TIGGE database in early flood warning. Physically based and fully distributed LISFLOOD suite of models is selected to simulate discharge and flood inundation consecutively. The results show the TIGGE database is a promising tool to produce forecasts of discharge and flood inundation comparable with the observed discharge and simulated inundation driven by the observed discharge. The spread of discharge forecasts varies from centre to centre, but it is generally large, implying a significant level of uncertainties. Precipitation input uncertainties dominate and propagate through the cascade chain. The current NWPs fall short of representing the spatial variability of precipitation on a comparatively small catchment. This perhaps indicates the need to improve NWPs resolution and/or disaggregation techniques to narrow down the spatial gap between meteorology and hydrology. It is not necessarily true that early flood warning becomes more reliable when more ensemble forecasts are employed. It is difficult to identify the best forecast centre(s), but in general the chance of detecting floods is increased by using the TIGGE database. Only one flood event was studied because most of the TIGGE data became available after October 2007. It is necessary to test the TIGGE ensemble forecasts with other flood events in other catchments with different hydrological and climatic regimes before general conclusions can be made on its robustness and applicability.
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data
2014-01-01
Background Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. Results We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Conclusions Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility. PMID:25089180
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.
Wu, Hongyan; Fujiwara, Toyofumi; Yamamoto, Yasunori; Bolleman, Jerven; Yamaguchi, Atsuko
2014-01-01
Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.
Tsanas, Athanasios; Clifford, Gari D
2015-01-01
Sleep spindles are critical in characterizing sleep and have been associated with cognitive function and pathophysiological assessment. Typically, their detection relies on the subjective and time-consuming visual examination of electroencephalogram (EEG) signal(s) by experts, and has led to large inter-rater variability as a result of poor definition of sleep spindle characteristics. Hitherto, many algorithmic spindle detectors inherently make signal stationarity assumptions (e.g., Fourier transform-based approaches) which are inappropriate for EEG signals, and frequently rely on additional information which may not be readily available in many practical settings (e.g., more than one EEG channels, or prior hypnogram assessment). This study proposes a novel signal processing methodology relying solely on a single EEG channel, and provides objective, accurate means toward probabilistically assessing the presence of sleep spindles in EEG signals. We use the intuitively appealing continuous wavelet transform (CWT) with a Morlet basis function, identifying regions of interest where the power of the CWT coefficients corresponding to the frequencies of spindles (11-16 Hz) is large. The potential for assessing the signal segment as a spindle is refined using local weighted smoothing techniques. We evaluate our findings on two databases: the MASS database comprising 19 healthy controls and the DREAMS sleep spindle database comprising eight participants diagnosed with various sleep pathologies. We demonstrate that we can replicate the experts' sleep spindles assessment accurately in both databases (MASS database: sensitivity: 84%, specificity: 90%, false discovery rate 83%, DREAMS database: sensitivity: 76%, specificity: 92%, false discovery rate: 67%), outperforming six competing automatic sleep spindle detection algorithms in terms of correctly replicating the experts' assessment of detected spindles.
The Eruption Forecasting Information System: Volcanic Eruption Forecasting Using Databases
NASA Astrophysics Data System (ADS)
Ogburn, S. E.; Harpel, C. J.; Pesicek, J. D.; Wellik, J.
2016-12-01
Forecasting eruptions, including the onset size, duration, location, and impacts, is vital for hazard assessment and risk mitigation. The Eruption Forecasting Information System (EFIS) project is a new initiative of the US Geological Survey-USAID Volcano Disaster Assistance Program (VDAP) and will advance VDAP's ability to forecast the outcome of volcanic unrest. The project supports probability estimation for eruption forecasting by creating databases useful for pattern recognition, identifying monitoring data thresholds beyond which eruptive probabilities increase, and for answering common forecasting questions. A major component of the project is a global relational database, which contains multiple modules designed to aid in the construction of probabilistic event trees and to answer common questions that arise during volcanic crises. The primary module contains chronologies of volcanic unrest. This module allows us to query eruption chronologies, monitoring data, descriptive information, operational data, and eruptive phases alongside other global databases, such as WOVOdat and the Global Volcanism Program. The EFIS database is in the early stages of development and population; thus, this contribution also is a request for feedback from the community. Preliminary data are already benefitting several research areas. For example, VDAP provided a forecast of the likely remaining eruption duration for Sinabung volcano, Indonesia, using global data taken from similar volcanoes in the DomeHaz database module, in combination with local monitoring time-series data. In addition, EFIS seismologists used a beta-statistic test and empirically-derived thresholds to identify distal volcano-tectonic earthquake anomalies preceding Alaska volcanic eruptions during 1990-2015 to retrospectively evaluate Alaska Volcano Observatory eruption precursors. This has identified important considerations for selecting analog volcanoes for global data analysis, such as differences between closed and open system volcanoes.
P43-S Computational Biology Applications Suite for High-Performance Computing (BioHPC.net)
Pillardy, J.
2007-01-01
One of the challenges of high-performance computing (HPC) is user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use Web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, and MrBayes. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 8500 job submissions per year, many of them requiring massively parallel computations. It also has a built-in user management system, which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The system consists of a Web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server, and file server. Users can interact with their jobs and data via a Web browser, ftp, or e-mail. The interface is accessible at http://cbsuapps.tc.cornell.edu/.
Semantic SenseLab: implementing the vision of the Semantic Web in neuroscience
Samwald, Matthias; Chen, Huajun; Ruttenberg, Alan; Lim, Ernest; Marenco, Luis; Miller, Perry; Shepherd, Gordon; Cheung, Kei-Hoi
2011-01-01
Summary Objective Integrative neuroscience research needs a scalable informatics framework that enables semantic integration of diverse types of neuroscience data. This paper describes the use of the Web Ontology Language (OWL) and other Semantic Web technologies for the representation and integration of molecular-level data provided by several of SenseLab suite of neuroscience databases. Methods Based on the original database structure, we semi-automatically translated the databases into OWL ontologies with manual addition of semantic enrichment. The SenseLab ontologies are extensively linked to other biomedical Semantic Web resources, including the Subcellular Anatomy Ontology, Brain Architecture Management System, the Gene Ontology, BIRNLex and UniProt. The SenseLab ontologies have also been mapped to the Basic Formal Ontology and Relation Ontology, which helps ease interoperability with many other existing and future biomedical ontologies for the Semantic Web. In addition, approaches to representing contradictory research statements are described. The SenseLab ontologies are designed for use on the Semantic Web that enables their integration into a growing collection of biomedical information resources. Conclusion We demonstrate that our approach can yield significant potential benefits and that the Semantic Web is rapidly becoming mature enough to realize its anticipated promises. The ontologies are available online at http://neuroweb.med.yale.edu/senselab/ PMID:20006477
FOUNTAIN: A JAVA open-source package to assist large sequencing projects
Buerstedde, Jean-Marie; Prill, Florian
2001-01-01
Background Better automation, lower cost per reaction and a heightened interest in comparative genomics has led to a dramatic increase in DNA sequencing activities. Although the large sequencing projects of specialized centers are supported by in-house bioinformatics groups, many smaller laboratories face difficulties managing the appropriate processing and storage of their sequencing output. The challenges include documentation of clones, templates and sequencing reactions, and the storage, annotation and analysis of the large number of generated sequences. Results We describe here a new program, named FOUNTAIN, for the management of large sequencing projects . FOUNTAIN uses the JAVA computer language and data storage in a relational database. Starting with a collection of sequencing objects (clones), the program generates and stores information related to the different stages of the sequencing project using a web browser interface for user input. The generated sequences are subsequently imported and annotated based on BLAST searches against the public databases. In addition, simple algorithms to cluster sequences and determine putative polymorphic positions are implemented. Conclusions A simple, but flexible and scalable software package is presented to facilitate data generation and storage for large sequencing projects. Open source and largely platform and database independent, we wish FOUNTAIN to be improved and extended in a community effort. PMID:11591214
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST)
Overbeek, Ross; Olson, Robert; Pusch, Gordon D.; Olsen, Gary J.; Davis, James J.; Disz, Terry; Edwards, Robert A.; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R.; Xia, Fangfang; Stevens, Rick
2014-01-01
In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources. PMID:24293654
The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST).
Overbeek, Ross; Olson, Robert; Pusch, Gordon D; Olsen, Gary J; Davis, James J; Disz, Terry; Edwards, Robert A; Gerdes, Svetlana; Parrello, Bruce; Shukla, Maulik; Vonstein, Veronika; Wattam, Alice R; Xia, Fangfang; Stevens, Rick
2014-01-01
In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources.
The BioMart community portal: an innovative alternative to large, centralized data repositories.
Smedley, Damian; Haider, Syed; Durinck, Steffen; Pandini, Luca; Provero, Paolo; Allen, James; Arnaiz, Olivier; Awedh, Mohammad Hamza; Baldock, Richard; Barbiera, Giulia; Bardou, Philippe; Beck, Tim; Blake, Andrew; Bonierbale, Merideth; Brookes, Anthony J; Bucci, Gabriele; Buetti, Iwan; Burge, Sarah; Cabau, Cédric; Carlson, Joseph W; Chelala, Claude; Chrysostomou, Charalambos; Cittaro, Davide; Collin, Olivier; Cordova, Raul; Cutts, Rosalind J; Dassi, Erik; Di Genova, Alex; Djari, Anis; Esposito, Anthony; Estrella, Heather; Eyras, Eduardo; Fernandez-Banet, Julio; Forbes, Simon; Free, Robert C; Fujisawa, Takatomo; Gadaleta, Emanuela; Garcia-Manteiga, Jose M; Goodstein, David; Gray, Kristian; Guerra-Assunção, José Afonso; Haggarty, Bernard; Han, Dong-Jin; Han, Byung Woo; Harris, Todd; Harshbarger, Jayson; Hastings, Robert K; Hayes, Richard D; Hoede, Claire; Hu, Shen; Hu, Zhi-Liang; Hutchins, Lucie; Kan, Zhengyan; Kawaji, Hideya; Keliet, Aminah; Kerhornou, Arnaud; Kim, Sunghoon; Kinsella, Rhoda; Klopp, Christophe; Kong, Lei; Lawson, Daniel; Lazarevic, Dejan; Lee, Ji-Hyun; Letellier, Thomas; Li, Chuan-Yun; Lio, Pietro; Liu, Chu-Jun; Luo, Jie; Maass, Alejandro; Mariette, Jerome; Maurel, Thomas; Merella, Stefania; Mohamed, Azza Mostafa; Moreews, Francois; Nabihoudine, Ibounyamine; Ndegwa, Nelson; Noirot, Céline; Perez-Llamas, Cristian; Primig, Michael; Quattrone, Alessandro; Quesneville, Hadi; Rambaldi, Davide; Reecy, James; Riba, Michela; Rosanoff, Steven; Saddiq, Amna Ali; Salas, Elisa; Sallou, Olivier; Shepherd, Rebecca; Simon, Reinhard; Sperling, Linda; Spooner, William; Staines, Daniel M; Steinbach, Delphine; Stone, Kevin; Stupka, Elia; Teague, Jon W; Dayem Ullah, Abu Z; Wang, Jun; Ware, Doreen; Wong-Erasmus, Marie; Youens-Clark, Ken; Zadissa, Amonida; Zhang, Shi-Jian; Kasprzyk, Arek
2015-07-01
The BioMart Community Portal (www.biomart.org) is a community-driven effort to provide a unified interface to biomedical databases that are distributed worldwide. The portal provides access to numerous database projects supported by 30 scientific organizations. It includes over 800 different biological datasets spanning genomics, proteomics, model organisms, cancer data, ontology information and more. All resources available through the portal are independently administered and funded by their host organizations. The BioMart data federation technology provides a unified interface to all the available data. The latest version of the portal comes with many new databases that have been created by our ever-growing community. It also comes with better support and extensibility for data analysis and visualization tools. A new addition to our toolbox, the enrichment analysis tool is now accessible through graphical and web service interface. The BioMart community portal averages over one million requests per day. Building on this level of service and the wealth of information that has become available, the BioMart Community Portal has introduced a new, more scalable and cheaper alternative to the large data stores maintained by specialized organizations. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Semantic SenseLab: Implementing the vision of the Semantic Web in neuroscience.
Samwald, Matthias; Chen, Huajun; Ruttenberg, Alan; Lim, Ernest; Marenco, Luis; Miller, Perry; Shepherd, Gordon; Cheung, Kei-Hoi
2010-01-01
Integrative neuroscience research needs a scalable informatics framework that enables semantic integration of diverse types of neuroscience data. This paper describes the use of the Web Ontology Language (OWL) and other Semantic Web technologies for the representation and integration of molecular-level data provided by several of SenseLab suite of neuroscience databases. Based on the original database structure, we semi-automatically translated the databases into OWL ontologies with manual addition of semantic enrichment. The SenseLab ontologies are extensively linked to other biomedical Semantic Web resources, including the Subcellular Anatomy Ontology, Brain Architecture Management System, the Gene Ontology, BIRNLex and UniProt. The SenseLab ontologies have also been mapped to the Basic Formal Ontology and Relation Ontology, which helps ease interoperability with many other existing and future biomedical ontologies for the Semantic Web. In addition, approaches to representing contradictory research statements are described. The SenseLab ontologies are designed for use on the Semantic Web that enables their integration into a growing collection of biomedical information resources. We demonstrate that our approach can yield significant potential benefits and that the Semantic Web is rapidly becoming mature enough to realize its anticipated promises. The ontologies are available online at http://neuroweb.med.yale.edu/senselab/. 2009 Elsevier B.V. All rights reserved.
Probabilistic simulation of stress concentration in composite laminates
NASA Technical Reports Server (NTRS)
Chamis, C. C.; Murthy, P. L. N.; Liaw, L.
1993-01-01
A computational methodology is described to probabilistically simulate the stress concentration factors in composite laminates. This new approach consists of coupling probabilistic composite mechanics with probabilistic finite element structural analysis. The probabilistic composite mechanics is used to probabilistically describe all the uncertainties inherent in composite material properties while probabilistic finite element is used to probabilistically describe the uncertainties associated with methods to experimentally evaluate stress concentration factors such as loads, geometry, and supports. The effectiveness of the methodology is demonstrated by using it to simulate the stress concentration factors in composite laminates made from three different composite systems. Simulated results match experimental data for probability density and for cumulative distribution functions. The sensitivity factors indicate that the stress concentration factors are influenced by local stiffness variables, by load eccentricities and by initial stress fields.
Probabilistic load simulation: Code development status
NASA Astrophysics Data System (ADS)
Newell, J. F.; Ho, H.
1991-05-01
The objective of the Composite Load Spectra (CLS) project is to develop generic load models to simulate the composite load spectra that are included in space propulsion system components. The probabilistic loads thus generated are part of the probabilistic design analysis (PDA) of a space propulsion system that also includes probabilistic structural analyses, reliability, and risk evaluations. Probabilistic load simulation for space propulsion systems demands sophisticated probabilistic methodology and requires large amounts of load information and engineering data. The CLS approach is to implement a knowledge based system coupled with a probabilistic load simulation module. The knowledge base manages and furnishes load information and expertise and sets up the simulation runs. The load simulation module performs the numerical computation to generate the probabilistic loads with load information supplied from the CLS knowledge base.
Perovskite Technology is Scalable, But Questions Remain about the Best
Methods | News | NREL Perovskite Technology is Scalable, But Questions Remain about the Best Methods News Release: Perovskite Technology is Scalable, But Questions Remain about the Best Methods NREL be used on a larger surface. The NREL researchers examined potential scalable deposition methods
Racoceanu, Daniel; Capron, Frédérique
2016-01-01
Being able to provide a traceable and dynamic second opinion has become an ethical priority for patients and health care professionals in modern computer-aided medicine. In this perspective, a semantic cognitive virtual microscopy approach has been recently initiated, the MICO project, by focusing on cognitive digital pathology. This approach supports the elaboration of pathology-compliant daily protocols dedicated to breast cancer grading, in particular mitotic counts and nuclear atypia. A proof of concept has thus been elaborated, and an extension of these approaches is now underway in a collaborative digital pathology framework, the FlexMIm project. As important milestones on the way to routine digital pathology, a series of pioneer international benchmarking initiatives have been launched for mitosis detection (MITOS), nuclear atypia grading (MITOS-ATYPIA) and glandular structure detection (GlaS), some of the fundamental grading components in diagnosis and prognosis. These initiatives allow envisaging a consolidated validation referential database for digital pathology in the very near future. This reference database will need coordinated efforts from all major teams working in this area worldwide, and it will certainly represent a critical bottleneck for the acceptance of all future imaging modules in clinical practice. In line with recent advances in molecular imaging and genetics, keeping the microscopic modality at the core of future digital systems in pathology is fundamental to insure the acceptance of these new technologies, as well as for a deeper systemic, structured comprehension of the pathologies. After all, at the scale of routine whole-slide imaging (WSI; ∼0.22 µm/pixel), the microscopic image represents a structured 'genomic cluster', enabling a naturally structured support for integrative digital pathology approaches. In order to accelerate and structure the integration of this heterogeneous information, a major effort is and will continue to be devoted to morphological microsemiology (microscopic morphology semantics). Besides insuring the traceability of the results (second opinion) and supporting the orchestration of high-content image analysis modules, the role of semantics will be crucial for the correlation between digital pathology and noninvasive medical imaging modalities. In addition, semantics has an important role in modelling the links between traditional microscopy and recent label-free technologies. The massive amount of visual data is challenging and represents a characteristic intrinsic to digital pathology. The design of an operational integrative microscopy framework needs to focus on scalable multiscale imaging formalism. In this sense, we prospectively consider some of the most recent scalable methodologies adapted to digital pathology as marked point processes for nuclear atypia and point-set mathematical morphology for architecture grading. To orchestrate this scalable framework, semantics-based WSI management (analysis, exploration, indexing, retrieval and report generation support) represents an important means towards approaches to integrating big data into biomedicine. This insight reflects our vision through an instantiation of essential bricks of this type of architecture. The generic approach introduced here is applicable to a number of challenges related to molecular imaging, high-content image management and, more generally, bioinformatics. © 2016 S. Karger AG, Basel.
Quality Scalability Aware Watermarking for Visual Content.
Bhowmik, Deepayan; Abhayaratne, Charith
2016-11-01
Scalable coding-based content adaptation poses serious challenges to traditional watermarking algorithms, which do not consider the scalable coding structure and hence cannot guarantee correct watermark extraction in media consumption chain. In this paper, we propose a novel concept of scalable blind watermarking that ensures more robust watermark extraction at various compression ratios while not effecting the visual quality of host media. The proposed algorithm generates scalable and robust watermarked image code-stream that allows the user to constrain embedding distortion for target content adaptations. The watermarked image code-stream consists of hierarchically nested joint distortion-robustness coding atoms. The code-stream is generated by proposing a new wavelet domain blind watermarking algorithm guided by a quantization based binary tree. The code-stream can be truncated at any distortion-robustness atom to generate the watermarked image with the desired distortion-robustness requirements. A blind extractor is capable of extracting watermark data from the watermarked images. The algorithm is further extended to incorporate a bit-plane discarding-based quantization model used in scalable coding-based content adaptation, e.g., JPEG2000. This improves the robustness against quality scalability of JPEG2000 compression. The simulation results verify the feasibility of the proposed concept, its applications, and its improved robustness against quality scalable content adaptation. Our proposed algorithm also outperforms existing methods showing 35% improvement. In terms of robustness to quality scalable video content adaptation using Motion JPEG2000 and wavelet-based scalable video coding, the proposed method shows major improvement for video watermarking.
NASA Astrophysics Data System (ADS)
Minnett, R.; Koppers, A. A. P.; Jarboe, N.; Tauxe, L.; Constable, C.; Jonestrask, L.; Shaar, R.
2014-12-01
Earth science grand challenges often require interdisciplinary and geographically distributed scientific collaboration to make significant progress. However, this organic collaboration between researchers, educators, and students only flourishes with the reduction or elimination of technological barriers. The Magnetics Information Consortium (http://earthref.org/MagIC/) is a grass-roots cyberinfrastructure effort envisioned by the geo-, paleo-, and rock magnetic scientific community to archive their wealth of peer-reviewed raw data and interpretations from studies on natural and synthetic samples. MagIC is dedicated to facilitating scientific progress towards several highly multidisciplinary grand challenges and the MagIC Database team is currently beta testing a new MagIC Search Interface and API designed to be flexible enough for the incorporation of large heterogeneous datasets and for horizontal scalability to tens of millions of records and hundreds of requests per second. In an effort to reduce the barriers to effective collaboration, the search interface includes a simplified data model and upload procedure, support for online editing of datasets amongst team members, commenting by reviewers and colleagues, and automated contribution workflows and data retrieval through the API. This web application has been designed to generalize to other databases in MagIC's umbrella website (EarthRef.org) so the Geochemical Earth Reference Model (http://earthref.org/GERM/) portal, Seamount Biogeosciences Network (http://earthref.org/SBN/), EarthRef Digital Archive (http://earthref.org/ERDA/) and EarthRef Reference Database (http://earthref.org/ERR/) will benefit from its development.
Photo-z-SQL: Integrated, flexible photometric redshift computation in a database
NASA Astrophysics Data System (ADS)
Beck, R.; Dobos, L.; Budavári, T.; Szalay, A. S.; Csabai, I.
2017-04-01
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z estimation testing datasets, and provide an analysis of execution time and scalability with respect to different configurations. The code is available for download at https://github.com/beckrob/Photo-z-SQL.
Evaluating a NoSQL Alternative for Chilean Virtual Observatory Services
NASA Astrophysics Data System (ADS)
Antognini, J.; Araya, M.; Solar, M.; Valenzuela, C.; Lira, F.
2015-09-01
Currently, the standards and protocols for data access in the Virtual Observatory architecture (DAL) are generally implemented with relational databases based on SQL. In particular, the Astronomical Data Query Language (ADQL), language used by IVOA to represent queries to VO services, was created to satisfy the different data access protocols, such as Simple Cone Search. ADQL is based in SQL92, and has extra functionality implemented using PgSphere. An emergent alternative to SQL are the so called NoSQL databases, which can be classified in several categories such as Column, Document, Key-Value, Graph, Object, etc.; each one recommended for different scenarios. Within their notable characteristics we can find: schema-free, easy replication support, simple API, Big Data, etc. The Chilean Virtual Observatory (ChiVO) is developing a functional prototype based on the IVOA architecture, with the following relevant factors: Performance, Scalability, Flexibility, Complexity, and Functionality. Currently, it's very difficult to compare these factors, due to a lack of alternatives. The objective of this paper is to compare NoSQL alternatives with SQL through the implementation of a Web API REST that satisfies ChiVO's needs: a SESAME-style name resolver for the data from ALMA. Therefore, we propose a test scenario by configuring a NoSQL database with data from different sources and evaluating the feasibility of creating a Simple Cone Search service and its performance. This comparison will allow to pave the way for the application of Big Data databases in the Virtual Observatory.
Database Resources of the BIG Data Center in 2018
Xu, Xingjian; Hao, Lili; Zhu, Junwei; Tang, Bixia; Zhou, Qing; Song, Fuhai; Chen, Tingting; Zhang, Sisi; Dong, Lili; Lan, Li; Wang, Yanqing; Sang, Jian; Hao, Lili; Liang, Fang; Cao, Jiabao; Liu, Fang; Liu, Lin; Wang, Fan; Ma, Yingke; Xu, Xingjian; Zhang, Lijuan; Chen, Meili; Tian, Dongmei; Li, Cuiping; Dong, Lili; Du, Zhenglin; Yuan, Na; Zeng, Jingyao; Zhang, Zhewen; Wang, Jinyue; Shi, Shuo; Zhang, Yadong; Pan, Mengyu; Tang, Bixia; Zou, Dong; Song, Shuhui; Sang, Jian; Xia, Lin; Wang, Zhennan; Li, Man; Cao, Jiabao; Niu, Guangyi; Zhang, Yang; Sheng, Xin; Lu, Mingming; Wang, Qi; Xiao, Jingfa; Zou, Dong; Wang, Fan; Hao, Lili; Liang, Fang; Li, Mengwei; Sun, Shixiang; Zou, Dong; Li, Rujiao; Yu, Chunlei; Wang, Guangyu; Sang, Jian; Liu, Lin; Li, Mengwei; Li, Man; Niu, Guangyi; Cao, Jiabao; Sun, Shixiang; Xia, Lin; Yin, Hongyan; Zou, Dong; Xu, Xingjian; Ma, Lina; Chen, Huanxin; Sun, Yubin; Yu, Lei; Zhai, Shuang; Sun, Mingyuan; Zhang, Zhang; Zhao, Wenming; Xiao, Jingfa; Bao, Yiming; Song, Shuhui; Hao, Lili; Li, Rujiao; Ma, Lina; Sang, Jian; Wang, Yanqing; Tang, Bixia; Zou, Dong; Wang, Fan
2018-01-01
Abstract The BIG Data Center at Beijing Institute of Genomics (BIG) of the Chinese Academy of Sciences provides freely open access to a suite of database resources in support of worldwide research activities in both academia and industry. With the vast amounts of omics data generated at ever-greater scales and rates, the BIG Data Center is continually expanding, updating and enriching its core database resources through big-data integration and value-added curation, including BioCode (a repository archiving bioinformatics tool codes), BioProject (a biological project library), BioSample (a biological sample library), Genome Sequence Archive (GSA, a data repository for archiving raw sequence reads), Genome Warehouse (GWH, a centralized resource housing genome-scale data), Genome Variation Map (GVM, a public repository of genome variations), Gene Expression Nebulas (GEN, a database of gene expression profiles based on RNA-Seq data), Methylation Bank (MethBank, an integrated databank of DNA methylomes), and Science Wikis (a series of biological knowledge wikis for community annotations). In addition, three featured web services are provided, viz., BIG Search (search as a service; a scalable inter-domain text search engine), BIG SSO (single sign-on as a service; a user access control system to gain access to multiple independent systems with a single ID and password) and Gsub (submission as a service; a unified submission service for all relevant resources). All of these resources are publicly accessible through the home page of the BIG Data Center at http://bigd.big.ac.cn. PMID:29036542
The probabilistic nature of preferential choice.
Rieskamp, Jörg
2008-11-01
Previous research has developed a variety of theories explaining when and why people's decisions under risk deviate from the standard economic view of expected utility maximization. These theories are limited in their predictive accuracy in that they do not explain the probabilistic nature of preferential choice, that is, why an individual makes different choices in nearly identical situations, or why the magnitude of these inconsistencies varies in different situations. To illustrate the advantage of probabilistic theories, three probabilistic theories of decision making under risk are compared with their deterministic counterparts. The probabilistic theories are (a) a probabilistic version of a simple choice heuristic, (b) a probabilistic version of cumulative prospect theory, and (c) decision field theory. By testing the theories with the data from three experimental studies, the superiority of the probabilistic models over their deterministic counterparts in predicting people's decisions under risk become evident. When testing the probabilistic theories against each other, decision field theory provides the best account of the observed behavior.
NASA Technical Reports Server (NTRS)
Singhal, Surendra N.
2003-01-01
The SAE G-11 RMSL Division and Probabilistic Methods Committee meeting sponsored by the Picatinny Arsenal during March 1-3, 2004 at Westin Morristown, will report progress on projects for probabilistic assessment of Army system and launch an initiative for probabilistic education. The meeting features several Army and industry Senior executives and Ivy League Professor to provide an industry/government/academia forum to review RMSL technology; reliability and probabilistic technology; reliability-based design methods; software reliability; and maintainability standards. With over 100 members including members with national/international standing, the mission of the G-11s Probabilistic Methods Committee is to enable/facilitate rapid deployment of probabilistic technology to enhance the competitiveness of our industries by better, faster, greener, smarter, affordable and reliable product development.
User Centric Job Monitoring - a redesign and novel approach in the STAR experiment
NASA Astrophysics Data System (ADS)
Arkhipkin, D.; Lauret, J.; Zulkarneeva, Y.
2014-06-01
User Centric Monitoring (or UCM) has been a long awaited feature in STAR, whereas programs, workflows and system "events" could be logged, broadcast and later analyzed. UCM allows to collect and filter available job monitoring information from various resources and present it to users in a user-centric view rather than an administrative-centric point of view. The first attempt and implementation of "a" UCM approach was made in STAR 2004 using a log4cxx plug-in back-end and then further evolved with an attempt to push toward a scalable database back-end (2006) and finally using a Web-Service approach (2010, CSW4DB SBIR). The latest showed to be incomplete and not addressing the evolving needs of the experiment where streamlined messages for online (data acquisition) purposes as well as the continuous support for the data mining needs and event analysis need to coexists and unified in a seamless approach. The code also revealed to be hardly maintainable. This paper presents the next evolutionary step of the UCM toolkit, a redesign and redirection of our latest attempt acknowledging and integrating recent technologies and a simpler, maintainable and yet scalable manner. The extended version of the job logging package is built upon three-tier approach based on Task, Job and Event, and features a Web-Service based logging API, a responsive AJAX-powered user interface, and a database back-end relying on MongoDB, which is uniquely suited for STAR needs. In addition, we present details of integration of this logging package with the STAR offline and online software frameworks. Leveraging on the reported experience and work from the ATLAS and CMS experience on using the ESPER engine, we discuss and show how such approach has been implemented in STAR for meta-data event triggering stream processing and filtering. An ESPER based solution seems to fit well into the online data acquisition system where many systems are monitored.
NASA Astrophysics Data System (ADS)
Zaslavsky, I.; Richard, S. M.; Valentine, D. W., Jr.; Grethe, J. S.; Hsu, L.; Malik, T.; Bermudez, L. E.; Gupta, A.; Lehnert, K. A.; Whitenack, T.; Ozyurt, I. B.; Condit, C.; Calderon, R.; Musil, L.
2014-12-01
EarthCube is envisioned as a cyberinfrastructure that fosters new, transformational geoscience by enabling sharing, understanding and scientifically-sound and efficient re-use of formerly unconnected data resources, software, models, repositories, and computational power. Its purpose is to enable science enterprise and workforce development via an extensible and adaptable collaboration and resource integration framework. A key component of this vision is development of comprehensive inventories supporting resource discovery and re-use across geoscience domains. The goal of the EarthCube CINERGI (Community Inventory of EarthCube Resources for Geoscience Interoperability) project is to create a methodology and assemble a large inventory of high-quality information resources with standard metadata descriptions and traceable provenance. The inventory is compiled from metadata catalogs maintained by geoscience data facilities, as well as from user contributions. The latter mechanism relies on community resource viewers: online applications that support update and curation of metadata records. Once harvested into CINERGI, metadata records from domain catalogs and community resource viewers are loaded into a staging database implemented in MongoDB, and validated for compliance with ISO 19139 metadata schema. Several types of metadata defects detected by the validation engine are automatically corrected with help of several information extractors or flagged for manual curation. The metadata harvesting, validation and processing components generate provenance statements using W3C PROV notation, which are stored in a Neo4J database. Thus curated metadata, along with the provenance information, is re-published and accessed programmatically and via a CINERGI online application. This presentation focuses on the role of resource inventories in a scalable and adaptable information infrastructure, and on the CINERGI metadata pipeline and its implementation challenges. Key project components are described at the project's website (http://workspace.earthcube.org/cinergi), which also provides access to the initial resource inventory, the inventory metadata model, metadata entry forms and a collection of the community resource viewers.
CMS users data management service integration and first experiences with its NoSQL data storage
NASA Astrophysics Data System (ADS)
Riahi, H.; Spiga, D.; Boccali, T.; Ciangottini, D.; Cinquilli, M.; Hernàndez, J. M.; Konstantinov, P.; Mascheroni, M.; Santocchia, A.
2014-06-01
The distributed data analysis workflow in CMS assumes that jobs run in a different location to where their results are finally stored. Typically the user outputs must be transferred from one site to another by a dedicated CMS service, AsyncStageOut. This new service is originally developed to address the inefficiency in using the CMS computing resources when transferring the analysis job outputs, synchronously, once they are produced in the job execution node to the remote site. The AsyncStageOut is designed as a thin application relying only on the NoSQL database (CouchDB) as input and data storage. It has progressed from a limited prototype to a highly adaptable service which manages and monitors the whole user files steps, namely file transfer and publication. The AsyncStageOut is integrated with the Common CMS/Atlas Analysis Framework. It foresees the management of nearly nearly 200k users' files per day of close to 1000 individual users per month with minimal delays, and providing a real time monitoring and reports to users and service operators, while being highly available. The associated data volume represents a new set of challenges in the areas of database scalability and service performance and efficiency. In this paper, we present an overview of the AsyncStageOut model and the integration strategy with the Common Analysis Framework. The motivations for using the NoSQL technology are also presented, as well as data design and the techniques used for efficient indexing and monitoring of the data. We describe deployment model for the high availability and scalability of the service. We also discuss the hardware requirements and the results achieved as they were determined by testing with actual data and realistic loads during the commissioning and the initial production phase with the Common Analysis Framework.
NASA Astrophysics Data System (ADS)
Daniels, M. D.; Graves, S. J.; Kerkez, B.; Chandrasekar, V.; Vernon, F.; Martin, C. L.; Maskey, M.; Keiser, K.; Dye, M. J.
2015-12-01
The Cloud-Hosted Real-time Data Services for the Geosciences (CHORDS) project was funded under the National Science Foundation's EarthCube initiative. CHORDS addresses the ever-increasing importance of real-time scientific data in the geosciences, particularly in mission critical scenarios, where informed decisions must be made rapidly. Access to constant streams of real-time data also allow many new transient phenomena in space-time to be observed, however, much of these streaming data are either completely inaccessible or only available to proprietary in-house tools or displays. Small research teams do not have the resources to develop tools for the broad dissemination of their unique real-time data and require an easy to use, scalable, cloud-based solution to facilitate this access. CHORDS will make these diverse streams of real-time data available to the broader geosciences community. This talk will highlight a recently developed CHORDS portal tools and processing systems which address some of the gaps in handling real-time data, particularly in the provisioning of data from the "long-tail" scientific community through a simple interface that is deployed in the cloud, is scalable and is able to be customized by research teams. A running portal, with operational data feeds from across the nation, will be presented. The processing within the CHORDS system will expose these real-time streams via standard services from the Open Geospatial Consortium (OGC) in a way that is simple and transparent to the data provider, while maximizing the usage of these investments. The ingestion of high velocity, high volume and diverse data has allowed the project to explore a NoSQL database implementation. Broad use of the CHORDS framework by geoscientists will help to facilitate adaptive experimentation, model assimilation and real-time hypothesis testing.
Students’ difficulties in probabilistic problem-solving
NASA Astrophysics Data System (ADS)
Arum, D. P.; Kusmayadi, T. A.; Pramudya, I.
2018-03-01
There are many errors can be identified when students solving mathematics problems, particularly in solving the probabilistic problem. This present study aims to investigate students’ difficulties in solving the probabilistic problem. It focuses on analyzing and describing students errors during solving the problem. This research used the qualitative method with case study strategy. The subjects in this research involve ten students of 9th grade that were selected by purposive sampling. Data in this research involve students’ probabilistic problem-solving result and recorded interview regarding students’ difficulties in solving the problem. Those data were analyzed descriptively using Miles and Huberman steps. The results show that students have difficulties in solving the probabilistic problem and can be divided into three categories. First difficulties relate to students’ difficulties in understanding the probabilistic problem. Second, students’ difficulties in choosing and using appropriate strategies for solving the problem. Third, students’ difficulties with the computational process in solving the problem. Based on the result seems that students still have difficulties in solving the probabilistic problem. It means that students have not able to use their knowledge and ability for responding probabilistic problem yet. Therefore, it is important for mathematics teachers to plan probabilistic learning which could optimize students probabilistic thinking ability.
A probabilistic Hu-Washizu variational principle
NASA Technical Reports Server (NTRS)
Liu, W. K.; Belytschko, T.; Besterfield, G. H.
1987-01-01
A Probabilistic Hu-Washizu Variational Principle (PHWVP) for the Probabilistic Finite Element Method (PFEM) is presented. This formulation is developed for both linear and nonlinear elasticity. The PHWVP allows incorporation of the probabilistic distributions for the constitutive law, compatibility condition, equilibrium, domain and boundary conditions into the PFEM. Thus, a complete probabilistic analysis can be performed where all aspects of the problem are treated as random variables and/or fields. The Hu-Washizu variational formulation is available in many conventional finite element codes thereby enabling the straightforward inclusion of the probabilistic features into present codes.
Adaptive format conversion for scalable video coding
NASA Astrophysics Data System (ADS)
Wan, Wade K.; Lim, Jae S.
2001-12-01
The enhancement layer in many scalable coding algorithms is composed of residual coding information. There is another type of information that can be transmitted instead of (or in addition to) residual coding. Since the encoder has access to the original sequence, it can utilize adaptive format conversion (AFC) to generate the enhancement layer and transmit the different format conversion methods as enhancement data. This paper investigates the use of adaptive format conversion information as enhancement data in scalable video coding. Experimental results are shown for a wide range of base layer qualities and enhancement bitrates to determine when AFC can improve video scalability. Since the parameters needed for AFC are small compared to residual coding, AFC can provide video scalability at low enhancement layer bitrates that are not possible with residual coding. In addition, AFC can also be used in addition to residual coding to improve video scalability at higher enhancement layer bitrates. Adaptive format conversion has not been studied in detail, but many scalable applications may benefit from it. An example of an application that AFC is well-suited for is the migration path for digital television where AFC can provide immediate video scalability as well as assist future migrations.
Navigation system for autonomous mapper robots
NASA Astrophysics Data System (ADS)
Halbach, Marc; Baudoin, Yvan
1993-05-01
This paper describes the conception and realization of a fast, robust, and general navigation system for a mobile (wheeled or legged) robot. A database, representing a high level map of the environment is generated and continuously updated. The first part describes the legged target vehicle and the hexapod robot being developed. The second section deals with spatial and temporal sensor fusion for dynamic environment modeling within an obstacle/free space probabilistic classification grid. Ultrasonic sensors are used, others are suspected to be integrated, and a-priori knowledge is treated. US sensors are controlled by the path planning module. The third part concerns path planning and a simulation of a wheeled robot is also presented.
Probabilistic Aeroelastic Analysis of Turbomachinery Components
NASA Technical Reports Server (NTRS)
Reddy, T. S. R.; Mital, S. K.; Stefko, G. L.
2004-01-01
A probabilistic approach is described for aeroelastic analysis of turbomachinery blade rows. Blade rows with subsonic flow and blade rows with supersonic flow with subsonic leading edge are considered. To demonstrate the probabilistic approach, the flutter frequency, damping and forced response of a blade row representing a compressor geometry is considered. The analysis accounts for uncertainties in structural and aerodynamic design variables. The results are presented in the form of probabilistic density function (PDF) and sensitivity factors. For subsonic flow cascade, comparisons are also made with different probabilistic distributions, probabilistic methods, and Monte-Carlo simulation. The approach shows that the probabilistic approach provides a more realistic and systematic way to assess the effect of uncertainties in design variables on the aeroelastic instabilities and response.
A privacy preserving protocol for tracking participants in phase I clinical trials.
El Emam, Khaled; Farah, Hanna; Samet, Saeed; Essex, Aleksander; Jonker, Elizabeth; Kantarcioglu, Murat; Earle, Craig C
2015-10-01
Some phase 1 clinical trials offer strong financial incentives for healthy individuals to participate in their studies. There is evidence that some individuals enroll in multiple trials concurrently. This creates safety risks and introduces data quality problems into the trials. Our objective was to construct a privacy preserving protocol to track phase 1 participants to detect concurrent enrollment. A protocol using secure probabilistic querying against a database of trial participants that allows for screening during telephone interviews and on-site enrollment was developed. The match variables consisted of demographic information. The accuracy (sensitivity, precision, and negative predictive value) of the matching and its computational performance in seconds were measured under simulated environments. Accuracy was also compared to non-secure matching methods. The protocol performance scales linearly with the database size. At the largest database size of 20,000 participants, a query takes under 20s on a 64 cores machine. Sensitivity, precision, and negative predictive value of the queries were consistently at or above 0.9, and were very similar to non-secure versions of the protocol. The protocol provides a reasonable solution to the concurrent enrollment problems in phase 1 clinical trials, and is able to ensure that personal information about participants is kept secure. Copyright © 2015 The Authors. Published by Elsevier Inc. All rights reserved.
A probabilistic model for detecting rigid domains in protein structures.
Nguyen, Thach; Habeck, Michael
2016-09-01
Large-scale conformational changes in proteins are implicated in many important biological functions. These structural transitions can often be rationalized in terms of relative movements of rigid domains. There is a need for objective and automated methods that identify rigid domains in sets of protein structures showing alternative conformational states. We present a probabilistic model for detecting rigid-body movements in protein structures. Our model aims to approximate alternative conformational states by a few structural parts that are rigidly transformed under the action of a rotation and a translation. By using Bayesian inference and Markov chain Monte Carlo sampling, we estimate all parameters of the model, including a segmentation of the protein into rigid domains, the structures of the domains themselves, and the rigid transformations that generate the observed structures. We find that our Gibbs sampling algorithm can also estimate the optimal number of rigid domains with high efficiency and accuracy. We assess the power of our method on several thousand entries of the DynDom database and discuss applications to various complex biomolecular systems. The Python source code for protein ensemble analysis is available at: https://github.com/thachnguyen/motion_detection : mhabeck@gwdg.de. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Probabilistic Feasibility of the Reconstruction Process of Russian-Orthodox Churches
NASA Astrophysics Data System (ADS)
Chizhova, M.; Brunn, A.; Stilla, U.
2016-06-01
The cultural human heritage is important for the identity of following generations and has to be preserved in a suitable manner. In the course of time a lot of information about former cultural constructions has been lost because some objects were strongly damaged by natural erosion or on account of human work or were even destroyed. It is important to capture still available building parts of former buildings, mostly ruins. This data could be the basis for a virtual reconstruction. Laserscanning offers in principle the possibility to take up extensively surfaces of buildings in its actual status. In this paper we assume a priori given 3d-laserscanner data, 3d point cloud for the partly destroyed church. There are many well known algorithms, that describe different methods of extraction and detection of geometric primitives, which are recognized separately in 3d points clouds. In our work we put them in a common probabilistic framework, which guides the complete reconstruction process of complex buildings, in our case russian-orthodox churches. Churches are modeled with their functional volumetric components, enriched with a priori known probabilities, which are deduced from a database of russian-orthodox churches. Each set of components represents a complete church. The power of the new method is shown for a simulated dataset of 100 russian-orthodox churches.
Community-based early warning systems for flood risk mitigation in Nepal
NASA Astrophysics Data System (ADS)
Smith, Paul J.; Brown, Sarah; Dugar, Sumit
2017-03-01
This paper focuses on the use of community-based early warning systems for flood resilience in Nepal. The first part of the work outlines the evolution and current status of these community-based systems, highlighting the limited lead times currently available for early warning. The second part of the paper focuses on the development of a robust operational flood forecasting methodology for use by the Nepal Department of Hydrology and Meteorology (DHM) to enhance early warning lead times. The methodology uses data-based physically interpretable time series models and data assimilation to generate probabilistic forecasts, which are presented in a simple visual tool. The approach is designed to work in situations of limited data availability with an emphasis on sustainability and appropriate technology. The successful application of the forecast methodology to the flood-prone Karnali River basin in western Nepal is outlined, increasing lead times from 2-3 to 7-8 h. The challenges faced in communicating probabilistic forecasts to the last mile of the existing community-based early warning systems across Nepal is discussed. The paper concludes with an assessment of the applicability of this approach in basins and countries beyond Karnali and Nepal and an overview of key lessons learnt from this initiative.
Sequence similarity is more relevant than species specificity in probabilistic backtranslation.
Ferro, Alfredo; Giugno, Rosalba; Pigola, Giuseppe; Pulvirenti, Alfredo; Di Pietro, Cinzia; Purrello, Michele; Ragusa, Marco
2007-02-21
Backtranslation is the process of decoding a sequence of amino acids into the corresponding codons. All synthetic gene design systems include a backtranslation module. The degeneracy of the genetic code makes backtranslation potentially ambiguous since most amino acids are encoded by multiple codons. The common approach to overcome this difficulty is based on imitation of codon usage within the target species. This paper describes EasyBack, a new parameter-free, fully-automated software for backtranslation using Hidden Markov Models. EasyBack is not based on imitation of codon usage within the target species, but instead uses a sequence-similarity criterion. The model is trained with a set of proteins with known cDNA coding sequences, constructed from the input protein by querying the NCBI databases with BLAST. Unlike existing software, the proposed method allows the quality of prediction to be estimated. When tested on a group of proteins that show different degrees of sequence conservation, EasyBack outperforms other published methods in terms of precision. The prediction quality of a protein backtranslation methis markedly increased by replacing the criterion of most used codon in the same species with a Hidden Markov Model trained with a set of most similar sequences from all species. Moreover, the proposed method allows the quality of prediction to be estimated probabilistically.
Protecting Data Privacy in Structured P2P Networks
NASA Astrophysics Data System (ADS)
Jawad, Mohamed; Serrano-Alvarado, Patricia; Valduriez, Patrick
P2P systems are increasingly used for efficient, scalable data sharing. Popular applications focus on massive file sharing. However, advanced applications such as online communities (e.g., medical or research communities) need to share private or sensitive data. Currently, in P2P systems, untrusted peers can easily violate data privacy by using data for malicious purposes (e.g., fraudulence, profiling). To prevent such behavior, the well accepted Hippocratic database principle states that data owners should specify the purpose for which their data will be collected. In this paper, we apply such principles as well as reputation techniques to support purpose and trust in structured P2P systems. Hippocratic databases enforce purpose-based privacy while reputation techniques guarantee trust. We propose a P2P data privacy model which combines the Hippocratic principles and the trust notions. We also present the algorithms of PriServ, a DHT-based P2P privacy service which supports this model and prevents data privacy violation. We show, in a performance evaluation, that PriServ introduces a small overhead.
Conditions Database for the Belle II Experiment
NASA Astrophysics Data System (ADS)
Wood, L.; Elsethagen, T.; Schram, M.; Stephan, E.
2017-10-01
The Belle II experiment at KEK is preparing for first collisions in 2017. Processing the large amounts of data that will be produced will require conditions data to be readily available to systems worldwide in a fast and efficient manner that is straightforward for both the user and maintainer. The Belle II conditions database was designed with a straightforward goal: make it as easily maintainable as possible. To this end, HEP-specific software tools were avoided as much as possible and industry standard tools used instead. HTTP REST services were selected as the application interface, which provide a high-level interface to users through the use of standard libraries such as curl. The application interface itself is written in Java and runs in an embedded Payara-Micro Java EE application server. Scalability at the application interface is provided by use of Hazelcast, an open source In-Memory Data Grid (IMDG) providing distributed in-memory computing and supporting the creation and clustering of new application interface instances as demand increases. The IMDG provides fast and efficient access to conditions data via in-memory caching.
Gonzalez, Sergio; Clavijo, Bernardo; Rivarola, Máximo; Moreno, Patricio; Fernandez, Paula; Dopazo, Joaquín; Paniego, Norma
2017-02-22
In the last years, applications based on massively parallelized RNA sequencing (RNA-seq) have become valuable approaches for studying non-model species, e.g., without a fully sequenced genome. RNA-seq is a useful tool for detecting novel transcripts and genetic variations and for evaluating differential gene expression by digital measurements. The large and complex datasets resulting from functional genomic experiments represent a challenge in data processing, management, and analysis. This problem is especially significant for small research groups working with non-model species. We developed a web-based application, called ATGC transcriptomics, with a flexible and adaptable interface that allows users to work with new generation sequencing (NGS) transcriptomic analysis results using an ontology-driven database. This new application simplifies data exploration, visualization, and integration for a better comprehension of the results. ATGC transcriptomics provides access to non-expert computer users and small research groups to a scalable storage option and simple data integration, including database administration and management. The software is freely available under the terms of GNU public license at http://atgcinta.sourceforge.net .
Monitoring of services with non-relational databases and map-reduce framework
NASA Astrophysics Data System (ADS)
Babik, M.; Souto, F.
2012-12-01
Service Availability Monitoring (SAM) is a well-established monitoring framework that performs regular measurements of the core site services and reports the corresponding availability and reliability of the Worldwide LHC Computing Grid (WLCG) infrastructure. One of the existing extensions of SAM is Site Wide Area Testing (SWAT), which gathers monitoring information from the worker nodes via instrumented jobs. This generates quite a lot of monitoring data to process, as there are several data points for every job and several million jobs are executed every day. The recent uptake of non-relational databases opens a new paradigm in the large-scale storage and distributed processing of systems with heavy read-write workloads. For SAM this brings new possibilities to improve its model, from performing aggregation of measurements to storing raw data and subsequent re-processing. Both SAM and SWAT are currently tuned to run at top performance, reaching some of the limits in storage and processing power of their existing Oracle relational database. We investigated the usability and performance of non-relational storage together with its distributed data processing capabilities. For this, several popular systems have been compared. In this contribution we describe our investigation of the existing non-relational databases suited for monitoring systems covering Cassandra, HBase and MongoDB. Further, we present our experiences in data modeling and prototyping map-reduce algorithms focusing on the extension of the already existing availability and reliability computations. Finally, possible future directions in this area are discussed, analyzing the current deficiencies of the existing Grid monitoring systems and proposing solutions to leverage the benefits of the non-relational databases to get more scalable and flexible frameworks.
Implementation of a Big Data Accessing and Processing Platform for Medical Records in Cloud.
Yang, Chao-Tung; Liu, Jung-Chun; Chen, Shuo-Tsung; Lu, Hsin-Wen
2017-08-18
Big Data analysis has become a key factor of being innovative and competitive. Along with population growth worldwide and the trend aging of population in developed countries, the rate of the national medical care usage has been increasing. Due to the fact that individual medical data are usually scattered in different institutions and their data formats are varied, to integrate those data that continue increasing is challenging. In order to have scalable load capacity for these data platforms, we must build them in good platform architecture. Some issues must be considered in order to use the cloud computing to quickly integrate big medical data into database for easy analyzing, searching, and filtering big data to obtain valuable information.This work builds a cloud storage system with HBase of Hadoop for storing and analyzing big data of medical records and improves the performance of importing data into database. The data of medical records are stored in HBase database platform for big data analysis. This system performs distributed computing on medical records data processing through Hadoop MapReduce programming, and to provide functions, including keyword search, data filtering, and basic statistics for HBase database. This system uses the Put with the single-threaded method and the CompleteBulkload mechanism to import medical data. From the experimental results, we find that when the file size is less than 300MB, the Put with single-threaded method is used and when the file size is larger than 300MB, the CompleteBulkload mechanism is used to improve the performance of data import into database. This system provides a web interface that allows users to search data, filter out meaningful information through the web, and analyze and convert data in suitable forms that will be helpful for medical staff and institutions.
Probabilistic structural analysis methods for space propulsion system components
NASA Technical Reports Server (NTRS)
Chamis, C. C.
1986-01-01
The development of a three-dimensional inelastic analysis methodology for the Space Shuttle main engine (SSME) structural components is described. The methodology is composed of: (1) composite load spectra, (2) probabilistic structural analysis methods, (3) the probabilistic finite element theory, and (4) probabilistic structural analysis. The methodology has led to significant technical progress in several important aspects of probabilistic structural analysis. The program and accomplishments to date are summarized.
NASA Technical Reports Server (NTRS)
Townsend, J.; Meyers, C.; Ortega, R.; Peck, J.; Rheinfurth, M.; Weinstock, B.
1993-01-01
Probabilistic structural analyses and design methods are steadily gaining acceptance within the aerospace industry. The safety factor approach to design has long been the industry standard, and it is believed by many to be overly conservative and thus, costly. A probabilistic approach to design may offer substantial cost savings. This report summarizes several probabilistic approaches: the probabilistic failure analysis (PFA) methodology developed by Jet Propulsion Laboratory, fast probability integration (FPI) methods, the NESSUS finite element code, and response surface methods. Example problems are provided to help identify the advantages and disadvantages of each method.
Orhan, A Emin; Ma, Wei Ji
2017-07-26
Animals perform near-optimal probabilistic inference in a wide range of psychophysical tasks. Probabilistic inference requires trial-to-trial representation of the uncertainties associated with task variables and subsequent use of this representation. Previous work has implemented such computations using neural networks with hand-crafted and task-dependent operations. We show that generic neural networks trained with a simple error-based learning rule perform near-optimal probabilistic inference in nine common psychophysical tasks. In a probabilistic categorization task, error-based learning in a generic network simultaneously explains a monkey's learning curve and the evolution of qualitative aspects of its choice behavior. In all tasks, the number of neurons required for a given level of performance grows sublinearly with the input population size, a substantial improvement on previous implementations of probabilistic inference. The trained networks develop a novel sparsity-based probabilistic population code. Our results suggest that probabilistic inference emerges naturally in generic neural networks trained with error-based learning rules.Behavioural tasks often require probability distributions to be inferred about task specific variables. Here, the authors demonstrate that generic neural networks can be trained using a simple error-based learning rule to perform such probabilistic computations efficiently without any need for task specific operations.
GERIREX - growing a second generation medical expert system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kocur, J. Jr.; Suh, S.C.
This article describes GERIREX, a medical expert system as the core module of an integrated system for total management of a medical practice. GERIREX is currently a first-generation consultant in the domain of prescribing for the geriatric patient with multiple ailments. Employing rule and objective probabilistic knowledge representations, the system performs at the near-expert level, correctly ranking single and multiple drug therapy for hypertension and/or congestive heart failure in the presence of between two and seven of 18 common accompanying or underlying conditions. GERIREX creates permanent consultation records and can access patient information from existing databases. System requirements are metmore » by very modest PCs, yet power, speed, flexibility, and ease of use rival or exceed those of many other systems. GERIREX interfaces with a variety of configurations and applications, including text, spreadsheets, databases, and executables, to fit in with current plans to upgrade to a second generation system, providing a degree of self-maintenance through intelligent parsing of a drug data source such as the Physicians` Desk Reference (PDR - CDROM version). Another option under consideration is developing neural networks to both replace the current knowledge base, and to embody the rationale employed by the medical expert in evaluating drug data for treatment selection. In this version, the current drug database would be used as warning data for the network tasked with adding new drugs to the drug database, imitating the process whereby a physician determines their personal arsenal from among the wide range of available options.« less
Ames Hybrid Combustion Facility
NASA Technical Reports Server (NTRS)
Zilliac, Greg; Karabeyoglu, Mustafa A.; Cantwell, Brian; Hunt, Rusty; DeZilwa, Shane; Shoffstall, Mike; Soderman, Paul T.; Bencze, Daniel P. (Technical Monitor)
2003-01-01
The report summarizes the design, fabrication, safety features, environmental impact, and operation of the Ames Hybrid-Fuel Combustion Facility (HCF). The facility is used in conducting research into the scalability and combustion processes of advanced paraffin-based hybrid fuels for the purpose of assessing their applicability to practical rocket systems. The facility was designed to deliver gaseous oxygen at rates between 0.5 and 16.0 kg/sec to a combustion chamber operating at pressures ranging from 300 to 900. The required run times were of the order of 10 to 20 sec. The facility proved to be robust and reliable and has been used to generate a database of regression-rate measurements of paraffin at oxygen mass flux levels comparable to those of moderate-sized hybrid rocket motors.
Object classification and outliers analysis in the forthcoming Gaia mission
NASA Astrophysics Data System (ADS)
Ordóñez-Blanco, D.; Arcay, B.; Dafonte, C.; Manteiga, M.; Ulla, A.
2010-12-01
Astrophysics is evolving towards the rational optimization of costly observational material by the intelligent exploitation of large astronomical databases from both terrestrial telescopes and spatial mission archives. However, there has been relatively little advance in the development of highly scalable data exploitation and analysis tools needed to generate the scientific returns from these large and expensively obtained datasets. Among the upcoming projects of astronomical instrumentation, Gaia is the next cornerstone ESA mission. The Gaia survey foresees the creation of a data archive and its future exploitation with automated or semi-automated analysis tools. This work reviews some of the work that is being developed by the Gaia Data Processing and Analysis Consortium for the object classification and analysis of outliers in the forthcoming mission.
Chemistry Modeling for Aerothermodynamics and TPS
NASA Technical Reports Server (NTRS)
Wang, Dunyou; Stallcop, James R.; Dateo, Christopher e.; Schwenke, David W.; Halicioglu, Timur; Huo, winifred M.
2005-01-01
Recent advances in supercomputers and highly scalable quantum chemistry software render computational chemistry methods a viable means of providing chemistry data for aerothermal analysis at a specific level of confidence. Four examples of first principles quantum chemistry calculations will be presented. Study of the highly nonequilibrium rotational distribution of a nitrogen molecule from the exchange reaction N + N2 illustrates how chemical reactions can influence rotational distribution. The reaction C2H + H2 is one example of a radical reaction that occurs during hypersonic entry into an atmosphere containing methane. A study of the etching of a Si surface illustrates our approach to surface reactions. A recently developed web accessible database and software tool (DDD) that provides the radiation profile of diatomic molecules is also described.
Chemistry Modeling for Aerothermodynamics and TPS
NASA Technical Reports Server (NTRS)
Wang, Dun-You; Stallcop, James R.; Dateo, Christopher E.; Schwenke, David W.; Haliciogiu, Timur; Huo, Winifred
2004-01-01
Recent advances in supercomputers and highly scalable quantum chemistry software render computational chemistry methods a viable means of providing chemistry data for aerothermal analysis at a specific level of confidence. Four examples of first principles quantum chemistry calculations will be presented. The study of the highly nonequilibrium rotational distribution of nitrogen molecule from the exchange reaction N + N2 illustrates how chemical reactions can influence the rotational distribution. The reaction C2H + H2 is one example of a radical reaction that occurs during hypersonic entry into a methane containing atmosphere. A study of the etching of Si surface illustrates our approach to surface reactions. A recently developed web accessible database and software tool (DDD) that provides the radiation profile of diatomic molecules is also described.
ECO: A Framework for Entity Co-Occurrence Exploration with Faceted Navigation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Halliday, K. D.
2010-08-20
Even as highly structured databases and semantic knowledge bases become more prevalent, a substantial amount of human knowledge is reported as written prose. Typical textual reports, such as news articles, contain information about entities (people, organizations, and locations) and their relationships. Automatically extracting such relationships from large text corpora is a key component of corporate and government knowledge bases. The primary goal of the ECO project is to develop a scalable framework for extracting and presenting these relationships for exploration using an easily navigable faceted user interface. ECO uses entity co-occurrence relationships to identify related entities. The system aggregates andmore » indexes information on each entity pair, allowing the user to rapidly discover and mine relational information.« less
NASA Astrophysics Data System (ADS)
Kiktenko, E. O.; Pozhar, N. O.; Anufriev, M. N.; Trushechkin, A. S.; Yunusov, R. R.; Kurochkin, Y. V.; Lvovsky, A. I.; Fedorov, A. K.
2018-07-01
Blockchain is a distributed database which is cryptographically protected against malicious modifications. While promising for a wide range of applications, current blockchain platforms rely on digital signatures, which are vulnerable to attacks by means of quantum computers. The same, albeit to a lesser extent, applies to cryptographic hash functions that are used in preparing new blocks, so parties with access to quantum computation would have unfair advantage in procuring mining rewards. Here we propose a possible solution to the quantum era blockchain challenge and report an experimental realization of a quantum-safe blockchain platform that utilizes quantum key distribution across an urban fiber network for information-theoretically secure authentication. These results address important questions about realizability and scalability of quantum-safe blockchains for commercial and governmental applications.
Real-time micro-modelling of city evacuations
NASA Astrophysics Data System (ADS)
Löhner, Rainald; Haug, Eberhard; Zinggerling, Claudio; Oñate, Eugenio
2018-01-01
A methodology to integrate geographical information system (GIS) data with large-scale pedestrian simulations has been developed. Advances in automatic data acquisition and archiving from GIS databases, automatic input for pedestrian simulations, as well as scalable pedestrian simulation tools have made it possible to simulate pedestrians at the individual level for complete cities in real time. An example that simulates the evacuation of the city of Barcelona demonstrates that this is now possible. This is the first step towards a fully integrated crowd prediction and management tool that takes into account not only data gathered in real time from cameras, cell phones or other sensors, but also merges these with advanced simulation tools to predict the future state of the crowd.
Langley, Shaun A.; Messina, Joseph P.
2011-01-01
The past decade has seen an explosion in the availability of spatial data not only for researchers, but the public alike. As the quantity of data increases, the ability to effectively navigate and understand the data becomes more challenging. Here we detail a conceptual model for a spatially explicit database management system that addresses the issues raised with the growing data management problem. We demonstrate utility with a case study in disease ecology: to develop a multi-scale predictive model of African Trypanosomiasis in Kenya. International collaborations and varying technical expertise necessitate a modular open-source software solution. Finally, we address three recurring problems with data management: scalability, reliability, and security. PMID:21686072
Langley, Shaun A; Messina, Joseph P
2011-01-01
The past decade has seen an explosion in the availability of spatial data not only for researchers, but the public alike. As the quantity of data increases, the ability to effectively navigate and understand the data becomes more challenging. Here we detail a conceptual model for a spatially explicit database management system that addresses the issues raised with the growing data management problem. We demonstrate utility with a case study in disease ecology: to develop a multi-scale predictive model of African Trypanosomiasis in Kenya. International collaborations and varying technical expertise necessitate a modular open-source software solution. Finally, we address three recurring problems with data management: scalability, reliability, and security.
Probabilistic classifiers with high-dimensional data
Kim, Kyung In; Simon, Richard
2011-01-01
For medical classification problems, it is often desirable to have a probability associated with each class. Probabilistic classifiers have received relatively little attention for small n large p classification problems despite of their importance in medical decision making. In this paper, we introduce 2 criteria for assessment of probabilistic classifiers: well-calibratedness and refinement and develop corresponding evaluation measures. We evaluated several published high-dimensional probabilistic classifiers and developed 2 extensions of the Bayesian compound covariate classifier. Based on simulation studies and analysis of gene expression microarray data, we found that proper probabilistic classification is more difficult than deterministic classification. It is important to ensure that a probabilistic classifier is well calibrated or at least not “anticonservative” using the methods developed here. We provide this evaluation for several probabilistic classifiers and also evaluate their refinement as a function of sample size under weak and strong signal conditions. We also present a cross-validation method for evaluating the calibration and refinement of any probabilistic classifier on any data set. PMID:21087946
Astronomy In The Cloud: Using Mapreduce For Image Coaddition
NASA Astrophysics Data System (ADS)
Wiley, Keith; Connolly, A.; Gardner, J.; Krughoff, S.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.
2011-01-01
In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computational challenges such as anomaly detection, classification, and moving object tracking. Since such studies require the highest quality data, methods such as image coaddition, i.e., registration, stacking, and mosaicing, will be critical to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources, e.g., asteroids, or transient objects, e.g., supernovae, these datastreams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this paper we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data is partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources, i.e., platforms where Hadoop is offered as a service. We report on our experience implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multi-terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image coaddition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results compring their performance. This work is funded by the NSF and by NASA.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition
NASA Astrophysics Data System (ADS)
Wiley, K.; Connolly, A.; Gardner, J.; Krughoff, S.; Balazinska, M.; Howe, B.; Kwon, Y.; Bu, Y.
2011-03-01
In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computation challenges such as anomaly detection and classification and moving-object tracking. Since such studies benefit from the highest-quality data, methods such as image co-addition, i.e., astrometric registration followed by per-pixel summation, will be a critical preprocessing step prior to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources such as potentially hazardous asteroids or transient objects such as supernovae, these data streams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this article we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data are partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources: i.e., platforms where Hadoop is offered as a service. We report on our experience of implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multiterabyte imaging data set provides a good testbed for algorithm development, since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image co-addition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results comparing their performance.
Dynamic-ETL: a hybrid approach for health data extraction, transformation and loading.
Ong, Toan C; Kahn, Michael G; Kwan, Bethany M; Yamashita, Traci; Brandt, Elias; Hosokawa, Patrick; Uhrich, Chris; Schilling, Lisa M
2017-09-13
Electronic health records (EHRs) contain detailed clinical data stored in proprietary formats with non-standard codes and structures. Participating in multi-site clinical research networks requires EHR data to be restructured and transformed into a common format and standard terminologies, and optimally linked to other data sources. The expertise and scalable solutions needed to transform data to conform to network requirements are beyond the scope of many health care organizations and there is a need for practical tools that lower the barriers of data contribution to clinical research networks. We designed and implemented a health data transformation and loading approach, which we refer to as Dynamic ETL (Extraction, Transformation and Loading) (D-ETL), that automates part of the process through use of scalable, reusable and customizable code, while retaining manual aspects of the process that requires knowledge of complex coding syntax. This approach provides the flexibility required for the ETL of heterogeneous data, variations in semantic expertise, and transparency of transformation logic that are essential to implement ETL conventions across clinical research sharing networks. Processing workflows are directed by the ETL specifications guideline, developed by ETL designers with extensive knowledge of the structure and semantics of health data (i.e., "health data domain experts") and target common data model. D-ETL was implemented to perform ETL operations that load data from various sources with different database schema structures into the Observational Medical Outcome Partnership (OMOP) common data model. The results showed that ETL rule composition methods and the D-ETL engine offer a scalable solution for health data transformation via automatic query generation to harmonize source datasets. D-ETL supports a flexible and transparent process to transform and load health data into a target data model. This approach offers a solution that lowers technical barriers that prevent data partners from participating in research data networks, and therefore, promotes the advancement of comparative effectiveness research using secondary electronic health data.
EarthServer: Cross-Disciplinary Earth Science Through Data Cube Analytics
NASA Astrophysics Data System (ADS)
Baumann, P.; Rossi, A. P.
2016-12-01
The unprecedented increase of imagery, in-situ measurements, and simulation data produced by Earth (and Planetary) Science observations missions bears a rich, yet not leveraged potential for getting insights from integrating such diverse datasets and transform scientific questions into actual queries to data, formulated in a standardized way.The intercontinental EarthServer [1] initiative is demonstrating new directions for flexible, scalable Earth Science services based on innovative NoSQL technology. Researchers from Europe, the US and Australia have teamed up to rigorously implement the concept of the datacube. Such a datacube may have spatial and temporal dimensions (such as a satellite image time series) and may unite an unlimited number of scenes. Independently from whatever efficient data structuring a server network may perform internally, users (scientist, planners, decision makers) will always see just a few datacubes they can slice and dice.EarthServer has established client [2] and server technology for such spatio-temporal datacubes. The underlying scalable array engine, rasdaman [3,4], enables direct interaction, including 3-D visualization, common EO data processing, and general analytics. Services exclusively rely on the open OGC "Big Geo Data" standards suite, the Web Coverage Service (WCS). Conversely, EarthServer has shaped and advanced WCS based on the experience gained. The first phase of EarthServer has advanced scalable array database technology into 150+ TB services. Currently, Petabyte datacubes are being built for ad-hoc and cross-disciplinary querying, e.g. using climate, Earth observation and ocean data.We will present the EarthServer approach, its impact on OGC / ISO / INSPIRE standardization, and its platform technology, rasdaman.References: [1] Baumann, et al. (2015) DOI: 10.1080/17538947.2014.1003106 [2] Hogan, P., (2011) NASA World Wind, Proceedings of the 2nd International Conference on Computing for Geospatial Research & Applications ACM. [3] Baumann, Peter, et al. (2014) In Proc. 10th ICDM, 194-201. [4] Dumitru, A. et al. (2014) In Proc ACM SIGMOD Workshop on Data Analytics in the Cloud (DanaC'2014), 1-4.
A Proposed Probabilistic Extension of the Halpern and Pearl Definition of ‘Actual Cause’
2017-01-01
ABSTRACT Joseph Halpern and Judea Pearl ([2005]) draw upon structural equation models to develop an attractive analysis of ‘actual cause’. Their analysis is designed for the case of deterministic causation. I show that their account can be naturally extended to provide an elegant treatment of probabilistic causation. 1Introduction2Preemption3Structural Equation Models4The Halpern and Pearl Definition of ‘Actual Cause’5Preemption Again6The Probabilistic Case7Probabilistic Causal Models8A Proposed Probabilistic Extension of Halpern and Pearl’s Definition9Twardy and Korb’s Account10Probabilistic Fizzling11Conclusion PMID:29593362
Probabilistic Structural Analysis Methods (PSAM) for select space propulsion system components
NASA Technical Reports Server (NTRS)
1991-01-01
The fourth year of technical developments on the Numerical Evaluation of Stochastic Structures Under Stress (NESSUS) system for Probabilistic Structural Analysis Methods is summarized. The effort focused on the continued expansion of the Probabilistic Finite Element Method (PFEM) code, the implementation of the Probabilistic Boundary Element Method (PBEM), and the implementation of the Probabilistic Approximate Methods (PAppM) code. The principal focus for the PFEM code is the addition of a multilevel structural dynamics capability. The strategy includes probabilistic loads, treatment of material, geometry uncertainty, and full probabilistic variables. Enhancements are included for the Fast Probability Integration (FPI) algorithms and the addition of Monte Carlo simulation as an alternate. Work on the expert system and boundary element developments continues. The enhanced capability in the computer codes is validated by applications to a turbine blade and to an oxidizer duct.
Tadmouri, Abir; Blomkvist, Josefin; Landais, Cécile; Seymour, Jerome; Azmoun, Alexandre
2018-02-01
Although left ventricular assist devices (LVADs) are currently approved for coverage and reimbursement in France, no French cost-effectiveness (CE) data are available to support this decision. This study aimed at estimating the CE of LVAD compared with medical management in the French health system. Individual patient data from the 'French hospital discharge database' (Medicalization of information systems program) were analysed using Kaplan-Meier method. Outcomes were time to death, time to heart transplantation (HTx), and time to death after HTx. A micro-costing method was used to calculate the monthly costs extracted from the Program for the Medicalization of Information Systems. A multistate Markov monthly cycle model was developed to assess CE. The analysis over a lifetime horizon was performed from the perspective of the French healthcare payer; discount rates were 4%. Probabilistic and deterministic sensitivity analyses were performed. Outcomes were quality-adjusted life years (QALYs) and incremental CE ratio (ICER). Mean QALY for an LVAD patient was 1.5 at a lifetime cost of €190 739, delivering a probabilistic ICER of €125 580/QALY [95% confidence interval: 105 587 to 150 314]. The sensitivity analysis showed that the ICER was mainly sensitive to two factors: (i) the high acquisition cost of the device and (ii) the device performance in terms of patient survival. Our economic evaluation showed that the use of LVAD in patients with end-stage heart failure yields greater benefit in terms of survival than medical management at an extra lifetime cost exceeding the €100 000/QALY. Technological advances and device costs reduction shall hence lead to an improvement in overall CE. © 2017 The Authors. ESC Heart Failure published by John Wiley & Sons Ltd on behalf of the European Society of Cardiology.
NASA Astrophysics Data System (ADS)
Villamor, P.; Litchfield, N. J.; Van Dissen, R. J.; Langridge, R.; Berryman, K. R.; Baize, S.
2016-12-01
Surface rupture associated with the 2010 Mw7.1 Darfield Earthquake (South Island, New Zealand) was extremely well documented, thanks to an immediate field mapping response and the acquisition of LiDAR data within days of the event. With respect to informing Probabilistic Fault Displacement Analysis (PFDHA) the main insights and outcomes from this rupture through Quaternary gravel are: 1) significant distributed deformation either side of the main trace (30 to 300 m wide deformation zone) and how the deformation is distributed away from the main trace; 2) a thorough analysis of uncertainty of the displacement measures obtained using the LIDAR data and repeated measurements from several scientists; and 3) the short surface rupture length for the reported magnitude, resulting from complex fault rupture with 5-6 reverse and strike-slip strands, most of which had no surface rupture. While the 2010 event is extremely well documented and will be an excellent case to add to the Surface Rupture during Earthquakes database (SURE), other NZ historical earthquakes that are not so well documented, but can provide important information for PFDHA. New Zealand has experienced about 10 historical surface fault ruptures since 1848, comprising ruptures on strike-slip, reverse and normal faults. Mw associated with these ruptures ranges between 6.3 and 8.1. From these ruptures we observed that the surface expression of deformation can be influenced by: fault maturity; the type of Quaternary sedimentary cover; fault history (e.g., influence of inversion tectonics, flexural slip); fault complexity; and primary versus secondary rupture. Other recent >Mw 6.6 earthquakes post-2010 that did not rupture the ground surface have been documented with InSAR and can inform Mw thresholds for surface fault rupture. It will be important to capture all this information and that of similar events worldwide to inform the SURE database and ultimately PFDHA.
G-Hash: Towards Fast Kernel-based Similarity Search in Large Graph Databases.
Wang, Xiaohong; Smalter, Aaron; Huan, Jun; Lushington, Gerald H
2009-01-01
Structured data including sets, sequences, trees and graphs, pose significant challenges to fundamental aspects of data management such as efficient storage, indexing, and similarity search. With the fast accumulation of graph databases, similarity search in graph databases has emerged as an important research topic. Graph similarity search has applications in a wide range of domains including cheminformatics, bioinformatics, sensor network management, social network management, and XML documents, among others.Most of the current graph indexing methods focus on subgraph query processing, i.e. determining the set of database graphs that contains the query graph and hence do not directly support similarity search. In data mining and machine learning, various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models for supervised learning, graph kernel functions have (i) high computational complexity and (ii) non-trivial difficulty to be indexed in a graph database.Our objective is to bridge graph kernel function and similarity search in graph databases by proposing (i) a novel kernel-based similarity measurement and (ii) an efficient indexing structure for graph data management. Our method of similarity measurement builds upon local features extracted from each node and their neighboring nodes in graphs. A hash table is utilized to support efficient storage and fast search of the extracted local features. Using the hash table, a graph kernel function is defined to capture the intrinsic similarity of graphs and for fast similarity query processing. We have implemented our method, which we have named G-hash, and have demonstrated its utility on large chemical graph databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Most importantly, the new similarity measurement and the index structure is scalable to large database with smaller indexing size, faster indexing construction time, and faster query processing time as compared to state-of-the-art indexing methods such as C-tree, gIndex, and GraphGrep.
Towards Scalable Deep Learning via I/O Analysis and Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pumma, Sarunya; Si, Min; Feng, Wu-Chun
Deep learning systems have been growing in prominence as a way to automatically characterize objects, trends, and anomalies. Given the importance of deep learning systems, researchers have been investigating techniques to optimize such systems. An area of particular interest has been using large supercomputing systems to quickly generate effective deep learning networks: a phase often referred to as “training” of the deep learning neural network. As we scale existing deep learning frameworks—such as Caffe—on these large supercomputing systems, we notice that the parallelism can help improve the computation tremendously, leaving data I/O as the major bottleneck limiting the overall systemmore » scalability. In this paper, we first present a detailed analysis of the performance bottlenecks of Caffe on large supercomputing systems. Our analysis shows that the I/O subsystem of Caffe—LMDB—relies on memory-mapped I/O to access its database, which can be highly inefficient on large-scale systems because of its interaction with the process scheduling system and the network-based parallel filesystem. Based on this analysis, we then present LMDBIO, our optimized I/O plugin for Caffe that takes into account the data access pattern of Caffe in order to vastly improve I/O performance. Our experimental results show that LMDBIO can improve the overall execution time of Caffe by nearly 20-fold in some cases.« less
Simms, Andrew M; Toofanny, Rudesh D; Kehl, Catherine; Benson, Noah C; Daggett, Valerie
2008-06-01
Dynameomics is a project to investigate and catalog the native-state dynamics and thermal unfolding pathways of representatives of all protein folds using solvated molecular dynamics simulations, as described in the preceding paper. Here we introduce the design of the molecular dynamics data warehouse, a scalable, reliable repository that houses simulation data that vastly simplifies management and access. In the succeeding paper, we describe the development of a complementary multidimensional database. A single protein unfolding or native-state simulation can take weeks to months to complete, and produces gigabytes of coordinate and analysis data. Mining information from over 3000 completed simulations is complicated and time-consuming. Even the simplest queries involve writing intricate programs that must be built from low-level file system access primitives and include significant logic to correctly locate and parse data of interest. As a result, programs to answer questions that require data from hundreds of simulations are very difficult to write. Thus, organization and access to simulation data have been major obstacles to the discovery of new knowledge in the Dynameomics project. This repository is used internally and is the foundation of the Dynameomics portal site http://www.dynameomics.org. By organizing simulation data into a scalable, manageable and accessible form, we can begin to address substantial questions that move us closer to solving biomedical and bioengineering problems.
Performance Prediction of a MongoDB-Based Traceability System in Smart Factory Supply Chains
Kang, Yong-Shin; Park, Il-Ha; Youm, Sekyoung
2016-01-01
In the future, with the advent of the smart factory era, manufacturing and logistics processes will become more complex, and the complexity and criticality of traceability will further increase. This research aims at developing a performance assessment method to verify scalability when implementing traceability systems based on key technologies for smart factories, such as Internet of Things (IoT) and BigData. To this end, based on existing research, we analyzed traceability requirements and an event schema for storing traceability data in MongoDB, a document-based Not Only SQL (NoSQL) database. Next, we analyzed the algorithm of the most representative traceability query and defined a query-level performance model, which is composed of response times for the components of the traceability query algorithm. Next, this performance model was solidified as a linear regression model because the response times increase linearly by a benchmark test. Finally, for a case analysis, we applied the performance model to a virtual automobile parts logistics. As a result of the case study, we verified the scalability of a MongoDB-based traceability system and predicted the point when data node servers should be expanded in this case. The traceability system performance assessment method proposed in this research can be used as a decision-making tool for hardware capacity planning during the initial stage of construction of traceability systems and during their operational phase. PMID:27983654
Performance Prediction of a MongoDB-Based Traceability System in Smart Factory Supply Chains.
Kang, Yong-Shin; Park, Il-Ha; Youm, Sekyoung
2016-12-14
In the future, with the advent of the smart factory era, manufacturing and logistics processes will become more complex, and the complexity and criticality of traceability will further increase. This research aims at developing a performance assessment method to verify scalability when implementing traceability systems based on key technologies for smart factories, such as Internet of Things (IoT) and BigData. To this end, based on existing research, we analyzed traceability requirements and an event schema for storing traceability data in MongoDB, a document-based Not Only SQL (NoSQL) database. Next, we analyzed the algorithm of the most representative traceability query and defined a query-level performance model, which is composed of response times for the components of the traceability query algorithm. Next, this performance model was solidified as a linear regression model because the response times increase linearly by a benchmark test. Finally, for a case analysis, we applied the performance model to a virtual automobile parts logistics. As a result of the case study, we verified the scalability of a MongoDB-based traceability system and predicted the point when data node servers should be expanded in this case. The traceability system performance assessment method proposed in this research can be used as a decision-making tool for hardware capacity planning during the initial stage of construction of traceability systems and during their operational phase.
Field of genes: using Apache Kafka as a bioinformatic data repository.
Lawlor, Brendan; Lynch, Richard; Mac Aogáin, Micheál; Walsh, Paul
2018-04-01
Bioinformatic research is increasingly dependent on large-scale datasets, accessed either from private or public repositories. An example of a public repository is National Center for Biotechnology Information's (NCBI's) Reference Sequence (RefSeq). These repositories must decide in what form to make their data available. Unstructured data can be put to almost any use but are limited in how access to them can be scaled. Highly structured data offer improved performance for specific algorithms but limit the wider usefulness of the data. We present an alternative: lightly structured data stored in Apache Kafka in a way that is amenable to parallel access and streamed processing, including subsequent transformations into more highly structured representations. We contend that this approach could provide a flexible and powerful nexus of bioinformatic data, bridging the gap between low structure on one hand, and high performance and scale on the other. To demonstrate this, we present a proof-of-concept version of NCBI's RefSeq database using this technology. We measure the performance and scalability characteristics of this alternative with respect to flat files. The proof of concept scales almost linearly as more compute nodes are added, outperforming the standard approach using files. Apache Kafka merits consideration as a fast and more scalable but general-purpose way to store and retrieve bioinformatic data, for public, centralized reference datasets such as RefSeq and for private clinical and experimental data.
NASA Astrophysics Data System (ADS)
Bandyopadhyay, Saptarshi
Multi-agent systems are widely used for constructing a desired formation shape, exploring an area, surveillance, coverage, and other cooperative tasks. This dissertation introduces novel algorithms in the three main areas of shape formation, distributed estimation, and attitude control of large-scale multi-agent systems. In the first part of this dissertation, we address the problem of shape formation for thousands to millions of agents. Here, we present two novel algorithms for guiding a large-scale swarm of robotic systems into a desired formation shape in a distributed and scalable manner. These probabilistic swarm guidance algorithms adopt an Eulerian framework, where the physical space is partitioned into bins and the swarm's density distribution over each bin is controlled using tunable Markov chains. In the first algorithm - Probabilistic Swarm Guidance using Inhomogeneous Markov Chains (PSG-IMC) - each agent determines its bin transition probabilities using a time-inhomogeneous Markov chain that is constructed in real-time using feedback from the current swarm distribution. This PSG-IMC algorithm minimizes the expected cost of the transitions required to achieve and maintain the desired formation shape, even when agents are added to or removed from the swarm. The algorithm scales well with a large number of agents and complex formation shapes, and can also be adapted for area exploration applications. In the second algorithm - Probabilistic Swarm Guidance using Optimal Transport (PSG-OT) - each agent determines its bin transition probabilities by solving an optimal transport problem, which is recast as a linear program. In the presence of perfect feedback of the current swarm distribution, this algorithm minimizes the given cost function, guarantees faster convergence, reduces the number of transitions for achieving the desired formation, and is robust to disturbances or damages to the formation. We demonstrate the effectiveness of these two proposed swarm guidance algorithms using results from numerical simulations and closed-loop hardware experiments on multiple quadrotors. In the second part of this dissertation, we present two novel discrete-time algorithms for distributed estimation, which track a single target using a network of heterogeneous sensing agents. The Distributed Bayesian Filtering (DBF) algorithm, the sensing agents combine their normalized likelihood functions using the logarithmic opinion pool and the discrete-time dynamic average consensus algorithm. Each agent's estimated likelihood function converges to an error ball centered on the joint likelihood function of the centralized multi-sensor Bayesian filtering algorithm. Using a new proof technique, the convergence, stability, and robustness properties of the DBF algorithm are rigorously characterized. The explicit bounds on the time step of the robust DBF algorithm are shown to depend on the time-scale of the target dynamics. Furthermore, the DBF algorithm for linear-Gaussian models can be cast into a modified form of the Kalman information filter. In the Bayesian Consensus Filtering (BCF) algorithm, the agents combine their estimated posterior pdfs multiple times within each time step using the logarithmic opinion pool scheme. Thus, each agent's consensual pdf minimizes the sum of Kullback-Leibler divergences with the local posterior pdfs. The performance and robust properties of these algorithms are validated using numerical simulations. In the third part of this dissertation, we present an attitude control strategy and a new nonlinear tracking controller for a spacecraft carrying a large object, such as an asteroid or a boulder. If the captured object is larger or comparable in size to the spacecraft and has significant modeling uncertainties, conventional nonlinear control laws that use exact feed-forward cancellation are not suitable because they exhibit a large resultant disturbance torque. The proposed nonlinear tracking control law guarantees global exponential convergence of tracking errors with finite-gain Lp stability in the presence of modeling uncertainties and disturbances, and reduces the resultant disturbance torque. Further, this control law permits the use of any attitude representation and its integral control formulation eliminates any constant disturbance. Under small uncertainties, the best strategy for stabilizing the combined system is to track a fuel-optimal reference trajectory using this nonlinear control law, because it consumes the least amount of fuel. In the presence of large uncertainties, the most effective strategy is to track the derivative plus proportional-derivative based reference trajectory, because it reduces the resultant disturbance torque. The effectiveness of the proposed attitude control law is demonstrated by using results of numerical simulation based on an Asteroid Redirect Mission concept. The new algorithms proposed in this dissertation will facilitate the development of versatile autonomous multi-agent systems that are capable of performing a variety of complex tasks in a robust and scalable manner.
Lee, Christopher T; Bulterys, Marc; Martel, Lise D; Dahl, Benjamin A
2016-03-11
The epidemic of Ebola virus disease (Ebola) in West Africa began in Guinea in late 2013 (1), and on August 8, 2014, the World Health Organization (WHO) declared the epidemic a Public Health Emergency of International Concern (2). Guinea was declared Ebola-free on December 29, 2015, and is under a 90 day period of enhanced surveillance, following 3,351 confirmed and 453 probable cases of Ebola and 2,536 deaths (3). Passive surveillance for Ebola in Guinea has been conducted principally through the use of a telephone alert system. Community members and health facilities report deaths and suspected Ebola cases to local alert numbers operated by prefecture health departments or to a national toll-free call center. The national call center additionally functions as a source of public health information by responding to questions from the public about Ebola. To evaluate the sensitivity of the two systems and compare the sensitivity of the national call center with the local alerts system, the CDC country team performed probabilistic record linkage of the combined prefecture alerts database, as well as the national call center database, with the national viral hemorrhagic fever (VHF) database; the VHF database contains records of all known confirmed Ebola cases. Among 17,309 alert calls analyzed from the national call center, 71 were linked to 1,838 confirmed Ebola cases in the VHF database, yielding a sensitivity of 3.9%. The sensitivity of the national call center was highest in the capital city of Conakry (11.4%) and lower in other prefectures. In comparison, the local alerts system had a sensitivity of 51.1%. Local public health infrastructure plays an important role in surveillance in an epidemic setting.
POLARIS: A 30-meter probabilistic soil series map of the contiguous United States
Chaney, Nathaniel W; Wood, Eric F; McBratney, Alexander B; Hempel, Jonathan W; Nauman, Travis; Brungard, Colby W.; Odgers, Nathan P
2016-01-01
A new complete map of soil series probabilities has been produced for the contiguous United States at a 30 m spatial resolution. This innovative database, named POLARIS, is constructed using available high-resolution geospatial environmental data and a state-of-the-art machine learning algorithm (DSMART-HPC) to remap the Soil Survey Geographic (SSURGO) database. This 9 billion grid cell database is possible using available high performance computing resources. POLARIS provides a spatially continuous, internally consistent, quantitative prediction of soil series. It offers potential solutions to the primary weaknesses in SSURGO: 1) unmapped areas are gap-filled using survey data from the surrounding regions, 2) the artificial discontinuities at political boundaries are removed, and 3) the use of high resolution environmental covariate data leads to a spatial disaggregation of the coarse polygons. The geospatial environmental covariates that have the largest role in assembling POLARIS over the contiguous United States (CONUS) are fine-scale (30 m) elevation data and coarse-scale (~ 2 km) estimates of the geographic distribution of uranium, thorium, and potassium. A preliminary validation of POLARIS using the NRCS National Soil Information System (NASIS) database shows variable performance over CONUS. In general, the best performance is obtained at grid cells where DSMART-HPC is most able to reduce the chance of misclassification. The important role of environmental covariates in limiting prediction uncertainty suggests including additional covariates is pivotal to improving POLARIS' accuracy. This database has the potential to improve the modeling of biogeochemical, water, and energy cycles in environmental models; enhance availability of data for precision agriculture; and assist hydrologic monitoring and forecasting to ensure food and water security.
Multivariate exploration of non-intrusive load monitoring via spatiotemporal pattern network
Liu, Chao; Akintayo, Adedotun; Jiang, Zhanhong; ...
2017-12-18
Non-intrusive load monitoring (NILM) of electrical demand for the purpose of identifying load components has thus far mostly been studied using univariate data, e.g., using only whole building electricity consumption time series to identify a certain type of end-use such as lighting load. However, using additional variables in the form of multivariate time series data may provide more information in terms of extracting distinguishable features in the context of energy disaggregation. In this work, a novel probabilistic graphical modeling approach, namely the spatiotemporal pattern network (STPN) is proposed for energy disaggregation using multivariate time-series data. The STPN framework is shownmore » to be capable of handling diverse types of multivariate time-series to improve the energy disaggregation performance. The technique outperforms the state of the art factorial hidden Markov models (FHMM) and combinatorial optimization (CO) techniques in multiple real-life test cases. Furthermore, based on two homes' aggregate electric consumption data, a similarity metric is defined for the energy disaggregation of one home using a trained model based on the other home (i.e., out-of-sample case). The proposed similarity metric allows us to enhance scalability via learning supervised models for a few homes and deploying such models to many other similar but unmodeled homes with significantly high disaggregation accuracy.« less
Knowlton, Kim; Kulkarni, Suhas P.; Azhar, Gulrez Shah; Mavalankar, Dileep; Jaiswal, Anjali; Connolly, Meredith; Nori-Sarma, Amruta; Rajiva, Ajit; Dutta, Priya; Deol, Bhaskar; Sanchez, Lauren; Khosla, Radhika; Webster, Peter J.; Toma, Violeta E.; Sheffield, Perry; Hess, Jeremy J.
2014-01-01
Recurrent heat waves, already a concern in rapidly growing and urbanizing South Asia, will very likely worsen in a warming world. Coordinated adaptation efforts can reduce heat’s adverse health impacts, however. To address this concern in Ahmedabad (Gujarat, India), a coalition has been formed to develop an evidence-based heat preparedness plan and early warning system. This paper describes the group and initial steps in the plan’s development and implementation. Evidence accumulation included extensive literature review, analysis of local temperature and mortality data, surveys with heat-vulnerable populations, focus groups with health care professionals, and expert consultation. The findings and recommendations were encapsulated in policy briefs for key government agencies, health care professionals, outdoor workers, and slum communities, and synthesized in the heat preparedness plan. A 7-day probabilistic weather forecast was also developed and is used to trigger the plan in advance of dangerous heat waves. The pilot plan was implemented in 2013, and public outreach was done through training workshops, hoardings/billboards, pamphlets, and print advertisements. Evaluation activities and continuous improvement efforts are ongoing, along with plans to explore the program’s scalability to other Indian cities, as Ahmedabad is the first South Asian city to address heat-health threats comprehensively. PMID:24670386
Jin, Rui-Bo; Shimizu, Ryosuke; Morohashi, Isao; Wakui, Kentaro; Takeoka, Masahiro; Izumi, Shuro; Sakamoto, Takahide; Fujiwara, Mikio; Yamashita, Taro; Miki, Shigehito; Terai, Hirotaka; Wang, Zhen; Sasaki, Masahide
2014-01-01
Efficient generation and detection of indistinguishable twin photons are at the core of quantum information and communications technology (Q-ICT). These photons are conventionally generated by spontaneous parametric down conversion (SPDC), which is a probabilistic process, and hence occurs at a limited rate, which restricts wider applications of Q-ICT. To increase the rate, one had to excite SPDC by higher pump power, while it inevitably produced more unwanted multi-photon components, harmfully degrading quantum interference visibility. Here we solve this problem by using recently developed 10 GHz repetition-rate-tunable comb laser, combined with a group-velocity-matched nonlinear crystal, and superconducting nanowire single photon detectors. They operate at telecom wavelengths more efficiently with less noises than conventional schemes, those typically operate at visible and near infrared wavelengths generated by a 76 MHz Ti Sapphire laser and detected by Si detectors. We could show high interference visibilities, which are free from the pump-power induced degradation. Our laser, nonlinear crystal, and detectors constitute a powerful tool box, which will pave a way to implementing quantum photonics circuits with variety of good and low-cost telecom components, and will eventually realize scalable Q-ICT in optical infra-structures. PMID:25524646
Statistical Symbolic Execution with Informed Sampling
NASA Technical Reports Server (NTRS)
Filieri, Antonio; Pasareanu, Corina S.; Visser, Willem; Geldenhuys, Jaco
2014-01-01
Symbolic execution techniques have been proposed recently for the probabilistic analysis of programs. These techniques seek to quantify the likelihood of reaching program events of interest, e.g., assert violations. They have many promising applications but have scalability issues due to high computational demand. To address this challenge, we propose a statistical symbolic execution technique that performs Monte Carlo sampling of the symbolic program paths and uses the obtained information for Bayesian estimation and hypothesis testing with respect to the probability of reaching the target events. To speed up the convergence of the statistical analysis, we propose Informed Sampling, an iterative symbolic execution that first explores the paths that have high statistical significance, prunes them from the state space and guides the execution towards less likely paths. The technique combines Bayesian estimation with a partial exact analysis for the pruned paths leading to provably improved convergence of the statistical analysis. We have implemented statistical symbolic execution with in- formed sampling in the Symbolic PathFinder tool. We show experimentally that the informed sampling obtains more precise results and converges faster than a purely statistical analysis and may also be more efficient than an exact symbolic analysis. When the latter does not terminate symbolic execution with informed sampling can give meaningful results under the same time and memory limits.
Multivariate exploration of non-intrusive load monitoring via spatiotemporal pattern network
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, Chao; Akintayo, Adedotun; Jiang, Zhanhong
Non-intrusive load monitoring (NILM) of electrical demand for the purpose of identifying load components has thus far mostly been studied using univariate data, e.g., using only whole building electricity consumption time series to identify a certain type of end-use such as lighting load. However, using additional variables in the form of multivariate time series data may provide more information in terms of extracting distinguishable features in the context of energy disaggregation. In this work, a novel probabilistic graphical modeling approach, namely the spatiotemporal pattern network (STPN) is proposed for energy disaggregation using multivariate time-series data. The STPN framework is shownmore » to be capable of handling diverse types of multivariate time-series to improve the energy disaggregation performance. The technique outperforms the state of the art factorial hidden Markov models (FHMM) and combinatorial optimization (CO) techniques in multiple real-life test cases. Furthermore, based on two homes' aggregate electric consumption data, a similarity metric is defined for the energy disaggregation of one home using a trained model based on the other home (i.e., out-of-sample case). The proposed similarity metric allows us to enhance scalability via learning supervised models for a few homes and deploying such models to many other similar but unmodeled homes with significantly high disaggregation accuracy.« less
Traffic accident in Cuiabá-MT: an analysis through the data mining technology.
Galvão, Noemi Dreyer; de Fátima Marin, Heimar
2010-01-01
The traffic road accidents (ATT) are non-intentional events with an important magnitude worldwide, mainly in the urban centers. This article aims to analyzes data related to the victims of ATT recorded by the Justice Secretariat and Public Security (SEJUSP) in hospital morbidity and mortality incidence at the city of Cuiabá-MT during 2006, using data mining technology. An observational, retrospective and exploratory study of the secondary data bases was carried out. The three database selected were related using the probabilistic method, through the free software RecLink. One hundred and thirty-nine (139) real pairs of victims of ATT were obtained. In this related database the data mining technology was applied with the software WEKA using the Apriori algorithm. The result generated 10 best rules, six of them were considered according to the parameters established that indicated a useful and comprehensible knowledge to characterize the victims of accidents in Cuiabá. Finally, the findings of the associative rules showed peculiarities of the road traffic accident victims in Cuiabá and highlight the need of prevention measures in the collision accidents for males.
Andromeda: a peptide search engine integrated into the MaxQuant environment.
Cox, Jürgen; Neuhauser, Nadin; Michalski, Annette; Scheltema, Richard A; Olsen, Jesper V; Mann, Matthias
2011-04-01
A key step in mass spectrometry (MS)-based proteomics is the identification of peptides in sequence databases by their fragmentation spectra. Here we describe Andromeda, a novel peptide search engine using a probabilistic scoring model. On proteome data, Andromeda performs as well as Mascot, a widely used commercial search engine, as judged by sensitivity and specificity analysis based on target decoy searches. Furthermore, it can handle data with arbitrarily high fragment mass accuracy, is able to assign and score complex patterns of post-translational modifications, such as highly phosphorylated peptides, and accommodates extremely large databases. The algorithms of Andromeda are provided. Andromeda can function independently or as an integrated search engine of the widely used MaxQuant computational proteomics platform and both are freely available at www.maxquant.org. The combination enables analysis of large data sets in a simple analysis workflow on a desktop computer. For searching individual spectra Andromeda is also accessible via a web server. We demonstrate the flexibility of the system by implementing the capability to identify cofragmented peptides, significantly improving the total number of identified peptides.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jochem, Warren C; Sims, Kelly M; Bright, Eddie A
In recent years, uses of high-resolution population distribution databases are increasing steadily for environmental, socioeconomic, public health, and disaster-related research and operations. With the development of daytime population distribution, temporal resolution of such databases has been improved. However, the lack of incorporation of transitional population, namely business and leisure travelers, leaves a significant population unaccounted for within the critical infrastructure networks, such as at transportation hubs. This paper presents two general methodologies for estimating passenger populations in airport and cruise port terminals at a high temporal resolution which can be incorporated into existing population distribution models. The methodologies are geographicallymore » scalable and are based on, and demonstrate how, two different transportation hubs with disparate temporal population dynamics can be modeled utilizing publicly available databases including novel data sources of flight activity from the Internet which are updated in near-real time. The airport population estimation model shows great potential for rapid implementation for a large collection of airports on a national scale, and the results suggest reasonable accuracy in the estimated passenger traffic. By incorporating population dynamics at high temporal resolutions into population distribution models, we hope to improve the estimates of populations exposed to or at risk to disasters, thereby improving emergency planning and response, and leading to more informed policy decisions.« less
Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters.
Lan, Haidong; Chan, Yuandong; Xu, Kai; Schmidt, Bertil; Peng, Shaoliang; Liu, Weiguo
2016-07-19
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators. This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency. Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi .
Scientific Data Services -- A High-Performance I/O System with Array Semantics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Kesheng; Byna, Surendra; Rotem, Doron
2011-09-21
As high-performance computing approaches exascale, the existing I/O system design is having trouble keeping pace in both performance and scalability. We propose to address this challenge by adopting database principles and techniques in parallel I/O systems. First, we propose to adopt an array data model because many scientific applications represent their data in arrays. This strategy follows a cardinal principle from database research, which separates the logical view from the physical layout of data. This high-level data model gives the underlying implementation more freedom to optimize the physical layout and to choose the most effective way of accessing the data.more » For example, knowing that a set of write operations is working on a single multi-dimensional array makes it possible to keep the subarrays in a log structure during the write operations and reassemble them later into another physical layout as resources permit. While maintaining the high-level view, the storage system could compress the user data to reduce the physical storage requirement, collocate data records that are frequently used together, or replicate data to increase availability and fault-tolerance. Additionally, the system could generate secondary data structures such as database indexes and summary statistics. We expect the proposed Scientific Data Services approach to create a “live” storage system that dynamically adjusts to user demands and evolves with the massively parallel storage hardware.« less
Clinical results of HIS, RIS, PACS integration using data integration CASE tools
NASA Astrophysics Data System (ADS)
Taira, Ricky K.; Chan, Hing-Ming; Breant, Claudine M.; Huang, Lu J.; Valentino, Daniel J.
1995-05-01
Current infrastructure research in PACS is dominated by the development of communication networks (local area networks, teleradiology, ATM networks, etc.), multimedia display workstations, and hierarchical image storage architectures. However, limited work has been performed on developing flexible, expansible, and intelligent information processing architectures for the vast decentralized image and text data repositories prevalent in healthcare environments. Patient information is often distributed among multiple data management systems. Current large-scale efforts to integrate medical information and knowledge sources have been costly with limited retrieval functionality. Software integration strategies to unify distributed data and knowledge sources is still lacking commercially. Systems heterogeneity (i.e., differences in hardware platforms, communication protocols, database management software, nomenclature, etc.) is at the heart of the problem and is unlikely to be standardized in the near future. In this paper, we demonstrate the use of newly available CASE (computer- aided software engineering) tools to rapidly integrate HIS, RIS, and PACS information systems. The advantages of these tools include fast development time (low-level code is generated from graphical specifications), and easy system maintenance (excellent documentation, easy to perform changes, and centralized code repository in an object-oriented database). The CASE tools are used to develop and manage the `middle-ware' in our client- mediator-serve architecture for systems integration. Our architecture is scalable and can accommodate heterogeneous database and communication protocols.
ArrayBridge: Interweaving declarative array processing with high-performance computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xing, Haoyuan; Floratos, Sofoklis; Blanas, Spyros
Scientists are increasingly turning to datacenter-scale computers to produce and analyze massive arrays. Despite decades of database research that extols the virtues of declarative query processing, scientists still write, debug and parallelize imperative HPC kernels even for the most mundane queries. This impedance mismatch has been partly attributed to the cumbersome data loading process; in response, the database community has proposed in situ mechanisms to access data in scientific file formats. Scientists, however, desire more than a passive access method that reads arrays from files. This paper describes ArrayBridge, a bi-directional array view mechanism for scientific file formats, that aimsmore » to make declarative array manipulations interoperable with imperative file-centric analyses. Our prototype implementation of ArrayBridge uses HDF5 as the underlying array storage library and seamlessly integrates into the SciDB open-source array database system. In addition to fast querying over external array objects, ArrayBridge produces arrays in the HDF5 file format just as easily as it can read from it. ArrayBridge also supports time travel queries from imperative kernels through the unmodified HDF5 API, and automatically deduplicates between array versions for space efficiency. Our extensive performance evaluation in NERSC, a large-scale scientific computing facility, shows that ArrayBridge exhibits statistically indistinguishable performance and I/O scalability to the native SciDB storage engine.« less
NASA Astrophysics Data System (ADS)
Myrbo, A.; Loeffler, S.; Ai, S.; McEwan, R.
2015-12-01
The ultimate EarthCube product has been described as a mobile app that provides all of the known geoscience data for a geographic point or polygon, from the top of the atmosphere to the core of the Earth, throughout geologic time. The database queries are hidden from the user, and the data are visually rendered for easy recognition of patterns and associations. This fanciful vision is not so remote: NSF EarthCube and Geoinformatics support has already fostered major advances in database interoperability and harmonization of APIs; numerous "domain repositories," databases curated by subject matter experts, now provide a vast wealth of open, easily-accessible georeferenced data on rock and sediment chemistry and mineralogy, paleobiology, stratigraphy, rock magnetics, and more. New datasets accrue daily, including many harvested from the literature by automated means. None of these constitute big data - all are part of the long tail of geoscience, heterogeneous data consisting of relatively small numbers of measurements made by a large number of people, typically on physical samples. This vision of mobile data discovery requires a software package to cleverly expose these domain repositories' holdings; currently, queries mainly come from single investigators to single databases. The NSF-funded mobile app Flyover Country (FC; fc.umn.edu), developed for geoscience outreach and education, has been welcomed by data curators and cyberinfrastructure developers as a testing ground for their API services, data provision, and scalability. FC pulls maps and data within a bounding envelope and caches them for offline use; location-based services alert users to nearby points of interest (POI). The incorporation of data from multiple databases across domains requires parsimonious data requests and novel visualization techniques, especially for mapping of data with a time or stratigraphic depth component. The preservation of data provenance and authority is critical for researcher buy-in to all community databases, and further allows exploration and suggestions of collaborators, based upon geography and topical relevance.
OntoMate: a text-mining tool aiding curation at the Rat Genome Database
Liu, Weisong; Laulederkind, Stanley J. F.; Hayman, G. Thomas; Wang, Shur-Jen; Nigam, Rajni; Smith, Jennifer R.; De Pons, Jeff; Dwinell, Melinda R.; Shimoyama, Mary
2015-01-01
The Rat Genome Database (RGD) is the premier repository of rat genomic, genetic and physiologic data. Converting data from free text in the scientific literature to a structured format is one of the main tasks of all model organism databases. RGD spends considerable effort manually curating gene, Quantitative Trait Locus (QTL) and strain information. The rapidly growing volume of biomedical literature and the active research in the biological natural language processing (bioNLP) community have given RGD the impetus to adopt text-mining tools to improve curation efficiency. Recently, RGD has initiated a project to use OntoMate, an ontology-driven, concept-based literature search engine developed at RGD, as a replacement for the PubMed (http://www.ncbi.nlm.nih.gov/pubmed) search engine in the gene curation workflow. OntoMate tags abstracts with gene names, gene mutations, organism name and most of the 16 ontologies/vocabularies used at RGD. All terms/ entities tagged to an abstract are listed with the abstract in the search results. All listed terms are linked both to data entry boxes and a term browser in the curation tool. OntoMate also provides user-activated filters for species, date and other parameters relevant to the literature search. Using the system for literature search and import has streamlined the process compared to using PubMed. The system was built with a scalable and open architecture, including features specifically designed to accelerate the RGD gene curation process. With the use of bioNLP tools, RGD has added more automation to its curation workflow. Database URL: http://rgd.mcw.edu PMID:25619558
Distributed cyberinfrastructure tools for automated data processing of structural monitoring data
NASA Astrophysics Data System (ADS)
Zhang, Yilan; Kurata, Masahiro; Lynch, Jerome P.; van der Linden, Gwendolyn; Sederat, Hassan; Prakash, Atul
2012-04-01
The emergence of cost-effective sensing technologies has now enabled the use of dense arrays of sensors to monitor the behavior and condition of large-scale bridges. The continuous operation of dense networks of sensors presents a number of new challenges including how to manage such massive amounts of data that can be created by the system. This paper reports on the progress of the creation of cyberinfrastructure tools which hierarchically control networks of wireless sensors deployed in a long-span bridge. The internet-enabled cyberinfrastructure is centrally managed by a powerful database which controls the flow of data in the entire monitoring system architecture. A client-server model built upon the database provides both data-provider and system end-users with secured access to various levels of information of a bridge. In the system, information on bridge behavior (e.g., acceleration, strain, displacement) and environmental condition (e.g., wind speed, wind direction, temperature, humidity) are uploaded to the database from sensor networks installed in the bridge. Then, data interrogation services interface with the database via client APIs to autonomously process data. The current research effort focuses on an assessment of the scalability and long-term robustness of the proposed cyberinfrastructure framework that has been implemented along with a permanent wireless monitoring system on the New Carquinez (Alfred Zampa Memorial) Suspension Bridge in Vallejo, CA. Many data interrogation tools are under development using sensor data and bridge metadata (e.g., geometric details, material properties, etc.) Sample data interrogation clients including those for the detection of faulty sensors, automated modal parameter extraction.
Database Resources of the BIG Data Center in 2018.
2018-01-04
The BIG Data Center at Beijing Institute of Genomics (BIG) of the Chinese Academy of Sciences provides freely open access to a suite of database resources in support of worldwide research activities in both academia and industry. With the vast amounts of omics data generated at ever-greater scales and rates, the BIG Data Center is continually expanding, updating and enriching its core database resources through big-data integration and value-added curation, including BioCode (a repository archiving bioinformatics tool codes), BioProject (a biological project library), BioSample (a biological sample library), Genome Sequence Archive (GSA, a data repository for archiving raw sequence reads), Genome Warehouse (GWH, a centralized resource housing genome-scale data), Genome Variation Map (GVM, a public repository of genome variations), Gene Expression Nebulas (GEN, a database of gene expression profiles based on RNA-Seq data), Methylation Bank (MethBank, an integrated databank of DNA methylomes), and Science Wikis (a series of biological knowledge wikis for community annotations). In addition, three featured web services are provided, viz., BIG Search (search as a service; a scalable inter-domain text search engine), BIG SSO (single sign-on as a service; a user access control system to gain access to multiple independent systems with a single ID and password) and Gsub (submission as a service; a unified submission service for all relevant resources). All of these resources are publicly accessible through the home page of the BIG Data Center at http://bigd.big.ac.cn. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Is probabilistic bias analysis approximately Bayesian?
MacLehose, Richard F.; Gustafson, Paul
2011-01-01
Case-control studies are particularly susceptible to differential exposure misclassification when exposure status is determined following incident case status. Probabilistic bias analysis methods have been developed as ways to adjust standard effect estimates based on the sensitivity and specificity of exposure misclassification. The iterative sampling method advocated in probabilistic bias analysis bears a distinct resemblance to a Bayesian adjustment; however, it is not identical. Furthermore, without a formal theoretical framework (Bayesian or frequentist), the results of a probabilistic bias analysis remain somewhat difficult to interpret. We describe, both theoretically and empirically, the extent to which probabilistic bias analysis can be viewed as approximately Bayesian. While the differences between probabilistic bias analysis and Bayesian approaches to misclassification can be substantial, these situations often involve unrealistic prior specifications and are relatively easy to detect. Outside of these special cases, probabilistic bias analysis and Bayesian approaches to exposure misclassification in case-control studies appear to perform equally well. PMID:22157311
Probabilistic Structural Analysis of SSME Turbopump Blades: Probabilistic Geometry Effects
NASA Technical Reports Server (NTRS)
Nagpal, V. K.
1985-01-01
A probabilistic study was initiated to evaluate the precisions of the geometric and material properties tolerances on the structural response of turbopump blades. To complete this study, a number of important probabilistic variables were identified which are conceived to affect the structural response of the blade. In addition, a methodology was developed to statistically quantify the influence of these probabilistic variables in an optimized way. The identified variables include random geometric and material properties perturbations, different loadings and a probabilistic combination of these loadings. Influences of these probabilistic variables are planned to be quantified by evaluating the blade structural response. Studies of the geometric perturbations were conducted for a flat plate geometry as well as for a space shuttle main engine blade geometry using a special purpose code which uses the finite element approach. Analyses indicate that the variances of the perturbations about given mean values have significant influence on the response.
Development of probabilistic multimedia multipathway computer codes.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yu, C.; LePoire, D.; Gnanapragasam, E.
2002-01-01
The deterministic multimedia dose/risk assessment codes RESRAD and RESRAD-BUILD have been widely used for many years for evaluation of sites contaminated with residual radioactive materials. The RESRAD code applies to the cleanup of sites (soils) and the RESRAD-BUILD code applies to the cleanup of buildings and structures. This work describes the procedure used to enhance the deterministic RESRAD and RESRAD-BUILD codes for probabilistic dose analysis. A six-step procedure was used in developing default parameter distributions and the probabilistic analysis modules. These six steps include (1) listing and categorizing parameters; (2) ranking parameters; (3) developing parameter distributions; (4) testing parameter distributionsmore » for probabilistic analysis; (5) developing probabilistic software modules; and (6) testing probabilistic modules and integrated codes. The procedures used can be applied to the development of other multimedia probabilistic codes. The probabilistic versions of RESRAD and RESRAD-BUILD codes provide tools for studying the uncertainty in dose assessment caused by uncertain input parameters. The parameter distribution data collected in this work can also be applied to other multimedia assessment tasks and multimedia computer codes.« less
Probabilistic structural analysis methods for select space propulsion system components
NASA Technical Reports Server (NTRS)
Millwater, H. R.; Cruse, T. A.
1989-01-01
The Probabilistic Structural Analysis Methods (PSAM) project developed at the Southwest Research Institute integrates state-of-the-art structural analysis techniques with probability theory for the design and analysis of complex large-scale engineering structures. An advanced efficient software system (NESSUS) capable of performing complex probabilistic analysis has been developed. NESSUS contains a number of software components to perform probabilistic analysis of structures. These components include: an expert system, a probabilistic finite element code, a probabilistic boundary element code and a fast probability integrator. The NESSUS software system is shown. An expert system is included to capture and utilize PSAM knowledge and experience. NESSUS/EXPERT is an interactive menu-driven expert system that provides information to assist in the use of the probabilistic finite element code NESSUS/FEM and the fast probability integrator (FPI). The expert system menu structure is summarized. The NESSUS system contains a state-of-the-art nonlinear probabilistic finite element code, NESSUS/FEM, to determine the structural response and sensitivities. A broad range of analysis capabilities and an extensive element library is present.
Frontal and Parietal Contributions to Probabilistic Association Learning
Rushby, Jacqueline A.; Vercammen, Ans; Loo, Colleen; Short, Brooke
2011-01-01
Neuroimaging studies have shown both dorsolateral prefrontal (DLPFC) and inferior parietal cortex (iPARC) activation during probabilistic association learning. Whether these cortical brain regions are necessary for probabilistic association learning is presently unknown. Participants' ability to acquire probabilistic associations was assessed during disruptive 1 Hz repetitive transcranial magnetic stimulation (rTMS) of the left DLPFC, left iPARC, and sham using a crossover single-blind design. On subsequent sessions, performance improved relative to baseline except during DLPFC rTMS that disrupted the early acquisition beneficial effect of prior exposure. A second experiment examining rTMS effects on task-naive participants showed that neither DLPFC rTMS nor sham influenced naive acquisition of probabilistic associations. A third experiment examining consecutive administration of the probabilistic association learning test revealed early trial interference from previous exposure to different probability schedules. These experiments, showing disrupted acquisition of probabilistic associations by rTMS only during subsequent sessions with an intervening night's sleep, suggest that the DLPFC may facilitate early access to learned strategies or prior task-related memories via consolidation. Although neuroimaging studies implicate DLPFC and iPARC in probabilistic association learning, the present findings suggest that early acquisition of the probabilistic cue-outcome associations in task-naive participants is not dependent on either region. PMID:21216842
Chapman, Tara; Lefevre, Philippe; Semal, Patrick; Moiseev, Fedor; Sholukha, Victor; Louryan, Stéphane; Rooze, Marcel; Van Sint Jan, Serge
2014-01-01
The hip bone is one of the most reliable indicators of sex in the human body due to the fact it is the most dimorphic bone. Probabilistic Sex Diagnosis (DSP: Diagnose Sexuelle Probabiliste) developed by Murail et al., in 2005, is a sex determination method based on a worldwide hip bone metrical database. Sex is determined by comparing specific measurements taken from each specimen using sliding callipers and computing the probability of specimens being female or male. In forensic science it is sometimes not possible to sex a body due to corpse decay or injury. Skeletalization and dissection of a body is a laborious process and desecrates the body. There were two aims to this study. The first aim was to examine the accuracy of the DSP method in comparison with a current visual sexing method on sex determination. A further aim was to see if it was possible to virtually utilise the DSP method on both the hip bone and the pelvic girdle in order to utilise this method for forensic sciences. For the first part of the study, forty-nine dry hip bones of unknown sex were obtained from the Body Donation Programme of the Université Libre de Bruxelles (ULB). A comparison was made between DSP analysis and visual sexing on dry bone by two researchers. CT scans of bones were then analysed to obtain three-dimensional (3D) virtual models and the method of DSP was analysed virtually by importing the models into a customised software programme called lhpFusionBox which was developed at ULB. The software enables DSP distances to be measured via virtually-palpated bony landmarks. There was found to be 100% agreement of sex between the manual and virtual DSP method. The second part of the study aimed to further validate the method by analysing thirty-nine supplementary pelvic girdles of known sex blind. There was found to be a 100% accuracy rate further demonstrating that the virtual DSP method is robust. Statistically significant differences were found in the identification of sex between researchers in the visual sexing method although both researchers identified the same sex in all cases in the manual and virtual DSP methods for both the hip bones and pelvic girdles. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
Development of probabilistic internal dosimetry computer code
NASA Astrophysics Data System (ADS)
Noh, Siwan; Kwon, Tae-Eun; Lee, Jai-Ki
2017-02-01
Internal radiation dose assessment involves biokinetic models, the corresponding parameters, measured data, and many assumptions. Every component considered in the internal dose assessment has its own uncertainty, which is propagated in the intake activity and internal dose estimates. For research or scientific purposes, and for retrospective dose reconstruction for accident scenarios occurring in workplaces having a large quantity of unsealed radionuclides, such as nuclear power plants, nuclear fuel cycle facilities, and facilities in which nuclear medicine is practiced, a quantitative uncertainty assessment of the internal dose is often required. However, no calculation tools or computer codes that incorporate all the relevant processes and their corresponding uncertainties, i.e., from the measured data to the committed dose, are available. Thus, the objective of the present study is to develop an integrated probabilistic internal-dose-assessment computer code. First, the uncertainty components in internal dosimetry are identified, and quantitative uncertainty data are collected. Then, an uncertainty database is established for each component. In order to propagate these uncertainties in an internal dose assessment, a probabilistic internal-dose-assessment system that employs the Bayesian and Monte Carlo methods. Based on the developed system, we developed a probabilistic internal-dose-assessment code by using MATLAB so as to estimate the dose distributions from the measured data with uncertainty. Using the developed code, we calculated the internal dose distribution and statistical values ( e.g. the 2.5th, 5th, median, 95th, and 97.5th percentiles) for three sample scenarios. On the basis of the distributions, we performed a sensitivity analysis to determine the influence of each component on the resulting dose in order to identify the major component of the uncertainty in a bioassay. The results of this study can be applied to various situations. In cases of severe internal exposure, the causation probability of a deterministic health effect can be derived from the dose distribution, and a high statistical value ( e.g., the 95th percentile of the distribution) can be used to determine the appropriate intervention. The distribution-based sensitivity analysis can also be used to quantify the contribution of each factor to the dose uncertainty, which is essential information for reducing and optimizing the uncertainty in the internal dose assessment. Therefore, the present study can contribute to retrospective dose assessment for accidental internal exposure scenarios, as well as to internal dose monitoring optimization and uncertainty reduction.
NASA Astrophysics Data System (ADS)
Marques, R.; Amaral, P.; Zêzere, J. L.; Queiroz, G.; Goulart, C.
2009-04-01
Slope instability research and susceptibility mapping is a fundamental component of hazard assessment and is of extreme importance for risk mitigation, land-use management and emergency planning. Landslide susceptibility zonation has been actively pursued during the last two decades and several methodologies are still being improved. Among all the methods presented in the literature, indirect quantitative probabilistic methods have been extensively used. In this work different linear probabilistic methods, both bi-variate and multi-variate (Informative Value, Fuzzy Logic, Weights of Evidence and Logistic Regression), were used for the computation of the spatial probability of landslide occurrence, using the pixel as mapping unit. The methods used are based on linear relationships between landslides and 9 considered conditioning factors (altimetry, slope angle, exposition, curvature, distance to streams, wetness index, contribution area, lithology and land-use). It was assumed that future landslides will be conditioned by the same factors as past landslides in the study area. The presented work was developed for Ribeira Quente Valley (S. Miguel Island, Azores), a study area of 9,5 km2, mainly composed of volcanic deposits (ash and pumice lapilli) produced by explosive eruptions in Furnas Volcano. This materials associated to the steepness of the slopes (38,9% of the area has slope angles higher than 35°, reaching a maximum of 87,5°), make the area very prone to landslide activity. A total of 1.495 shallow landslides were mapped (at 1:5.000 scale) and included in a GIS database. The total affected area is 401.744 m2 (4,5% of the study area). Most slope movements are translational slides frequently evolving into debris-flows. The landslides are elongated, with maximum length generally equivalent to the slope extent, and their width normally does not exceed 25 m. The failure depth rarely exceeds 1,5 m and the volume is usually smaller than 700 m3. For modelling purposes, the landslides were randomly divided in two sub-datasets: a modelling dataset with 748 events (2,2% of the study area) and a validation dataset with 747 events (2,3% of the study area). The susceptibility algorithms achieved with the different probabilistic techniques, were rated individually using success rate and prediction rate curves. The best model performance was obtained with the logistic regression, although the results from the different methods do not show significant differences neither in success nor in prediction rate curves. These evidences revealed that: (1) the modelling landslide dataset is representative of the entire landslide population characteristics; and (2) the increase of complexity and robustness in the probabilistic methodology did not produce a significant increase in success or prediction rates. Therefore, it was concluded that the resolution and quality of the input variables are much more important than the probabilistic model chosen to assess landslide susceptibility. This work was developed on the behalf of VOLCSOILRISK project (Volcanic Soils Geotechnical Characterization for Landslide Risk Mitigation), supported by Direcção Regional da Ciência e Tecnologia - Governo Regional dos Açores.