large relational databases: Topics by Science.gov

Sample records for large relational databases

Using relational databases for improved sequence similarity searching and large-scale genomic analyses.

PubMed

Mackey, Aaron J; Pearson, William R

2004-10-01

Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Using SQL Databases for Sequence Similarity Searching and Analysis.

PubMed

Pearson, William R; Mackey, Aaron J

2017-09-13

Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Relational Databases: A Transparent Framework for Encouraging Biology Students to Think Informatically

ERIC Educational Resources Information Center

Rice, Michael; Gladstone, William; Weir, Michael

2004-01-01

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a…
Keeping Track of Our Treasures: Managing Historical Data with Relational Database Software.

ERIC Educational Resources Information Center

Gutmann, Myron P.; And Others

1989-01-01

Describes the way a relational database management system manages a large historical data collection project. Shows that such databases are practical to construct. States that the programing tasks involved are not for beginners, but the rewards of having data organized are worthwhile. (GG)
Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency.

PubMed

Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio

2015-01-01

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB.
Evaluating the Cassandra NoSQL Database Approach for Genomic Data Persistency

PubMed Central

Aniceto, Rodrigo; Xavier, Rene; Guimarães, Valeria; Hondo, Fernanda; Holanda, Maristela; Walter, Maria Emilia; Lifschitz, Sérgio

2015-01-01

Rapid advances in high-throughput sequencing techniques have created interesting computational challenges in bioinformatics. One of them refers to management of massive amounts of data generated by automatic sequencers. We need to deal with the persistency of genomic data, particularly storing and analyzing these large-scale processed data. To find an alternative to the frequently considered relational database model becomes a compelling task. Other data models may be more effective when dealing with a very large amount of nonconventional data, especially for writing and retrieving operations. In this paper, we discuss the Cassandra NoSQL database approach for storing genomic data. We perform an analysis of persistency and I/O operations with real data, using the Cassandra database system. We also compare the results obtained with a classical relational database system and another NoSQL database approach, MongoDB. PMID:26558254
YAdumper: extracting and translating large information volumes from relational databases to structured flat files.

PubMed

Fernández, José M; Valencia, Alfonso

2004-10-12

Downloading the information stored in relational databases into XML and other flat formats is a common task in bioinformatics. This periodical dumping of information requires considerable CPU time, disk and memory resources. YAdumper has been developed as a purpose-specific tool to deal with the integral structured information download of relational databases. YAdumper is a Java application that organizes database extraction following an XML template based on an external Document Type Declaration. Compared with other non-native alternatives, YAdumper substantially reduces memory requirements and considerably improves writing performance.
Very large database of lipids: rationale and design.

PubMed

Martin, Seth S; Blaha, Michael J; Toth, Peter P; Joshi, Parag H; McEvoy, John W; Ahmed, Haitham M; Elshazly, Mohamed B; Swiger, Kristopher J; Michos, Erin D; Kwiterovich, Peter O; Kulkarni, Krishnaji R; Chimera, Joseph; Cannon, Christopher P; Blumenthal, Roger S; Jones, Steven R

2013-11-01

Blood lipids have major cardiovascular and public health implications. Lipid-lowering drugs are prescribed based in part on categorization of patients into normal or abnormal lipid metabolism, yet relatively little emphasis has been placed on: (1) the accuracy of current lipid measures used in clinical practice, (2) the reliability of current categorizations of dyslipidemia states, and (3) the relationship of advanced lipid characterization to other cardiovascular disease biomarkers. To these ends, we developed the Very Large Database of Lipids (NCT01698489), an ongoing database protocol that harnesses deidentified data from the daily operations of a commercial lipid laboratory. The database includes individuals who were referred for clinical purposes for a Vertical Auto Profile (Atherotech Inc., Birmingham, AL), which directly measures cholesterol concentrations of low-density lipoprotein, very low-density lipoprotein, intermediate-density lipoprotein, high-density lipoprotein, their subclasses, and lipoprotein(a). Individual Very Large Database of Lipids studies, ranging from studies of measurement accuracy, to dyslipidemia categorization, to biomarker associations, to characterization of rare lipid disorders, are investigator-initiated and utilize peer-reviewed statistical analysis plans to address a priori hypotheses/aims. In the first database harvest (Very Large Database of Lipids 1.0) from 2009 to 2011, there were 1 340 614 adult and 10 294 pediatric patients; the adult sample had a median age of 59 years (interquartile range, 49-70 years) with even representation by sex. Lipid distributions closely matched those from the population-representative National Health and Nutrition Examination Survey. The second harvest of the database (Very Large Database of Lipids 2.0) is underway. Overall, the Very Large Database of Lipids database provides an opportunity for collaboration and new knowledge generation through careful examination of granular lipid data on a large scale. © 2013 Wiley Periodicals, Inc.
Creating databases for biological information: an introduction.

PubMed

Stein, Lincoln

2013-06-01

The essence of bioinformatics is dealing with large quantities of information. Whether it be sequencing data, microarray data files, mass spectrometric data (e.g., fingerprints), the catalog of strains arising from an insertional mutagenesis project, or even large numbers of PDF files, there inevitably comes a time when the information can simply no longer be managed with files and directories. This is where databases come into play. This unit briefly reviews the characteristics of several database management systems, including flat file, indexed file, relational databases, and NoSQL databases. It compares their strengths and weaknesses and offers some general guidelines for selecting an appropriate database management system. Copyright 2013 by JohnWiley & Sons, Inc.
Creating databases for biological information: an introduction.

PubMed

Stein, Lincoln

2002-08-01

The essence of bioinformatics is dealing with large quantities of information. Whether it be sequencing data, microarray data files, mass spectrometric data (e.g., fingerprints), the catalog of strains arising from an insertional mutagenesis project, or even large numbers of PDF files, there inevitably comes a time when the information can simply no longer be managed with files and directories. This is where databases come into play. This unit briefly reviews the characteristics of several database management systems, including flat file, indexed file, and relational databases, as well as ACeDB. It compares their strengths and weaknesses and offers some general guidelines for selecting an appropriate database management system.
Using Large-Scale Databases in Evaluation: Advances, Opportunities, and Challenges

ERIC Educational Resources Information Center

Penuel, William R.; Means, Barbara

2011-01-01

Major advances in the number, capabilities, and quality of state, national, and transnational databases have opened up new opportunities for evaluators. Both large-scale data sets collected for administrative purposes and those collected by other researchers can provide data for a variety of evaluation-related activities. These include (a)…
Development of a database system for near-future climate change projections under the Japanese National Project SI-CAT

NASA Astrophysics Data System (ADS)

Nakagawa, Y.; Kawahara, S.; Araki, F.; Matsuoka, D.; Ishikawa, Y.; Fujita, M.; Sugimoto, S.; Okada, Y.; Kawazoe, S.; Watanabe, S.; Ishii, M.; Mizuta, R.; Murata, A.; Kawase, H.

2017-12-01

Analyses of large ensemble data are quite useful in order to produce probabilistic effect projection of climate change. Ensemble data of "+2K future climate simulations" are currently produced by Japanese national project "Social Implementation Program on Climate Change Adaptation Technology (SI-CAT)" as a part of a database for Policy Decision making for Future climate change (d4PDF; Mizuta et al. 2016) produced by Program for Risk Information on Climate Change. Those data consist of global warming simulations and regional downscaling simulations. Considering that those data volumes are too large (a few petabyte) to download to a local computer of users, a user-friendly system is required to search and download data which satisfy requests of the users. We develop "a database system for near-future climate change projections" for providing functions to find necessary data for the users under SI-CAT. The database system for near-future climate change projections mainly consists of a relational database, a data download function and user interface. The relational database using PostgreSQL is a key function among them. Temporally and spatially compressed data are registered on the relational database. As a first step, we develop the relational database for precipitation, temperature and track data of typhoon according to requests by SI-CAT members. The data download function using Open-source Project for a Network Data Access Protocol (OPeNDAP) provides a function to download temporally and spatially extracted data based on search results obtained by the relational database. We also develop the web-based user interface for using the relational database and the data download function. A prototype of the database system for near-future climate change projections are currently in operational test on our local server. The database system for near-future climate change projections will be released on Data Integration and Analysis System Program (DIAS) in fiscal year 2017. Techniques of the database system for near-future climate change projections might be quite useful for simulation and observational data in other research fields. We report current status of development and some case studies of the database system for near-future climate change projections.
Using Large Diabetes Databases for Research.

PubMed

Wild, Sarah; Fischbacher, Colin; McKnight, John

2016-09-01

There are an increasing number of clinical, administrative and trial databases that can be used for research. These are particularly valuable if there are opportunities for linkage to other databases. This paper describes examples of the use of large diabetes databases for research. It reviews the advantages and disadvantages of using large diabetes databases for research and suggests solutions for some challenges. Large, high-quality databases offer potential sources of information for research at relatively low cost. Fundamental issues for using databases for research are the completeness of capture of cases within the population and time period of interest and accuracy of the diagnosis of diabetes and outcomes of interest. The extent to which people included in the database are representative should be considered if the database is not population based and there is the intention to extrapolate findings to the wider diabetes population. Information on key variables such as date of diagnosis or duration of diabetes may not be available at all, may be inaccurate or may contain a large amount of missing data. Information on key confounding factors is rarely available for the nondiabetic or general population limiting comparisons with the population of people with diabetes. However comparisons that allow for differences in distribution of important demographic factors may be feasible using data for the whole population or a matched cohort study design. In summary, diabetes databases can be used to address important research questions. Understanding the strengths and limitations of this approach is crucial to interpret the findings appropriately. © 2016 Diabetes Technology Society.
Inconsistencies in the red blood cell membrane proteome analysis: generation of a database for research and diagnostic applications

PubMed Central

Hegedűs, Tamás; Chaubey, Pururawa Mayank; Várady, György; Szabó, Edit; Sarankó, Hajnalka; Hofstetter, Lia; Roschitzki, Bernd; Sarkadi, Balázs

2015-01-01

Based on recent results, the determination of the easily accessible red blood cell (RBC) membrane proteins may provide new diagnostic possibilities for assessing mutations, polymorphisms or regulatory alterations in diseases. However, the analysis of the current mass spectrometry-based proteomics datasets and other major databases indicates inconsistencies—the results show large scattering and only a limited overlap for the identified RBC membrane proteins. Here, we applied membrane-specific proteomics studies in human RBC, compared these results with the data in the literature, and generated a comprehensive and expandable database using all available data sources. The integrated web database now refers to proteomic, genetic and medical databases as well, and contains an unexpected large number of validated membrane proteins previously thought to be specific for other tissues and/or related to major human diseases. Since the determination of protein expression in RBC provides a method to indicate pathological alterations, our database should facilitate the development of RBC membrane biomarker platforms and provide a unique resource to aid related further research and diagnostics. Database URL: http://rbcc.hegelab.org PMID:26078478
High Performance Semantic Factoring of Giga-Scale Semantic Graph Databases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joslyn, Cliff A.; Adolf, Robert D.; Al-Saffar, Sinan

2010-10-04

As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors.« less
Evolution of the use of relational and NoSQL databases in the ATLAS experiment

NASA Astrophysics Data System (ADS)

Barberis, D.

2016-09-01

The ATLAS experiment used for many years a large database infrastructure based on Oracle to store several different types of non-event data: time-dependent detector configuration and conditions data, calibrations and alignments, configurations of Grid sites, catalogues for data management tools, job records for distributed workload management tools, run and event metadata. The rapid development of "NoSQL" databases (structured storage services) in the last five years allowed an extended and complementary usage of traditional relational databases and new structured storage tools in order to improve the performance of existing applications and to extend their functionalities using the possibilities offered by the modern storage systems. The trend is towards using the best tool for each kind of data, separating for example the intrinsically relational metadata from payload storage, and records that are frequently updated and benefit from transactions from archived information. Access to all components has to be orchestrated by specialised services that run on front-end machines and shield the user from the complexity of data storage infrastructure. This paper describes this technology evolution in the ATLAS database infrastructure and presents a few examples of large database applications that benefit from it.
Multiresource inventories incorporating GIS, GPS, and database management systems

Treesearch

Loukas G. Arvanitis; Balaji Ramachandran; Daniel P. Brackett; Hesham Abd-El Rasol; Xuesong Du

2000-01-01

Large-scale natural resource inventories generate enormous data sets. Their effective handling requires a sophisticated database management system. Such a system must be robust enough to efficiently store large amounts of data and flexible enough to allow users to manipulate a wide variety of information. In a pilot project, related to a multiresource inventory of the...
Data management and language enhancement for generalized set theory computer language for operation of large relational databases

NASA Technical Reports Server (NTRS)

Finley, Gail T.

1988-01-01

This report covers the study of the relational database implementation in the NASCAD computer program system. The existing system is used primarily for computer aided design. Attention is also directed to a hidden-surface algorithm for final drawing output.
New Resources for Computer-Aided Legal Research: An Assessment of the Usefulness of the DIALOG System in Securities Regulation Studies.

ERIC Educational Resources Information Center

Gruner, Richard; Heron, Carol E.

1984-01-01

Examines usefulness of DIALOG as legal research tool through use of DIALOG's DIALINDEX database to identify those databases among almost 200 available that contain large numbers of records related to federal securities regulation. Eight databases selected for further study are detailed. Twenty-six footnotes, database statistics, and samples are…
Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

NASA Astrophysics Data System (ADS)

Dykstra, Dave

2012-12-01

One of the main attractions of non-relational “NoSQL” databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It also compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.

Comparison of the Frontier Distributed Database Caching System to NoSQL Databases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dykstra, Dave

One of the main attractions of non-relational NoSQL databases is their ability to scale to large numbers of readers, including readers spread over a wide area. The Frontier distributed database caching system, used in production by the Large Hadron Collider CMS and ATLAS detector projects for Conditions data, is based on traditional SQL databases but also adds high scalability and the ability to be distributed over a wide-area for an important subset of applications. This paper compares the major characteristics of the two different approaches and identifies the criteria for choosing which approach to prefer over the other. It alsomore » compares in some detail the NoSQL databases used by CMS and ATLAS: MongoDB, CouchDB, HBase, and Cassandra.« less
A Database as a Service for the Healthcare System to Store Physiological Signal Data.

PubMed

Chang, Hsien-Tsung; Lin, Tsai-Huei

2016-01-01

Wearable devices that measure physiological signals to help develop self-health management habits have become increasingly popular in recent years. These records are conducive for follow-up health and medical care. In this study, based on the characteristics of the observed physiological signal records- 1) a large number of users, 2) a large amount of data, 3) low information variability, 4) data privacy authorization, and 5) data access by designated users-we wish to resolve physiological signal record-relevant issues utilizing the advantages of the Database as a Service (DaaS) model. Storing a large amount of data using file patterns can reduce database load, allowing users to access data efficiently; the privacy control settings allow users to store data securely. The results of the experiment show that the proposed system has better database access performance than a traditional relational database, with a small difference in database volume, thus proving that the proposed system can improve data storage performance.
A Database as a Service for the Healthcare System to Store Physiological Signal Data

PubMed Central

Lin, Tsai-Huei

2016-01-01

Wearable devices that measure physiological signals to help develop self-health management habits have become increasingly popular in recent years. These records are conducive for follow-up health and medical care. In this study, based on the characteristics of the observed physiological signal records– 1) a large number of users, 2) a large amount of data, 3) low information variability, 4) data privacy authorization, and 5) data access by designated users—we wish to resolve physiological signal record-relevant issues utilizing the advantages of the Database as a Service (DaaS) model. Storing a large amount of data using file patterns can reduce database load, allowing users to access data efficiently; the privacy control settings allow users to store data securely. The results of the experiment show that the proposed system has better database access performance than a traditional relational database, with a small difference in database volume, thus proving that the proposed system can improve data storage performance. PMID:28033415
High performance semantic factoring of giga-scale semantic graph databases.

DOE Office of Scientific and Technical Information (OSTI.GOV)

al-Saffar, Sinan; Adolf, Bob; Haglin, David

2010-10-01

As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to bring high performance computational resources to bear on their analysis, interpretation, and visualization, especially with respect to their innate semantic structure. Our research group built a novel high performance hybrid system comprising computational capability for semantic graph database processing utilizing the large multithreaded architecture of the Cray XMT platform, conventional clusters, and large data stores. In this paper we describe that architecture, and present the results of our deployingmore » that for the analysis of the Billion Triple dataset with respect to its semantic factors, including basic properties, connected components, namespace interaction, and typed paths.« less
What have we learned in minimally invasive colorectal surgery from NSQIP and NIS large databases? A systematic review.

PubMed

Batista Rodríguez, Gabriela; Balla, Andrea; Corradetti, Santiago; Martinez, Carmen; Hernández, Pilar; Bollo, Jesús; Targarona, Eduard M

2018-06-01

"Big data" refers to large amount of dataset. Those large databases are useful in many areas, including healthcare. The American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) and the National Inpatient Sample (NIS) are big databases that were developed in the USA in order to record surgical outcomes. The aim of the present systematic review is to evaluate the type and clinical impact of the information retrieved through NISQP and NIS big database articles focused on laparoscopic colorectal surgery. A systematic review was conducted using The Meta-Analysis Of Observational Studies in Epidemiology (MOOSE) guidelines. The research was carried out on PubMed database and revealed 350 published papers. Outcomes of articles in which laparoscopic colorectal surgery was the primary aim were analyzed. Fifty-five studies, published between 2007 and February 2017, were included. Articles included were categorized in groups according to the main topic as: outcomes related to surgical technique comparisons, morbidity and perioperatory results, specific disease-related outcomes, sociodemographic disparities, and academic training impact. NSQIP and NIS databases are just the tip of the iceberg for the potential application of Big Data technology and analysis in MIS. Information obtained through big data is useful and could be considered as external validation in those situations where a significant evidence-based medicine exists; also, those databases establish benchmarks to measure the quality of patient care. Data retrieved helps to inform decision-making and improve healthcare delivery.
Design and implementation of a distributed large-scale spatial database system based on J2EE

NASA Astrophysics Data System (ADS)

Gong, Jianya; Chen, Nengcheng; Zhu, Xinyan; Zhang, Xia

2003-03-01

With the increasing maturity of distributed object technology, CORBA, .NET and EJB are universally used in traditional IT field. However, theories and practices of distributed spatial database need farther improvement in virtue of contradictions between large scale spatial data and limited network bandwidth or between transitory session and long transaction processing. Differences and trends among of CORBA, .NET and EJB are discussed in details, afterwards the concept, architecture and characteristic of distributed large-scale seamless spatial database system based on J2EE is provided, which contains GIS client application, web server, GIS application server and spatial data server. Moreover the design and implementation of components of GIS client application based on JavaBeans, the GIS engine based on servlet, the GIS Application server based on GIS enterprise JavaBeans(contains session bean and entity bean) are explained.Besides, the experiments of relation of spatial data and response time under different conditions are conducted, which proves that distributed spatial database system based on J2EE can be used to manage, distribute and share large scale spatial data on Internet. Lastly, a distributed large-scale seamless image database based on Internet is presented.
A survey of commercial object-oriented database management systems

NASA Technical Reports Server (NTRS)

Atkins, John

1992-01-01

The object-oriented data model is the culmination of over thirty years of database research. Initially, database research focused on the need to provide information in a consistent and efficient manner to the business community. Early data models such as the hierarchical model and the network model met the goal of consistent and efficient access to data and were substantial improvements over simple file mechanisms for storing and accessing data. However, these models required highly skilled programmers to provide access to the data. Consequently, in the early 70's E.F. Codd, an IBM research computer scientists, proposed a new data model based on the simple mathematical notion of the relation. This model is known as the Relational Model. In the relational model, data is represented in flat tables (or relations) which have no physical or internal links between them. The simplicity of this model fostered the development of powerful but relatively simple query languages that now made data directly accessible to the general database user. Except for large, multi-user database systems, a database professional was in general no longer necessary. Database professionals found that traditional data in the form of character data, dates, and numeric data were easily represented and managed via the relational model. Commercial relational database management systems proliferated and performance of relational databases improved dramatically. However, there was a growing community of potential database users whose needs were not met by the relational model. These users needed to store data with data types not available in the relational model and who required a far richer modelling environment than that provided by the relational model. Indeed, the complexity of the objects to be represented in the model mandated a new approach to database technology. The Object-Oriented Model was the result.
A high performance, ad-hoc, fuzzy query processing system for relational databases

NASA Technical Reports Server (NTRS)

Mansfield, William H., Jr.; Fleischman, Robert M.

1992-01-01

Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research. Such queries place severe stress on the indexing and I/O subsystems of conventional database environments since they involve the search of large numbers of records. The Datacycle architecture and research prototype is a database environment that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has recently been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides unlimited query throughput, permits the use of ad-hoc fuzzy membership functions, and provides a deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.
Leveraging Cognitive Context for Object Recognition

DTIC Science & Technology

2014-06-01

learned from large image databases. We build upon this concept by exploring cognitive context, demonstrating how rich dynamic context provided by...context that people rely upon as they perceive the world. Context in ACT-R/E takes the form of associations between related concepts that are learned ...and accuracy of object recognition. Context is most often viewed as a static concept, learned from large image databases. We build upon this concept by
SORTEZ: a relational translator for NCBI's ASN.1 database.

PubMed

Hart, K W; Searls, D B; Overton, G C

1994-07-01

The National Center for Biotechnology Information (NCBI) has created a database collection that includes several protein and nucleic acid sequence databases, a biosequence-specific subset of MEDLINE, as well as value-added information such as links between similar sequences. Information in the NCBI database is modeled in Abstract Syntax Notation 1 (ASN.1) an Open Systems Interconnection protocol designed for the purpose of exchanging structured data between software applications rather than as a data model for database systems. While the NCBI database is distributed with an easy-to-use information retrieval system, ENTREZ, the ASN.1 data model currently lacks an ad hoc query language for general-purpose data access. For that reason, we have developed a software package, SORTEZ, that transforms the ASN.1 database (or other databases with nested data structures) to a relational data model and subsequently to a relational database management system (Sybase) where information can be accessed through the relational query language, SQL. Because the need to transform data from one data model and schema to another arises naturally in several important contexts, including efficient execution of specific applications, access to multiple databases and adaptation to database evolution this work also serves as a practical study of the issues involved in the various stages of database transformation. We show that transformation from the ASN.1 data model to a relational data model can be largely automated, but that schema transformation and data conversion require considerable domain expertise and would greatly benefit from additional support tools.
DEXTER: Disease-Expression Relation Extraction from Text.

PubMed

Gupta, Samir; Dingerdissen, Hayley; Ross, Karen E; Hu, Yu; Wu, Cathy H; Mazumder, Raja; Vijay-Shanker, K

2018-01-01

Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression-disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL: http://biotm.cis.udel.edu/DEXTER.
Mining a human transcriptome database for Nrf2 modulators

EPA Science Inventory

Nuclear factor erythroid-2 related factor 2 (Nrf2) is a key transcription factor important in the protection against oxidative stress. We developed computational procedures to enable the identification of chemical, genetic and environmental modulators of Nrf2 in a large database ...
Design and deployment of a large brain-image database for clinical and nonclinical research

NASA Astrophysics Data System (ADS)

Yang, Guo Liang; Lim, Choie Cheio Tchoyoson; Banukumar, Narayanaswami; Aziz, Aamer; Hui, Francis; Nowinski, Wieslaw L.

2004-04-01

An efficient database is an essential component of organizing diverse information on image metadata and patient information for research in medical imaging. This paper describes the design, development and deployment of a large database system serving as a brain image repository that can be used across different platforms in various medical researches. It forms the infrastructure that links hospitals and institutions together and shares data among them. The database contains patient-, pathology-, image-, research- and management-specific data. The functionalities of the database system include image uploading, storage, indexing, downloading and sharing as well as database querying and management with security and data anonymization concerns well taken care of. The structure of database is multi-tier client-server architecture with Relational Database Management System, Security Layer, Application Layer and User Interface. Image source adapter has been developed to handle most of the popular image formats. The database has a user interface based on web browsers and is easy to handle. We have used Java programming language for its platform independency and vast function libraries. The brain image database can sort data according to clinically relevant information. This can be effectively used in research from the clinicians" points of view. The database is suitable for validation of algorithms on large population of cases. Medical images for processing could be identified and organized based on information in image metadata. Clinical research in various pathologies can thus be performed with greater efficiency and large image repositories can be managed more effectively. The prototype of the system has been installed in a few hospitals and is working to the satisfaction of the clinicians.
ARCPHdb: A comprehensive protein database for SF1 and SF2 helicase from archaea.

PubMed

Moukhtar, Mirna; Chaar, Wafi; Abdel-Razzak, Ziad; Khalil, Mohamad; Taha, Samir; Chamieh, Hala

2017-01-01

Superfamily 1 and Superfamily 2 helicases, two of the largest helicase protein families, play vital roles in many biological processes including replication, transcription and translation. Study of helicase proteins in the model microorganisms of archaea have largely contributed to the understanding of their function, architecture and assembly. Based on a large phylogenomics approach, we have identified and classified all SF1 and SF2 protein families in ninety five sequenced archaea genomes. Here we developed an online webserver linked to a specialized protein database named ARCPHdb to provide access for SF1 and SF2 helicase families from archaea. ARCPHdb was implemented using MySQL relational database. Web interfaces were developed using Netbeans. Data were stored according to UniProt accession numbers, NCBI Ref Seq ID, PDB IDs and Entrez Databases. A user-friendly interactive web interface has been developed to browse, search and download archaeal helicase protein sequences, their available 3D structure models, and related documentation available in the literature provided by ARCPHdb. The database provides direct links to matching external databases. The ARCPHdb is the first online database to compile all protein information on SF1 and SF2 helicase from archaea in one platform. This database provides essential resource information for all researchers interested in the field. Copyright © 2016 Elsevier Ltd. All rights reserved.
Validation of a common data model for active safety surveillance research

PubMed Central

Ryan, Patrick B; Reich, Christian G; Hartzema, Abraham G; Stang, Paul E

2011-01-01

Objective Systematic analysis of observational medical databases for active safety surveillance is hindered by the variation in data models and coding systems. Data analysts often find robust clinical data models difficult to understand and ill suited to support their analytic approaches. Further, some models do not facilitate the computations required for systematic analysis across many interventions and outcomes for large datasets. Translating the data from these idiosyncratic data models to a common data model (CDM) could facilitate both the analysts' understanding and the suitability for large-scale systematic analysis. In addition to facilitating analysis, a suitable CDM has to faithfully represent the source observational database. Before beginning to use the Observational Medical Outcomes Partnership (OMOP) CDM and a related dictionary of standardized terminologies for a study of large-scale systematic active safety surveillance, the authors validated the model's suitability for this use by example. Validation by example To validate the OMOP CDM, the model was instantiated into a relational database, data from 10 different observational healthcare databases were loaded into separate instances, a comprehensive array of analytic methods that operate on the data model was created, and these methods were executed against the databases to measure performance. Conclusion There was acceptable representation of the data from 10 observational databases in the OMOP CDM using the standardized terminologies selected, and a range of analytic methods was developed and executed with sufficient performance to be useful for active safety surveillance. PMID:22037893
The Xeno-glycomics database (XDB): a relational database of qualitative and quantitative pig glycome repertoire.

PubMed

Park, Hae-Min; Park, Ju-Hyeong; Kim, Yoon-Woo; Kim, Kyoung-Jin; Jeong, Hee-Jin; Jang, Kyoung-Soon; Kim, Byung-Gee; Kim, Yun-Gon

2013-11-15

In recent years, the improvement of mass spectrometry-based glycomics techniques (i.e. highly sensitive, quantitative and high-throughput analytical tools) has enabled us to obtain a large dataset of glycans. Here we present a database named Xeno-glycomics database (XDB) that contains cell- or tissue-specific pig glycomes analyzed with mass spectrometry-based techniques, including a comprehensive pig glycan information on chemical structures, mass values, types and relative quantities. It was designed as a user-friendly web-based interface that allows users to query the database according to pig tissue/cell types or glycan masses. This database will contribute in providing qualitative and quantitative information on glycomes characterized from various pig cells/organs in xenotransplantation and might eventually provide new targets in the α1,3-galactosyltransferase gene-knock out pigs era. The database can be accessed on the web at http://bioinformatics.snu.ac.kr/xdb.
FOUNTAIN: A JAVA open-source package to assist large sequencing projects

PubMed Central

Buerstedde, Jean-Marie; Prill, Florian

2001-01-01

Background Better automation, lower cost per reaction and a heightened interest in comparative genomics has led to a dramatic increase in DNA sequencing activities. Although the large sequencing projects of specialized centers are supported by in-house bioinformatics groups, many smaller laboratories face difficulties managing the appropriate processing and storage of their sequencing output. The challenges include documentation of clones, templates and sequencing reactions, and the storage, annotation and analysis of the large number of generated sequences. Results We describe here a new program, named FOUNTAIN, for the management of large sequencing projects . FOUNTAIN uses the JAVA computer language and data storage in a relational database. Starting with a collection of sequencing objects (clones), the program generates and stores information related to the different stages of the sequencing project using a web browser interface for user input. The generated sequences are subsequently imported and annotated based on BLAST searches against the public databases. In addition, simple algorithms to cluster sequences and determine putative polymorphic positions are implemented. Conclusions A simple, but flexible and scalable software package is presented to facilitate data generation and storage for large sequencing projects. Open source and largely platform and database independent, we wish FOUNTAIN to be improved and extended in a community effort. PMID:11591214
The Danish Testicular Cancer database.

PubMed

Daugaard, Gedske; Kier, Maria Gry Gundgaard; Bandak, Mikkel; Mortensen, Mette Saksø; Larsson, Heidi; Søgaard, Mette; Toft, Birgitte Groenkaer; Engvad, Birte; Agerbæk, Mads; Holm, Niels Vilstrup; Lauritsen, Jakob

2016-01-01

The nationwide Danish Testicular Cancer database consists of a retrospective research database (DaTeCa database) and a prospective clinical database (Danish Multidisciplinary Cancer Group [DMCG] DaTeCa database). The aim is to improve the quality of care for patients with testicular cancer (TC) in Denmark, that is, by identifying risk factors for relapse, toxicity related to treatment, and focusing on late effects. All Danish male patients with a histologically verified germ cell cancer diagnosis in the Danish Pathology Registry are included in the DaTeCa databases. Data collection has been performed from 1984 to 2007 and from 2013 onward, respectively. The retrospective DaTeCa database contains detailed information with more than 300 variables related to histology, stage, treatment, relapses, pathology, tumor markers, kidney function, lung function, etc. A questionnaire related to late effects has been conducted, which includes questions regarding social relationships, life situation, general health status, family background, diseases, symptoms, use of medication, marital status, psychosocial issues, fertility, and sexuality. TC survivors alive on October 2014 were invited to fill in this questionnaire including 160 validated questions. Collection of questionnaires is still ongoing. A biobank including blood/sputum samples for future genetic analyses has been established. Both samples related to DaTeCa and DMCG DaTeCa database are included. The prospective DMCG DaTeCa database includes variables regarding histology, stage, prognostic group, and treatment. The DMCG DaTeCa database has existed since 2013 and is a young clinical database. It is necessary to extend the data collection in the prospective database in order to answer quality-related questions. Data from the retrospective database will be added to the prospective data. This will result in a large and very comprehensive database for future studies on TC patients.
Relational databases: a transparent framework for encouraging biology students to think informatically.

PubMed

Rice, Michael; Gladstone, William; Weir, Michael

2004-01-01

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills.
Relational Databases: A Transparent Framework for Encouraging Biology Students To Think Informatically

PubMed Central

2004-01-01

We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills. PMID:15592597

High-Performance Secure Database Access Technologies for HEP Grids

DOE Office of Scientific and Technical Information (OSTI.GOV)

Matthew Vranicar; John Weicher

2006-04-17

The Large Hadron Collider (LHC) at the CERN Laboratory will become the largest scientific instrument in the world when it starts operations in 2007. Large Scale Analysis Computer Systems (computational grids) are required to extract rare signals of new physics from petabytes of LHC detector data. In addition to file-based event data, LHC data processing applications require access to large amounts of data in relational databases: detector conditions, calibrations, etc. U.S. high energy physicists demand efficient performance of grid computing applications in LHC physics research where world-wide remote participation is vital to their success. To empower physicists with data-intensive analysismore » capabilities a whole hyperinfrastructure of distributed databases cross-cuts a multi-tier hierarchy of computational grids. The crosscutting allows separation of concerns across both the global environment of a federation of computational grids and the local environment of a physicist’s computer used for analysis. Very few efforts are on-going in the area of database and grid integration research. Most of these are outside of the U.S. and rely on traditional approaches to secure database access via an extraneous security layer separate from the database system core, preventing efficient data transfers. Our findings are shared by the Database Access and Integration Services Working Group of the Global Grid Forum, who states that "Research and development activities relating to the Grid have generally focused on applications where data is stored in files. However, in many scientific and commercial domains, database management systems have a central role in data storage, access, organization, authorization, etc, for numerous applications.” There is a clear opportunity for a technological breakthrough, requiring innovative steps to provide high-performance secure database access technologies for grid computing. We believe that an innovative database architecture where the secure authorization is pushed into the database engine will eliminate inefficient data transfer bottlenecks. Furthermore, traditionally separated database and security layers provide an extra vulnerability, leaving a weak clear-text password authorization as the only protection on the database core systems. Due to the legacy limitations of the systems’ security models, the allowed passwords often can not even comply with the DOE password guideline requirements. We see an opportunity for the tight integration of the secure authorization layer with the database server engine resulting in both improved performance and improved security. Phase I has focused on the development of a proof-of-concept prototype using Argonne National Laboratory’s (ANL) Argonne Tandem-Linac Accelerator System (ATLAS) project as a test scenario. By developing a grid-security enabled version of the ATLAS project’s current relation database solution, MySQL, PIOCON Technologies aims to offer a more efficient solution to secure database access.« less
The Era of the Large Databases: Outcomes After Gastroesophageal Surgery According to NSQIP, NIS, and NCDB Databases. Systematic Literature Review.

PubMed

Batista Rodríguez, Gabriela; Balla, Andrea; Fernández-Ananín, Sonia; Balagué, Carmen; Targarona, Eduard M

2018-05-01

The term big data refers to databases that include large amounts of information used in various areas of knowledge. Currently, there are large databases that allow the evaluation of postoperative evolution, such as the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP), the Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample (NIS), and the National Cancer Database (NCDB). The aim of this review was to evaluate the clinical impact of information obtained from these registries regarding gastroesophageal surgery. A systematic review using the Meta-analysis of Observational Studies in Epidemiology guidelines was performed. The research was carried out using the PubMed database identifying 251 articles. All outcomes related to gastroesophageal surgery were analyzed. A total of 34 articles published between January 2007 and July 2017 were included, for a total of 345 697 patients. Studies were analyzed and divided according to the type of surgery and main theme in (1) esophageal surgery and (2) gastric surgery. The information provided by these databases is an effective way to obtain levels of evidence not obtainable by conventional methods. Furthermore, this information is useful for the external validation of previous studies, to establish benchmarks that allow comparisons between centers and have a positive impact on the quality of care.
Information technologies in public health management: a database on biocides to improve quality of life.

PubMed

Roman, C; Scripcariu, L; Diaconescu, Rm; Grigoriu, A

2012-01-01

Biocides for prolonging the shelf life of a large variety of materials have been extensively used over the last decades. It has estimated that the worldwide biocide consumption to be about 12.4 billion dollars in 2011, and is expected to increase in 2012. As biocides are substances we get in contact with in our everyday lives, access to this type of information is of paramount importance in order to ensure an appropriate living environment. Consequently, a database where information may be quickly processed, sorted, and easily accessed, according to different search criteria, is the most desirable solution. The main aim of this work was to design and implement a relational database with complete information about biocides used in public health management to improve the quality of life. Design and implementation of a relational database for biocides, by using the software "phpMyAdmin". A database, which allows for an efficient collection, storage, and management of information including chemical properties and applications of a large quantity of biocides, as well as its adequate dissemination into the public health environment. The information contained in the database herein presented promotes an adequate use of biocides, by means of information technologies, which in consequence may help achieve important improvement in our quality of life.
U.S. Quaternary Fault and Fold Database Released

NASA Astrophysics Data System (ADS)

Haller, Kathleen M.; Machette, Michael N.; Dart, Richard L.; Rhea, B. Susan

2004-06-01

A comprehensive online compilation of Quaternary-age faults and folds throughout the United States was recently released by the U.S. Geological Survey, with cooperation from state geological surveys, academia, and the private sector. The Web site at http://Qfaults.cr.usgs.gov/ contains searchable databases and related geo-spatial data that characterize earthquake-related structures that could be potential seismic sources for large-magnitude (M > 6) earthquakes.
An SQL query generator for CLIPS

NASA Technical Reports Server (NTRS)

Snyder, James; Chirica, Laurian

1990-01-01

As expert systems become more widely used, their access to large amounts of external information becomes increasingly important. This information exists in several forms such as statistical, tabular data, knowledge gained by experts and large databases of information maintained by companies. Because many expert systems, including CLIPS, do not provide access to this external information, much of the usefulness of expert systems is left untapped. The scope of this paper is to describe a database extension for the CLIPS expert system shell. The current industry standard database language is SQL. Due to SQL standardization, large amounts of information stored on various computers, potentially at different locations, will be more easily accessible. Expert systems should be able to directly access these existing databases rather than requiring information to be re-entered into the expert system environment. The ORACLE relational database management system (RDBMS) was used to provide a database connection within the CLIPS environment. To facilitate relational database access a query generation system was developed as a CLIPS user function. The queries are entered in a CLlPS-like syntax and are passed to the query generator, which constructs and submits for execution, an SQL query to the ORACLE RDBMS. The query results are asserted as CLIPS facts. The query generator was developed primarily for use within the ICADS project (Intelligent Computer Aided Design System) currently being developed by the CAD Research Unit in the California Polytechnic State University (Cal Poly). In ICADS, there are several parallel or distributed expert systems accessing a common knowledge base of facts. Expert system has a narrow domain of interest and therefore needs only certain portions of the information. The query generator provides a common method of accessing this information and allows the expert system to specify what data is needed without specifying how to retrieve it.
Reporting to Improve Reproducibility and Facilitate Validity Assessment for Healthcare Database Studies V1.0.

PubMed

Wang, Shirley V; Schneeweiss, Sebastian; Berger, Marc L; Brown, Jeffrey; de Vries, Frank; Douglas, Ian; Gagne, Joshua J; Gini, Rosa; Klungel, Olaf; Mullins, C Daniel; Nguyen, Michael D; Rassen, Jeremy A; Smeeth, Liam; Sturkenboom, Miriam

2017-09-01

Defining a study population and creating an analytic dataset from longitudinal healthcare databases involves many decisions. Our objective was to catalogue scientific decisions underpinning study execution that should be reported to facilitate replication and enable assessment of validity of studies conducted in large healthcare databases. We reviewed key investigator decisions required to operate a sample of macros and software tools designed to create and analyze analytic cohorts from longitudinal streams of healthcare data. A panel of academic, regulatory, and industry experts in healthcare database analytics discussed and added to this list. Evidence generated from large healthcare encounter and reimbursement databases is increasingly being sought by decision-makers. Varied terminology is used around the world for the same concepts. Agreeing on terminology and which parameters from a large catalogue are the most essential to report for replicable research would improve transparency and facilitate assessment of validity. At a minimum, reporting for a database study should provide clarity regarding operational definitions for key temporal anchors and their relation to each other when creating the analytic dataset, accompanied by an attrition table and a design diagram. A substantial improvement in reproducibility, rigor and confidence in real world evidence generated from healthcare databases could be achieved with greater transparency about operational study parameters used to create analytic datasets from longitudinal healthcare databases. © 2017 The Authors. Pharmacoepidemiology & Drug Safety Published by John Wiley & Sons Ltd.
Large-scale annotation of small-molecule libraries using public databases.

PubMed

Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

2007-01-01

While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.
Search extension transforms Wiki into a relational system: a case for flavonoid metabolite database.

PubMed

Arita, Masanori; Suwa, Kazuhiro

2008-09-17

In computer science, database systems are based on the relational model founded by Edgar Codd in 1970. On the other hand, in the area of biology the word 'database' often refers to loosely formatted, very large text files. Although such bio-databases may describe conflicts or ambiguities (e.g. a protein pair do and do not interact, or unknown parameters) in a positive sense, the flexibility of the data format sacrifices a systematic query mechanism equivalent to the widely used SQL. To overcome this disadvantage, we propose embeddable string-search commands on a Wiki-based system and designed a half-formatted database. As proof of principle, a database of flavonoid with 6902 molecular structures from over 1687 plant species was implemented on MediaWiki, the background system of Wikipedia. Registered users can describe any information in an arbitrary format. Structured part is subject to text-string searches to realize relational operations. The system was written in PHP language as the extension of MediaWiki. All modifications are open-source and publicly available. This scheme benefits from both the free-formatted Wiki style and the concise and structured relational-database style. MediaWiki supports multi-user environments for document management, and the cost for database maintenance is alleviated.
Search extension transforms Wiki into a relational system: A case for flavonoid metabolite database

PubMed Central

Arita, Masanori; Suwa, Kazuhiro

2008-01-01

Background In computer science, database systems are based on the relational model founded by Edgar Codd in 1970. On the other hand, in the area of biology the word 'database' often refers to loosely formatted, very large text files. Although such bio-databases may describe conflicts or ambiguities (e.g. a protein pair do and do not interact, or unknown parameters) in a positive sense, the flexibility of the data format sacrifices a systematic query mechanism equivalent to the widely used SQL. Results To overcome this disadvantage, we propose embeddable string-search commands on a Wiki-based system and designed a half-formatted database. As proof of principle, a database of flavonoid with 6902 molecular structures from over 1687 plant species was implemented on MediaWiki, the background system of Wikipedia. Registered users can describe any information in an arbitrary format. Structured part is subject to text-string searches to realize relational operations. The system was written in PHP language as the extension of MediaWiki. All modifications are open-source and publicly available. Conclusion This scheme benefits from both the free-formatted Wiki style and the concise and structured relational-database style. MediaWiki supports multi-user environments for document management, and the cost for database maintenance is alleviated. PMID:18822113
High Performance Descriptive Semantic Analysis of Semantic Graph Databases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Joslyn, Cliff A.; Adolf, Robert D.; al-Saffar, Sinan

As semantic graph database technology grows to address components ranging from extant large triple stores to SPARQL endpoints over SQL-structured relational databases, it will become increasingly important to be able to understand their inherent semantic structure, whether codified in explicit ontologies or not. Our group is researching novel methods for what we call descriptive semantic analysis of RDF triplestores, to serve purposes of analysis, interpretation, visualization, and optimization. But data size and computational complexity makes it increasingly necessary to bring high performance computational resources to bear on this task. Our research group built a novel high performance hybrid system comprisingmore » computational capability for semantic graph database processing utilizing the large multi-threaded architecture of the Cray XMT platform, conventional servers, and large data stores. In this paper we describe that architecture and our methods, and present the results of our analyses of basic properties, connected components, namespace interaction, and typed paths such for the Billion Triple Challenge 2010 dataset.« less
A blue carbon soil database: Tidal wetland stocks for the US National Greenhouse Gas Inventory

NASA Astrophysics Data System (ADS)

Feagin, R. A.; Eriksson, M.; Hinson, A.; Najjar, R. G.; Kroeger, K. D.; Herrmann, M.; Holmquist, J. R.; Windham-Myers, L.; MacDonald, G. M.; Brown, L. N.; Bianchi, T. S.

2015-12-01

Coastal wetlands contain large reservoirs of carbon, and in 2015 the US National Greenhouse Gas Inventory began the work of placing blue carbon within the national regulatory context. The potential value of a wetland carbon stock, in relation to its location, soon could be influential in determining governmental policy and management activities, or in stimulating market-based CO2 sequestration projects. To meet the national need for high-resolution maps, a blue carbon stock database was developed linking National Wetlands Inventory datasets with the USDA Soil Survey Geographic Database. Users of the database can identify the economic potential for carbon conservation or restoration projects within specific estuarine basins, states, wetland types, physical parameters, and land management activities. The database is geared towards both national-level assessments and local-level inquiries. Spatial analysis of the stocks show high variance within individual estuarine basins, largely dependent on geomorphic position on the landscape, though there are continental scale trends to the carbon distribution as well. Future plans including linking this database with a sedimentary accretion database to predict carbon flux in US tidal wetlands.
Extending the data dictionary for data/knowledge management

NASA Technical Reports Server (NTRS)

Hydrick, Cecile L.; Graves, Sara J.

1988-01-01

Current relational database technology provides the means for efficiently storing and retrieving large amounts of data. By combining techniques learned from the field of artificial intelligence with this technology, it is possible to expand the capabilities of such systems. This paper suggests using the expanded domain concept, an object-oriented organization, and the storing of knowledge rules within the relational database as a solution to the unique problems associated with CAD/CAM and engineering data.
Scale out databases for CERN use cases

NASA Astrophysics Data System (ADS)

Baranowski, Zbigniew; Grzybek, Maciej; Canali, Luca; Lanza Garcia, Daniel; Surdy, Kacper

2015-12-01

Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.
The BioImage Database Project: organizing multidimensional biological images in an object-relational database.

PubMed

Carazo, J M; Stelzer, E H

1999-01-01

The BioImage Database Project collects and structures multidimensional data sets recorded by various microscopic techniques relevant to modern life sciences. It provides, as precisely as possible, the circumstances in which the sample was prepared and the data were recorded. It grants access to the actual data and maintains links between related data sets. In order to promote the interdisciplinary approach of modern science, it offers a large set of key words, which covers essentially all aspects of microscopy. Nonspecialists can, therefore, access and retrieve significant information recorded and submitted by specialists in other areas. A key issue of the undertaking is to exploit the available technology and to provide a well-defined yet flexible structure for dealing with data. Its pivotal element is, therefore, a modern object relational database that structures the metadata and ameliorates the provision of a complete service. The BioImage database can be accessed through the Internet. Copyright 1999 Academic Press.
Variations in data collection methods between national databases affect study results: a comparison of the nationwide inpatient sample and national surgical quality improvement program databases for lumbar spine fusion procedures.

PubMed

Bohl, Daniel D; Russo, Glenn S; Basques, Bryce A; Golinvaux, Nicholas S; Fu, Michael C; Long, William D; Grauer, Jonathan N

2014-12-03

There has been an increasing use of national databases to conduct orthopaedic research. Questions regarding the validity and consistency of these studies have not been fully addressed. The purpose of this study was to test for similarity in reported measures between two national databases commonly used for orthopaedic research. A retrospective cohort study of patients undergoing lumbar spinal fusion procedures during 2009 to 2011 was performed in two national databases: the Nationwide Inpatient Sample and the National Surgical Quality Improvement Program. Demographic characteristics, comorbidities, and inpatient adverse events were directly compared between databases. The total numbers of patients included were 144,098 from the Nationwide Inpatient Sample and 8434 from the National Surgical Quality Improvement Program. There were only small differences in demographic characteristics between the two databases. There were large differences between databases in the rates at which specific comorbidities were documented. Non-morbid obesity was documented at rates of 9.33% in the Nationwide Inpatient Sample and 36.93% in the National Surgical Quality Improvement Program (relative risk, 0.25; p < 0.05). Peripheral vascular disease was documented at rates of 2.35% in the Nationwide Inpatient Sample and 0.60% in the National Surgical Quality Improvement Program (relative risk, 3.89; p < 0.05). Similarly, there were large differences between databases in the rates at which specific inpatient adverse events were documented. Sepsis was documented at rates of 0.38% in the Nationwide Inpatient Sample and 0.81% in the National Surgical Quality Improvement Program (relative risk, 0.47; p < 0.05). Acute kidney injury was documented at rates of 1.79% in the Nationwide Inpatient Sample and 0.21% in the National Surgical Quality Improvement Program (relative risk, 8.54; p < 0.05). As database studies become more prevalent in orthopaedic surgery, authors, reviewers, and readers should view these studies with caution. This study shows that two commonly used databases can identify demographically similar patients undergoing a common orthopaedic procedure; however, the databases document markedly different rates of comorbidities and inpatient adverse events. The differences are likely the result of the very different mechanisms through which the databases collect their comorbidity and adverse event data. Findings highlight concerns regarding the validity of orthopaedic database research. Copyright © 2014 by The Journal of Bone and Joint Surgery, Incorporated.
Protein Simulation Data in the Relational Model.

PubMed

Simms, Andrew M; Daggett, Valerie

2012-10-01

High performance computing is leading to unprecedented volumes of data. Relational databases offer a robust and scalable model for storing and analyzing scientific data. However, these features do not come without a cost-significant design effort is required to build a functional and efficient repository. Modeling protein simulation data in a relational database presents several challenges: the data captured from individual simulations are large, multi-dimensional, and must integrate with both simulation software and external data sites. Here we present the dimensional design and relational implementation of a comprehensive data warehouse for storing and analyzing molecular dynamics simulations using SQL Server.
Protein Simulation Data in the Relational Model

PubMed Central

Simms, Andrew M.; Daggett, Valerie

2011-01-01

High performance computing is leading to unprecedented volumes of data. Relational databases offer a robust and scalable model for storing and analyzing scientific data. However, these features do not come without a cost—significant design effort is required to build a functional and efficient repository. Modeling protein simulation data in a relational database presents several challenges: the data captured from individual simulations are large, multi-dimensional, and must integrate with both simulation software and external data sites. Here we present the dimensional design and relational implementation of a comprehensive data warehouse for storing and analyzing molecular dynamics simulations using SQL Server. PMID:23204646
A practical approach for inexpensive searches of radiology report databases.

PubMed

Desjardins, Benoit; Hamilton, R Curtis

2007-06-01

We present a method to perform full text searches of radiology reports for the large number of departments that do not have this ability as part of their radiology or hospital information system. A tool written in Microsoft Access (front-end) has been designed to search a server (back-end) containing the indexed backup weekly copy of the full relational database extracted from a radiology information system (RIS). This front end-/back-end approach has been implemented in a large academic radiology department, and is used for teaching, research and administrative purposes. The weekly second backup of the 80 GB, 4 million record RIS database takes 2 hours. Further indexing of the exported radiology reports takes 6 hours. Individual searches of the indexed database typically take less than 1 minute on the indexed database and 30-60 minutes on the nonindexed database. Guidelines to properly address privacy and institutional review board issues are closely followed by all users. This method has potential to improve teaching, research, and administrative programs within radiology departments that cannot afford more expensive technology.
The use of DRG for identifying clinical trials centers with high recruitment potential: a feasability study.

PubMed

Aegerter, Philippe; Bendersky, Noelle; Tran, Thi-Chien; Ropers, Jacques; Taright, Namik; Chatellier, Gilles

2014-01-01

Recruitment of large samples of patients is crucial for evidence level and efficacy of clinical trials (CT). Clinical Trial Recruitment Support Systems (CTRSS) used to estimate patient recruitment are generally specific to Hospital Information Systems and few were evaluated on a large number of trials. Our aim was to assess, on a large number of CT, the usefulness of commonly available data as Diagnosis Related Groups (DRG) databases in order to estimate potential recruitment. We used the DRG database of a large French multicenter medical institution (1.2 million inpatient stays and 400 new trials each year). Eligibility criteria of protocols were broken down into in atomic entities (diagnosis, procedures, treatments...) then translated into codes and operators recorded in a standardized form. A program parsed the forms and generated requests on the DRG database. A large majority of selection criteria could be coded and final estimations of number of eligible patients were close to observed ones (median difference = 25). Such a system could be part of the feasability evaluation and center selection process before the start of the clinical trial.
The Génolevures database.

PubMed

Martin, Tiphaine; Sherman, David J; Durrens, Pascal

2011-01-01

The Génolevures online database (URL: http://www.genolevures.org) stores and provides the data and results obtained by the Génolevures Consortium through several campaigns of genome annotation of the yeasts in the Saccharomycotina subphylum (hemiascomycetes). This database is dedicated to large-scale comparison of these genomes, storing not only the different chromosomal elements detected in the sequences, but also the logical relations between them. The database is divided into a public part, accessible to anyone through Internet, and a private part where the Consortium members make genome annotations with our Magus annotation system; this system is used to annotate several related genomes in parallel. The public database is widely consulted and offers structured data, organized using a REST web site architecture that allows for automated requests. The implementation of the database, as well as its associated tools and methods, is evolving to cope with the influx of genome sequences produced by Next Generation Sequencing (NGS). Copyright © 2011 Académie des sciences. Published by Elsevier SAS. All rights reserved.

The Cologne Database for Molecular Spectroscopy, CDMS, in the Virtual Atomic and Molecular Data Centre, VAMDC

NASA Astrophysics Data System (ADS)

Endres, Christian P.; Schlemmer, Stephan; Schilke, Peter; Stutzki, Jürgen; Müller, Holger S. P.

2016-09-01

The Cologne Database for Molecular Spectroscopy, CDMS, was founded 1998 to provide in its catalog section line lists of mostly molecular species which are or may be observed in various astronomical sources (usually) by radio astronomical means. The line lists contain transition frequencies with qualified accuracies, intensities, quantum numbers, as well as further auxiliary information. They have been generated from critically evaluated experimental line lists, mostly from laboratory experiments, employing established Hamiltonian models. Separate entries exist for different isotopic species and usually also for different vibrational states. As of December 2015, the number of entries is 792. They are available online as ascii tables with additional files documenting information on the entries. The Virtual Atomic and Molecular Data Centre, VAMDC, was founded more than 5 years ago as a common platform for atomic and molecular data. This platform facilitates exchange not only between spectroscopic databases related to astrophysics or astrochemistry, but also with collisional and kinetic databases. A dedicated infrastructure was developed to provide a common data format in the various databases enabling queries to a large variety of databases on atomic and molecular data at once. For CDMS, the incorporation in VAMDC was combined with several modifications on the generation of CDMS catalog entries. Here we introduce related changes to the data structure and the data content in the CDMS. The new data scheme allows us to incorporate all previous data entries but in addition allows us also to include entries based on new theoretical descriptions. Moreover, the CDMS entries have been transferred into a mySQL database format. These developments within the VAMDC framework have in part been driven by the needs of the astronomical community to be able to deal efficiently with large data sets obtained with the Herschel Space Telescope or, more recently, with the Atacama Large Millimeter Array.
Value of shared preclinical safety studies - The eTOX database.

PubMed

Briggs, Katharine; Barber, Chris; Cases, Montserrat; Marc, Philippe; Steger-Hartmann, Thomas

2015-01-01

A first analysis of a database of shared preclinical safety data for 1214 small molecule drugs and drug candidates extracted from 3970 reports donated by thirteen pharmaceutical companies for the eTOX project (www.etoxproject.eu) is presented. Species, duration of exposure and administration route data were analysed to assess if large enough subsets of homogenous data are available for building in silico predictive models. Prevalence of treatment related effects for the different types of findings recorded were analysed. The eTOX ontology was used to determine the most common treatment-related clinical chemistry and histopathology findings reported in the database. The data were then mined to evaluate sensitivity of established in vivo biomarkers for liver toxicity risk assessment. The value of the database to inform other drug development projects during early drug development is illustrated by a case study.
ApoptoProteomics, an integrated database for analysis of proteomics data obtained from apoptotic cells.

PubMed

Arntzen, Magnus Ø; Thiede, Bernd

2012-02-01

Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no.
ApoptoProteomics, an Integrated Database for Analysis of Proteomics Data Obtained from Apoptotic Cells*

PubMed Central

Arntzen, Magnus Ø.; Thiede, Bernd

2012-01-01

Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no. PMID:22067098
A database paradigm for the management of DICOM-RT structure sets using a geographic information system

NASA Astrophysics Data System (ADS)

Shao, Weber; Kupelian, Patrick A.; Wang, Jason; Low, Daniel A.; Ruan, Dan

2014-03-01

We devise a paradigm for representing the DICOM-RT structure sets in a database management system, in such way that secondary calculations of geometric information can be performed quickly from the existing contour definitions. The implementation of this paradigm is achieved using the PostgreSQL database system and the PostGIS extension, a geographic information system commonly used for encoding geographical map data. The proposed paradigm eliminates the overhead of retrieving large data records from the database, as well as the need to implement various numerical and data parsing routines, when additional information related to the geometry of the anatomy is desired.
NVST Data Archiving System Based On FastBit NoSQL Database

NASA Astrophysics Data System (ADS)

Liu, Ying-bo; Wang, Feng; Ji, Kai-fan; Deng, Hui; Dai, Wei; Liang, Bo

2014-06-01

The New Vacuum Solar Telescope (NVST) is a 1-meter vacuum solar telescope that aims to observe the fine structures of active regions on the Sun. The main tasks of the NVST are high resolution imaging and spectral observations, including the measurements of the solar magnetic field. The NVST has been collecting more than 20 million FITS files since it began routine observations in 2012 and produces a maximum observational records of 120 thousand files in a day. Given the large amount of files, the effective archiving and retrieval of files becomes a critical and urgent problem. In this study, we implement a new data archiving system for the NVST based on the Fastbit Not Only Structured Query Language (NoSQL) database. Comparing to the relational database (i.e., MySQL; My Structured Query Language), the Fastbit database manifests distinctive advantages on indexing and querying performance. In a large scale database of 40 million records, the multi-field combined query response time of Fastbit database is about 15 times faster and fully meets the requirements of the NVST. Our study brings a new idea for massive astronomical data archiving and would contribute to the design of data management systems for other astronomical telescopes.
Generating Shifting Workloads to Benchmark Adaptability in Relational Database Systems

NASA Astrophysics Data System (ADS)

Rabl, Tilmann; Lang, Andreas; Hackl, Thomas; Sick, Bernhard; Kosch, Harald

A large body of research concerns the adaptability of database systems. Many commercial systems already contain autonomic processes that adapt configurations as well as data structures and data organization. Yet there is virtually no possibility for a just measurement of the quality of such optimizations. While standard benchmarks have been developed that simulate real-world database applications very precisely, none of them considers variations in workloads produced by human factors. Today’s benchmarks test the performance of database systems by measuring peak performance on homogeneous request streams. Nevertheless, in systems with user interaction access patterns are constantly shifting. We present a benchmark that simulates a web information system with interaction of large user groups. It is based on the analysis of a real online eLearning management system with 15,000 users. The benchmark considers the temporal dependency of user interaction. Main focus is to measure the adaptability of a database management system according to shifting workloads. We will give details on our design approach that uses sophisticated pattern analysis and data mining techniques.
Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases.

PubMed

Wollbrett, Julien; Larmande, Pierre; de Lamotte, Frédéric; Ruiz, Manuel

2013-04-15

In recent years, a large amount of "-omics" data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic.
Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases

PubMed Central

2013-01-01

Background In recent years, a large amount of “-omics” data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. Results We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. Conclusions BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic. PMID:23586394
Exploring Human Cognition Using Large Image Databases.

PubMed

Griffiths, Thomas L; Abbott, Joshua T; Hsu, Anne S

2016-07-01

Most cognitive psychology experiments evaluate models of human cognition using a relatively small, well-controlled set of stimuli. This approach stands in contrast to current work in neuroscience, perception, and computer vision, which have begun to focus on using large databases of natural images. We argue that natural images provide a powerful tool for characterizing the statistical environment in which people operate, for better evaluating psychological theories, and for bringing the insights of cognitive science closer to real applications. We discuss how some of the challenges of using natural images as stimuli in experiments can be addressed through increased sample sizes, using representations from computer vision, and developing new experimental methods. Finally, we illustrate these points by summarizing recent work using large image databases to explore questions about human cognition in four different domains: modeling subjective randomness, defining a quantitative measure of representativeness, identifying prior knowledge used in word learning, and determining the structure of natural categories. Copyright © 2016 Cognitive Science Society, Inc.
Information integration for a sky survey by data warehousing

NASA Astrophysics Data System (ADS)

Luo, A.; Zhang, Y.; Zhao, Y.

The virtualization service of data system for a sky survey LAMOST is very important for astronomers The service needs to integrate information from data collections catalogs and references and support simple federation of a set of distributed files and associated metadata Data warehousing has been in existence for several years and demonstrated superiority over traditional relational database management systems by providing novel indexing schemes that supported efficient on-line analytical processing OLAP of large databases Now relational database systems such as Oracle etc support the warehouse capability which including extensions to the SQL language to support OLAP operations and a number of metadata management tools have been created The information integration of LAMOST by applying data warehousing is to effectively provide data and knowledge on-line
Design and implementation of relational databases relevant to the diverse needs of a tuberculosis case contact study in the Gambia.

PubMed

Jeffries, D J; Donkor, S; Brookes, R H; Fox, A; Hill, P C

2004-09-01

The data requirements of a large multidisciplinary tuberculosis case contact study are complex. We describe an ACCESS-based relational database system that meets our rigorous requirements for data entry and validation, while being user-friendly, flexible, exportable, and easy to install on a network or stand alone system. This includes the development of a double data entry package for epidemiology and laboratory data, semi-automated entry of ELISPOT data directly from the plate reader, and a suite of new programmes for the manipulation and integration of flow cytometry data. The double entered epidemiology and immunology databases are combined into a separate database, providing a near-real-time analysis of immuno-epidemiological data, allowing important trends to be identified early and major decisions about the study to be made and acted on. This dynamic data management model is portable and can easily be applied to other studies.
Efficient hemodynamic event detection utilizing relational databases and wavelet analysis

NASA Technical Reports Server (NTRS)

Saeed, M.; Mark, R. G.

2001-01-01

Development of a temporal query framework for time-oriented medical databases has hitherto been a challenging problem. We describe a novel method for the detection of hemodynamic events in multiparameter trends utilizing wavelet coefficients in a MySQL relational database. Storage of the wavelet coefficients allowed for a compact representation of the trends, and provided robust descriptors for the dynamics of the parameter time series. A data model was developed to allow for simplified queries along several dimensions and time scales. Of particular importance, the data model and wavelet framework allowed for queries to be processed with minimal table-join operations. A web-based search engine was developed to allow for user-defined queries. Typical queries required between 0.01 and 0.02 seconds, with at least two orders of magnitude improvement in speed over conventional queries. This powerful and innovative structure will facilitate research on large-scale time-oriented medical databases.
Factors Affecting Volunteering among Older Rural and City Dwelling Adults in Australia

ERIC Educational Resources Information Center

Warburton, Jeni; Stirling, Christine

2007-01-01

In the absence of large scale Australian studies of volunteering among older adults, this study compared the relevance of two theoretical approaches--social capital theory and sociostructural resources theory--to predict voluntary activity in relation to a large national database. The paper explores volunteering by older people (aged 55+) in order…
Informatics in neurocritical care: new ideas for Big Data.

PubMed

Flechet, Marine; Grandas, Fabian Güiza; Meyfroidt, Geert

2016-04-01

Big data is the new hype in business and healthcare. Data storage and processing has become cheap, fast, and easy. Business analysts and scientists are trying to design methods to mine these data for hidden knowledge. Neurocritical care is a field that typically produces large amounts of patient-related data, and these data are increasingly being digitized and stored. This review will try to look beyond the hype, and focus on possible applications in neurointensive care amenable to Big Data research that can potentially improve patient care. The first challenge in Big Data research will be the development of large, multicenter, and high-quality databases. These databases could be used to further investigate recent findings from mathematical models, developed in smaller datasets. Randomized clinical trials and Big Data research are complementary. Big Data research might be used to identify subgroups of patients that could benefit most from a certain intervention, or can be an alternative in areas where randomized clinical trials are not possible. The processing and the analysis of the large amount of patient-related information stored in clinical databases is beyond normal human cognitive ability. Big Data research applications have the potential to discover new medical knowledge, and improve care in the neurointensive care unit.
Similarity Search in Large Collections of Biometric Data

DTIC Science & Technology

2009-10-01

instantaneous identification of a person by converting the biometric into a digital form and then comparing it against a computerized database . They can...combined to get reliable results. Exact match in biometric collections have very little meaning and only a relative ordering of database objects with...running several indices for different aspects of the data, e.g. facial features, fingerprints and palmprints of a person, together. The system then
The database of the PREDICTS (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems) project

Treesearch

Lawrence N. Hudson; Joseph Wunderle M.; And Others

2016-01-01

The PREDICTS projectâProjecting Responses of Ecological Diversity In Changing Terrestrial Systems (www.predicts.org.uk)âhas collated from published studies a large, reasonably representative database of comparable samples of biodiversity from multiple sites that differ in the nature or intensity of human impacts relating to land use. We have used this evidence base to...
Discovery of parvovirus-related sequences in an unexpected broad range of animals.

PubMed

François, S; Filloux, D; Roumagnac, P; Bigot, D; Gayral, P; Martin, D P; Froissart, R; Ogliastro, M

2016-09-07

Our knowledge of the genetic diversity and host ranges of viruses is fragmentary. This is particularly true for the Parvoviridae family. Genetic diversity studies of single stranded DNA viruses within this family have been largely focused on arthropod- and vertebrate-infecting species that cause diseases of humans and our domesticated animals: a focus that has biased our perception of parvovirus diversity. While metagenomics approaches could help rectify this bias, so too could transcriptomics studies. Large amounts of transcriptomic data are available for a diverse array of animal species and whenever this data has inadvertently been gathered from virus-infected individuals, it could contain detectable viral transcripts. We therefore performed a systematic search for parvovirus-related sequences (PRSs) within publicly available transcript, genome and protein databases and eleven new transcriptome datasets. This revealed 463 PRSs in the transcript databases of 118 animals. At least 41 of these PRSs are likely integrated within animal genomes in that they were also found within genomic sequence databases. Besides illuminating the ubiquity of parvoviruses, the number of parvoviral sequences discovered within public databases revealed numerous previously unknown parvovirus-host combinations; particularly in invertebrates. Our findings suggest that the host-ranges of extant parvoviruses might span the entire animal kingdom.
Graph Databases for Large-Scale Healthcare Systems: A Framework for Efficient Data Management and Data Services

DOE Office of Scientific and Technical Information (OSTI.GOV)

Park, Yubin; Shankar, Mallikarjun; Park, Byung H.

Designing a database system for both efficient data management and data services has been one of the enduring challenges in the healthcare domain. In many healthcare systems, data services and data management are often viewed as two orthogonal tasks; data services refer to retrieval and analytic queries such as search, joins, statistical data extraction, and simple data mining algorithms, while data management refers to building error-tolerant and non-redundant database systems. The gap between service and management has resulted in rigid database systems and schemas that do not support effective analytics. We compose a rich graph structure from an abstracted healthcaremore » RDBMS to illustrate how we can fill this gap in practice. We show how a healthcare graph can be automatically constructed from a normalized relational database using the proposed 3NF Equivalent Graph (3EG) transformation.We discuss a set of real world graph queries such as finding self-referrals, shared providers, and collaborative filtering, and evaluate their performance over a relational database and its 3EG-transformed graph. Experimental results show that the graph representation serves as multiple de-normalized tables, thus reducing complexity in a database and enhancing data accessibility of users. Based on this finding, we propose an ensemble framework of databases for healthcare applications.« less
Global search tool for the Advanced Photon Source Integrated Relational Model of Installed Systems (IRMIS) database.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Quock, D. E. R.; Cianciarulo, M. B.; APS Engineering Support Division

2007-01-01

The Integrated Relational Model of Installed Systems (IRMIS) is a relational database tool that has been implemented at the Advanced Photon Source to maintain an updated account of approximately 600 control system software applications, 400,000 process variables, and 30,000 control system hardware components. To effectively display this large amount of control system information to operators and engineers, IRMIS was initially built with nine Web-based viewers: Applications Organizing Index, IOC, PLC, Component Type, Installed Components, Network, Controls Spares, Process Variables, and Cables. However, since each viewer is designed to provide details from only one major category of the control system, themore » necessity for a one-stop global search tool for the entire database became apparent. The user requirements for extremely fast database search time and ease of navigation through search results led to the choice of Asynchronous JavaScript and XML (AJAX) technology in the implementation of the IRMIS global search tool. Unique features of the global search tool include a two-tier level of displayed search results, and a database data integrity validation and reporting mechanism.« less

Stereoselective virtual screening of the ZINC database using atom pair 3D-fingerprints.

PubMed

Awale, Mahendra; Jin, Xian; Reymond, Jean-Louis

2015-01-01

Tools to explore large compound databases in search for analogs of query molecules provide a strategically important support in drug discovery to help identify available analogs of any given reference or hit compound by ligand based virtual screening (LBVS). We recently showed that large databases can be formatted for very fast searching with various 2D-fingerprints using the city-block distance as similarity measure, in particular a 2D-atom pair fingerprint (APfp) and the related category extended atom pair fingerprint (Xfp) which efficiently encode molecular shape and pharmacophores, but do not perceive stereochemistry. Here we investigated related 3D-atom pair fingerprints to enable rapid stereoselective searches in the ZINC database (23.2 million 3D structures). Molecular fingerprints counting atom pairs at increasing through-space distance intervals were designed using either all atoms (16-bit 3DAPfp) or different atom categories (80-bit 3DXfp). These 3D-fingerprints retrieved molecular shape and pharmacophore analogs (defined by OpenEye ROCS scoring functions) of 110,000 compounds from the Cambridge Structural Database with equal or better accuracy than the 2D-fingerprints APfp and Xfp, and showed comparable performance in recovering actives from decoys in the DUD database. LBVS by 3DXfp or 3DAPfp similarity was stereoselective and gave very different analogs when starting from different diastereomers of the same chiral drug. Results were also different from LBVS with the parent 2D-fingerprints Xfp or APfp. 3D- and 2D-fingerprints also gave very different results in LBVS of folded molecules where through-space distances between atom pairs are much shorter than topological distances. 3DAPfp and 3DXfp are suitable for stereoselective searches for shape and pharmacophore analogs of query molecules in large databases. Web-browsers for searching ZINC by 3DAPfp and 3DXfp similarity are accessible at www.gdb.unibe.ch and should provide useful assistance to drug discovery projects. Graphical abstractAtom pair fingerprints based on through-space distances (3DAPfp) provide better shape encoding than atom pair fingerprints based on topological distances (APfp) as measured by the recovery of ROCS shape analogs by fp similarity.
A data model and database for high-resolution pathology analytical image informatics.

PubMed

Wang, Fusheng; Kong, Jun; Cooper, Lee; Pan, Tony; Kurc, Tahsin; Chen, Wenjin; Sharma, Ashish; Niedermayr, Cristobal; Oh, Tae W; Brat, Daniel; Farris, Alton B; Foran, David J; Saltz, Joel

2011-01-01

The systematic analysis of imaged pathology specimens often results in a vast amount of morphological information at both the cellular and sub-cellular scales. While microscopy scanners and computerized analysis are capable of capturing and analyzing data rapidly, microscopy image data remain underutilized in research and clinical settings. One major obstacle which tends to reduce wider adoption of these new technologies throughout the clinical and scientific communities is the challenge of managing, querying, and integrating the vast amounts of data resulting from the analysis of large digital pathology datasets. This paper presents a data model, which addresses these challenges, and demonstrates its implementation in a relational database system. This paper describes a data model, referred to as Pathology Analytic Imaging Standards (PAIS), and a database implementation, which are designed to support the data management and query requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines on whole-slide images and tissue microarrays (TMAs). (1) Development of a data model capable of efficiently representing and storing virtual slide related image, annotation, markup, and feature information. (2) Development of a database, based on the data model, capable of supporting queries for data retrieval based on analysis and image metadata, queries for comparison of results from different analyses, and spatial queries on segmented regions, features, and classified objects. The work described in this paper is motivated by the challenges associated with characterization of micro-scale features for comparative and correlative analyses involving whole-slides tissue images and TMAs. Technologies for digitizing tissues have advanced significantly in the past decade. Slide scanners are capable of producing high-magnification, high-resolution images from whole slides and TMAs within several minutes. Hence, it is becoming increasingly feasible for basic, clinical, and translational research studies to produce thousands of whole-slide images. Systematic analysis of these large datasets requires efficient data management support for representing and indexing results from hundreds of interrelated analyses generating very large volumes of quantifications such as shape and texture and of classifications of the quantified features. We have designed a data model and a database to address the data management requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines. The data model represents virtual slide related image, annotation, markup and feature information. The database supports a wide range of metadata and spatial queries on images, annotations, markups, and features. We currently have three databases running on a Dell PowerEdge T410 server with CentOS 5.5 Linux operating system. The database server is IBM DB2 Enterprise Edition 9.7.2. The set of databases consists of 1) a TMA database containing image analysis results from 4740 cases of breast cancer, with 641 MB storage size; 2) an algorithm validation database, which stores markups and annotations from two segmentation algorithms and two parameter sets on 18 selected slides, with 66 GB storage size; and 3) an in silico brain tumor study database comprising results from 307 TCGA slides, with 365 GB storage size. The latter two databases also contain human-generated annotations and markups for regions and nuclei. Modeling and managing pathology image analysis results in a database provide immediate benefits on the value and usability of data in a research study. The database provides powerful query capabilities, which are otherwise difficult or cumbersome to support by other approaches such as programming languages. Standardized, semantic annotated data representation and interfaces also make it possible to more efficiently share image data and analysis results.
Techniques for Efficiently Managing Large Geosciences Data Sets

NASA Astrophysics Data System (ADS)

Kruger, A.; Krajewski, W. F.; Bradley, A. A.; Smith, J. A.; Baeck, M. L.; Steiner, M.; Lawrence, R. E.; Ramamurthy, M. K.; Weber, J.; Delgreco, S. A.; Domaszczynski, P.; Seo, B.; Gunyon, C. A.

2007-12-01

We have developed techniques and software tools for efficiently managing large geosciences data sets. While the techniques were developed as part of an NSF-Funded ITR project that focuses on making NEXRAD weather data and rainfall products available to hydrologists and other scientists, they are relevant to other geosciences disciplines that deal with large data sets. Metadata, relational databases, data compression, and networking are central to our methodology. Data and derived products are stored on file servers in a compressed format. URLs to, and metadata about the data and derived products are managed in a PostgreSQL database. Virtually all access to the data and products is through this database. Geosciences data normally require a number of processing steps to transform the raw data into useful products: data quality assurance, coordinate transformations and georeferencing, applying calibration information, and many more. We have developed the concept of crawlers that manage this scientific workflow. Crawlers are unattended processes that run indefinitely, and at set intervals query the database for their next assignment. A database table functions as a roster for the crawlers. Crawlers perform well-defined tasks that are, except for perhaps sequencing, largely independent from other crawlers. Once a crawler is done with its current assignment, it updates the database roster table, and gets its next assignment by querying the database. We have developed a library that enables one to quickly add crawlers. The library provides hooks to external (i.e., C-language) compiled codes, so that developers can work and contribute independently. Processes called ingesters inject data into the system. The bulk of the data are from a real-time feed using UCAR/Unidata's IDD/LDM software. An exciting recent development is the establishment of a Unidata HYDRO feed that feeds value-added metadata over the IDD/LDM. Ingesters grab the metadata and populate the PostgreSQL tables. These and other concepts we have developed have enabled us to efficiently manage a 70 Tb (and growing) data weather radar data set.
ORFer--retrieval of protein sequences and open reading frames from GenBank and storage into relational databases or text files.

PubMed

Büssow, Konrad; Hoffmann, Steve; Sievert, Volker

2002-12-19

Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.
Cloud-Based Distributed Control of Unmanned Systems

DTIC Science & Technology

2015-04-01

during mission execution. At best, the data is saved onto hard-drives and is accessible only by the local team. Data history in a form available and...following open source technologies: GeoServer, OpenLayers, PostgreSQL , and PostGIS are chosen to implement the back-end database and server. A brief...geospatial map data. 3. PostgreSQL : An SQL-compliant object-relational database that easily scales to accommodate large amounts of data - upwards to
Network Configuration of Oracle and Database Programming Using SQL

NASA Technical Reports Server (NTRS)

Davis, Melton; Abdurrashid, Jibril; Diaz, Philip; Harris, W. C.

2000-01-01

A database can be defined as a collection of information organized in such a way that it can be retrieved and used. A database management system (DBMS) can further be defined as the tool that enables us to manage and interact with the database. The Oracle 8 Server is a state-of-the-art information management environment. It is a repository for very large amounts of data, and gives users rapid access to that data. The Oracle 8 Server allows for sharing of data between applications; the information is stored in one place and used by many systems. My research will focus primarily on SQL (Structured Query Language) programming. SQL is the way you define and manipulate data in Oracle's relational database. SQL is the industry standard adopted by all database vendors. When programming with SQL, you work on sets of data (i.e., information is not processed one record at a time).
Large-scale silviculture experiments of western Oregon and Washington.

Treesearch

Nathan J. Poage; Paul D. Anderson

2007-01-01

We review 12 large-scale silviculture experiments (LSSEs) in western Washington and Oregon with which the Pacific Northwest Research Station of the USDA Forest Service is substantially involved. We compiled and arrayed information about the LSSEs as a series of matrices in a relational database, which is included on the compact disc published with this report and...
Extracting Databases from Dark Data with DeepDive.

PubMed

Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng

2016-01-01

DeepDive is a system for extracting relational databases from dark data : the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data - scientific papers, Web classified ads, customer service notes, and so on - were instead in a relational database, it would give analysts a massive and valuable new set of "big data." DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference.
Chess databases as a research vehicle in psychology: Modeling large data.

PubMed

Vaci, Nemanja; Bilalić, Merim

2017-08-01

The game of chess has often been used for psychological investigations, particularly in cognitive science. The clear-cut rules and well-defined environment of chess provide a model for investigations of basic cognitive processes, such as perception, memory, and problem solving, while the precise rating system for the measurement of skill has enabled investigations of individual differences and expertise-related effects. In the present study, we focus on another appealing feature of chess-namely, the large archive databases associated with the game. The German national chess database presented in this study represents a fruitful ground for the investigation of multiple longitudinal research questions, since it collects the data of over 130,000 players and spans over 25 years. The German chess database collects the data of all players, including hobby players, and all tournaments played. This results in a rich and complete collection of the skill, age, and activity of the whole population of chess players in Germany. The database therefore complements the commonly used expertise approach in cognitive science by opening up new possibilities for the investigation of multiple factors that underlie expertise and skill acquisition. Since large datasets are not common in psychology, their introduction also raises the question of optimal and efficient statistical analysis. We offer the database for download and illustrate how it can be used by providing concrete examples and a step-by-step tutorial using different statistical analyses on a range of topics, including skill development over the lifetime, birth cohort effects, effects of activity and inactivity on skill, and gender differences.
A user-defined data type for the storage of time series data allowing efficient similarity screening.

PubMed

Sorokin, Anatoly; Selkov, Gene; Goryanin, Igor

2012-07-16

The volume of the experimentally measured time series data is rapidly growing, while storage solutions offering better data types than simple arrays of numbers or opaque blobs for keeping series data are sorely lacking. A number of indexing methods have been proposed to provide efficient access to time series data, but none has so far been integrated into a tried-and-proven database system. To explore the possibility of such integration, we have developed a data type for time series storage in PostgreSQL, an object-relational database system, and equipped it with an access method based on SAX (Symbolic Aggregate approXimation). This new data type has been successfully tested in a database supporting a large-scale plant gene expression experiment, and it was additionally tested on a very large set of simulated time series data. Copyright © 2011 Elsevier B.V. All rights reserved.
Massive parallelization of serial inference algorithms for a complex generalized linear model

PubMed Central

Suchard, Marc A.; Simpson, Shawn E.; Zorych, Ivan; Ryan, Patrick; Madigan, David

2014-01-01

Following a series of high-profile drug safety disasters in recent years, many countries are redoubling their efforts to ensure the safety of licensed medical products. Large-scale observational databases such as claims databases or electronic health record systems are attracting particular attention in this regard, but present significant methodological and computational concerns. In this paper we show how high-performance statistical computation, including graphics processing units, relatively inexpensive highly parallel computing devices, can enable complex methods in large databases. We focus on optimization and massive parallelization of cyclic coordinate descent approaches to fit a conditioned generalized linear model involving tens of millions of observations and thousands of predictors in a Bayesian context. We find orders-of-magnitude improvement in overall run-time. Coordinate descent approaches are ubiquitous in high-dimensional statistics and the algorithms we propose open up exciting new methodological possibilities with the potential to significantly improve drug safety. PMID:25328363
SchizConnect: Mediating Neuroimaging Databases on Schizophrenia and Related Disorders for Large-Scale Integration

PubMed Central

Wang, Lei; Alpert, Kathryn I.; Calhoun, Vince D.; Cobia, Derin J.; Keator, David B.; King, Margaret D.; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D.; Potkin, Steven G.; Turner, Jessica A.; Ambite, Jose Luis

2015-01-01

SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation—translating across data sources—so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information are being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. PMID:26142271
Simple re-instantiation of small databases using cloud computing.

PubMed

Tan, Tin Wee; Xie, Chao; De Silva, Mark; Lim, Kuan Siong; Patro, C Pawan K; Lim, Shen Jean; Govindarajan, Kunde Ramamoorthy; Tong, Joo Chuan; Choo, Khar Heng; Ranganathan, Shoba; Khan, Asif M

2013-01-01

Small bioinformatics databases, unlike institutionally funded large databases, are vulnerable to discontinuation and many reported in publications are no longer accessible. This leads to irreproducible scientific work and redundant effort, impeding the pace of scientific progress. We describe a Web-accessible system, available online at http://biodb100.apbionet.org, for archival and future on demand re-instantiation of small databases within minutes. Depositors can rebuild their databases by downloading a Linux live operating system (http://www.bioslax.com), preinstalled with bioinformatics and UNIX tools. The database and its dependencies can be compressed into an ".lzm" file for deposition. End-users can search for archived databases and activate them on dynamically re-instantiated BioSlax instances, run as virtual machines over the two popular full virtualization standard cloud-computing platforms, Xen Hypervisor or vSphere. The system is adaptable to increasing demand for disk storage or computational load and allows database developers to use the re-instantiated databases for integration and development of new databases. Herein, we demonstrate that a relatively inexpensive solution can be implemented for archival of bioinformatics databases and their rapid re-instantiation should the live databases disappear.
Simple re-instantiation of small databases using cloud computing

PubMed Central

2013-01-01

Background Small bioinformatics databases, unlike institutionally funded large databases, are vulnerable to discontinuation and many reported in publications are no longer accessible. This leads to irreproducible scientific work and redundant effort, impeding the pace of scientific progress. Results We describe a Web-accessible system, available online at http://biodb100.apbionet.org, for archival and future on demand re-instantiation of small databases within minutes. Depositors can rebuild their databases by downloading a Linux live operating system (http://www.bioslax.com), preinstalled with bioinformatics and UNIX tools. The database and its dependencies can be compressed into an ".lzm" file for deposition. End-users can search for archived databases and activate them on dynamically re-instantiated BioSlax instances, run as virtual machines over the two popular full virtualization standard cloud-computing platforms, Xen Hypervisor or vSphere. The system is adaptable to increasing demand for disk storage or computational load and allows database developers to use the re-instantiated databases for integration and development of new databases. Conclusions Herein, we demonstrate that a relatively inexpensive solution can be implemented for archival of bioinformatics databases and their rapid re-instantiation should the live databases disappear. PMID:24564380
Query by forms: User-oriented relational database retrieving system and its application in analysis of experiment data

NASA Astrophysics Data System (ADS)

Skotniczny, Zbigniew

1989-12-01

The Query by Forms (QbF) system is a user-oriented interactive tool for querying large relational database with minimal queries difinition cost. The system was worked out under the assumption that user's time and effort for defining needed queries is the most severe bottleneck. The system may be applied in any Rdb/VMS databases system and is recommended for specific information systems of any project where end-user queries cannot be foreseen. The tool is dedicated to specialist of an application domain who have to analyze data maintained in database from any needed point of view, who do not need to know commercial databases languages. The paper presents the system developed as a compromise between its functionality and usability. User-system communication via a menu-driven "tree-like" structure of screen-forms which produces a query difinition and execution is discussed in detail. Output of query results (printed reports and graphics) is also discussed. Finally the paper shows one application of QbF to a HERA-project.
Empirical performance of the self-controlled case series design: lessons for developing a risk identification and analysis system.

PubMed

Suchard, Marc A; Zorych, Ivan; Simpson, Shawn E; Schuemie, Martijn J; Ryan, Patrick B; Madigan, David

2013-10-01

The self-controlled case series (SCCS) offers potential as an statistical method for risk identification involving medical products from large-scale observational healthcare data. However, analytic design choices remain in encoding the longitudinal health records into the SCCS framework and its risk identification performance across real-world databases is unknown. To evaluate the performance of SCCS and its design choices as a tool for risk identification in observational healthcare data. We examined the risk identification performance of SCCS across five design choices using 399 drug-health outcome pairs in five real observational databases (four administrative claims and one electronic health records). In these databases, the pairs involve 165 positive controls and 234 negative controls. We also consider several synthetic databases with known relative risks between drug-outcome pairs. We evaluate risk identification performance through estimating the area under the receiver-operator characteristics curve (AUC) and bias and coverage probability in the synthetic examples. The SCCS achieves strong predictive performance. Twelve of the twenty health outcome-database scenarios return AUCs >0.75 across all drugs. Including all adverse events instead of just the first per patient and applying a multivariate adjustment for concomitant drug use are the most important design choices. However, the SCCS as applied here returns relative risk point-estimates biased towards the null value of 1 with low coverage probability. The SCCS recently extended to apply a multivariate adjustment for concomitant drug use offers promise as a statistical tool for risk identification in large-scale observational healthcare databases. Poor estimator calibration dampens enthusiasm, but on-going work should correct this short-coming.
Data Aggregation System: A system for information retrieval on demand over relational and non-relational distributed data sources

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ball, G.; Kuznetsov, V.; Evans, D.

We present the Data Aggregation System, a system for information retrieval and aggregation from heterogenous sources of relational and non-relational data for the Compact Muon Solenoid experiment on the CERN Large Hadron Collider. The experiment currently has a number of organically-developed data sources, including front-ends to a number of different relational databases and non-database data services which do not share common data structures or APIs (Application Programming Interfaces), and cannot at this stage be readily converged. DAS provides a single interface for querying all these services, a caching layer to speed up access to expensive underlying calls and the abilitymore » to merge records from different data services pertaining to a single primary key.« less
Database recovery using redundant disk arrays

NASA Technical Reports Server (NTRS)

Mourad, Antoine N.; Fuchs, W. K.; Saab, Daniel G.

1992-01-01

Redundant disk arrays provide a way for achieving rapid recovery from media failures with a relatively low storage cost for large scale database systems requiring high availability. In this paper a method is proposed for using redundant disk arrays to support rapid-recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, it is shown that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
Recovery issues in databases using redundant disk arrays

NASA Technical Reports Server (NTRS)

Mourad, Antoine N.; Fuchs, W. K.; Saab, Daniel G.

1993-01-01

Redundant disk arrays provide a way for achieving rapid recovery from media failures with a relatively low storage cost for large scale database systems requiring high availability. In this paper we propose a method for using redundant disk arrays to support rapid recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, we show that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
The comparative effectiveness of conventional and digital image libraries.

PubMed

McColl, R I; Johnson, A

2001-03-01

Before introducing a hospital-wide image database to improve access, navigation and retrieval speed, a comparative study between a conventional slide library and a matching image database was undertaken to assess its relative benefits. Paired time trials and personal questionnaires revealed faster retrieval rates, higher image quality, and easier viewing for the pilot digital image database. Analysis of confidentiality, copyright and data protection exposed similar issues for both systems, thus concluding that the digital image database is a more effective library system. The authors suggest that in the future, medical images will be stored on large, professionally administered, centrally located file servers, allowing specialist image libraries to be tailored locally for individual users. The further integration of the database with web technology will enable cheap and efficient remote access for a wide range of users.

Multiple imputation as one tool to provide longitudinal databases for modelling human height and weight development.

PubMed

Aßmann, C

2016-06-01

Besides large efforts regarding field work, provision of valid databases requires statistical and informational infrastructure to enable long-term access to longitudinal data sets on height, weight and related issues. To foster use of longitudinal data sets within the scientific community, provision of valid databases has to address data-protection regulations. It is, therefore, of major importance to hinder identifiability of individuals from publicly available databases. To reach this goal, one possible strategy is to provide a synthetic database to the public allowing for pretesting strategies for data analysis. The synthetic databases can be established using multiple imputation tools. Given the approval of the strategy, verification is based on the original data. Multiple imputation by chained equations is illustrated to facilitate provision of synthetic databases as it allows for capturing a wide range of statistical interdependencies. Also missing values, typically occurring within longitudinal databases for reasons of item non-response, can be addressed via multiple imputation when providing databases. The provision of synthetic databases using multiple imputation techniques is one possible strategy to ensure data protection, increase visibility of longitudinal databases and enhance the analytical potential.
Global Aerodynamic Modeling for Stall/Upset Recovery Training Using Efficient Piloted Flight Test Techniques

NASA Technical Reports Server (NTRS)

Morelli, Eugene A.; Cunningham, Kevin; Hill, Melissa A.

2013-01-01

Flight test and modeling techniques were developed for efficiently identifying global aerodynamic models that can be used to accurately simulate stall, upset, and recovery on large transport airplanes. The techniques were developed and validated in a high-fidelity fixed-base flight simulator using a wind-tunnel aerodynamic database, realistic sensor characteristics, and a realistic flight deck representative of a large transport aircraft. Results demonstrated that aerodynamic models for stall, upset, and recovery can be identified rapidly and accurately using relatively simple piloted flight test maneuvers. Stall maneuver predictions and comparisons of identified aerodynamic models with data from the underlying simulation aerodynamic database were used to validate the techniques.
EDULISS: a small-molecule database with data-mining and pharmacophore searching capabilities

PubMed Central

Hsin, Kun-Yi; Morgan, Hugh P.; Shave, Steven R.; Hinton, Andrew C.; Taylor, Paul; Walkinshaw, Malcolm D.

2011-01-01

We present the relational database EDULISS (EDinburgh University Ligand Selection System), which stores structural, physicochemical and pharmacophoric properties of small molecules. The database comprises a collection of over 4 million commercially available compounds from 28 different suppliers. A user-friendly web-based interface for EDULISS (available at http://eduliss.bch.ed.ac.uk/) has been established providing a number of data-mining possibilities. For each compound a single 3D conformer is stored along with over 1600 calculated descriptor values (molecular properties). A very efficient method for unique compound recognition, especially for a large scale database, is demonstrated by making use of small subgroups of the descriptors. Many of the shape and distance descriptors are held as pre-calculated bit strings permitting fast and efficient similarity and pharmacophore searches which can be used to identify families of related compounds for biological testing. Two ligand searching applications are given to demonstrate how EDULISS can be used to extract families of molecules with selected structural and biophysical features. PMID:21051336
Characterizing the genetic structure of a forensic DNA database using a latent variable approach.

PubMed

Kruijver, Maarten

2016-07-01

Several problems in forensic genetics require a representative model of a forensic DNA database. Obtaining an accurate representation of the offender database can be difficult, since databases typically contain groups of persons with unregistered ethnic origins in unknown proportions. We propose to estimate the allele frequencies of the subpopulations comprising the offender database and their proportions from the database itself using a latent variable approach. We present a model for which parameters can be estimated using the expectation maximization (EM) algorithm. This approach does not rely on relatively small and possibly unrepresentative population surveys, but is driven by the actual genetic composition of the database only. We fit the model to a snapshot of the Dutch offender database (2014), which contains close to 180,000 profiles, and find that three subpopulations suffice to describe a large fraction of the heterogeneity in the database. We demonstrate the utility and reliability of the approach with three applications. First, we use the model to predict the number of false leads obtained in database searches. We assess how well the model predicts the number of false leads obtained in mock searches in the Dutch offender database, both for the case of familial searching for first degree relatives of a donor and searching for contributors to three-person mixtures. Second, we study the degree of partial matching between all pairs of profiles in the Dutch database and compare this to what is predicted using the latent variable approach. Third, we use the model to provide evidence to support that the Dutch practice of estimating match probabilities using the Balding-Nichols formula with a native Dutch reference database and θ=0.03 is conservative. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Design and Implementation of an Environmental Mercury Database for Northeastern North America

NASA Astrophysics Data System (ADS)

Clair, T. A.; Evers, D.; Smith, T.; Goodale, W.; Bernier, M.

2002-12-01

An important issue faced when attempting to interpret geochemical variability studies across large regions, is the accumulation, access and consistent display of data from a large number of sources. We were given the opportunity to provide a regional assessment of mercury distribution in surface waters, sediments, invertebrates, fish, and birds in a region extending from New York State to the Island of Newfoundland. We received over 20 individual databases from State, Provincial, and Federal governments, as well as university researchers from both Canada and the United States. These databases came in a variety of formats and sizes. Our challenge was to find a way of accumulating and presenting the large amounts of acquired data, in a consistent, easily accessible fashion, which could then be more easily interpreted. Moreover, the database had to be portable and easily distributable to the large number of study participants. We developed a static database structure using a web-based approach which we were then able to mount on a server which was accessible to all project participants. The site also contained all the necessary documentation related to the data, its acquisition, as well as the methods used in its analysis and interpretation. We then copied the complete web site on CDROM's which we then distributed to all project participants, funding agencies, and other interested parties. The CDROM formed a permanent record of the project and was issued ISSN and ISBN numbers so that the information remained accessible to researchers in perpetuity. Here we present an overview of the CDROM and data structures, of the information accumulated over the first year of the study, and initial interpretation of the results.
SymDex: increasing the efficiency of chemical fingerprint similarity searches for comparing large chemical libraries by using query set indexing.

PubMed

Tai, David; Fang, Jianwen

2012-08-27

The large sizes of today's chemical databases require efficient algorithms to perform similarity searches. It can be very time consuming to compare two large chemical databases. This paper seeks to build upon existing research efforts by describing a novel strategy for accelerating existing search algorithms for comparing large chemical collections. The quest for efficiency has focused on developing better indexing algorithms by creating heuristics for searching individual chemical against a chemical library by detecting and eliminating needless similarity calculations. For comparing two chemical collections, these algorithms simply execute searches for each chemical in the query set sequentially. The strategy presented in this paper achieves a speedup upon these algorithms by indexing the set of all query chemicals so redundant calculations that arise in the case of sequential searches are eliminated. We implement this novel algorithm by developing a similarity search program called Symmetric inDexing or SymDex. SymDex shows over a 232% maximum speedup compared to the state-of-the-art single query search algorithm over real data for various fingerprint lengths. Considerable speedup is even seen for batch searches where query set sizes are relatively small compared to typical database sizes. To the best of our knowledge, SymDex is the first search algorithm designed specifically for comparing chemical libraries. It can be adapted to most, if not all, existing indexing algorithms and shows potential for accelerating future similarity search algorithms for comparing chemical databases.
An integrated approach to reservoir modeling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Donaldson, K.

1993-08-01

The purpose of this research is to evaluate the usefulness of the following procedural and analytical methods in investigating the heterogeneity of the oil reserve for the Mississipian Big Injun Sandstone of the Granny Creek field, Clay and Roane counties, West Virginia: (1) relational database, (2) two-dimensional cross sections, (3) true three-dimensional modeling, (4) geohistory analysis, (5) a rule-based expert system, and (6) geographical information systems. The large data set could not be effectively integrated and interpreted without this approach. A relational database was designed to fully integrate three- and four-dimensional data. The database provides an effective means for maintainingmore » and manipulating the data. A two-dimensional cross section program was designed to correlate stratigraphy, depositional environments, porosity, permeability, and petrographic data. This flexible design allows for additional four-dimensional data. Dynamic Graphics[sup [trademark
Distributed databases for materials study of thermo-kinetic properties

NASA Astrophysics Data System (ADS)

Toher, Cormac

2015-03-01

High-throughput computational materials science provides researchers with the opportunity to rapidly generate large databases of materials properties. To rapidly add thermal properties to the AFLOWLIB consortium and Materials Project repositories, we have implemented an automated quasi-harmonic Debye model, the Automatic GIBBS Library (AGL). This enables us to screen thousands of materials for thermal conductivity, bulk modulus, thermal expansion and related properties. The search and sort functions of the online database can then be used to identify suitable materials for more in-depth study using more precise computational or experimental techniques. AFLOW-AGL source code is public domain and will soon be released within the GNU-GPL license.
A mapping review of the literature on UK-focused health and social care databases.

PubMed

Cooper, Chris; Rogers, Morwenna; Bethel, Alison; Briscoe, Simon; Lowe, Jenny

2015-03-01

Bibliographic databases are a day-to-day tool of the researcher: they offer the researcher easy and organised access to knowledge, but how much is actually known about the databases on offer? The focus of this paper is UK health and social care databases. These databases are often small, specialised by topic, and provide a complementary literature to the large, international databases. There is, however, good evidence that these databases are overlooked in systematic reviews, perhaps because little is known about what they can offer. To systematically locate and map, published and unpublished literature on the key UK health and social care bibliographic databases. Systematic searching and mapping. Two hundred and forty-two items were identified which specifically related to the 24 of the 34 databases under review. There is little published or unpublished literature specifically analysing the key UK health and social care databases. Since several UK databases have closed, others are at risk, and some are overlooked in reviews, better information is required to enhance our knowledge. Further research on UK health and social care databases is required. This paper suggests the need to develop the evidence base through a series of case studies on each of the databases. © 2014 The authors. Health Information and Libraries Journal © 2014 Health Libraries Journal.
Extension of the COG and arCOG databases by amino acid and nucleotide sequences

PubMed Central

Meereis, Florian; Kaufmann, Michael

2008-01-01

Background The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries. Results Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at . Conclusion NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document. PMID:19014535
Helicobacter pylori-related chronic gastritis as a risk factor for colonic neoplasms.

PubMed

Inoue, Izumi; Kato, Jun; Tamai, Hideyuki; Iguchi, Mikitaka; Maekita, Takao; Yoshimura, Noriko; Ichinose, Masao

2014-02-14

To summarize the current views and insights on associations between Helicobacter pylori (H. pylori)-related chronic gastritis and colorectal neoplasm, we reviewed recent studies to clarify whether H. pylori infection/H. pylori-related chronic gastritis is associated with an elevated risk of colorectal neoplasm. Recent studies based on large databases with careful control for confounding variables have clearly demonstrated an increased risk of colorectal neoplasm associated with H. pylori infection. The correlation between H. pylori-related chronic atrophic gastritis (CAG) and colorectal neoplasm has only been examined in a limited number of studies. A recent large study using a national histopathological database, and our study based on the stage of H. pylori-related chronic gastritis as determined by serum levels of H. pylori antibody titer and pepsinogen, indicated that H. pylori-related CAG confers an increased risk of colorectal neoplasm, and more extensive atrophic gastritis will probably be associated with even higher risk of neoplasm. In addition, our study suggested that the activity of H. pylori-related chronic gastritis is correlated with colorectal neoplasm risk. H. pylori-related chronic gastritis could be involved in an increased risk of colorectal neoplasm that appears to be enhanced by the progression of gastric atrophy and the presence of active inflammation.
Helicobacter pylori-related chronic gastritis as a risk factor for colonic neoplasms

PubMed Central

Inoue, Izumi; Kato, Jun; Tamai, Hideyuki; Iguchi, Mikitaka; Maekita, Takao; Yoshimura, Noriko; Ichinose, Masao

2014-01-01

To summarize the current views and insights on associations between Helicobacter pylori (H. pylori)-related chronic gastritis and colorectal neoplasm, we reviewed recent studies to clarify whether H. pylori infection/H. pylori-related chronic gastritis is associated with an elevated risk of colorectal neoplasm. Recent studies based on large databases with careful control for confounding variables have clearly demonstrated an increased risk of colorectal neoplasm associated with H. pylori infection. The correlation between H. pylori-related chronic atrophic gastritis (CAG) and colorectal neoplasm has only been examined in a limited number of studies. A recent large study using a national histopathological database, and our study based on the stage of H. pylori-related chronic gastritis as determined by serum levels of H. pylori antibody titer and pepsinogen, indicated that H. pylori-related CAG confers an increased risk of colorectal neoplasm, and more extensive atrophic gastritis will probably be associated with even higher risk of neoplasm. In addition, our study suggested that the activity of H. pylori-related chronic gastritis is correlated with colorectal neoplasm risk. H. pylori-related chronic gastritis could be involved in an increased risk of colorectal neoplasm that appears to be enhanced by the progression of gastric atrophy and the presence of active inflammation. PMID:24587623
HBVPathDB: a database of HBV infection-related molecular interaction network.

PubMed

Zhang, Yi; Bo, Xiao-Chen; Yang, Jing; Wang, Sheng-Qi

2005-03-21

To describe molecules or genes interaction between hepatitis B viruses (HBV) and host, for understanding how virus' and host's genes and molecules are networked to form a biological system and for perceiving mechanism of HBV infection. The knowledge of HBV infection-related reactions was organized into various kinds of pathways with carefully drawn graphs in HBVPathDB. Pathway information is stored with relational database management system (DBMS), which is currently the most efficient way to manage large amounts of data and query is implemented with powerful Structured Query Language (SQL). The search engine is written using Personal Home Page (PHP) with SQL embedded and web retrieval interface is developed for searching with Hypertext Markup Language (HTML). We present the first version of HBVPathDB, which is a HBV infection-related molecular interaction network database composed of 306 pathways with 1 050 molecules involved. With carefully drawn graphs, pathway information stored in HBVPathDB can be browsed in an intuitive way. We develop an easy-to-use interface for flexible accesses to the details of database. Convenient software is implemented to query and browse the pathway information of HBVPathDB. Four search page layout options-category search, gene search, description search, unitized search-are supported by the search engine of the database. The database is freely available at http://www.bio-inf.net/HBVPathDB/HBV/. The conventional perspective HBVPathDB have already contained a considerable amount of pathway information with HBV infection related, which is suitable for in-depth analysis of molecular interaction network of virus and host. HBVPathDB integrates pathway data-sets with convenient software for query, browsing, visualization, that provides users more opportunity to identify regulatory key molecules as potential drug targets and to explore the possible mechanism of HBV infection based on gene expression datasets.
A Study of the Efficiency of Spatial Indexing Methods Applied to Large Astronomical Databases

NASA Astrophysics Data System (ADS)

Donaldson, Tom; Berriman, G. Bruce; Good, John; Shiao, Bernie

2018-01-01

Spatial indexing of astronomical databases generally uses quadrature methods, which partition the sky into cells used to create an index (usually a B-tree) written as database column. We report the results of a study to compare the performance of two common indexing methods, HTM and HEALPix, on Solaris and Windows database servers installed with a PostgreSQL database, and a Windows Server installed with MS SQL Server. The indexing was applied to the 2MASS All-Sky Catalog and to the Hubble Source catalog. On each server, the study compared indexing performance by submitting 1 million queries at each index level with random sky positions and random cone search radius, which was computed on a logarithmic scale between 1 arcsec and 1 degree, and measuring the time to complete the query and write the output. These simulated queries, intended to model realistic use patterns, were run in a uniform way on many combinations of indexing method and indexing level. The query times in all simulations are strongly I/O-bound and are linear with number of records returned for large numbers of sources. There are, however, considerable differences between simulations, which reveal that hardware I/O throughput is a more important factor in managing the performance of a DBMS than the choice of indexing scheme. The choice of index itself is relatively unimportant: for comparable index levels, the performance is consistent within the scatter of the timings. At small index levels (large cells; e.g. level 4; cell size 3.7 deg), there is large scatter in the timings because of wide variations in the number of sources found in the cells. At larger index levels, performance improves and scatter decreases, but the improvement at level 8 (14 min) and higher is masked to some extent in the timing scatter caused by the range of query sizes. At very high levels (20; 0.0004 arsec), the granularity of the cells becomes so high that a large number of extraneous empty cells begin to degrade performance. Thus, for the use patterns studied here the database performance is not critically dependent on the exact choices of index or level.
Database extraction strategies for low-template evidence.

PubMed

Bleka, Øyvind; Dørum, Guro; Haned, Hinda; Gill, Peter

2014-03-01

Often in forensic cases, the profile of at least one of the contributors to a DNA evidence sample is unknown and a database search is needed to discover possible perpetrators. In this article we consider two types of search strategies to extract suspects from a database using methods based on probability arguments. The performance of the proposed match scores is demonstrated by carrying out a study of each match score relative to the level of allele drop-out in the crime sample, simulating low-template DNA. The efficiency was measured by random man simulation and we compared the performance using the SGM Plus kit and the ESX 17 kit for the Norwegian population, demonstrating that the latter has greatly enhanced power to discover perpetrators of crime in large national DNA databases. The code for the database extraction strategies will be prepared for release in the R-package forensim. Copyright © 2013 Elsevier Ireland Ltd. All rights reserved.
A method to implement fine-grained access control for personal health records through standard relational database queries.

PubMed

Sujansky, Walter V; Faus, Sam A; Stone, Ethan; Brennan, Patricia Flatley

2010-10-01

Online personal health records (PHRs) enable patients to access, manage, and share certain of their own health information electronically. This capability creates the need for precise access-controls mechanisms that restrict the sharing of data to that intended by the patient. The authors describe the design and implementation of an access-control mechanism for PHR repositories that is modeled on the eXtensible Access Control Markup Language (XACML) standard, but intended to reduce the cognitive and computational complexity of XACML. The authors implemented the mechanism entirely in a relational database system using ANSI-standard SQL statements. Based on a set of access-control rules encoded as relational table rows, the mechanism determines via a single SQL query whether a user who accesses patient data from a specific application is authorized to perform a requested operation on a specified data object. Testing of this query on a moderately large database has demonstrated execution times consistently below 100ms. The authors include the details of the implementation, including algorithms, examples, and a test database as Supplementary materials. Copyright © 2010 Elsevier Inc. All rights reserved.
Extracting Databases from Dark Data with DeepDive

PubMed Central

Zhang, Ce; Shin, Jaeho; Ré, Christopher; Cafarella, Michael; Niu, Feng

2016-01-01

DeepDive is a system for extracting relational databases from dark data: the mass of text, tables, and images that are widely collected and stored but which cannot be exploited by standard relational tools. If the information in dark data — scientific papers, Web classified ads, customer service notes, and so on — were instead in a relational database, it would give analysts a massive and valuable new set of “big data.” DeepDive is distinctive when compared to previous information extraction systems in its ability to obtain very high precision and recall at reasonable engineering cost; in a number of applications, we have used DeepDive to create databases with accuracy that meets that of human annotators. To date we have successfully deployed DeepDive to create data-centric applications for insurance, materials science, genomics, paleontologists, law enforcement, and others. The data unlocked by DeepDive represents a massive opportunity for industry, government, and scientific researchers. DeepDive is enabled by an unusual design that combines large-scale probabilistic inference with a novel developer interaction cycle. This design is enabled by several core innovations around probabilistic training and inference. PMID:28316365
SchizConnect: Mediating neuroimaging databases on schizophrenia and related disorders for large-scale integration.

PubMed

Wang, Lei; Alpert, Kathryn I; Calhoun, Vince D; Cobia, Derin J; Keator, David B; King, Margaret D; Kogan, Alexandr; Landis, Drew; Tallis, Marcelo; Turner, Matthew D; Potkin, Steven G; Turner, Jessica A; Ambite, Jose Luis

2016-01-01

SchizConnect (www.schizconnect.org) is built to address the issues of multiple data repositories in schizophrenia neuroimaging studies. It includes a level of mediation--translating across data sources--so that the user can place one query, e.g. for diffusion images from male individuals with schizophrenia, and find out from across participating data sources how many datasets there are, as well as downloading the imaging and related data. The current version handles the Data Usage Agreements across different studies, as well as interpreting database-specific terminologies into a common framework. New data repositories can also be mediated to bring immediate access to existing datasets. Compared with centralized, upload data sharing models, SchizConnect is a unique, virtual database with a focus on schizophrenia and related disorders that can mediate live data as information is being updated at each data source. It is our hope that SchizConnect can facilitate testing new hypotheses through aggregated datasets, promoting discovery related to the mechanisms underlying schizophrenic dysfunction. Copyright © 2015 Elsevier Inc. All rights reserved.
High dimensional biological data retrieval optimization with NoSQL technology.

PubMed

Wang, Shicai; Pandis, Ioannis; Wu, Chao; He, Sijin; Johnson, David; Emam, Ibrahim; Guitton, Florian; Guo, Yike

2014-01-01

High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data.
High dimensional biological data retrieval optimization with NoSQL technology

PubMed Central

2014-01-01

Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data model as a basis for migrating tranSMART's implementation to a more scalable solution for Big Data. PMID:25435347

TransAtlasDB: an integrated database connecting expression data, metadata and variants

PubMed Central

Adetunji, Modupeore O; Lamont, Susan J; Schmidt, Carl J

2018-01-01

Abstract High-throughput transcriptome sequencing (RNAseq) is the universally applied method for target-free transcript identification and gene expression quantification, generating huge amounts of data. The constraint of accessing such data and interpreting results can be a major impediment in postulating suitable hypothesis, thus an innovative storage solution that addresses these limitations, such as hard disk storage requirements, efficiency and reproducibility are paramount. By offering a uniform data storage and retrieval mechanism, various data can be compared and easily investigated. We present a sophisticated system, TransAtlasDB, which incorporates a hybrid architecture of both relational and NoSQL databases for fast and efficient data storage, processing and querying of large datasets from transcript expression analysis with corresponding metadata, as well as gene-associated variants (such as SNPs) and their predicted gene effects. TransAtlasDB provides the data model of accurate storage of the large amount of data derived from RNAseq analysis and also methods of interacting with the database, either via the command-line data management workflows, written in Perl, with useful functionalities that simplifies the complexity of data storage and possibly manipulation of the massive amounts of data generated from RNAseq analysis or through the web interface. The database application is currently modeled to handle analyses data from agricultural species, and will be expanded to include more species groups. Overall TransAtlasDB aims to serve as an accessible repository for the large complex results data files derived from RNAseq gene expression profiling and variant analysis. Database URL: https://modupeore.github.io/TransAtlasDB/ PMID:29688361
Reducing process delays for real-time earthquake parameter estimation - An application of KD tree to large databases for Earthquake Early Warning

NASA Astrophysics Data System (ADS)

Yin, Lucy; Andrews, Jennifer; Heaton, Thomas

2018-05-01

Earthquake parameter estimations using nearest neighbor searching among a large database of observations can lead to reliable prediction results. However, in the real-time application of Earthquake Early Warning (EEW) systems, the accurate prediction using a large database is penalized by a significant delay in the processing time. We propose to use a multidimensional binary search tree (KD tree) data structure to organize large seismic databases to reduce the processing time in nearest neighbor search for predictions. We evaluated the performance of KD tree on the Gutenberg Algorithm, a database-searching algorithm for EEW. We constructed an offline test to predict peak ground motions using a database with feature sets of waveform filter-bank characteristics, and compare the results with the observed seismic parameters. We concluded that large database provides more accurate predictions of the ground motion information, such as peak ground acceleration, velocity, and displacement (PGA, PGV, PGD), than source parameters, such as hypocenter distance. Application of the KD tree search to organize the database reduced the average searching process by 85% time cost of the exhaustive method, allowing the method to be feasible for real-time implementation. The algorithm is straightforward and the results will reduce the overall time of warning delivery for EEW.
Profiling Oman education data using data visualization technique

NASA Astrophysics Data System (ADS)

Alalawi, Sultan Juma Sultan; Shaharanee, Izwan Nizal Mohd; Jamil, Jastini Mohd

2016-10-01

This research works presents an innovative data visualization technique to understand and visualize the information of Oman's education data generated from the Ministry of Education Oman "Educational Portal". The Ministry of Education in Sultanate of Oman have huge databases contains massive information. The volume of data in the database increase yearly as many students, teachers and employees enter into the database. The task for discovering and analyzing these vast volumes of data becomes increasingly difficult. Information visualization and data mining offer a better ways in dealing with large volume of information. In this paper, an innovative information visualization technique is developed to visualize the complex multidimensional educational data. Microsoft Excel Dashboard, Visual Basic Application (VBA) and Pivot Table are utilized to visualize the data. Findings from the summarization of the data are presented, and it is argued that information visualization can help related stakeholders to become aware of hidden and interesting information from large amount of data drowning in their educational portal.
Compression technique for large statistical data bases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Eggers, S.J.; Olken, F.; Shoshani, A.

1981-03-01

The compression of large statistical databases is explored and are proposed for organizing the compressed data, such that the time required to access the data is logarithmic. The techniques exploit special characteristics of statistical databases, namely, variation in the space required for the natural encoding of integer attributes, a prevalence of a few repeating values or constants, and the clustering of both data of the same length and constants in long, separate series. The techniques are variations of run-length encoding, in which modified run-lengths for the series are extracted from the data stream and stored in a header, which ismore » used to form the base level of a B-tree index into the database. The run-lengths are cumulative, and therefore the access time of the data is logarithmic in the size of the header. The details of the compression scheme and its implementation are discussed, several special cases are presented, and an analysis is given of the relative performance of the various versions.« less
Application of kernel functions for accurate similarity search in large chemical databases.

PubMed

Wang, Xiaohong; Huan, Jun; Smalter, Aaron; Lushington, Gerald H

2010-04-29

Similarity search in chemical structure databases is an important problem with many applications in chemical genomics, drug design, and efficient chemical probe screening among others. It is widely believed that structure based methods provide an efficient way to do the query. Recently various graph kernel functions have been designed to capture the intrinsic similarity of graphs. Though successful in constructing accurate predictive and classification models, graph kernel functions can not be applied to large chemical compound database due to the high computational complexity and the difficulties in indexing similarity search for large databases. To bridge graph kernel function and similarity search in chemical databases, we applied a novel kernel-based similarity measurement, developed in our team, to measure similarity of graph represented chemicals. In our method, we utilize a hash table to support new graph kernel function definition, efficient storage and fast search. We have applied our method, named G-hash, to large chemical databases. Our results show that the G-hash method achieves state-of-the-art performance for k-nearest neighbor (k-NN) classification. Moreover, the similarity measurement and the index structure is scalable to large chemical databases with smaller indexing size, and faster query processing time as compared to state-of-the-art indexing methods such as Daylight fingerprints, C-tree and GraphGrep. Efficient similarity query processing method for large chemical databases is challenging since we need to balance running time efficiency and similarity search accuracy. Our previous similarity search method, G-hash, provides a new way to perform similarity search in chemical databases. Experimental study validates the utility of G-hash in chemical databases.
Empirical ground-motion relations for subduction-zone earthquakes and their application to Cascadia and other regions

USGS Publications Warehouse

Atkinson, G.M.; Boore, D.M.

2003-01-01

Ground-motion relations for earthquakes that occur in subduction zones are an important input to seismic-hazard analyses in many parts of the world. In the Cascadia region (Washington, Oregon, northern California, and British Columbia), for example, there is a significant hazard from megathrust earthquakes along the subduction interface and from large events within the subducting slab. These hazards are in addition to the hazard from shallow earthquakes in the overlying crust. We have compiled a response spectra database from thousands of strong-motion recordings from events of moment magnitude (M) 5-8.3 occurring in subduction zones around the world, including both interface and in-slab events. The 2001 M 6.8 Nisqually and 1999 M 5.9 Satsop earthquakes are included in the database, as are many records from subduction zones in Japan (Kyoshin-Net data), Mexico (Guerrero data), and Central America. The size of the database is four times larger than that available for previous empirical regressions to determine ground-motion relations for subduction-zone earthquakes. The large dataset enables improved determination of attenuation parameters and magnitude scaling, for both interface and in-slab events. Soil response parameters are also better determined by the data. We use the database to develop global ground-motion relations for interface and in-slab earthquakes, using a maximum likelihood regression method. We analyze regional variability of ground-motion amplitudes across the global database and find that there are significant regional differences. In particular, amplitudes in Cascadia differ by more than a factor of 2 from those in Japan for the same magnitude, distance, event type, and National Earthquake Hazards Reduction Program (NEHRP) soil class. This is believed to be due to regional differences in the depth of the soil profile, which are not captured by the NEHRP site classification scheme. Regional correction factors to account for these differences are proposed for Cascadia and Japan. The results of this study differ significantly from previous analyses based on more limited data and have important implications for seismic-hazard analysis. The ground-motion relations predict that a great megathrust earthquake (M ???8) at a fault distance of about 100 km would produce pseudoacceleration (PSA), 5% damped, horizontal component on soil sites of about 110 cm/sec2 at 0.5 Hz, 660 cm/sec2 at 2.5 Hz, and 410 cm/sec2 at 5 Hz, with a peak ground acceleration of about 180 cm/ sec2 . These damaging levels of motion would be experienced over a very large area, corresponding to a rectangular area about 300 km wide by 500 km long. Large in-slab events (M 7.5) would produce even higher PSA values within 100 km of the fault, but the in-slab motions attenuate much more rapidly with distance. Thus the hazard posed by moderate to large in-slab events such as the 2001 Nisqually earthquake is modest compared to that of a Cascadia megathrust earthquake of M ???8, in terms of the area that would experience damaging levels of ground motion.
Evaluation of "shotgun" proteomics for identification of biological threat agents in complex environmental matrixes: experimental simulations.

PubMed

Verberkmoes, Nathan C; Hervey, W Judson; Shah, Manesh; Land, Miriam; Hauser, Loren; Larimer, Frank W; Van Berkel, Gary J; Goeringer, Douglas E

2005-02-01

There is currently a great need for rapid detection and positive identification of biological threat agents, as well as microbial species in general, directly from complex environmental samples. This need is most urgent in the area of homeland security, but also extends into medical, environmental, and agricultural sciences. Mass-spectrometry-based analysis is one of the leading technologies in the field with a diversity of different methodologies for biothreat detection. Over the past few years, "shotgun"proteomics has become one method of choice for the rapid analysis of complex protein mixtures by mass spectrometry. Recently, it was demonstrated that this methodology is capable of distinguishing a target species against a large database of background species from a single-component sample or dual-component mixtures with relatively the same concentration. Here, we examine the potential of shotgun proteomics to analyze a target species in a background of four contaminant species. We tested the capability of a common commercial mass-spectrometry-based shotgun proteomics platform for the detection of the target species (Escherichia coli) at four different concentrations and four different time points of analysis. We also tested the effect of database size on positive identification of the four microbes used in this study by testing a small (13-species) database and a large (261-species) database. The results clearly indicated that this technology could easily identify the target species at 20% in the background mixture at a 60, 120, 180, or 240 min analysis time with the small database. The results also indicated that the target species could easily be identified at 20% or 6% but could not be identified at 0.6% or 0.06% in either a 240 min analysis or a 30 h analysis with the small database. The effects of the large database were severe on the target species where detection above the background at any concentration used in this study was impossible, though the three other microbes used in this study were clearly identified above the background when analyzed with the large database. This study points to the potential application of this technology for biological threat agent detection but highlights many areas of needed research before the technology will be useful in real world samples.
Optical components damage parameters database system

NASA Astrophysics Data System (ADS)

Tao, Yizheng; Li, Xinglan; Jin, Yuquan; Xie, Dongmei; Tang, Dingyong

2012-10-01

Optical component is the key to large-scale laser device developed by one of its load capacity is directly related to the device output capacity indicators, load capacity depends on many factors. Through the optical components will damage parameters database load capacity factors of various digital, information technology, for the load capacity of optical components to provide a scientific basis for data support; use of business processes and model-driven approach, the establishment of component damage parameter information model and database systems, system application results that meet the injury test optical components business processes and data management requirements of damage parameters, component parameters of flexible, configurable system is simple, easy to use, improve the efficiency of the optical component damage test.
E-MSD: improving data deposition and structure quality.

PubMed

Tagari, M; Tate, J; Swaminathan, G J; Newman, R; Naim, A; Vranken, W; Kapopoulou, A; Hussain, A; Fillon, J; Henrick, K; Velankar, S

2006-01-01

The Macromolecular Structure Database (MSD) (http://www.ebi.ac.uk/msd/) [H. Boutselakis, D. Dimitropoulos, J. Fillon, A. Golovin, K. Henrick, A. Hussain, J. Ionides, M. John, P. A. Keller, E. Krissinel et al. (2003) E-MSD: the European Bioinformatics Institute Macromolecular Structure Database. Nucleic Acids Res., 31, 458-462.] group is one of the three partners in the worldwide Protein DataBank (wwPDB), the consortium entrusted with the collation, maintenance and distribution of the global repository of macromolecular structure data [H. Berman, K. Henrick and H. Nakamura (2003) Announcing the worldwide Protein Data Bank. Nature Struct. Biol., 10, 980.]. Since its inception, the MSD group has worked with partners around the world to improve the quality of PDB data, through a clean up programme that addresses inconsistencies and inaccuracies in the legacy archive. The improvements in data quality in the legacy archive have been achieved largely through the creation of a unified data archive, in the form of a relational database that stores all of the data in the wwPDB. The three partners are working towards improving the tools and methods for the deposition of new data by the community at large. The implementation of the MSD database, together with the parallel development of improved tools and methodologies for data harvesting, validation and archival, has lead to significant improvements in the quality of data that enters the archive. Through this and related projects in the NMR and EM realms the MSD continues to improve the quality of publicly available structural data.
Quantum Search in Hilbert Space

NASA Technical Reports Server (NTRS)

Zak, Michail

2003-01-01

A proposed quantum-computing algorithm would perform a search for an item of information in a database stored in a Hilbert-space memory structure. The algorithm is intended to make it possible to search relatively quickly through a large database under conditions in which available computing resources would otherwise be considered inadequate to perform such a task. The algorithm would apply, more specifically, to a relational database in which information would be stored in a set of N complex orthonormal vectors, each of N dimensions (where N can be exponentially large). Each vector would constitute one row of a unitary matrix, from which one would derive the Hamiltonian operator (and hence the evolutionary operator) of a quantum system. In other words, all the stored information would be mapped onto a unitary operator acting on a quantum state that would represent the item of information to be retrieved. Then one could exploit quantum parallelism: one could pose all search queries simultaneously by performing a quantum measurement on the system. In so doing, one would effectively solve the search problem in one computational step. One could exploit the direct- and inner-product decomposability of the unitary matrix to make the dimensionality of the memory space exponentially large by use of only linear resources. However, inasmuch as the necessary preprocessing (the mapping of the stored information into a Hilbert space) could be exponentially expensive, the proposed algorithm would likely be most beneficial in applications in which the resources available for preprocessing were much greater than those available for searching.
What do data used to develop ground-motion prediction equations tell us about motions near faults?

USGS Publications Warehouse

Boore, David M.

2014-01-01

A large database of ground motions from shallow earthquakes occurring in active tectonic regions around the world, recently developed in the Pacific Earthquake Engineering Center’s NGA-West2 project, has been used to investigate what such a database can say about the properties and processes of crustal fault zones. There are a relatively small number of near-rupture records, implying that few recordings in the database are within crustal fault zones, but the records that do exist emphasize the complexity of ground-motion amplitudes and polarization close to individual faults. On average over the whole data set, however, the scaling of ground motions with magnitude at a fixed distance, and the distance dependence of the ground motions, seem to be largely consistent with simple seismological models of source scaling, path propagation effects, and local site amplification. The data show that ground motions close to large faults, as measured by elastic response spectra, tend to saturate and become essentially constant for short periods. This saturation seems to be primarily a geometrical effect, due to the increasing size of the rupture surface with magnitude, and not due to a breakdown in self similarity.
What Do Data Used to Develop Ground-Motion Prediction Equations Tell Us About Motions Near Faults?

NASA Astrophysics Data System (ADS)

Boore, David M.

2014-11-01

A large database of ground motions from shallow earthquakes occurring in active tectonic regions around the world, recently developed in the Pacific Earthquake Engineering Center's NGA-West2 project, has been used to investigate what such a database can say about the properties and processes of crustal fault zones. There are a relatively small number of near-rupture records, implying that few recordings in the database are within crustal fault zones, but the records that do exist emphasize the complexity of ground-motion amplitudes and polarization close to individual faults. On average over the whole data set, however, the scaling of ground motions with magnitude at a fixed distance, and the distance dependence of the ground motions, seem to be largely consistent with simple seismological models of source scaling, path propagation effects, and local site amplification. The data show that ground motions close to large faults, as measured by elastic response spectra, tend to saturate and become essentially constant for short periods. This saturation seems to be primarily a geometrical effect, due to the increasing size of the rupture surface with magnitude, and not due to a breakdown in self similarity.
MouseNet database: digital management of a large-scale mutagenesis project.

PubMed

Pargent, W; Heffner, S; Schäble, K F; Soewarto, D; Fuchs, H; Hrabé de Angelis, M

2000-07-01

The Munich ENU Mouse Mutagenesis Screen is a large-scale mutant production, phenotyping, and mapping project. It encompasses two animal breeding facilities and a number of screening groups located in the general area of Munich. A central database is required to manage and process the immense amount of data generated by the mutagenesis project. This database, which we named MouseNet(c), runs on a Sybase platform and will finally store and process all data from the entire project. In addition, the system comprises a portfolio of functions needed to support the workflow management of the core facility and the screening groups. MouseNet(c) will make all of the data available to the participating screening groups, and later to the international scientific community. MouseNet(c) will consist of three major software components:* Animal Management System (AMS)* Sample Tracking System (STS)* Result Documentation System (RDS)MouseNet(c) provides the following major advantages:* being accessible from different client platforms via the Internet* being a full-featured multi-user system (including access restriction and data locking mechanisms)* relying on a professional RDBMS (relational database management system) which runs on a UNIX server platform* supplying workflow functions and a variety of plausibility checks.
Assessing the general safety and tolerability of vildagliptin: value of pooled analyses from a large safety database versus evaluation of individual studies

PubMed Central

Schweizer, Anja; Dejager, Sylvie; Foley, James E; Kothny, Wolfgang

2011-01-01

Aim: Analyzing safety aspects of a drug from individual studies can lead to difficult-to-interpret results. The aim of this paper is therefore to assess the general safety and tolerability, including incidences of the most common adverse events (AEs), of vildagliptin based on a large pooled database of Phase II and III clinical trials. Methods: Safety data were pooled from 38 studies of ≥12 to ≥104 weeks’ duration. AE profiles of vildagliptin (50 mg bid; N = 6116) were evaluated relative to a pool of comparators (placebo and active comparators; N = 6210). Absolute incidence rates were calculated for all AEs, serious AEs (SAEs), discontinuations due to AEs, and deaths. Results: Overall AEs, SAEs, discontinuations due to AEs, and deaths were all reported with a similar frequency in patients receiving vildagliptin (69.1%, 8.9%, 5.7%, and 0.4%, respectively) and patients receiving comparators (69.0%, 9.0%, 6.4%, and 0.4%, respectively), whereas drug-related AEs were seen with a lower frequency in vildagliptin-treated patients (15.7% vs 21.7% with comparators). The incidences of the most commonly reported specific AEs were also similar between vildagliptin and comparators, except for increased incidences of hypoglycemia, tremor, and hyperhidrosis in the comparator group related to the use of sulfonylureas. Conclusions: The present pooled analysis shows that vildagliptin was overall well tolerated in clinical trials of up to >2 years in duration. The data further emphasize the value of a pooled analysis from a large safety database versus assessing safety and tolerability from individual studies. PMID:21415917
RARGE II: an integrated phenotype database of Arabidopsis mutant traits using a controlled vocabulary.

PubMed

Akiyama, Kenji; Kurotani, Atsushi; Iida, Kei; Kuromori, Takashi; Shinozaki, Kazuo; Sakurai, Tetsuya

2014-01-01

Arabidopsis thaliana is one of the most popular experimental plants. However, only 40% of its genes have at least one experimental Gene Ontology (GO) annotation assigned. Systematic observation of mutant phenotypes is an important technique for elucidating gene functions. Indeed, several large-scale phenotypic analyses have been performed and have generated phenotypic data sets from many Arabidopsis mutant lines and overexpressing lines, which are freely available online. Since each Arabidopsis mutant line database uses individual phenotype expression, the differences in the structured term sets used by each database make it difficult to compare data sets and make it impossible to search across databases. Therefore, we obtained publicly available information for a total of 66,209 Arabidopsis mutant lines, including loss-of-function (RATM and TARAPPER) and gain-of-function (AtFOX and OsFOX) lines, and integrated the phenotype data by mapping the descriptions onto Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) terms. This approach made it possible to manage the four different phenotype databases as one large data set. Here, we report a publicly accessible web-based database, the RIKEN Arabidopsis Genome Encyclopedia II (RARGE II; http://rarge-v2.psc.riken.jp/), in which all of the data described in this study are included. Using the database, we demonstrated consistency (in terms of protein function) with a previous study and identified the presumed function of an unknown gene. We provide examples of AT1G21600, which is a subunit in the plastid-encoded RNA polymerase complex, and AT5G56980, which is related to the jasmonic acid signaling pathway.
Monitoring of services with non-relational databases and map-reduce framework

NASA Astrophysics Data System (ADS)

Babik, M.; Souto, F.

2012-12-01

Service Availability Monitoring (SAM) is a well-established monitoring framework that performs regular measurements of the core site services and reports the corresponding availability and reliability of the Worldwide LHC Computing Grid (WLCG) infrastructure. One of the existing extensions of SAM is Site Wide Area Testing (SWAT), which gathers monitoring information from the worker nodes via instrumented jobs. This generates quite a lot of monitoring data to process, as there are several data points for every job and several million jobs are executed every day. The recent uptake of non-relational databases opens a new paradigm in the large-scale storage and distributed processing of systems with heavy read-write workloads. For SAM this brings new possibilities to improve its model, from performing aggregation of measurements to storing raw data and subsequent re-processing. Both SAM and SWAT are currently tuned to run at top performance, reaching some of the limits in storage and processing power of their existing Oracle relational database. We investigated the usability and performance of non-relational storage together with its distributed data processing capabilities. For this, several popular systems have been compared. In this contribution we describe our investigation of the existing non-relational databases suited for monitoring systems covering Cassandra, HBase and MongoDB. Further, we present our experiences in data modeling and prototyping map-reduce algorithms focusing on the extension of the already existing availability and reliability computations. Finally, possible future directions in this area are discussed, analyzing the current deficiencies of the existing Grid monitoring systems and proposing solutions to leverage the benefits of the non-relational databases to get more scalable and flexible frameworks.
Performance-Oriented Privacy-Preserving Data Integration

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pon, R K; Critchlow, T

2004-09-15

Current solutions to integrating private data with public data have provided useful privacy metrics, such as relative information gain, that can be used to evaluate alternative approaches. Unfortunately, they have not addressed critical performance issues, especially when the public database is very large. The use of hashes and noise yields better performance than existing techniques while still making it difficult for unauthorized entities to distinguish which data items truly exist in the private database. As we show here, leveraging the uncertainty introduced by collisions caused by hashing and the injection of noise, we present a technique for performing a relationalmore » join operation between a massive public table and a relatively smaller private one.« less
Potential relative increment (PRI): a new method to empirically derive optimal tree diameter growth

Treesearch

Don C Bragg

2001-01-01

Potential relative increment (PRI) is a new method to derive optimal diameter growth equations using inventory information from a large public database. Optimal growth equations for 24 species were developed using plot and tree records from several states (Michigan, Minnesota, and Wisconsin) of the North Central US. Most species were represented by thousands of...
Origine et developpement des industries de la langue (Origin and Development of Language Utilities). Publication K-8.

ERIC Educational Resources Information Center

L'Homme, Marie-Claude

The evolution of "language utilities," a concept confined largely to the francophone world and relating to the uses of language in computer science and the use of computer science for languages, is chronicled. The language utilities are of three types: (1) tools for language development, primarily dictionary databases and related tools;…
Integrative neuroscience: the role of a standardized database.

PubMed

Gordon, E; Cooper, N; Rennie, C; Hermens, D; Williams, L M

2005-04-01

Most brain related databases bring together specialized information, with a growing number that include neuroimaging measures. This article outlines the potential use and insights from the first entirely standardized and centralized database, which integrates information from neuroimaging measures (EEG, event related potential (ERP), structural/functional MRI), arousal (skin conductance responses (SCR)s, heart rate, respiration), neuropsychological and personality tests, genomics and demographics: The Brain Resource International Database. It comprises data from over 2000 "normative" subjects and a growing number of patients with neurological and psychiatric illnesses, acquired from over 50 laboratories (in the U.S.A, United Kingdom, Holland, South Africa, Israel and Australia), all with identical equipment and experimental procedures. Three primary goals of this database are to quantify individual differences in normative brain function, to compare an individual's performance to their database peers, and to provide a robust normative framework for clinical assessment and treatment prediction. We present three example demonstrations in relation to these goals. First, we show how consistent age differences may be quantified when large subject numbers are available, using EEG and ERP data from nearly 2000 stringently screened. normative subjects. Second, the use of a normalization technique provides a means to compare clinical subjects (50 ADHD subjects in this study) to the normative database with the effects of age and gender taken into account. Third, we show how a profile of EEG/ERP and autonomic measures potentially provides a means to predict treatment response in ADHD subjects. The example data consists of EEG under eyes open and eyes closed and ERP data for auditory oddball, working memory and Go-NoGo paradigms. Autonomic measures of skin conductance (tonic skin conductance level, SCL, and phasic skin conductance responses, SCRs) were acquired simultaneously with central EEG/ERP measures. The findings show that the power of large samples, tested using standardized protocols, allows for the quantification of individual differences that can subsequently be used to control such variation and to enhance the sensitivity and specificity of comparisons between normative and clinical groups. In terms of broader significance, the combination of size and multidimensional measures tapping the brain's core cognitive competencies, may provide a normative and evidence-based framework for individually-based assessments in "Personalized Medicine."

Data Representations for Geographic Information Systems.

ERIC Educational Resources Information Center

Shaffer, Clifford A.

1992-01-01

Surveys the field and literature of geographic information systems (GIS) and spatial data representation as it relates to GIS. Highlights include GIS terms, data types, and operations; vector representations and raster, or grid, representations; spatial indexing; elevation data representations; large spatial databases; and problem areas and future…
The power of an ontology-driven developmental toxicity database for data mining and computational modeling

EPA Science Inventory

Modeling of developmental toxicology presents a significant challenge to computational toxicology due to endpoint complexity and lack of data coverage. These challenges largely account for the relatively few modeling successes using the structure–activity relationship (SAR) parad...
Role of Chemical Reactivity and Transition State Modeling for Virtual Screening.

PubMed

Karthikeyan, Muthukumarasamy; Vyas, Renu; Tambe, Sanjeev S; Radhamohan, Deepthi; Kulkarni, Bhaskar D

2015-01-01

Every drug discovery research program involves synthesis of a novel and potential drug molecule utilizing atom efficient, economical and environment friendly synthetic strategies. The current work focuses on the role of the reactivity based fingerprints of compounds as filters for virtual screening using a tool ChemScore. A reactant-like (RLS) and a product- like (PLS) score can be predicted for a given compound using the binary fingerprints derived from the numerous known organic reactions which capture the molecule-molecule interactions in the form of addition, substitution, rearrangement, elimination and isomerization reactions. The reaction fingerprints were applied to large databases in biology and chemistry, namely ChEMBL, KEGG, HMDB, DSSTox, and the Drug Bank database. A large network of 1113 synthetic reactions was constructed to visualize and ascertain the reactant product mappings in the chemical reaction space. The cumulative reaction fingerprints were computed for 4000 molecules belonging to 29 therapeutic classes of compounds, and these were found capable of discriminating between the cognition disorder related and anti-allergy compounds with reasonable accuracy of 75% and AUC 0.8. In this study, the transition state based fingerprints were also developed and used effectively for virtual screening in drug related databases. The methodology presented here provides an efficient handle for the rapid scoring of molecular libraries for virtual screening.
Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA.

PubMed

Xu, Weijia; Ozer, Stuart; Gutell, Robin R

2009-01-01

With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure.
Covariant Evolutionary Event Analysis for Base Interaction Prediction Using a Relational Database Management System for RNA

PubMed Central

Xu, Weijia; Ozer, Stuart; Gutell, Robin R.

2010-01-01

With an increasingly large amount of sequences properly aligned, comparative sequence analysis can accurately identify not only common structures formed by standard base pairing but also new types of structural elements and constraints. However, traditional methods are too computationally expensive to perform well on large scale alignment and less effective with the sequences from diversified phylogenetic classifications. We propose a new approach that utilizes coevolutional rates among pairs of nucleotide positions using phylogenetic and evolutionary relationships of the organisms of aligned sequences. With a novel data schema to manage relevant information within a relational database, our method, implemented with a Microsoft SQL Server 2005, showed 90% sensitivity in identifying base pair interactions among 16S ribosomal RNA sequences from Bacteria, at a scale 40 times bigger and 50% better sensitivity than a previous study. The results also indicated covariation signals for a few sets of cross-strand base stacking pairs in secondary structure helices, and other subtle constraints in the RNA structure. PMID:20502534
National Databases for Neurosurgical Outcomes Research: Options, Strengths, and Limitations.

PubMed

Karhade, Aditya V; Larsen, Alexandra M G; Cote, David J; Dubois, Heloise M; Smith, Timothy R

2017-08-05

Quality improvement, value-based care delivery, and personalized patient care depend on robust clinical, financial, and demographic data streams of neurosurgical outcomes. The neurosurgical literature lacks a comprehensive review of large national databases. To assess the strengths and limitations of various resources for outcomes research in neurosurgery. A review of the literature was conducted to identify surgical outcomes studies using national data sets. The databases were assessed for the availability of patient demographics and clinical variables, longitudinal follow-up of patients, strengths, and limitations. The number of unique patients contained within each data set ranged from thousands (Quality Outcomes Database [QOD]) to hundreds of millions (MarketScan). Databases with both clinical and financial data included PearlDiver, Premier Healthcare Database, Vizient Clinical Data Base and Resource Manager, and the National Inpatient Sample. Outcomes collected by databases included patient-reported outcomes (QOD); 30-day morbidity, readmissions, and reoperations (National Surgical Quality Improvement Program); and disease incidence and disease-specific survival (Surveillance, Epidemiology, and End Results-Medicare). The strengths of large databases included large numbers of rare pathologies and multi-institutional nationally representative sampling; the limitations of these databases included variable data veracity, variable data completeness, and missing disease-specific variables. The improvement of existing large national databases and the establishment of new registries will be crucial to the future of neurosurgical outcomes research. Copyright © 2017 by the Congress of Neurological Surgeons
On the connection of gamma-ray bursts and X-ray flashes in the BATSE and RHESSI databases

NASA Astrophysics Data System (ADS)

Řípa, J.; Mészáros, A.

2016-12-01

Classification of gamma-ray bursts (GRBs) into groups has been intensively studied by various statistical tests in previous years. It has been suggested that there was a distinct group of GRBs, beyond the long and short ones, with intermediate durations. However, such a group is not securely confirmed yet. Strangely, concerning the spectral hardness, the observations from the Swift and RHESSI satellites give different results. For the Swift/BAT database it is found that the intermediate-duration bursts might well be related to so-called X-ray flashes (XRFs). On the other hand, for the RHESSI dataset the intermediate-duration bursts seem to be spectrally too hard to be given by XRFs. The connection of the intermediate-duration bursts and XRFs for the BATSE database is not clear as well. The purpose of this article is to check the relation between XRFs and GRBs for the BATSE and RHESSI databases, respectively. We use an empirical definition of XRFs introduced by other authors earlier. For the RHESSI database we also use a transformation between the detected counts and the fluences based on the simulated detector response function. The purpose is to compare the hardnesses of GRBs with the definition of XRFs. There is a 1.3-4.2 % fraction of XRFs in the whole BATSE database. The vast majority of the BATSE short bursts are not XRFs because only 0.7-5.7 % of the short bursts can be given by XRFs. However, there is a large uncertainty in the fraction of XRFs among the intermediate-duration bursts. The fraction of 1-85 % of the BATSE intermediate-duration bursts can be related to XRFs. For the long bursts this fraction is between 1.0 % and 3.4 %. The uncertainties in these fractions are large, however it can be claimed that all BATSE intermediate-duration bursts cannot be given by XRFs. At least 79 % of RHESSI short bursts, at least 53 % of RHESSI intermediate-duration bursts, and at least 45 % of RHESSI long bursts should not be given by XRFs. A simulation of XRFs observed by HETE-2 and Swift has shown that RHESSI would detect, and in fact detected, only one long-duration XRF out of 26 ones observed by those two satellites. We arrive at the conclusion that the intermediate-duration bursts in the BATSE database can be partly populated by XRFs, but the RHESSI intermediate-duration bursts are most likely not given by XRFs. The results, claiming that the Swift/BAT intermediate-duration bursts are closely related to XRFs do not hold for the BATSE and RHESSI databases.
Frequency of pacemaker malfunction associated with monopolar electrosurgery during pulse generator replacement or upgrade surgery.

PubMed

Lin, Yun; Melby, Daniel P; Krishnan, Balaji; Adabag, Selcuk; Tholakanahalli, Venkatakrishna; Li, Jian-Ming

2017-08-01

The aim of this study is to investigate the frequency of electrosurgery-related pacemaker malfunction. A retrospective study was conducted to investigate electrosurgery-related pacemaker malfunction in consecutive patients undergoing pulse generator (PG) replacement or upgrade from two large hospitals in Minneapolis, MN between January 2011 and January 2014. The occurrence of this pacemaker malfunction was then studied by using MAUDE database for all four major device vendors. A total of 1398 consecutive patients from 2 large tertiary referral centers in Minneapolis, MN undergoing PG replacement or upgrade surgery were retrospectively studied. Four patients (0.3% of all patients), all with pacemakers from St Jude Medical (2.8%, 4 of 142) had output failure or inappropriately low pacing rate below 30 bpm during electrosurgery, despite being programmed in an asynchronous mode. During the same period, 1174 cases of pacemaker malfunctions were reported on the same models in MAUDE database, 37 of which (3.2%) were electrosurgery-related. Twenty-four cases (65%) had output failure or inappropriate low pacing rate. The distribution of adverse events was loss of pacing (59.5%), reversion to backup pacing (32.4%), inappropriate low pacing rate (5.4%), and ventricular fibrillation (2.7%). The majority of these (78.5%) occurred during PG replacement at ERI or upgrade surgery. No electrosurgery-related malfunction was found in MAUDE database on 862 pacemaker malfunction cases during the same period from other vendors. Electrosurgery during PG replacement or upgrade surgery can trigger output failure or inappropriate low pacing rate in certain models of modern pacemakers. Cautions should be taken for pacemaker-dependent patients.
Terrestrial Sediments of the Earth: Development of a Global Unconsolidated Sediments Map Database (GUM)

NASA Astrophysics Data System (ADS)

Börker, J.; Hartmann, J.; Amann, T.; Romero-Mujalli, G.

2018-04-01

Mapped unconsolidated sediments cover half of the global land surface. They are of considerable importance for many Earth surface processes like weathering, hydrological fluxes or biogeochemical cycles. Ignoring their characteristics or spatial extent may lead to misinterpretations in Earth System studies. Therefore, a new Global Unconsolidated Sediments Map database (GUM) was compiled, using regional maps specifically representing unconsolidated and quaternary sediments. The new GUM database provides insights into the regional distribution of unconsolidated sediments and their properties. The GUM comprises 911,551 polygons and describes not only sediment types and subtypes, but also parameters like grain size, mineralogy, age and thickness where available. Previous global lithological maps or databases lacked detail for reported unconsolidated sediment areas or missed large areas, and reported a global coverage of 25 to 30%, considering the ice-free land area. Here, alluvial sediments cover about 23% of the mapped total ice-free area, followed by aeolian sediments (˜21%), glacial sediments (˜20%), and colluvial sediments (˜16%). A specific focus during the creation of the database was on the distribution of loess deposits, since loess is highly reactive and relevant to understand geochemical cycles related to dust deposition and weathering processes. An additional layer compiling pyroclastic sediment is added, which merges consolidated and unconsolidated pyroclastic sediments. The compilation shows latitudinal abundances of sediment types related to climate of the past. The GUM database is available at the PANGAEA database (https://doi.org/10.1594/PANGAEA.884822).
Assessment of COPD-related outcomes via a national electronic medical record database.

PubMed

Asche, Carl; Said, Quayyim; Joish, Vijay; Hall, Charles Oaxaca; Brixner, Diana

2008-01-01

The technology and sophistication of healthcare utilization databases have expanded over the last decade to include results of lab tests, vital signs, and other clinical information. This review provides an assessment of the methodological and analytical challenges of conducting chronic obstructive pulmonary disease (COPD) outcomes research in a national electronic medical records (EMR) dataset and its potential application towards the assessment of national health policy issues, as well as a description of the challenges or limitations. An EMR database and its application to measuring outcomes for COPD are described. The ability to measure adherence to the COPD evidence-based practice guidelines, generated by the NIH and HEDIS quality indicators, in this database was examined. Case studies, before and after their publication, were used to assess the adherence to guidelines and gauge the conformity to quality indicators. EMR was the only source of information for pulmonary function tests, but low frequency in ordering by primary care was an issue. The EMR data can be used to explore impact of variation in healthcare provision on clinical outcomes. The EMR database permits access to specific lab data and biometric information. The richness and depth of information on "real world" use of health services for large population-based analytical studies at relatively low cost render such databases an attractive resource for outcomes research. Various sources of information exist to perform outcomes research. It is important to understand the desired endpoints of such research and choose the appropriate database source.
Prescription-drug-related risk in driving: comparing conventional and lasso shrinkage logistic regressions.

PubMed

Avalos, Marta; Adroher, Nuria Duran; Lagarde, Emmanuel; Thiessard, Frantz; Grandvalet, Yves; Contrand, Benjamin; Orriols, Ludivine

2012-09-01

Large data sets with many variables provide particular challenges when constructing analytic models. Lasso-related methods provide a useful tool, although one that remains unfamiliar to most epidemiologists. We illustrate the application of lasso methods in an analysis of the impact of prescribed drugs on the risk of a road traffic crash, using a large French nationwide database (PLoS Med 2010;7:e1000366). In the original case-control study, the authors analyzed each exposure separately. We use the lasso method, which can simultaneously perform estimation and variable selection in a single model. We compare point estimates and confidence intervals using (1) a separate logistic regression model for each drug with a Bonferroni correction and (2) lasso shrinkage logistic regression analysis. Shrinkage regression had little effect on (bias corrected) point estimates, but led to less conservative results, noticeably for drugs with moderate levels of exposure. Carbamates, carboxamide derivative and fatty acid derivative antiepileptics, drugs used in opioid dependence, and mineral supplements of potassium showed stronger associations. Lasso is a relevant method in the analysis of databases with large number of exposures and can be recommended as an alternative to conventional strategies.
Cross-checking of Large Evaluated and Experimental Nuclear Reaction Databases

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zeydina, O.; Koning, A.J.; Soppera, N.

2014-06-15

Automated methods are presented for the verification of large experimental and evaluated nuclear reaction databases (e.g. EXFOR, JEFF, TENDL). These methods allow an assessment of the overall consistency of the data and detect aberrant values in both evaluated and experimental databases.
Recent advances in the compilation of holocene relative Sea-level database in North America

NASA Astrophysics Data System (ADS)

Horton, B.; Vacchi, M.; Engelhart, S. E.; Nikitina, D.

2015-12-01

Reconstruction of relative sea level (RSL) has implications for investigation of crustal movements, calibration of earth rheology models and the reconstruction of ice sheets. In recent years, efforts were made to create RSL databases following a standardized methodology. These regional databases provided a framework for developing our understanding of the primary mechanisms of RSL change since the Last Glacial Maximum and a long-term baseline against which to gauge changes in sea-level during the 20th century and forecasts for the 21st. Here we present two quality-controlled Holocene RSL database compiled for North America. Along the Pacific coast of North America (British Columbia, Canada to California, USA), our re-evaluation of sea-level indicators from geological and archaeological investigations yield 841 RSL data-points mainly from salt and freshwater wetlands or adjacent estuarine sediment as well as from isolation basin. Along the Atlantic coast of North America (Hudson Bay, Canada to South Carolina, USA), we are currently compiling a database including more than 2000 RSL data-points from isolation basin, salt and freshwater wetlands, beach ridges and intratidal deposits. We outline the difficulties and solutions we made to compile databases in such different depostional environment. We address complex tectonics and the framework to compare such large variability of RSL data-point. We discuss the implications of our results for the glacio-isostatic adjustment (GIA) models in the two studied regions.
Appropriateness of the food-pics image database for experimental eating and appetite research with adolescents.

PubMed

Jensen, Chad D; Duraccio, Kara M; Barnett, Kimberly A; Stevens, Kimberly S

2016-12-01

Research examining effects of visual food cues on appetite-related brain processes and eating behavior has proliferated. Recently investigators have developed food image databases for use across experimental studies examining appetite and eating behavior. The food-pics image database represents a standardized, freely available image library originally validated in a large sample primarily comprised of adults. The suitability of the images for use with adolescents has not been investigated. The aim of the present study was to evaluate the appropriateness of the food-pics image library for appetite and eating research with adolescents. Three hundred and seven adolescents (ages 12-17) provided ratings of recognizability, palatability, and desire to eat, for images from the food-pics database. Moreover, participants rated the caloric content (high vs. low) and healthiness (healthy vs. unhealthy) of each image. Adolescents rated approximately 75% of the food images as recognizable. Approximately 65% of recognizable images were correctly categorized as high vs. low calorie and 63% were correctly classified as healthy vs. unhealthy in 80% or more of image ratings. These results suggest that a smaller subset of the food-pics image database is appropriate for use with adolescents. With some modifications to included images, the food-pics image database appears to be appropriate for use in experimental appetite and eating-related research conducted with adolescents. Copyright Â© 2016 Elsevier Ltd. All rights reserved.
Bundled payment reimbursement for anterior and posterior approaches for cervical spondylotic myelopathy: an analysis of private payer and Medicare databases.

PubMed

Virk, Sohrab S; Phillips, Frank M; Khan, Safdar N

2018-03-01

OBJECTIVE Cervical spondylotic myelopathy (CSM) is a progressive spinal condition that often requires surgery. Studies have shown the clinical equivalency of anterior versus posterior approaches for CSM surgery. The purpose of this study was to determine the amount and type of resources used for anterior and posterior surgical treatment of CSM by using large national databases of clinical and financial information from patients. METHODS This study consists of 2 large cohorts of patients who underwent either an anterior or posterior approach for treatment of CSM. These patients were selected from the Medicare 5% National Sample Administrative Database (SAF5) and the Humana orthopedic database (HORTHO), which is a database of patients with private payer health insurance. The outcome measures were the cost of a 90-day episode of care, as well as a breakdown of the cost components for each surgical procedure between 2005 and 2014. RESULTS A total of 16,444 patients were included in this analysis. In HORTHO, there were 10,332 and 1556 patients treated with an anterior or posterior approach for CSM, respectively. In SAF5, there were 3851 and 705 patients who were treated by an anterior or posterior approach for CSM, respectively. The mean ± SD reimbursements for anterior and posterior approaches in the HORTHO database were $20,863 ± $2014 and $23,813 ± $4258, respectively (p = 0.048). The mean ± SD reimbursements for anterior and posterior approaches in the SAF5 database were $18,219 ± $1053 and $25,598 ± $1686, respectively (p < 0.0001). There were also significantly higher reimbursements for a rehabilitation/skilled nursing facility and hospital/inpatient care for patients who underwent a posterior approach in both the private payer and Medicare databases. In all cohorts in this study, the hospital-related reimbursement was more than double the surgeon-related reimbursement. CONCLUSIONS This study provides resource utilization information for a 90-day episode of care for both anterior and posterior approaches for CSM surgery. There is a statistically significant higher resource utilization for patients undergoing the posterior approach for CSM, which is consistent with the literature. Understanding the reimbursement patterns for anterior versus posterior approaches for CSM will help providers design a bundled payment for patients requiring surgery for CSM, and this study suggests that a subset of patients who require the posterior approach for treatment also require greater resources. The data also suggest that hospital-related reimbursement is the major driver of payments.
Surgical research using national databases

PubMed Central

Leland, Hyuma; Heckmann, Nathanael

2016-01-01

Recent changes in healthcare and advances in technology have increased the use of large-volume national databases in surgical research. These databases have been used to develop perioperative risk stratification tools, assess postoperative complications, calculate costs, and investigate numerous other topics across multiple surgical specialties. The results of these studies contain variable information but are subject to unique limitations. The use of large-volume national databases is increasing in popularity, and thorough understanding of these databases will allow for a more sophisticated and better educated interpretation of studies that utilize such databases. This review will highlight the composition, strengths, and weaknesses of commonly used national databases in surgical research. PMID:27867945
Surgical research using national databases.

PubMed

Alluri, Ram K; Leland, Hyuma; Heckmann, Nathanael

2016-10-01

Recent changes in healthcare and advances in technology have increased the use of large-volume national databases in surgical research. These databases have been used to develop perioperative risk stratification tools, assess postoperative complications, calculate costs, and investigate numerous other topics across multiple surgical specialties. The results of these studies contain variable information but are subject to unique limitations. The use of large-volume national databases is increasing in popularity, and thorough understanding of these databases will allow for a more sophisticated and better educated interpretation of studies that utilize such databases. This review will highlight the composition, strengths, and weaknesses of commonly used national databases in surgical research.
DWARF – a data warehouse system for analyzing protein families

PubMed Central

Fischer, Markus; Thai, Quan K; Grieb, Melanie; Pleiss, Jürgen

2006-01-01

Background The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. Description The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. Conclusion DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering. PMID:17094801
A generic method for improving the spatial interoperability of medical and ecological databases.

PubMed

Ghenassia, A; Beuscart, J B; Ficheur, G; Occelli, F; Babykina, E; Chazard, E; Genin, M

2017-10-03

The availability of big data in healthcare and the intensive development of data reuse and georeferencing have opened up perspectives for health spatial analysis. However, fine-scale spatial studies of ecological and medical databases are limited by the change of support problem and thus a lack of spatial unit interoperability. The use of spatial disaggregation methods to solve this problem introduces errors into the spatial estimations. Here, we present a generic, two-step method for merging medical and ecological databases that avoids the use of spatial disaggregation methods, while maximizing the spatial resolution. Firstly, a mapping table is created after one or more transition matrices have been defined. The latter link the spatial units of the original databases to the spatial units of the final database. Secondly, the mapping table is validated by (1) comparing the covariates contained in the two original databases, and (2) checking the spatial validity with a spatial continuity criterion and a spatial resolution index. We used our novel method to merge a medical database (the French national diagnosis-related group database, containing 5644 spatial units) with an ecological database (produced by the French National Institute of Statistics and Economic Studies, and containing with 36,594 spatial units). The mapping table yielded 5632 final spatial units. The mapping table's validity was evaluated by comparing the number of births in the medical database and the ecological databases in each final spatial unit. The median [interquartile range] relative difference was 2.3% [0; 5.7]. The spatial continuity criterion was low (2.4%), and the spatial resolution index was greater than for most French administrative areas. Our innovative approach improves interoperability between medical and ecological databases and facilitates fine-scale spatial analyses. We have shown that disaggregation models and large aggregation techniques are not necessarily the best ways to tackle the change of support problem.
The database of the PREDICTS (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems) project.

PubMed

Hudson, Lawrence N; Newbold, Tim; Contu, Sara; Hill, Samantha L L; Lysenko, Igor; De Palma, Adriana; Phillips, Helen R P; Alhusseini, Tamera I; Bedford, Felicity E; Bennett, Dominic J; Booth, Hollie; Burton, Victoria J; Chng, Charlotte W T; Choimes, Argyrios; Correia, David L P; Day, Julie; Echeverría-Londoño, Susy; Emerson, Susan R; Gao, Di; Garon, Morgan; Harrison, Michelle L K; Ingram, Daniel J; Jung, Martin; Kemp, Victoria; Kirkpatrick, Lucinda; Martin, Callum D; Pan, Yuan; Pask-Hale, Gwilym D; Pynegar, Edwin L; Robinson, Alexandra N; Sanchez-Ortiz, Katia; Senior, Rebecca A; Simmons, Benno I; White, Hannah J; Zhang, Hanbin; Aben, Job; Abrahamczyk, Stefan; Adum, Gilbert B; Aguilar-Barquero, Virginia; Aizen, Marcelo A; Albertos, Belén; Alcala, E L; Del Mar Alguacil, Maria; Alignier, Audrey; Ancrenaz, Marc; Andersen, Alan N; Arbeláez-Cortés, Enrique; Armbrecht, Inge; Arroyo-Rodríguez, Víctor; Aumann, Tom; Axmacher, Jan C; Azhar, Badrul; Azpiroz, Adrián B; Baeten, Lander; Bakayoko, Adama; Báldi, András; Banks, John E; Baral, Sharad K; Barlow, Jos; Barratt, Barbara I P; Barrico, Lurdes; Bartolommei, Paola; Barton, Diane M; Basset, Yves; Batáry, Péter; Bates, Adam J; Baur, Bruno; Bayne, Erin M; Beja, Pedro; Benedick, Suzan; Berg, Åke; Bernard, Henry; Berry, Nicholas J; Bhatt, Dinesh; Bicknell, Jake E; Bihn, Jochen H; Blake, Robin J; Bobo, Kadiri S; Bóçon, Roberto; Boekhout, Teun; Böhning-Gaese, Katrin; Bonham, Kevin J; Borges, Paulo A V; Borges, Sérgio H; Boutin, Céline; Bouyer, Jérémy; Bragagnolo, Cibele; Brandt, Jodi S; Brearley, Francis Q; Brito, Isabel; Bros, Vicenç; Brunet, Jörg; Buczkowski, Grzegorz; Buddle, Christopher M; Bugter, Rob; Buscardo, Erika; Buse, Jörn; Cabra-García, Jimmy; Cáceres, Nilton C; Cagle, Nicolette L; Calviño-Cancela, María; Cameron, Sydney A; Cancello, Eliana M; Caparrós, Rut; Cardoso, Pedro; Carpenter, Dan; Carrijo, Tiago F; Carvalho, Anelena L; Cassano, Camila R; Castro, Helena; Castro-Luna, Alejandro A; Rolando, Cerda B; Cerezo, Alexis; Chapman, Kim Alan; Chauvat, Matthieu; Christensen, Morten; Clarke, Francis M; Cleary, Daniel F R; Colombo, Giorgio; Connop, Stuart P; Craig, Michael D; Cruz-López, Leopoldo; Cunningham, Saul A; D'Aniello, Biagio; D'Cruze, Neil; da Silva, Pedro Giovâni; Dallimer, Martin; Danquah, Emmanuel; Darvill, Ben; Dauber, Jens; Davis, Adrian L V; Dawson, Jeff; de Sassi, Claudio; de Thoisy, Benoit; Deheuvels, Olivier; Dejean, Alain; Devineau, Jean-Louis; Diekötter, Tim; Dolia, Jignasu V; Domínguez, Erwin; Dominguez-Haydar, Yamileth; Dorn, Silvia; Draper, Isabel; Dreber, Niels; Dumont, Bertrand; Dures, Simon G; Dynesius, Mats; Edenius, Lars; Eggleton, Paul; Eigenbrod, Felix; Elek, Zoltán; Entling, Martin H; Esler, Karen J; de Lima, Ricardo F; Faruk, Aisyah; Farwig, Nina; Fayle, Tom M; Felicioli, Antonio; Felton, Annika M; Fensham, Roderick J; Fernandez, Ignacio C; Ferreira, Catarina C; Ficetola, Gentile F; Fiera, Cristina; Filgueiras, Bruno K C; Fırıncıoğlu, Hüseyin K; Flaspohler, David; Floren, Andreas; Fonte, Steven J; Fournier, Anne; Fowler, Robert E; Franzén, Markus; Fraser, Lauchlan H; Fredriksson, Gabriella M; Freire, Geraldo B; Frizzo, Tiago L M; Fukuda, Daisuke; Furlani, Dario; Gaigher, René; Ganzhorn, Jörg U; García, Karla P; Garcia-R, Juan C; Garden, Jenni G; Garilleti, Ricardo; Ge, Bao-Ming; Gendreau-Berthiaume, Benoit; Gerard, Philippa J; Gheler-Costa, Carla; Gilbert, Benjamin; Giordani, Paolo; Giordano, Simonetta; Golodets, Carly; Gomes, Laurens G L; Gould, Rachelle K; Goulson, Dave; Gove, Aaron D; Granjon, Laurent; Grass, Ingo; Gray, Claudia L; Grogan, James; Gu, Weibin; Guardiola, Moisès; Gunawardene, Nihara R; Gutierrez, Alvaro G; Gutiérrez-Lamus, Doris L; Haarmeyer, Daniela H; Hanley, Mick E; Hanson, Thor; Hashim, Nor R; Hassan, Shombe N; Hatfield, Richard G; Hawes, Joseph E; Hayward, Matt W; Hébert, Christian; Helden, Alvin J; Henden, John-André; Henschel, Philipp; Hernández, Lionel; Herrera, James P; Herrmann, Farina; Herzog, Felix; Higuera-Diaz, Diego; Hilje, Branko; Höfer, Hubert; Hoffmann, Anke; Horgan, Finbarr G; Hornung, Elisabeth; Horváth, Roland; Hylander, Kristoffer; Isaacs-Cubides, Paola; Ishida, Hiroaki; Ishitani, Masahiro; Jacobs, Carmen T; Jaramillo, Víctor J; Jauker, Birgit; Hernández, F Jiménez; Johnson, McKenzie F; Jolli, Virat; Jonsell, Mats; Juliani, S Nur; Jung, Thomas S; Kapoor, Vena; Kappes, Heike; Kati, Vassiliki; Katovai, Eric; Kellner, Klaus; Kessler, Michael; Kirby, Kathryn R; Kittle, Andrew M; Knight, Mairi E; Knop, Eva; Kohler, Florian; Koivula, Matti; Kolb, Annette; Kone, Mouhamadou; Kőrösi, Ádám; Krauss, Jochen; Kumar, Ajith; Kumar, Raman; Kurz, David J; Kutt, Alex S; Lachat, Thibault; Lantschner, Victoria; Lara, Francisco; Lasky, Jesse R; Latta, Steven C; Laurance, William F; Lavelle, Patrick; Le Féon, Violette; LeBuhn, Gretchen; Légaré, Jean-Philippe; Lehouck, Valérie; Lencinas, María V; Lentini, Pia E; Letcher, Susan G; Li, Qi; Litchwark, Simon A; Littlewood, Nick A; Liu, Yunhui; Lo-Man-Hung, Nancy; López-Quintero, Carlos A; Louhaichi, Mounir; Lövei, Gabor L; Lucas-Borja, Manuel Esteban; Luja, Victor H; Luskin, Matthew S; MacSwiney G, M Cristina; Maeto, Kaoru; Magura, Tibor; Mallari, Neil Aldrin; Malone, Louise A; Malonza, Patrick K; Malumbres-Olarte, Jagoba; Mandujano, Salvador; Måren, Inger E; Marin-Spiotta, Erika; Marsh, Charles J; Marshall, E J P; Martínez, Eliana; Martínez Pastur, Guillermo; Moreno Mateos, David; Mayfield, Margaret M; Mazimpaka, Vicente; McCarthy, Jennifer L; McCarthy, Kyle P; McFrederick, Quinn S; McNamara, Sean; Medina, Nagore G; Medina, Rafael; Mena, Jose L; Mico, Estefania; Mikusinski, Grzegorz; Milder, Jeffrey C; Miller, James R; Miranda-Esquivel, Daniel R; Moir, Melinda L; Morales, Carolina L; Muchane, Mary N; Muchane, Muchai; Mudri-Stojnic, Sonja; Munira, A Nur; Muoñz-Alonso, Antonio; Munyekenye, B F; Naidoo, Robin; Naithani, A; Nakagawa, Michiko; Nakamura, Akihiro; Nakashima, Yoshihiro; Naoe, Shoji; Nates-Parra, Guiomar; Navarrete Gutierrez, Dario A; Navarro-Iriarte, Luis; Ndang'ang'a, Paul K; Neuschulz, Eike L; Ngai, Jacqueline T; Nicolas, Violaine; Nilsson, Sven G; Noreika, Norbertas; Norfolk, Olivia; Noriega, Jorge Ari; Norton, David A; Nöske, Nicole M; Nowakowski, A Justin; Numa, Catherine; O'Dea, Niall; O'Farrell, Patrick J; Oduro, William; Oertli, Sabine; Ofori-Boateng, Caleb; Oke, Christopher Omamoke; Oostra, Vicencio; Osgathorpe, Lynne M; Otavo, Samuel Eduardo; Page, Navendu V; Paritsis, Juan; Parra-H, Alejandro; Parry, Luke; Pe'er, Guy; Pearman, Peter B; Pelegrin, Nicolás; Pélissier, Raphaël; Peres, Carlos A; Peri, Pablo L; Persson, Anna S; Petanidou, Theodora; Peters, Marcell K; Pethiyagoda, Rohan S; Phalan, Ben; Philips, T Keith; Pillsbury, Finn C; Pincheira-Ulbrich, Jimmy; Pineda, Eduardo; Pino, Joan; Pizarro-Araya, Jaime; Plumptre, A J; Poggio, Santiago L; Politi, Natalia; Pons, Pere; Poveda, Katja; Power, Eileen F; Presley, Steven J; Proença, Vânia; Quaranta, Marino; Quintero, Carolina; Rader, Romina; Ramesh, B R; Ramirez-Pinilla, Martha P; Ranganathan, Jai; Rasmussen, Claus; Redpath-Downing, Nicola A; Reid, J Leighton; Reis, Yana T; Rey Benayas, José M; Rey-Velasco, Juan Carlos; Reynolds, Chevonne; Ribeiro, Danilo Bandini; Richards, Miriam H; Richardson, Barbara A; Richardson, Michael J; Ríos, Rodrigo Macip; Robinson, Richard; Robles, Carolina A; Römbke, Jörg; Romero-Duque, Luz Piedad; Rös, Matthias; Rosselli, Loreta; Rossiter, Stephen J; Roth, Dana S; Roulston, T'ai H; Rousseau, Laurent; Rubio, André V; Ruel, Jean-Claude; Sadler, Jonathan P; Sáfián, Szabolcs; Saldaña-Vázquez, Romeo A; Sam, Katerina; Samnegård, Ulrika; Santana, Joana; Santos, Xavier; Savage, Jade; Schellhorn, Nancy A; Schilthuizen, Menno; Schmiedel, Ute; Schmitt, Christine B; Schon, Nicole L; Schüepp, Christof; Schumann, Katharina; Schweiger, Oliver; Scott, Dawn M; Scott, Kenneth A; Sedlock, Jodi L; Seefeldt, Steven S; Shahabuddin, Ghazala; Shannon, Graeme; Sheil, Douglas; Sheldon, Frederick H; Shochat, Eyal; Siebert, Stefan J; Silva, Fernando A B; Simonetti, Javier A; Slade, Eleanor M; Smith, Jo; Smith-Pardo, Allan H; Sodhi, Navjot S; Somarriba, Eduardo J; Sosa, Ramón A; Soto Quiroga, Grimaldo; St-Laurent, Martin-Hugues; Starzomski, Brian M; Stefanescu, Constanti; Steffan-Dewenter, Ingolf; Stouffer, Philip C; Stout, Jane C; Strauch, Ayron M; Struebig, Matthew J; Su, Zhimin; Suarez-Rubio, Marcela; Sugiura, Shinji; Summerville, Keith S; Sung, Yik-Hei; Sutrisno, Hari; Svenning, Jens-Christian; Teder, Tiit; Threlfall, Caragh G; Tiitsaar, Anu; Todd, Jacqui H; Tonietto, Rebecca K; Torre, Ignasi; Tóthmérész, Béla; Tscharntke, Teja; Turner, Edgar C; Tylianakis, Jason M; Uehara-Prado, Marcio; Urbina-Cardona, Nicolas; Vallan, Denis; Vanbergen, Adam J; Vasconcelos, Heraldo L; Vassilev, Kiril; Verboven, Hans A F; Verdasca, Maria João; Verdú, José R; Vergara, Carlos H; Vergara, Pablo M; Verhulst, Jort; Virgilio, Massimiliano; Vu, Lien Van; Waite, Edward M; Walker, Tony R; Wang, Hua-Feng; Wang, Yanping; Watling, James I; Weller, Britta; Wells, Konstans; Westphal, Catrin; Wiafe, Edward D; Williams, Christopher D; Willig, Michael R; Woinarski, John C Z; Wolf, Jan H D; Wolters, Volkmar; Woodcock, Ben A; Wu, Jihua; Wunderle, Joseph M; Yamaura, Yuichi; Yoshikura, Satoko; Yu, Douglas W; Zaitsev, Andrey S; Zeidler, Juliane; Zou, Fasheng; Collen, Ben; Ewers, Rob M; Mace, Georgina M; Purves, Drew W; Scharlemann, Jörn P W; Purvis, Andy

2017-01-01

The PREDICTS project-Projecting Responses of Ecological Diversity In Changing Terrestrial Systems (www.predicts.org.uk)-has collated from published studies a large, reasonably representative database of comparable samples of biodiversity from multiple sites that differ in the nature or intensity of human impacts relating to land use. We have used this evidence base to develop global and regional statistical models of how local biodiversity responds to these measures. We describe and make freely available this 2016 release of the database, containing more than 3.2 million records sampled at over 26,000 locations and representing over 47,000 species. We outline how the database can help in answering a range of questions in ecology and conservation biology. To our knowledge, this is the largest and most geographically and taxonomically representative database of spatial comparisons of biodiversity that has been collated to date; it will be useful to researchers and international efforts wishing to model and understand the global status of biodiversity.

De-identifying an EHR database - anonymity, correctness and readability of the medical record.

PubMed

Pantazos, Kostas; Lauesen, Soren; Lippert, Soren

2011-01-01

Electronic health records (EHR) contain a large amount of structured data and free text. Exploring and sharing clinical data can improve healthcare and facilitate the development of medical software. However, revealing confidential information is against ethical principles and laws. We de-identified a Danish EHR database with 437,164 patients. The goal was to generate a version with real medical records, but related to artificial persons. We developed a de-identification algorithm that uses lists of named entities, simple language analysis, and special rules. Our algorithm consists of 3 steps: collect lists of identifiers from the database and external resources, define a replacement for each identifier, and replace identifiers in structured data and free text. Some patient records could not be safely de-identified, so the de-identified database has 323,122 patient records with an acceptable degree of anonymity, readability and correctness (F-measure of 95%). The algorithm has to be adjusted for each culture, language and database.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2008-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
GenBank

PubMed Central

Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.

2008-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 260 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov PMID:18073190
Artemis and ACT: viewing, annotating and comparing sequences stored in a relational database.

PubMed

Carver, Tim; Berriman, Matthew; Tivey, Adrian; Patel, Chinmay; Böhme, Ulrike; Barrell, Barclay G; Parkhill, Julian; Rajandream, Marie-Adèle

2008-12-01

Artemis and Artemis Comparison Tool (ACT) have become mainstream tools for viewing and annotating sequence data, particularly for microbial genomes. Since its first release, Artemis has been continuously developed and supported with additional functionality for editing and analysing sequences based on feedback from an active user community of laboratory biologists and professional annotators. Nevertheless, its utility has been somewhat restricted by its limitation to reading and writing from flat files. Therefore, a new version of Artemis has been developed, which reads from and writes to a relational database schema, and allows users to annotate more complex, often large and fragmented, genome sequences. Artemis and ACT have now been extended to read and write directly to the Generic Model Organism Database (GMOD, http://www.gmod.org) Chado relational database schema. In addition, a Gene Builder tool has been developed to provide structured forms and tables to edit coordinates of gene models and edit functional annotation, based on standard ontologies, controlled vocabularies and free text. Artemis and ACT are freely available (under a GPL licence) for download (for MacOSX, UNIX and Windows) at the Wellcome Trust Sanger Institute web sites: http://www.sanger.ac.uk/Software/Artemis/ http://www.sanger.ac.uk/Software/ACT/
Some Reliability Issues in Very Large Databases.

ERIC Educational Resources Information Center

Lynch, Clifford A.

1988-01-01

Describes the unique reliability problems of very large databases that necessitate specialized techniques for hardware problem management. The discussion covers the use of controlled partial redundancy to improve reliability, issues in operating systems and database management systems design, and the impact of disk technology on very large…
The Marshall Islands Data Management Program

DOE Office of Scientific and Technical Information (OSTI.GOV)

Stoker, A.C.; Conrado, C.L.

1995-09-01

This report is a resource document of the methods and procedures used currently in the Data Management Program of the Marshall Islands Dose Assessment and Radioecology Project. Since 1973, over 60,000 environmental samples have been collected. Our program includes relational database design, programming and maintenance; sample and information management; sample tracking; quality control; and data entry, evaluation and reduction. The usefulness of scientific databases involves careful planning in order to fulfill the requirements of any large research program. Compilation of scientific results requires consolidation of information from several databases, and incorporation of new information as it is generated. The successmore » in combining and organizing all radionuclide analysis, sample information and statistical results into a readily accessible form, is critical to our project.« less
The Kepler DB: a database management system for arrays, sparse arrays, and binary data

NASA Astrophysics Data System (ADS)

McCauliff, Sean; Cote, Miles T.; Girouard, Forrest R.; Middour, Christopher; Klaus, Todd C.; Wohler, Bill

2010-07-01

The Kepler Science Operations Center stores pixel values on approximately six million pixels collected every 30 minutes, as well as data products that are generated as a result of running the Kepler science processing pipeline. The Kepler Database management system (Kepler DB)was created to act as the repository of this information. After one year of flight usage, Kepler DB is managing 3 TiB of data and is expected to grow to over 10 TiB over the course of the mission. Kepler DB is a non-relational, transactional database where data are represented as one-dimensional arrays, sparse arrays or binary large objects. We will discuss Kepler DB's APIs, implementation, usage and deployment at the Kepler Science Operations Center.
The Kepler DB, a Database Management System for Arrays, Sparse Arrays and Binary Data

NASA Technical Reports Server (NTRS)

McCauliff, Sean; Cote, Miles T.; Girouard, Forrest R.; Middour, Christopher; Klaus, Todd C.; Wohler, Bill

2010-01-01

The Kepler Science Operations Center stores pixel values on approximately six million pixels collected every 30-minutes, as well as data products that are generated as a result of running the Kepler science processing pipeline. The Kepler Database (Kepler DB) management system was created to act as the repository of this information. After one year of ight usage, Kepler DB is managing 3 TiB of data and is expected to grow to over 10 TiB over the course of the mission. Kepler DB is a non-relational, transactional database where data are represented as one dimensional arrays, sparse arrays or binary large objects. We will discuss Kepler DB's APIs, implementation, usage and deployment at the Kepler Science Operations Center.
Global Distribution of Outbreaks of Water-Associated Infectious Diseases

PubMed Central

Yang, Kun; LeJeune, Jeffrey; Alsdorf, Doug; Lu, Bo; Shum, C. K.; Liang, Song

2012-01-01

Background Water plays an important role in the transmission of many infectious diseases, which pose a great burden on global public health. However, the global distribution of these water-associated infectious diseases and underlying factors remain largely unexplored. Methods and Findings Based on the Global Infectious Disease and Epidemiology Network (GIDEON), a global database including water-associated pathogens and diseases was developed. In this study, reported outbreak events associated with corresponding water-associated infectious diseases from 1991 to 2008 were extracted from the database. The location of each reported outbreak event was identified and geocoded into a GIS database. Also collected in the GIS database included geo-referenced socio-environmental information including population density (2000), annual accumulated temperature, surface water area, and average annual precipitation. Poisson models with Bayesian inference were developed to explore the association between these socio-environmental factors and distribution of the reported outbreak events. Based on model predictions a global relative risk map was generated. A total of 1,428 reported outbreak events were retrieved from the database. The analysis suggested that outbreaks of water-associated diseases are significantly correlated with socio-environmental factors. Population density is a significant risk factor for all categories of reported outbreaks of water-associated diseases; water-related diseases (e.g., vector-borne diseases) are associated with accumulated temperature; water-washed diseases (e.g., conjunctivitis) are inversely related to surface water area; both water-borne and water-related diseases are inversely related to average annual rainfall. Based on the model predictions, “hotspots” of risks for all categories of water-associated diseases were explored. Conclusions At the global scale, water-associated infectious diseases are significantly correlated with socio-environmental factors, impacting all regions which are affected disproportionately by different categories of water-associated infectious diseases. PMID:22348158
Use of large healthcare databases for rheumatology clinical research.

PubMed

Desai, Rishi J; Solomon, Daniel H

2017-03-01

Large healthcare databases, which contain data collected during routinely delivered healthcare to patients, can serve as a valuable resource for generating actionable evidence to assist medical and healthcare policy decision-making. In this review, we summarize use of large healthcare databases in rheumatology clinical research. Large healthcare data are critical to evaluate medication safety and effectiveness in patients with rheumatologic conditions. Three major sources of large healthcare data are: first, electronic medical records, second, health insurance claims, and third, patient registries. Each of these sources offers unique advantages, but also has some inherent limitations. To address some of these limitations and maximize the utility of these data sources for evidence generation, recent efforts have focused on linking different data sources. Innovations such as randomized registry trials, which aim to facilitate design of low-cost randomized controlled trials built on existing infrastructure provided by large healthcare databases, are likely to make clinical research more efficient in coming years. Harnessing the power of information contained in large healthcare databases, while paying close attention to their inherent limitations, is critical to generate a rigorous evidence-base for medical decision-making and ultimately enhancing patient care.
The development of a prototype intelligent user interface subsystem for NASA's scientific database systems

NASA Technical Reports Server (NTRS)

Campbell, William J.; Roelofs, Larry H.; Short, Nicholas M., Jr.

1987-01-01

The National Space Science Data Center (NSSDC) has initiated an Intelligent Data Management (IDM) research effort which has as one of its components the development of an Intelligent User Interface (IUI).The intent of the latter is to develop a friendly and intelligent user interface service that is based on expert systems and natural language processing technologies. The purpose is to support the large number of potential scientific and engineering users presently having need of space and land related research and technical data but who have little or no experience in query languages or understanding of the information content or architecture of the databases involved. This technical memorandum presents prototype Intelligent User Interface Subsystem (IUIS) using the Crustal Dynamics Project Database as a test bed for the implementation of the CRUDDES (Crustal Dynamics Expert System). The knowledge base has more than 200 rules and represents a single application view and the architectural view. Operational performance using CRUDDES has allowed nondatabase users to obtain useful information from the database previously accessible only to an expert database user or the database designer.
Perianesthetic and Anesthesia-Related Mortality in a Southeastern United States Population: A Longitudinal Review of a Prospectively Collected Quality Assurance Data Base.

PubMed

Pollard, Richard J; Hopkins, Thomas; Smith, C Tyler; May, Bryan V; Doyle, James; Chambers, C Labron; Clark, Reese; Buhrman, William

2018-05-21

Perianesthetic mortality (death occurring within 48 hours of an anesthetic) continues to vary widely depending on the study population examined. The authors study in a private practice physician group that covers multiple anesthetizing locations in the Southeastern United States. This group has in place a robust quality assurance (QA) database to follow all patients undergoing anesthesia. With this study, we estimate the incidence of anesthesia-related and perianesthetic mortality in this QA database. Following institutional review board approval, data from 2011 to 2016 were obtained from the QA database of a large, community-based anesthesiology group practice. The physician practice covers 233 anesthetizing locations across 20 facilities in 2 US states. All detected cases of perianesthetic death were extracted from the database and compared to the patients' electronic medical record. These cases were further examined by a committee of 3 anesthesiologists to determine whether the death was anesthesia related (a perioperative death solely attributable to either the anesthesia provider or anesthetic technique), anesthetic contributory (a perioperative death in which anesthesia role could not be entirely excluded), or not due to anesthesia. A total of 785,467 anesthesia procedures were examined from the study period. A total of 592 cases of perianesthetic deaths were detected, giving an overall death rate of 75.37 in 100,000 cases (95% CI, 69.5-81.7). Mortality judged to be anesthesia related was found in 4 cases, giving a mortality rate of 0.509 in 100,000 (95% CI, 0.198-1.31). Mortality judged to be anesthesia contributory were found in 18 cases, giving a mortality of 2.29 in 100,000 patients (95% CI, 1.45-3.7). A total of 570 cases were judged to be nonanesthesia related, giving an incidence of 72.6 per 100,000 anesthetics (95% CI, 69.3-75.7). In a large, comprehensive database representing the full range of anesthesia practices and locations in the Southeastern United States, the rate of perianesthestic death was 0.509 in 100,000 (95% CI, 0.198-1.31). Future in-depth analysis of the epidemiology of perianesthetic deaths will be reported in later studies.
An algorithm of discovering signatures from DNA databases on a computer cluster.

PubMed

Lee, Hsiao Ping; Sheu, Tzu-Fang

2014-10-05

Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved. In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms. The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available at http://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
BioPepDB: an integrated data platform for food-derived bioactive peptides.

PubMed

Li, Qilin; Zhang, Chao; Chen, Hongjun; Xue, Jitong; Guo, Xiaolei; Liang, Ming; Chen, Ming

2018-03-12

Food-derived bioactive peptides play critical roles in regulating most biological processes and have considerable biological, medical and industrial importance. However, a large number of active peptides data, including sequence, function, source, commercial product information, references and other information are poorly integrated. BioPepDB is a searchable database of food-derived bioactive peptides and their related articles, including more than four thousand bioactive peptide entries. Moreover, BioPepDB provides modules of prediction and hydrolysis-simulation for discovering novel peptides. It can serve as a reference database to investigate the function of different bioactive peptides. BioPepDB is available at http://bis.zju.edu.cn/biopepdbr/ . The web page utilises Apache, PHP5 and MySQL to provide the user interface for accessing the database and predict novel peptides. The database itself is operated on a specialised server.
GenBank

PubMed Central

Benson, Dennis A.; Karsch-Mizrachi, Ilene; Lipman, David J.; Ostell, James; Wheeler, David L.

2007-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (). PMID:17202161
SkyDOT: a publicly accessible variability database, containing multiple sky surveys and real-time data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Starr, D. L.; Wozniak, P. R.; Vestrand, W. T.

2002-01-01

SkyDOT (Sky Database for Objects in Time-Domain) is a Virtual Observatory currently comprised of data from the RAPTOR, ROTSE I, and OGLE I1 survey projects. This makes it a very large time domain database. In addition, the RAPTOR project provides SkyDOT with real-time variability data as well as stereoscopic information. With its web interface, we believe SkyDOT will be a very useful tool for both astronomers, and the public. Our main task has been to construct an efficient relational database containing all existing data, while handling a real-time inflow of data. We also provide a useful web interface allowing easymore » access to both astronomers and the public. Initially, this server will allow common searches, specific queries, and access to light curves. In the future we will include machine learning classification tools and access to spectral information.« less
SPSmart: adapting population based SNP genotype databases for fast and comprehensive web access.

PubMed

Amigo, Jorge; Salas, Antonio; Phillips, Christopher; Carracedo, Angel

2008-10-10

In the last five years large online resources of human variability have appeared, notably HapMap, Perlegen and the CEPH foundation. These databases of genotypes with population information act as catalogues of human diversity, and are widely used as reference sources for population genetics studies. Although many useful conclusions may be extracted by querying databases individually, the lack of flexibility for combining data from within and between each database does not allow the calculation of key population variability statistics. We have developed a novel tool for accessing and combining large-scale genomic databases of single nucleotide polymorphisms (SNPs) in widespread use in human population genetics: SPSmart (SNPs for Population Studies). A fast pipeline creates and maintains a data mart from the most commonly accessed databases of genotypes containing population information: data is mined, summarized into the standard statistical reference indices, and stored into a relational database that currently handles as many as 4 x 10(9) genotypes and that can be easily extended to new database initiatives. We have also built a web interface to the data mart that allows the browsing of underlying data indexed by population and the combining of populations, allowing intuitive and straightforward comparison of population groups. All the information served is optimized for web display, and most of the computations are already pre-processed in the data mart to speed up the data browsing and any computational treatment requested. In practice, SPSmart allows populations to be combined into user-defined groups, while multiple databases can be accessed and compared in a few simple steps from a single query. It performs the queries rapidly and gives straightforward graphical summaries of SNP population variability through visual inspection of allele frequencies outlined in standard pie-chart format. In addition, full numerical description of the data is output in statistical results panels that include common population genetics metrics such as heterozygosity, Fst and In.
How Artificial Intelligence Can Improve Our Understanding of the Genes Associated with Endometriosis: Natural Language Processing of the PubMed Database

PubMed Central

Mashiach, R.; Cohen, S.; Kedem, A.; Baron, A.; Zajicek, M.; Feldman, I.; Seidman, D.; Soriano, D.

2018-01-01

Endometriosis is a disease characterized by the development of endometrial tissue outside the uterus, but its cause remains largely unknown. Numerous genes have been studied and proposed to help explain its pathogenesis. However, the large number of these candidate genes has made functional validation through experimental methodologies nearly impossible. Computational methods could provide a useful alternative for prioritizing those most likely to be susceptibility genes. Using artificial intelligence applied to text mining, this study analyzed the genes involved in the pathogenesis, development, and progression of endometriosis. The data extraction by text mining of the endometriosis-related genes in the PubMed database was based on natural language processing, and the data were filtered to remove false positives. Using data from the text mining and gene network information as input for the web-based tool, 15,207 endometriosis-related genes were ranked according to their score in the database. Characterization of the filtered gene set through gene ontology, pathway, and network analysis provided information about the numerous mechanisms hypothesized to be responsible for the establishment of ectopic endometrial tissue, as well as the migration, implantation, survival, and proliferation of ectopic endometrial cells. Finally, the human genome was scanned through various databases using filtered genes as a seed to determine novel genes that might also be involved in the pathogenesis of endometriosis but which have not yet been characterized. These genes could be promising candidates to serve as useful diagnostic biomarkers and therapeutic targets in the management of endometriosis. PMID:29750165
How Artificial Intelligence Can Improve Our Understanding of the Genes Associated with Endometriosis: Natural Language Processing of the PubMed Database.

PubMed

Bouaziz, J; Mashiach, R; Cohen, S; Kedem, A; Baron, A; Zajicek, M; Feldman, I; Seidman, D; Soriano, D

2018-01-01

Endometriosis is a disease characterized by the development of endometrial tissue outside the uterus, but its cause remains largely unknown. Numerous genes have been studied and proposed to help explain its pathogenesis. However, the large number of these candidate genes has made functional validation through experimental methodologies nearly impossible. Computational methods could provide a useful alternative for prioritizing those most likely to be susceptibility genes. Using artificial intelligence applied to text mining, this study analyzed the genes involved in the pathogenesis, development, and progression of endometriosis. The data extraction by text mining of the endometriosis-related genes in the PubMed database was based on natural language processing, and the data were filtered to remove false positives. Using data from the text mining and gene network information as input for the web-based tool, 15,207 endometriosis-related genes were ranked according to their score in the database. Characterization of the filtered gene set through gene ontology, pathway, and network analysis provided information about the numerous mechanisms hypothesized to be responsible for the establishment of ectopic endometrial tissue, as well as the migration, implantation, survival, and proliferation of ectopic endometrial cells. Finally, the human genome was scanned through various databases using filtered genes as a seed to determine novel genes that might also be involved in the pathogenesis of endometriosis but which have not yet been characterized. These genes could be promising candidates to serve as useful diagnostic biomarkers and therapeutic targets in the management of endometriosis.
The use of a computerized database to monitor vaccine safety in Viet Nam.

PubMed Central

Ali, Mohammad; Canh, Gia Do; Clemens, John D.; Park, Jin-Kyung; von Seidlein, Lorenz; Minh, Tan Truong; Thiem, Dinh Vu; Tho, Huu Le; Trach, Duc Dang

2005-01-01

Health information systems to monitor vaccine safety are used in industrialized countries to detect adverse medical events related to vaccinations or to prove the safety of vaccines. There are no such information systems in the developing world, but they are urgently needed. A large linked database for the monitoring of vaccine-related adverse events has been established in Khanh Hoa province, Viet Nam. Data collected during the first 2 years of surveillance, a period which included a mass measles vaccination campaign, were used to evaluate the system. For this purpose the discharge diagnoses of individuals admitted to polyclinics and hospitals were coded according to the International Classification of Diseases (ICD)-10 guidelines and linked in a dynamic population database with vaccination histories. A case-series analysis was applied to the cohort of children vaccinated during the mass measles vaccination campaign. The study recorded 107,022 immunizations in a catchment area with a population of 357,458 and confirmed vaccine coverage of 87% or higher for completed routine childhood vaccinations. The measles vaccination campaign immunized at least 86% of the targeted children aged 9 months to 10 years. No medical event was detected significantly more frequently during the 14 days after measles vaccination than before it. The experience in Viet Nam confirmed the safety of a measles vaccination campaign and shows that it is feasible to establish health information systems such as a large linked database which can provide reliable data in a developing country for a modest increase in use of resources. PMID:16193545

Does filler database size influence identification accuracy?

PubMed

Bergold, Amanda N; Heaton, Paul

2018-06-01

Police departments increasingly use large photo databases to select lineup fillers using facial recognition software, but this technological shift's implications have been largely unexplored in eyewitness research. Database use, particularly if coupled with facial matching software, could enable lineup constructors to increase filler-suspect similarity and thus enhance eyewitness accuracy (Fitzgerald, Oriet, Price, & Charman, 2013). However, with a large pool of potential fillers, such technologies might theoretically produce lineup fillers too similar to the suspect (Fitzgerald, Oriet, & Price, 2015; Luus & Wells, 1991; Wells, Rydell, & Seelau, 1993). This research proposes a new factor-filler database size-as a lineup feature affecting eyewitness accuracy. In a facial recognition experiment, we select lineup fillers in a legally realistic manner using facial matching software applied to filler databases of 5,000, 25,000, and 125,000 photos, and find that larger databases are associated with a higher objective similarity rating between suspects and fillers and lower overall identification accuracy. In target present lineups, witnesses viewing lineups created from the larger databases were less likely to make correct identifications and more likely to select known innocent fillers. When the target was absent, database size was associated with a lower rate of correct rejections and a higher rate of filler identifications. Higher algorithmic similarity ratings were also associated with decreases in eyewitness identification accuracy. The results suggest that using facial matching software to select fillers from large photograph databases may reduce identification accuracy, and provides support for filler database size as a meaningful system variable. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Role of Teacher Characteristics and School Resources in Early Mathematics Learning

ERIC Educational Resources Information Center

Jung, Eunjoo; Brown, Elizabeth T.; Karp, Karen S.

2014-01-01

The authors investigated the degree to which school-level teacher characteristics and resources are related to the mathematics learning of kindergarten children using a sample drawn from a large US database. Kindergarten-level teacher characteristics were operationalised as collective teacher efficacy, teacher experience and teacher preparation;…
Computing Properties of Hadrons, Nuclei and Nuclear Matter from Quantum Chromodynamics

DOE Office of Scientific and Technical Information (OSTI.GOV)

Savage, Martin J.

This project was part of a coordinated software development effort which the nuclear physics lattice QCD community pursues in order to ensure that lattice calculations can make optimal use of present, and forthcoming leadership-class and dedicated hardware, including those of the national laboratories, and prepares for the exploitation of future computational resources in the exascale era. The UW team improved and extended software libraries used in lattice QCD calculations related to multi-nucleon systems, enhanced production running codes related to load balancing multi-nucleon production on large-scale computing platforms, and developed SQLite (addressable database) interfaces to efficiently archive and analyze multi-nucleon datamore » and developed a Mathematica interface for the SQLite databases.« less
Recognizing vocal emotions in Mandarin Chinese: a validated database of Chinese vocal emotional stimuli.

PubMed

Liu, Pan; Pell, Marc D

2012-12-01

To establish a valid database of vocal emotional stimuli in Mandarin Chinese, a set of Chinese pseudosentences (i.e., semantically meaningless sentences that resembled real Chinese) were produced by four native Mandarin speakers to express seven emotional meanings: anger, disgust, fear, sadness, happiness, pleasant surprise, and neutrality. These expressions were identified by a group of native Mandarin listeners in a seven-alternative forced choice task, and items reaching a recognition rate of at least three times chance performance in the seven-choice task were selected as a valid database and then subjected to acoustic analysis. The results demonstrated expected variations in both perceptual and acoustic patterns of the seven vocal emotions in Mandarin. For instance, fear, anger, sadness, and neutrality were associated with relatively high recognition, whereas happiness, disgust, and pleasant surprise were recognized less accurately. Acoustically, anger and pleasant surprise exhibited relatively high mean f0 values and large variation in f0 and amplitude; in contrast, sadness, disgust, fear, and neutrality exhibited relatively low mean f0 values and small amplitude variations, and happiness exhibited a moderate mean f0 value and f0 variation. Emotional expressions varied systematically in speech rate and harmonics-to-noise ratio values as well. This validated database is available to the research community and will contribute to future studies of emotional prosody for a number of purposes. To access the database, please contact pan.liu@mail.mcgill.ca.
IDAAPM: integrated database of ADMET and adverse effects of predictive modeling based on FDA approved drug data.

PubMed

Legehar, Ashenafi; Xhaard, Henri; Ghemtio, Leo

2016-01-01

The disposition of a pharmaceutical compound within an organism, i.e. its Absorption, Distribution, Metabolism, Excretion, Toxicity (ADMET) properties and adverse effects, critically affects late stage failure of drug candidates and has led to the withdrawal of approved drugs. Computational methods are effective approaches to reduce the number of safety issues by analyzing possible links between chemical structures and ADMET or adverse effects, but this is limited by the size, quality, and heterogeneity of the data available from individual sources. Thus, large, clean and integrated databases of approved drug data, associated with fast and efficient predictive tools are desirable early in the drug discovery process. We have built a relational database (IDAAPM) to integrate available approved drug data such as drug approval information, ADMET and adverse effects, chemical structures and molecular descriptors, targets, bioactivity and related references. The database has been coupled with a searchable web interface and modern data analytics platform (KNIME) to allow data access, data transformation, initial analysis and further predictive modeling. Data were extracted from FDA resources and supplemented from other publicly available databases. Currently, the database contains information regarding about 19,226 FDA approval applications for 31,815 products (small molecules and biologics) with their approval history, 2505 active ingredients, together with as many ADMET properties, 1629 molecular structures, 2.5 million adverse effects and 36,963 experimental drug-target bioactivity data. IDAAPM is a unique resource that, in a single relational database, provides detailed information on FDA approved drugs including their ADMET properties and adverse effects, the corresponding targets with bioactivity data, coupled with a data analytics platform. It can be used to perform basic to complex drug-target ADMET or adverse effects analysis and predictive modeling. IDAAPM is freely accessible at http://idaapm.helsinki.fi and can be exploited through a KNIME workflow connected to the database.Graphical abstractFDA approved drug data integration for predictive modeling.
A Chronostratigraphic Relational Database Ontology

NASA Astrophysics Data System (ADS)

Platon, E.; Gary, A.; Sikora, P.

2005-12-01

A chronostratigraphic research database was donated by British Petroleum to the Stratigraphy Group at the Energy and Geoscience Institute (EGI), University of Utah. These data consists of over 2,000 measured sections representing over three decades of research into the application of the graphic correlation method. The data are global and includes both microfossil (foraminifera, calcareous nannoplankton, spores, pollen, dinoflagellate cysts, etc) and macrofossil data. The objective of the donation was to make the research data available to the public in order to encourage additional chronostratigraphy studies, specifically regarding graphic correlation. As part of the National Science Foundation's Cyberinfrastructure for the Geosciences (GEON) initiative these data have been made available to the public at http://css.egi.utah.edu. To encourage further research using the graphic correlation method, EGI has developed a software package, StrataPlot that will soon be publicly available from the GEON website as a standalone software download. The EGI chronostratigraphy research database, although relatively large, has many data holes relative to some paleontological disciplines and geographical areas, so the challenge becomes how do we expand the data available for chronostratigrahic studies using graphic correlation. There are several public or soon-to-be public databases available to chronostratigraphic research, but they have their own data structures and modes of presentation. The heterogeneous nature of these database schemas hinders their integration and makes it difficult for the user to retrieve and consolidate potentially valuable chronostratigraphic data. The integration of these data sources would facilitate rapid and comprehensive data searches, thus helping advance studies in chronostratigraphy. The GEON project will host a number of databases within the geology domain, some of which contain biostratigraphic data. Ontologies are being developed to provide an integrated query system for the searching across GEON's biostratigraphy databases, as well as databases available in the public domain. Although creating an ontology directly from the existing database metadata would have been effective and straightforward, our effort was directed towards creating a more efficient representation of our database, as well as a general representation of the biostratigraphic domain.
Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data.

PubMed

Su, Xiaoquan; Xu, Jian; Ning, Kang

2012-10-01

It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database. In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods. Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples. ningkang@qibebt.ac.cn Supplementary data are available at Bioinformatics online.
Temporal and Fine-Grained Pedestrian Action Recognition on Driving Recorder Database

PubMed Central

Satoh, Yutaka; Aoki, Yoshimitsu; Oikawa, Shoko; Matsui, Yasuhiro

2018-01-01

The paper presents an emerging issue of fine-grained pedestrian action recognition that induces an advanced pre-crush safety to estimate a pedestrian intention in advance. The fine-grained pedestrian actions include visually slight differences (e.g., walking straight and crossing), which are difficult to distinguish from each other. It is believed that the fine-grained action recognition induces a pedestrian intention estimation for a helpful advanced driver-assistance systems (ADAS). The following difficulties have been studied to achieve a fine-grained and accurate pedestrian action recognition: (i) In order to analyze the fine-grained motion of a pedestrian appearance in the vehicle-mounted drive recorder, a method to describe subtle change of motion characteristics occurring in a short time is necessary; (ii) even when the background moves greatly due to the driving of the vehicle, it is necessary to detect changes in subtle motion of the pedestrian; (iii) the collection of large-scale fine-grained actions is very difficult, and therefore a relatively small database should be focused. We find out how to learn an effective recognition model with only a small-scale database. Here, we have thoroughly evaluated several types of configurations to explore an effective approach in fine-grained pedestrian action recognition without a large-scale database. Moreover, two different datasets have been collected in order to raise the issue. Finally, our proposal attained 91.01% on National Traffic Science and Environment Laboratory database (NTSEL) and 53.23% on the near-miss driving recorder database (NDRDB). The paper has improved +8.28% and +6.53% from baseline two-stream fusion convnets. PMID:29461473
Image segmentation evaluation for very-large datasets

NASA Astrophysics Data System (ADS)

Reeves, Anthony P.; Liu, Shuang; Xie, Yiting

2016-03-01

With the advent of modern machine learning methods and fully automated image analysis there is a need for very large image datasets having documented segmentations for both computer algorithm training and evaluation. Current approaches of visual inspection and manual markings do not scale well to big data. We present a new approach that depends on fully automated algorithm outcomes for segmentation documentation, requires no manual marking, and provides quantitative evaluation for computer algorithms. The documentation of new image segmentations and new algorithm outcomes are achieved by visual inspection. The burden of visual inspection on large datasets is minimized by (a) customized visualizations for rapid review and (b) reducing the number of cases to be reviewed through analysis of quantitative segmentation evaluation. This method has been applied to a dataset of 7,440 whole-lung CT images for 6 different segmentation algorithms designed to fully automatically facilitate the measurement of a number of very important quantitative image biomarkers. The results indicate that we could achieve 93% to 99% successful segmentation for these algorithms on this relatively large image database. The presented evaluation method may be scaled to much larger image databases.
Differentiating the Bishop ash bed and related tephra layers by elemental-based similarity coefficients of volcanic glass shards using solution inductively coupled plasma-mass spectrometry (S-ICP-MS)

USGS Publications Warehouse

Knott, J.R.; Sarna-Wojcicki, A. M.; Montanez, I.P.; Wan, E.

2007-01-01

Volcanic glass samples from the same volcanic center (intra-source) often have a similar major-element composition. Thus, it can be difficult to distinguish between individual tephra layers, particularly when using similarity coefficients calculated from electron microprobe major-element measurements. Minor/trace element concentrations in glass can be determined by solution inductively coupled plasma mass spectrometry (S-ICP-MS), but have not been shown as suitable for use in large tephrochronologic databases. Here, we present minor/trace-element concentrations measured by S-ICP-MS and compare these data by similarity coefficients, the method commonly used in large databases. Trial samples from the Bishop Tuff, the upper and lower tuffs of Glass Mountain and the tuffs of Mesquite Spring suites from eastern California, USA, which have an indistinguishable major-element composition, were analyzed using S-ICP-MS. The resulting minor/trace element similarity coefficients clearly separated the suites of tephra layers and, in most cases, individual tephra layers within each suite. Comparisons with previous instrumental neutron activation analysis (INAA) elemental measurements were marginally successful. This is important step toward quantitative correlation in large tephrochronologic databases to achieve definitive identification of volcanic glass samples and for high-resolution age determinations. ?? 2007 Elsevier Ltd and INQUA.
Epidemiological considerations for the use of databases in transfusion research: a Scandinavian perspective.

PubMed

Edgren, Gustaf; Hjalgrim, Henrik

2010-11-01

At current safety levels, with adverse events from transfusions being relatively rare, further progress in risk reductions will require large-scale investigations. Thus, truly prospective studies may prove unfeasible and other alternatives deserve consideration. In this review, we will try to give an overview of recent and historical developments in the use of blood donation and transfusion databases in research. In addition, we will go over important methodological issues. There are at least three nationwide or near-nationwide donation/transfusion databases with the possibility for long-term follow-up of donors and recipients. During the past few years, a large number of reports have been published utilizing such data sources to investigate transfusion-associated risks. In addition, numerous clinics systematically collect and use such data on a smaller scale. Combining systematically recorded donation and transfusion data with long-term health follow-up opens up exciting opportunities for transfusion medicine research. However, the correct analysis of such data requires close attention to methodological issues, especially including the indication for transfusion and reverse causality.
Connecting proteins with drug-like compounds: Open source drug discovery workflows with BindingDB and KNIME

PubMed Central

Berthold, Michael R.; Hedrick, Michael P.; Gilson, Michael K.

2015-01-01

Today’s large, public databases of protein–small molecule interaction data are creating important new opportunities for data mining and integration. At the same time, new graphical user interface-based workflow tools offer facile alternatives to custom scripting for informatics and data analysis. Here, we illustrate how the large protein-ligand database BindingDB may be incorporated into KNIME workflows as a step toward the integration of pharmacological data with broader biomolecular analyses. Thus, we describe a collection of KNIME workflows that access BindingDB data via RESTful webservices and, for more intensive queries, via a local distillation of the full BindingDB dataset. We focus in particular on the KNIME implementation of knowledge-based tools to generate informed hypotheses regarding protein targets of bioactive compounds, based on notions of chemical similarity. A number of variants of this basic approach are tested for seven existing drugs with relatively ill-defined therapeutic targets, leading to replication of some previously confirmed results and discovery of new, high-quality hits. Implications for future development are discussed. Database URL: www.bindingdb.org PMID:26384374
SPMBR: a scalable algorithm for mining sequential patterns based on bitmaps

NASA Astrophysics Data System (ADS)

Xu, Xiwei; Zhang, Changhai

2013-12-01

Now some sequential patterns mining algorithms generate too many candidate sequences, and increase the processing cost of support counting. Therefore, we present an effective and scalable algorithm called SPMBR (Sequential Patterns Mining based on Bitmap Representation) to solve the problem of mining the sequential patterns for large databases. Our method differs from previous related works of mining sequential patterns. The main difference is that the database of sequential patterns is represented by bitmaps, and a simplified bitmap structure is presented firstly. In this paper, First the algorithm generate candidate sequences by SE(Sequence Extension) and IE(Item Extension), and then obtain all frequent sequences by comparing the original bitmap and the extended item bitmap .This method could simplify the problem of mining the sequential patterns and avoid the high processing cost of support counting. Both theories and experiments indicate that the performance of SPMBR is predominant for large transaction databases, the required memory size for storing temporal data is much less during mining process, and all sequential patterns can be mined with feasibility.
Database Dictionary for Ethiopian National Ground-Water DAtabase (ENGDA) Data Fields

USGS Publications Warehouse

Kuniansky, Eve L.; Litke, David W.; Tucci, Patrick

2007-01-01

Introduction This document describes the data fields that are used for both field forms and the Ethiopian National Ground-water Database (ENGDA) tables associated with information stored about production wells, springs, test holes, test wells, and water level or water-quality observation wells. Several different words are used in this database dictionary and in the ENGDA database to describe a narrow shaft constructed in the ground. The most general term is borehole, which is applicable to any type of hole. A well is a borehole specifically constructed to extract water from the ground; however, for this data dictionary and for the ENGDA database, the words well and borehole are used interchangeably. A production well is defined as any well used for water supply and includes hand-dug wells, small-diameter bored wells equipped with hand pumps, or large-diameter bored wells equipped with large-capacity motorized pumps. Test holes are borings made to collect information about the subsurface with continuous core or non-continuous core and/or where geophysical logs are collected. Test holes are not converted into wells. A test well is a well constructed for hydraulic testing of an aquifer in order to plan a larger ground-water production system. A water-level or water-quality observation well is a well that is used to collect information about an aquifer and not used for water supply. A spring is any naturally flowing, local, ground-water discharge site. The database dictionary is designed to help define all fields on both field data collection forms (provided in attachment 2 of this report) and for the ENGDA software screen entry forms (described in Litke, 2007). The data entered into each screen entry field are stored in relational database tables within the computer database. The organization of the database dictionary is designed based on field data collection and the field forms, because this is what the majority of people will use. After each field, however, the ENGDA database field name and relational database table is designated; along with the ENGDA screen entry form(s) and the ENGDA field form (attachment 2). The database dictionary is separated into sections. The first section, Basic Site Data Fields, describes the basic site information that is similar for all of the different types of sites. The remaining sections may be applicable for only one type of site; for example, the Well Drilling and Construction Data Fields and Lithologic Description Data Fields are applicable to boreholes and not to springs. Attachment 1 contains a table for conversion from English to metric units. Attachment 2 contains selected field forms used in conjunction with ENGDA. A separate document, 'Users Reference Manual for the Ethiopian National Ground-Water DAtabase (ENGDA),' by David W. Litke was developed as a users guide for the computer database and screen entry. This database dictionary serves as a reference for both the field forms and the computer database. Every effort has been made to have identical field names between the field forms and the screen entry forms in order to avoid confusion.
"Mr. Database" : Jim Gray and the History of Database Technologies.

PubMed

Hanwahr, Nils C

2017-12-01

Although the widespread use of the term "Big Data" is comparatively recent, it invokes a phenomenon in the developments of database technology with distinct historical contexts. The database engineer Jim Gray, known as "Mr. Database" in Silicon Valley before his disappearance at sea in 2007, was involved in many of the crucial developments since the 1970s that constitute the foundation of exceedingly large and distributed databases. Jim Gray was involved in the development of relational database systems based on the concepts of Edgar F. Codd at IBM in the 1970s before he went on to develop principles of Transaction Processing that enable the parallel and highly distributed performance of databases today. He was also involved in creating forums for discourse between academia and industry, which influenced industry performance standards as well as database research agendas. As a co-founder of the San Francisco branch of Microsoft Research, Gray increasingly turned toward scientific applications of database technologies, e. g. leading the TerraServer project, an online database of satellite images. Inspired by Vannevar Bush's idea of the memex, Gray laid out his vision of a Personal Memex as well as a World Memex, eventually postulating a new era of data-based scientific discovery termed "Fourth Paradigm Science". This article gives an overview of Gray's contributions to the development of database technology as well as his research agendas and shows that central notions of Big Data have been occupying database engineers for much longer than the actual term has been in use.
[Adverse Effect Predictions Based on Computational Toxicology Techniques and Large-scale Databases].

PubMed

Uesawa, Yoshihiro

2018-01-01

　Understanding the features of chemical structures related to the adverse effects of drugs is useful for identifying potential adverse effects of new drugs. This can be based on the limited information available from post-marketing surveillance, assessment of the potential toxicities of metabolites and illegal drugs with unclear characteristics, screening of lead compounds at the drug discovery stage, and identification of leads for the discovery of new pharmacological mechanisms. This present paper describes techniques used in computational toxicology to investigate the content of large-scale spontaneous report databases of adverse effects, and it is illustrated with examples. Furthermore, volcano plotting, a new visualization method for clarifying the relationships between drugs and adverse effects via comprehensive analyses, will be introduced. These analyses may produce a great amount of data that can be applied to drug repositioning.
Effect of the curvature parameter on least-squares prediction within poor data coverage: case study for Africa

NASA Astrophysics Data System (ADS)

Abd-Elmotaal, Hussein; Kühtreiber, Norbert

2016-04-01

In the framework of the IAG African Geoid Project, there are a lot of large data gaps in its gravity database. These gaps are filled initially using unequal weight least-squares prediction technique. This technique uses a generalized Hirvonen covariance function model to replace the empirically determined covariance function. The generalized Hirvonen covariance function model has a sensitive parameter which is related to the curvature parameter of the covariance function at the origin. This paper studies the effect of the curvature parameter on the least-squares prediction results, especially in the large data gaps as appearing in the African gravity database. An optimum estimation of the curvature parameter has also been carried out. A wide comparison among the results obtained in this research along with their obtained accuracy is given and thoroughly discussed.
WholeCellSimDB: a hybrid relational/HDF database for whole-cell model predictions

PubMed Central

Karr, Jonathan R.; Phillips, Nolan C.; Covert, Markus W.

2014-01-01

Mechanistic ‘whole-cell’ models are needed to develop a complete understanding of cell physiology. However, extracting biological insights from whole-cell models requires running and analyzing large numbers of simulations. We developed WholeCellSimDB, a database for organizing whole-cell simulations. WholeCellSimDB was designed to enable researchers to search simulation metadata to identify simulations for further analysis, and quickly slice and aggregate simulation results data. In addition, WholeCellSimDB enables users to share simulations with the broader research community. The database uses a hybrid relational/hierarchical data format architecture to efficiently store and retrieve both simulation setup metadata and results data. WholeCellSimDB provides a graphical Web-based interface to search, browse, plot and export simulations; a JavaScript Object Notation (JSON) Web service to retrieve data for Web-based visualizations; a command-line interface to deposit simulations; and a Python API to retrieve data for advanced analysis. Overall, we believe WholeCellSimDB will help researchers use whole-cell models to advance basic biological science and bioengineering. Database URL: http://www.wholecellsimdb.org Source code repository URL: http://github.com/CovertLab/WholeCellSimDB PMID:25231498
An image database management system for conducting CAD research

NASA Astrophysics Data System (ADS)

Gruszauskas, Nicholas; Drukker, Karen; Giger, Maryellen L.

2007-03-01

The development of image databases for CAD research is not a trivial task. The collection and management of images and their related metadata from multiple sources is a time-consuming but necessary process. By standardizing and centralizing the methods in which these data are maintained, one can generate subsets of a larger database that match the specific criteria needed for a particular research project in a quick and efficient manner. A research-oriented management system of this type is highly desirable in a multi-modality CAD research environment. An online, webbased database system for the storage and management of research-specific medical image metadata was designed for use with four modalities of breast imaging: screen-film mammography, full-field digital mammography, breast ultrasound and breast MRI. The system was designed to consolidate data from multiple clinical sources and provide the user with the ability to anonymize the data. Input concerning the type of data to be stored as well as desired searchable parameters was solicited from researchers in each modality. The backbone of the database was created using MySQL. A robust and easy-to-use interface for entering, removing, modifying and searching information in the database was created using HTML and PHP. This standardized system can be accessed using any modern web-browsing software and is fundamental for our various research projects on computer-aided detection, diagnosis, cancer risk assessment, multimodality lesion assessment, and prognosis. Our CAD database system stores large amounts of research-related metadata and successfully generates subsets of cases that match the user's desired search criteria.
Follicle Online: an integrated database of follicle assembly, development and ovulation.

PubMed

Hua, Juan; Xu, Bo; Yang, Yifan; Ban, Rongjun; Iqbal, Furhan; Cooke, Howard J; Zhang, Yuanwei; Shi, Qinghua

2015-01-01

Folliculogenesis is an important part of ovarian function as it provides the oocytes for female reproductive life. Characterizing genes/proteins involved in folliculogenesis is fundamental for understanding the mechanisms associated with this biological function and to cure the diseases associated with folliculogenesis. A large number of genes/proteins associated with folliculogenesis have been identified from different species. However, no dedicated public resource is currently available for folliculogenesis-related genes/proteins that are validated by experiments. Here, we are reporting a database 'Follicle Online' that provides the experimentally validated gene/protein map of the folliculogenesis in a number of species. Follicle Online is a web-based database system for storing and retrieving folliculogenesis-related experimental data. It provides detailed information for 580 genes/proteins (from 23 model organisms, including Homo sapiens, Mus musculus, Rattus norvegicus, Mesocricetus auratus, Bos Taurus, Drosophila and Xenopus laevis) that have been reported to be involved in folliculogenesis, POF (premature ovarian failure) and PCOS (polycystic ovary syndrome). The literature was manually curated from more than 43,000 published articles (till 1 March 2014). The Follicle Online database is implemented in PHP + MySQL + JavaScript and this user-friendly web application provides access to the stored data. In summary, we have developed a centralized database that provides users with comprehensive information about genes/proteins involved in folliculogenesis. This database can be accessed freely and all the stored data can be viewed without any registration. Database URL: http://mcg.ustc.edu.cn/sdap1/follicle/index.php © The Author(s) 2015. Published by Oxford University Press.

Follicle Online: an integrated database of follicle assembly, development and ovulation

PubMed Central

Hua, Juan; Xu, Bo; Yang, Yifan; Ban, Rongjun; Iqbal, Furhan; Zhang, Yuanwei; Shi, Qinghua

2015-01-01

Folliculogenesis is an important part of ovarian function as it provides the oocytes for female reproductive life. Characterizing genes/proteins involved in folliculogenesis is fundamental for understanding the mechanisms associated with this biological function and to cure the diseases associated with folliculogenesis. A large number of genes/proteins associated with folliculogenesis have been identified from different species. However, no dedicated public resource is currently available for folliculogenesis-related genes/proteins that are validated by experiments. Here, we are reporting a database ‘Follicle Online’ that provides the experimentally validated gene/protein map of the folliculogenesis in a number of species. Follicle Online is a web-based database system for storing and retrieving folliculogenesis-related experimental data. It provides detailed information for 580 genes/proteins (from 23 model organisms, including Homo sapiens, Mus musculus, Rattus norvegicus, Mesocricetus auratus, Bos Taurus, Drosophila and Xenopus laevis) that have been reported to be involved in folliculogenesis, POF (premature ovarian failure) and PCOS (polycystic ovary syndrome). The literature was manually curated from more than 43 000 published articles (till 1 March 2014). The Follicle Online database is implemented in PHP + MySQL + JavaScript and this user-friendly web application provides access to the stored data. In summary, we have developed a centralized database that provides users with comprehensive information about genes/proteins involved in folliculogenesis. This database can be accessed freely and all the stored data can be viewed without any registration. Database URL: http://mcg.ustc.edu.cn/sdap1/follicle/index.php PMID:25931457
A pilot GIS database of active faults of Mt. Etna (Sicily): A tool for integrated hazard evaluation

NASA Astrophysics Data System (ADS)

Barreca, Giovanni; Bonforte, Alessandro; Neri, Marco

2013-02-01

A pilot GIS-based system has been implemented for the assessment and analysis of hazard related to active faults affecting the eastern and southern flanks of Mt. Etna. The system structure was developed in ArcGis® environment and consists of different thematic datasets that include spatially-referenced arc-features and associated database. Arc-type features, georeferenced into WGS84 Ellipsoid UTM zone 33 Projection, represent the five main fault systems that develop in the analysed region. The backbone of the GIS-based system is constituted by the large amount of information which was collected from the literature and then stored and properly geocoded in a digital database. This consists of thirty five alpha-numeric fields which include all fault parameters available from literature such us location, kinematics, landform, slip rate, etc. Although the system has been implemented according to the most common procedures used by GIS developer, the architecture and content of the database represent a pilot backbone for digital storing of fault parameters, providing a powerful tool in modelling hazard related to the active tectonics of Mt. Etna. The database collects, organises and shares all scientific currently available information about the active faults of the volcano. Furthermore, thanks to the strong effort spent on defining the fields of the database, the structure proposed in this paper is open to the collection of further data coming from future improvements in the knowledge of the fault systems. By layering additional user-specific geographic information and managing the proposed database (topological querying) a great diversity of hazard and vulnerability maps can be produced by the user. This is a proposal of a backbone for a comprehensive geographical database of fault systems, universally applicable to other sites.
MTO-like reference mask modeling for advanced inverse lithography technology patterns

NASA Astrophysics Data System (ADS)

Park, Jongju; Moon, Jongin; Son, Suein; Chung, Donghoon; Kim, Byung-Gook; Jeon, Chan-Uk; LoPresti, Patrick; Xue, Shan; Wang, Sonny; Broadbent, Bill; Kim, Soonho; Hur, Jiuk; Choo, Min

2017-07-01

Advanced Inverse Lithography Technology (ILT) can result in mask post-OPC databases with very small address units, all-angle figures, and very high vertex counts. This creates mask inspection issues for existing mask inspection database rendering. These issues include: large data volumes, low transfer rate, long data preparation times, slow inspection throughput, and marginal rendering accuracy leading to high false detections. This paper demonstrates the application of a new rendering method including a new OASIS-like mask inspection format, new high-speed rendering algorithms, and related hardware to meet the inspection challenges posed by Advanced ILT masks.
Evaluation of Glaucoma Progression in Large-Scale Clinical Data: The Japanese Archive of Multicentral Databases in Glaucoma (JAMDIG).

PubMed

Fujino, Yuri; Asaoka, Ryo; Murata, Hiroshi; Miki, Atsuya; Tanito, Masaki; Mizoue, Shiro; Mori, Kazuhiko; Suzuki, Katsuyoshi; Yamashita, Takehiro; Kashiwagi, Kenji; Shoji, Nobuyuki

2016-04-01

To develop a large-scale real clinical database of glaucoma (Japanese Archive of Multicentral Databases in Glaucoma: JAMDIG) and to investigate the effect of treatment. The study included a total of 1348 eyes of 805 primary open-angle glaucoma patients with 10 visual fields (VFs) measured with 24-2 or 30-2 Humphrey Field Analyzer (HFA) and intraocular pressure (IOP) records in 10 institutes in Japan. Those with 10 reliable VFs were further identified (638 eyes of 417 patients). Mean total deviation (mTD) of the 52 test points in the 24-2 HFA VF was calculated, and the relationship between mTD progression rate and seven variables (age, mTD of baseline VF, average IOP, standard deviation (SD) of IOP, previous argon/selective laser trabeculoplasties (ALT/SLT), previous trabeculectomy, and previous trabeculotomy) was analyzed. The mTD in the initial VF was -6.9 ± 6.2 dB and the mTD progression rate was -0.26 ± 0.46 dB/year. Mean IOP during the follow-up period was 13.5 ± 2.2 mm Hg. Age and SD of IOP were related to mTD progression rate. However, in eyes with average IOP below 15 and also 13 mm Hg, only age and baseline VF mTD were related to mTD progression rate. Age and the degree of VF damage were related to future progression. Average IOP was not related to the progression rate; however, fluctuation of IOP was associated with faster progression, although this was not the case when average IOP was below 15 mm Hg.
European multicentre database of healthy controls for [123I]FP-CIT SPECT (ENC-DAT): age-related effects, gender differences and evaluation of different methods of analysis.

PubMed

Varrone, Andrea; Dickson, John C; Tossici-Bolt, Livia; Sera, Terez; Asenbaum, Susanne; Booij, Jan; Kapucu, Ozlem L; Kluge, Andreas; Knudsen, Gitte M; Koulibaly, Pierre Malick; Nobili, Flavio; Pagani, Marco; Sabri, Osama; Vander Borght, Thierry; Van Laere, Koen; Tatsch, Klaus

2013-01-01

Dopamine transporter (DAT) imaging with [(123)I]FP-CIT (DaTSCAN) is an established diagnostic tool in parkinsonism and dementia. Although qualitative assessment criteria are available, DAT quantification is important for research and for completion of a diagnostic evaluation. One critical aspect of quantification is the availability of normative data, considering possible age and gender effects on DAT availability. The aim of the European Normal Control Database of DaTSCAN (ENC-DAT) study was to generate a large database of [(123)I]FP-CIT SPECT scans in healthy controls. SPECT data from 139 healthy controls (74 men, 65 women; age range 20-83 years, mean 53 years) acquired in 13 different centres were included. Images were reconstructed using the ordered-subset expectation-maximization algorithm without correction (NOACSC), with attenuation correction (AC), and with both attenuation and scatter correction using the triple-energy window method (ACSC). Region-of-interest analysis was performed using the BRASS software (caudate and putamen), and the Southampton method (striatum). The outcome measure was the specific binding ratio (SBR). A significant effect of age on SBR was found for all data. Gender had a significant effect on SBR in the caudate and putamen for the NOACSC and AC data, and only in the left caudate for the ACSC data (BRASS method). Significant effects of age and gender on striatal SBR were observed for all data analysed with the Southampton method. Overall, there was a significant age-related decline in SBR of between 4 % and 6.7 % per decade. This study provides a large database of [(123)I]FP-CIT SPECT scans in healthy controls across a wide age range and with balanced gender representation. Higher DAT availability was found in women than in men. An average age-related decline in DAT availability of 5.5 % per decade was found for both genders, in agreement with previous reports. The data collected in this study may serve as a reference database for nuclear medicine centres and for clinical trials using [(123)I]FP-CIT SPECT as the imaging marker.
High-Latitude Topside Ionospheric Vertical Electron-Density-Profile Changes in Response to Large Magnetic Storms

NASA Technical Reports Server (NTRS)

Benson, Robert F.; Fainberg, Joseph; Osherovich, Vladimir A.; Truhlik, Vladimir; Wang, Yongli; Bilitza, Dieter; Fung, Shing F.

2015-01-01

Large magnetic-storm induced changes have been detected in high-latitude topside vertical electron-density profiles Ne(h). The investigation was based on the large database of topside Ne(h) profiles and digital topside ionograms from the International Satellites for Ionospheric Studies (ISIS) program available from the NASA Space Physics Data Facility (SPDF) at http://spdf.gsfc.nasa.gov/isis/isis-status.html. This large database enabled Ne(h) profiles to be obtained when an ISIS satellite passed through nearly the same region of space before, during, and after a major magnetic storm. A major goal was to relate the magnetic-storm induced high-latitude Ne(h) profile changes to solar-wind parameters. Thus an additional data constraint was to consider only storms where solar-wind data were available from the NASA/SPDF OMNIWeb database. Ten large magnetic storms (with Dst less than -100 nT) were identified that satisfied both the Ne(h) profile and the solar-wind data constraints. During five of these storms topside ionospheric Ne(h) profiles were available in the high-latitude northern hemisphere and during the other five storms similar ionospheric data were available in the southern hemisphere. Large Ne(h) changes were observed during each one of these storms. Our concentration in this paper is on the northern hemisphere. The data coverage was best for the northern-hemisphere winter. Here Ne(h) profile enhancements were always observed when the magnetic local time (MLT) was between 00 and 03 and Ne(h) profile depletions were always observed between 08 and 10 MLT. The observed Ne(h) deviations were compared with solar-wind parameters, with appropriate time shifts, for four storms.
Data model and relational database design for the New England Water-Use Data System (NEWUDS)

USGS Publications Warehouse

Tessler, Steven

2001-01-01

The New England Water-Use Data System (NEWUDS) is a database for the storage and retrieval of water-use data. NEWUDS can handle data covering many facets of water use, including (1) tracking various types of water-use activities (withdrawals, returns, transfers, distributions, consumptive-use, wastewater collection, and treatment); (2) the description, classification and location of places and organizations involved in water-use activities; (3) details about measured or estimated volumes of water associated with water-use activities; and (4) information about data sources and water resources associated with water use. In NEWUDS, each water transaction occurs unidirectionally between two site objects, and the sites and conveyances form a water network. The core entities in the NEWUDS model are site, conveyance, transaction/rate, location, and owner. Other important entities include water resources (used for withdrawals and returns), data sources, and aliases. Multiple water-exchange estimates can be stored for individual transactions based on different methods or data sources. Storage of user-defined details is accommodated for several of the main entities. Numerous tables containing classification terms facilitate detailed descriptions of data items and can be used for routine or custom data summarization. NEWUDS handles single-user and aggregate-user water-use data, can be used for large or small water-network projects, and is available as a stand-alone Microsoft? Access database structure. Users can customize and extend the database, link it to other databases, or implement the design in other relational database applications.
Cutaneous lichen planus: A systematic review of treatments.

PubMed

Fazel, Nasim

2015-06-01

Various treatment modalities are available for cutaneous lichen planus. Pubmed, EMBASE, Cochrane Database of Systematic Reviews, Cochrane Central Register of Controlled Trials, Database of Abstracts of Reviews of Effects, and Health Technology Assessment Database were searched for all the systematic reviews and randomized controlled trials related to cutaneous lichen planus. Two systematic reviews and nine relevant randomized controlled trials were identified. Acitretin, griseofulvin, hydroxychloroquine and narrow band ultraviolet B are demonstrated to be effective in the treatment of cutaneous lichen planus. Sulfasalazine is effective, but has an unfavorable safety profile. KH1060, a vitamin D analogue, is not beneficial in the management of cutaneous lichen planus. Evidence from large scale randomized trials demonstrating the safety and efficacy for many other treatment modalities used to treat cutaneous lichen planus is simply not available.
Adverse Events Associated with Prolonged Antibiotic Use

PubMed Central

Meropol, Sharon B.; Chan, K. Arnold; Chen, Zhen; Finkelstein, Jonathan A.; Hennessy, Sean; Lautenbach, Ebbing; Platt, Richard; Schech, Stephanie D.; Shatin, Deborah; Metlay, Joshua P.

2014-01-01

Purpose The Infectious Diseases Society of America and US CDC recommend 60 days of ciprofloxacin, doxycycline or amoxicillin for anthrax prophylaxis. It is not possible to determine severe adverse drug event (ADE) risks from the few people thus far exposed to anthrax prophylaxis. This study’s objective was to estimate risks of severe ADEs associated with long-term ciprofloxacin, doxycycline and amoxicillin exposure using 3 large databases: one electronic medical record (General Practice Research Database) and two claims databases (UnitedHealthcare, HMO Research Network). Methods We include office visit, hospital admission and prescription data for 1/1/1999–6/30/2001. Exposure variable was oral antibiotic person-days (pds). Primary outcome was hospitalization during exposure with ADE diagnoses: anaphylaxis, phototoxicity, hepatotoxicity, nephrotoxicity, seizures, ventricular arrhythmia or infectious colitis. Results We randomly sampled 999,773, 1,047,496 and 1,819,004 patients from Databases A, B and C respectively. 33,183 amoxicillin, 15,250 ciprofloxacin and 50,171 doxycycline prescriptions continued ≥30 days. ADE hospitalizations during long-term exposure were not observed in Database A. ADEs during long-term amoxicillin were seen only in Database C with 5 ADEs or 1.2(0.4–2.7) ADEs/100,000 pds exposure. Long-term ciprofloxacin showed 3 and 4 ADEs with 5.7(1.2–16.6) and 3.5(1.0–9.0) ADEs/100,000 pds in Databases B and C, respectively. Only Database B had ADEs during long-term doxycycline with 3 ADEs or 0.9(0.2–2.6) ADEs/100,000 pds. For most events, the incidence rate ratio, comparing >28 vs.1–28 pds exposure was <1, showing limited evidence for cumulative dose-related ADEs from long-term exposure. Conclusions Long-term amoxicillin, ciprofloxacin and doxycycline appears safe, supporting use of these medications if needed for large-scale post-exposure anthrax prophylaxis. PMID:18215001
A Data Analysis Expert System For Large Established Distributed Databases

NASA Astrophysics Data System (ADS)

Gnacek, Anne-Marie; An, Y. Kim; Ryan, J. Patrick

1987-05-01

The purpose of this work is to analyze the applicability of artificial intelligence techniques for developing a user-friendly, parallel interface to large isolated, incompatible NASA databases for the purpose of assisting the management decision process. To carry out this work, a survey was conducted to establish the data access requirements of several key NASA user groups. In addition, current NASA database access methods were evaluated. The results of this work are presented in the form of a design for a natural language database interface system, called the Deductively Augmented NASA Management Decision Support System (DANMDS). This design is feasible principally because of recently announced commercial hardware and software product developments which allow cross-vendor compatibility. The goal of the DANMDS system is commensurate with the central dilemma confronting most large companies and institutions in America, the retrieval of information from large, established, incompatible database systems. The DANMDS system implementation would represent a significant first step toward this problem's resolution.
Pattern-based, multi-scale segmentation and regionalization of EOSD land cover

NASA Astrophysics Data System (ADS)

Niesterowicz, Jacek; Stepinski, Tomasz F.

2017-10-01

The Earth Observation for Sustainable Development of Forests (EOSD) map is a 25 m resolution thematic map of Canadian forests. Because of its large spatial extent and relatively high resolution the EOSD is difficult to analyze using standard GIS methods. In this paper we propose multi-scale segmentation and regionalization of EOSD as new methods for analyzing EOSD on large spatial scales. Segments, which we refer to as forest land units (FLUs), are delineated as tracts of forest characterized by cohesive patterns of EOSD categories; we delineated from 727 to 91,885 FLUs within the spatial extent of EOSD depending on the selected scale of a pattern. Pattern of EOSD's categories within each FLU is described by 1037 landscape metrics. A shapefile containing boundaries of all FLUs together with an attribute table listing landscape metrics make up an SQL-searchable spatial database providing detailed information on composition and pattern of land cover types in Canadian forest. Shapefile format and extensive attribute table pertaining to the entire legend of EOSD are designed to facilitate broad range of investigations in which assessment of composition and pattern of forest over large areas is needed. We calculated four such databases using different spatial scales of pattern. We illustrate the use of FLU database for producing forest regionalization maps of two Canadian provinces, Quebec and Ontario. Such maps capture the broad scale variability of forest at the spatial scale of the entire province. We also demonstrate how FLU database can be used to map variability of landscape metrics, and thus the character of landscape, over the entire Canada.
The Cancer Epidemiology Descriptive Cohort Database: A Tool to Support Population-Based Interdisciplinary Research

PubMed Central

Kennedy, Amy E.; Khoury, Muin J.; Ioannidis, John P.A.; Brotzman, Michelle; Miller, Amy; Lane, Crystal; Lai, Gabriel Y.; Rogers, Scott D.; Harvey, Chinonye; Elena, Joanne W.; Seminara, Daniela

2017-01-01

Background We report on the establishment of a web-based Cancer Epidemiology Descriptive Cohort Database (CEDCD). The CEDCD’s goals are to enhance awareness of resources, facilitate interdisciplinary research collaborations, and support existing cohorts for the study of cancer-related outcomes. Methods Comprehensive descriptive data were collected from large cohorts established to study cancer as primary outcome using a newly developed questionnaire. These included an inventory of baseline and follow-up data, biospecimens, genomics, policies, and protocols. Additional descriptive data extracted from publicly available sources were also collected. This information was entered in a searchable and publicly accessible database. We summarized the descriptive data across cohorts and reported the characteristics of this resource. Results As of December 2015, the CEDCD includes data from 46 cohorts representing more than 6.5 million individuals (29% ethnic/racial minorities). Overall, 78% of the cohorts have collected blood at least once, 57% at multiple time points, and 46% collected tissue samples. Genotyping has been performed by 67% of the cohorts, while 46% have performed whole-genome or exome sequencing in subsets of enrolled individuals. Information on medical conditions other than cancer has been collected in more than 50% of the cohorts. More than 600,000 incident cancer cases and more than 40,000 prevalent cases are reported, with 24 cancer sites represented. Conclusions The CEDCD assembles detailed descriptive information on a large number of cancer cohorts in a searchable database. Impact Information from the CEDCD may assist the interdisciplinary research community by facilitating identification of well-established population resources and large-scale collaborative and integrative research. PMID:27439404
Comparative effectiveness analysis of anticoagulant strategies in a large observational database of percutaneous coronary interventions.

PubMed

Wise, Gregory R; Schwartz, Brian P; Dittoe, Nathaniel; Safar, Ammar; Sherman, Steven; Bowdy, Bruce; Hahn, Harvey S

2012-06-01

Percutaneous coronary intervention (PCI) is the most commonly used procedure for coronary revascularization. There are multiple adjuvant anticoagulation strategies available. In this era of cost containment, we performed a comparative effectiveness analysis of clinical outcomes and cost of the major anticoagulant strategies across all types of PCI procedures in a large observational database. A retrospective, comparative effectiveness analysis of the Premier observational database was conducted to determine the impact of anticoagulant treatment on outcomes. Multiple linear regression and logistic regression models were used to assess the association of initial antithrombotic treatment with outcomes while controlling for other factors. A total of 458,448 inpatient PCI procedures with known antithrombotic regimen from 299 hospitals between January 1, 2004 and March 31, 2008 were identified. Compared to patients treated with heparin plus glycoprotein IIb/IIIa inhibitor (GPI), bivalirudin was associated with a 41% relative risk reduction (RRR) for inpatient mortality, a 44% RRR for clinically apparent bleeding, and a 37% RRR for any transfusion. Furthermore, treatment with bivalirudin alone resulted in a cost savings of $976 per case. Similar results were seen between bivalirudin and heparin in all end-points. Combined use of both bivalirudin and GPI substantially attenuated the cost benefits demonstrated with bivalirudin alone. Bivalirudin use was associated with both improved clinical outcomes and decreased hospital costs in this large "real-world" database. To our knowledge, this study is the first to demonstrate the ideal comparative effectiveness end-point of both improved clinical outcomes with decreased costs in PCI. ©2012, Wiley Periodicals, Inc.
Fast Updating National Geo-Spatial Databases with High Resolution Imagery: China's Methodology and Experience

NASA Astrophysics Data System (ADS)

Chen, J.; Wang, D.; Zhao, R. L.; Zhang, H.; Liao, A.; Jiu, J.

2014-04-01

Geospatial databases are irreplaceable national treasure of immense importance. Their up-to-dateness referring to its consistency with respect to the real world plays a critical role in its value and applications. The continuous updating of map databases at 1:50,000 scales is a massive and difficult task for larger countries of the size of more than several million's kilometer squares. This paper presents the research and technological development to support the national map updating at 1:50,000 scales in China, including the development of updating models and methods, production tools and systems for large-scale and rapid updating, as well as the design and implementation of the continuous updating workflow. The use of many data sources and the integration of these data to form a high accuracy, quality checked product were required. It had in turn required up to date techniques of image matching, semantic integration, generalization, data base management and conflict resolution. Design and develop specific software tools and packages to support the large-scale updating production with high resolution imagery and large-scale data generalization, such as map generalization, GIS-supported change interpretation from imagery, DEM interpolation, image matching-based orthophoto generation, data control at different levels. A national 1:50,000 databases updating strategy and its production workflow were designed, including a full coverage updating pattern characterized by all element topographic data modeling, change detection in all related areas, and whole process data quality controlling, a series of technical production specifications, and a network of updating production units in different geographic places in the country.
Supporting Building Portfolio Investment and Policy Decision Making through an Integrated Building Utility Data Platform

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aziz, Azizan; Lasternas, Bertrand; Alschuler, Elena

The American Recovery and Reinvestment Act stimulus funding of 2009 for smart grid projects resulted in the tripling of smart meters deployment. In 2012, the Green Button initiative provided utility customers with access to their real-time1 energy usage. The availability of finely granular data provides an enormous potential for energy data analytics and energy benchmarking. The sheer volume of time-series utility data from a large number of buildings also poses challenges in data collection, quality control, and database management for rigorous and meaningful analyses. In this paper, we will describe a building portfolio-level data analytics tool for operational optimization, businessmore » investment and policy assessment using 15-minute to monthly intervals utility data. The analytics tool is developed on top of the U.S. Department of Energy’s Standard Energy Efficiency Data (SEED) platform, an open source software application that manages energy performance data of large groups of buildings. To support the significantly large volume of granular interval data, we integrated a parallel time-series database to the existing relational database. The time-series database improves on the current utility data input, focusing on real-time data collection, storage, analytics and data quality control. The fully integrated data platform supports APIs for utility apps development by third party software developers. These apps will provide actionable intelligence for building owners and facilities managers. Unlike a commercial system, this platform is an open source platform funded by the U.S. Government, accessible to the public, researchers and other developers, to support initiatives in reducing building energy consumption.« less
The National Hospital Discharge Survey and Nationwide Inpatient Sample: the databases used affect results in THA research.

PubMed

Bekkers, Stijn; Bot, Arjan G J; Makarawung, Dennis; Neuhaus, Valentin; Ring, David

2014-11-01

The National Hospital Discharge Survey (NHDS) and the Nationwide Inpatient Sample (NIS) collect sample data and publish annual estimates of inpatient care in the United States, and both are commonly used in orthopaedic research. However, there are important differences between the databases, and because of these differences, asking these two databases the same question may result in different answers. The degree to which this is true for arthroplasty-related research has, to our knowledge, not been characterized. We tested the following null hypotheses: (1) there are no differences between the NHDS and NIS in patient characteristics, comorbidities, and adverse events in patients with hip osteoarthritis treated with THA, and (2) there are no differences between databases in factors associated with inpatient mortality, adverse events, and length of hospital stay after THA. The NHDS and NIS databases use different methods of data collection and weighting to provide data representative of all nonfederal hospital discharges in the United States. In 2006 the NHDS database contained 203,149 patients with hip arthritis treated with hip arthroplasty, and the NIS database included 193,879 patients. Multivariable analyses for factors associated with inpatient mortality, adverse events, and days of care were constructed for each database. We found that 26 of 42 of the factors in demographics, comorbidities, and adverse events after THA in the NIS and NHDS databases differed more than 10%. Age and days of care were associated with inpatient mortality with the NHDS and the NIS although the effect rates differ more than 10%. The NIS identified seven other factors not identified by the NHDS: wound complications, congestive heart failure, new mental disorder, chronic pulmonary disease, dementia, geographic region Northeast, acute postoperative anemia, and sex, that were associated with inpatient mortality even after controlling for potentially confounding variables. For inpatient adverse events, atrial fibrillation, osteoporosis, and female sex were associated with the NHDS and the NIS although the effect rates differ more than 10%. There were different directions for sources of payment, dementia, congestive heart failure, and geographic region. For longer length of stay, common factors differing more than 10% in effect rate included chronic pulmonary disease, atrial fibrillation, complication not elsewhere classified, congestive heart failure, transfusion, discharge nonroutine compared with routine, acute postoperative anemia, hypertension, wound adverse events, and diabetes mellitus, whereas discrepant factors included geographic region, payment method, dementia, sex, and iatrogenic hypotension. Studies that use large databases intended to be representative of the entire United States population can produce different results, likely related to differences in the databases, such as the number of comorbidities and procedures that can be entered in the database. In other words, analyses of large databases can have limited reliability and should be interpreted with caution. Level II, prognostic study. See the Instructions for Authors for a complete description of levels of evidence.
Information Retrieval Strategies of Millennial Undergraduate Students in Web and Library Database Searches

ERIC Educational Resources Information Center

Porter, Brandi

2009-01-01

Millennial students make up a large portion of undergraduate students attending colleges and universities, and they have a variety of online resources available to them to complete academically related information searches, primarily Web based and library-based online information retrieval systems. The content, ease of use, and required search…
Application of a large electronic health record database to study effects of air quality on mortality in a heart failure population

EPA Science Inventory

Background: Electronic health records (EHRs) are now a ubiquitous component of the US healthcare system and are attractive for secondary data analysis as they contain detailed and longitudinal clinical records on potentially millions of individuals. However, due to their relative...
Relative Performance of Hardwood Sawing Machines

Treesearch

Philip H. Steele; Michael W. Wade; Steven H. Bullard; Philip A. Araman

1991-01-01

Only limited information has been available to hardwood sawmillers on the performance of their sawing machines. This study analyzes a large database of individual machine studies to provide detailed information on 6 machine types. These machine types were band headrig, circular headrig, band linebar resaw, vertical band splitter resaw, single arbor gang resaw and...
May I Recruit through Your Agency? Considerations for Researchers of Spousal Caregivers

ERIC Educational Resources Information Center

Pote, Steven

2016-01-01

Few resources within social science literature provide helpful information related to participant recruitment for research on spousal caregivers of individuals with dementia. Independent researchers with limited resources may not have access to large databases or centers in which access to caregivers is readily available. As a result, these…

Use of Patient Registries and Administrative Datasets for the Study of Pediatric Cancer

PubMed Central

Rice, Henry E.; Englum, Brian R.; Gulack, Brian C.; Adibe, Obinna O.; Tracy, Elizabeth T.; Kreissman, Susan G.; Routh, Jonathan C.

2015-01-01

Analysis of data from large administrative databases and patient registries is increasingly being used to study childhood cancer care, although the value of these data sources remains unclear to many clinicians. Interpretation of large databases requires a thorough understanding of how the dataset was designed, how data were collected, and how to assess data quality. This review will detail the role of administrative databases and registry databases for the study of childhood cancer, tools to maximize information from these datasets, and recommendations to improve the use of these databases for the study of pediatric oncology. PMID:25807938
Fractal patterns in Stock Intertrading Times

NASA Astrophysics Data System (ADS)

White, Ainslie; Lee, Youngki; Ivanov, Plamen Ch.

2003-03-01

We study intertrades times (ITT) of stock trades of a range of companies included in the New York Stock Exchange's Trades and Quotes (TAQ) database. The time between transactions is an indicator of the dynamics of the market, and in the field of econometrics, intertrade durations play a key role in the understanding of the market activity and microstructure. Previous work has mainly focused on the properties of price changes of individual company stocks as well as global financial indices (e.g. SP500, DJ etc.). We hypothesize that there is a relation between the dynamics of price change and the trading activity. To investigate this relation we first study the statistical features of ITT data. The TAQ database covers all transactions on the NSE, AMEX, NASDAQ and the US regional exchanges. We have performed a preliminary analysis of 100 company stocks from a range of industries of the US economy selecting predominantly those companies which have large market capitalisations (MC). We focus on companies with large MC, since the dynamics of the price change and trading activity of stocks of such companies has a considerable impact on the market behaviour.
A multi-user real time inventorying system for radioactive materials: a networking approach.

PubMed

Mehta, S; Bandyopadhyay, D; Hoory, S

1998-01-01

A computerized system for radioisotope management and real time inventory coordinated across a large organization is reported. It handles hundreds of individual users and their separate inventory records. Use of highly efficient computer network and database technologies makes it possible to accept, maintain, and furnish all records related to receipt, usage, and disposal of the radioactive materials for the users separately and collectively. The system's central processor is an HP-9000/800 G60 RISC server and users from across the organization use their personal computers to login to this server using the TCP/IP networking protocol, which makes distributed use of the system possible. Radioisotope decay is automatically calculated by the program, so that it can make the up-to-date radioisotope inventory data of an entire institution available immediately. The system is specifically designed to allow use by large numbers of users (about 300) and accommodates high volumes of data input and retrieval without compromising simplicity and accuracy. Overall, it is an example of a true multi-user, on-line, relational database information system that makes the functioning of a radiation safety department efficient.
Relations between some horizontal‐component ground‐motion intensity measures used in practice

USGS Publications Warehouse

Boore, David; Kishida, Tadahiro

2017-01-01

Various measures using the two horizontal components of recorded ground motions have been used in a number of studies that derive ground‐motion prediction equations and construct maps of shaking intensity. We update relations between a number of these measures, including those in Boore et al. (2006) and Boore (2010), using the large and carefully constructed global database of ground motions from crustal earthquakes in active tectonic regions developed as part of the Pacific Earthquake Engineering Research Center–Next Generation Attenuation‐West2 project. The ratios from the expanded datasets generally agree to within a few percent of the previously published ratios. We also provide some ratios that were not considered before, some of which will be useful in applications such as constructing ShakeMaps. Finally, we compare two important ratios with those from a large central and eastern North American database and from many records from subduction earthquakes in Japan and Taiwan. In general, the ratios from these regions are within several percent of those from crustal earthquakes in active tectonic regions.
Modeling and Databases for Teaching Petrology

NASA Astrophysics Data System (ADS)

Asher, P.; Dutrow, B.

2003-12-01

With the widespread availability of high-speed computers with massive storage and ready transport capability of large amounts of data, computational and petrologic modeling and the use of databases provide new tools with which to teach petrology. Modeling can be used to gain insights into a system, predict system behavior, describe a system's processes, compare with a natural system or simply to be illustrative. These aspects result from data driven or empirical, analytical or numerical models or the concurrent examination of multiple lines of evidence. At the same time, use of models can enhance core foundations of the geosciences by improving critical thinking skills and by reinforcing prior knowledge gained. However, the use of modeling to teach petrology is dictated by the level of expectation we have for students and their facility with modeling approaches. For example, do we expect students to push buttons and navigate a program, understand the conceptual model and/or evaluate the results of a model. Whatever the desired level of sophistication, specific elements of design should be incorporated into a modeling exercise for effective teaching. These include, but are not limited to; use of the scientific method, use of prior knowledge, a clear statement of purpose and goals, attainable goals, a connection to the natural/actual system, a demonstration that complex heterogeneous natural systems are amenable to analyses by these techniques and, ideally, connections to other disciplines and the larger earth system. Databases offer another avenue with which to explore petrology. Large datasets are available that allow integration of multiple lines of evidence to attack a petrologic problem or understand a petrologic process. These are collected into a database that offers a tool for exploring, organizing and analyzing the data. For example, datasets may be geochemical, mineralogic, experimental and/or visual in nature, covering global, regional to local scales. These datasets provide students with access to large amount of related data through space and time. Goals of the database working group include educating earth scientists about information systems in general, about the importance of metadata about ways of using databases and datasets as educational tools and about the availability of existing datasets and databases. The modeling and databases groups hope to create additional petrologic teaching tools using these aspects and invite the community to contribute to the effort.
GIDL: a rule based expert system for GenBank Intelligent Data Loading into the Molecular Biodiversity database

PubMed Central

2012-01-01

Background In the scientific biodiversity community, it is increasingly perceived the need to build a bridge between molecular and traditional biodiversity studies. We believe that the information technology could have a preeminent role in integrating the information generated by these studies with the large amount of molecular data we can find in bioinformatics public databases. This work is primarily aimed at building a bioinformatic infrastructure for the integration of public and private biodiversity data through the development of GIDL, an Intelligent Data Loader coupled with the Molecular Biodiversity Database. The system presented here organizes in an ontological way and locally stores the sequence and annotation data contained in the GenBank primary database. Methods The GIDL architecture consists of a relational database and of an intelligent data loader software. The relational database schema is designed to manage biodiversity information (Molecular Biodiversity Database) and it is organized in four areas: MolecularData, Experiment, Collection and Taxonomy. The MolecularData area is inspired to an established standard in Generic Model Organism Databases, the Chado relational schema. The peculiarity of Chado, and also its strength, is the adoption of an ontological schema which makes use of the Sequence Ontology. The Intelligent Data Loader (IDL) component of GIDL is an Extract, Transform and Load software able to parse data, to discover hidden information in the GenBank entries and to populate the Molecular Biodiversity Database. The IDL is composed by three main modules: the Parser, able to parse GenBank flat files; the Reasoner, which automatically builds CLIPS facts mapping the biological knowledge expressed by the Sequence Ontology; the DBFiller, which translates the CLIPS facts into ordered SQL statements used to populate the database. In GIDL Semantic Web technologies have been adopted due to their advantages in data representation, integration and processing. Results and conclusions Entries coming from Virus (814,122), Plant (1,365,360) and Invertebrate (959,065) divisions of GenBank rel.180 have been loaded in the Molecular Biodiversity Database by GIDL. Our system, combining the Sequence Ontology and the Chado schema, allows a more powerful query expressiveness compared with the most commonly used sequence retrieval systems like Entrez or SRS. PMID:22536971
PedNavigator: a pedigree drawing servlet for large and inbred populations.

PubMed

Mancosu, Gianmaria; Ledda, Giuseppe; Melis, Paola M

2003-03-22

PedNavigator is a pedigree drawing application for large and complex pedigrees. It has been developed especially for genetic and epidemiological studies of isolated populations characterized by high inbreeding and multiple matrimonies. PedNavigator is written in Java and is intended as a server-side web application, allowing researchers to 'walk' through family ties by point-and-clicking on person's symbols. The application is able to enrich the pedigree drawings with genotypic and phenotypic information taken from the underlying relational database.
Very Large Data Volumes Analysis of Collaborative Systems with Finite Number of States

ERIC Educational Resources Information Center

Ivan, Ion; Ciurea, Cristian; Pavel, Sorin

2010-01-01

The collaborative system with finite number of states is defined. A very large database is structured. Operations on large databases are identified. Repetitive procedures for collaborative systems operations are derived. The efficiency of such procedures is analyzed. (Contains 6 tables, 5 footnotes and 3 figures.)
A Relational/Object-Oriented Database Management System: R/OODBMS

DTIC Science & Technology

1992-09-01

Concepts In 1968, Dr. Edgar F . Codd had the idea that "predicate logic could be applied to maintaining the logical integrity of the data" in a DBMS [CD90, p...Hall, Inc., Englewood Cliffs, NJ, 1990. [Co70] Codd , E. F ., "A Relational Model for Large Shared Data Banks," Communications of the ACM, v. 13, no.6...pp. 377-387 Jun 1970. [CD90] Interview between E. F . Codd and DBMS, "Relational philosopher: the creator of the relational model talks about his
MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring

NASA Technical Reports Server (NTRS)

Saeed, M.; Lieu, C.; Raber, G.; Mark, R. G.

2002-01-01

Development and evaluation of Intensive Care Unit (ICU) decision-support systems would be greatly facilitated by the availability of a large-scale ICU patient database. Following our previous efforts with the MIMIC (Multi-parameter Intelligent Monitoring for Intensive Care) Database, we have leveraged advances in networking and storage technologies to develop a far more massive temporal database, MIMIC II. MIMIC II is an ongoing effort: data is continuously and prospectively archived from all ICU patients in our hospital. MIMIC II now consists of over 800 ICU patient records including over 120 gigabytes of data and is growing. A customized archiving system was used to store continuously up to four waveforms and 30 different parameters from ICU patient monitors. An integrated user-friendly relational database was developed for browsing of patients' clinical information (lab results, fluid balance, medications, nurses' progress notes). Based upon its unprecedented size and scope, MIMIC II will prove to be an important resource for intelligent patient monitoring research, and will support efforts in medical data mining and knowledge-discovery.
OrChem - An open source chemistry search engine for Oracle(R).

PubMed

Rijnbeek, Mark; Steinbeck, Christoph

2009-10-22

Registration, indexing and searching of chemical structures in relational databases is one of the core areas of cheminformatics. However, little detail has been published on the inner workings of search engines and their development has been mostly closed-source. We decided to develop an open source chemistry extension for Oracle, the de facto database platform in the commercial world. Here we present OrChem, an extension for the Oracle 11G database that adds registration and indexing of chemical structures to support fast substructure and similarity searching. The cheminformatics functionality is provided by the Chemistry Development Kit. OrChem provides similarity searching with response times in the order of seconds for databases with millions of compounds, depending on a given similarity cut-off. For substructure searching, it can make use of multiple processor cores on today's powerful database servers to provide fast response times in equally large data sets. OrChem is free software and can be redistributed and/or modified under the terms of the GNU Lesser General Public License as published by the Free Software Foundation. All software is available via http://orchem.sourceforge.net.
A geospatial database model for the management of remote sensing datasets at multiple spectral, spatial, and temporal scales

NASA Astrophysics Data System (ADS)

Ifimov, Gabriela; Pigeau, Grace; Arroyo-Mora, J. Pablo; Soffer, Raymond; Leblanc, George

2017-10-01

In this study the development and implementation of a geospatial database model for the management of multiscale datasets encompassing airborne imagery and associated metadata is presented. To develop the multi-source geospatial database we have used a Relational Database Management System (RDBMS) on a Structure Query Language (SQL) server which was then integrated into ArcGIS and implemented as a geodatabase. The acquired datasets were compiled, standardized, and integrated into the RDBMS, where logical associations between different types of information were linked (e.g. location, date, and instrument). Airborne data, at different processing levels (digital numbers through geocorrected reflectance), were implemented in the geospatial database where the datasets are linked spatially and temporally. An example dataset consisting of airborne hyperspectral imagery, collected for inter and intra-annual vegetation characterization and detection of potential hydrocarbon seepage events over pipeline areas, is presented. Our work provides a model for the management of airborne imagery, which is a challenging aspect of data management in remote sensing, especially when large volumes of data are collected.
Total choline and choline-containing moieties of commercially available pulses.

PubMed

Lewis, Erin D; Kosik, Sarah J; Zhao, Yuan-Yuan; Jacobs, René L; Curtis, Jonathan M; Field, Catherine J

2014-06-01

Estimating dietary choline intake can be challenging due to missing foods in the current United States Department of Agriculture (USDA) database. The objectives of the study were to quantify the choline-containing moieties and the total choline content of a variety of pulses available in North America and use the expanded compositional database to determine the potential contribution of pulses to dietary choline intake. Commonly consumed pulses (n = 32) were analyzed by hydrophilic interaction liquid chromatography-tandem mass spectrometry (HILIC LC-MS/MS) and compared to the current USDA database. Cooking was found to reduce the relative percent from free choline and increased the contribution of phosphatidylcholine to total choline for most pulses (P < 0.05). Using the expanded database to estimate choline content of recipes using pulses as meat alternatives, resulted in a different estimation of choline content per serving (±30%), compared to the USDA database. These results suggest that when pulses are a large part of a meal or diet, the use of accurate food composition data should be used.
Tests of methods for evaluating bibliographic databases: an analysis of the National Library of Medicine's handling of literatures in the medical behavioral sciences.

PubMed

Griffith, B C; White, H D; Drott, M C; Saye, J D

1986-07-01

This article reports on five separate studies designed for the National Library of Medicine (NLM) to develop and test methodologies for evaluating the products of large databases. The methodologies were tested on literatures of the medical behavioral sciences (MBS). One of these studies examined how well NLM covered MBS monographic literature using CATLINE and OCLC. Another examined MBS journal and serial literature coverage in MEDLINE and other MBS-related databases available through DIALOG. These two studies used 1010 items derived from the reference lists of sixty-one journals, and tested for gaps and overlaps in coverage in the various databases. A third study examined the quality of the indexing NLM provides to MBS literatures and developed a measure of indexing as a system component. The final two studies explored how well MEDLINE retrieved documents on topics submitted by MBS professionals and how online searchers viewed MEDLINE (and other systems and databases) in handling MBS topics. The five studies yielded both broad research outcomes and specific recommendations to NLM.
The VirusBanker database uses a Java program to allow flexible searching through Bunyaviridae sequences.

PubMed

Fourment, Mathieu; Gibbs, Mark J

2008-02-05

Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically.
Update of the Diatom EST Database: a new tool for digital transcriptomics

PubMed Central

Maheswari, Uma; Mock, Thomas; Armbrust, E. Virginia; Bowler, Chris

2009-01-01

The Diatom Expressed Sequence Tag (EST) Database was constructed to provide integral access to ESTs from these ecologically and evolutionarily interesting microalgae. It has now been updated with 130 000 Phaeodactylum tricornutum ESTs from 16 cDNA libraries and 77 000 Thalassiosira pseudonana ESTs from seven libraries, derived from cells grown in different nutrient and stress regimes. The updated relational database incorporates results from statistical analyses such as log-likelihood ratios and hierarchical clustering, which help to identify differentially expressed genes under different conditions, and allow similarities in gene expression in different libraries to be investigated in a functional context. The database also incorporates links to the recently sequenced genomes of P. tricornutum and T. pseudonana, enabling an easy cross-talk between the expression pattern of diatom orthologs and the genome browsers. These improvements will facilitate exploration of diatom responses to conditions of ecological relevance and will aid gene function identification of diatom-specific genes and in silico gene prediction in this largely unexplored class of eukaryotes. The updated Diatom EST Database is available at http://www.biologie.ens.fr/diatomics/EST3. PMID:19029140
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W

2010-01-01

GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bi-monthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI homepage: www.ncbi.nlm.nih.gov.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W

2009-01-01

GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank(R) staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Expert searching in public health

PubMed Central

Alpi, Kristine M.

2005-01-01

Objective: The article explores the characteristics of public health information needs and the resources available to address those needs that distinguish it as an area of searching requiring particular expertise. Methods: Public health searching activities from reference questions and literature search requests at a large, urban health department library were reviewed to identify the challenges in finding relevant public health information. Results: The terminology of the information request frequently differed from the vocabularies available in the databases. Searches required the use of multiple databases and/or Web resources with diverse interfaces. Issues of the scope and features of the databases relevant to the search questions were considered. Conclusion: Expert searching in public health differs from other types of expert searching in the subject breadth and technical demands of the databases to be searched, the fluidity and lack of standardization of the vocabulary, and the relative scarcity of high-quality investigations at the appropriate level of geographic specificity. Health sciences librarians require a broad exposure to databases, gray literature, and public health terminology to perform as expert searchers in public health. PMID:15685281
Optimization of the efficiency of search operations in the relational database of radio electronic systems

NASA Astrophysics Data System (ADS)

Wajszczyk, Bronisław; Biernacki, Konrad

2018-04-01

The increase of interoperability of radio electronic systems used in the Armed Forces requires the processing of very large amounts of data. Requirements for the integration of information from many systems and sensors, including radar recognition, electronic and optical recognition, force to look for more efficient methods to support information retrieval in even-larger database resources. This paper presents the results of research on methods of improving the efficiency of databases using various types of indexes. The data structure indexing technique is a solution used in RDBMS systems (relational database management system). However, the analysis of the performance of indices, the description of potential applications, and in particular the presentation of a specific scale of performance growth for individual indices are limited to few studies in this field. This paper contains analysis of methods affecting the work efficiency of a relational database management system. As a result of the research, a significant increase in the efficiency of operations on data was achieved through the strategy of indexing data structures. The presentation of the research topic discussed in this paper mainly consists of testing the operation of various indexes against the background of different queries and data structures. The conclusions from the conducted experiments allow to assess the effectiveness of the solutions proposed and applied in the research. The results of the research indicate the existence of a real increase in the performance of operations on data using indexation of data structures. In addition, the level of this growth is presented, broken down by index types.

Teaching Case: Adapting the Access Northwind Database to Support a Database Course

ERIC Educational Resources Information Center

Dyer, John N.; Rogers, Camille

2015-01-01

A common problem encountered when teaching database courses is that few large illustrative databases exist to support teaching and learning. Most database textbooks have small "toy" databases that are chapter objective specific, and thus do not support application over the complete domain of design, implementation and management concepts…
Large-Scale 1:1 Computing Initiatives: An Open Access Database

ERIC Educational Resources Information Center

Richardson, Jayson W.; McLeod, Scott; Flora, Kevin; Sauers, Nick J.; Kannan, Sathiamoorthy; Sincar, Mehmet

2013-01-01

This article details the spread and scope of large-scale 1:1 computing initiatives around the world. What follows is a review of the existing literature around 1:1 programs followed by a description of the large-scale 1:1 database. Main findings include: 1) the XO and the Classmate PC dominate large-scale 1:1 initiatives; 2) if professional…
GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.

PubMed

Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de

2006-03-31

Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.
In-Memory Graph Databases for Web-Scale Data

DOE Office of Scientific and Technical Information (OSTI.GOV)

Castellana, Vito G.; Morari, Alessandro; Weaver, Jesse R.

RDF databases have emerged as one of the most relevant way for organizing, integrating, and managing expo- nentially growing, often heterogeneous, and not rigidly structured data for a variety of scientific and commercial fields. In this paper we discuss the solutions integrated in GEMS (Graph database Engine for Multithreaded Systems), a software framework for implementing RDF databases on commodity, distributed-memory high-performance clusters. Unlike the majority of current RDF databases, GEMS has been designed from the ground up to primarily employ graph-based methods. This is reflected in all the layers of its stack. The GEMS framework is composed of: a SPARQL-to-C++more » compiler, a library of data structures and related methods to access and modify them, and a custom runtime providing lightweight software multithreading, network messages aggregation and a partitioned global address space. We provide an overview of the framework, detailing its component and how they have been closely designed and customized to address issues of graph methods applied to large-scale datasets on clusters. We discuss in details the principles that enable automatic translation of the queries (expressed in SPARQL, the query language of choice for RDF databases) to graph methods, and identify differences with respect to other RDF databases.« less
Automatic initialization and quality control of large-scale cardiac MRI segmentations.

PubMed

Albà, Xènia; Lekadir, Karim; Pereañez, Marco; Medrano-Gracia, Pau; Young, Alistair A; Frangi, Alejandro F

2018-01-01

Continuous advances in imaging technologies enable ever more comprehensive phenotyping of human anatomy and physiology. Concomitant reduction of imaging costs has resulted in widespread use of imaging in large clinical trials and population imaging studies. Magnetic Resonance Imaging (MRI), in particular, offers one-stop-shop multidimensional biomarkers of cardiovascular physiology and pathology. A wide range of analysis methods offer sophisticated cardiac image assessment and quantification for clinical and research studies. However, most methods have only been evaluated on relatively small databases often not accessible for open and fair benchmarking. Consequently, published performance indices are not directly comparable across studies and their translation and scalability to large clinical trials or population imaging cohorts is uncertain. Most existing techniques still rely on considerable manual intervention for the initialization and quality control of the segmentation process, becoming prohibitive when dealing with thousands of images. The contributions of this paper are three-fold. First, we propose a fully automatic method for initializing cardiac MRI segmentation, by using image features and random forests regression to predict an initial position of the heart and key anatomical landmarks in an MRI volume. In processing a full imaging database, the technique predicts the optimal corrective displacements and positions in relation to the initial rough intersections of the long and short axis images. Second, we introduce for the first time a quality control measure capable of identifying incorrect cardiac segmentations with no visual assessment. The method uses statistical, pattern and fractal descriptors in a random forest classifier to detect failures to be corrected or removed from subsequent statistical analysis. Finally, we validate these new techniques within a full pipeline for cardiac segmentation applicable to large-scale cardiac MRI databases. The results obtained based on over 1200 cases from the Cardiac Atlas Project show the promise of fully automatic initialization and quality control for population studies. Copyright © 2017 Elsevier B.V. All rights reserved.
Large Survey Database: A Distributed Framework for Storage and Analysis of Large Datasets

NASA Astrophysics Data System (ADS)

Juric, Mario

2011-01-01

The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures. An LSD database consists of a set of vertically and horizontally partitioned tables, physically stored as compressed HDF5 files. Vertically, we partition the tables into groups of related columns ('column groups'), storing together logically related data (e.g., astrometry, photometry). Horizontally, the tables are partitioned into partially overlapping ``cells'' by position in space (lon, lat) and time (t). This organization allows for fast lookups based on spatial and temporal coordinates, as well as data and task distribution. The design was inspired by the success of Google BigTable (Chang et al., 2006). Our programming model is a pipelined extension of MapReduce (Dean and Ghemawat, 2004). An SQL-like query language is used to access data. For complex tasks, map-reduce ``kernels'' that operate on query results on a per-cell basis can be written, with the framework taking care of scheduling and execution. The combination leverages users' familiarity with SQL, while offering a fully distributed computing environment. LSD adds little overhead compared to direct Python file I/O. In tests, we sweeped through 1.1 Grows of PanSTARRS+SDSS data (220GB) less than 15 minutes on a dual CPU machine. In a cluster environment, we achieved bandwidths of 17Gbits/sec (I/O limited). Based on current experience, we believe LSD should scale to be useful for analysis and storage of LSST-scale datasets. It can be downloaded from http://mwscience.net/lsd.
Automated biosurveillance data from England and Wales, 1991-2011.

PubMed

Enki, Doyo G; Noufaily, Angela; Garthwaite, Paul H; Andrews, Nick J; Charlett, André; Lane, Chris; Farrington, C Paddy

2013-01-01

Outbreak detection systems for use with very large multiple surveillance databases must be suited both to the data available and to the requirements of full automation. To inform the development of more effective outbreak detection algorithms, we analyzed 20 years of data (1991-2011) from a large laboratory surveillance database used for outbreak detection in England and Wales. The data relate to 3,303 distinct types of infectious pathogens, with a frequency range spanning 6 orders of magnitude. Several hundred organism types were reported each week. We describe the diversity of seasonal patterns, trends, artifacts, and extra-Poisson variability to which an effective multiple laboratory-based outbreak detection system must adjust. We provide empirical information to guide the selection of simple statistical models for automated surveillance of multiple organisms, in the light of the key requirements of such outbreak detection systems, namely, robustness, flexibility, and sensitivity.
Indicators for the automated analysis of drug prescribing quality.

PubMed

Coste, J; Séné, B; Milstein, C; Bouée, S; Venot, A

1998-01-01

Irrational and inconsistent drug prescription has considerable impact on morbidity, mortality, health service utilization, and community burden. However, few studies have addressed the methodology of processing the information contained in these drug orders used to study the quality of drug prescriptions and prescriber behavior. We present a comprehensive set of quantitative indicators for the quality of drug prescriptions which can be derived from a drug order. These indicators were constructed using explicit a priori criteria which were previously validated on the basis of scientific data. Automatic computation is straightforward, using a relational database system, such that large sets of prescriptions can be processed with minimal human effort. We illustrate the feasibility and value of this approach by using a large set of 23,000 prescriptions for several diseases, selected from a nationally representative prescriptions database. Our study may result in direct and wide applications in the epidemiology of medical practice and in quality control procedures.
Automated Biosurveillance Data from England and Wales, 1991–2011

PubMed Central

Enki, Doyo G.; Noufaily, Angela; Garthwaite, Paul H.; Andrews, Nick J.; Charlett, André; Lane, Chris

2013-01-01

Outbreak detection systems for use with very large multiple surveillance databases must be suited both to the data available and to the requirements of full automation. To inform the development of more effective outbreak detection algorithms, we analyzed 20 years of data (1991–2011) from a large laboratory surveillance database used for outbreak detection in England and Wales. The data relate to 3,303 distinct types of infectious pathogens, with a frequency range spanning 6 orders of magnitude. Several hundred organism types were reported each week. We describe the diversity of seasonal patterns, trends, artifacts, and extra-Poisson variability to which an effective multiple laboratory-based outbreak detection system must adjust. We provide empirical information to guide the selection of simple statistical models for automated surveillance of multiple organisms, in the light of the key requirements of such outbreak detection systems, namely, robustness, flexibility, and sensitivity. PMID:23260848
TabSQL: a MySQL tool to facilitate mapping user data to public databases.

PubMed

Xia, Xiao-Qin; McClelland, Michael; Wang, Yipeng

2010-06-23

With advances in high-throughput genomics and proteomics, it is challenging for biologists to deal with large data files and to map their data to annotations in public databases. We developed TabSQL, a MySQL-based application tool, for viewing, filtering and querying data files with large numbers of rows. TabSQL provides functions for downloading and installing table files from public databases including the Gene Ontology database (GO), the Ensembl databases, and genome databases from the UCSC genome bioinformatics site. Any other database that provides tab-delimited flat files can also be imported. The downloaded gene annotation tables can be queried together with users' data in TabSQL using either a graphic interface or command line. TabSQL allows queries across the user's data and public databases without programming. It is a convenient tool for biologists to annotate and enrich their data.
TabSQL: a MySQL tool to facilitate mapping user data to public databases

PubMed Central

2010-01-01

Background With advances in high-throughput genomics and proteomics, it is challenging for biologists to deal with large data files and to map their data to annotations in public databases. Results We developed TabSQL, a MySQL-based application tool, for viewing, filtering and querying data files with large numbers of rows. TabSQL provides functions for downloading and installing table files from public databases including the Gene Ontology database (GO), the Ensembl databases, and genome databases from the UCSC genome bioinformatics site. Any other database that provides tab-delimited flat files can also be imported. The downloaded gene annotation tables can be queried together with users' data in TabSQL using either a graphic interface or command line. Conclusions TabSQL allows queries across the user's data and public databases without programming. It is a convenient tool for biologists to annotate and enrich their data. PMID:20573251
Integrated database for rapid mass movements in Norway

NASA Astrophysics Data System (ADS)

Jaedicke, C.; Lied, K.; Kronholm, K.

2009-03-01

Rapid gravitational slope mass movements include all kinds of short term relocation of geological material, snow or ice. Traditionally, information about such events is collected separately in different databases covering selected geographical regions and types of movement. In Norway the terrain is susceptible to all types of rapid gravitational slope mass movements ranging from single rocks hitting roads and houses to large snow avalanches and rock slides where entire mountainsides collapse into fjords creating flood waves and endangering large areas. In addition, quick clay slides occur in desalinated marine sediments in South Eastern and Mid Norway. For the authorities and inhabitants of endangered areas, the type of threat is of minor importance and mitigation measures have to consider several types of rapid mass movements simultaneously. An integrated national database for all types of rapid mass movements built around individual events has been established. Only three data entries are mandatory: time, location and type of movement. The remaining optional parameters enable recording of detailed information about the terrain, materials involved and damages caused. Pictures, movies and other documentation can be uploaded into the database. A web-based graphical user interface has been developed allowing new events to be entered, as well as editing and querying for all events. An integration of the database into a GIS system is currently under development. Datasets from various national sources like the road authorities and the Geological Survey of Norway were imported into the database. Today, the database contains 33 000 rapid mass movement events from the last five hundred years covering the entire country. A first analysis of the data shows that the most frequent type of recorded rapid mass movement is rock slides and snow avalanches followed by debris slides in third place. Most events are recorded in the steep fjord terrain of the Norwegian west coast, but major events are recorded all over the country. Snow avalanches account for most fatalities, while large rock slides causing flood waves and huge quick clay slides are the most damaging individual events in terms of damage to infrastructure and property and for causing multiple fatalities. The quality of the data is strongly influenced by the personal engagement of local observers and varying observation routines. This database is a unique source for statistical analysis including, risk analysis and the relation between rapid mass movements and climate. The database of rapid mass movement events will also facilitate validation of national hazard and risk maps.
The Determinants of the Technical Efficiency of Acute Inpatient Care in Canada.

PubMed

Wang, Li; Grignon, Michel; Perry, Sheril; Chen, Xi-Kuan; Ytsma, Alison; Allin, Sara; Gapanenko, Katerina

2018-04-17

To evaluate the technical efficiency of acute inpatient care at the pan-Canadian level and to explore the factors associated with inefficiency-why hospitals are not on their production frontier. Canadian Management Information System (MIS) database (CMDB) and Discharge Abstract Database (DAD) for the fiscal year of 2012-2013. We use a nonparametric approach (data envelopment analysis) applied to three peer groups (teaching, large, and medium hospitals, focusing on their acute inpatient care only). The double bootstrap procedure (Simar and Wilson 2007) is adopted in the regression. Information on inpatient episodes of care (number and quality of outcomes) was extracted from the DAD. The cost of the inpatient care was extracted from the CMDB. On average, acute hospitals in Canada are operating at about 75 percent efficiency, and this could thus potentially increase their level of outcomes (quantity and quality) by addressing inefficiencies. In some cases, such as for teaching hospitals, the factors significantly correlated with efficiency scores were not related to management but to the social composition of the caseload. In contrast, for large and medium nonteaching hospitals, efficiency related more to the ability to discharge patients to postacute care facilities. The efficiency of medium hospitals is also positively related to treating more clinically noncomplex patients. The main drivers of efficiency of acute inpatient care vary by hospital peer groups. Thus, the results provide different policy and managerial implications for teaching, large, and medium hospitals to achieve efficiency gains. © Health Research and Educational Trust.
Association of Lifestyle-Related Comorbidities With Periodontitis

PubMed Central

Lee, Jae-Hong; Lee, Jung-Seok; Park, Jin-Young; Choi, Jung-Kyu; Kim, Dong-Wook; Kim, Young-Taek; Choi, Seong-Ho

2015-01-01

Abstract The aim of this study was to determine the association of periodontitis with lifestyle-related comorbidities (LCs) using data in the Korean National Health Insurance Cohort Database from 2002 to 2013. This was a retrospective study involving a large national cohort with patient samples (representing 2% of the total Korean population) stratified on the basis of sociodemographic information. Using this precisely extracted database, the correlations between LCs (cerebral infarction, angina pectoris, myocardial infarction, hypertension, diabetes mellitus, rheumatoid arthritis, erectile dysfunction, osteoporosis, and obesity) and periodontitis were investigated while adjusting for confounding bias. Univariate and multiple logistic regression analyses were used to evaluate differences in variable factors. Among a total of 1,025,340 samples, 321,103 (31.3%) cases were diagnosed with periodontitis. Statistically significant associations were found between all LCs except myocardial infarction and periodontitis (P < 0.005). Periodontitis is significantly and positively correlated with LCs (except for myocardial infarction) after adjusting for confounding bias. In particular, lifestyle-related diseases, erectile dysfunction, and osteoporosis seem to be intimately related to periodontitis. PMID:26376407
Pharmacoepidemiology resources in Ireland-an introduction to pharmacy claims data.

PubMed

Sinnott, Sarah-Jo; Bennett, Kathleen; Cahir, Caitriona

2017-11-01

Administrative health data, such as pharmacy claims data, present a valuable resource for conducting pharmacoepidemiological and health services research. Often, data are available for whole populations allowing population level analyses. Moreover, their routine collection ensures that the data reflect health care utilisation in the real-world setting compared to data collected in clinical trials. The Irish Health Service Executive-Primary Care Reimbursement Service (HSE-PCRS) community pharmacy claims database is described. The availability of demographic variables and drug-related information is discussed. The strengths and limitations associated using this database for conducting research are presented, in particular, internal and external validity. Examples of recently conducted research using the HSE-PCRS pharmacy claims database are used to illustrate the breadth of its use. The HSE-PCRS national pharmacy claims database is a large, high-quality, valid and accurate data source for measuring drug exposure in specific populations in Ireland. The main limitation is the lack of generalisability for those aged <70 years and the lack of information on indication or outcome.
Guidelines for the Effective Use of Entity-Attribute-Value Modeling for Biomedical Databases

PubMed Central

Dinu, Valentin; Nadkarni, Prakash

2007-01-01

Purpose To introduce the goals of EAV database modeling, to describe the situations where Entity-Attribute-Value (EAV) modeling is a useful alternative to conventional relational methods of database modeling, and to describe the fine points of implementation in production systems. Methods We analyze the following circumstances: 1) data are sparse and have a large number of applicable attributes, but only a small fraction will apply to a given entity; 2) numerous classes of data need to be represented, each class has a limited number of attributes, but the number of instances of each class is very small. We also consider situations calling for a mixed approach where both conventional and EAV design are used for appropriate data classes. Results and Conclusions In robust production systems, EAV-modeled databases trade a modest data sub-schema for a complex metadata sub-schema. The need to design the metadata effectively makes EAV design potentially more challenging than conventional design. PMID:17098467
Providing R-Tree Support for Mongodb

NASA Astrophysics Data System (ADS)

Xiang, Longgang; Shao, Xiaotian; Wang, Dehao

2016-06-01

Supporting large amounts of spatial data is a significant characteristic of modern databases. However, unlike some mature relational databases, such as Oracle and PostgreSQL, most of current burgeoning NoSQL databases are not well designed for storing geospatial data, which is becoming increasingly important in various fields. In this paper, we propose a novel method to provide R-tree index, as well as corresponding spatial range query and nearest neighbour query functions, for MongoDB, one of the most prevalent NoSQL databases. First, after in-depth analysis of MongoDB's features, we devise an efficient tabular document structure which flattens R-tree index into MongoDB collections. Further, relevant mechanisms of R-tree operations are issued, and then we discuss in detail how to integrate R-tree into MongoDB. Finally, we present the experimental results which show that our proposed method out-performs the built-in spatial index of MongoDB. Our research will greatly facilitate big data management issues with MongoDB in a variety of geospatial information applications.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2007-01-01

GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than 240 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage (www.ncbi.nlm.nih.gov).
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2005-01-01

GenBank is a comprehensive database that contains publicly available DNA sequences for more than 165,000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in the UK and the DNA Data Bank of Japan helps to ensure worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Wheeler, David L

2006-01-01

GenBank (R) is a comprehensive database that contains publicly available DNA sequences for more than 205 000 named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the Web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the EMBL Data Library in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at www.ncbi.nlm.nih.gov.

Genomics Portals: integrative web-platform for mining genomics data.

PubMed

Shinde, Kaustubh; Phatak, Mukta; Johannes, Freudenberg M; Chen, Jing; Li, Qian; Vineet, Joshi K; Hu, Zhen; Ghosh, Krishnendu; Meller, Jaroslaw; Medvedovic, Mario

2010-01-13

A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org.
Genomics Portals: integrative web-platform for mining genomics data

PubMed Central

2010-01-01

Background A large amount of experimental data generated by modern high-throughput technologies is available through various public repositories. Our knowledge about molecular interaction networks, functional biological pathways and transcriptional regulatory modules is rapidly expanding, and is being organized in lists of functionally related genes. Jointly, these two sources of information hold a tremendous potential for gaining new insights into functioning of living systems. Results Genomics Portals platform integrates access to an extensive knowledge base and a large database of human, mouse, and rat genomics data with basic analytical visualization tools. It provides the context for analyzing and interpreting new experimental data and the tool for effective mining of a large number of publicly available genomics datasets stored in the back-end databases. The uniqueness of this platform lies in the volume and the diversity of genomics data that can be accessed and analyzed (gene expression, ChIP-chip, ChIP-seq, epigenomics, computationally predicted binding sites, etc), and the integration with an extensive knowledge base that can be used in such analysis. Conclusion The integrated access to primary genomics data, functional knowledge and analytical tools makes Genomics Portals platform a unique tool for interpreting results of new genomics experiments and for mining the vast amount of data stored in the Genomics Portals backend databases. Genomics Portals can be accessed and used freely at http://GenomicsPortals.org. PMID:20070909
Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE).

PubMed

Schmedes, Sarah E; King, Jonathan L; Budowle, Bruce

2015-01-01

Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.
[Privacy and public benefit in using large scale health databases].

PubMed

Yamamoto, Ryuichi

2014-01-01

In Japan, large scale heath databases were constructed in a few years, such as National Claim insurance and health checkup database (NDB) and Japanese Sentinel project. But there are some legal issues for making adequate balance between privacy and public benefit by using such databases. NDB is carried based on the act for elderly person's health care but in this act, nothing is mentioned for using this database for general public benefit. Therefore researchers who use this database are forced to pay much concern about anonymization and information security that may disturb the research work itself. Japanese Sentinel project is a national project to detecting drug adverse reaction using large scale distributed clinical databases of large hospitals. Although patients give the future consent for general such purpose for public good, it is still under discussion using insufficiently anonymized data. Generally speaking, researchers of study for public benefit will not infringe patient's privacy, but vague and complex requirements of legislation about personal data protection may disturb the researches. Medical science does not progress without using clinical information, therefore the adequate legislation that is simple and clear for both researchers and patients is strongly required. In Japan, the specific act for balancing privacy and public benefit is now under discussion. The author recommended the researchers including the field of pharmacology should pay attention to, participate in the discussion of, and make suggestion to such act or regulations.
Examining database persistence of ISO/EN 13606 standardized electronic health record extracts: relational vs. NoSQL approaches.

PubMed

Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Lozano-Rubí, Raimundo; Serrano-Balazote, Pablo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario

2017-08-18

The objective of this research is to compare the relational and non-relational (NoSQL) database systems approaches in order to store, recover, query and persist standardized medical information in the form of ISO/EN 13606 normalized Electronic Health Record XML extracts, both in isolation and concurrently. NoSQL database systems have recently attracted much attention, but few studies in the literature address their direct comparison with relational databases when applied to build the persistence layer of a standardized medical information system. One relational and two NoSQL databases (one document-based and one native XML database) of three different sizes have been created in order to evaluate and compare the response times (algorithmic complexity) of six different complexity growing queries, which have been performed on them. Similar appropriate results available in the literature have also been considered. Relational and non-relational NoSQL database systems show almost linear algorithmic complexity query execution. However, they show very different linear slopes, the former being much steeper than the two latter. Document-based NoSQL databases perform better in concurrency than in isolation, and also better than relational databases in concurrency. Non-relational NoSQL databases seem to be more appropriate than standard relational SQL databases when database size is extremely high (secondary use, research applications). Document-based NoSQL databases perform in general better than native XML NoSQL databases. EHR extracts visualization and edition are also document-based tasks more appropriate to NoSQL database systems. However, the appropriate database solution much depends on each particular situation and specific problem.
The research infrastructure of Chinese foundations, a database for Chinese civil society studies

PubMed Central

Ma, Ji; Wang, Qun; Dong, Chao; Li, Huafang

2017-01-01

This paper provides technical details and user guidance on the Research Infrastructure of Chinese Foundations (RICF), a database of Chinese foundations, civil society, and social development in general. The structure of the RICF is deliberately designed and normalized according to the Three Normal Forms. The database schema consists of three major themes: foundations’ basic organizational profile (i.e., basic profile, board member, supervisor, staff, and related party tables), program information (i.e., program information, major program, program relationship, and major recipient tables), and financial information (i.e., financial position, financial activities, cash flow, activity overview, and large donation tables). The RICF’s data quality can be measured by four criteria: data source reputation and credibility, completeness, accuracy, and timeliness. Data records are properly versioned, allowing verification and replication for research purposes. PMID:28742065
Building a multi-scaled geospatial temporal ecology database from disparate data sources: fostering open science and data reuse.

PubMed

Soranno, Patricia A; Bissell, Edward G; Cheruvelil, Kendra S; Christel, Samuel T; Collins, Sarah M; Fergus, C Emi; Filstrup, Christopher T; Lapierre, Jean-Francois; Lottig, Noah R; Oliver, Samantha K; Scott, Caren E; Smith, Nicole J; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A; Gries, Corinna; Henry, Emily N; Skaff, Nick K; Stanley, Emily H; Stow, Craig A; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E

2015-01-01

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km(2)). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
Building a multi-scaled geospatial temporal ecology database from disparate data sources: Fostering open science through data reuse

USGS Publications Warehouse

Soranno, Patricia A.; Bissell, E.G.; Cheruvelil, Kendra S.; Christel, Samuel T.; Collins, Sarah M.; Fergus, C. Emi; Filstrup, Christopher T.; Lapierre, Jean-Francois; Lotting, Noah R.; Oliver, Samantha K.; Scott, Caren E.; Smith, Nicole J.; Stopyak, Scott; Yuan, Shuai; Bremigan, Mary Tate; Downing, John A.; Gries, Corinna; Henry, Emily N.; Skaff, Nick K.; Stanley, Emily H.; Stow, Craig A.; Tan, Pang-Ning; Wagner, Tyler; Webster, Katherine E.

2015-01-01

Although there are considerable site-based data for individual or groups of ecosystems, these datasets are widely scattered, have different data formats and conventions, and often have limited accessibility. At the broader scale, national datasets exist for a large number of geospatial features of land, water, and air that are needed to fully understand variation among these ecosystems. However, such datasets originate from different sources and have different spatial and temporal resolutions. By taking an open-science perspective and by combining site-based ecosystem datasets and national geospatial datasets, science gains the ability to ask important research questions related to grand environmental challenges that operate at broad scales. Documentation of such complicated database integration efforts, through peer-reviewed papers, is recommended to foster reproducibility and future use of the integrated database. Here, we describe the major steps, challenges, and considerations in building an integrated database of lake ecosystems, called LAGOS (LAke multi-scaled GeOSpatial and temporal database), that was developed at the sub-continental study extent of 17 US states (1,800,000 km2). LAGOS includes two modules: LAGOSGEO, with geospatial data on every lake with surface area larger than 4 ha in the study extent (~50,000 lakes), including climate, atmospheric deposition, land use/cover, hydrology, geology, and topography measured across a range of spatial and temporal extents; and LAGOSLIMNO, with lake water quality data compiled from ~100 individual datasets for a subset of lakes in the study extent (~10,000 lakes). Procedures for the integration of datasets included: creating a flexible database design; authoring and integrating metadata; documenting data provenance; quantifying spatial measures of geographic data; quality-controlling integrated and derived data; and extensively documenting the database. Our procedures make a large, complex, and integrated database reproducible and extensible, allowing users to ask new research questions with the existing database or through the addition of new data. The largest challenge of this task was the heterogeneity of the data, formats, and metadata. Many steps of data integration need manual input from experts in diverse fields, requiring close collaboration.
The Reliability of College Grades

ERIC Educational Resources Information Center

Beatty, Adam S.; Walmsley, Philip T.; Sackett, Paul R.; Kuncel, Nathan R.; Koch, Amanda J.

2015-01-01

Little is known about the reliability of college grades relative to how prominently they are used in educational research, and the results to date tend to be based on small sample studies or are decades old. This study uses two large databases (N > 800,000) from over 200 educational institutions spanning 13 years and finds that both first-year…
Does High School Facility Quality Affect Student Achievement? A Two-Level Hierarchical Linear Model

ERIC Educational Resources Information Center

Bowers, Alex J.; Urick, Angela

2011-01-01

The purpose of this study is to isolate the independent effects of high school facility quality on student achievement using a large, nationally representative U.S. database of student achievement and school facility quality. Prior research on linking school facility quality to student achievement has been mixed. Studies that relate overall…
How To Use the SilverPlatter Software To Search the ERIC CD ROM.

ERIC Educational Resources Information Center

Merrill, Paul F.

This manual provides detailed instructions for using SilverPlatter software to search the ERIC CD ROM (Compact Disk Read Only Memory), a large bibliographic database relating to education which contains reference information on numerous journal articles from over 750 journals cited in the "Current Index to Journals in Education" (CIJE),…
Supporting user-defined granularities in a spatiotemporal conceptual model

USGS Publications Warehouse

Khatri, V.; Ram, S.; Snodgrass, R.T.; O'Brien, G. M.

2002-01-01

Granularities are integral to spatial and temporal data. A large number of applications require storage of facts along with their temporal and spatial context, which needs to be expressed in terms of appropriate granularities. For many real-world applications, a single granularity in the database is insufficient. In order to support any type of spatial or temporal reasoning, the semantics related to granularities needs to be embedded in the database. Specifying granularities related to facts is an important part of conceptual database design because under-specifying the granularity can restrict an application, affect the relative ordering of events and impact the topological relationships. Closely related to granularities is indeterminacy, i.e., an occurrence time or location associated with a fact that is not known exactly. In this paper, we present an ontology for spatial granularities that is a natural analog of temporal granularities. We propose an upward-compatible, annotation-based spatiotemporal conceptual model that can comprehensively capture the semantics related to spatial and temporal granularities, and indeterminacy without requiring new spatiotemporal constructs. We specify the formal semantics of this spatiotemporal conceptual model via translation to a conventional conceptual model. To underscore the practical focus of our approach, we describe an on-going case study. We apply our approach to a hydrogeologic application at the United States Geologic Survey and demonstrate that our proposed granularity-based spatiotemporal conceptual model is straightforward to use and is comprehensive.
Comparative effectiveness research in hand surgery.

PubMed

Johnson, Shepard P; Chung, Kevin C

2014-08-01

Comparative effectiveness research (CER) is a concept initiated by the Institute of Medicine and financially supported by the federal government. The primary objective of CER is to improve decision making in medicine. This research is intended to evaluate the effectiveness, benefits, and harmful effects of alternative interventions. CER studies are commonly large, simple, observational, and conducted using electronic databases. To date, there is little comparative effectiveness evidence within hand surgery to guide therapeutic decisions. To draw conclusions on effectiveness through electronic health records, databases must contain clinical information and outcomes relevant to hand surgery interventions, such as patient-related outcomes. Copyright © 2014 Elsevier Inc. All rights reserved.
Solving Relational Database Problems with ORDBMS in an Advanced Database Course

ERIC Educational Resources Information Center

Wang, Ming

2011-01-01

This paper introduces how to use the object-relational database management system (ORDBMS) to solve relational database (RDB) problems in an advanced database course. The purpose of the paper is to provide a guideline for database instructors who desire to incorporate the ORDB technology in their traditional database courses. The paper presents…
WikiPEATia - a web based platform for assembling peatland data through ‘crowd sourcing’

NASA Astrophysics Data System (ADS)

Wisser, D.; Glidden, S.; Fieseher, C.; Treat, C. C.; Routhier, M.; Frolking, S. E.

2009-12-01

The Earth System Science community is realizing that peatlands are an important and unique terrestrial ecosystem that has not yet been well-integrated into large-scale earth system analyses. A major hurdle is the lack of accessible, geospatial data of peatland distribution, coupled with data on peatland properties (e.g., vegetation composition, peat depth, basal dates, soil chemistry, peatland class) at the global scale. This data, however, is available at the local scale. Although a comprehensive global database on peatlands probably lags similar data on more economically important ecosystems such as forests, grasslands, croplands, a large amount of field data have been collected over the past several decades. A few efforts have been made to map peatlands at large scales but existing data have not been assembled into a single geospatial database that is publicly accessible or do not depict data with a level of detail that is needed in the Earth System Science Community. A global peatland database would contribute to advances in a number of research fields such as hydrology, vegetation and ecosystem modeling, permafrost modeling, and earth system modeling. We present a Web 2.0 approach that uses state-of-the-art webserver and innovative online mapping technologies and is designed to create such a global database through ‘crowd-sourcing’. Primary functions of the online system include form-driven textual user input of peatland research metadata, spatial data input of peatland areas via a mapping interface, database editing and querying editing capabilities, as well as advanced visualization and data analysis tools. WikiPEATia provides an integrated information technology platform for assembling, integrating, and posting peatland-related geospatial datasets facilitates and encourages research community involvement. A successful effort will make existing peatland data much more useful to the research community, and will help to identify significant data gaps.
Design and Implementation of a Metadata-rich File System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ames, S; Gokhale, M B; Maltzahn, C

2010-01-19

Despite continual improvements in the performance and reliability of large scale file systems, the management of user-defined file system metadata has changed little in the past decade. The mismatch between the size and complexity of large scale data stores and their ability to organize and query their metadata has led to a de facto standard in which raw data is stored in traditional file systems, while related, application-specific metadata is stored in relational databases. This separation of data and semantic metadata requires considerable effort to maintain consistency and can result in complex, slow, and inflexible system operation. To address thesemore » problems, we have developed the Quasar File System (QFS), a metadata-rich file system in which files, user-defined attributes, and file relationships are all first class objects. In contrast to hierarchical file systems and relational databases, QFS defines a graph data model composed of files and their relationships. QFS incorporates Quasar, an XPATH-extended query language for searching the file system. Results from our QFS prototype show the effectiveness of this approach. Compared to the de facto standard, the QFS prototype shows superior ingest performance and comparable query performance on user metadata-intensive operations and superior performance on normal file metadata operations.« less
A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies.

PubMed

Jagtap, Pratik; Goslinga, Jill; Kooren, Joel A; McGowan, Thomas; Wroblewski, Matthew S; Seymour, Sean L; Griffin, Timothy J

2013-04-01

Large databases (>10(6) sequences) used in metaproteomic and proteogenomic studies present challenges in matching peptide sequences to MS/MS data using database-search programs. Most notably, strict filtering to avoid false-positive matches leads to more false negatives, thus constraining the number of peptide matches. To address this challenge, we developed a two-step method wherein matches derived from a primary search against a large database were used to create a smaller subset database. The second search was performed against a target-decoy version of this subset database merged with a host database. High confidence peptide sequence matches were then used to infer protein identities. Applying our two-step method for both metaproteomic and proteogenomic analysis resulted in twice the number of high confidence peptide sequence matches in each case, as compared to the conventional one-step method. The two-step method captured almost all of the same peptides matched by the one-step method, with a majority of the additional matches being false negatives from the one-step method. Furthermore, the two-step method improved results regardless of the database search program used. Our results show that our two-step method maximizes the peptide matching sensitivity for applications requiring large databases, especially valuable for proteogenomics and metaproteomics studies. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Is rigorous retrospective harmonization possible? Application of the DataSHaPER approach across 53 large studies

PubMed Central

Fortier, Isabel; Doiron, Dany; Little, Julian; Ferretti, Vincent; L’Heureux, François; Stolk, Ronald P; Knoppers, Bartha M; Hudson, Thomas J; Burton, Paul R

2011-01-01

Background Proper understanding of the roles of, and interactions between genetic, lifestyle, environmental and psycho-social factors in determining the risk of development and/or progression of chronic diseases requires access to very large high-quality databases. Because of the financial, technical and time burdens related to developing and maintaining very large studies, the scientific community is increasingly synthesizing data from multiple studies to construct large databases. However, the data items collected by individual studies must be inferentially equivalent to be meaningfully synthesized. The DataSchema and Harmonization Platform for Epidemiological Research (DataSHaPER; http://www.datashaper.org) was developed to enable the rigorous assessment of the inferential equivalence, i.e. the potential for harmonization, of selected information from individual studies. Methods This article examines the value of using the DataSHaPER for retrospective harmonization of established studies. Using the DataSHaPER approach, the potential to generate 148 harmonized variables from the questionnaires and physical measures collected in 53 large population-based studies (6.9 million participants) was assessed. Variable and study characteristics that might influence the potential for data synthesis were also explored. Results Out of all assessment items evaluated (148 variables for each of the 53 studies), 38% could be harmonized. Certain characteristics of variables (i.e. relative importance, individual targeted, reference period) and of studies (i.e. observational units, data collection start date and mode of questionnaire administration) were associated with the potential for harmonization. For example, for variables deemed to be essential, 62% of assessment items paired could be harmonized. Conclusion The current article shows that the DataSHaPER provides an effective and flexible approach for the retrospective harmonization of information across studies. To implement data synthesis, some additional scientific, ethico-legal and technical considerations must be addressed. The success of the DataSHaPER as a harmonization approach will depend on its continuing development and on the rigour and extent of its use. The DataSHaPER has the potential to take us closer to a truly collaborative epidemiology and offers the promise of enhanced research potential generated through synthesized databases. PMID:21804097
Uses of tuberculosis mortality surveillance to identify programme errors and improve database reporting.

PubMed

Selig, L; Guedes, R; Kritski, A; Spector, N; Lapa E Silva, J R; Braga, J U; Trajman, A

2009-08-01

In 2006, 848 persons died from tuberculosis (TB) in Rio de Janeiro, Brazil, corresponding to a mortality rate of 5.4 per 100 000 population. No specific TB death surveillance actions are currently in place in Brazil. Two public general hospitals with large open emergency rooms in Rio de Janeiro City. To evaluate the contribution of TB death surveillance in detecting gaps in TB control. We conducted a survey of TB deaths from September 2005 to August 2006. Records of TB-related deaths and deaths due to undefined causes were investigated. Complementary data were gathered from the mortality and TB notification databases. Seventy-three TB-related deaths were investigated. Transmission hazards were identified among firefighters, health care workers and in-patients. Management errors included failure to isolate suspected cases, to confirm TB, to correct drug doses in underweight patients and to trace contacts. Following the survey, 36 cases that had not previously been notified were included in the national TB notification database and the outcome of 29 notified cases was corrected. TB mortality surveillance can contribute to TB monitoring and evaluation by detecting correctable and specific programme- and hospital-based care errors, and by improving the accuracy of TB database reporting. Specific local and programmatic interventions can be proposed as a result.
In-database processing of a large collection of remote sensing data: applications and implementation

NASA Astrophysics Data System (ADS)

Kikhtenko, Vladimir; Mamash, Elena; Chubarov, Dmitri; Voronina, Polina

2016-04-01

Large archives of remote sensing data are now available to scientists, yet the need to work with individual satellite scenes or product files constrains studies that span a wide temporal range or spatial extent. The resources (storage capacity, computing power and network bandwidth) required for such studies are often beyond the capabilities of individual geoscientists. This problem has been tackled before in remote sensing research and inspired several information systems. Some of them such as NASA Giovanni [1] and Google Earth Engine have already proved their utility for science. Analysis tasks involving large volumes of numerical data are not unique to Earth Sciences. Recent advances in data science are enabled by the development of in-database processing engines that bring processing closer to storage, use declarative query languages to facilitate parallel scalability and provide high-level abstraction of the whole dataset. We build on the idea of bridging the gap between file archives containing remote sensing data and databases by integrating files into relational database as foreign data sources and performing analytical processing inside the database engine. Thereby higher level query language can efficiently address problems of arbitrary size: from accessing the data associated with a specific pixel or a grid cell to complex aggregation over spatial or temporal extents over a large number of individual data files. This approach was implemented using PostgreSQL for a Siberian regional archive of satellite data products holding hundreds of terabytes of measurements from multiple sensors and missions taken over a decade-long span. While preserving the original storage layout and therefore compatibility with existing applications the in-database processing engine provides a toolkit for provisioning remote sensing data in scientific workflows and applications. The use of SQL - a widely used higher level declarative query language - simplifies interoperability between desktop GIS, web applications and geographic web services and interactive scientific applications (MATLAB, IPython). The system is also automatically ingesting direct readout data from meteorological and research satellites in near-real time with distributed acquisition workflows managed by Taverna workflow engine [2]. The system has demonstrated its utility in performing non-trivial analytic processing such as the computation of the Robust Satellite Technique (RST) indices [3]. It had been useful in different tasks such as studying urban heat islands, analyzing patterns in the distribution of wildfire occurrences, detecting phenomena related to seismic and earthquake activity. Initial experience has highlighted several limitations of the proposed approach yet it has demonstrated ability to facilitate the use of large archives of remote sensing data by geoscientists. 1. J.G. Acker, G. Leptoukh, Online analysis enhances use of NASA Earth science data. EOS Trans. AGU, 2007, 88(2), P. 14-17. 2. D. Hull, K. Wolsfencroft, R. Stevens, C. Goble, M.R. Pocock, P. Li and T. Oinn, Taverna: a tool for building and running workflows of services. Nucleic Acids Research. 2006. V. 34. P. W729-W732. 3. V. Tramutoli, G. Di Bello, N. Pergola, S. Piscitelli, Robust satellite techniques for remote sensing of seismically active areas // Annals of Geophysics. 2001. no. 44(2). P. 295-312.

Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants in the InSiGHT locus-specific database.

PubMed

Thompson, Bryony A; Spurdle, Amanda B; Plazzer, John-Paul; Greenblatt, Marc S; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P; Farrington, Susan M; Frayling, Ian M; Frebourg, Thierry; Goldgar, David E; Heinen, Christopher D; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J; Sijmons, Rolf; Tavtigian, Sean V; Tops, Carli M; Weber, Thomas; Wijnen, Juul; Woods, Michael O; Macrae, Finlay; Genuardi, Maurizio

2014-02-01

The clinical classification of hereditary sequence variants identified in disease-related genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch syndrome-associated genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist in variant classification and was recognized through microattribution. The scheme was refined by multidisciplinary expert committee review of the clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants that were not obviously protein truncating from nomenclature. This large-scale endeavor will facilitate the consistent management of families suspected to have Lynch syndrome and demonstrates the value of multidisciplinary collaboration in the curation and classification of variants in public locus-specific databases.
Quality of life in maltreated children and adult survivors of child maltreatment: a systematic review.

PubMed

Weber, S; Jud, A; Landolt, M A

2016-02-01

To review data on health-related quality of life (HRQoL) in individuals with childhood trauma, including psychological maltreatment, physical maltreatment, sexual abuse, and neglect. The literature search was conducted with pre-defined keywords using the following electronic bibliographic databases: EMBASE, PubMed, MEDLINE, CINAHL, PsyINFO, PSYNDEX, and Cochrane Database of Systematic Reviews. Further databases were searched for relevant dissertations. Study selection and data extraction were completed by two independent reviewers. The literature search yielded 1568 entries. Nineteen articles met all inclusion criteria and were retained for further analysis. Findings quite consistently showed significant negative associations between child maltreatment and both self- and proxy-rated HRQoL. Effect sizes range from small to large. Number of types of maltreatment and HRQoL were found to be negatively related. Data on HRQoL for maltreated children are still rare. Studies often investigate adult survivors of child maltreatment. Considering HRQoL in children and adolescents who suffered maltreatment would allow the planning of effective interventions and the evaluation of treatments to improve HRQoL of these children.
Orthographic and Phonological Neighborhood Databases across Multiple Languages.

PubMed

Marian, Viorica

2017-01-01

The increased globalization of science and technology and the growing number of bilinguals and multilinguals in the world have made research with multiple languages a mainstay for scholars who study human function and especially those who focus on language, cognition, and the brain. Such research can benefit from large-scale databases and online resources that describe and measure lexical, phonological, orthographic, and semantic information. The present paper discusses currently-available resources and underscores the need for tools that enable measurements both within and across multiple languages. A general review of language databases is followed by a targeted introduction to databases of orthographic and phonological neighborhoods. A specific focus on CLEARPOND illustrates how databases can be used to assess and compare neighborhood information across languages, to develop research materials, and to provide insight into broad questions about language. As an example of how using large-scale databases can answer questions about language, a closer look at neighborhood effects on lexical access reveals that not only orthographic, but also phonological neighborhoods can influence visual lexical access both within and across languages. We conclude that capitalizing upon large-scale linguistic databases can advance, refine, and accelerate scientific discoveries about the human linguistic capacity.
Identifying work-related motor vehicle crashes in multiple databases.

PubMed

Thomas, Andrea M; Thygerson, Steven M; Merrill, Ray M; Cook, Lawrence J

2012-01-01

To compare and estimate the magnitude of work-related motor vehicle crashes in Utah using 2 probabilistically linked statewide databases. Data from 2006 and 2007 motor vehicle crash and hospital databases were joined through probabilistic linkage. Summary statistics and capture-recapture were used to describe occupants injured in work-related motor vehicle crashes and estimate the size of this population. There were 1597 occupants in the motor vehicle crash database and 1673 patients in the hospital database identified as being in a work-related motor vehicle crash. We identified 1443 occupants with at least one record from either the motor vehicle crash or hospital database indicating work-relatedness that linked to any record in the opposing database. We found that 38.7 percent of occupants injured in work-related motor vehicle crashes identified in the motor vehicle crash database did not have a primary payer code of workers' compensation in the hospital database and 40.0 percent of patients injured in work-related motor vehicle crashes identified in the hospital database did not meet our definition of a work-related motor vehicle crash in the motor vehicle crash database. Depending on how occupants injured in work-related motor crashes are identified, we estimate the population to be between 1852 and 8492 in Utah for the years 2006 and 2007. Research on single databases may lead to biased interpretations of work-related motor vehicle crashes. Combining 2 population based databases may still result in an underestimate of the magnitude of work-related motor vehicle crashes. Improved coding of work-related incidents is needed in current databases.
Large Scale Landslide Database System Established for the Reservoirs in Southern Taiwan

NASA Astrophysics Data System (ADS)

Tsai, Tsai-Tsung; Tsai, Kuang-Jung; Shieh, Chjeng-Lun

2017-04-01

Typhoon Morakot seriously attack southern Taiwan awaken the public awareness of large scale landslide disasters. Large scale landslide disasters produce large quantity of sediment due to negative effects on the operating functions of reservoirs. In order to reduce the risk of these disasters within the study area, the establishment of a database for hazard mitigation / disaster prevention is necessary. Real time data and numerous archives of engineering data, environment information, photo, and video, will not only help people make appropriate decisions, but also bring the biggest concern for people to process and value added. The study tried to define some basic data formats / standards from collected various types of data about these reservoirs and then provide a management platform based on these formats / standards. Meanwhile, in order to satisfy the practicality and convenience, the large scale landslide disasters database system is built both provide and receive information abilities, which user can use this large scale landslide disasters database system on different type of devices. IT technology progressed extreme quick, the most modern system might be out of date anytime. In order to provide long term service, the system reserved the possibility of user define data format /standard and user define system structure. The system established by this study was based on HTML5 standard language, and use the responsive web design technology. This will make user can easily handle and develop this large scale landslide disasters database system.
SensorDB: a virtual laboratory for the integration, visualization and analysis of varied biological sensor data.

PubMed

Salehi, Ali; Jimenez-Berni, Jose; Deery, David M; Palmer, Doug; Holland, Edward; Rozas-Larraondo, Pablo; Chapman, Scott C; Georgakopoulos, Dimitrios; Furbank, Robert T

2015-01-01

To our knowledge, there is no software or database solution that supports large volumes of biological time series sensor data efficiently and enables data visualization and analysis in real time. Existing solutions for managing data typically use unstructured file systems or relational databases. These systems are not designed to provide instantaneous response to user queries. Furthermore, they do not support rapid data analysis and visualization to enable interactive experiments. In large scale experiments, this behaviour slows research discovery, discourages the widespread sharing and reuse of data that could otherwise inform critical decisions in a timely manner and encourage effective collaboration between groups. In this paper we present SensorDB, a web based virtual laboratory that can manage large volumes of biological time series sensor data while supporting rapid data queries and real-time user interaction. SensorDB is sensor agnostic and uses web-based, state-of-the-art cloud and storage technologies to efficiently gather, analyse and visualize data. Collaboration and data sharing between different agencies and groups is thereby facilitated. SensorDB is available online at http://sensordb.csiro.au.
The Cancer Epidemiology Descriptive Cohort Database: A Tool to Support Population-Based Interdisciplinary Research.

PubMed

Kennedy, Amy E; Khoury, Muin J; Ioannidis, John P A; Brotzman, Michelle; Miller, Amy; Lane, Crystal; Lai, Gabriel Y; Rogers, Scott D; Harvey, Chinonye; Elena, Joanne W; Seminara, Daniela

2016-10-01

We report on the establishment of a web-based Cancer Epidemiology Descriptive Cohort Database (CEDCD). The CEDCD's goals are to enhance awareness of resources, facilitate interdisciplinary research collaborations, and support existing cohorts for the study of cancer-related outcomes. Comprehensive descriptive data were collected from large cohorts established to study cancer as primary outcome using a newly developed questionnaire. These included an inventory of baseline and follow-up data, biospecimens, genomics, policies, and protocols. Additional descriptive data extracted from publicly available sources were also collected. This information was entered in a searchable and publicly accessible database. We summarized the descriptive data across cohorts and reported the characteristics of this resource. As of December 2015, the CEDCD includes data from 46 cohorts representing more than 6.5 million individuals (29% ethnic/racial minorities). Overall, 78% of the cohorts have collected blood at least once, 57% at multiple time points, and 46% collected tissue samples. Genotyping has been performed by 67% of the cohorts, while 46% have performed whole-genome or exome sequencing in subsets of enrolled individuals. Information on medical conditions other than cancer has been collected in more than 50% of the cohorts. More than 600,000 incident cancer cases and more than 40,000 prevalent cases are reported, with 24 cancer sites represented. The CEDCD assembles detailed descriptive information on a large number of cancer cohorts in a searchable database. Information from the CEDCD may assist the interdisciplinary research community by facilitating identification of well-established population resources and large-scale collaborative and integrative research. Cancer Epidemiol Biomarkers Prev; 25(10); 1392-401. ©2016 AACR. ©2016 American Association for Cancer Research.
The Top 50 Articles on Minimally Invasive Spine Surgery.

PubMed

Virk, Sohrab S; Yu, Elizabeth

2017-04-01

Bibliometric study of current literature. To catalog the most important minimally invasive spine (MIS) surgery articles using the amount of citations as a marker of relevance. MIS surgery is a relatively new tool used by spinal surgeons. There is a dynamic and evolving field of research related to MIS techniques, clinical outcomes, and basic science research. To date, there is no comprehensive review of the most cited articles related to MIS surgery. A systematic search was performed over three widely used literature databases: Web of Science, Scopus, and Google Scholar. There were four searches performed using the terms "minimally invasive spine surgery," "endoscopic spine surgery," "percutaneous spinal surgery," and "lateral interbody surgery." The amount of citations included was averaged amongst the three databases to rank each article. The query of the three databases was performed in November 2015. Fifty articles were selected based upon the amount of citations each averaged amongst the three databases. The most cited article was titled "Extreme Lateral Interbody Fusion (XLIF): a novel surgical technique for anterior lumbar interbody fusion" by Ozgur et al and was credited with 447, 239, and 279 citations in Google Scholar, Web of Science, and Scopus, respectively. Citations ranged from 27 to 239 for Web of Science, 60 to 279 for Scopus, and 104 to 462 for Google Scholar. There was a large variety of articles written spanning over 14 different topics with the majority dealing with clinical outcomes related to MIS surgery. The majority of the most cited articles were level III and level IV studies. This is likely due to the relatively recent nature of technological advances in the field. Furthermore level I and level II studies are required in MIS surgery in the years ahead. 5.
Performance assessment of EMR systems based on post-relational database.

PubMed

Yu, Hai-Yan; Li, Jing-Song; Zhang, Xiao-Guang; Tian, Yu; Suzuki, Muneou; Araki, Kenji

2012-08-01

Post-relational databases provide high performance and are currently widely used in American hospitals. As few hospital information systems (HIS) in either China or Japan are based on post-relational databases, here we introduce a new-generation electronic medical records (EMR) system called Hygeia, which was developed with the post-relational database Caché and the latest platform Ensemble. Utilizing the benefits of a post-relational database, Hygeia is equipped with an "integration" feature that allows all the system users to access data-with a fast response time-anywhere and at anytime. Performance tests of databases in EMR systems were implemented in both China and Japan. First, a comparison test was conducted between a post-relational database, Caché, and a relational database, Oracle, embedded in the EMR systems of a medium-sized first-class hospital in China. Second, a user terminal test was done on the EMR system Izanami, which is based on the identical database Caché and operates efficiently at the Miyazaki University Hospital in Japan. The results proved that the post-relational database Caché works faster than the relational database Oracle and showed perfect performance in the real-time EMR system.
Regional spatial-temporal spread of citrus huanglongbing is affected by rain in Florida.

PubMed

Shimwela, Mpoki; Schubert, Timothy S; Albritton, Matthew; Halbert, Susan E; Jones, Debra J; Sun, Xiaoan; Roberts, Pamela; Singer, Burton; Lee, Wen Suk; Jones, Jeffrey B; Ploetz, Randy; van Bruggen, Ariena H C

2018-06-06

Citrus huanglongbing (HLB), associated with Candidatus Liberibacter asiaticus (Las), disseminated by Asian Citrus Psyllid (ACP), has devastated citrus in Florida since 2005. Data on HLB occurrence were stored in databases (2005-2012). Cumulative HLB-positive citrus blocks were subjected to kernel density analysis and kriging. Relative disease incidence per county was calculated by dividing HLB numbers by relative tree numbers and maximum incidence. Spatio-temporal HLB distributions were correlated with weather. Relative HLB incidence correlated positively with rainfall. The focus expansion rate was 1626 m month-1, similar to that in Brazil. Relative HLB incidence in counties with primarily large groves increased at a lower rate (0.24 year-1) than in counties with smaller groves in hotspot areas (0.67 year-1), confirming reports that large-scale HLB management may slow epidemic progress.
The USA-NPN Information Management System: A tool in support of phenological assessments

NASA Astrophysics Data System (ADS)

Rosemartin, A.; Vazquez, R.; Wilson, B. E.; Denny, E. G.

2009-12-01

The USA National Phenology Network (USA-NPN) serves science and society by promoting a broad understanding of plant and animal phenology and the relationships among phenological patterns and all aspects of environmental change. Data management and information sharing are central to the USA-NPN mission. The USA-NPN develops, implements, and maintains a comprehensive Information Management System (IMS) to serve the needs of the network, including the collection, storage and dissemination of phenology data, access to phenology-related information, tools for data interpretation, and communication among partners of the USA-NPN. The IMS includes components for data storage, such as the National Phenology Database (NPD), and several online user interfaces to accommodate data entry, data download, data visualization and catalog searches for phenology-related information. The IMS is governed by a set of standards to ensure security, privacy, data access, and data quality. The National Phenology Database is designed to efficiently accommodate large quantities of phenology data, to be flexible to the changing needs of the network, and to provide for quality control. The database stores phenology data from multiple sources (e.g., partner organizations, researchers and citizen observers), and provides for integration with legacy datasets. Several services will be created to provide access to the data, including reports, visualization interfaces, and web services. These services will provide integrated access to phenology and related information for scientists, decision-makers and general audiences. Phenological assessments at any scale will rely on secure and flexible information management systems for the organization and analysis of phenology data. The USA-NPN’s IMS can serve phenology assessments directly, through data management and indirectly as a model for large-scale integrated data management.
New perspectives in toxicological information management, and the role of ISSTOX databases in assessing chemical mutagenicity and carcinogenicity.

PubMed

Benigni, Romualdo; Battistelli, Chiara Laura; Bossa, Cecilia; Tcheremenskaia, Olga; Crettaz, Pierre

2013-07-01

Currently, the public has access to a variety of databases containing mutagenicity and carcinogenicity data. These resources are crucial for the toxicologists and regulators involved in the risk assessment of chemicals, which necessitates access to all the relevant literature, and the capability to search across toxicity databases using both biological and chemical criteria. Towards the larger goal of screening chemicals for a wide range of toxicity end points of potential interest, publicly available resources across a large spectrum of biological and chemical data space must be effectively harnessed with current and evolving information technologies (i.e. systematised, integrated and mined), if long-term screening and prediction objectives are to be achieved. A key to rapid progress in the field of chemical toxicity databases is that of combining information technology with the chemical structure as identifier of the molecules. This permits an enormous range of operations (e.g. retrieving chemicals or chemical classes, describing the content of databases, finding similar chemicals, crossing biological and chemical interrogations, etc.) that other more classical databases cannot allow. This article describes the progress in the technology of toxicity databases, including the concepts of Chemical Relational Database and Toxicological Standardized Controlled Vocabularies (Ontology). Then it describes the ISSTOX cluster of toxicological databases at the Istituto Superiore di Sanitá. It consists of freely available databases characterised by the use of modern information technologies and by curation of the quality of the biological data. Finally, this article provides examples of analyses and results made possible by ISSTOX.
LiverAtlas: a unique integrated knowledge database for systems-level research of liver and hepatic disease.

PubMed

Zhang, Yanqiong; Yang, Chunyuan; Wang, Shaochuang; Chen, Tao; Li, Mansheng; Wang, Xue; Li, Dongsheng; Wang, Kang; Ma, Jie; Wu, Songfeng; Zhang, Xueli; Zhu, Yunping; Wu, Jinsheng; He, Fuchu

2013-09-01

A large amount of liver-related physiological and pathological data exist in publicly available biological and bibliographic databases, which are usually far from comprehensive or integrated. Data collection, integration and mining processes pose a great challenge to scientific researchers and clinicians interested in the liver. To address these problems, we constructed LiverAtlas (http://liveratlas.hupo.org.cn), a comprehensive resource of biomedical knowledge related to the liver and various hepatic diseases by incorporating 53 databases. In the present version, LiverAtlas covers data on liver-related genomics, transcriptomics, proteomics, metabolomics and hepatic diseases. Additionally, LiverAtlas provides a wealth of manually curated information, relevant literature citations and cross-references to other databases. Importantly, an expert-confirmed Human Liver Disease Ontology, including relevant information for 227 types of hepatic disease, has been constructed and is used to annotate LiverAtlas data. Furthermore, we have demonstrated two examples of applying LiverAtlas data to identify candidate markers for hepatocellular carcinoma (HCC) at the systems level and to develop a systems biology-based classifier by combining the differential gene expression with topological features of human protein interaction networks to enhance the ability of HCC differential diagnosis. LiverAtlas is the most comprehensive liver and hepatic disease resource, which helps biologists and clinicians to analyse their data at the systems level and will contribute much to the biomarker discovery and diagnostic performance enhancement for liver diseases. © 2013 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Database Objects vs Files: Evaluation of alternative strategies for managing large remote sensing data

NASA Astrophysics Data System (ADS)

Baru, Chaitan; Nandigam, Viswanath; Krishnan, Sriram

2010-05-01

Increasingly, the geoscience user community expects modern IT capabilities to be available in service of their research and education activities, including the ability to easily access and process large remote sensing datasets via online portals such as GEON (www.geongrid.org) and OpenTopography (opentopography.org). However, serving such datasets via online data portals presents a number of challenges. In this talk, we will evaluate the pros and cons of alternative storage strategies for management and processing of such datasets using binary large object implementations (BLOBs) in database systems versus implementation in Hadoop files using the Hadoop Distributed File System (HDFS). The storage and I/O requirements for providing online access to large datasets dictate the need for declustering data across multiple disks, for capacity as well as bandwidth and response time performance. This requires partitioning larger files into a set of smaller files, and is accompanied by the concomitant requirement for managing large numbers of file. Storing these sub-files as blobs in a shared-nothing database implemented across a cluster provides the advantage that all the distributed storage management is done by the DBMS. Furthermore, subsetting and processing routines can be implemented as user-defined functions (UDFs) on these blobs and would run in parallel across the set of nodes in the cluster. On the other hand, there are both storage overheads and constraints, and software licensing dependencies created by such an implementation. Another approach is to store the files in an external filesystem with pointers to them from within database tables. The filesystem may be a regular UNIX filesystem, a parallel filesystem, or HDFS. In the HDFS case, HDFS would provide the file management capability, while the subsetting and processing routines would be implemented as Hadoop programs using the MapReduce model. Hadoop and its related software libraries are freely available. Another consideration is the strategy used for partitioning large data collections, and large datasets within collections, using round-robin vs hash partitioning vs range partitioning methods. Each has different characteristics in terms of spatial locality of data and resultant degree of declustering of the computations on the data. Furthermore, we have observed that, in practice, there can be large variations in the frequency of access to different parts of a large data collection and/or dataset, thereby creating "hotspots" in the data. We will evaluate the ability of different approaches for dealing effectively with such hotspots and alternative strategies for dealing with hotspots.
Sharing Water Data to Encourage Sustainable Choices in Areas of the Marcellus Shale

NASA Astrophysics Data System (ADS)

Brantley, S. L.; Abad, J. D.; Vastine, J.; Yoxtheimer, D.; Wilderman, C.; Vidic, R.; Hooper, R. P.; Brasier, K.

2012-12-01

Natural gas sourced from shales but stored in more permeable formations has long been exploited as an energy resource. Now, however, gas is exploited directly from the low-porosity and low-permeability shale reservoirs through the use of hydrofracturing. Hydrofracturing is not a new technique: it has long been utilized in the energy industry to promote flow of oil and gas from traditional reservoirs. To exploit gas in reservoirs such as the Marcellus shale in PA, hydrofracturing is paired with directional drilling. Such hydrofracturing utilizes large volumes of water to increase porosity in the shale formations at depth. Small concentrations of chemicals are added to the water to improve the formation and maintenance of the fractures. Significant public controversy has developed in response to the use of hydrofracturing especially in the northeastern states underlain by the Marcellus shale where some citizens and scientists question whether shale gas recovery will contaminate local surface and ground waters. Researchers, government agencies, and citizen scientists in Pennsylvania are teaming up to run the ShaleNetwork (www.shalenetwork.org), an NSF-funded research collaboration network that is currently finding, collating, sharing, publishing, and exploring data related to water quality and quantity in areas that are exploiting shale gas. The effort, focussed initially on Pennsylvania, is now developing the ShaleNetwork database that can be accessed through HydroDesktop in the CUAHSI Hydrologic Information System. In the first year since inception, the ShaleNetwork ran a workshop and reached eight conclusions, largely focussed on issues related to the sources, entry, and use of data. First, the group discovered that extensive water data is available in areas of shale gas. Second, participants agreed that the Shale Network team should partner with state agencies and industry to move datasets online. Third, participants discovered that the database allows participants to assess data gaps. Fourth, the team was encouraged to search for data that plug gaps. Fifth, the database should be easily sustained by others long-term if the Shale Network team simplifies the process of uploading data and finds ways to create community buy-in or incentives for data uploads. Sixth, the database itself and the workshops for the database should drive future agreement about analytical protocols. Seventh, the database is already encouraging other groups to publish data online. Finally, a user interface is needed that is easier and more accessible for citizens to use. Overall, it is clear that sharing data is one way to build bridges among decision makers, scientists, and citizens to understand issues related to sustainable development of energy resources in the face of issues related to water quality and quantity.
[Benefits of large healthcare databases for drug risk research].

PubMed

Garbe, Edeltraut; Pigeot, Iris

2015-08-01

Large electronic healthcare databases have become an important worldwide data resource for drug safety research after approval. Signal generation methods and drug safety studies based on these data facilitate the prospective monitoring of drug safety after approval, as has been recently required by EU law and the German Medicines Act. Despite its large size, a single healthcare database may include insufficient patients for the study of a very small number of drug-exposed patients or the investigation of very rare drug risks. For that reason, in the United States, efforts have been made to work on models that provide the linkage of data from different electronic healthcare databases for monitoring the safety of medicines after authorization in (i) the Sentinel Initiative and (ii) the Observational Medical Outcomes Partnership (OMOP). In July 2014, the pilot project Mini-Sentinel included a total of 178 million people from 18 different US databases. The merging of the data is based on a distributed data network with a common data model. In the European Network of Centres for Pharmacoepidemiology and Pharmacovigilance (ENCEPP) there has been no comparable merging of data from different databases; however, first experiences have been gained in various EU drug safety projects. In Germany, the data of the statutory health insurance providers constitute the most important resource for establishing a large healthcare database. Their use for this purpose has so far been severely restricted by the Code of Social Law (Section 75, Book 10). Therefore, a reform of this section is absolutely necessary.
Visualizing the semantic content of large text databases using text maps

NASA Technical Reports Server (NTRS)

Combs, Nathan

1993-01-01

A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content.
hEIDI: An Intuitive Application Tool To Organize and Treat Large-Scale Proteomics Data.

PubMed

Hesse, Anne-Marie; Dupierris, Véronique; Adam, Claire; Court, Magali; Barthe, Damien; Emadali, Anouk; Masselon, Christophe; Ferro, Myriam; Bruley, Christophe

2016-10-07

Advances in high-throughput proteomics have led to a rapid increase in the number, size, and complexity of the associated data sets. Managing and extracting reliable information from such large series of data sets require the use of dedicated software organized in a consistent pipeline to reduce, validate, exploit, and ultimately export data. The compilation of multiple mass-spectrometry-based identification and quantification results obtained in the context of a large-scale project represents a real challenge for developers of bioinformatics solutions. In response to this challenge, we developed a dedicated software suite called hEIDI to manage and combine both identifications and semiquantitative data related to multiple LC-MS/MS analyses. This paper describes how, through a user-friendly interface, hEIDI can be used to compile analyses and retrieve lists of nonredundant protein groups. Moreover, hEIDI allows direct comparison of series of analyses, on the basis of protein groups, while ensuring consistent protein inference and also computing spectral counts. hEIDI ensures that validated results are compliant with MIAPE guidelines as all information related to samples and results is stored in appropriate databases. Thanks to the database structure, validated results generated within hEIDI can be easily exported in the PRIDE XML format for subsequent publication. hEIDI can be downloaded from http://biodev.extra.cea.fr/docs/heidi .
Quality Control Algorithms for the Kennedy Space Center 50-Megahertz Doppler Radar Wind Profiler Winds Database

NASA Technical Reports Server (NTRS)

Barbre, Robert E., Jr.

2012-01-01

This paper presents the process used by the Marshall Space Flight Center Natural Environments Branch (EV44) to quality control (QC) data from the Kennedy Space Center's 50-MHz Doppler Radar Wind Profiler for use in vehicle wind loads and steering commands. The database has been built to mitigate limitations of using the currently archived databases from weather balloons. The DRWP database contains wind measurements from approximately 2.7-18.6 km altitude at roughly five minute intervals for the August 1997 to December 2009 period of record, and the extensive QC process was designed to remove spurious data from various forms of atmospheric and non-atmospheric artifacts. The QC process is largely based on DRWP literature, but two new algorithms have been developed to remove data contaminated by convection and excessive first guess propagations from the Median Filter First Guess Algorithm. In addition to describing the automated and manual QC process in detail, this paper describes the extent of the data retained. Roughly 58% of all possible wind observations exist in the database, with approximately 100 times as many complete profile sets existing relative to the EV44 balloon databases. This increased sample of near-continuous wind profile measurements may help increase launch availability by reducing the uncertainty of wind changes during launch countdown
Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles.

PubMed

Kafkas, Şenay; Kim, Jee-Hyub; Pi, Xingjun; McEntyre, Johanna R

2015-01-01

In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases. Using text-mining methods to identify and extract a variety of core biological database accession numbers, we found that the supplemental data files contain many more database citations than the body of the article, and that those citations often take the form of a relatively small number of articles citing large collections of accession numbers in text-based files. Moreover, citation of value-added databases derived from submission databases (such as Pfam, UniProt or Ensembl) is common, demonstrating the reuse of these resources as datasets in themselves. All the database accession numbers extracted from the supplementary data are publicly accessible from http://dx.doi.org/10.5281/zenodo.11771. Our study suggests that supplementary data should be considered when linking articles with data, in curation pipelines, and in information retrieval tasks in order to make full use of the entire research article. These observations highlight the need to improve the management of supplemental data in general, in order to make this information more discoverable and useful.

The EpiSLI Database: A Publicly Available Database on Speech and Language

ERIC Educational Resources Information Center

Tomblin, J. Bruce

2010-01-01

Purpose: This article describes a database that was created in the process of conducting a large-scale epidemiologic study of specific language impairment (SLI). As such, this database will be referred to as the EpiSLI database. Children with SLI have unexpected and unexplained difficulties learning and using spoken language. Although there is no…
A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection

PubMed Central

Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike

2018-01-01

ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection. PMID:29564396
A Reference Viral Database (RVDB) To Enhance Bioinformatics Analysis of High-Throughput Sequencing for Novel Virus Detection.

PubMed

Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike; Khan, Arifa S

2018-01-01

Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.
Variant Alleles, Triallelic Patterns, and Point Mutations Observed in Nuclear Short Tandem Repeat Typing of Populations in Bosnia and Serbia

PubMed Central

Huel, René L. M.; Bašić, Lara; Madacki-Todorović, Kamelija; Smajlović, Lejla; Eminović, Izet; Berbić, Irfan; Miloš, Ana; Parsons, Thomas J.

2007-01-01

Aim To present a compendium of off-ladder alleles and other genotyping irregularities relating to rare/unexpected population genetic variation, observed in a large short tandem repeat (STR) database from Bosnia and Serbia. Methods DNA was extracted from blood stain cards relating to reference samples from a population of 32 800 individuals from Bosnia and Serbia, and typed using Promega’s PowerPlex®16 STR kit. Results There were 31 distinct off-ladder alleles were observed in 10 of the 15 STR loci amplified from the PowerPlex®16 STR kit. Of these 31 alleles, 3 have not been previously reported. Furthermore, 16 instances of triallelic patterns were observed in 9 of the 15 loci. Primer binding site mismatches that affected amplification were observed in two loci, D5S818 and D8S1179. Conclusion Instances of deviations from manufacturer’s allelic ladders should be expected and caution taken to properly designate the correct alleles in large DNA databases. Particular care should be taken in kinship matching or paternity cases as incorrect designation of any of these deviations from allelic ladders could lead to false exclusions. PMID:17696304
The Transporter Classification Database: recent advances.

PubMed

Saier, Milton H; Yen, Ming Ren; Noto, Keith; Tamang, Dorjee G; Elkan, Charles

2009-01-01

The Transporter Classification Database (TCDB), freely accessible at http://www.tcdb.org, is a relational database containing sequence, structural, functional and evolutionary information about transport systems from a variety of living organisms, based on the International Union of Biochemistry and Molecular Biology-approved transporter classification (TC) system. It is a curated repository for factual information compiled largely from published references. It uses a functional/phylogenetic system of classification, and currently encompasses about 5000 representative transporters and putative transporters in more than 500 families. We here describe novel software designed to support and extend the usefulness of TCDB. Our recent efforts render it more user friendly, incorporate machine learning to input novel data in a semiautomatic fashion, and allow analyses that are more accurate and less time consuming. The availability of these tools has resulted in recognition of distant phylogenetic relationships and tremendous expansion of the information available to TCDB users.
Automation of a N-S S and C Database Generation for the Harrier in Ground Effect

NASA Technical Reports Server (NTRS)

Murman, Scott M.; Chaderjian, Neal M.; Pandya, Shishir; Kwak, Dochan (Technical Monitor)

2001-01-01

A method of automating the generation of a time-dependent, Navier-Stokes static stability and control database for the Harrier aircraft in ground effect is outlined. Reusable, lightweight components arc described which allow different facets of the computational fluid dynamic simulation process to utilize a consistent interface to a remote database. These components also allow changes and customizations to easily be facilitated into the solution process to enhance performance, without relying upon third-party support. An analysis of the multi-level parallel solver OVERFLOW-MLP is presented, and the results indicate that it is feasible to utilize large numbers of processors (= 100) even with a grid system with relatively small number of cells (= 10(exp 6)). A more detailed discussion of the simulation process, as well as refined data for the scaling of the OVERFLOW-MLP flow solver will be included in the full paper.
DynGO: a tool for visualizing and mining of Gene Ontology and its associations

PubMed Central

Liu, Hongfang; Hu, Zhang-Zhi; Wu, Cathy H

2005-01-01

Background A large volume of data and information about genes and gene products has been stored in various molecular biology databases. A major challenge for knowledge discovery using these databases is to identify related genes and gene products in disparate databases. The development of Gene Ontology (GO) as a common vocabulary for annotation allows integrated queries across multiple databases and identification of semantically related genes and gene products (i.e., genes and gene products that have similar GO annotations). Meanwhile, dozens of tools have been developed for browsing, mining or editing GO terms, their hierarchical relationships, or their "associated" genes and gene products (i.e., genes and gene products annotated with GO terms). Tools that allow users to directly search and inspect relations among all GO terms and their associated genes and gene products from multiple databases are needed. Results We present a standalone package called DynGO, which provides several advanced functionalities in addition to the standard browsing capability of the official GO browsing tool (AmiGO). DynGO allows users to conduct batch retrieval of GO annotations for a list of genes and gene products, and semantic retrieval of genes and gene products sharing similar GO annotations. The result are shown in an association tree organized according to GO hierarchies and supported with many dynamic display options such as sorting tree nodes or changing orientation of the tree. For GO curators and frequent GO users, DynGO provides fast and convenient access to GO annotation data. DynGO is generally applicable to any data set where the records are annotated with GO terms, as illustrated by two examples. Conclusion We have presented a standalone package DynGO that provides functionalities to search and browse GO and its association databases as well as several additional functions such as batch retrieval and semantic retrieval. The complete documentation and software are freely available for download from the website . PMID:16091147
A web-based, relational database for studying glaciers in the Italian Alps

NASA Astrophysics Data System (ADS)

Nigrelli, G.; Chiarle, M.; Nuzzi, A.; Perotti, L.; Torta, G.; Giardino, M.

2013-02-01

Glaciers are among the best terrestrial indicators of climate change and thus glacier inventories have attracted a growing, worldwide interest in recent years. In Italy, the first official glacier inventory was completed in 1925 and 774 glacial bodies were identified. As the amount of data continues to increase, and new techniques become available, there is a growing demand for computer tools that can efficiently manage the collected data. The Research Institute for Geo-hydrological Protection of the National Research Council, in cooperation with the Departments of Computer Science and Earth Sciences of the University of Turin, created a database that provides a modern tool for storing, processing and sharing glaciological data. The database was developed according to the need of storing heterogeneous information, which can be retrieved through a set of web search queries. The database's architecture is server-side, and was designed by means of an open source software. The website interface, simple and intuitive, was intended to meet the needs of a distributed public: through this interface, any type of glaciological data can be managed, specific queries can be performed, and the results can be exported in a standard format. The use of a relational database to store and organize a large variety of information about Italian glaciers collected over the last hundred years constitutes a significant step forward in ensuring the safety and accessibility of such data. Moreover, the same benefits also apply to the enhanced operability for handling information in the future, including new and emerging types of data formats, such as geographic and multimedia files. Future developments include the integration of cartographic data, such as base maps, satellite images and vector data. The relational database described in this paper will be the heart of a new geographic system that will merge data, data attributes and maps, leading to a complete description of Italian glacial environments.
Tomato functional genomics database (TFGD): a comprehensive collection and analysis package for tomato functional genomics

USDA-ARS?s Scientific Manuscript database

Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...
OrChem - An open source chemistry search engine for Oracle®

PubMed Central

2009-01-01

Background Registration, indexing and searching of chemical structures in relational databases is one of the core areas of cheminformatics. However, little detail has been published on the inner workings of search engines and their development has been mostly closed-source. We decided to develop an open source chemistry extension for Oracle, the de facto database platform in the commercial world. Results Here we present OrChem, an extension for the Oracle 11G database that adds registration and indexing of chemical structures to support fast substructure and similarity searching. The cheminformatics functionality is provided by the Chemistry Development Kit. OrChem provides similarity searching with response times in the order of seconds for databases with millions of compounds, depending on a given similarity cut-off. For substructure searching, it can make use of multiple processor cores on today's powerful database servers to provide fast response times in equally large data sets. Availability OrChem is free software and can be redistributed and/or modified under the terms of the GNU Lesser General Public License as published by the Free Software Foundation. All software is available via http://orchem.sourceforge.net. PMID:20298521
Making your database available through Wikipedia: the pros and cons.

PubMed

Finn, Robert D; Gardner, Paul P; Bateman, Alex

2012-01-01

Wikipedia, the online encyclopedia, is the most famous wiki in use today. It contains over 3.7 million pages of content; with many pages written on scientific subject matters that include peer-reviewed citations, yet are written in an accessible manner and generally reflect the consensus opinion of the community. In this, the 19th Annual Database Issue of Nucleic Acids Research, there are 11 articles that describe the use of a wiki in relation to a biological database. In this commentary, we discuss how biological databases can be integrated with Wikipedia, thereby utilising the pre-existing infrastructure, tools and above all, large community of authors (or Wikipedians). The limitations to the content that can be included in Wikipedia are highlighted, with examples drawn from articles found in this issue and other wiki-based resources, indicating why other wiki solutions are necessary. We discuss the merits of using open wikis, like Wikipedia, versus other models, with particular reference to potential vandalism. Finally, we raise the question about the future role of dedicated database biocurators in context of the thousands of crowdsourced, community annotations that are now being stored in wikis.
The VirusBanker database uses a Java program to allow flexible searching through Bunyaviridae sequences

PubMed Central

Fourment, Mathieu; Gibbs, Mark J

2008-01-01

Background Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. Results The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. Conclusion VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically. PMID:18251994
MetReS, an Efficient Database for Genomic Applications.

PubMed

Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc

2018-02-01

MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.
Making your database available through Wikipedia: the pros and cons

PubMed Central

Finn, Robert D.; Gardner, Paul P.; Bateman, Alex

2012-01-01

Wikipedia, the online encyclopedia, is the most famous wiki in use today. It contains over 3.7 million pages of content; with many pages written on scientific subject matters that include peer-reviewed citations, yet are written in an accessible manner and generally reflect the consensus opinion of the community. In this, the 19th Annual Database Issue of Nucleic Acids Research, there are 11 articles that describe the use of a wiki in relation to a biological database. In this commentary, we discuss how biological databases can be integrated with Wikipedia, thereby utilising the pre-existing infrastructure, tools and above all, large community of authors (or Wikipedians). The limitations to the content that can be included in Wikipedia are highlighted, with examples drawn from articles found in this issue and other wiki-based resources, indicating why other wiki solutions are necessary. We discuss the merits of using open wikis, like Wikipedia, versus other models, with particular reference to potential vandalism. Finally, we raise the question about the future role of dedicated database biocurators in context of the thousands of crowdsourced, community annotations that are now being stored in wikis. PMID:22144683
Improving retrospective characterization of the food environment for a large region in the United States during a historic time period.

PubMed

Auchincloss, Amy H; Moore, Kari A B; Moore, Latetia V; Diez Roux, Ana V

2012-11-01

Access to healthy foods has received increasing attention due to growing prevalence of obesity and diet-related health conditions yet there are major obstacles in characterizing the local food environment. This study developed a method to retrospectively characterize supermarkets for a single historic year, 2005, in 19 counties in 6 states in the USA using a supermarket chain-name list and two business databases. Data preparation, merging, overlaps, added-value amongst various approaches and differences by census tract area-level socio-demographic characteristics are described. Agreement between two food store databases was modest: 63%. Only 55% of the final list of supermarkets were identified by a single business database and selection criteria that included industry classification codes and sales revenue ≥$2 million. The added-value of using a supermarket chain-name list and second business database was identification of an additional 14% and 30% of supermarkets, respectively. These methods are particularly useful to retrospectively characterize access to supermarkets during a historic period and when field observations are not feasible and business databases are used. Copyright © 2012 Elsevier Ltd. All rights reserved.
Short Fiction on Film: A Relational DataBase.

ERIC Educational Resources Information Center

May, Charles

Short Fiction on Film is a database that was created and will run on DataRelator, a relational database manager created by Bill Finzer for the California State Department of Education in 1986. DataRelator was designed for use in teaching students database management skills and to provide teachers with examples of how a database manager might be…
Class dependency of fuzzy relational database using relational calculus and conditional probability

NASA Astrophysics Data System (ADS)

Deni Akbar, Mohammad; Mizoguchi, Yoshihiro; Adiwijaya

2018-03-01

In this paper, we propose a design of fuzzy relational database to deal with a conditional probability relation using fuzzy relational calculus. In the previous, there are several researches about equivalence class in fuzzy database using similarity or approximate relation. It is an interesting topic to investigate the fuzzy dependency using equivalence classes. Our goal is to introduce a formulation of a fuzzy relational database model using the relational calculus on the category of fuzzy relations. We also introduce general formulas of the relational calculus for the notion of database operations such as ’projection’, ’selection’, ’injection’ and ’natural join’. Using the fuzzy relational calculus and conditional probabilities, we introduce notions of equivalence class, redundant, and dependency in the theory fuzzy relational database.
'The surface management system' (SuMS) database: a surface-based database to aid cortical surface reconstruction, visualization and analysis

NASA Technical Reports Server (NTRS)

Dickson, J.; Drury, H.; Van Essen, D. C.

2001-01-01

Surface reconstructions of the cerebral cortex are increasingly widely used in the analysis and visualization of cortical structure, function and connectivity. From a neuroinformatics perspective, dealing with surface-related data poses a number of challenges. These include the multiplicity of configurations in which surfaces are routinely viewed (e.g. inflated maps, spheres and flat maps), plus the diversity of experimental data that can be represented on any given surface. To address these challenges, we have developed a surface management system (SuMS) that allows automated storage and retrieval of complex surface-related datasets. SuMS provides a systematic framework for the classification, storage and retrieval of many types of surface-related data and associated volume data. Within this classification framework, it serves as a version-control system capable of handling large numbers of surface and volume datasets. With built-in database management system support, SuMS provides rapid search and retrieval capabilities across all the datasets, while also incorporating multiple security levels to regulate access. SuMS is implemented in Java and can be accessed via a Web interface (WebSuMS) or using downloaded client software. Thus, SuMS is well positioned to act as a multiplatform, multi-user 'surface request broker' for the neuroscience community.
Development and Operation of a Database Machine for Online Access and Update of a Large Database.

ERIC Educational Resources Information Center

Rush, James E.

1980-01-01

Reviews the development of a fault tolerant database processor system which replaced OCLC's conventional file system. A general introduction to database management systems and the operating environment is followed by a description of the hardware selection, software processes, and system characteristics. (SW)
Hypersonic and Supersonic Flow Roadmaps Using Bibliometrics and Database Tomography.

ERIC Educational Resources Information Center

Kostoff, R. N.; Eberhart, Henry J.; Toothman, Darrell Ray

1999-01-01

Database Tomography (DT) is a textual database-analysis system consisting of algorithms for extracting multiword phrase frequencies and proximities from a large textual database, to augment interpretative capabilities of the expert human analyst. Describes use of the DT process, supplemented by literature bibliometric analyses, to derive technical…

Comparison of the NCI open database with seven large chemical structural databases.

PubMed

Voigt, J H; Bienfait, B; Wang, S; Nicklaus, M C

2001-01-01

Eight large chemical databases have been analyzed and compared to each other. Central to this comparison is the open National Cancer Institute (NCI) database, consisting of approximately 250 000 structures. The other databases analyzed are the Available Chemicals Directory ("ACD," from MDL, release 1.99, 3D-version); the ChemACX ("ACX," from CamSoft, Version 4.5); the Maybridge Catalog and the Asinex database (both as distributed by CamSoft as part of ChemInfo 4.5); the Sigma-Aldrich Catalog (CD-ROM, 1999 Version); the World Drug Index ("WDI," Derwent, version 1999.03); and the organic part of the Cambridge Crystallographic Database ("CSD," from Cambridge Crystallographic Data Center, 1999 Version 5.18). The database properties analyzed are internal duplication rates; compounds unique to each database; cumulative occurrence of compounds in an increasing number of databases; overlap of identical compounds between two databases; similarity overlap; diversity; and others. The crystallographic database CSD and the WDI show somewhat less overlap with the other databases than those with each other. In particular the collections of commercial compounds and compilations of vendor catalogs have a substantial degree of overlap among each other. Still, no database is completely a subset of any other, and each appears to have its own niche and thus "raison d'être". The NCI database has by far the highest number of compounds that are unique to it. Approximately 200 000 of the NCI structures were not found in any of the other analyzed databases.
Mitochondrial DNA identification of game and harvested freshwater fish species.

PubMed

Kyle, C J; Wilson, C C

2007-02-14

The use of DNA in forensics has grown rapidly for human applications along with the concomitant development of bioinformatics and demographic databases to help fully realize the potential of this molecular information. Similar techniques are also used routinely in many wildlife cases, such as species identification in food products, poaching and the illegal trade of endangered species. The use of molecular techniques in forensic cases related to wildlife and the development of associated databases has, however, mainly focused on large mammals with the exception of a few high-profile species. There is a need to develop similar databases for aquatic species for fisheries enforcement, given the large number of exploited and endangered fish species, the intensity of exploitation, and challenges in identifying species and their derived products. We sequenced a 500bp fragment of the mitochondrial cytochrome b gene from representative individuals from 26 harvested fish taxa from Ontario, Canada, focusing on species that support major commercial and recreational fisheries. Ontario provides a unique model system for the development of a fish species database, as the province contains an evolutionarily diverse array of freshwater fish families representing more than one third of all freshwater fish in Canada. Inter- and intraspecific sequence comparisons using phylogenetic analysis and a BLAST search algorithm provided rigorous statistical metrics for species identification. This methodology and these data will aid in fisheries enforcement, providing a tool to easily and accurately identify fish species in enforcement investigations that would have otherwise been difficult or impossible to pursue.
A Database of Supercooled Large Droplet Ice Accretions [Supplement

NASA Technical Reports Server (NTRS)

VanZante, Judith Foss

2007-01-01

A unique, publicly available database regarding supercooled large droplet (SLD) ice accretions has been developed in NASA Glenn's Icing Research Tunnel. Identical cloud and flight conditions were generated for five different airfoil models. The models chosen represent a variety of aircraft types from the horizontal stabilizer of a large transport aircraft to the wings of regional, business, and general aviation aircraft. In addition to the standard documentation methods of 2D ice shape tracing and imagery, ice mass measurements were also taken. This database will also be used to validate and verify the extension of the ice accretion code, LEWICE, into the SLD realm.
A Database of Supercooled Large Droplet Ice Accretions

NASA Technical Reports Server (NTRS)

VanZante, Judith Foss

2007-01-01

A unique, publicly available database regarding supercooled large droplet ice accretions has been developed in NASA Glenn's Icing Research Tunnel. Identical cloud and flight conditions were generated for five different airfoil models. The models chosen represent a variety of aircraft types from the horizontal stabilizer of a large trans-port aircraft to the wings of regional, business, and general aviation aircraft. In addition to the standard documentation methods of 2D ice shape tracing and imagery, ice mass measurements were also taken. This database will also be used to validate and verify the extension of the ice accretion code, LEWICE, into the SLD realm.
Spatiotemporal database of US congressional elections, 1896–2014

PubMed Central

Wolf, Levi John

2017-01-01

High-quality historical data about US Congressional elections has long provided common ground for electoral studies. However, advances in geographic information science have recently made it efficient to compile, distribute, and analyze large spatio-temporal data sets on the structure of US Congressional districts. A single spatio-temporal data set that relates US Congressional election results to the spatial extent of the constituencies has not yet been developed. To address this, existing high-quality data sets of elections returns were combined with a spatiotemporal data set on Congressional district boundaries to generate a new spatio-temporal database of US Congressional election results that are explicitly linked to the geospatial data about the districts themselves. PMID:28809849
Integrating diverse databases into an unified analysis framework: a Galaxy approach

PubMed Central

Blankenberg, Daniel; Coraor, Nathan; Von Kuster, Gregory; Taylor, James; Nekrutenko, Anton

2011-01-01

Recent technological advances have lead to the ability to generate large amounts of data for model and non-model organisms. Whereas, in the past, there have been a relatively small number of central repositories that serve genomic data, an increasing number of distinct specialized data repositories and resources have been established. Here, we describe a generic approach that provides for the integration of a diverse spectrum of data resources into a unified analysis framework, Galaxy (http://usegalaxy.org). This approach allows the simplified coupling of external data resources with the data analysis tools available to Galaxy users, while leveraging the native data mining facilities of the external data resources. Database URL: http://usegalaxy.org PMID:21531983
Immunoinformatics: an integrated scenario

PubMed Central

Tomar, Namrata; De, Rajat K

2010-01-01

Genome sequencing of humans and other organisms has led to the accumulation of huge amounts of data, which include immunologically relevant data. A large volume of clinical data has been deposited in several immunological databases and as a result immunoinformatics has emerged as an important field which acts as an intersection between experimental immunology and computational approaches. It not only helps in dealing with the huge amount of data but also plays a role in defining new hypotheses related to immune responses. This article reviews classical immunology, different databases and prediction tools. It also describes applications of immunoinformatics in designing in silico vaccination and immune system modelling. All these efforts save time and reduce cost. PMID:20722763
Data model and relational database design for the New Jersey Water-Transfer Data System (NJWaTr)

USGS Publications Warehouse

Tessler, Steven

2003-01-01

The New Jersey Water-Transfer Data System (NJWaTr) is a database design for the storage and retrieval of water-use data. NJWaTr can manage data encompassing many facets of water use, including (1) the tracking of various types of water-use activities (withdrawals, returns, transfers, distributions, consumptive-use, wastewater collection, and treatment); (2) the storage of descriptions, classifications and locations of places and organizations involved in water-use activities; (3) the storage of details about measured or estimated volumes of water associated with water-use activities; and (4) the storage of information about data sources and water resources associated with water use. In NJWaTr, each water transfer occurs unidirectionally between two site objects, and the sites and conveyances form a water network. The core entities in the NJWaTr model are site, conveyance, transfer/volume, location, and owner. Other important entities include water resource (used for withdrawals and returns), data source, permit, and alias. Multiple water-exchange estimates based on different methods or data sources can be stored for individual transfers. Storage of user-defined details is accommodated for several of the main entities. Many tables contain classification terms to facilitate the detailed description of data items and can be used for routine or custom data summarization. NJWaTr accommodates single-user and aggregate-user water-use data, can be used for large or small water-network projects, and is available as a stand-alone Microsoft? Access database. Data stored in the NJWaTr structure can be retrieved in user-defined combinations to serve visualization and analytical applications. Users can customize and extend the database, link it to other databases, or implement the design in other relational database applications.
A Review of Stellar Abundance Databases and the Hypatia Catalog Database

NASA Astrophysics Data System (ADS)

Hinkel, Natalie Rose

2018-01-01

The astronomical community is interested in elements from lithium to thorium, from solar twins to peculiarities of stellar evolution, because they give insight into different regimes of star formation and evolution. However, while some trends between elements and other stellar or planetary properties are well known, many other trends are not as obvious and are a point of conflict. For example, stars that host giant planets are found to be consistently enriched in iron, but the same cannot be definitively said for any other element. Therefore, it is time to take advantage of large stellar abundance databases in order to better understand not only the large-scale patterns, but also the more subtle, small-scale trends within the data.In this overview to the special session, I will present a review of large stellar abundance databases that are both currently available (i.e. RAVE, APOGEE) and those that will soon be online (i.e. Gaia-ESO, GALAH). Additionally, I will discuss the Hypatia Catalog Database (www.hypatiacatalog.com) -- which includes abundances from individual literature sources that observed stars within 150pc. The Hypatia Catalog currently contains 72 elements as measured within ~6000 stars, with a total of ~240,000 unique abundance determinations. The online database offers a variety of solar normalizations, stellar properties, and planetary properties (where applicable) that can all be viewed through multiple interactive plotting interfaces as well as in a tabular format. By analyzing stellar abundances for large populations of stars and from a variety of different perspectives, a wealth of information can be revealed on both large and small scales.
DataHub knowledge based assistance for science visualization and analysis using large distributed databases

NASA Technical Reports Server (NTRS)

Handley, Thomas H., Jr.; Collins, Donald J.; Doyle, Richard J.; Jacobson, Allan S.

1991-01-01

Viewgraphs on DataHub knowledge based assistance for science visualization and analysis using large distributed databases. Topics covered include: DataHub functional architecture; data representation; logical access methods; preliminary software architecture; LinkWinds; data knowledge issues; expert systems; and data management.
Exploring Large-Scale Cross-Correlation for Teleseismic and Regional Seismic Event Characterization

NASA Astrophysics Data System (ADS)

Dodge, Doug; Walter, William; Myers, Steve; Ford, Sean; Harris, Dave; Ruppert, Stan; Buttler, Dave; Hauk, Terri

2013-04-01

The decrease in costs of both digital storage space and computation power invites new methods of seismic data processing. At Lawrence Livermore National Laboratory(LLNL) we operate a growing research database of seismic events and waveforms for nuclear explosion monitoring and other applications. Currently the LLNL database contains several million events associated with tens of millions of waveforms at thousands of stations. We are making use of this database to explore the power of seismic waveform correlation to quantify signal similarities, to discover new events not in catalogs, and to more accurately locate events and identify source types. Building on the very efficient correlation methodologies of Harris and Dodge (2011) we computed the waveform correlation for event pairs in the LLNL database in two ways. First we performed entire waveform cross-correlation over seven distinct frequency bands. The correlation coefficient exceeds 0.6 for more than 40 million waveform pairs for several hundred thousand events at more than a thousand stations. These correlations reveal clusters of mining events and aftershock sequences, which can be used to readily identify and locate events. Second we determine relative pick times by correlating signals in time windows for distinct seismic phases. These correlated picks are then used to perform very high accuracy event relocations. We are examining the percentage of events that correlate as a function of magnitude and observing station distance in selected high seismicity regions. Combining these empirical results and those using synthetic data, we are working to quantify relationships between correlation and event pair separation (in epicenter and depth) as well as mechanism differences. Our exploration of these techniques on a large seismic database is in process and we will report on our findings in more detail at the meeting.
Exploring Large-Scale Cross-Correlation for Teleseismic and Regional Seismic Event Characterization

NASA Astrophysics Data System (ADS)

Dodge, D.; Walter, W. R.; Myers, S. C.; Ford, S. R.; Harris, D.; Ruppert, S.; Buttler, D.; Hauk, T. F.

2012-12-01

The decrease in costs of both digital storage space and computation power invites new methods of seismic data processing. At Lawrence Livermore National Laboratory (LLNL) we operate a growing research database of seismic events and waveforms for nuclear explosion monitoring and other applications. Currently the LLNL database contains several million events associated with tens of millions of waveforms at thousands of stations. We are making use of this database to explore the power of seismic waveform correlation to quantify signal similarities, to discover new events not in catalogs, and to more accurately locate events and identify source types. Building on the very efficient correlation methodologies of Harris and Dodge (2011) we computed the waveform correlation for event pairs in the LLNL database in two ways. First we performed entire waveform cross-correlation over seven distinct frequency bands. The correlation coefficient exceeds 0.6 for more than 40 million waveform pairs for several hundred thousand events at more than a thousand stations. These correlations reveal clusters of mining events and aftershock sequences, which can be used to readily identify and locate events. Second we determine relative pick times by correlating signals in time windows for distinct seismic phases. These correlated picks are then used to perform very high accuracy event relocations. We are examining the percentage of events that correlate as a function of magnitude and observing station distance in selected high seismicity regions. Combining these empirical results and those using synthetic data, we are working to quantify relationships between correlation and event pair separation (in epicenter and depth) as well as mechanism differences. Our exploration of these techniques on a large seismic database is in process and we will report on our findings in more detail at the meeting.
Large-scale mass spectrometric detection of variant peptides resulting from non-synonymous nucleotide differences

PubMed Central

Sheynkman, Gloria M.; Shortreed, Michael R.; Frey, Brian L.; Scalf, Mark; Smith, Lloyd M.

2013-01-01

Each individual carries thousands of non-synonymous single nucleotide variants (nsSNVs) in their genome, each corresponding to a single amino acid polymorphism (SAP) in the encoded proteins. It is important to be able to directly detect and quantify these variations at the protein level in order to study post-transcriptional regulation, differential allelic expression, and other important biological processes. However, such variant peptides are not generally detected in standard proteomic analyses, due to their absence from the generic databases that are employed for mass spectrometry searching. Here, we extend previous work that demonstrated the use of customized SAP databases constructed from sample-matched RNA-Seq data. We collected deep coverage RNA-Seq data from the Jurkat cell line, compiled the set of nsSNVs that are expressed, used this information to construct a customized SAP database, and searched it against deep coverage shotgun MS data obtained from the same sample. This approach enabled detection of 421 SAP peptides mapping to 395 nsSNVs. We compared these peptides to peptides identified from a large generic search database containing all known nsSNVs (dbSNP) and found that more than 70% of the SAP peptides from this dbSNP-derived search were not supported by the RNA-Seq data, and thus are likely false positives. Next, we increased the SAP coverage from the RNA-Seq derived database by utilizing multiple protease digestions, thereby increasing variant detection to 695 SAP peptides mapping to 504 nsSNV sites. These detected SAP peptides corresponded to moderate to high abundance transcripts (30+ transcripts per million, TPM). The SAP peptides included 192 allelic pairs; the relative expression levels of the two alleles were evaluated for 51 of those pairs, and found to be comparable in all cases. PMID:24175627
Toxicity tests aiming to protect Brazilian aquatic systems: current status and implications for management.

PubMed

Martins, Samantha Eslava; Bianchini, Adalto

2011-07-01

The current status of toxicological tests performed with Brazilian native species was evaluated through a survey of the scientific data available in the literature. The information gathered was processed and an electronic toxicology database (http://www.inct-ta.furg.br/bd_toxicologico.php) was generated. This database provides valuable information for researchers to select sensitive and tolerant aquatic species to a large variety of aquatic pollutants. Furthermore, the toxicology database allows researchers to select species representative of an ecosystem of interest. Analysis of the toxicology database showed that ecotoxicological assays have significantly improved in Brazil over the last decade, in spite of the still relatively low number of tests performed and the restricted number of native species tested. This is because most of the research is developed in a few laboratories concentrated in certain regions of Brazil, especially in Southern and Southeast regions. Considering the extremely rich biodiversity and the large variety of aquatic ecosystems in Brazil, this finding points to the urgent need for the development of ecotoxicological studies with other groups of aquatic animals, such as insects, foraminifera, cnidarians, worms, amphibians, among others. This would help to derive more realistic water quality criteria (WQC) values, which would better protect the different aquatic ecosystems in Brazil. Finally, the toxicology database generated presents solid and science based information, which can encourage and drive the Environmental Regulatory Agencies in Brazil to derive WQC based on native species. In this context, the present paper discusses the historical evolution of ecotoxicological studies in Brazil, and how they have contributed to the improvement of the Brazilian Federal and Regional regulations for environment.
Comparative Evaluation of Registration Algorithms in Different Brain Databases With Varying Difficulty: Results and Insights

PubMed Central

Akbari, Hamed; Bilello, Michel; Da, Xiao; Davatzikos, Christos

2015-01-01

Evaluating various algorithms for the inter-subject registration of brain magnetic resonance images (MRI) is a necessary topic receiving growing attention. Existing studies evaluated image registration algorithms in specific tasks or using specific databases (e.g., only for skull-stripped images, only for single-site images, etc.). Consequently, the choice of registration algorithms seems task- and usage/parameter-dependent. Nevertheless, recent large-scale, often multi-institutional imaging-related studies create the need and raise the question whether some registration algorithms can 1) generally apply to various tasks/databases posing various challenges; 2) perform consistently well, and while doing so, 3) require minimal or ideally no parameter tuning. In seeking answers to this question, we evaluated 12 general-purpose registration algorithms, for their generality, accuracy and robustness. We fixed their parameters at values suggested by algorithm developers as reported in the literature. We tested them in 7 databases/tasks, which present one or more of 4 commonly-encountered challenges: 1) inter-subject anatomical variability in skull-stripped images; 2) intensity homogeneity, noise and large structural differences in raw images; 3) imaging protocol and field-of-view (FOV) differences in multi-site data; and 4) missing correspondences in pathology-bearing images. Totally 7,562 registrations were performed. Registration accuracies were measured by (multi-)expert-annotated landmarks or regions of interest (ROIs). To ensure reproducibility, we used public software tools, public databases (whenever possible), and we fully disclose the parameter settings. We show evaluation results, and discuss the performances in light of algorithms’ similarity metrics, transformation models and optimization strategies. We also discuss future directions for the algorithm development and evaluations. PMID:24951685
Drug2Gene: an exhaustive resource to explore effectively the drug-target relation network.

PubMed

Roider, Helge G; Pavlova, Nadia; Kirov, Ivaylo; Slavov, Stoyan; Slavov, Todor; Uzunov, Zlatyo; Weiss, Bertram

2014-03-11

Information about drug-target relations is at the heart of drug discovery. There are now dozens of databases providing drug-target interaction data with varying scope, and focus. Therefore, and due to the large chemical space, the overlap of the different data sets is surprisingly small. As searching through these sources manually is cumbersome, time-consuming and error-prone, integrating all the data is highly desirable. Despite a few attempts, integration has been hampered by the diversity of descriptions of compounds, and by the fact that the reported activity values, coming from different data sets, are not always directly comparable due to usage of different metrics or data formats. We have built Drug2Gene, a knowledge base, which combines the compound/drug-gene/protein information from 19 publicly available databases. A key feature is our rigorous unification and standardization process which makes the data truly comparable on a large scale, allowing for the first time effective data mining in such a large knowledge corpus. As of version 3.2, Drug2Gene contains 4,372,290 unified relations between compounds and their targets most of which include reported bioactivity data. We extend this set with putative (i.e. homology-inferred) relations where sufficient sequence homology between proteins suggests they may bind to similar compounds. Drug2Gene provides powerful search functionalities, very flexible export procedures, and a user-friendly web interface. Drug2Gene v3.2 has become a mature and comprehensive knowledge base providing unified, standardized drug-target related information gathered from publicly available data sources. It can be used to integrate proprietary data sets with publicly available data sets. Its main goal is to be a 'one-stop shop' to identify tool compounds targeting a given gene product or for finding all known targets of a drug. Drug2Gene with its integrated data set of public compound-target relations is freely accessible without restrictions at http://www.drug2gene.com.
MapReduce implementation of a hybrid spectral library-database search method for large-scale peptide identification.

PubMed

Kalyanaraman, Ananth; Cannon, William R; Latt, Benjamin; Baxter, Douglas J

2011-11-01

A MapReduce-based implementation called MR-MSPolygraph for parallelizing peptide identification from mass spectrometry data is presented. The underlying serial method, MSPolygraph, uses a novel hybrid approach to match an experimental spectrum against a combination of a protein sequence database and a spectral library. Our MapReduce implementation can run on any Hadoop cluster environment. Experimental results demonstrate that, relative to the serial version, MR-MSPolygraph reduces the time to solution from weeks to hours, for processing tens of thousands of experimental spectra. Speedup and other related performance studies are also reported on a 400-core Hadoop cluster using spectral datasets from environmental microbial communities as inputs. The source code along with user documentation are available on http://compbio.eecs.wsu.edu/MR-MSPolygraph. ananth@eecs.wsu.edu; william.cannon@pnnl.gov. Supplementary data are available at Bioinformatics online.
MOPED 2.5—An Integrated Multi-Omics Resource: Multi-Omics Profiling Expression Database Now Includes Transcriptomics Data

PubMed Central

Montague, Elizabeth; Stanberry, Larissa; Higdon, Roger; Janko, Imre; Lee, Elaine; Anderson, Nathaniel; Choiniere, John; Stewart, Elizabeth; Yandl, Gregory; Broomall, William; Kolker, Natali

2014-01-01

Abstract Multi-omics data-driven scientific discovery crucially rests on high-throughput technologies and data sharing. Currently, data are scattered across single omics repositories, stored in varying raw and processed formats, and are often accompanied by limited or no metadata. The Multi-Omics Profiling Expression Database (MOPED, http://moped.proteinspire.org) version 2.5 is a freely accessible multi-omics expression database. Continual improvement and expansion of MOPED is driven by feedback from the Life Sciences Community. In order to meet the emergent need for an integrated multi-omics data resource, MOPED 2.5 now includes gene relative expression data in addition to protein absolute and relative expression data from over 250 large-scale experiments. To facilitate accurate integration of experiments and increase reproducibility, MOPED provides extensive metadata through the Data-Enabled Life Sciences Alliance (DELSA Global, http://delsaglobal.org) metadata checklist. MOPED 2.5 has greatly increased the number of proteomics absolute and relative expression records to over 500,000, in addition to adding more than four million transcriptomics relative expression records. MOPED has an intuitive user interface with tabs for querying different types of omics expression data and new tools for data visualization. Summary information including expression data, pathway mappings, and direct connection between proteins and genes can be viewed on Protein and Gene Details pages. These connections in MOPED provide a context for multi-omics expression data exploration. Researchers are encouraged to submit omics data which will be consistently processed into expression summaries. MOPED as a multi-omics data resource is a pivotal public database, interdisciplinary knowledge resource, and platform for multi-omics understanding. PMID:24910945
Sports medicine clinical trial research publications in academic medical journals between 1996 and 2005: an audit of the PubMed MEDLINE database.

PubMed

Nichols, A W

2008-11-01

To identify sports medicine-related clinical trial research articles in the PubMed MEDLINE database published between 1996 and 2005 and conduct a review and analysis of topics of research, experimental designs, journals of publication and the internationality of authorships. Sports medicine research is international in scope with improving study methodology and an evolution of topics. Structured review of articles identified in a search of a large electronic medical database. PubMed MEDLINE database. Sports medicine-related clinical research trials published between 1996 and 2005. Review and analysis of articles that meet inclusion criteria. Articles were examined for study topics, research methods, experimental subject characteristics, journal of publication, lead authors and journal countries of origin and language of publication. The search retrieved 414 articles, of which 379 (345 English language and 34 non-English language) met the inclusion criteria. The number of publications increased steadily during the study period. Randomised clinical trials were the most common study type and the "diagnosis, management and treatment of sports-related injuries and conditions" was the most popular study topic. The knee, ankle/foot and shoulder were the most frequent anatomical sites of study. Soccer players and runners were the favourite study subjects. The American Journal of Sports Medicine had the highest number of publications and shared the greatest international diversity of authorships with the British Journal of Sports Medicine. The USA, Australia, Germany and the UK produced a good number of the lead authorships. In all, 91% of articles and 88% of journals were published in English. Sports medicine-related research is internationally diverse, clinical trial publications are increasing and the sophistication of research design may be improving.
MEGALEX: A megastudy of visual and auditory word recognition.

PubMed

Ferrand, Ludovic; Méot, Alain; Spinelli, Elsa; New, Boris; Pallier, Christophe; Bonin, Patrick; Dufau, Stéphane; Mathôt, Sebastiaan; Grainger, Jonathan

2018-06-01

Using the megastudy approach, we report a new database (MEGALEX) of visual and auditory lexical decision times and accuracy rates for tens of thousands of words. We collected visual lexical decision data for 28,466 French words and the same number of pseudowords, and auditory lexical decision data for 17,876 French words and the same number of pseudowords (synthesized tokens were used for the auditory modality). This constitutes the first large-scale database for auditory lexical decision, and the first database to enable a direct comparison of word recognition in different modalities. Different regression analyses were conducted to illustrate potential ways to exploit this megastudy database. First, we compared the proportions of variance accounted for by five word frequency measures. Second, we conducted item-level regression analyses to examine the relative importance of the lexical variables influencing performance in the different modalities (visual and auditory). Finally, we compared the similarities and differences between the two modalities. All data are freely available on our website ( https://sedufau.shinyapps.io/megalex/ ) and are searchable at www.lexique.org , inside the Open Lexique search engine.

Using GenBank.

PubMed

Wheeler, David

2007-01-01

GenBank(R) is a comprehensive database of publicly available DNA sequences for more than 205,000 named organisms and for more than 60,000 within the embryophyta, obtained through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Daily data exchange with the European Molecular Biology Laboratory (EMBL) in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the National Center for Biotechnology Information (NCBI) retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases with taxonomy, genome, mapping, protein structure, and domain information and the biomedical journal literature through PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available through FTP. GenBank usage scenarios ranging from local analyses of the data available through FTP to online analyses supported by the NCBI Web-based tools are discussed. To access GenBank and its related retrieval and analysis services, go to the NCBI Homepage at http://www.ncbi.nlm.nih.gov.
GenBank.

PubMed

Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W

2011-01-01

GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
Distributed data collection for a database of radiological image interpretations

NASA Astrophysics Data System (ADS)

Long, L. Rodney; Ostchega, Yechiam; Goh, Gin-Hua; Thoma, George R.

1997-01-01

The National Library of Medicine, in collaboration with the National Center for Health Statistics and the National Institute for Arthritis and Musculoskeletal and Skin Diseases, has built a system for collecting radiological interpretations for a large set of x-ray images acquired as part of the data gathered in the second National Health and Nutrition Examination Survey. This system is capable of delivering across the Internet 5- and 10-megabyte x-ray images to Sun workstations equipped with X Window based 2048 X 2560 image displays, for the purpose of having these images interpreted for the degree of presence of particular osteoarthritic conditions in the cervical and lumbar spines. The collected interpretations can then be stored in a database at the National Library of Medicine, under control of the Illustra DBMS. This system is a client/server database application which integrates (1) distributed server processing of client requests, (2) a customized image transmission method for faster Internet data delivery, (3) distributed client workstations with high resolution displays, image processing functions and an on-line digital atlas, and (4) relational database management of the collected data.
The crustal dynamics intelligent user interface anthology

NASA Technical Reports Server (NTRS)

Short, Nicholas M., Jr.; Campbell, William J.; Roelofs, Larry H.; Wattawa, Scott L.

1987-01-01

The National Space Science Data Center (NSSDC) has initiated an Intelligent Data Management (IDM) research effort which has, as one of its components, the development of an Intelligent User Interface (IUI). The intent of the IUI is to develop a friendly and intelligent user interface service based on expert systems and natural language processing technologies. The purpose of such a service is to support the large number of potential scientific and engineering users that have need of space and land-related research and technical data, but have little or no experience in query languages or understanding of the information content or architecture of the databases of interest. This document presents the design concepts, development approach and evaluation of the performance of a prototype IUI system for the Crustal Dynamics Project Database, which was developed using a microcomputer-based expert system tool (M. 1), the natural language query processor THEMIS, and the graphics software system GSS. The IUI design is based on a multiple view representation of a database from both the user and database perspective, with intelligent processes to translate between the views.
Brief Report: The Negev Hospital-University-Based (HUB) Autism Database

ERIC Educational Resources Information Center

Meiri, Gal; Dinstein, Ilan; Michaelowski, Analya; Flusser, Hagit; Ilan, Michal; Faroy, Michal; Bar-Sinai, Asif; Manelis, Liora; Stolowicz, Dana; Yosef, Lili Lea; Davidovitch, Nadav; Golan, Hava; Arbelle, Shosh; Menashe, Idan

2017-01-01

Elucidating the heterogeneous etiologies of autism will require investment in comprehensive longitudinal data acquisition from large community based cohorts. With this in mind, we have established a hospital-university-based (HUB) database of autism which incorporates prospective and retrospective data from a large and ethnically diverse…
Improving Decisions with Data

ERIC Educational Resources Information Center

Johnson, Doug

2004-01-01

Schools gather, store and use an increasingly large amount of data. Keeping track of everything from bus routes to building access codes to test scores to sports equipment is done with the help of electronic database programs. Large databases designed for budgeting and student record keeping have long been an integral part of the educational…
Relational Database for the Geology of the Northern Rocky Mountains - Idaho, Montana, and Washington

USGS Publications Warehouse

Causey, J. Douglas; Zientek, Michael L.; Bookstrom, Arthur A.; Frost, Thomas P.; Evans, Karl V.; Wilson, Anna B.; Van Gosen, Bradley S.; Boleneus, David E.; Pitts, Rebecca A.

2008-01-01

A relational database was created to prepare and organize geologic map-unit and lithologic descriptions for input into a spatial database for the geology of the northern Rocky Mountains, a compilation of forty-three geologic maps for parts of Idaho, Montana, and Washington in U.S. Geological Survey Open File Report 2005-1235. Not all of the information was transferred to and incorporated in the spatial database due to physical file limitations. This report releases that part of the relational database that was completed for that earlier product. In addition to descriptive geologic information for the northern Rocky Mountains region, the relational database contains a substantial bibliography of geologic literature for the area. The relational database nrgeo.mdb (linked below) is available in Microsoft Access version 2000, a proprietary database program. The relational database contains data tables and other tables used to define terms, relationships between the data tables, and hierarchical relationships in the data; forms used to enter data; and queries used to extract data.
A MySQL Based EPICS Archiver

DOE Office of Scientific and Technical Information (OSTI.GOV)

Christopher Slominski

2009-10-01

Archiving a large fraction of the EPICS signals within the Jefferson Lab (JLAB) Accelerator control system is vital for postmortem and real-time analysis of the accelerator performance. This analysis is performed on a daily basis by scientists, operators, engineers, technicians, and software developers. Archiving poses unique challenges due to the magnitude of the control system. A MySQL Archiving system (Mya) was developed to scale to the needs of the control system; currently archiving 58,000 EPICS variables, updating at a rate of 11,000 events per second. In addition to the large collection rate, retrieval of the archived data must also bemore » fast and robust. Archived data retrieval clients obtain data at a rate over 100,000 data points per second. Managing the data in a relational database provides a number of benefits. This paper describes an archiving solution that uses an open source database and standard off the shelf hardware to reach high performance archiving needs. Mya has been in production at Jefferson Lab since February of 2007.« less
Effects of Childhood and Middle-Adulthood Family Conditions on Later-Life Mortality: Evidence from the Utah Population Database, 1850-2002

PubMed Central

Mineau, Mineau P; Gilda, Garibotti; Kerber, Richard

2014-01-01

We examine how key early family circumstances affect mortality risks decades later. Early life conditions are measured by parental mortality, parental fertility (e.g., offspring sibship size, parental age at offspring birth), religious upbringing, and parental socioeconomic status. Prior to these early life conditions are familial and genetic factors that affect life-span. Accordingly, we consider the role of parental and familial longevity on adult mortality risks. We analyze the large Utah Population Database which contains a vast amount of genealogical and other vital/health data that contain full life histories of individuals and hundreds of their relatives. To control for unobserved heterogeneity, we analyze sib-pair data for 12,000 sib-pairs using frailty models. We found modest effects of key childhood conditions (birth order, sibship size, parental religiosity, parental SES, and parental death in childhood). Our measures of familial aggregation of longevity were large and suggest an alternative view of early life conditions. PMID:19278766
The HARPS-N archive through a Cassandra, NoSQL database suite?

NASA Astrophysics Data System (ADS)

Molinari, Emilio; Guerra, Jose; Harutyunyan, Avet; Lodi, Marcello; Martin, Adrian

2016-07-01

The TNG-INAF is developing the science archive for the WEAVE instrument. The underlying architecture of the archive is based on a non relational database, more precisely, on Apache Cassandra cluster, which uses a NoSQL technology. In order to test and validate the use of this architecture, we created a local archive which we populated with all the HARPSN spectra collected at the TNG since the instrument's start of operations in mid-2012, as well as developed tools for the analysis of this data set. The HARPS-N data set is two orders of magnitude smaller than WEAVE, but we want to demonstrate the ability to walk through a complete data set and produce scientific output, as valuable as that produced by an ordinary pipeline, though without accessing directly the FITS files. The analytics is done by Apache Solr and Spark and on a relational PostgreSQL database. As an example, we produce observables like metallicity indexes for the targets in the archive and compare the results with the ones coming from the HARPS-N regular data reduction software. The aim of this experiment is to explore the viability of a high availability cluster and distributed NoSQL database as a platform for complex scientific analytics on a large data set, which will then be ported to the WEAVE Archive System (WAS) which we are developing for the WEAVE multi object, fiber spectrograph.
Cyclebase 3.0: a multi-organism database on cell-cycle regulation and phenotypes.

PubMed

Santos, Alberto; Wernersson, Rasmus; Jensen, Lars Juhl

2015-01-01

The eukaryotic cell division cycle is a highly regulated process that consists of a complex series of events and involves thousands of proteins. Researchers have studied the regulation of the cell cycle in several organisms, employing a wide range of high-throughput technologies, such as microarray-based mRNA expression profiling and quantitative proteomics. Due to its complexity, the cell cycle can also fail or otherwise change in many different ways if important genes are knocked out, which has been studied in several microscopy-based knockdown screens. The data from these many large-scale efforts are not easily accessed, analyzed and combined due to their inherent heterogeneity. To address this, we have created Cyclebase--available at http://www.cyclebase.org--an online database that allows users to easily visualize and download results from genome-wide cell-cycle-related experiments. In Cyclebase version 3.0, we have updated the content of the database to reflect changes to genome annotation, added new mRNA and protein expression data, and integrated cell-cycle phenotype information from high-content screens and model-organism databases. The new version of Cyclebase also features a new web interface, designed around an overview figure that summarizes all the cell-cycle-related data for a gene. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Teaching Advanced SQL Skills: Text Bulk Loading

ERIC Educational Resources Information Center

Olsen, David; Hauser, Karina

2007-01-01

Studies show that advanced database skills are important for students to be prepared for today's highly competitive job market. A common task for database administrators is to insert a large amount of data into a database. This paper illustrates how an up-to-date, advanced database topic, namely bulk insert, can be incorporated into a database…
Sagace: A web-based search engine for biomedical databases in Japan

PubMed Central

2012-01-01

Background In the big data era, biomedical research continues to generate a large amount of data, and the generated information is often stored in a database and made publicly available. Although combining data from multiple databases should accelerate further studies, the current number of life sciences databases is too large to grasp features and contents of each database. Findings We have developed Sagace, a web-based search engine that enables users to retrieve information from a range of biological databases (such as gene expression profiles and proteomics data) and biological resource banks (such as mouse models of disease and cell lines). With Sagace, users can search more than 300 databases in Japan. Sagace offers features tailored to biomedical research, including manually tuned ranking, a faceted navigation to refine search results, and rich snippets constructed with retrieved metadata for each database entry. Conclusions Sagace will be valuable for experts who are involved in biomedical research and drug development in both academia and industry. Sagace is freely available at http://sagace.nibio.go.jp/en/. PMID:23110816
Large-scale patterns of insect and disease activity in the conterminous United States and Alaska from the National Insect and Disease Detection Survey Database, 2010

Treesearch

Kevin M. Potter; Jeanine L. Paschke

2013-01-01

Analyzing patterns of forest pest infestations, diseases occurrences, forest declines and related biotic stress factors is necessary to monitor the health of forested ecosystems and their potential impacts on forest structure, composition, biodiversity, and species distributions (Castello and others 1995). Introduced nonnative insects and diseases, in particular, can...
Feedback Conversations: Creating Feedback Dialogues with a New Textual Tool for Industrial Design Student Feedback

ERIC Educational Resources Information Center

Funk, Mathias; van Diggelen, Migchiel

2017-01-01

In this paper, the authors describe how a study of a large database of written university teacher feedback in the department of Industrial Design led to the development of a new conceptual framework for feedback and the design of a new feedback tool. This paper focuses on the translation of related work in the area of feedback mechanisms for…
Stability and Change in Interests: A Longitudinal Study of Adolescents from Grades 8 through 12

ERIC Educational Resources Information Center

Tracey, Terence J. G.; Robbins, Steven B.; Hofsess, Christy D.

2005-01-01

The pattern of RIASEC interests and academic skills were assessed longitudinally from a large-scale national database at three time points: eight grade, 10th grade, and 12th grade. Validation and cross-validation samples of 1000 males and 1000 females in each set were used to test the pattern of these scores over time relative to mean changes,…
Migration from relational to NoSQL database

NASA Astrophysics Data System (ADS)

Ghotiya, Sunita; Mandal, Juhi; Kandasamy, Saravanakumar

2017-11-01

Data generated by various real time applications, social networking sites and sensor devices is of very huge amount and unstructured, which makes it difficult for Relational database management systems to handle the data. Data is very precious component of any application and needs to be analysed after arranging it in some structure. Relational databases are only able to deal with structured data, so there is need of NoSQL Database management System which can deal with semi -structured data also. Relational database provides the easiest way to manage the data but as the use of NoSQL is increasing it is becoming necessary to migrate the data from Relational to NoSQL databases. Various frameworks has been proposed previously which provides mechanisms for migration of data stored at warehouses in SQL, middle layer solutions which can provide facility of data to be stored in NoSQL databases to handle data which is not structured. This paper provides a literature review of some of the recent approaches proposed by various researchers to migrate data from relational to NoSQL databases. Some researchers proposed mechanisms for the co-existence of NoSQL and Relational databases together. This paper provides a summary of mechanisms which can be used for mapping data stored in Relational databases to NoSQL databases. Various techniques for data transformation and middle layer solutions are summarised in the paper.
Access to Emissions Distributions and Related Ancillary Data through the ECCAD database

NASA Astrophysics Data System (ADS)

Darras, Sabine; Granier, Claire; Liousse, Catherine; De Graaf, Erica; Enriquez, Edgar; Boulanger, Damien; Brissebrat, Guillaume

2017-04-01

The ECCAD database (Emissions of atmospheric Compounds and Compilation of Ancillary Data) provides a user-friendly access to global and regional surface emissions for a large set of chemical compounds and ancillary data (land use, active fires, burned areas, population,etc). The emissions inventories are time series gridded data at spatial resolution from 1x1 to 0.1x0.1 degrees. ECCAD is the emissions database of the GEIA (Global Emissions InitiAtive) project and a sub-project of the French Atmospheric Data Center AERIS (http://www.aeris-data.fr). ECCAD has currently more than 2200 users originating from more than 80 countries. The project benefits from this large international community of users to expand the number of emission datasets made available. ECCAD provides detailed metadata for each of the datasets and various tools for data visualization, for computing global and regional totals and for interactive spatial and temporal analysis. The data can be downloaded as interoperable NetCDF CF-compliant files, i.e. the data are compatible with many other client interfaces. The presentation will provide information on the datasets available within ECCAD, as well as examples of the analysis work that can be done online through the website: http://eccad.aeris-data.fr.
Access to Emissions Distributions and Related Ancillary Data through the ECCAD database

NASA Astrophysics Data System (ADS)

Darras, Sabine; Enriquez, Edgar; Granier, Claire; Liousse, Catherine; Boulanger, Damien; Fontaine, Alain

2016-04-01

The ECCAD database (Emissions of atmospheric Compounds and Compilation of Ancillary Data) provides a user-friendly access to global and regional surface emissions for a large set of chemical compounds and ancillary data (land use, active fires, burned areas, population,etc). The emissions inventories are time series gridded data at spatial resolution from 1x1 to 0.1x0.1 degrees. ECCAD is the emissions database of the GEIA (Global Emissions InitiAtive) project and a sub-project of the French Atmospheric Data Center AERIS (http://www.aeris-data.fr). ECCAD has currently more than 2200 users originating from more than 80 countries. The project benefits from this large international community of users to expand the number of emission datasets made available. ECCAD provides detailed metadata for each of the datasets and various tools for data visualization, for computing global and regional totals and for interactive spatial and temporal analysis. The data can be downloaded as interoperable NetCDF CF-compliant files, i.e. the data are compatible with many other client interfaces. The presentation will provide information on the datasets available within ECCAD, as well as examples of the analysis work that can be done online through the website: http://eccad.aeris-data.fr.
The European Southern Observatory-MIDAS table file system

NASA Technical Reports Server (NTRS)

Peron, M.; Grosbol, P.

1992-01-01

The new and substantially upgraded version of the Table File System in MIDAS is presented as a scientific database system. MIDAS applications for performing database operations on tables are discussed, for instance, the exchange of the data to and from the TFS, the selection of objects, the uncertainty joins across tables, and the graphical representation of data. This upgraded version of the TFS is a full implementation of the binary table extension of the FITS format; in addition, it also supports arrays of strings. Different storage strategies for optimal access of very large data sets are implemented and are addressed in detail. As a simple relational database, the TFS may be used for the management of personal data files. This opens the way to intelligent pipeline processing of large amounts of data. One of the key features of the Table File System is to provide also an extensive set of tools for the analysis of the final results of a reduction process. Column operations using standard and special mathematical functions as well as statistical distributions can be carried out; commands for linear regression and model fitting using nonlinear least square methods and user-defined functions are available. Finally, statistical tests of hypothesis and multivariate methods can also operate on tables.

Integration of NASA/GSFC and USGS Rock Magnetic Databases.

NASA Astrophysics Data System (ADS)

Nazarova, K. A.; Glen, J. M.

2004-05-01

A global Magnetic Petrology Database (MPDB) was developed and continues to be updated at NASA/Goddard Space Flight Center. The purpose of this database is to provide the geomagnetic community with a comprehensive and user-friendly method of accessing magnetic petrology data via the Internet for a more realistic interpretation of satellite (as well as aeromagnetic and ground) lithospheric magnetic anomalies. The MPDB contains data on rocks from localities around the world (about 19,000 samples) including the Ukranian and Baltic Shields, Kamchatka, Iceland, Urals Mountains, etc. The MPDB is designed, managed and presented on the web as a research oriented database. Several database applications have been specifically developed for data manipulation and analysis of the MPDB. The geophysics unit at the USGS in Menlo Park has over 17,000 rock-property data, largely from sites within the western U.S. This database contains rock-density and rock-magnetic parameters collected for use in gravity and magnetic field modeling, and paleomagnetic studies. Most of these data were taken from surface outcrops and together they span a broad range of rock types. Measurements were made either in-situ at the outcrop, or in the laboratory on hand samples and paleomagnetic cores acquired in the field. The USGS and NASA/GSFC data will be integrated as part of an effort to provide public access to a single, uniformly maintained database. Due to the large number of data and the very large area sampled, the database can yield rock-property statistics on a broad range of rock types; it is thus applicable to study areas beyond the geographic scope of the database. The intent of this effort is to provide incentive for others to further contribute to the database, and a tool with which the geophysical community can entertain studies formerly precluded.
Automating Relational Database Design for Microcomputer Users.

ERIC Educational Resources Information Center

Pu, Hao-Che

1991-01-01

Discusses issues involved in automating the relational database design process for microcomputer users and presents a prototype of a microcomputer-based system (RA, Relation Assistant) that is based on expert systems technology and helps avoid database maintenance problems. Relational database design is explained and the importance of easy input…
Compressing DNA sequence databases with coil.

PubMed

White, W Timothy J; Hendy, Michael D

2008-05-20

Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression - an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression - the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work.
Compressing DNA sequence databases with coil

PubMed Central

White, W Timothy J; Hendy, Michael D

2008-01-01

Background Publicly available DNA sequence databases such as GenBank are large, and are growing at an exponential rate. The sheer volume of data being dealt with presents serious storage and data communications problems. Currently, sequence data is usually kept in large "flat files," which are then compressed using standard Lempel-Ziv (gzip) compression – an approach which rarely achieves good compression ratios. While much research has been done on compressing individual DNA sequences, surprisingly little has focused on the compression of entire databases of such sequences. In this study we introduce the sequence database compression software coil. Results We have designed and implemented a portable software package, coil, for compressing and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared towards achieving high compression ratios at the expense of execution time and memory usage during compression – the compression time represents a "one-off investment" whose cost is quickly amortised if the resulting compressed file is transmitted many times. Decompression requires little memory and is extremely fast. We demonstrate a 5% improvement in compression ratio over state-of-the-art general-purpose compression tools for a large GenBank database file containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental additions to a sequence database. Conclusion coil presents a compelling alternative to conventional compression of flat files for the storage and distribution of DNA sequence databases having a narrow distribution of sequence lengths, such as EST data. Increasing compression levels for databases having a wide distribution of sequence lengths is a direction for future work. PMID:18489794
Using administrative data to track fall-related ambulatory care services in the Veterans Administration Healthcare system.

PubMed

Luther, Stephen L; French, Dustin D; Powell-Cope, Gail; Rubenstein, Laurence Z; Campbell, Robert

2005-10-01

The Veterans Administration (VA) Healthcare system, containing hospital and community-based outpatient clinics, provides the setting for the study. Summary data was obtained from the VA Ambulatory Events Database for fiscal years (FY) 1997-2001 and in-depth data for FY 2001. In FY 2001, the database included approximately 4 million unique patients with 60 million encounters. The purpose of this study was: 1) to quantify injuries and use of services associated with falls among the elderly treated in Veterans Administration (VA) ambulatory care settings using administrative data; 2) to compare fall-related services provided to elderly veterans with those provided to younger veterans. Retrospective analysis of administrative data. This study describes the trends (FY 1997-2001) and patterns of fall-related ambulatory care encounters (FY 2001) in the VA Healthcare System. An approximately four-fold increase in both encounters and patients seen was observed in FY 1997-2001, largely paralleling the growth of VA ambulatory care services. More than two-thirds of the patients treated were found to be over the age of 65. Veterans over the age of 65 were found to be more likely to receive care in the non-urgent setting and had higher numbers of co-morbid conditions than younger veterans. While nearly half of the encounters occurred in the Emergency/Urgent Care setting, fall-related injuries led to services across a wide spectrum of medical and surgical providers/departments. This study represents the first attempt to use the VA Ambulatory Events Database to study fall-related services provided to elderly veterans. In view of the aging population served by the VA and the movement to provide increased services in the outpatient setting, this database provides an important resource for researchers and administrators interested in the prevention and treatment of fall-related injuries.
Reducing Information Overload in Large Seismic Data Sets

DOE Office of Scientific and Technical Information (OSTI.GOV)

HAMPTON,JEFFERY W.; YOUNG,CHRISTOPHER J.; MERCHANT,BION J.

2000-08-02

Event catalogs for seismic data can become very large. Furthermore, as researchers collect multiple catalogs and reconcile them into a single catalog that is stored in a relational database, the reconciled set becomes even larger. The sheer number of these events makes searching for relevant events to compare with events of interest problematic. Information overload in this form can lead to the data sets being under-utilized and/or used incorrectly or inconsistently. Thus, efforts have been initiated to research techniques and strategies for helping researchers to make better use of large data sets. In this paper, the authors present their effortsmore » to do so in two ways: (1) the Event Search Engine, which is a waveform correlation tool and (2) some content analysis tools, which area combination of custom-built and commercial off-the-shelf tools for accessing, managing, and querying seismic data stored in a relational database. The current Event Search Engine is based on a hierarchical clustering tool known as the dendrogram tool, which is written as a MatSeis graphical user interface. The dendrogram tool allows the user to build dendrogram diagrams for a set of waveforms by controlling phase windowing, down-sampling, filtering, enveloping, and the clustering method (e.g. single linkage, complete linkage, flexible method). It also allows the clustering to be based on two or more stations simultaneously, which is important to bridge gaps in the sparsely recorded event sets anticipated in such a large reconciled event set. Current efforts are focusing on tools to help the researcher winnow the clusters defined using the dendrogram tool down to the minimum optimal identification set. This will become critical as the number of reference events in the reconciled event set continually grows. The dendrogram tool is part of the MatSeis analysis package, which is available on the Nuclear Explosion Monitoring Research and Engineering Program Web Site. As part of the research into how to winnow the reference events in these large reconciled event sets, additional database query approaches have been developed to provide windows into these datasets. These custom built content analysis tools help identify dataset characteristics that can potentially aid in providing a basis for comparing similar reference events in these large reconciled event sets. Once these characteristics can be identified, algorithms can be developed to create and add to the reduced set of events used by the Event Search Engine. These content analysis tools have already been useful in providing information on station coverage of the referenced events and basic statistical, information on events in the research datasets. The tools can also provide researchers with a quick way to find interesting and useful events within the research datasets. The tools could also be used as a means to review reference event datasets as part of a dataset delivery verification process. There has also been an effort to explore the usefulness of commercially available web-based software to help with this problem. The advantages of using off-the-shelf software applications, such as Oracle's WebDB, to manipulate, customize and manage research data are being investigated. These types of applications are being examined to provide access to large integrated data sets for regional seismic research in Asia. All of these software tools would provide the researcher with unprecedented power without having to learn the intricacies and complexities of relational database systems.« less
BioMart: a data federation framework for large collaborative projects.

PubMed

Zhang, Junjun; Haider, Syed; Baran, Joachim; Cros, Anthony; Guberman, Jonathan M; Hsu, Jack; Liang, Yong; Yao, Long; Kasprzyk, Arek

2011-01-01

BioMart is a freely available, open source, federated database system that provides a unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart allows databases hosted on different servers to be presented seamlessly to users, facilitating collaborative projects between different research groups. BioMart contains several levels of query optimization to efficiently manage large data sets and offers a diverse selection of graphical user interfaces and application programming interfaces to ensure that queries can be performed in whatever manner is most convenient for the user. The software has now been adopted by a large number of different biological databases spanning a wide range of data types and providing a rich source of annotation available to bioinformaticians and biologists alike.
Database of significant deposits of gold, silver, copper, lead, and zinc in the United States

USGS Publications Warehouse

Long, Keith R.; DeYoung,, John H.; Ludington, Stephen

1998-01-01

It has long been recognized that the largest mineral deposits contain most of the known mineral endowment (Singer and DeYoung, 1980). Sometimes called giant or world-class deposits, these largest deposits account for a very large share of historic and current mineral production and resources in industrial society (Singer, 1995). For example, Singer (1995) shows that the largest 10 percent of the world’s gold deposits contain 86 percent of the gold discovered to date. Many mineral resource issues and investigations are more easily addressed if limited to the relatively small number of deposits that contain most of the known mineral resources. An estimate of known resources using just these deposits would normally be sufficient, because considering smaller deposits would not add significantly to the total estimate. Land-use planning should treat mainly with these deposits due to their relative scarcity, the large share of known resources they contain, and the fact that economies of scale allow minerals to be produced much more cheaply from larger deposits. Investigation of environmental and other hazards that result from mining operations can be limited to these largest deposits because they account for most of past and current production.The National Mineral Resource Assessment project of the U.S. Geological Survey (USGS) has compiled a database on the largest known deposits of gold, silver, copper, lead, and zinc in the United States to complement the 1996 national assessment of undiscovered deposits of these same metals (Ludington and Cox, 1996). The deposits in this database account for approximately 99 percent of domestic production of these metals and probably a similar share of identified resources. These data may be compared with results of the assessment of undiscovered resources to characterize the nation’s total mineral endowment for these metals. This database is a starting point for any national or regional mineral-resource or mineral-environmental investigation.
A fully automatic end-to-end method for content-based image retrieval of CT scans with similar liver lesion annotations.

PubMed

Spanier, A B; Caplan, N; Sosna, J; Acar, B; Joskowicz, L

2018-01-01

The goal of medical content-based image retrieval (M-CBIR) is to assist radiologists in the decision-making process by retrieving medical cases similar to a given image. One of the key interests of radiologists is lesions and their annotations, since the patient treatment depends on the lesion diagnosis. Therefore, a key feature of M-CBIR systems is the retrieval of scans with the most similar lesion annotations. To be of value, M-CBIR systems should be fully automatic to handle large case databases. We present a fully automatic end-to-end method for the retrieval of CT scans with similar liver lesion annotations. The input is a database of abdominal CT scans labeled with liver lesions, a query CT scan, and optionally one radiologist-specified lesion annotation of interest. The output is an ordered list of the database CT scans with the most similar liver lesion annotations. The method starts by automatically segmenting the liver in the scan. It then extracts a histogram-based features vector from the segmented region, learns the features' relative importance, and ranks the database scans according to the relative importance measure. The main advantages of our method are that it fully automates the end-to-end querying process, that it uses simple and efficient techniques that are scalable to large datasets, and that it produces quality retrieval results using an unannotated CT scan. Our experimental results on 9 CT queries on a dataset of 41 volumetric CT scans from the 2014 Image CLEF Liver Annotation Task yield an average retrieval accuracy (Normalized Discounted Cumulative Gain index) of 0.77 and 0.84 without/with annotation, respectively. Fully automatic end-to-end retrieval of similar cases based on image information alone, rather that on disease diagnosis, may help radiologists to better diagnose liver lesions.
Anthropometry of Brazilian Air Force pilots.

PubMed

da Silva, Gilvan V; Halpern, Manny; Gordon, Claire C

2017-10-01

Anthropometric data are essential for the design of military equipment including sizing of aircraft cockpits and personal gear. Currently, there are no anthropometric databases specific to Brazilian military personnel. The aim of this study was to create a Brazilian anthropometric database of Air Force pilots. The methods, protocols, descriptions, definitions, landmarks, tools and measurements procedures followed the instructions outlined in Measurer's Handbook: US Army and Marine Corps Anthropometric Surveys, 2010-2011 - NATICK/TR-11/017. The participants were measured countrywide, in all five Brazilian Geographical Regions. Thirty-nine anthropometric measurements related to cockpit design were selected. The results of 2133 males and 206 females aged 16-52 years constitute a set of basic data for cockpit design, space arrangement issues and adjustments, protective gear and equipment design, as well as for digital human modelling. Another important implication is that this study can be considered a starting point for reducing gender bias in women's career as pilots. Practitioner Summary: This paper describes the first large-scale anthropometric survey of the Brazilian Air Force pilots and the development of the related database. This study provides critical data for improving aircraft cockpit design for ergonomics and comprehensive pilot accommodation, protective gear and uniform design, as well as digital human modelling.
Multi-Sensor Scene Synthesis and Analysis

DTIC Science & Technology

1981-09-01

Quad Trees for Image Representation and Processing ...... ... 126 2.6.2 Databases ..... ..... ... ..... ... ..... ..... 138 2.6.2.1 Definitions and...Basic Concepts ....... 138 2.6.3 Use of Databases in Hierarchical Scene Analysis ...... ... ..................... 147 2.6.4 Use of Relational Tables...Multisensor Image Database Systems (MIDAS) . 161 2.7.2 Relational Database System for Pictures .... ..... 168 2.7.3 Relational Pictorial Database
Collision Cross Section (CCS) Database: An Additional Measure to Characterize Steroids.

PubMed

Hernández-Mesa, Maykel; Le Bizec, Bruno; Monteau, Fabrice; García-Campaña, Ana M; Dervilly-Pinel, Gaud

2018-04-03

Ion mobility spectrometry enhances the performance characteristics of liquid chromatography-mass spectrometry workflows intended to steroid profiling by providing a new separation dimension and a novel characterization parameter, the so-called collision cross section (CCS). This work proposes the first CCS database for 300 steroids (i.e., endogenous, including phase I and phase II metabolites, and exogenous synthetic compounds), which involves 1080 ions and covers the CCS of 127 androgens, 84 estrogens, 50 corticosteroids, and 39 progestagens. This large database provides information related to all the ionized species identified for each steroid in positive electrospray ionization mode as well as for estrogens in negative ionization mode. CCS values have been measured using nitrogen as drift gas in the ion mobility cell. Generally, direct correlation exists between mass-to-charge ratio ( m/ z) and CCS because both are related parameters. However, several steroids mainly steroid glucuronides and steroid esters have been characterized as more compact or elongated molecules than expected. In such cases, CCS results in additional relevant information to retention time and mass spectral data for the identification of steroids. Moreover, several isomeric steroid pairs (e.g., 5β-androstane-3,17-dione and 5α-androstane-3,17-dione) have been separated based on their CCS differences. These results indicate that adding the CCS to databases in analytical workflows increases selectivity, thus improving the confidence in steroids analysis. Consequences in terms of identification and quantification are discussed. Quality criteria and a construction of an interlaboratory reproducibility approach are also reported for the obtained CCS values. The CCS database described here is made publicly available.
Enhanced DIII-D Data Management Through a Relational Database

NASA Astrophysics Data System (ADS)

Burruss, J. R.; Peng, Q.; Schachter, J.; Schissel, D. P.; Terpstra, T. B.

2000-10-01

A relational database is being used to serve data about DIII-D experiments. The database is optimized for queries across multiple shots, allowing for rapid data mining by SQL-literate researchers. The relational database relates different experiments and datasets, thus providing a big picture of DIII-D operations. Users are encouraged to add their own tables to the database. Summary physics quantities about DIII-D discharges are collected and stored in the database automatically. Meta-data about code runs, MDSplus usage, and visualization tool usage are collected, stored in the database, and later analyzed to improve computing. Documentation on the database may be accessed through programming languages such as C, Java, and IDL, or through ODBC compliant applications such as Excel and Access. A database-driven web page also provides a convenient means for viewing database quantities through the World Wide Web. Demonstrations will be given at the poster.
Application of Large-Scale Database-Based Online Modeling to Plant State Long-Term Estimation

NASA Astrophysics Data System (ADS)

Ogawa, Masatoshi; Ogai, Harutoshi

Recently, attention has been drawn to the local modeling techniques of a new idea called “Just-In-Time (JIT) modeling”. To apply “JIT modeling” to a large amount of database online, “Large-scale database-based Online Modeling (LOM)” has been proposed. LOM is a technique that makes the retrieval of neighboring data more efficient by using both “stepwise selection” and quantization. In order to predict the long-term state of the plant without using future data of manipulated variables, an Extended Sequential Prediction method of LOM (ESP-LOM) has been proposed. In this paper, the LOM and the ESP-LOM are introduced.
Long-term citizen-collected data reveal geographical patterns and temporal trends in lake water clarity

USGS Publications Warehouse

Lottig, Noah R.; Wagner, Tyler; Henry, Emily N.; Cheruvelil, Kendra Spence; Webster, Katherine E.; Downing, John A.; Stow, Craig A.

2014-01-01

We compiled a lake-water clarity database using publically available, citizen volunteer observations made between 1938 and 2012 across eight states in the Upper Midwest, USA. Our objectives were to determine (1) whether temporal trends in lake-water clarity existed across this large geographic area and (2) whether trends were related to the lake-specific characteristics of latitude, lake size, or time period the lake was monitored. Our database consisted of >140,000 individual Secchi observations from 3,251 lakes that we summarized per lake-year, resulting in 21,020 summer averages. Using Bayesian hierarchical modeling, we found approximately a 1% per year increase in water clarity (quantified as Secchi depth) for the entire population of lakes. On an individual lake basis, 7% of lakes showed increased water clarity and 4% showed decreased clarity. Trend direction and strength were related to latitude and median sample date. Lakes in the southern part of our study-region had lower average annual summer water clarity, more negative long-term trends, and greater inter-annual variability in water clarity compared to northern lakes. Increasing trends were strongest for lakes with median sample dates earlier in the period of record (1938–2012). Our ability to identify specific mechanisms for these trends is currently hampered by the lack of a large, multi-thematic database of variables that drive water clarity (e.g., climate, land use/cover). Our results demonstrate, however, that citizen science can provide the critical monitoring data needed to address environmental questions at large spatial and long temporal scales. Collaborations among citizens, research scientists, and government agencies may be important for developing the data sources and analytical tools necessary to move toward an understanding of the factors influencing macro-scale patterns such as those shown here for lake water clarity.
Long-Term Citizen-Collected Data Reveal Geographical Patterns and Temporal Trends in Lake Water Clarity

PubMed Central

Lottig, Noah R.; Wagner, Tyler; Norton Henry, Emily; Spence Cheruvelil, Kendra; Webster, Katherine E.; Downing, John A.; Stow, Craig A.

2014-01-01

We compiled a lake-water clarity database using publically available, citizen volunteer observations made between 1938 and 2012 across eight states in the Upper Midwest, USA. Our objectives were to determine (1) whether temporal trends in lake-water clarity existed across this large geographic area and (2) whether trends were related to the lake-specific characteristics of latitude, lake size, or time period the lake was monitored. Our database consisted of >140,000 individual Secchi observations from 3,251 lakes that we summarized per lake-year, resulting in 21,020 summer averages. Using Bayesian hierarchical modeling, we found approximately a 1% per year increase in water clarity (quantified as Secchi depth) for the entire population of lakes. On an individual lake basis, 7% of lakes showed increased water clarity and 4% showed decreased clarity. Trend direction and strength were related to latitude and median sample date. Lakes in the southern part of our study-region had lower average annual summer water clarity, more negative long-term trends, and greater inter-annual variability in water clarity compared to northern lakes. Increasing trends were strongest for lakes with median sample dates earlier in the period of record (1938–2012). Our ability to identify specific mechanisms for these trends is currently hampered by the lack of a large, multi-thematic database of variables that drive water clarity (e.g., climate, land use/cover). Our results demonstrate, however, that citizen science can provide the critical monitoring data needed to address environmental questions at large spatial and long temporal scales. Collaborations among citizens, research scientists, and government agencies may be important for developing the data sources and analytical tools necessary to move toward an understanding of the factors influencing macro-scale patterns such as those shown here for lake water clarity. PMID:24788722
Integrated database for identifying candidate genes for Aspergillus flavus resistance in maize

PubMed Central

2010-01-01

Background Aspergillus flavus Link:Fr, an opportunistic fungus that produces aflatoxin, is pathogenic to maize and other oilseed crops. Aflatoxin is a potent carcinogen, and its presence markedly reduces the value of grain. Understanding and enhancing host resistance to A. flavus infection and/or subsequent aflatoxin accumulation is generally considered an efficient means of reducing grain losses to aflatoxin. Different proteomic, genomic and genetic studies of maize (Zea mays L.) have generated large data sets with the goal of identifying genes responsible for conferring resistance to A. flavus, or aflatoxin. Results In order to maximize the usage of different data sets in new studies, including association mapping, we have constructed a relational database with web interface integrating the results of gene expression, proteomic (both gel-based and shotgun), Quantitative Trait Loci (QTL) genetic mapping studies, and sequence data from the literature to facilitate selection of candidate genes for continued investigation. The Corn Fungal Resistance Associated Sequences Database (CFRAS-DB) (http://agbase.msstate.edu/) was created with the main goal of identifying genes important to aflatoxin resistance. CFRAS-DB is implemented using MySQL as the relational database management system running on a Linux server, using an Apache web server, and Perl CGI scripts as the web interface. The database and the associated web-based interface allow researchers to examine many lines of evidence (e.g. microarray, proteomics, QTL studies, SNP data) to assess the potential role of a gene or group of genes in the response of different maize lines to A. flavus infection and subsequent production of aflatoxin by the fungus. Conclusions CFRAS-DB provides the first opportunity to integrate data pertaining to the problem of A. flavus and aflatoxin resistance in maize in one resource and to support queries across different datasets. The web-based interface gives researchers different query options for mining the database across different types of experiments. The database is publically available at http://agbase.msstate.edu. PMID:20946609
RaMP: A Comprehensive Relational Database of Metabolomics Pathways for Pathway Enrichment Analysis of Genes and Metabolites

PubMed Central

Zhang, Bofei; Hu, Senyang; Baskin, Elizabeth; Patt, Andrew; Siddiqui, Jalal K.

2018-01-01

The value of metabolomics in translational research is undeniable, and metabolomics data are increasingly generated in large cohorts. The functional interpretation of disease-associated metabolites though is difficult, and the biological mechanisms that underlie cell type or disease-specific metabolomics profiles are oftentimes unknown. To help fully exploit metabolomics data and to aid in its interpretation, analysis of metabolomics data with other complementary omics data, including transcriptomics, is helpful. To facilitate such analyses at a pathway level, we have developed RaMP (Relational database of Metabolomics Pathways), which combines biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, WikiPathways, and the Human Metabolome DataBase (HMDB). To the best of our knowledge, an off-the-shelf, public database that maps genes and metabolites to biochemical/disease pathways and can readily be integrated into other existing software is currently lacking. For consistent and comprehensive analysis, RaMP enables batch and complex queries (e.g., list all metabolites involved in glycolysis and lung cancer), can readily be integrated into pathway analysis tools, and supports pathway overrepresentation analysis given a list of genes and/or metabolites of interest. For usability, we have developed a RaMP R package (https://github.com/Mathelab/RaMP-DB), including a user-friendly RShiny web application, that supports basic simple and batch queries, pathway overrepresentation analysis given a list of genes or metabolites of interest, and network visualization of gene-metabolite relationships. The package also includes the raw database file (mysql dump), thereby providing a stand-alone downloadable framework for public use and integration with other tools. In addition, the Python code needed to recreate the database on another system is also publicly available (https://github.com/Mathelab/RaMP-BackEnd). Updates for databases in RaMP will be checked multiple times a year and RaMP will be updated accordingly. PMID:29470400
RaMP: A Comprehensive Relational Database of Metabolomics Pathways for Pathway Enrichment Analysis of Genes and Metabolites.

PubMed

Zhang, Bofei; Hu, Senyang; Baskin, Elizabeth; Patt, Andrew; Siddiqui, Jalal K; Mathé, Ewy A

2018-02-22

The value of metabolomics in translational research is undeniable, and metabolomics data are increasingly generated in large cohorts. The functional interpretation of disease-associated metabolites though is difficult, and the biological mechanisms that underlie cell type or disease-specific metabolomics profiles are oftentimes unknown. To help fully exploit metabolomics data and to aid in its interpretation, analysis of metabolomics data with other complementary omics data, including transcriptomics, is helpful. To facilitate such analyses at a pathway level, we have developed RaMP (Relational database of Metabolomics Pathways), which combines biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, WikiPathways, and the Human Metabolome DataBase (HMDB). To the best of our knowledge, an off-the-shelf, public database that maps genes and metabolites to biochemical/disease pathways and can readily be integrated into other existing software is currently lacking. For consistent and comprehensive analysis, RaMP enables batch and complex queries (e.g., list all metabolites involved in glycolysis and lung cancer), can readily be integrated into pathway analysis tools, and supports pathway overrepresentation analysis given a list of genes and/or metabolites of interest. For usability, we have developed a RaMP R package (https://github.com/Mathelab/RaMP-DB), including a user-friendly RShiny web application, that supports basic simple and batch queries, pathway overrepresentation analysis given a list of genes or metabolites of interest, and network visualization of gene-metabolite relationships. The package also includes the raw database file (mysql dump), thereby providing a stand-alone downloadable framework for public use and integration with other tools. In addition, the Python code needed to recreate the database on another system is also publicly available (https://github.com/Mathelab/RaMP-BackEnd). Updates for databases in RaMP will be checked multiple times a year and RaMP will be updated accordingly.
Integrated database for identifying candidate genes for Aspergillus flavus resistance in maize.

PubMed

Kelley, Rowena Y; Gresham, Cathy; Harper, Jonathan; Bridges, Susan M; Warburton, Marilyn L; Hawkins, Leigh K; Pechanova, Olga; Peethambaran, Bela; Pechan, Tibor; Luthe, Dawn S; Mylroie, J E; Ankala, Arunkanth; Ozkan, Seval; Henry, W B; Williams, W P

2010-10-07

Aspergillus flavus Link:Fr, an opportunistic fungus that produces aflatoxin, is pathogenic to maize and other oilseed crops. Aflatoxin is a potent carcinogen, and its presence markedly reduces the value of grain. Understanding and enhancing host resistance to A. flavus infection and/or subsequent aflatoxin accumulation is generally considered an efficient means of reducing grain losses to aflatoxin. Different proteomic, genomic and genetic studies of maize (Zea mays L.) have generated large data sets with the goal of identifying genes responsible for conferring resistance to A. flavus, or aflatoxin. In order to maximize the usage of different data sets in new studies, including association mapping, we have constructed a relational database with web interface integrating the results of gene expression, proteomic (both gel-based and shotgun), Quantitative Trait Loci (QTL) genetic mapping studies, and sequence data from the literature to facilitate selection of candidate genes for continued investigation. The Corn Fungal Resistance Associated Sequences Database (CFRAS-DB) (http://agbase.msstate.edu/) was created with the main goal of identifying genes important to aflatoxin resistance. CFRAS-DB is implemented using MySQL as the relational database management system running on a Linux server, using an Apache web server, and Perl CGI scripts as the web interface. The database and the associated web-based interface allow researchers to examine many lines of evidence (e.g. microarray, proteomics, QTL studies, SNP data) to assess the potential role of a gene or group of genes in the response of different maize lines to A. flavus infection and subsequent production of aflatoxin by the fungus. CFRAS-DB provides the first opportunity to integrate data pertaining to the problem of A. flavus and aflatoxin resistance in maize in one resource and to support queries across different datasets. The web-based interface gives researchers different query options for mining the database across different types of experiments. The database is publically available at http://agbase.msstate.edu.

Improving data management and dissemination in web based information systems by semantic enrichment of descriptive data aspects

NASA Astrophysics Data System (ADS)

Gebhardt, Steffen; Wehrmann, Thilo; Klinger, Verena; Schettler, Ingo; Huth, Juliane; Künzer, Claudia; Dech, Stefan

2010-10-01

The German-Vietnamese water-related information system for the Mekong Delta (WISDOM) project supports business processes in Integrated Water Resources Management in Vietnam. Multiple disciplines bring together earth and ground based observation themes, such as environmental monitoring, water management, demographics, economy, information technology, and infrastructural systems. This paper introduces the components of the web-based WISDOM system including data, logic and presentation tier. It focuses on the data models upon which the database management system is built, including techniques for tagging or linking metadata with the stored information. The model also uses ordered groupings of spatial, thematic and temporal reference objects to semantically tag datasets to enable fast data retrieval, such as finding all data in a specific administrative unit belonging to a specific theme. A spatial database extension is employed by the PostgreSQL database. This object-oriented database was chosen over a relational database to tag spatial objects to tabular data, improving the retrieval of census and observational data at regional, provincial, and local areas. While the spatial database hinders processing raster data, a "work-around" was built into WISDOM to permit efficient management of both raster and vector data. The data model also incorporates styling aspects of the spatial datasets through styled layer descriptions (SLD) and web mapping service (WMS) layer specifications, allowing retrieval of rendered maps. Metadata elements of the spatial data are based on the ISO19115 standard. XML structured information of the SLD and metadata are stored in an XML database. The data models and the data management system are robust for managing the large quantity of spatial objects, sensor observations, census and document data. The operational WISDOM information system prototype contains modules for data management, automatic data integration, and web services for data retrieval, analysis, and distribution. The graphical user interfaces facilitate metadata cataloguing, data warehousing, web sensor data analysis and thematic mapping.
Using an international p53 mutation database as a foundation for an online laboratory in an upper level undergraduate biology class.

PubMed

Melloy, Patricia G

2015-01-01

A two-part laboratory exercise was developed to enhance classroom instruction on the significance of p53 mutations in cancer development. Students were asked to mine key information from an international database of p53 genetic changes related to cancer, the IARC TP53 database. Using this database, students designed several data mining activities to look at the changes in the p53 gene from a number of perspectives, including potential cancer-causing agents leading to particular changes and the prevalence of certain p53 variations in certain cancers. In addition, students gained a global perspective on cancer prevalence in different parts of the world. Students learned how to use the database in the first part of the exercise, and then used that knowledge to search particular cancers and cancer-causing agents of their choosing in the second part of the exercise. Students also connected the information gathered from the p53 exercise to a previous laboratory exercise looking at risk factors for cancer development. The goal of the experience was to increase student knowledge of the link between p53 genetic variation and cancer. Students also were able to walk a similar path through the website as a cancer researcher using the database to enhance bench work-based experiments with complementary large-scale database p53 variation information. © 2014 The International Union of Biochemistry and Molecular Biology.
Mining the Galaxy Zoo Database: Machine Learning Applications

NASA Astrophysics Data System (ADS)

Borne, Kirk D.; Wallin, J.; Vedachalam, A.; Baehr, S.; Lintott, C.; Darg, D.; Smith, A.; Fortson, L.

2010-01-01

The new Zooniverse initiative is addressing the data flood in the sciences through a transformative partnership between professional scientists, volunteer citizen scientists, and machines. As part of this project, we are exploring the application of machine learning techniques to data mining problems associated with the large and growing database of volunteer science results gathered by the Galaxy Zoo citizen science project. We will describe the basic challenge, some machine learning approaches, and early results. One of the motivators for this study is the acquisition (through the Galaxy Zoo results database) of approximately 100 million classification labels for roughly one million galaxies, yielding a tremendously large and rich set of training examples for improving automated galaxy morphological classification algorithms. In our first case study, the goal is to learn which morphological and photometric features in the Sloan Digital Sky Survey (SDSS) database correlate most strongly with user-selected galaxy morphological class. As a corollary to this study, we are also aiming to identify which galaxy parameters in the SDSS database correspond to galaxies that have been the most difficult to classify (based upon large dispersion in their volunter-provided classifications). Our second case study will focus on similar data mining analyses and machine leaning algorithms applied to the Galaxy Zoo catalog of merging and interacting galaxies. The outcomes of this project will have applications in future large sky surveys, such as the LSST (Large Synoptic Survey Telescope) project, which will generate a catalog of 20 billion galaxies and will produce an additional astronomical alert database of approximately 100 thousand events each night for 10 years -- the capabilities and algorithms that we are exploring will assist in the rapid characterization and classification of such massive data streams. This research has been supported in part through NSF award #0941610.
An experimental investigation of masking in the US FDA adverse event reporting system database.

PubMed

Wang, Hsin-wei; Hochberg, Alan M; Pearson, Ronald K; Hauben, Manfred

2010-12-01

A phenomenon of 'masking' or 'cloaking' in pharmacovigilance data mining has been described, which can potentially cause signals of disproportionate reporting (SDRs) to be missed, particularly in pharmaceutical company databases. Masking has been predicted theoretically, observed anecdotally or studied to a limited extent in both pharmaceutical company and health authority databases, but no previous publication systematically assesses its occurrence in a large health authority database. To explore the nature, extent and possible consequences of masking in the US FDA Adverse Event Reporting System (AERS) database by applying various experimental unmasking protocols to a set of drugs and events representing realistic pharmacovigilance analysis conditions. This study employed AERS data from 2001 through 2005. For a set of 63 Medical Dictionary for Regulatory Activities (MedDRA®) Preferred Terms (PTs), disproportionality analysis was carried out with respect to all drugs included in the AERS database, using a previously described urn-model-based algorithm. We specifically sought masking in which drug removal induced an increase in the statistical representation of a drug-event combination (DEC) that resulted in the emergence of a new SDR. We performed a series of unmasking experiments selecting drugs for removal using rational statistical decision rules based on the requirement of a reporting ratio (RR) >1, top-ranked statistical unexpectedness (SU) and relatedness as reflected in the WHO Anatomical Therapeutic Chemical level 4 (ATC4) grouping. In order to assess the possible extent of residual masking we performed two supplemental purely empirical analyses on a limited subset of data. This entailed testing every drug and drug group to determine which was most influential in uncovering masked SDRs. We assessed the strength of external evidence for a causal association for a small number of masked SDRs involving a subset of 29 drugs for which level of evidence adjudication was available from a previous study. The original disproportionality analysis identified 8719 SDRs for the 63 PTs. The SU-based unmasking protocols generated variable numbers of masked SDRs ranging from 38 to 156, representing a 0.43-1.8% increase over the number of baseline SDRs. A significant number of baseline SDRs were also lost in the course of our experiments. The trend in the number of gained SDRs per report removed was inversely related to the number of lost SDRs per protocol. Both the number and nature of the reports removed influenced the number of gained SDRs observed. The purely empirical protocols unmasked up to ten times as many SDRs. None of the masked SDRs had strong external evidence supporting a causal association. Most involved associations for which there was no external supporting evidence or were in the original product label. For two masked SDRs, there was external evidence of a possible causal association. We documented masking in the FDA AERS database. Attempts at unmasking SDRs using practically implementable protocols produced only small changes in the output of SDRs in our analysis. This is undoubtedly related to the large size and diversity of the database, but the complex interdependencies between drugs and events in authentic spontaneous reporting system (SRS) databases, and the impact of measures of statistical variability that are typically used in real-world disproportionality analysis, may be additional factors that constrain the discovery of masked SDRs and which may also operate in pharmaceutical company databases. Empirical determination of the most influential drugs may uncover significantly more SDRs than protocols based on predetermined statistical selection rules but are impractical except possibly for evaluating specific events. Routine global exercises to elicit masking, especially in large health authority databases are not justified based on results available to date. Exercises to elicit unmasking should be driven by prior knowledge or obvious data imbalances.
Variability sensitivity of dynamic texture based recognition in clinical CT data

NASA Astrophysics Data System (ADS)

Kwitt, Roland; Razzaque, Sharif; Lowell, Jeffrey; Aylward, Stephen

2014-03-01

Dynamic texture recognition using a database of template models has recently shown promising results for the task of localizing anatomical structures in Ultrasound video. In order to understand its clinical value, it is imperative to study the sensitivity with respect to inter-patient variability as well as sensitivity to acquisition parameters such as Ultrasound probe angle. Fully addressing patient and acquisition variability issues, however, would require a large database of clinical Ultrasound from many patients, acquired in a multitude of controlled conditions, e.g., using a tracked transducer. Since such data is not readily attainable, we advocate an alternative evaluation strategy using abdominal CT data as a surrogate. In this paper, we describe how to replicate Ultrasound variabilities by extracting subvolumes from CT and interpreting the image material as an ordered sequence of video frames. Utilizing this technique, and based on a database of abdominal CT from 45 patients, we report recognition results on an organ (kidney) recognition task, where we try to discriminate kidney subvolumes/videos from a collection of randomly sampled negative instances. We demonstrate that (1) dynamic texture recognition is relatively insensitive to inter-patient variation while (2) viewing angle variability needs to be accounted for in the template database. Since naively extending the template database to counteract variability issues can lead to impractical database sizes, we propose an alternative strategy based on automated identification of a small set of representative models.
Historical reconstructions of California wildfires vary by data source

USGS Publications Warehouse

Syphard, Alexandra D.; Keeley, Jon E.

2016-01-01

Historical data are essential for understanding how fire activity responds to different drivers. It is important that the source of data is commensurate with the spatial and temporal scale of the question addressed, but fire history databases are derived from different sources with different restrictions. In California, a frequently used fire history dataset is the State of California Fire and Resource Assessment Program (FRAP) fire history database, which circumscribes fire perimeters at a relatively fine scale. It includes large fires on both state and federal lands but only covers fires that were mapped or had other spatially explicit data. A different database is the state and federal governments’ annual reports of all fires. They are more complete than the FRAP database but are only spatially explicit to the level of county (California Department of Forestry and Fire Protection – Cal Fire) or forest (United States Forest Service – USFS). We found substantial differences between the FRAP database and the annual summaries, with the largest and most consistent discrepancy being in fire frequency. The FRAP database missed the majority of fires and is thus a poor indicator of fire frequency or indicators of ignition sources. The FRAP database is also deficient in area burned, especially before 1950. Even in contemporary records, the huge number of smaller fires not included in the FRAP database account for substantial cumulative differences in area burned. Wildfires in California account for nearly half of the western United States fire suppression budget. Therefore, the conclusions about data discrepancies and the implications for fire research are of broad importance.
Improving the Scalability of an Exact Approach for Frequent Item Set Hiding

ERIC Educational Resources Information Center

LaMacchia, Carolyn

2013-01-01

Technological advances have led to the generation of large databases of organizational data recognized as an information-rich, strategic asset for internal analysis and sharing with trading partners. Data mining techniques can discover patterns in large databases including relationships considered strategically relevant to the owner of the data.…
Reflections on CD-ROM: Bridging the Gap between Technology and Purpose.

ERIC Educational Resources Information Center

Saviers, Shannon Smith

1987-01-01

Provides a technological overview of CD-ROM (Compact Disc-Read Only Memory), an optically-based medium for data storage offering large storage capacity, computer-based delivery system, read-only medium, and economic mass production. CD-ROM database attributes appropriate for information delivery are also reviewed, including large database size,…
Cost and cost-effectiveness studies in urologic oncology using large administrative databases.

PubMed

Wang, Ye; Mossanen, Matthew; Chang, Steven L

2018-04-01

Urologic cancers are not only among the most common types of cancers, but also among the most expensive cancers to treat in the United States. This study aimed to review the use of CEAs and other cost analyses in urologic oncology using large databases to better understand the value of management strategies of these cancers. A literature review on CEAs and other cost analyses in urologic oncology using large databases. The options for and costs of diagnosing, treating, and following patients with urologic cancers can be expected to rise in the coming years. There are numerous opportunities in each urologic cancer to use CEAs to both lower costs and provide high-quality services. Improved cancer care must balance the integration of novelty with ensuring reasonable costs to patients and the health care system. With the increasing focus cost containment, appreciating the value of competing strategies in caring for our patients is pivotal. Leveraging methods such as CEAs and harnessing large databases may help evaluate the merit of established or emerging strategies. Copyright © 2018 Elsevier Inc. All rights reserved.
DaVIE: Database for the Visualization and Integration of Epigenetic data

PubMed Central

Fejes, Anthony P.; Jones, Meaghan J.; Kobor, Michael S.

2014-01-01

One of the challenges in the analysis of large data sets, particularly in a population-based setting, is the ability to perform comparisons across projects. This has to be done in such a way that the integrity of each individual project is maintained, while ensuring that the data are comparable across projects. These issues are beginning to be observed in human DNA methylation studies, as the Illumina 450k platform and next generation sequencing-based assays grow in popularity and decrease in price. This increase in productivity is enabling new insights into epigenetics, but also requires the development of pipelines and software capable of handling the large volumes of data. The specific problems inherent in creating a platform for the storage, comparison, integration, and visualization of DNA methylation data include data storage, algorithm efficiency and ability to interpret the results to derive biological meaning from them. Databases provide a ready-made solution to these issues, but as yet no tools exist that that leverage these advantages while providing an intuitive user interface for interpreting results in a genomic context. We have addressed this void by integrating a database to store DNA methylation data with a web interface to query and visualize the database and a set of libraries for more complex analysis. The resulting platform is called DaVIE: Database for the Visualization and Integration of Epigenetics data. DaVIE can use data culled from a variety of sources, and the web interface includes the ability to group samples by sub-type, compare multiple projects and visualize genomic features in relation to sites of interest. We have used DaVIE to identify patterns of DNA methylation in specific projects and across different projects, identify outlier samples, and cross-check differentially methylated CpG sites identified in specific projects across large numbers of samples. A demonstration server has been setup using GEO data at http://echelon.cmmt.ubc.ca/dbaccess/, with login “guest” and password “guest.” Groups may download and install their own version of the server following the instructions on the project's wiki. PMID:25278960
LSD: Large Survey Database framework

NASA Astrophysics Data System (ADS)

Juric, Mario

2012-09-01

The Large Survey Database (LSD) is a Python framework and DBMS for distributed storage, cross-matching and querying of large survey catalogs (>10^9 rows, >1 TB). The primary driver behind its development is the analysis of Pan-STARRS PS1 data. It is specifically optimized for fast queries and parallel sweeps of positionally and temporally indexed datasets. It transparently scales to more than >10^2 nodes, and can be made to function in "shared nothing" architectures.
Ultra-Structure database design methodology for managing systems biology data and analyses

PubMed Central

Maier, Christopher W; Long, Jeffrey G; Hemminger, Bradley M; Giddings, Morgan C

2009-01-01

Background Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping). Results We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research. Conclusion We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era. PMID:19691849
Unified Access Architecture for Large-Scale Scientific Datasets

NASA Astrophysics Data System (ADS)

Karna, Risav

2014-05-01

Data-intensive sciences have to deploy diverse large scale database technologies for data analytics as scientists have now been dealing with much larger volume than ever before. While array databases have bridged many gaps between the needs of data-intensive research fields and DBMS technologies (Zhang 2011), invocation of other big data tools accompanying these databases is still manual and separate the database management's interface. We identify this as an architectural challenge that will increasingly complicate the user's work flow owing to the growing number of useful but isolated and niche database tools. Such use of data analysis tools in effect leaves the burden on the user's end to synchronize the results from other data manipulation analysis tools with the database management system. To this end, we propose a unified access interface for using big data tools within large scale scientific array database using the database queries themselves to embed foreign routines belonging to the big data tools. Such an invocation of foreign data manipulation routines inside a query into a database can be made possible through a user-defined function (UDF). UDFs that allow such levels of freedom as to call modules from another language and interface back and forth between the query body and the side-loaded functions would be needed for this purpose. For the purpose of this research we attempt coupling of four widely used tools Hadoop (hadoop1), Matlab (matlab1), R (r1) and ScaLAPACK (scalapack1) with UDF feature of rasdaman (Baumann 98), an array-based data manager, for investigating this concept. The native array data model used by an array-based data manager provides compact data storage and high performance operations on ordered data such as spatial data, temporal data, and matrix-based data for linear algebra operations (scidbusr1). Performances issues arising due to coupling of tools with different paradigms, niche functionalities, separate processes and output data formats have been anticipated and considered during the design of the unified architecture. The research focuses on the feasibility of the designed coupling mechanism and the evaluation of the efficiency and benefits of our proposed unified access architecture. Zhang 2011: Zhang, Ying and Kersten, Martin and Ivanova, Milena and Nes, Niels, SciQL: Bridging the Gap Between Science and Relational DBMS, Proceedings of the 15th Symposium on International Database Engineering Applications, 2011. Baumann 98: Baumann, P., Dehmel, A., Furtado, P., Ritsch, R., Widmann, N., "The Multidimensional Database System RasDaMan", SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, 1998. hadoop1: hadoop.apache.org, "Hadoop", http://hadoop.apache.org/, [Online; accessed 12-Jan-2014]. scalapack1: netlib.org/scalapack, "ScaLAPACK", http://www.netlib.org/scalapack,[Online; accessed 12-Jan-2014]. r1: r-project.org, "R", http://www.r-project.org/,[Online; accessed 12-Jan-2014]. matlab1: mathworks.com, "Matlab Documentation", http://www.mathworks.de/de/help/matlab/,[Online; accessed 12-Jan-2014]. scidbusr1: scidb.org, "SciDB User's Guide", http://scidb.org/HTMLmanual/13.6/scidb_ug,[Online; accessed 01-Dec-2013].
Technical Aspects of Interfacing MUMPS to an External SQL Relational Database Management System

PubMed Central

Kuzmak, Peter M.; Walters, Richard F.; Penrod, Gail

1988-01-01

This paper describes an interface connecting InterSystems MUMPS (M/VX) to an external relational DBMS, the SYBASE Database Management System. The interface enables MUMPS to operate in a relational environment and gives the MUMPS language full access to a complete set of SQL commands. MUMPS generates SQL statements as ASCII text and sends them to the RDBMS. The RDBMS executes the statements and returns ASCII results to MUMPS. The interface suggests that the language features of MUMPS make it an attractive tool for use in the relational database environment. The approach described in this paper separates MUMPS from the relational database. Positioning the relational database outside of MUMPS promotes data sharing and permits a number of different options to be used for working with the data. Other languages like C, FORTRAN, and COBOL can access the RDBMS database. Advanced tools provided by the relational database vendor can also be used. SYBASE is an advanced high-performance transaction-oriented relational database management system for the VAX/VMS and UNIX operating systems. SYBASE is designed using a distributed open-systems architecture, and is relatively easy to interface with MUMPS.
Exploring performance issues for a clinical database organized using an entity-attribute-value representation.

PubMed

Chen, R S; Nadkarni, P; Marenco, L; Levin, F; Erdos, J; Miller, P L

2000-01-01

The entity-attribute-value representation with classes and relationships (EAV/CR) provides a flexible and simple database schema to store heterogeneous biomedical data. In certain circumstances, however, the EAV/CR model is known to retrieve data less efficiently than conventionally based database schemas. To perform a pilot study that systematically quantifies performance differences for database queries directed at real-world microbiology data modeled with EAV/CR and conventional representations, and to explore the relative merits of different EAV/CR query implementation strategies. Clinical microbiology data obtained over a ten-year period were stored using both database models. Query execution times were compared for four clinically oriented attribute-centered and entity-centered queries operating under varying conditions of database size and system memory. The performance characteristics of three different EAV/CR query strategies were also examined. Performance was similar for entity-centered queries in the two database models. Performance in the EAV/CR model was approximately three to five times less efficient than its conventional counterpart for attribute-centered queries. The differences in query efficiency became slightly greater as database size increased, although they were reduced with the addition of system memory. The authors found that EAV/CR queries formulated using multiple, simple SQL statements executed in batch were more efficient than single, large SQL statements. This paper describes a pilot project to explore issues in and compare query performance for EAV/CR and conventional database representations. Although attribute-centered queries were less efficient in the EAV/CR model, these inefficiencies may be addressable, at least in part, by the use of more powerful hardware or more memory, or both.
Studies of Big Data metadata segmentation between relational and non-relational databases

NASA Astrophysics Data System (ADS)

Golosova, M. V.; Grigorieva, M. A.; Klimentov, A. A.; Ryabinkin, E. A.; Dimitrov, G.; Potekhin, M.

2015-12-01

In recent years the concepts of Big Data became well established in IT. Systems managing large data volumes produce metadata that describe data and workflows. These metadata are used to obtain information about current system state and for statistical and trend analysis of the processes these systems drive. Over the time the amount of the stored metadata can grow dramatically. In this article we present our studies to demonstrate how metadata storage scalability and performance can be improved by using hybrid RDBMS/NoSQL architecture.
Assignment to database industy

NASA Astrophysics Data System (ADS)

Abe, Kohichiroh

Various kinds of databases are considered to be essential part in future large sized systems. Information provision only by databases is also considered to be growing as the market becomes mature. This paper discusses how such circumstances have been built and will be developed from now on.
FReD: the floral reflectance database--a web portal for analyses of flower colour.

PubMed

Arnold, Sarah E J; Faruq, Samia; Savolainen, Vincent; McOwan, Peter W; Chittka, Lars

2010-12-10

Flower colour is of great importance in various fields relating to floral biology and pollinator behaviour. However, subjective human judgements of flower colour may be inaccurate and are irrelevant to the ecology and vision of the flower's pollinators. For precise, detailed information about the colours of flowers, a full reflectance spectrum for the flower of interest should be used rather than relying on such human assessments. The Floral Reflectance Database (FReD) has been developed to make an extensive collection of such data available to researchers. It is freely available at http://www.reflectance.co.uk. The database allows users to download spectral reflectance data for flower species collected from all over the world. These could, for example, be used in modelling interactions between pollinator vision and plant signals, or analyses of flower colours in various habitats. The database contains functions for calculating flower colour loci according to widely-used models of bee colour space, reflectance graphs of the spectra and an option to search for flowers with similar colours in bee colour space. The Floral Reflectance Database is a valuable new tool for researchers interested in the colours of flowers and their association with pollinator colour vision, containing raw spectral reflectance data for a large number of flower species.
HANZE: a pan-European database of exposure to natural hazards and damaging historical floods since 1870

NASA Astrophysics Data System (ADS)

Paprotny, Dominik; Morales-Nápoles, Oswaldo; Jonkman, Sebastiaan N.

2018-03-01

The influence of social and economic change on the consequences of natural hazards has been a matter of much interest recently. However, there is a lack of comprehensive, high-resolution data on historical changes in land use, population, or assets available to study this topic. Here, we present the Historical Analysis of Natural Hazards in Europe (HANZE) database, which contains two parts: (1) HANZE-Exposure with maps for 37 countries and territories from 1870 to 2020 in 100 m resolution and (2) HANZE-Events, a compilation of past disasters with information on dates, locations, and losses, currently limited to floods only. The database was constructed using high-resolution maps of present land use and population, a large compilation of historical statistics, and relatively simple disaggregation techniques and rule-based land use reallocation schemes. Data encompassed in HANZE allow one to "normalize" information on losses due to natural hazards by taking into account inflation as well as changes in population, production, and wealth. This database of past events currently contains 1564 records (1870-2016) of flash, river, coastal, and compound floods. The HANZE database is freely available at https://data.4tu.nl/repository/collection:HANZE.
Interacting with the National Database for Autism Research (NDAR) via the LONI Pipeline workflow environment.

PubMed

Torgerson, Carinna M; Quinn, Catherine; Dinov, Ivo; Liu, Zhizhong; Petrosyan, Petros; Pelphrey, Kevin; Haselgrove, Christian; Kennedy, David N; Toga, Arthur W; Van Horn, John Darrell

2015-03-01

Under the umbrella of the National Database for Clinical Trials (NDCT) related to mental illnesses, the National Database for Autism Research (NDAR) seeks to gather, curate, and make openly available neuroimaging data from NIH-funded studies of autism spectrum disorder (ASD). NDAR has recently made its database accessible through the LONI Pipeline workflow design and execution environment to enable large-scale analyses of cortical architecture and function via local, cluster, or "cloud"-based computing resources. This presents a unique opportunity to overcome many of the customary limitations to fostering biomedical neuroimaging as a science of discovery. Providing open access to primary neuroimaging data, workflow methods, and high-performance computing will increase uniformity in data collection protocols, encourage greater reliability of published data, results replication, and broaden the range of researchers now able to perform larger studies than ever before. To illustrate the use of NDAR and LONI Pipeline for performing several commonly performed neuroimaging processing steps and analyses, this paper presents example workflows useful for ASD neuroimaging researchers seeking to begin using this valuable combination of online data and computational resources. We discuss the utility of such database and workflow processing interactivity as a motivation for the sharing of additional primary data in ASD research and elsewhere.

CCDB: a curated database of genes involved in cervix cancer.

PubMed

Agarwal, Subhash M; Raghav, Dhwani; Singh, Harinder; Raghava, G P S

2011-01-01

The Cervical Cancer gene DataBase (CCDB, http://crdd.osdd.net/raghava/ccdb) is a manually curated catalog of experimentally validated genes that are thought, or are known to be involved in the different stages of cervical carcinogenesis. In spite of the large women population that is presently affected from this malignancy still at present, no database exists that catalogs information on genes associated with cervical cancer. Therefore, we have compiled 537 genes in CCDB that are linked with cervical cancer causation processes such as methylation, gene amplification, mutation, polymorphism and change in expression level, as evident from published literature. Each record contains details related to gene like architecture (exon-intron structure), location, function, sequences (mRNA/CDS/protein), ontology, interacting partners, homology to other eukaryotic genomes, structure and links to other public databases, thus augmenting CCDB with external data. Also, manually curated literature references have been provided to support the inclusion of the gene in the database and establish its association with cervix cancer. In addition, CCDB provides information on microRNA altered in cervical cancer as well as search facility for querying, several browse options and an online tool for sequence similarity search, thereby providing researchers with easy access to the latest information on genes involved in cervix cancer.
Exploring molecular networks using MONET ontology.

PubMed

Silva, João Paulo Müller da; Lemke, Ney; Mombach, José Carlos; Souza, José Guilherme Camargo de; Sinigaglia, Marialva; Vieira, Renata

2006-03-31

The description of the complex molecular network responsible for cell behavior requires new tools to integrate large quantities of experimental data in the design of biological information systems. These tools could be used in the characterization of these networks and in the formulation of relevant biological hypotheses. The building of an ontology is a crucial step because it integrates in a coherent framework the concepts necessary to accomplish such a task. We present MONET (molecular network), an extensible ontology and an architecture designed to facilitate the integration of data originating from different public databases in a single- and well-documented relational database, that is compatible with MONET formal definition. We also present an example of an application that can easily be implemented using these tools.
Performance evaluation of redundant disk array support for transaction recovery

NASA Technical Reports Server (NTRS)

Mourad, Antoine N.; Fuchs, W. Kent; Saab, Daniel G.

1991-01-01

Redundant disk arrays provide a way of achieving rapid recovery from media failures with a relatively low storage cost for large scale data systems requiring high availability. Here, we propose a method for using redundant disk arrays to support rapid recovery from system crashes and transaction aborts in addition to their role in providing media failure recovery. A twin page scheme is used to store the parity information in the array so that the time for transaction commit processing is not degraded. Using an analytical model, we show that the proposed method achieves a significant increase in the throughput of database systems using redundant disk arrays by reducing the number of recovery operations needed to maintain the consistency of the database.
Bibliometrics of NIHR HTA monographs and their related journal articles

PubMed Central

Royle, Pamela

2015-01-01

Objectives A bibliometric analysis of the UK National Institute for Health Research (NIHR) Health Technology Assessment (HTA) monographs and their related journal articles by: (1) exploring the differences in citations to the HTA monographs in Google Scholar (GS), Scopus and Web of Science (WoS), and (2) comparing Scopus citations to the monographs with their related journal articles. Setting A study of 111 HTA monographs published in 2010 and 2011, and their external journal articles. Main outcome measures Citations to the monographs in GS, Scopus and WoS, and to their external journal articles in Scopus. Results The number of citations varied among the three databases, with GS having the highest and WoS the lowest; however, the citation-based rankings among the databases were highly correlated. Overall, 56% of monographs had a related publication, with the highest proportion for primary research (76%) and lowest for evidence syntheses (43%). There was a large variation in how the monographs were cited, compared to journal articles, resulting in more frequent problems, with unlinked citations in Scopus and WoS. When comparing differences in the number of citations between monograph publications with their related journal articles from the same project, we found that monographs received more citations than their journal articles for evidence syntheses and methodology projects; by contrast, journal articles related to primary research monographs were more highly cited than their monograph. Conclusions The numbers of citations to the HTA monographs differed considerably between the databases, but were highly correlated. When a HTA monograph had a journal article from the same study, there were more citations to the journal article for primary research, but more to the monographs for evidence syntheses. Citations to the related journal articles were more reliably recorded than citations to the HTA monographs. PMID:25694457
Risk and Safety in Post-Soviet Russia

DTIC Science & Technology

2008-09-01

radiation exposure databases from Chernobyl , radioactive contamination from long-term operation of large radiochemical atomic plants, and the impact of...64 9.4 Single Irradiation of the Population 65 9.5 Chronic Irradiation of the Population and Personnel 66 9.6. Conclusions 67 10.0 Chernobyl ...Related Radiation Risk for the Public 76 10.1 Introduction 76 10.2 Radioactive Contamination of Russian Territories as a Result of the Chernobyl
IUEAGN: A database of ultraviolet spectra of active galactic nuclei

NASA Technical Reports Server (NTRS)

Pike, G.; Edelson, R.; Shull, J. M.; Saken, J.

1993-01-01

In 13 years of operation, IUE has gathered approximately 5000 spectra of almost 600 Active Galactic Nuclei (AGN). In order to undertake AGN studies which require large amounts of data, we are consistently reducing this entire archive and creating a homogeneous, easy-to-use database. First, the spectra are extracted using the Optimal extraction algorithm. Continuum fluxes are then measured across predefined bands, and line fluxes are measured with a multi-component fit. These results, along with source information such as redshifts and positions, are placed in the IUEAGN relational database. Analysis algorithms, statistical tests, and plotting packages run within the structure, and this flexible database can accommodate future data when they are released. This archival approach has already been used to survey line and continuum variability in six bright Seyfert 1s and rapid continuum variability in 14 blazars. Among the results that could only be obtained using a large archival study is evidence that blazars show a positive correlation between degree of variability and apparent luminosity, while Seyfert 1s show an anti-correlation. This suggests that beaming dominates the ultraviolet properties for blazars, while thermal emission from an accretion disk dominates for Seyfert 1s. Our future plans include a survey of line ratios in Seyfert 1s, to be fitted with photoionization models to test the models and determine the range of temperatures, densities and ionization parameters. We will also include data from IRAS, Einstein, EXOSAT, and ground-based telescopes to measure multi-wavelength correlations and broadband spectral energy distributions.
Geolocation of man-made reservoirs across terrains of varying complexity using GIS

NASA Astrophysics Data System (ADS)

Mixon, David M.; Kinner, David A.; Stallard, Robert F.; Syvitski, James P. M.

2008-10-01

The Reservoir Sedimentation Survey Information System (RESIS) is one of the world's most comprehensive databases of reservoir sedimentation rates, comprising nearly 6000 surveys for 1819 reservoirs across the continental United States. Sediment surveys in the database date from 1904 to 1999, though more than 95% of surveys were entered prior to 1980, making RESIS largely a historical database. The use of this database for large-scale studies has been limited by the lack of precise coordinates for the reservoirs. Many of the reservoirs are relatively small structures and do not appear on current USGS topographic maps. Others have been renamed or have only approximate (i.e. township and range) coordinates. This paper presents a method scripted in ESRI's ARC Macro Language (AML) to locate the reservoirs on digital elevation models using information available in RESIS. The script also delineates the contributing watersheds and compiles several hydrologically important parameters for each reservoir. Evaluation of the method indicates that, for watersheds larger than 5 km 2, the correct outlet is identified over 80% of the time. The importance of identifying the watershed outlet correctly depends on the application. Our intent is to collect spatial data for watersheds across the continental United States and describe the land use, soils, and topography for each reservoir's watershed. Because of local landscape similarity in these properties, we show that choosing the incorrect watershed does not necessarily mean that the watershed characteristics will be misrepresented. We present a measure termed terrain complexity and examine its relationship to geolocation success rate and its influence on the similarity of nearby watersheds.
Novel primers for complete mitochondrial cytochrome b genesequencing in mammals

USGS Publications Warehouse

Naidu, Ashwin; Fitak, Robert R.; Munguia-Vega, Adrian; Culver, Melanie

2011-01-01

Sequence-based species identification relies on the extent and integrity of sequence data available in online databases such as GenBank. When identifying species from a sample of unknown origin, partial DNA sequences obtained from the sample are aligned against existing sequences in databases. When the sequence from the matching species is not present in the database, high-scoring alignments with closely related sequences might produce unreliable results on species identity. For species identification in mammals, the cytochrome b (cyt b) gene has been identified to be highly informative; thus, large amounts of reference sequence data from the cyt b gene are much needed. To enhance availability of cyt b gene sequence data on a large number of mammalian species in GenBank and other such publicly accessible online databases, we identified a primer pair for complete cyt b gene sequencing in mammals. Using this primer pair, we successfully PCR amplified and sequenced the complete cyt b gene from 40 of 44 mammalian species representing 10 orders of mammals. We submitted 40 complete, correctly annotated, cyt b protein coding sequences to GenBank. To our knowledge, this is the first single primer pair to amplify the complete cyt b gene in a broad range of mammalian species. This primer pair can be used for the addition of new cyt b gene sequences and to enhance data available on species represented in GenBank. The availability of novel and complete gene sequences as high-quality reference data can improve the reliability of sequence-based species identification.
[Pharmacovigilance in Germany : It is about time].

PubMed

Douros, A; Schaefer, C; Kreutz, R; Garbe, E

2016-06-01

Pharmacovigilance is defined as the activities relating to the detection, assessment, and prevention of adverse drug reactions (ADRs). Although its beginnings in Germany date back more than 50 years, a stagnation in this field has been observed lately. Different tools of pharmacovigilance will be illustrated and the reasons for its stagnation in Germany will be elucidated. Spontaneous reporting systems are an important tool in pharmacovigilance and are based on reports of ADRs from treating physicians, other healthcare professionals, or patients. Due to several weaknesses of spontaneous reporting systems such as underreporting, media bias, confounding by comorbidity or comedication, and due to the limited quality of the reports, the development of electronic healthcare databases was publicly funded in recent years so that they can be used for pharmacovigilance research. In the US different electronic healthcare databases were merged in a project sponsored by public means resulting in more than 193 million individuals. In Germany the establishment of large longitudinal databases was never conceived as a public duty and has not been implemented so far. Further attempts to use administrative healthcare data for pharmacovigilance purposes are severely restricted by the Code of Social Law (Section 75, Book 10). This situation has led to a stagnation in pharmacovigilance research in Germany. Without publicly funded large longitudinal healthcare databases and an amendment of Section 75, Book 10, of the Code of Social Law, the use of healthcare data in pharmacovigilance research in Germany will remain a rarity. This could have negative effects on the medical care of the general population.
Impact of data base structure in a successful in vitro-in vivo correlation for pharmaceutical products.

PubMed

Roudier, B; Davit, B; Schütz, H; Cardot, J-M

2015-01-01

The in vitro-in vivo correlation (IVIVC) (Food and Drug Administration 1997) aims to predict performances in vivo of a pharmaceutical formulation based on its in vitro characteristics. It is a complex process that (i) incorporates in a gradual and incremental way a large amount of information and (ii) requires information from different properties (formulation, analytical, clinical) and associated dedicated treatments (statistics, modeling, simulation). These results in many studies that are initiated and integrated into the specifications (quality target product profile, QTPP). This latter defines the appropriate experimental designs (quality by design, QbD) (Food and Drug Administration 2011, 2012) whose main objectives are determination (i) of key factors of development and manufacturing (critical process parameters, CPPs) and (ii) of critical points of physicochemical nature relating to active ingredients (API) and critical quality attribute (CQA) which may have implications in terms of efficiency, safety, and inoffensiveness for the patient, due to their non-inclusion. These processes generate a very large amount of data that is necessary to structure. In this context, the storage of information in a database (DB) and the management of this database (database management system, DBMS) become an important issue for the management of projects and IVIVC and more generally for development of new pharmaceutical forms. This article describes the implementation of a prototype object-oriented database (OODB) considered as a tool, which is helpful for decision taking, responding in a structured and consistent way to the issues of project management of IVIVC (including bioequivalence and bioavailability) (Food and Drug Administration 2003) necessary for the implementation of QTPP.
Database of synesthetic color associations for Japanese kanji.

PubMed

Hamada, Daisuke; Yamamoto, Hiroki; Saiki, Jun

2017-02-01

Synesthesia is a neurological phenomenon in which certain types of stimuli elicit involuntary perceptions in an unrelated pathway. A common type of synesthesia is grapheme-color synesthesia, in which the visual perception of letters and numbers stimulates the perception of a specific color. Previous studies have often collected relatively small numbers of grapheme-color associations per synesthete, but the accumulation of a large quantity of data has greater promise for uncovering the mechanisms underlying synesthetic association. In this study, we therefore collected large samples of data from a total of eight synesthetes. All told, we obtained over 1000 synesthetic colors associated with Japanese kanji characters from each of two synesthetes, over 100 synesthetic colors form each of three synesthetes, and about 80 synesthetic colors associated with Japanese hiragana, Latin letters, and Arabic numerals from each of three synesthetes. We then compiled the data into a database, called the KANJI-Synesthetic Colors Database (K-SCD), which has a total of 5122 colors for 483, 46, and 46 Japanese kanji, hiragana, and katakana characters, respectively, as well as for 26 Latin letters and ten Arabic numerals. In addition to introducing the K-SCD, this article demonstrates the database's merits by using two examples, in which two new rules for synesthetic association, "shape similarity" and "synesthetic color clustering," were found. The K-SCD is publicly accessible ( www.cv.jinkan.kyoto-u.ac.jp/site/uploads/K-SCD.xlsm ) and will be a valuable resource for those who wish to conduct statistical analyses using a rich dataset in order to uncover the rules governing synesthetic association and to understand its mechanisms.
Sensing emotion in voices: Negativity bias and gender differences in a validation study of the Oxford Vocal ('OxVoc') sounds database.

PubMed

Young, Katherine S; Parsons, Christine E; LeBeau, Richard T; Tabak, Benjamin A; Sewart, Amy R; Stein, Alan; Kringelbach, Morten L; Craske, Michelle G

2017-08-01

Emotional expressions are an essential element of human interactions. Recent work has increasingly recognized that emotional vocalizations can color and shape interactions between individuals. Here we present data on the psychometric properties of a recently developed database of authentic nonlinguistic emotional vocalizations from human adults and infants (the Oxford Vocal 'OxVoc' Sounds Database; Parsons, Young, Craske, Stein, & Kringelbach, 2014). In a large sample (n = 562), we demonstrate that adults can reliably categorize these sounds (as 'positive,' 'negative,' or 'sounds with no emotion'), and rate valence in these sounds consistently over time. In an extended sample (n = 945, including the initial n = 562), we also investigated a number of individual difference factors in relation to valence ratings of these vocalizations. Results demonstrated small but significant effects of (a) symptoms of depression and anxiety with more negative ratings of adult neutral vocalizations (R2 = .011 and R2 = .008, respectively) and (b) gender differences in perceived valence such that female listeners rated adult neutral vocalizations more positively and infant cry vocalizations more negatively than male listeners (R2 = .021, R2 = .010, respectively). Of note, we did not find evidence of negativity bias among other affective vocalizations or gender differences in perceived valence of adult laughter, adult cries, infant laughter, or infant neutral vocalizations. Together, these findings largely converge with factors previously shown to impact processing of emotional facial expressions, suggesting a modality-independent impact of depression, anxiety, and listener gender, particularly among vocalizations with more ambiguous valence. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Recent advances on terrain database correlation testing

NASA Astrophysics Data System (ADS)

Sakude, Milton T.; Schiavone, Guy A.; Morelos-Borja, Hector; Martin, Glenn; Cortes, Art

1998-08-01

Terrain database correlation is a major requirement for interoperability in distributed simulation. There are numerous situations in which terrain database correlation problems can occur that, in turn, lead to lack of interoperability in distributed training simulations. Examples are the use of different run-time terrain databases derived from inconsistent on source data, the use of different resolutions, and the use of different data models between databases for both terrain and culture data. IST has been developing a suite of software tools, named ZCAP, to address terrain database interoperability issues. In this paper we discuss recent enhancements made to this suite, including improved algorithms for sampling and calculating line-of-sight, an improved method for measuring terrain roughness, and the application of a sparse matrix method to the terrain remediation solution developed at the Visual Systems Lab of the Institute for Simulation and Training. We review the application of some of these new algorithms to the terrain correlation measurement processes. The application of these new algorithms improves our support for very large terrain databases, and provides the capability for performing test replications to estimate the sampling error of the tests. With this set of tools, a user can quantitatively assess the degree of correlation between large terrain databases.
Data-driven indexing mechanism for the recognition of polyhedral objects

NASA Astrophysics Data System (ADS)

McLean, Stewart; Horan, Peter; Caelli, Terry M.

1992-02-01

This paper is concerned with the problem of searching large model databases. To date, most object recognition systems have concentrated on the problem of matching using simple searching algorithms. This is quite acceptable when the number of object models is small. However, in the future, general purpose computer vision systems will be required to recognize hundreds or perhaps thousands of objects and, in such circumstances, efficient searching algorithms will be needed. The problem of searching a large model database is one which must be addressed if future computer vision systems are to be at all effective. In this paper we present a method we call data-driven feature-indexed hypothesis generation as one solution to the problem of searching large model databases.
Decreasing incidence of type 2 diabetes mellitus in the United States, 2007-2012: Epidemiologic findings from a large US claims database.

PubMed

Weng, Wayne; Liang, Yuanjie; Kimball, Edward S; Hobbs, Todd; Kong, Sheldon X; Sakurada, Brian; Bouchard, Jonathan

2016-07-01

To explore epidemiological trends in type 2 diabetes mellitus (T2D) in the US between 2007 and 2012 using a large US claims database, with a particular focus on demographics, prevalence, newly-diagnosed cases, and comorbidities. Truven Health MarketScan® Databases were used to identify patients with claims evidence of T2D in the years 2007 and 2012. Newly-diagnosed T2D was characterized by an absence of any T2D claims or related drug claims for 6months preceding the index claim. Demographic and comorbidity characteristics of the prevalent and new-onset T2D groups were compared and analyzed descriptively for trends over time. The overall prevalence of T2D remained stable from 2007 (1.24 million cases/15.07 million enrolled; 8.2%) to 2012 (2.04 million cases/24.52 million enrolled; 8.3%), while the percentage of newly-diagnosed cases fell dramatically from 2007 (152,252 cases; 1.1%) to 2012 (147,011 cases; 0.65%). The mean age of patients with prevalent T2D was similar in 2007 (60.6y) and 2012 (60.0y), while the mean age of newly-diagnosed T2D patients decreased by 3years from 2007 (57.7y) to 2012 (54.8y). Hypertension and hyperlipidemia were the most common comorbidities, evident in 50-75% of T2D patients, and increased markedly from 2007 to 2012 in both prevalent and new-onset T2D populations. Cardiovascular disease decreased slightly in prevalent (-0.9%) and new-onset (-2.8%) cases. This large US health claims database analysis suggests stabilization in prevalence and declining incidence of T2D over a recent 5-year period, a downward shift in age at T2D diagnosis, but increases in several comorbidities. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Automated identification and geometrical features extraction of individual trees from Mobile Laser Scanning data in Budapest

NASA Astrophysics Data System (ADS)

Koma, Zsófia; Székely, Balázs; Folly-Ritvay, Zoltán; Skobrák, Ferenc; Koenig, Kristina; Höfle, Bernhard

2016-04-01

Mobile Laser Scanning (MLS) is an evolving operational measurement technique for urban environment providing large amounts of high resolution information about trees, street features, pole-like objects on the street sides or near to motorways. In this study we investigate a robust segmentation method to extract the individual trees automatically in order to build an object-based tree database system. We focused on the large urban parks in Budapest (Margitsziget and Városliget; KARESZ project) which contained large diversity of different kind of tree species. The MLS data contained high density point cloud data with 1-8 cm mean absolute accuracy 80-100 meter distance from streets. The robust segmentation method contained following steps: The ground points are determined first. As a second step cylinders are fitted in vertical slice 1-1.5 meter relative height above ground, which is used to determine the potential location of each single trees trunk and cylinder-like object. Finally, residual values are calculated as deviation of each point from a vertically expanded fitted cylinder; these residual values are used to separate cylinder-like object from individual trees. After successful parameterization, the model parameters and the corresponding residual values of the fitted object are extracted and imported into the tree database. Additionally, geometric features are calculated for each segmented individual tree like crown base, crown width, crown length, diameter of trunk, volume of the individual trees. In case of incompletely scanned trees, the extraction of geometric features is based on fitted circles. The result of the study is a tree database containing detailed information about urban trees, which can be a valuable dataset for ecologist, city planners, planting and mapping purposes. Furthermore, the established database will be the initial point for classification trees into single species. MLS data used in this project had been measured in the framework of KARESZ project for whole Budapest. BSz contributed as an Alexander von Humboldt Research Fellow.
Advances in Satellite Microwave Precipitation Retrieval Algorithms Over Land

NASA Astrophysics Data System (ADS)

Wang, N. Y.; You, Y.; Ferraro, R. R.

2015-12-01

Precipitation plays a key role in the earth's climate system, particularly in the aspect of its water and energy balance. Satellite microwave (MW) observations of precipitation provide a viable mean to achieve global measurement of precipitation with sufficient sampling density and accuracy. However, accurate precipitation information over land from satellite MW is a challenging problem. The Goddard Profiling Algorithm (GPROF) algorithm for the Global Precipitation Measurement (GPM) is built around the Bayesian formulation (Evans et al., 1995; Kummerow et al., 1996). GPROF uses the likelihood function and the prior probability distribution function to calculate the expected value of precipitation rate, given the observed brightness temperatures. It is particularly convenient to draw samples from a prior PDF from a predefined database of observations or models. GPROF algorithm does not search all database entries but only the subset thought to correspond to the actual observation. The GPM GPROF V1 database focuses on stratification by surface emissivity class, land surface temperature and total precipitable water. However, there is much uncertainty as to what is the optimal information needed to subset the database for different conditions. To this end, we conduct a database stratification study of using National Mosaic and Multi-Sensor Quantitative Precipitation Estimation, Special Sensor Microwave Imager/Sounder (SSMIS) and Advanced Technology Microwave Sounder (ATMS) and reanalysis data from Modern-Era Retrospective Analysis for Research and Applications (MERRA). Our database study (You et al., 2015) shows that environmental factors such as surface elevation, relative humidity, and storm vertical structure and height, and ice thickness can help in stratifying a single large database to smaller and more homogeneous subsets, in which the surface condition and precipitation vertical profiles are similar. It is found that the probability of detection (POD) increases about 8% and 12% by using stratified databases for rainfall and snowfall detection, respectively. In addition, by considering the relative humidity at lower troposphere and the vertical velocity at 700 hPa in the precipitation detection process, the POD for snowfall detection is further increased by 20.4% from 56.0% to 76.4%.
HypoxiaDB: a database of hypoxia-regulated proteins

PubMed Central

Khurana, Pankaj; Sugadev, Ragumani; Jain, Jaspreet; Singh, Shashi Bala

2013-01-01

There has been intense interest in the cellular response to hypoxia, and a large number of differentially expressed proteins have been identified through various high-throughput experiments. These valuable data are scattered, and there have been no systematic attempts to document the various proteins regulated by hypoxia. Compilation, curation and annotation of these data are important in deciphering their role in hypoxia and hypoxia-related disorders. Therefore, we have compiled HypoxiaDB, a database of hypoxia-regulated proteins. It is a comprehensive, manually-curated, non-redundant catalog of proteins whose expressions are shown experimentally to be altered at different levels and durations of hypoxia. The database currently contains 72 000 manually curated entries taken on 3500 proteins extracted from 73 peer-reviewed publications selected from PubMed. HypoxiaDB is distinctive from other generalized databases: (i) it compiles tissue-specific protein expression changes under different levels and duration of hypoxia. Also, it provides manually curated literature references to support the inclusion of the protein in the database and establish its association with hypoxia. (ii) For each protein, HypoxiaDB integrates data on gene ontology, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway, protein–protein interactions, protein family (Pfam), OMIM (Online Mendelian Inheritance in Man), PDB (Protein Data Bank) structures and homology to other sequenced genomes. (iii) It also provides pre-compiled information on hypoxia-proteins, which otherwise requires tedious computational analysis. This includes information like chromosomal location, identifiers like Entrez, HGNC, Unigene, Uniprot, Ensembl, Vega, GI numbers and Genbank accession numbers associated with the protein. These are further cross-linked to respective public databases augmenting HypoxiaDB to the external repositories. (iv) In addition, HypoxiaDB provides an online sequence-similarity search tool for users to compare their protein sequences with HypoxiaDB protein database. We hope that HypoxiaDB will enrich our knowledge about hypoxia-related biology and eventually will lead to the development of novel hypothesis and advancements in diagnostic and therapeutic activities. HypoxiaDB is freely accessible for academic and non-profit users via http://www.hypoxiadb.com. Database URL: http://www.hypoxiadb.com PMID:24178989
Geospatial Data Management Platform for Urban Groundwater

NASA Astrophysics Data System (ADS)

Gaitanaru, D.; Priceputu, A.; Gogu, C. R.

2012-04-01

Due to the large amount of civil work projects and research studies, large quantities of geo-data are produced for the urban environments. These data are usually redundant as well as they are spread in different institutions or private companies. Time consuming operations like data processing and information harmonisation represents the main reason to systematically avoid the re-use of data. The urban groundwater data shows the same complex situation. The underground structures (subway lines, deep foundations, underground parkings, and others), the urban facility networks (sewer systems, water supply networks, heating conduits, etc), the drainage systems, the surface water works and many others modify continuously. As consequence, their influence on groundwater changes systematically. However, these activities provide a large quantity of data, aquifers modelling and then behaviour prediction can be done using monitored quantitative and qualitative parameters. Due to the rapid evolution of technology in the past few years, transferring large amounts of information through internet has now become a feasible solution for sharing geoscience data. Furthermore, standard platform-independent means to do this have been developed (specific mark-up languages like: GML, GeoSciML, WaterML, GWML, CityML). They allow easily large geospatial databases updating and sharing through internet, even between different companies or between research centres that do not necessarily use the same database structures. For Bucharest City (Romania) an integrated platform for groundwater geospatial data management is developed under the framework of a national research project - "Sedimentary media modeling platform for groundwater management in urban areas" (SIMPA) financed by the National Authority for Scientific Research of Romania. The platform architecture is based on three components: a geospatial database, a desktop application (a complex set of hydrogeological and geological analysis tools) and a front-end geoportal service. The SIMPA platform makes use of mark-up transfer standards to provide a user-friendly application that can be accessed through internet to query, analyse, and visualise geospatial data related to urban groundwater. The platform holds the information within the local groundwater geospatial databases and the user is able to access this data through a geoportal service. The database architecture allows storing accurate and very detailed geological, hydrogeological, and infrastructure information that can be straightforwardly generalized and further upscaled. The geoportal service offers the possibility of querying a dataset from the spatial database. The query is coded in a standard mark-up language, and sent to the server through a standard Hyper Text Transfer Protocol (http) to be processed by the local application. After the validation of the query, the results are sent back to the user to be displayed by the geoportal application. The main advantage of the SIMPA platform is that it offers to the user the possibility to make a primary multi-criteria query, which results in a smaller set of data to be analysed afterwards. This improves both the transfer process parameters and the user's means of creating the desired query.
[Establishment of a comprehensive database for laryngeal cancer related genes and the miRNAs].

PubMed

Li, Mengjiao; E, Qimin; Liu, Jialin; Huang, Tingting; Liang, Chuanyu

2015-09-01

By collecting and analyzing the laryngeal cancer related genes and the miRNAs, to build a comprehensive laryngeal cancer-related gene database, which differs from the current biological information database with complex and clumsy structure and focuses on the theme of gene and miRNA, and it could make the research and teaching more convenient and efficient. Based on the B/S architecture, using Apache as a Web server, MySQL as coding language of database design and PHP as coding language of web design, a comprehensive database for laryngeal cancer-related genes was established, providing with the gene tables, protein tables, miRNA tables and clinical information tables of the patients with laryngeal cancer. The established database containsed 207 laryngeal cancer related genes, 243 proteins, 26 miRNAs, and their particular information such as mutations, methylations, diversified expressions, and the empirical references of laryngeal cancer relevant molecules. The database could be accessed and operated via the Internet, by which browsing and retrieval of the information were performed. The database were maintained and updated regularly. The database for laryngeal cancer related genes is resource-integrated and user-friendly, providing a genetic information query tool for the study of laryngeal cancer.

Generation and analysis of expressed sequence tags from six developing xylem libraries in Pinus radiata D. Don

PubMed Central

Li, Xinguo; Wu, Harry X; Dillon, Shannon K; Southerton, Simon G

2009-01-01

Background Wood is a major renewable natural resource for the timber, fibre and bioenergy industry. Pinus radiata D. Don is the most important commercial plantation tree species in Australia and several other countries; however, genomic resources for this species are very limited in public databases. Our primary objective was to sequence a large number of expressed sequence tags (ESTs) from genes involved in wood formation in radiata pine. Results Six developing xylem cDNA libraries were constructed from earlywood and latewood tissues sampled at juvenile (7 yrs), transition (11 yrs) and mature (30 yrs) ages, respectively. These xylem tissues represent six typical development stages in a rotation period of radiata pine. A total of 6,389 high quality ESTs were collected from 5,952 cDNA clones. Assembly of 5,952 ESTs from 5' end sequences generated 3,304 unigenes including 952 contigs and 2,352 singletons. About 97.0% of the 5,952 ESTs and 96.1% of the unigenes have matches in the UniProt and TIGR databases. Of the 3,174 unigenes with matches, 42.9% were not assigned GO (Gene Ontology) terms and their functions are unknown or unclassified. More than half (52.1%) of the 5,952 ESTs have matches in the Pfam database and represent 772 known protein families. About 18.0% of the 5,952 ESTs matched cell wall related genes in the MAIZEWALL database, representing all 18 categories, 91 of all 174 families and possibly 557 genes. Fifteen cell wall-related genes are ranked in the 30 most abundant genes, including CesA, tubulin, AGP, SAMS, actin, laccase, CCoAMT, MetE, phytocyanin, pectate lyase, cellulase, SuSy, expansin, chitinase and UDP-glucose dehydrogenase. Based on the PlantTFDB database 41 of the 64 transcription factor families in the poplar genome were identified as being involved in radiata pine wood formation. Comparative analysis of GO term abundance revealed a distinct transcriptome in juvenile earlywood formation compared to other stages of wood development. Conclusion The first large scale genomic resource in radiata pine was generated from six developing xylem cDNA libraries. Cell wall-related genes and transcription factors were identified. Juvenile earlywood has a distinct transcriptome, which is likely to contribute to the undesirable properties of juvenile wood in radiata pine. The publicly available resource of radiata pine will also be valuable for gene function studies and comparative genomics in forest trees. PMID:19159482
A Chemoinformatics Approach to the Discovery of Lead-Like Molecules from Marine and Microbial Sources En Route to Antitumor and Antibiotic Drugs

PubMed Central

Pereira, Florbela; Latino, Diogo A. R. S.; Gaudêncio, Susana P.

2014-01-01

The comprehensive information of small molecules and their biological activities in the PubChem database allows chemoinformatic researchers to access and make use of large-scale biological activity data to improve the precision of drug profiling. A Quantitative Structure–Activity Relationship approach, for classification, was used for the prediction of active/inactive compounds relatively to overall biological activity, antitumor and antibiotic activities using a data set of 1804 compounds from PubChem. Using the best classification models for antibiotic and antitumor activities a data set of marine and microbial natural products from the AntiMarin database were screened—57 and 16 new lead compounds for antibiotic and antitumor drug design were proposed, respectively. All compounds proposed by our approach are classified as non-antibiotic and non-antitumor compounds in the AntiMarin database. Recently several of the lead-like compounds proposed by us were reported as being active in the literature. PMID:24473174
EU Laws on Privacy in Genomic Databases and Biobanking.

PubMed

Townend, David

2016-03-01

Both the European Union and the Council of Europe have a bearing on privacy in genomic databases and biobanking. In terms of legislation, the processing of personal data as it relates to the right to privacy is currently largely regulated in Europe by Directive 95/46/EC, which requires that processing be "fair and lawful" and follow a set of principles, meaning that the data be processed only for stated purposes, be sufficient for the purposes of the processing, be kept only for so long as is necessary to achieve those purposes, and be kept securely and only in an identifiable state for such time as is necessary for the processing. The European privacy regime does not require the de-identification (anonymization) of personal data used in genomic databases or biobanks, and alongside this practice informed consent as well as governance and oversight mechanisms provide for the protection of genomic data. © 2016 American Society of Law, Medicine & Ethics.
ERAIZDA: a model for holistic annotation of animal infectious and zoonotic diseases

PubMed Central

Buza, Teresia M.; Jack, Sherman W.; Kirunda, Halid; Khaitsa, Margaret L.; Lawrence, Mark L.; Pruett, Stephen; Peterson, Daniel G.

2015-01-01

There is an urgent need for a unified resource that integrates trans-disciplinary annotations of emerging and reemerging animal infectious and zoonotic diseases. Such data integration will provide wonderful opportunity for epidemiologists, researchers and health policy makers to make data-driven decisions designed to improve animal health. Integrating emerging and reemerging animal infectious and zoonotic disease data from a large variety of sources into a unified open-access resource provides more plausible arguments to achieve better understanding of infectious and zoonotic diseases. We have developed a model for interlinking annotations of these diseases. These diseases are of particular interest because of the threats they pose to animal health, human health and global health security. We demonstrated the application of this model using brucellosis, an infectious and zoonotic disease. Preliminary annotations were deposited into VetBioBase database (http://vetbiobase.igbb.msstate.edu). This database is associated with user-friendly tools to facilitate searching, retrieving and downloading of disease-related information. Database URL: http://vetbiobase.igbb.msstate.edu PMID:26581408
Estrogen alters the profile of the transcriptome in river snail Bellamya aeruginosa.

PubMed

Lei, Kun; Liu, Ruizhi; An, Li-Hui; Luo, Ying-Feng; LeBlanc, Gerald A

2015-03-01

We evaluated the transcriptome dynamics of the freshwater river snail Bellamya aeruginosa exposed to 17β-estradiol (E2) using the Roche/454 GS-FLX platform. In total, 41,869 unigenes, with an average length of 586 bp, representing 36,181 contigs and 5,688 singlets were obtained. Among them, 18.08, 36.85, and 25.47 % matched sequences in the GenBank non-redundant nucleic acid database, non-redundant protein database, and Swiss protein database, respectively. Annotation of the unigenes with gene ontology, and then mapping them to biological pathways, revealed large groups of genes related to growth, development, reproduction, signal transduction, and defense mechanisms. Significant differences were found in gene expression in both liver and testicular tissues between control and E2-exposed organisms. These changes in gene expression will help in understanding the molecular mechanisms of the response to physiological stress in the river snail exposed to estrogen, and will facilitate research into biological processes and underlying physiological adaptations to xenoestrogen exposure in gastropods.
PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature.

PubMed

Yoo, Danny; Xu, Iris; Berardini, Tanya Z; Rhee, Seung Yon; Narayanasamy, Vijay; Twigger, Simon

2006-03-01

For most systems in biology, a large body of literature exists that describes the complexity of the system based on experimental results. Manual review of this literature to extract targeted information into biological databases is difficult and time consuming. To address this problem, we developed PubSearch and PubFetch, which store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature. A set of protocols is provided in this unit for installing, populating, running, and using PubSearch and PubFetch. In addition, we provide support protocols for performing controlled vocabulary annotations. Intended users of PubSearch and PubFetch are database curators and biology researchers interested in tracking the literature and capturing information about genes of interest in a more effective way than with conventional spreadsheets and lab notebooks.
Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes.

PubMed

Hassani-Pak, Keywan; Rawlings, Christopher

2017-06-13

Genetics and "omics" studies designed to uncover genotype to phenotype relationships often identify large numbers of potential candidate genes, among which the causal genes are hidden. Scientists generally lack the time and technical expertise to review all relevant information available from the literature, from key model species and from a potentially wide range of related biological databases in a variety of data formats with variable quality and coverage. Computational tools are needed for the integration and evaluation of heterogeneous information in order to prioritise candidate genes and components of interaction networks that, if perturbed through potential interventions, have a positive impact on the biological outcome in the whole organism without producing negative side effects. Here we review several bioinformatics tools and databases that play an important role in biological knowledge discovery and candidate gene prioritization. We conclude with several key challenges that need to be addressed in order to facilitate biological knowledge discovery in the future.
First web-based database on total phenolics and oxygen radical absorbance capacity (ORAC) of fruits produced and consumed within the south Andes region of South America.

PubMed

Speisky, Hernan; López-Alarcón, Camilo; Gómez, Maritza; Fuentes, Jocelyn; Sandoval-Acuña, Cristian

2012-09-12

This paper reports the first database on antioxidants contained in fruits produced and consumed within the south Andes region of South America. The database ( www.portalantioxidantes.com ) contains over 500 total phenolics (TP) and ORAC values for more than 120 species/varieties of fruits. All analyses were conducted by a single ISO/IEC 17025-certified laboratory. The characterization comprised native berries such as maqui ( Aristotelia chilensis ), murtilla ( Ugni molinae ), and calafate ( Barberis microphylla ), which largely outscored all other studied fruits. Major differences in TP and ORAC were observed as a function of the fruit variety in berries, avocado, cherries, and apples. In fruits such as pears, apples, apricots, and peaches, a significant part of the TP and ORAC was accounted for by the antioxidants present in the peel. These data should be useful to estimate the fruit-based intake of TP and, through the ORAC data, their antioxidant-related contribution to the diet of south Andes populations.
CHEMICAL STRUCTURE INDEXING OF TOXICITY DATA ON ...

EPA Pesticide Factsheets

Standardized chemical structure annotation of public toxicity databases and information resources is playing an increasingly important role in the 'flattening' and integration of diverse sets of biological activity data on the Internet. This review discusses public initiatives that are accelerating the pace of this transformation, with particular reference to toxicology-related chemical information. Chemical content annotators, structure locator services, large structure/data aggregator web sites, structure browsers, International Union of Pure and Applied Chemistry (IUPAC) International Chemical Identifier (InChI) codes, toxicity data models and public chemical/biological activity profiling initiatives are all playing a role in overcoming barriers to the integration of toxicity data, and are bringing researchers closer to the reality of a mineable chemical Semantic Web. An example of this integration of data is provided by the collaboration among researchers involved with the Distributed Structure-Searchable Toxicity (DSSTox) project, the Carcinogenic Potency Project, projects at the National Cancer Institute and the PubChem database. Standardizing chemical structure annotation of public toxicity databases
The Molecule Pages database

PubMed Central

Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar

2008-01-01

The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community. PMID:17965093
The Molecule Pages database.

PubMed

Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar; Vadivelu, Ilango

2008-01-01

The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community.
Medical data mining: knowledge discovery in a clinical data warehouse.

PubMed Central

Prather, J. C.; Lobach, D. F.; Goodwin, L. K.; Hales, J. W.; Hage, M. L.; Hammond, W. E.

1997-01-01

Clinical databases have accumulated large quantities of information about patients and their medical conditions. Relationships and patterns within this data could provide new medical knowledge. Unfortunately, few methodologies have been developed and applied to discover this hidden knowledge. In this study, the techniques of data mining (also known as Knowledge Discovery in Databases) were used to search for relationships in a large clinical database. Specifically, data accumulated on 3,902 obstetrical patients were evaluated for factors potentially contributing to preterm birth using exploratory factor analysis. Three factors were identified by the investigators for further exploration. This paper describes the processes involved in mining a clinical database including data warehousing, data query and cleaning, and data analysis. PMID:9357597
Nosql for Storage and Retrieval of Large LIDAR Data Collections

NASA Astrophysics Data System (ADS)

Boehm, J.; Liu, K.

2015-08-01

Developments in LiDAR technology over the past decades have made LiDAR to become a mature and widely accepted source of geospatial information. This in turn has led to an enormous growth in data volume. The central idea for a file-centric storage of LiDAR point clouds is the observation that large collections of LiDAR data are typically delivered as large collections of files, rather than single files of terabyte size. This split of the dataset, commonly referred to as tiling, was usually done to accommodate a specific processing pipeline. It makes therefore sense to preserve this split. A document oriented NoSQL database can easily emulate this data partitioning, by representing each tile (file) in a separate document. The document stores the metadata of the tile. The actual files are stored in a distributed file system emulated by the NoSQL database. We demonstrate the use of MongoDB a highly scalable document oriented NoSQL database for storing large LiDAR files. MongoDB like any NoSQL database allows for queries on the attributes of the document. As a specialty MongoDB also allows spatial queries. Hence we can perform spatial queries on the bounding boxes of the LiDAR tiles. Inserting and retrieving files on a cloud-based database is compared to native file system and cloud storage transfer speed.
Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example.

PubMed

Ye, Jay J

2016-01-01

Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described. Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past 4 and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients. 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis. R extended with RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.
Incorporating Aquatic Interspecies Toxicity Estimates into Large Databases: Model Evaluations and Data Gains

EPA Science Inventory

The Chemical Aquatic Fate and Effects (CAFE) database, developed by NOAA’s Emergency Response Division (ERD), is a centralized data repository that allows for unrestricted access to fate and effects data. While this database was originally designed to help support decisions...
A Relational Database System for Student Use.

ERIC Educational Resources Information Center

Fertuck, Len

1982-01-01

Describes an APL implementation of a relational database system suitable for use in a teaching environment in which database development and database administration are studied, and discusses the functions of the user and the database administrator. An appendix illustrating system operation and an eight-item reference list are attached. (Author/JL)
Fossil-Fuel C02 Emissions Database and Exploration System

NASA Astrophysics Data System (ADS)

Krassovski, M.; Boden, T.

2012-04-01

Fossil-Fuel C02 Emissions Database and Exploration System Misha Krassovski and Tom Boden Carbon Dioxide Information Analysis Center Oak Ridge National Laboratory The Carbon Dioxide Information Analysis Center (CDIAC) at Oak Ridge National Laboratory (ORNL) quantifies the release of carbon from fossil-fuel use and cement production each year at global, regional, and national spatial scales. These estimates are vital to climate change research given the strong evidence suggesting fossil-fuel emissions are responsible for unprecedented levels of carbon dioxide (CO2) in the atmosphere. The CDIAC fossil-fuel emissions time series are based largely on annual energy statistics published for all nations by the United Nations (UN). Publications containing historical energy statistics make it possible to estimate fossil-fuel CO2 emissions back to 1751 before the Industrial Revolution. From these core fossil-fuel CO2 emission time series, CDIAC has developed a number of additional data products to satisfy modeling needs and to address other questions aimed at improving our understanding of the global carbon cycle budget. For example, CDIAC also produces a time series of gridded fossil-fuel CO2 emission estimates and isotopic (e.g., C13) emissions estimates. The gridded data are generated using the methodology described in Andres et al. (2011) and provide monthly and annual estimates for 1751-2008 at 1° latitude by 1° longitude resolution. These gridded emission estimates are being used in the latest IPCC Scientific Assessment (AR4). Isotopic estimates are possible thanks to detailed information for individual nations regarding the carbon content of select fuels (e.g., the carbon signature of natural gas from Russia). CDIAC has recently developed a relational database to house these baseline emissions estimates and associated derived products and a web-based interface to help users worldwide query these data holdings. Users can identify, explore and download desired CDIAC fossil-fuel CO2 emissions data. This presentation introduces the architecture and design of the new relational database and web interface, summarizes the present state and functionality of the Fossil-Fuel CO2 Emissions Database and Exploration System, and highlights future plans for expansion of the relational database and interface.
Idaho and Montana non-fuel exploration database 1980-1997

USGS Publications Warehouse

Buckingham, David A.; DiFrancesco, Carl A.; Porter, Kenneth E.; Bleiwas, Donald I.; Causey, J. Douglas; Ferguson, William B.

2006-01-01

This report describes a relational database containing information about mineral exploration projects in the States of Idaho and Montana for the years 1980 through 1997 and a spatial (geographic) database constructed using data from the relational database. The focus of this project was to collect information on exploration for mineral commodities with the exception of sand, gravel, coal, geothermal, oil, and gas. The associate databases supplied with this report are prototypes that can be used or modified as needed. The following sources were used to create the databases-serial mining periodicals; annual mineral publications; mining company reports; U.S. Bureau of Mines (USBM) and U.S. Geological Survey (USGS) publications; an Idaho mineral property data base developed by Dave Boleneus, USGS, Spokane, Washington; Montana state publications; and discussions with representatives of Montana, principally the Montana Bureau of Mines and Geology and the Department of Environmental Quality. Fifty commodity groups were reported between the 596 exploration projects identified in this study. Precious metals (gold, silver, or platinum group elements) were the primary targets for about 67 percent of the exploration projects. Information on 17 of the projects did not include commodities. No location could be determined for 51 projects, all in Idaho. During the time period evaluated, some mineral properties were developed into large mining operations (for example Beal Mountain Mine, Stillwater Mine, Troy Mine, Montana Tunnels Mine) and six properties were reclaimed. Environmental Impact Statements were done on four properties. Some operating mines either closed or went through one or more shutdowns and re-openings. Other properties, where significant resources were delineated by recent exploration during this time frame, await the outcome of important factors for development such as defining additional reserves, higher metal prices, and the permitting process. Many of these projects examined relatively minor mineral occurrences. Approximately half of the exploration projects are located on Federal lands and about 40 percent were on lands managed by the U.S. Forest Service. More than 75 percent of the exploration occurred in areas with significant previous mineral activity.
The LUCK study: Laxative Usage in patients with GP-diagnosed Constipation in the UK, within the general population and in pregnancy. An epidemiological study using the General Practice Research Database (GPRD)

PubMed Central

Shafe, Anna C. E.; Lee, Sally; Dalrymple, Jamie S. O.; Whorwell, Peter J.

2011-01-01

Background: Despite the high prevalence of constipation and its related public health implications, there is relatively little research available on the condition from large epidemiological studies. The aim of this study was to investigate the epidemiology of general practitioner (GP)-diagnosed constipation and the prescribing trends for laxatives in the UK, within the general population and during pregnancy. Methods: A cohort study for the period from 2005 to 2009 was performed using the UK primary care database (General Practice Research Database), which contains information on over 3 million individuals. Results: The prevalence of GP-diagnosed constipation ranged from 12 per 1000 persons in 2005 (0.012 per person year) to 12.8 per 1000 in 2009 (0.013 per person year). The prevalence was almost twice as high in women as in men, and was higher in older patients. In 2005 the most commonly prescribed laxatives were lactulose (37%), senna (26%), macrogol (19%), ispaghula (6%), docusate sodium (5%), bisacodyl (4%) and glycerol suppositories (2%). By 2009, this pattern had changed: macrogol (31%), lactulose (29%), senna (22%), ispaghula (5%), docusate sodium (6%), bisacodyl (3%) and glycerol suppositories (3%). In pregnancy, lactulose accounted for 81% of laxative use in 2005, falling to 64% by 2009. In contrast, macrogol use in pregnancy rose from 13% in 2005 to 32% in 2009. Conclusions: GP-diagnosed constipation is common, accounting for a large number of consultations. Laxative prescribing trends have changed over the 5-year study period, prescriptions for macrogol becoming increasingly common and prescriptions for lactulose and senna less common. Macrogol also appears to have been replacing lactulose for treating constipation in pregnant women. PMID:22043228
Attenuation relation for strong motion in Eastern Java based on appropriate database and method

NASA Astrophysics Data System (ADS)

Mahendra, Rian; Rohadi, Supriyanto; Rudyanto, Ariska

2017-07-01

The selection and determination of attenuation relation has become important for seismic hazard assessment in active seismic region. This research initially constructs the appropriate strong motion database, including site condition and type of the earthquake. The data set consisted of large number earthquakes of 5 ≤ Mw ≤ 9 and distance less than 500 km that occurred around Java from 2009 until 2016. The location and depth of earthquake are being relocated using double difference method to improve the quality of database. Strong motion data from twelve BMKG's accelerographs which are located in east Java is used. The site condition is known by using dominant period and Vs30. The type of earthquake is classified into crustal earthquake, interface, and intraslab based on slab geometry analysis. A total of 10 Ground Motion Prediction Equations (GMPEs) are tested using Likelihood (Scherbaum et al., 2004) and Euclidean Distance Ranking method (Kale and Akkar, 2012) with the associated database. The evaluation of these methods lead to a set of GMPEs that can be applied for seismic hazard in East Java where the strong motion data is collected. The result of these methods found that there is still high deviation of GMPEs, so the writer modified some GMPEs using inversion method. Validation was performed by analysing the attenuation curve of the selected GMPE and observation data in period 2015 up to 2016. The results show that the selected GMPE is suitable for estimated PGA value in East Java.

Copycats in Pilot Aircraft-Assisted Suicides after the Germanwings Incident

PubMed Central

Vuorio, Alpo; Bor, Robert; Budowle, Bruce; Navathe, Pooshan; Pukkala, Eero; Sajantila, Antti

2018-01-01

Aircraft-assisted pilot suicide is a rare but serious phenomenon. The aim of this study was to evaluate changes in pilot aircraft-assisted suicide risks, i.e., a copycat effect, in the U.S. and Germany after the Germanwings 2015 incident in the French Alps. Aircraft-assisted pilot suicides were searched in the U.S. National Transportation Safety Board (NTSB) accident investigation database and in the German Bundestelle für Flugunfalluntersuchung (BFU) Reports of Investigation database five years before and two years after the deliberate crash of the Germanwings flight into the French Alps in 2015. The relative risk (RR) of the aircraft-assisted pilot suicides was calculated. Two years after the incident, three out of 454 (0.66%) fatal incidents were aircraft-assisted suicides compared with six out of 1292 (0.46%) in the prior five years in the NTSB database. There were no aircraft-assisted pilot suicides in the German database during the two years after or five years prior to the Germanwings crash. The relative aircraft-assisted pilot suicide risk for the U.S. was 1.4 (95% CI 0.3–4.2) which was not statistically significant. Six of the pilots who died by suicide had told someone of their suicidal intentions. We consider changes in the rate to be within a normal variation. Responsible media coverage of aircraft incidents is important due to the large amount of publicity that these events attract. PMID:29534475
Compiling Holocene RSL databases from near- to far-field regions: proxies, difficulties and possible solutions

NASA Astrophysics Data System (ADS)

Vacchi, M.; Horton, B.; Mann, T.; Engelhart, S. E.; Rovere, A.; Nikitina, D.; Bender, M.; Roy, K.; Peltier, W. R.

2017-12-01

Reconstructions of relative sea level (RSL) have implications for investigation of crustal movements, calibration of earth rheology models and the reconstruction of ice sheets. In recent years, efforts were made to create RSL databases following a standardized methodology. These regional databases provide a framework for developing our understanding of the primary mechanisms of RSL change since the Last Glacial Maximum and a long-term baseline against which to gauge changes in sea level during the 20th century and forecasts for the 21st. We report here the results of recently compiled databases in very different climatic and geographic contexts that are the northeastern Canadian coast, the Mediterranean Sea as well as the southeastern Asiatic region. Our re-evaluation of sea-level indicators from geological and archaeological investigations have yielded more than 3000 RSL data-points mainly from salt and freshwater wetlands or adjacent estuarine sediment, isolation basins, beach ridges, fixed biological indicators, beachrocks as well as coastal archaeological structures. We outline some of the inherent difficulties, and potential solutions to analyse sea-level data in such different depositional environments. In particular, we discuss problems related with the definition of standardized indicative meaning, and with the re-evaluation of old radiocarbon samples. We further address complex tectonics influences and the framework to compare such large variability of RSL data-points. Finally we discuss the implications of our results for the patterns of glacio-isostatic adjustment in these regions.
Relational-database model for improving quality assurance and process control in a composite manufacturing environment

NASA Astrophysics Data System (ADS)

Gentry, Jeffery D.

2000-05-01

A relational database is a powerful tool for collecting and analyzing the vast amounts of inner-related data associated with the manufacture of composite materials. A relational database contains many individual database tables that store data that are related in some fashion. Manufacturing process variables as well as quality assurance measurements can be collected and stored in database tables indexed according to lot numbers, part type or individual serial numbers. Relationships between manufacturing process and product quality can then be correlated over a wide range of product types and process variations. This paper presents details on how relational databases are used to collect, store, and analyze process variables and quality assurance data associated with the manufacture of advanced composite materials. Important considerations are covered including how the various types of data are organized and how relationships between the data are defined. Employing relational database techniques to establish correlative relationships between process variables and quality assurance measurements is then explored. Finally, the benefits of database techniques such as data warehousing, data mining and web based client/server architectures are discussed in the context of composite material manufacturing.
Routine health insurance data for scientific research: potential and limitations of the Agis Health Database.

PubMed

Smeets, Hugo M; de Wit, Niek J; Hoes, Arno W

2011-04-01

Observational studies performed within routine health care databases have the advantage of their large size and, when the aim is to assess the effect of interventions, can offer a completion to randomized controlled trials with usually small samples from experimental situations. Institutional Health Insurance Databases (HIDs) are attractive for research because of their large size, their longitudinal perspective, and their practice-based information. As they are based on financial reimbursement, the information is generally reliable. The database of one of the major insurance companies in the Netherlands, the Agis Health Database (AHD), is described in detail. Whether the AHD data sets meet the specific requirements to conduct several types of clinical studies is discussed according to the classification of the four different types of clinical research; that is, diagnostic, etiologic, prognostic, and intervention research. The potential of the AHD for these various types of research is illustrated using examples of studies recently conducted in the AHD. HIDs such as the AHD offer large potential for several types of clinical research, in particular etiologic and intervention studies, but at present the lack of detailed clinical information is an important limitation. Copyright © 2011 Elsevier Inc. All rights reserved.
Image-based query-by-example for big databases of galaxy images

NASA Astrophysics Data System (ADS)

Shamir, Lior; Kuminski, Evan

2017-01-01

Very large astronomical databases containing millions or even billions of galaxy images have been becoming increasingly important tools in astronomy research. However, in many cases the very large size makes it more difficult to analyze these data manually, reinforcing the need for computer algorithms that can automate the data analysis process. An example of such task is the identification of galaxies of a certain morphology of interest. For instance, if a rare galaxy is identified it is reasonable to expect that more galaxies of similar morphology exist in the database, but it is virtually impossible to manually search these databases to identify such galaxies. Here we describe computer vision and pattern recognition methodology that receives a galaxy image as an input, and searches automatically a large dataset of galaxies to return a list of galaxies that are visually similar to the query galaxy. The returned list is not necessarily complete or clean, but it provides a substantial reduction of the original database into a smaller dataset, in which the frequency of objects visually similar to the query galaxy is much higher. Experimental results show that the algorithm can identify rare galaxies such as ring galaxies among datasets of 10,000 astronomical objects.
Relational Databases and Biomedical Big Data.

PubMed

de Silva, N H Nisansa D

2017-01-01

In various biomedical applications that collect, handle, and manipulate data, the amounts of data tend to build up and venture into the range identified as bigdata. In such occurrences, a design decision has to be taken as to what type of database would be used to handle this data. More often than not, the default and classical solution to this in the biomedical domain according to past research is relational databases. While this used to be the norm for a long while, it is evident that there is a trend to move away from relational databases in favor of other types and paradigms of databases. However, it still has paramount importance to understand the interrelation that exists between biomedical big data and relational databases. This chapter will review the pros and cons of using relational databases to store biomedical big data that previous researches have discussed and used.
DBMap: a TreeMap-based framework for data navigation and visualization of brain research registry

NASA Astrophysics Data System (ADS)

Zhang, Ming; Zhang, Hong; Tjandra, Donny; Wong, Stephen T. C.

2003-05-01

The purpose of this study is to investigate and apply a new, intuitive and space-conscious visualization framework to facilitate efficient data presentation and exploration of large-scale data warehouses. We have implemented the DBMap framework for the UCSF Brain Research Registry. Such a novel utility would facilitate medical specialists and clinical researchers in better exploring and evaluating a number of attributes organized in the brain research registry. The current UCSF Brain Research Registry consists of a federation of disease-oriented database modules, including Epilepsy, Brain Tumor, Intracerebral Hemorrphage, and CJD (Creuzfeld-Jacob disease). These database modules organize large volumes of imaging and non-imaging data to support Web-based clinical research. While the data warehouse supports general information retrieval and analysis, there lacks an effective way to visualize and present the voluminous and complex data stored. This study investigates whether the TreeMap algorithm can be adapted to display and navigate categorical biomedical data warehouse or registry. TreeMap is a space constrained graphical representation of large hierarchical data sets, mapped to a matrix of rectangles, whose size and color represent interested database fields. It allows the display of a large amount of numerical and categorical information in limited real estate of computer screen with an intuitive user interface. The paper will describe, DBMap, the proposed new data visualization framework for large biomedical databases. Built upon XML, Java and JDBC technologies, the prototype system includes a set of software modules that reside in the application server tier and provide interface to backend database tier and front-end Web tier of the brain registry.
Individual differences in loss aversion and preferences for skewed risks across adulthood.

PubMed

Seaman, Kendra L; Green, Mikella A; Shu, Stephen; Samanez-Larkin, Gregory R

2018-06-01

In a previous study, we found adult age differences in the tendency to accept more positively skewed gambles (with a small chance of a large win) than other equivalent risks, or an age-related positive-skew bias. In the present study, we examined whether loss aversion explained this bias. A total of 508 healthy participants (ages 21-82) completed measures of loss aversion and skew preference. Age was not related to loss aversion. Although loss aversion was a significant predictor of gamble acceptance, it did not influence the age-related positive-skew bias. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Publications of the Western Geologic Mapping Team 1997-1998

USGS Publications Warehouse

Stone, Paul; Powell, C.L.

1999-01-01

The Western Geologic Mapping Team (WGMT) of the U.S. Geological Survey, Geologic Division (USGS, GD), conducts geologic mapping and related topical earth-science studies in the western United States. This work is focused on areas where modern geologic maps and associated earth-science data are needed to address key societal and environmental issues such as ground-water quality, potential geologic hazards, and land-use decisions. Areas of primary emphasis currently include southern California, the San Francisco Bay region, the Pacific Northwest, the Las Vegas urban corridor, and selected National Park lands. The team has its headquarters in Menlo Park, California, and maintains smaller field offices at several other locations in the western United States. The results of research conducted by the WGMT are released to the public as a variety of databases, maps, text reports, and abstracts, both through the internal publication system of the USGS and in diverse external publications such as scientific journals and books. This report lists publications of the WGMT released in calendar years 1997 and 1998. Most of the publications listed were authored or coauthored by WGMT staff. However, the list also includes some publications authored by formal non-USGS cooperators with the WGMT, as well as some authored by USGS staff outside the WGMT in cooperation with WGMT projects. Several of the publications listed are available on the World Wide Web; for these, URL addresses are provided. Most of these Web publications are USGS open-file reports that contain large digital databases of geologic map and related information. For these, the bibliographic citation refers specifically to an explanatory pamphlet containing information about the content and accessibility of the database, not to the actual map or related information comprising the database itself.
Analysis of Outcomes After TKA: Do All Databases Produce Similar Findings?

PubMed

Bedard, Nicholas A; Pugely, Andrew J; McHugh, Michael; Lux, Nathan; Otero, Jesse E; Bozic, Kevin J; Gao, Yubo; Callaghan, John J

2018-01-01

Use of large clinical and administrative databases for orthopaedic research has increased exponentially. Each database represents unique patient populations and varies in their methodology of data acquisition, which makes it possible that similar research questions posed to different databases might result in answers that differ in important ways. (1) What are the differences in reported demographics, comorbidities, and complications for patients undergoing primary TKA among four databases commonly used in orthopaedic research? (2) How does the difference in reported complication rates vary depending on whether only inpatient data or 30-day postoperative data are analyzed? Patients who underwent primary TKA during 2010 to 2012 were identified within the National Surgical Quality Improvement Programs (NSQIP), the Nationwide Inpatient Sample (NIS), the Medicare Standard Analytic Files (MED), and the Humana Administrative Claims database (HAC). NSQIP is a clinical registry that captures both inpatient and outpatient events up to 30 days after surgery using clinical reviewers and strict definitions for each variable. The other databases are administrative claims databases with their comorbidity and adverse event data defined by diagnosis and procedure codes used for reimbursement. NIS is limited to inpatient data only, whereas HAC and MED also have outpatient data. The number of patients undergoing primary TKA from each database was 48,248 in HAC, 783,546 in MED, 393,050 in NIS, and 43,220 in NSQIP. NSQIP definitions for comorbidities and surgical complications were matched to corresponding International Classification of Diseases, 9 Revision/Current Procedural Terminology codes and these coding algorithms were used to query NIS, MED, and HAC. Age, sex, comorbidities, and inpatient versus 30-day postoperative complications were compared across the four databases. Given the large sample sizes, statistical significance was often detected for small, clinically unimportant differences; thus, the focus of comparisons was whether the difference reached an absolute difference of twofold to signify an important clinical difference. Although there was a higher proportion of males in NIS and NSQIP and patients in NIS were younger, the difference was slight and well below our predefined threshold for a clinically important difference. There was variation in the prevalence of comorbidities and rates of postoperative complications among databases. The prevalence of chronic obstructive pulmonary disease (COPD) and coagulopathy in HAC and MED was more than twice that in NIS and NSQIP (relative risk [RR] for COPD: MED versus NIS 3.1, MED versus NSQIP 4.5, HAC versus NIS 3.6, HAC versus NSQIP 5.3; RR for coagulopathy: MED versus NIS 3.9, MED versus NSQIP 3.1, HAC versus NIS 3.3, HAC versus NSQIP 2.7; p < 0.001 for all comparisons). NSQIP had more than twice the obesity as NIS (RR 0.35). Rates of stroke within 30 days of TKA had more than a twofold difference among all databases (p < 0.001). HAC had more than twice the rates of 30-day complications at all endpoints compared with NSQIP and more than twice the 30-day infections as MED. A comparison of inpatient and 30-day complications rates demonstrated more than twice the amount of wound infections and deep vein thromboses is captured when data are analyzed out to 30 days after TKA (p < 0.001 for all comparisons). When evaluating research utilizing large databases, one must pay particular attention to the type of database used (administrative claims, clinical registry, or other kinds of databases), time period included, definitions utilized for specific variables, and the population captured to ensure it is best suited for the specific research question. Furthermore, with the advent of bundled payments, policymakers must meticulously consider the data sources used to ensure the data analytics match historical sources. Level III, therapeutic study.
Exploring the feasibility of traditional image querying tasks for industrial radiographs

NASA Astrophysics Data System (ADS)

Bray, Iliana E.; Tsai, Stephany J.; Jimenez, Edward S.

2015-08-01

Although there have been great strides in object recognition with optical images (photographs), there has been comparatively little research into object recognition for X-ray radiographs. Our exploratory work contributes to this area by creating an object recognition system designed to recognize components from a related database of radiographs. Object recognition for radiographs must be approached differently than for optical images, because radiographs have much less color-based information to distinguish objects, and they exhibit transmission overlap that alters perceived object shapes. The dataset used in this work contained more than 55,000 intermixed radiographs and photographs, all in a compressed JPEG form and with multiple ways of describing pixel information. For this work, a robust and efficient system is needed to combat problems presented by properties of the X-ray imaging modality, the large size of the given database, and the quality of the images contained in said database. We have explored various pre-processing techniques to clean the cluttered and low-quality images in the database, and we have developed our object recognition system by combining multiple object detection and feature extraction methods. We present the preliminary results of the still-evolving hybrid object recognition system.
DSSTOX WEBSITE LAUNCH: IMPROVING PUBLIC ACCESS ...

EPA Pesticide Factsheets

DSSTox Website Launch: Improving Public Access to Databases for Building Structure-Toxicity Prediction ModelsAnn M. RichardUS Environmental Protection Agency, Research Triangle Park, NC, USADistributed: Decentralized set of standardized, field-delimited databases, each separatelyauthored and maintained, that are able to accommodate diverse toxicity data content;Structure-Searchable: Standard format (SDF) structure-data files that can be readily imported into available chemical relational databases and structure-searched;Tox: Toxicity data as it exists in widely disparate forms in current public databases, spanning diverse toxicity endpoints, test systems, levels of biological content, degrees of summarization, and information content.INTRODUCTIONThe economic and social pressures to reduce the need for animal testing and to better anticipate the potential for human and eco-toxicity of environmental, industrial, or pharmaceutical chemicals are as pressing today as at any time prior. However, the goal of predicting chemical toxicity in its many manifestations, the `T' in 'ADMET' (adsorption, distribution, metabolism, elimination, toxicity), remains one of the most difficult and largely unmet challenges in a chemical screening paradigm [1]. It is widely acknowledged that the single greatest hurdle to improving structure-activity relationship (SAR) toxicity prediction capabilities, in both the pharmaceutical and environmental regulation arenas, is the lack of suffici
HUNT: launch of a full-length cDNA database from the Helix Research Institute.

PubMed

Yudate, H T; Suwa, M; Irie, R; Matsui, H; Nishikawa, T; Nakamura, Y; Yamaguchi, D; Peng, Z Z; Yamamoto, T; Nagai, K; Hayashi, K; Otsuki, T; Sugiyama, T; Ota, T; Suzuki, Y; Sugano, S; Isogai, T; Masuho, Y

2001-01-01

The Helix Research Institute (HRI) in Japan is releasing 4356 HUman Novel Transcripts and related information in the newly established HUNT database. The institute is a joint research project principally funded by the Japanese Ministry of International Trade and Industry, and the clones were sequenced in the governmental New Energy and Industrial Technology Development Organization (NEDO) Human cDNA Sequencing Project. The HUNT database contains an extensive amount of annotation from advanced analysis and represents an essential bioinformatics contribution towards understanding of the gene function. The HRI human cDNA clones were obtained from full-length enriched cDNA libraries constructed with the oligo-capping method and have resulted in novel full-length cDNA sequences. A large fraction has little similarity to any proteins of known function and to obtain clues about possible function we have developed original analysis procedures. Any putative function deduced here can be validated or refuted by complementary analysis results. The user can also extract information from specific categories like PROSITE patterns, PFAM domains, PSORT localization, transmembrane helices and clones with GENIUS structure assignments. The HUNT database can be accessed at http://www.hri.co.jp/HUNT.
Arabidopsis Hormone Database: a comprehensive genetic and phenotypic information database for plant hormone research in Arabidopsis

PubMed Central

Peng, Zhi-yu; Zhou, Xin; Li, Linchuan; Yu, Xiangchun; Li, Hongjiang; Jiang, Zhiqiang; Cao, Guangyu; Bai, Mingyi; Wang, Xingchun; Jiang, Caifu; Lu, Haibin; Hou, Xianhui; Qu, Lijia; Wang, Zhiyong; Zuo, Jianru; Fu, Xiangdong; Su, Zhen; Li, Songgang; Guo, Hongwei

2009-01-01

Plant hormones are small organic molecules that influence almost every aspect of plant growth and development. Genetic and molecular studies have revealed a large number of genes that are involved in responses to numerous plant hormones, including auxin, gibberellin, cytokinin, abscisic acid, ethylene, jasmonic acid, salicylic acid, and brassinosteroid. Here, we develop an Arabidopsis hormone database, which aims to provide a systematic and comprehensive view of genes participating in plant hormonal regulation, as well as morphological phenotypes controlled by plant hormones. Based on data from mutant studies, transgenic analysis and gene ontology (GO) annotation, we have identified a total of 1026 genes in the Arabidopsis genome that participate in plant hormone functions. Meanwhile, a phenotype ontology is developed to precisely describe myriad hormone-regulated morphological processes with standardized vocabularies. A web interface (http://ahd.cbi.pku.edu.cn) would allow users to quickly get access to information about these hormone-related genes, including sequences, functional category, mutant information, phenotypic description, microarray data and linked publications. Several applications of this database in studying plant hormonal regulation and hormone cross-talk will be presented and discussed. PMID:19015126
Chesapeake Bay Program Water Quality Database

EPA Pesticide Factsheets

The Chesapeake Information Management System (CIMS), designed in 1996, is an integrated, accessible information management system for the Chesapeake Bay Region. CIMS is an organized, distributed library of information and software tools designed to increase basin-wide public access to Chesapeake Bay information. The information delivered by CIMS includes technical and public information, educational material, environmental indicators, policy documents, and scientific data. Through the use of relational databases, web-based programming, and web-based GIS a large number of Internet resources have been established. These resources include multiple distributed on-line databases, on-demand graphing and mapping of environmental data, and geographic searching tools for environmental information. Baseline monitoring data, summarized data and environmental indicators that document ecosystem status and trends, confirm linkages between water quality, habitat quality and abundance, and the distribution and integrity of biological populations are also available. One of the major features of the CIMS network is the Chesapeake Bay Program's Data Hub, providing users access to a suite of long- term water quality and living resources databases. Chesapeake Bay mainstem and tidal tributary water quality, benthic macroinvertebrates, toxics, plankton, and fluorescence data can be obtained for a network of over 800 monitoring stations.
Arabidopsis Hormone Database: a comprehensive genetic and phenotypic information database for plant hormone research in Arabidopsis.

PubMed

Peng, Zhi-yu; Zhou, Xin; Li, Linchuan; Yu, Xiangchun; Li, Hongjiang; Jiang, Zhiqiang; Cao, Guangyu; Bai, Mingyi; Wang, Xingchun; Jiang, Caifu; Lu, Haibin; Hou, Xianhui; Qu, Lijia; Wang, Zhiyong; Zuo, Jianru; Fu, Xiangdong; Su, Zhen; Li, Songgang; Guo, Hongwei

2009-01-01

Plant hormones are small organic molecules that influence almost every aspect of plant growth and development. Genetic and molecular studies have revealed a large number of genes that are involved in responses to numerous plant hormones, including auxin, gibberellin, cytokinin, abscisic acid, ethylene, jasmonic acid, salicylic acid, and brassinosteroid. Here, we develop an Arabidopsis hormone database, which aims to provide a systematic and comprehensive view of genes participating in plant hormonal regulation, as well as morphological phenotypes controlled by plant hormones. Based on data from mutant studies, transgenic analysis and gene ontology (GO) annotation, we have identified a total of 1026 genes in the Arabidopsis genome that participate in plant hormone functions. Meanwhile, a phenotype ontology is developed to precisely describe myriad hormone-regulated morphological processes with standardized vocabularies. A web interface (http://ahd.cbi.pku.edu.cn) would allow users to quickly get access to information about these hormone-related genes, including sequences, functional category, mutant information, phenotypic description, microarray data and linked publications. Several applications of this database in studying plant hormonal regulation and hormone cross-talk will be presented and discussed.
A Relational Algebra Query Language for Programming Relational Databases

ERIC Educational Resources Information Center

McMaster, Kirby; Sambasivam, Samuel; Anderson, Nicole

2011-01-01

In this paper, we describe a Relational Algebra Query Language (RAQL) and Relational Algebra Query (RAQ) software product we have developed that allows database instructors to teach relational algebra through programming. Instead of defining query operations using mathematical notation (the approach commonly taken in database textbooks), students…
Large Differences in Global and Regional Total Soil Carbon Stock Estimates Based on SoilGrids, HWSD, and NCSCD: Intercomparison and Evaluation Based on Field Data From USA, England, Wales, and France

NASA Astrophysics Data System (ADS)

Tifafi, Marwa; Guenet, Bertrand; Hatté, Christine

2018-01-01

Soils are the major component of the terrestrial ecosystem and the largest organic carbon reservoir on Earth. However, they are a nonrenewable natural resource and especially reactive to human disturbance and climate change. Despite its importance, soil carbon dynamics is an important source of uncertainty for future climate predictions and there is a growing need for more precise information to better understand the mechanisms controlling soil carbon dynamics and better constrain Earth system models. The aim of our work is to compare soil organic carbon stocks given by different global and regional databases that already exist. We calculated global and regional soil carbon stocks at 1 m depth given by three existing databases (SoilGrids, the Harmonized World Soil Database, and the Northern Circumpolar Soil Carbon Database). We observed that total stocks predicted by each product differ greatly: it is estimated to be around 3,400 Pg by SoilGrids and is about 2,500 Pg according to Harmonized World Soil Database. This difference is marked in particular for boreal regions where differences can be related to high disparities in soil organic carbon concentration. Differences in other regions are more limited and may be related to differences in bulk density estimates. Finally, evaluation of the three data sets versus ground truth data shows that (i) there is a significant difference in spatial patterns between ground truth data and compared data sets and that (ii) data sets underestimate by more than 40% the soil organic carbon stock compared to field data.
A phenome database (NEAUHLFPD) designed and constructed for broiler lines divergently selected for abdominal fat content.

PubMed

Li, Min; Dong, Xiang-yu; Liang, Hao; Leng, Li; Zhang, Hui; Wang, Shou-zhi; Li, Hui; Du, Zhi-Qiang

2017-05-20

Effective management and analysis of precisely recorded phenotypic traits are important components of the selection and breeding of superior livestocks. Over two decades, we divergently selected chicken lines for abdominal fat content at Northeast Agricultural University (Northeast Agricultural University High and Low Fat, NEAUHLF), and collected large volume of phenotypic data related to the investigation on molecular genetic basis of adipose tissue deposition in broilers. To effectively and systematically store, manage and analyze phenotypic data, we built the NEAUHLF Phenome Database (NEAUHLFPD). NEAUHLFPD included the following phenotypic records: pedigree (generations 1-19) and 29 phenotypes, such as body sizes and weights, carcass traits and their corresponding rates. The design and construction strategy of NEAUHLFPD were executed as follows: (1) Framework design. We used Apache as our web server, MySQL and Navicat as database management tools, and PHP as the HTML-embedded language to create dynamic interactive website. (2) Structural components. On the main interface, detailed introduction on the composition, function, and the index buttons of the basic structure of the database could be found. The functional modules of NEAUHLFPD had two main components: the first module referred to the physical storage space for phenotypic data, in which functional manipulation on data can be realized, such as data indexing, filtering, range-setting, searching, etc.; the second module related to the calculation of basic descriptive statistics, where data filtered from the database can be used for the computation of basic statistical parameters and the simultaneous conditional sorting. NEAUHLFPD could be used to effectively store and manage not only phenotypic, but also genotypic and genomics data, which can facilitate further investigation on the molecular genetic basis of chicken adipose tissue growth and development, and expedite the selection and breeding of broilers with low fat content.
Developing a Large Lexical Database for Information Retrieval, Parsing, and Text Generation Systems.

ERIC Educational Resources Information Center

Conlon, Sumali Pin-Ngern; And Others

1993-01-01

Important characteristics of lexical databases and their applications in information retrieval and natural language processing are explained. An ongoing project using various machine-readable sources to build a lexical database is described, and detailed designs of individual entries with examples are included. (Contains 66 references.) (EAM)

NREL Opens Large Database of Inorganic Thin-Film Materials | News | NREL

Science.gov Websites

Inorganic Thin-Film Materials April 3, 2018 An extensive experimental database of inorganic thin-film Energy Laboratory (NREL) is now publicly available. The High Throughput Experimental Materials (HTEM Schroeder / NREL) "All existing experimental databases either contain many entries or have all this
Active Exploration of Large 3D Model Repositories.

PubMed

Gao, Lin; Cao, Yan-Pei; Lai, Yu-Kun; Huang, Hao-Zhi; Kobbelt, Leif; Hu, Shi-Min

2015-12-01

With broader availability of large-scale 3D model repositories, the need for efficient and effective exploration becomes more and more urgent. Existing model retrieval techniques do not scale well with the size of the database since often a large number of very similar objects are returned for a query, and the possibilities to refine the search are quite limited. We propose an interactive approach where the user feeds an active learning procedure by labeling either entire models or parts of them as "like" or "dislike" such that the system can automatically update an active set of recommended models. To provide an intuitive user interface, candidate models are presented based on their estimated relevance for the current query. From the methodological point of view, our main contribution is to exploit not only the similarity between a query and the database models but also the similarities among the database models themselves. We achieve this by an offline pre-processing stage, where global and local shape descriptors are computed for each model and a sparse distance metric is derived that can be evaluated efficiently even for very large databases. We demonstrate the effectiveness of our method by interactively exploring a repository containing over 100 K models.
Vehicle-triggered video compression/decompression for fast and efficient searching in large video databases

NASA Astrophysics Data System (ADS)

Bulan, Orhan; Bernal, Edgar A.; Loce, Robert P.; Wu, Wencheng

2013-03-01

Video cameras are widely deployed along city streets, interstate highways, traffic lights, stop signs and toll booths by entities that perform traffic monitoring and law enforcement. The videos captured by these cameras are typically compressed and stored in large databases. Performing a rapid search for a specific vehicle within a large database of compressed videos is often required and can be a time-critical life or death situation. In this paper, we propose video compression and decompression algorithms that enable fast and efficient vehicle or, more generally, event searches in large video databases. The proposed algorithm selects reference frames (i.e., I-frames) based on a vehicle having been detected at a specified position within the scene being monitored while compressing a video sequence. A search for a specific vehicle in the compressed video stream is performed across the reference frames only, which does not require decompression of the full video sequence as in traditional search algorithms. Our experimental results on videos captured in a local road show that the proposed algorithm significantly reduces the search space (thus reducing time and computational resources) in vehicle search tasks within compressed video streams, particularly those captured in light traffic volume conditions.
HAEdb: a novel interactive, locus-specific mutation database for the C1 inhibitor gene.

PubMed

Kalmár, Lajos; Hegedüs, Tamás; Farkas, Henriette; Nagy, Melinda; Tordai, Attila

2005-01-01

Hereditary angioneurotic edema (HAE) is an autosomal dominant disorder characterized by episodic local subcutaneous and submucosal edema and is caused by the deficiency of the activated C1 esterase inhibitor protein (C1-INH or C1INH; approved gene symbol SERPING1). Published C1-INH mutations are represented in large universal databases (e.g., OMIM, HGMD), but these databases update their data rather infrequently, they are not interactive, and they do not allow searches according to different criteria. The HAEdb, a C1-INH gene mutation database (http://hae.biomembrane.hu) was created to contribute to the following expectations: 1) help the comprehensive collection of information on genetic alterations of the C1-INH gene; 2) create a database in which data can be searched and compared according to several flexible criteria; and 3) provide additional help in new mutation identification. The website uses MySQL, an open-source, multithreaded, relational database management system. The user-friendly graphical interface was written in the PHP web programming language. The website consists of two main parts, the freely browsable search function, and the password-protected data deposition function. Mutations of the C1-INH gene are divided in two parts: gross mutations involving DNA fragments >1 kb, and micro mutations encompassing all non-gross mutations. Several attributes (e.g., affected exon, molecular consequence, family history) are collected for each mutation in a standardized form. This database may facilitate future comprehensive analyses of C1-INH mutations and also provide regular help for molecular diagnostic testing of HAE patients in different centers.
Next-Generation Sequencing of the Chrysanthemum nankingense (Asteraceae) Transcriptome Permits Large-Scale Unigene Assembly and SSR Marker Discovery

PubMed Central

Wang, Haibin; Jiang, Jiafu; Chen, Sumei; Qi, Xiangyu; Peng, Hui; Li, Pirui; Song, Aiping; Guan, Zhiyong; Fang, Weimin; Liao, Yuan; Chen, Fadi

2013-01-01

Background Simple sequence repeats (SSRs) are ubiquitous in eukaryotic genomes. Chrysanthemum is one of the largest genera in the Asteraceae family. Only few Chrysanthemum expressed sequence tag (EST) sequences have been acquired to date, so the number of available EST-SSR markers is very low. Methodology/Principal Findings Illumina paired-end sequencing technology produced over 53 million sequencing reads from C. nankingense mRNA. The subsequent de novo assembly yielded 70,895 unigenes, of which 45,789 (64.59%) unigenes showed similarity to the sequences in NCBI database. Out of 45,789 sequences, 107 have hits to the Chrysanthemum Nr protein database; 679 and 277 sequences have hits to the database of Helianthus and Lactuca species, respectively. MISA software identified a large number of putative EST-SSRs, allowing 1,788 primer pairs to be designed from the de novo transcriptome sequence and a further 363 from archival EST sequence. Among 100 primer pairs randomly chosen, 81 markers have amplicons and 20 are polymorphic for genotypes analysis in Chrysanthemum. The results showed that most (but not all) of the assays were transferable across species and that they exposed a significant amount of allelic diversity. Conclusions/Significance SSR markers acquired by transcriptome sequencing are potentially useful for marker-assisted breeding and genetic analysis in the genus Chrysanthemum and its related genera. PMID:23626799
A cluster-based strategy for assessing the overlap between large chemical libraries and its application to a recent acquisition.

PubMed

Engels, Michael F M; Gibbs, Alan C; Jaeger, Edward P; Verbinnen, Danny; Lobanov, Victor S; Agrafiotis, Dimitris K

2006-01-01

We report on the structural comparison of the corporate collections of Johnson & Johnson Pharmaceutical Research & Development (JNJPRD) and 3-Dimensional Pharmaceuticals (3DP), performed in the context of the recent acquisition of 3DP by JNJPRD. The main objective of the study was to assess the druglikeness of the 3DP library and the extent to which it enriched the chemical diversity of the JNJPRD corporate collection. The two databases, at the time of acquisition, collectively contained more than 1.1 million compounds with a clearly defined structural description. The analysis was based on a clustering approach and aimed at providing an intuitive quantitative estimate and visual representation of this enrichment. A novel hierarchical clustering algorithm called divisive k-means was employed in combination with Kelley's cluster-level selection method to partition the combined data set into clusters, and the diversity contribution of each library was evaluated as a function of the relative occupancy of these clusters. Typical 3DP chemotypes enriching the diversity of the JNJPRD collection were catalogued and visualized using a modified maximum common substructure algorithm. The joint collection of JNJPRD and 3DP compounds was also compared to other databases of known medicinally active or druglike compounds. The potential of the methodology for the analysis of very large chemical databases is discussed.
DDMGD: the database of text-mined associations between genes methylated in diseases from different species.

PubMed

Bin Raies, Arwa; Mansour, Hicham; Incitti, Roberto; Bajic, Vladimir B

2015-01-01

Gathering information about associations between methylated genes and diseases is important for diseases diagnosis and treatment decisions. Recent advancements in epigenetics research allow for large-scale discoveries of associations of genes methylated in diseases in different species. Searching manually for such information is not easy, as it is scattered across a large number of electronic publications and repositories. Therefore, we developed DDMGD database (http://www.cbrc.kaust.edu.sa/ddmgd/) to provide a comprehensive repository of information related to genes methylated in diseases that can be found through text mining. DDMGD's scope is not limited to a particular group of genes, diseases or species. Using the text mining system DEMGD we developed earlier and additional post-processing, we extracted associations of genes methylated in different diseases from PubMed Central articles and PubMed abstracts. The accuracy of extracted associations is 82% as estimated on 2500 hand-curated entries. DDMGD provides a user-friendly interface facilitating retrieval of these associations ranked according to confidence scores. Submission of new associations to DDMGD is provided. A comparison analysis of DDMGD with several other databases focused on genes methylated in diseases shows that DDMGD is comprehensive and includes most of the recent information on genes methylated in diseases. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Estimation of Solvation Quantities from Experimental Thermodynamic Data: Development of the Comprehensive CompSol Databank for Pure and Mixed Solutes

NASA Astrophysics Data System (ADS)

Moine, Edouard; Privat, Romain; Sirjean, Baptiste; Jaubert, Jean-Noël

2017-09-01

The Gibbs energy of solvation measures the affinity of a solute for its solvent and is thus a key property for the selection of an appropriate solvent for a chemical synthesis or a separation process. More fundamentally, Gibbs energies of solvation are choice data for developing and benchmarking molecular models predicting solvation effects. The Comprehensive Solvation—CompSol—database was developed with the ambition to propose very large sets of new experimental solvation chemical-potential, solvation entropy, and solvation enthalpy data of pure and mixed components, covering extended temperature ranges. For mixed compounds, the solvation quantities were generated in infinite-dilution conditions by combining experimental values of pure-component and binary-mixture thermodynamic properties. Three types of binary-mixture properties were considered: partition coefficients, activity coefficients at infinite dilution, and Henry's-law constants. A rigorous methodology was implemented with the aim to select data at appropriate conditions of temperature, pressure, and concentration for the estimation of solvation data. Finally, our comprehensive CompSol database contains 21 671 data associated with 1969 pure species and 70 062 data associated with 14 102 binary mixtures (including 760 solvation data related to the ionic-liquid class of solvents). On the basis of the very large amount of experimental data contained in the CompSol database, it is finally discussed how solvation energies are influenced by hydrogen-bonding association effects.
Crystallography Open Databases and Preservation: a World-wide Initiative

NASA Astrophysics Data System (ADS)

Chateigner, Daniel

In 2003, an international team of crystallographers proposed the Crystallography Open Database (COD), a fully-free collection of crystal structure data, in the aim of ensuring their preservation. With nearly 250000 entries, this database represents a large open set of data for crystallographers, academics and industrials, located at five different places world-wide, and included in Thomson-Reuters’ ISI. As a large step towards data preservation, raw data can now be uploaded along with «digested» structure files, and COD can be questioned by most of the crystallography-linked industrial software. The COD initiative work deserves several other open developments.
Iris indexing based on local intensity order pattern

NASA Astrophysics Data System (ADS)

Emerich, Simina; Malutan, Raul; Crisan, Septimiu; Lefkovits, Laszlo

2017-03-01

In recent years, iris biometric systems have increased in popularity and have been proven that are capable of handling large-scale databases. The main advantage of these systems is accuracy and reliability. A proper iris patterns classification is expected to reduce the matching time in huge databases. This paper presents an iris indexing technique based on Local Intensity Order Pattern. The performance of the present approach is evaluated on UPOL database and is compared with other recent systems designed for iris indexing. The results illustrate the potential of the proposed method for large scale iris identification.
A dynamic clinical dental relational database.

PubMed

Taylor, D; Naguib, R N G; Boulton, S

2004-09-01

The traditional approach to relational database design is based on the logical organization of data into a number of related normalized tables. One assumption is that the nature and structure of the data is known at the design stage. In the case of designing a relational database to store historical dental epidemiological data from individual clinical surveys, the structure of the data is not known until the data is presented for inclusion into the database. This paper addresses the issues concerned with the theoretical design of a clinical dynamic database capable of adapting the internal table structure to accommodate clinical survey data, and presents a prototype database application capable of processing, displaying, and querying the dental data.
Predicting Protein Relationships to Human Pathways through a Relational Learning Approach Based on Simple Sequence Features.

PubMed

García-Jiménez, Beatriz; Pons, Tirso; Sanchis, Araceli; Valencia, Alfonso

2014-01-01

Biological pathways are important elements of systems biology and in the past decade, an increasing number of pathway databases have been set up to document the growing understanding of complex cellular processes. Although more genome-sequence data are becoming available, a large fraction of it remains functionally uncharacterized. Thus, it is important to be able to predict the mapping of poorly annotated proteins to original pathway models. We have developed a Relational Learning-based Extension (RLE) system to investigate pathway membership through a function prediction approach that mainly relies on combinations of simple properties attributed to each protein. RLE searches for proteins with molecular similarities to specific pathway components. Using RLE, we associated 383 uncharacterized proteins to 28 pre-defined human Reactome pathways, demonstrating relative confidence after proper evaluation. Indeed, in specific cases manual inspection of the database annotations and the related literature supported the proposed classifications. Examples of possible additional components of the Electron transport system, Telomere maintenance and Integrin cell surface interactions pathways are discussed in detail. All the human predicted proteins in the 2009 and 2012 releases 30 and 40 of Reactome are available at http://rle.bioinfo.cnio.es.
Formation of Ground Truth Databases and Related Studies and Regional Seismic Monitoring Research

DTIC Science & Technology

2006-06-01

denoted by black circles) and those derived by the combined use of both IRIS and revised IDC for all 70 Chinese events. Arrows point towards the...IRIS data provide a large number of additional Lg amvals. Figure 45. Differences in epicentral location between the original REB (denoted by black ...Washington, USA Haystack Fork , Wyoming, USA Hells Canyon, Colorado, USA Hiawatha Road, Colorado, USA Hockley, Texas, USA Hailey, Idaho, USA Honiara
Big Data and Total Hip Arthroplasty: How Do Large Databases Compare?

PubMed

Bedard, Nicholas A; Pugely, Andrew J; McHugh, Michael A; Lux, Nathan R; Bozic, Kevin J; Callaghan, John J

2018-01-01

Use of large databases for orthopedic research has become extremely popular in recent years. Each database varies in the methods used to capture data and the population it represents. The purpose of this study was to evaluate how these databases differed in reported demographics, comorbidities, and postoperative complications for primary total hip arthroplasty (THA) patients. Primary THA patients were identified within National Surgical Quality Improvement Programs (NSQIP), Nationwide Inpatient Sample (NIS), Medicare Standard Analytic Files (MED), and Humana administrative claims database (HAC). NSQIP definitions for comorbidities and complications were matched to corresponding International Classification of Diseases, 9th Revision/Current Procedural Terminology codes to query the other databases. Demographics, comorbidities, and postoperative complications were compared. The number of patients from each database was 22,644 in HAC, 371,715 in MED, 188,779 in NIS, and 27,818 in NSQIP. Age and gender distribution were clinically similar. Overall, there was variation in prevalence of comorbidities and rates of postoperative complications between databases. As an example, NSQIP had more than twice the obesity than NIS. HAC and MED had more than 2 times the diabetics than NSQIP. Rates of deep infection and stroke 30 days after THA had more than 2-fold difference between all databases. Among databases commonly used in orthopedic research, there is considerable variation in complication rates following THA depending upon the database used for analysis. It is important to consider these differences when critically evaluating database research. Additionally, with the advent of bundled payments, these differences must be considered in risk adjustment models. Copyright © 2017 Elsevier Inc. All rights reserved.
Differences in the Reporting of Racial and Socioeconomic Disparities among Three Large National Databases for Breast Reconstruction.

PubMed

Kamali, Parisa; Zettervall, Sara L; Wu, Winona; Ibrahim, Ahmed M S; Medin, Caroline; Rakhorst, Hinne A; Schermerhorn, Marc L; Lee, Bernard T; Lin, Samuel J

2017-04-01

Research derived from large-volume databases plays an increasing role in the development of clinical guidelines and health policy. In breast cancer research, the Surveillance, Epidemiology and End Results, National Surgical Quality Improvement Program, and Nationwide Inpatient Sample databases are widely used. This study aims to compare the trends in immediate breast reconstruction and identify the drawbacks and benefits of each database. Patients with invasive breast cancer and ductal carcinoma in situ were identified from each database (2005-2012). Trends of immediate breast reconstruction over time were evaluated. Patient demographics and comorbidities were compared. Subgroup analysis of immediate breast reconstruction use per race was conducted. Within the three databases, 1.2 million patients were studied. Immediate breast reconstruction in invasive breast cancer patients increased significantly over time in all databases. A similar significant upward trend was seen in ductal carcinoma in situ patients. Significant differences in immediate breast reconstruction rates were seen among races; and the disparity differed among the three databases. Rates of comorbidities were similar among the three databases. There has been a significant increase in immediate breast reconstruction; however, the extent of the reporting of overall immediate breast reconstruction rates and of racial disparities differs significantly among databases. The Nationwide Inpatient Sample and the National Surgical Quality Improvement Program report similar findings, with the Surveillance, Epidemiology and End Results database reporting results significantly lower in several categories. These findings suggest that use of the Surveillance, Epidemiology and End Results database may not be universally generalizable to the entire U.S.
Mean velocity and turbulence measurements in a 90 deg curved duct with thin inlet boundary layer

NASA Technical Reports Server (NTRS)

Crawford, R. A.; Peters, C. E.; Steinhoff, J.; Hornkohl, J. O.; Nourinejad, J.; Ramachandran, K.

1985-01-01

The experimental database established by this investigation of the flow in a large rectangular turning duct is of benchmark quality. The experimental Reynolds numbers, Deans numbers and boundary layer characteristics are significantly different from previous benchmark curved-duct experimental parameters. This investigation extends the experimental database to higher Reynolds number and thinner entrance boundary layers. The 5% to 10% thick boundary layers, based on duct half-width, results in a large region of near-potential flow in the duct core surrounded by developing boundary layers with large crossflows. The turbulent entrance boundary layer case at R sub ed = 328,000 provides an incompressible flowfield which approaches real turbine blade cascade characteristics. The results of this investigation provide a challenging benchmark database for computational fluid dynamics code development.
The Odense University Pharmacoepidemiological Database (OPED)

Cancer.gov

The Odense University Pharmacoepidemiological Database is one of two large prescription registries in Denmark and covers a stable population that is representative of the Danish population as a whole.
Assessment methodologies and statistical issues for computer-aided diagnosis of lung nodules in computed tomography: contemporary research topics relevant to the lung image database consortium.

PubMed

Dodd, Lori E; Wagner, Robert F; Armato, Samuel G; McNitt-Gray, Michael F; Beiden, Sergey; Chan, Heang-Ping; Gur, David; McLennan, Geoffrey; Metz, Charles E; Petrick, Nicholas; Sahiner, Berkman; Sayre, Jim

2004-04-01

Cancer of the lung and bronchus is the leading fatal malignancy in the United States. Five-year survival is low, but treatment of early stage disease considerably improves chances of survival. Advances in multidetector-row computed tomography technology provide detection of smaller lung nodules and offer a potentially effective screening tool. The large number of images per exam, however, requires considerable radiologist time for interpretation and is an impediment to clinical throughput. Thus, computer-aided diagnosis (CAD) methods are needed to assist radiologists with their decision making. To promote the development of CAD methods, the National Cancer Institute formed the Lung Image Database Consortium (LIDC). The LIDC is charged with developing the consensus and standards necessary to create an image database of multidetector-row computed tomography lung images as a resource for CAD researchers. To develop such a prospective database, its potential uses must be anticipated. The ultimate applications will influence the information that must be included along with the images, the relevant measures of algorithm performance, and the number of required images. In this article we outline assessment methodologies and statistical issues as they relate to several potential uses of the LIDC database. We review methods for performance assessment and discuss issues of defining "truth" as well as the complications that arise when truth information is not available. We also discuss issues about sizing and populating a database.
[The opening of the French national health database: Opportunities and difficulties. The experience of the Gazel and Constances cohorts].

PubMed

Goldberg, M; Carton, M; Gourmelen, J; Genreau, M; Montourcy, M; Le Got, S; Zins, M

2016-09-01

In France, the national health database (SNIIRAM) is an administrative health database that collects data on hospitalizations and healthcare consumption for more than 60 million people. Although it does not record behavioral and environmental data, these data have a major interest for epidemiology, surveillance and public health. One of the most interesting uses of SNIIRAM is its linkage with surveys collecting data directly from persons. Access to the SNIIRAM data is currently relatively limited, but in the near future changes in regulations will largely facilitate open access. However, it is a huge and complex database and there are some important methodological and technical difficulties for using it due to its volume and architecture. We are developing tools for facilitating the linkage of the Gazel and Constances cohorts to the SNIIRAM: interactive documentation on the SNIIRAM database, software for the verification of the completeness and validity of the data received from the SNIIRAM, methods for constructing indicators from the raw data in order to flag the presence of certain events (specific diagnosis, procedure, drug…), standard queries for producing a set of variables on a specific area (drugs, diagnoses during a hospital stay…). Moreover, the REDSIAM network recently set up aims to develop, evaluate and make available algorithms to identify pathologies in SNIIRAM. In order to fully benefit from the exceptional potential of the SNIIRAM database, it is essential to develop tools to facilitate its use. Copyright © 2016 Elsevier Masson SAS. All rights reserved.
Educational inequalities in mortality by cause of death: first national data for the Netherlands.

PubMed

Kulhánová, Ivana; Hoffmann, Rasmus; Eikemo, Terje A; Menvielle, Gwenn; Mackenbach, Johan P

2014-10-01

Using new facilities for linking large databases, we aimed to evaluate for the first time the magnitude of relative and absolute educational inequalities in mortality by sex and cause of death in the Netherlands. We analyzed data from Dutch Labour Force Surveys (1998-2002) with mortality follow-up 1998-2007 among people aged 30-79 years. We calculated hazard ratios using Cox proportional hazards model, age-standardized mortality rates and partial life expectancy by education. We compared results for the Netherlands with those for other European countries. The relative risk of dying was about two times higher among primary educated men and women as compared to their tertiary educated counterparts, leading to a gap in partial life expectancy of 3.4 years (men) and 2.4 years (women). Inequalities in mortality are similar to those in other countries in North-Western Europe, but inequalities in lung cancer mortality are substantially larger in the Netherlands, particularly among men. The Netherlands has large inequalities in mortality, especially for smoking-related causes of death. These large inequalities require the urgent attention of policy makers.

A Comparative Analysis Among the SRS M&M, NIS, and KID Databases for the Adolescent Idiopathic Scoliosis.

PubMed

Lee, Nathan J; Guzman, Javier Z; Kim, Jun; Skovrlj, Branko; Martin, Christopher T; Pugely, Andrew J; Gao, Yubo; Caridi, John M; Mendoza-Lattes, Sergio; Cho, Samuel K

2016-11-01

Retrospective cohort analysis. A growing number of publications have utilized the Scoliosis Research Society (SRS) Morbidity and Mortality (M&M) database, but none have compared it to other large databases. The objective of this study was to compare SRS complications with those in administrative databases. The Nationwide Inpatient Sample (NIS) and Kid's Inpatient Database (KID) captured a greater number of overall complications while the SRS M&M data provided a greater incidence of spine-related complications following adolescent idiopathic scoliosis (AIS) surgery. Chi-square was used to obtain statistical significance, with p < .05 considered significant. The SRS 2004-2007 (9,904 patients), NIS 2004-2007 (20,441 patients) and KID 2003-2006 (10,184 patients) databases were analyzed for AIS patients who underwent fusion. Comparable variables were queried in all three databases, including patient demographics, surgical variables, and complications. Patients undergoing AIS in the SRS database were slightly older (SRS 14.4 years vs. NIS 13.8 years, p < .0001; KID 13.9 years, p < .0001) and less likely to be male (SRS 18.5% vs. NIS 26.3%, p < .0001; KID 24.8%, p < .0001). Revision surgery (SRS 3.3% vs. NIS 2.4%, p < .0001; KID 0.9%, p < .0001) and osteotomy (SRS 8% vs. NIS 2.3%, p < .0001; KID 2.4%, p < .0001) were more commonly reported in the SRS database. The SRS database reported fewer overall complications (SRS 3.9% vs. NIS 7.3%, p < .0001; KID 6.6%, p < .0001). However, when respiratory complications (SRS 0.5% vs. NIS 3.7%, p < .0001; KID 4.4%, p < .0001) were excluded, medical complication rates were similar across databases. In contrast, SRS reported higher spine-specific complication rates. Mortality rates were similar between SRS versus NIS (p = .280) and SRS versus KID (p = .08) databases. There are similarities and differences between the three databases. These discrepancies are likely due to the varying data-gathering methods each organization uses to collect their morbidity data. Level IV. Copyright © 2016 Scoliosis Research Society. Published by Elsevier Inc. All rights reserved.
Alternatives to relational database: comparison of NoSQL and XML approaches for clinical data storage.

PubMed

Lee, Ken Ka-Yin; Tang, Wai-Choi; Choi, Kup-Sze

2013-04-01

Clinical data are dynamic in nature, often arranged hierarchically and stored as free text and numbers. Effective management of clinical data and the transformation of the data into structured format for data analysis are therefore challenging issues in electronic health records development. Despite the popularity of relational databases, the scalability of the NoSQL database model and the document-centric data structure of XML databases appear to be promising features for effective clinical data management. In this paper, three database approaches--NoSQL, XML-enabled and native XML--are investigated to evaluate their suitability for structured clinical data. The database query performance is reported, together with our experience in the databases development. The results show that NoSQL database is the best choice for query speed, whereas XML databases are advantageous in terms of scalability, flexibility and extensibility, which are essential to cope with the characteristics of clinical data. While NoSQL and XML technologies are relatively new compared to the conventional relational database, both of them demonstrate potential to become a key database technology for clinical data management as the technology further advances. Copyright © 2012 Elsevier Ireland Ltd. All rights reserved.
Object recognition based on Google's reverse image search and image similarity

NASA Astrophysics Data System (ADS)

Horváth, András.

2015-12-01

Image classification is one of the most challenging tasks in computer vision and a general multiclass classifier could solve many different tasks in image processing. Classification is usually done by shallow learning for predefined objects, which is a difficult task and very different from human vision, which is based on continuous learning of object classes and one requires years to learn a large taxonomy of objects which are not disjunct nor independent. In this paper I present a system based on Google image similarity algorithm and Google image database, which can classify a large set of different objects in a human like manner, identifying related classes and taxonomies.
A clear human footprint in the coral reefs of the Caribbean.

PubMed

Mora, Camilo

2008-04-07

The recent degradation of coral reefs worldwide is increasingly well documented, yet the underlying causes remain debated. In this study, we used a large-scale database on the status of coral reef communities in the Caribbean and analysed it in combination with a comprehensive set of socioeconomic and environmental databases to decouple confounding factors and identify the drivers of change in coral reef communities. Our results indicated that human activities related to agricultural land use, coastal development, overfishing and climate change had created independent and overwhelming responses in fishes, corals and macroalgae. While the effective implementation of marine protected areas (MPAs) increased the biomass of fish populations, coral reef builders and macroalgae followed patterns of change independent of MPAs. However, we also found significant ecological links among all these groups of organisms suggesting that the long-term stability of coral reefs as a whole requires a holistic and regional approach to the control of human-related stressors in addition to the improvement and establishment of new MPAs.
An algorithm to identify functional groups in organic molecules.

PubMed

Ertl, Peter

2017-06-07

The concept of functional groups forms a basis of organic chemistry, medicinal chemistry, toxicity assessment, spectroscopy and also chemical nomenclature. All current software systems to identify functional groups are based on a predefined list of substructures. We are not aware of any program that can identify all functional groups in a molecule automatically. The algorithm presented in this article is an attempt to solve this scientific challenge. An algorithm to identify functional groups in a molecule based on iterative marching through its atoms is described. The procedure is illustrated by extracting functional groups from the bioactive portion of the ChEMBL database, resulting in identification of 3080 unique functional groups. A new algorithm to identify all functional groups in organic molecules is presented. The algorithm is relatively simple and full details with examples are provided, therefore implementation in any cheminformatics toolkit should be relatively easy. The new method allows the analysis of functional groups in large chemical databases in a way that was not possible using previous approaches. Graphical abstract .
Computer-aided auditing of prescription drug claims.

PubMed

Iyengar, Vijay S; Hermiz, Keith B; Natarajan, Ramesh

2014-09-01

We describe a methodology for identifying and ranking candidate audit targets from a database of prescription drug claims. The relevant audit targets may include various entities such as prescribers, patients and pharmacies, who exhibit certain statistical behavior indicative of potential fraud and abuse over the prescription claims during a specified period of interest. Our overall approach is consistent with related work in statistical methods for detection of fraud and abuse, but has a relative emphasis on three specific aspects: first, based on the assessment of domain experts, certain focus areas are selected and data elements pertinent to the audit analysis in each focus area are identified; second, specialized statistical models are developed to characterize the normalized baseline behavior in each focus area; and third, statistical hypothesis testing is used to identify entities that diverge significantly from their expected behavior according to the relevant baseline model. The application of this overall methodology to a prescription claims database from a large health plan is considered in detail.
A clear human footprint in the coral reefs of the Caribbean

PubMed Central

Mora, Camilo

2008-01-01

The recent degradation of coral reefs worldwide is increasingly well documented, yet the underlying causes remain debated. In this study, we used a large-scale database on the status of coral reef communities in the Caribbean and analysed it in combination with a comprehensive set of socioeconomic and environmental databases to decouple confounding factors and identify the drivers of change in coral reef communities. Our results indicated that human activities related to agricultural land use, coastal development, overfishing and climate change had created independent and overwhelming responses in fishes, corals and macroalgae. While the effective implementation of marine protected areas (MPAs) increased the biomass of fish populations, coral reef builders and macroalgae followed patterns of change independent of MPAs. However, we also found significant ecological links among all these groups of organisms suggesting that the long-term stability of coral reefs as a whole requires a holistic and regional approach to the control of human-related stressors in addition to the improvement and establishment of new MPAs. PMID:18182370
A Data Management System for International Space Station Simulation Tools

NASA Technical Reports Server (NTRS)

Betts, Bradley J.; DelMundo, Rommel; Elcott, Sharif; McIntosh, Dawn; Niehaus, Brian; Papasin, Richard; Mah, Robert W.; Clancy, Daniel (Technical Monitor)

2002-01-01

Groups associated with the design, operational, and training aspects of the International Space Station make extensive use of modeling and simulation tools. Users of these tools often need to access and manipulate large quantities of data associated with the station, ranging from design documents to wiring diagrams. Retrieving and manipulating this data directly within the simulation and modeling environment can provide substantial benefit to users. An approach for providing these kinds of data management services, including a database schema and class structure, is presented. Implementation details are also provided as a data management system is integrated into the Intelligent Virtual Station, a modeling and simulation tool developed by the NASA Ames Smart Systems Research Laboratory. One use of the Intelligent Virtual Station is generating station-related training procedures in a virtual environment, The data management component allows users to quickly and easily retrieve information related to objects on the station, enhancing their ability to generate accurate procedures. Users can associate new information with objects and have that information stored in a database.
The Tully-Fisher relation for flat galaxies

NASA Astrophysics Data System (ADS)

Makarov, D. I.; Zaitseva, N. A.; Bizyaev, D. V.

2018-06-01

We construct a multiparametric Tully-Fisher (TF) relation for a large sample of edge-on galaxies from the Revised Flat Galaxy Catalog using H I data from the EDD database and parameters from the EGIS catalog. We incorporate a variety of additional parameters including structural parameters of edge-on galaxies in different bandpasses. Besides the rotation curve maximum, only the H I-to-optical luminosity ratio and optical colours play a statistically significant role in the multiparametric TF relation. We are able to decrease the standard deviation of the multiparametric TF relation down to 0.32 mag, which is at the level of best modern samples of galaxies used for studies of the matter motion in the Universe via the TF-relation.
NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases.

PubMed

Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin; Senger, Philipp

2015-01-01

Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article's supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer's disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html. © The Author(s) 2015. Published by Oxford University Press.
NeuroTransDB: highly curated and structured transcriptomic metadata for neurodegenerative diseases

PubMed Central

Bagewadi, Shweta; Adhikari, Subash; Dhrangadhariya, Anjani; Irin, Afroza Khanam; Ebeling, Christian; Namasivayam, Aishwarya Alex; Page, Matthew; Hofmann-Apitius, Martin

2015-01-01

Neurodegenerative diseases are chronic debilitating conditions, characterized by progressive loss of neurons that represent a significant health care burden as the global elderly population continues to grow. Over the past decade, high-throughput technologies such as the Affymetrix GeneChip microarrays have provided new perspectives into the pathomechanisms underlying neurodegeneration. Public transcriptomic data repositories, namely Gene Expression Omnibus and curated ArrayExpress, enable researchers to conduct integrative meta-analysis; increasing the power to detect differentially regulated genes in disease and explore patterns of gene dysregulation across biologically related studies. The reliability of retrospective, large-scale integrative analyses depends on an appropriate combination of related datasets, in turn requiring detailed meta-annotations capturing the experimental setup. In most cases, we observe huge variation in compliance to defined standards for submitted metadata in public databases. Much of the information to complete, or refine meta-annotations are distributed in the associated publications. For example, tissue preparation or comorbidity information is frequently described in an article’s supplementary tables. Several value-added databases have employed additional manual efforts to overcome this limitation. However, none of these databases explicate annotations that distinguish human and animal models in neurodegeneration context. Therefore, adopting a more specific disease focus, in combination with dedicated disease ontologies, will better empower the selection of comparable studies with refined annotations to address the research question at hand. In this article, we describe the detailed development of NeuroTransDB, a manually curated database containing metadata annotations for neurodegenerative studies. The database contains more than 20 dimensions of metadata annotations within 31 mouse, 5 rat and 45 human studies, defined in collaboration with domain disease experts. We elucidate the step-by-step guidelines used to critically prioritize studies from public archives and their metadata curation and discuss the key challenges encountered. Curated metadata for Alzheimer’s disease gene expression studies are available for download. Database URL: www.scai.fraunhofer.de/NeuroTransDB.html PMID:26475471
Application of new type of distributed multimedia databases to networked electronic museum

NASA Astrophysics Data System (ADS)

Kuroda, Kazuhide; Komatsu, Naohisa; Komiya, Kazumi; Ikeda, Hiroaki

1999-01-01

Recently, various kinds of multimedia application systems have actively been developed based on the achievement of advanced high sped communication networks, computer processing technologies, and digital contents-handling technologies. Under this background, this paper proposed a new distributed multimedia database system which can effectively perform a new function of cooperative retrieval among distributed databases. The proposed system introduces a new concept of 'Retrieval manager' which functions as an intelligent controller so that the user can recognize a set of distributed databases as one logical database. The logical database dynamically generates and performs a preferred combination of retrieving parameters on the basis of both directory data and the system environment. Moreover, a concept of 'domain' is defined in the system as a managing unit of retrieval. The retrieval can effectively be performed by cooperation of processing among multiple domains. Communication language and protocols are also defined in the system. These are used in every action for communications in the system. A language interpreter in each machine translates a communication language into an internal language used in each machine. Using the language interpreter, internal processing, such internal modules as DBMS and user interface modules can freely be selected. A concept of 'content-set' is also introduced. A content-set is defined as a package of contents. Contents in the content-set are related to each other. The system handles a content-set as one object. The user terminal can effectively control the displaying of retrieved contents, referring to data indicating the relation of the contents in the content- set. In order to verify the function of the proposed system, a networked electronic museum was experimentally built. The results of this experiment indicate that the proposed system can effectively retrieve the objective contents under the control to a number of distributed domains. The result also indicate that the system can effectively work even if the system becomes large.
WholeCellSimDB: a hybrid relational/HDF database for whole-cell model predictions.

PubMed

Karr, Jonathan R; Phillips, Nolan C; Covert, Markus W

2014-01-01

Mechanistic 'whole-cell' models are needed to develop a complete understanding of cell physiology. However, extracting biological insights from whole-cell models requires running and analyzing large numbers of simulations. We developed WholeCellSimDB, a database for organizing whole-cell simulations. WholeCellSimDB was designed to enable researchers to search simulation metadata to identify simulations for further analysis, and quickly slice and aggregate simulation results data. In addition, WholeCellSimDB enables users to share simulations with the broader research community. The database uses a hybrid relational/hierarchical data format architecture to efficiently store and retrieve both simulation setup metadata and results data. WholeCellSimDB provides a graphical Web-based interface to search, browse, plot and export simulations; a JavaScript Object Notation (JSON) Web service to retrieve data for Web-based visualizations; a command-line interface to deposit simulations; and a Python API to retrieve data for advanced analysis. Overall, we believe WholeCellSimDB will help researchers use whole-cell models to advance basic biological science and bioengineering. http://www.wholecellsimdb.org SOURCE CODE REPOSITORY: URL: http://github.com/CovertLab/WholeCellSimDB. © The Author(s) 2014. Published by Oxford University Press.
Content Is King: Databases Preserve the Collective Information of Science.

PubMed

Yates, John R

2018-04-01

Databases store sequence information experimentally gathered to create resources that further science. In the last 20 years databases have become critical components of fields like proteomics where they provide the basis for large-scale and high-throughput proteomic informatics. Amos Bairoch, winner of the Association of Biomolecular Resource Facilities Frederick Sanger Award, has created some of the important databases proteomic research depends upon for accurate interpretation of data.
Use of a German longitudinal prescription database (LRx) in pharmacoepidemiology.

PubMed

Richter, Hartmut; Dombrowski, Silvia; Hamer, Hajo; Hadji, Peyman; Kostev, Karel

2015-01-01

Large epidemiological databases are often used to examine matters pertaining to drug utilization, health services, and drug safety. The major strength of such databases is that they include large sample sizes, which allow precise estimates to be made. The IMS® LRx database has in recent years been used as a data source for epidemiological research. The aim of this paper is to review a number of recent studies published with the aid of this database and compare these with the results of similar studies using independent data published in the literature. In spite of being somewhat limited to studies for which comparative independent results were available, it was possible to include a wide range of possible uses of the LRx database in a variety of therapeutic fields: prevalence/incidence rate determination (diabetes, epilepsy), persistence analyses (diabetes, osteoporosis), use of comedication (diabetes), drug utilization (G-CSF market) and treatment costs (diabetes, G-CSF market). In general, the results of the LRx studies were found to be clearly in line with previously published reports. In some cases, noticeable discrepancies between the LRx results and the literature data were found (e.g. prevalence in epilepsy, persistence in osteoporosis) and these were discussed and possible reasons presented. Overall, it was concluded that the IMS® LRx database forms a suitable database for pharmacoepidemiological studies.
Osteoporosis therapies: evidence from health-care databases and observational population studies.

PubMed

Silverman, Stuart L

2010-11-01

Osteoporosis is a well-recognized disease with severe consequences if left untreated. Randomized controlled trials are the most rigorous method for determining the efficacy and safety of therapies. Nevertheless, randomized controlled trials underrepresent the real-world patient population and are costly in both time and money. Modern technology has enabled researchers to use information gathered from large health-care or medical-claims databases to assess the practical utilization of available therapies in appropriate patients. Observational database studies lack randomization but, if carefully designed and successfully completed, can provide valuable information that complements results obtained from randomized controlled trials and extends our knowledge to real-world clinical patients. Randomized controlled trials comparing fracture outcomes among osteoporosis therapies are difficult to perform. In this regard, large observational database studies could be useful in identifying clinically important differences among therapeutic options. Database studies can also provide important information with regard to osteoporosis prevalence, health economics, and compliance and persistence with treatment. This article describes the strengths and limitations of both randomized controlled trials and observational database studies, discusses considerations for observational study design, and reviews a wealth of information generated by database studies in the field of osteoporosis.
Aging assessment of large electric motors in nuclear power plants

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villaran, M.; Subudhi, M.

1996-03-01

Large electric motors serve as the prime movers to drive high capacity pumps, fans, compressors, and generators in a variety of nuclear plant systems. This study examined the stressors that cause degradation and aging in large electric motors operating in various plant locations and environments. The operating history of these machines in nuclear plant service was studied by review and analysis of failure reports in the NPRDS and LER databases. This was supplemented by a review of motor designs, and their nuclear and balance of plant applications, in order to characterize the failure mechanisms that cause degradation, aging, and failuremore » in large electric motors. A generic failure modes and effects analysis for large squirrel cage induction motors was performed to identify the degradation and aging mechanisms affecting various components of these large motors, the failure modes that result, and their effects upon the function of the motor. The effects of large motor failures upon the systems in which they are operating, and on the plant as a whole, were analyzed from failure reports in the databases. The effectiveness of the industry`s large motor maintenance programs was assessed based upon the failure reports in the databases and reviews of plant maintenance procedures and programs.« less
Digital hand atlas and computer-aided bone age assessment via the Web

NASA Astrophysics Data System (ADS)

Cao, Fei; Huang, H. K.; Pietka, Ewa; Gilsanz, Vicente

1999-07-01

A frequently used assessment method of bone age is atlas matching by a radiological examination of a hand image against a reference set of atlas patterns of normal standards. We are in a process of developing a digital hand atlas with a large standard set of normal hand and wrist images that reflect the skeletal maturity, race and sex difference, and current child development. The digital hand atlas will be used for a computer-aided bone age assessment via Web. We have designed and partially implemented a computer-aided diagnostic (CAD) system for Web-based bone age assessment. The system consists of a digital hand atlas, a relational image database and a Web-based user interface. The digital atlas is based on a large standard set of normal hand an wrist images with extracted bone objects and quantitative features. The image database uses a content- based indexing to organize the hand images and their attributes and present to users in a structured way. The Web-based user interface allows users to interact with the hand image database from browsers. Users can use a Web browser to push a clinical hand image to the CAD server for a bone age assessment. Quantitative features on the examined image, which reflect the skeletal maturity, will be extracted and compared with patterns from the atlas database to assess the bone age. The relevant reference imags and the final assessment report will be sent back to the user's browser via Web. The digital atlas will remove the disadvantages of the currently out-of-date one and allow the bone age assessment to be computerized and done conveniently via Web. In this paper, we present the system design and Web-based client-server model for computer-assisted bone age assessment and our initial implementation of the digital atlas database.
Benchmarking distributed data warehouse solutions for storing genomic variant information

PubMed Central

Wiewiórka, Marek S.; Wysakowicz, Dawid P.; Okoniewski, Michał J.

2017-01-01

Abstract Genomic-based personalized medicine encompasses storing, analysing and interpreting genomic variants as its central issues. At a time when thousands of patientss sequenced exomes and genomes are becoming available, there is a growing need for efficient database storage and querying. The answer could be the application of modern distributed storage systems and query engines. However, the application of large genomic variant databases to this problem has not been sufficiently far explored so far in the literature. To investigate the effectiveness of modern columnar storage [column-oriented Database Management System (DBMS)] and query engines, we have developed a prototypic genomic variant data warehouse, populated with large generated content of genomic variants and phenotypic data. Next, we have benchmarked performance of a number of combinations of distributed storages and query engines on a set of SQL queries that address biological questions essential for both research and medical applications. In addition, a non-distributed, analytical database (MonetDB) has been used as a baseline. Comparison of query execution times confirms that distributed data warehousing solutions outperform classic relational DBMSs. Moreover, pre-aggregation and further denormalization of data, which reduce the number of distributed join operations, significantly improve query performance by several orders of magnitude. Most of distributed back-ends offer a good performance for complex analytical queries, while the Optimized Row Columnar (ORC) format paired with Presto and Parquet with Spark 2 query engines provide, on average, the lowest execution times. Apache Kudu on the other hand, is the only solution that guarantees a sub-second performance for simple genome range queries returning a small subset of data, where low-latency response is expected, while still offering decent performance for running analytical queries. In summary, research and clinical applications that require the storage and analysis of variants from thousands of samples can benefit from the scalability and performance of distributed data warehouse solutions. Database URL: https://github.com/ZSI-Bio/variantsdwh PMID:29220442
System, method and apparatus for generating phrases from a database

NASA Technical Reports Server (NTRS)

McGreevy, Michael W. (Inventor)

2004-01-01

A phrase generation is a method of generating sequences of terms, such as phrases, that may occur within a database of subsets containing sequences of terms, such as text. A database is provided and a relational model of the database is created. A query is then input. The query includes a term or a sequence of terms or multiple individual terms or multiple sequences of terms or combinations thereof. Next, several sequences of terms that are contextually related to the query are assembled from contextual relations in the model of the database. The sequences of terms are then sorted and output. Phrase generation can also be an iterative process used to produce sequences of terms from a relational model of a database.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.