mESAdb: microRNA Expression and Sequence Analysis Database
Kaya, Koray D.; Karakülah, Gökhan; Yakıcıer, Cengiz M.; Acar, Aybar C.; Konu, Özlen
2011-01-01
microRNA expression and sequence analysis database (http://konulab.fen.bilkent.edu.tr/mirna/) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse and zebrafish expression data sets. mESAdb analysis modules allow (i) mining of microRNA expression data sets for subsets of microRNAs selected manually or by motif; (ii) pair-wise multivariate analysis of expression data sets within and between taxa; and (iii) association of microRNA subsets with annotation databases, HUGE Navigator, KEGG and GO. The use of existing and customized R packages facilitates future addition of data sets and analysis tools. Furthermore, the ability to upload and analyze user-specified data sets makes mESAdb an interactive and expandable analysis tool for microRNA sequence and expression data. PMID:21177657
mESAdb: microRNA expression and sequence analysis database.
Kaya, Koray D; Karakülah, Gökhan; Yakicier, Cengiz M; Acar, Aybar C; Konu, Ozlen
2011-01-01
microRNA expression and sequence analysis database (http://konulab.fen.bilkent.edu.tr/mirna/) (mESAdb) is a regularly updated database for the multivariate analysis of sequences and expression of microRNAs from multiple taxa. mESAdb is modular and has a user interface implemented in PHP and JavaScript and coupled with statistical analysis and visualization packages written for the R language. The database primarily comprises mature microRNA sequences and their target data, along with selected human, mouse and zebrafish expression data sets. mESAdb analysis modules allow (i) mining of microRNA expression data sets for subsets of microRNAs selected manually or by motif; (ii) pair-wise multivariate analysis of expression data sets within and between taxa; and (iii) association of microRNA subsets with annotation databases, HUGE Navigator, KEGG and GO. The use of existing and customized R packages facilitates future addition of data sets and analysis tools. Furthermore, the ability to upload and analyze user-specified data sets makes mESAdb an interactive and expandable analysis tool for microRNA sequence and expression data.
The Molecular Signatures Database (MSigDB) hallmark gene set collection.
Liberzon, Arthur; Birger, Chet; Thorvaldsdóttir, Helga; Ghandi, Mahmoud; Mesirov, Jill P; Tamayo, Pablo
2015-12-23
The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of "hallmark" gene sets as part of MSigDB. Each hallmark in this collection consists of a "refined" gene set, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.
Aguilera-Mendoza, Longendri; Marrero-Ponce, Yovani; Tellez-Ibarra, Roberto; Llorente-Quesada, Monica T; Salgado, Jesús; Barigye, Stephen J; Liu, Jun
2015-08-01
The large variety of antimicrobial peptide (AMP) databases developed to date are characterized by a substantial overlap of data and similarity of sequences. Our goals are to analyze the levels of redundancy for all available AMP databases and use this information to build a new non-redundant sequence database. For this purpose, a new software tool is introduced. A comparative study of 25 AMP databases reveals the overlap and diversity among them and the internal diversity within each database. The overlap analysis shows that only one database (Peptaibol) contains exclusive data, not present in any other, whereas all sequences in the LAMP_Patent database are included in CAMP_Patent. However, the majority of databases have their own set of unique sequences, as well as some overlap with other databases. The complete set of non-duplicate sequences comprises 16 990 cases, which is almost half of the total number of reported peptides. On the other hand, the diversity analysis identifies the most and least diverse databases and proves that all databases exhibit some level of redundancy. Finally, we present a new parallel-free software, named Dover Analyzer, developed to compute the overlap and diversity between any number of databases and compile a set of non-redundant sequences. These results are useful for selecting or building a suitable representative set of AMPs, according to specific needs. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy
2013-08-01
Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics.
Deutsch, Eric W; Sun, Zhi; Campbell, David S; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S; Moritz, Robert L
2016-11-04
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances-a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/ .
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
Deutsch, Eric W.; Sun, Zhi; Campbell, David S.; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S.; Moritz, Robert L.
2016-01-01
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances – a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ~20,000 primary isoforms plus contaminants to a very large database that includes almost all non-redundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/. PMID:27577934
Damming the genomic data flood using a comprehensive analysis and storage data structure
Bouffard, Marc; Phillips, Michael S.; Brown, Andrew M.K.; Marsh, Sharon; Tardif, Jean-Claude; van Rooij, Tibor
2010-01-01
Data generation, driven by rapid advances in genomic technologies, is fast outpacing our analysis capabilities. Faced with this flood of data, more hardware and software resources are added to accommodate data sets whose structure has not specifically been designed for analysis. This leads to unnecessarily lengthy processing times and excessive data handling and storage costs. Current efforts to address this have centered on developing new indexing schemas and analysis algorithms, whereas the root of the problem lies in the format of the data itself. We have developed a new data structure for storing and analyzing genotype and phenotype data. By leveraging data normalization techniques, database management system capabilities and the use of a novel multi-table, multidimensional database structure we have eliminated the following: (i) unnecessarily large data set size due to high levels of redundancy, (ii) sequential access to these data sets and (iii) common bottlenecks in analysis times. The resulting novel data structure horizontally divides the data to circumvent traditional problems associated with the use of databases for very large genomic data sets. The resulting data set required 86% less disk space and performed analytical calculations 6248 times faster compared to a standard approach without any loss of information. Database URL: http://castor.pharmacogenomics.ca PMID:21159730
Recent NASA Wake-Vortex Flight Tests, Flow-Physics Database and Wake-Development Analysis
NASA Technical Reports Server (NTRS)
Vicroy, Dan D.; Vijgen, Paul M.; Reimer, Heidi M.; Gallegos, Joey L.; Spalart, Philippe R.
1998-01-01
A series of flight tests over the ocean of a four engine turboprop airplane in the cruise configuration have provided a data set for improved understanding of wake vortex physics and atmospheric interaction. An integrated database has been compiled for wake characterization and validation of wake-vortex computational models. This paper describes the wake-vortex flight tests, the data processing, the database development and access, and results obtained from preliminary wake-characterization analysis using the data sets.
Assessment of the SFC database for analysis and modeling
NASA Technical Reports Server (NTRS)
Centeno, Martha A.
1994-01-01
SFC is one of the four clusters that make up the Integrated Work Control System (IWCS), which will integrate the shuttle processing databases at Kennedy Space Center (KSC). The IWCS framework will enable communication among the four clusters and add new data collection protocols. The Shop Floor Control (SFC) module has been operational for two and a half years; however, at this stage, automatic links to the other 3 modules have not been implemented yet, except for a partial link to IOS (CASPR). SFC revolves around a DB/2 database with PFORMS acting as the database management system (DBMS). PFORMS is an off-the-shelf DB/2 application that provides a set of data entry screens and query forms. The main dynamic entity in the SFC and IOS database is a task; thus, the physical storage location and update privileges are driven by the status of the WAD. As we explored the SFC values, we realized that there was much to do before actually engaging in continuous analysis of the SFC data. Half way into this effort, it was realized that full scale analysis would have to be a future third phase of this effort. So, we concentrated on getting to know the contents of the database, and in establishing an initial set of tools to start the continuous analysis process. Specifically, we set out to: (1) provide specific procedures for statistical models, so as to enhance the TP-OAO office analysis and modeling capabilities; (2) design a data exchange interface; (3) prototype the interface to provide inputs to SCRAM; and (4) design a modeling database. These objectives were set with the expectation that, if met, they would provide former TP-OAO engineers with tools that would help them demonstrate the importance of process-based analyses. The latter, in return, will help them obtain the cooperation of various organizations in charting out their individual processes.
Starbase Data Tables: An ASCII Relational Database for Unix
NASA Astrophysics Data System (ADS)
Roll, John
2011-11-01
Database management is an increasingly important part of astronomical data analysis. Astronomers need easy and convenient ways of storing, editing, filtering, and retrieving data about data. Commercial databases do not provide good solutions for many of the everyday and informal types of database access astronomers need. The Starbase database system with simple data file formatting rules and command line data operators has been created to answer this need. The system includes a complete set of relational and set operators, fast search/index and sorting operators, and many formatting and I/O operators. Special features are included to enhance the usefulness of the database when manipulating astronomical data. The software runs under UNIX, MSDOS and IRAF.
Dziadkowiec, Oliwier; Callahan, Tiffany; Ozkaynak, Mustafa; Reeder, Blaine; Welton, John
2016-01-01
Objectives: We examine the following: (1) the appropriateness of using a data quality (DQ) framework developed for relational databases as a data-cleaning tool for a data set extracted from two EPIC databases, and (2) the differences in statistical parameter estimates on a data set cleaned with the DQ framework and data set not cleaned with the DQ framework. Background: The use of data contained within electronic health records (EHRs) has the potential to open doors for a new wave of innovative research. Without adequate preparation of such large data sets for analysis, the results might be erroneous, which might affect clinical decision-making or the results of Comparative Effectives Research studies. Methods: Two emergency department (ED) data sets extracted from EPIC databases (adult ED and children ED) were used as examples for examining the five concepts of DQ based on a DQ assessment framework designed for EHR databases. The first data set contained 70,061 visits; and the second data set contained 2,815,550 visits. SPSS Syntax examples as well as step-by-step instructions of how to apply the five key DQ concepts these EHR database extracts are provided. Conclusions: SPSS Syntax to address each of the DQ concepts proposed by Kahn et al. (2012)1 was developed. The data set cleaned using Kahn’s framework yielded more accurate results than the data set cleaned without this framework. Future plans involve creating functions in R language for cleaning data extracted from the EHR as well as an R package that combines DQ checks with missing data analysis functions. PMID:27429992
The Cardiac Safety Research Consortium ECG database.
Kligfield, Paul; Green, Cynthia L
2012-01-01
The Cardiac Safety Research Consortium (CSRC) ECG database was initiated to foster research using anonymized, XML-formatted, digitized ECGs with corresponding descriptive variables from placebo- and positive-control arms of thorough QT studies submitted to the US Food and Drug Administration (FDA) by pharmaceutical sponsors. The database can be expanded to other data that are submitted directly to CSRC from other sources, and currently includes digitized ECGs from patients with genotyped varieties of congenital long-QT syndrome; this congenital long-QT database is also linked to ambulatory electrocardiograms stored in the Telemetric and Holter ECG Warehouse (THEW). Thorough QT data sets are available from CSRC for unblinded development of algorithms for analysis of repolarization and for blinded comparative testing of algorithms developed for the identification of moxifloxacin, as used as a positive control in thorough QT studies. Policies and procedures for access to these data sets are available from CSRC, which has developed tools for statistical analysis of blinded new algorithm performance. A recently approved CSRC project will create a data set for blinded analysis of automated ECG interval measurements, whose initial focus will include comparison of four of the major manufacturers of automated electrocardiographs in the United States. CSRC welcomes application for use of the ECG database for clinical investigation. Copyright © 2012 Elsevier Inc. All rights reserved.
2012-09-01
relative performance of several conventional SQL and NoSQL databases with a set of one billion file block hashes. Digital Forensics, Sector Hashing, Full... NoSQL databases with a set of one billion file block hashes. v THIS PAGE INTENTIONALLY LEFT BLANK vi Table of Contents List of Acronyms and...Operating System NOOP No Operation assembly instruction NoSQL “Not only SQL” model for non-relational database management NSRL National Software
SASD: the Synthetic Alternative Splicing Database for identifying novel isoform from proteomics
2013-01-01
Background Alternative splicing is an important and widespread mechanism for generating protein diversity and regulating protein expression. High-throughput identification and analysis of alternative splicing in the protein level has more advantages than in the mRNA level. The combination of alternative splicing database and tandem mass spectrometry provides a powerful technique for identification, analysis and characterization of potential novel alternative splicing protein isoforms from proteomics. Therefore, based on the peptidomic database of human protein isoforms for proteomics experiments, our objective is to design a new alternative splicing database to 1) provide more coverage of genes, transcripts and alternative splicing, 2) exclusively focus on the alternative splicing, and 3) perform context-specific alternative splicing analysis. Results We used a three-step pipeline to create a synthetic alternative splicing database (SASD) to identify novel alternative splicing isoforms and interpret them at the context of pathway, disease, drug and organ specificity or custom gene set with maximum coverage and exclusive focus on alternative splicing. First, we extracted information on gene structures of all genes in the Ensembl Genes 71 database and incorporated the Integrated Pathway Analysis Database. Then, we compiled artificial splicing transcripts. Lastly, we translated the artificial transcripts into alternative splicing peptides. The SASD is a comprehensive database containing 56,630 genes (Ensembl gene IDs), 95,260 transcripts (Ensembl transcript IDs), and 11,919,779 Alternative Splicing peptides, and also covering about 1,956 pathways, 6,704 diseases, 5,615 drugs, and 52 organs. The database has a web-based user interface that allows users to search, display and download a single gene/transcript/protein, custom gene set, pathway, disease, drug, organ related alternative splicing. Moreover, the quality of the database was validated with comparison to other known databases and two case studies: 1) in liver cancer and 2) in breast cancer. Conclusions The SASD provides the scientific community with an efficient means to identify, analyze, and characterize novel Exon Skipping and Intron Retention protein isoforms from mass spectrometry and interpret them at the context of pathway, disease, drug and organ specificity or custom gene set with maximum coverage and exclusive focus on alternative splicing. PMID:24267658
The improved Apriori algorithm based on matrix pruning and weight analysis
NASA Astrophysics Data System (ADS)
Lang, Zhenhong
2018-04-01
This paper uses the matrix compression algorithm and weight analysis algorithm for reference and proposes an improved matrix pruning and weight analysis Apriori algorithm. After the transactional database is scanned for only once, the algorithm will construct the boolean transaction matrix. Through the calculation of one figure in the rows and columns of the matrix, the infrequent item set is pruned, and a new candidate item set is formed. Then, the item's weight and the transaction's weight as well as the weight support for items are calculated, thus the frequent item sets are gained. The experimental result shows that the improved Apriori algorithm not only reduces the number of repeated scans of the database, but also improves the efficiency of data correlation mining.
A comprehensive and scalable database search system for metaproteomics.
Chatterjee, Sandip; Stupp, Gregory S; Park, Sung Kyu Robin; Ducom, Jean-Christophe; Yates, John R; Su, Andrew I; Wolan, Dennis W
2016-08-16
Mass spectrometry-based shotgun proteomics experiments rely on accurate matching of experimental spectra against a database of protein sequences. Existing computational analysis methods are limited in the size of their sequence databases, which severely restricts the proteomic sequencing depth and functional analysis of highly complex samples. The growing amount of public high-throughput sequencing data will only exacerbate this problem. We designed a broadly applicable metaproteomic analysis method (ComPIL) that addresses protein database size limitations. Our approach to overcome this significant limitation in metaproteomics was to design a scalable set of sequence databases assembled for optimal library querying speeds. ComPIL was integrated with a modified version of the search engine ProLuCID (termed "Blazmass") to permit rapid matching of experimental spectra. Proof-of-principle analysis of human HEK293 lysate with a ComPIL database derived from high-quality genomic libraries was able to detect nearly all of the same peptides as a search with a human database (~500x fewer peptides in the database), with a small reduction in sensitivity. We were also able to detect proteins from the adenovirus used to immortalize these cells. We applied our method to a set of healthy human gut microbiome proteomic samples and showed a substantial increase in the number of identified peptides and proteins compared to previous metaproteomic analyses, while retaining a high degree of protein identification accuracy and allowing for a more in-depth characterization of the functional landscape of the samples. The combination of ComPIL with Blazmass allows proteomic searches to be performed with database sizes much larger than previously possible. These large database searches can be applied to complex meta-samples with unknown composition or proteomic samples where unexpected proteins may be identified. The protein database, proteomic search engine, and the proteomic data files for the 5 microbiome samples characterized and discussed herein are open source and available for use and additional analysis.
Stochastic Modeling and Analysis of Energy Commodity Spot Price Processes
2014-06-27
Eurocurrency data set [129] collected from Forex database . Table 10: Estimates m̂k, βm̂k,k, µm̂k,k, δm̂k,k, σm̂k,k, γm̂k,k for U. S. Treasury Bill...U. S. Eurocurrency rateUS dollar Eurocurrency data set January 1990-December 2004, Forex Database . [130] U. S. Energy Information Administration
Ushijima, Masaru; Mashima, Tetsuo; Tomida, Akihiro; Dan, Shingo; Saito, Sakae; Furuno, Aki; Tsukahara, Satomi; Seimiya, Hiroyuki; Yamori, Takao; Matsuura, Masaaki
2013-03-01
Genome-wide transcriptional expression analysis is a powerful strategy for characterizing the biological activity of anticancer compounds. It is often instructive to identify gene sets involved in the activity of a given drug compound for comparison with different compounds. Currently, however, there is no comprehensive gene expression database and related application system that is; (i) specialized in anticancer agents; (ii) easy to use; and (iii) open to the public. To develop a public gene expression database of antitumor agents, we first examined gene expression profiles in human cancer cells after exposure to 35 compounds including 25 clinically used anticancer agents. Gene signatures were extracted that were classified as upregulated or downregulated after exposure to the drug. Hierarchical clustering showed that drugs with similar mechanisms of action, such as genotoxic drugs, were clustered. Connectivity map analysis further revealed that our gene signature data reflected modes of action of the respective agents. Together with the database, we developed analysis programs that calculate scores for ranking changes in gene expression and for searching statistically significant pathways from the Kyoto Encyclopedia of Genes and Genomes database in order to analyze the datasets more easily. Our database and the analysis programs are available online at our website (http://scads.jfcr.or.jp/db/cs/). Using these systems, we successfully showed that proteasome inhibitors are selectively classified as endoplasmic reticulum stress inducers and induce atypical endoplasmic reticulum stress. Thus, our public access database and related analysis programs constitute a set of efficient tools to evaluate the mode of action of novel compounds and identify promising anticancer lead compounds. © 2012 Japanese Cancer Association.
A data model and database for high-resolution pathology analytical image informatics.
Wang, Fusheng; Kong, Jun; Cooper, Lee; Pan, Tony; Kurc, Tahsin; Chen, Wenjin; Sharma, Ashish; Niedermayr, Cristobal; Oh, Tae W; Brat, Daniel; Farris, Alton B; Foran, David J; Saltz, Joel
2011-01-01
The systematic analysis of imaged pathology specimens often results in a vast amount of morphological information at both the cellular and sub-cellular scales. While microscopy scanners and computerized analysis are capable of capturing and analyzing data rapidly, microscopy image data remain underutilized in research and clinical settings. One major obstacle which tends to reduce wider adoption of these new technologies throughout the clinical and scientific communities is the challenge of managing, querying, and integrating the vast amounts of data resulting from the analysis of large digital pathology datasets. This paper presents a data model, which addresses these challenges, and demonstrates its implementation in a relational database system. This paper describes a data model, referred to as Pathology Analytic Imaging Standards (PAIS), and a database implementation, which are designed to support the data management and query requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines on whole-slide images and tissue microarrays (TMAs). (1) Development of a data model capable of efficiently representing and storing virtual slide related image, annotation, markup, and feature information. (2) Development of a database, based on the data model, capable of supporting queries for data retrieval based on analysis and image metadata, queries for comparison of results from different analyses, and spatial queries on segmented regions, features, and classified objects. The work described in this paper is motivated by the challenges associated with characterization of micro-scale features for comparative and correlative analyses involving whole-slides tissue images and TMAs. Technologies for digitizing tissues have advanced significantly in the past decade. Slide scanners are capable of producing high-magnification, high-resolution images from whole slides and TMAs within several minutes. Hence, it is becoming increasingly feasible for basic, clinical, and translational research studies to produce thousands of whole-slide images. Systematic analysis of these large datasets requires efficient data management support for representing and indexing results from hundreds of interrelated analyses generating very large volumes of quantifications such as shape and texture and of classifications of the quantified features. We have designed a data model and a database to address the data management requirements of detailed characterization of micro-anatomic morphology through many interrelated analysis pipelines. The data model represents virtual slide related image, annotation, markup and feature information. The database supports a wide range of metadata and spatial queries on images, annotations, markups, and features. We currently have three databases running on a Dell PowerEdge T410 server with CentOS 5.5 Linux operating system. The database server is IBM DB2 Enterprise Edition 9.7.2. The set of databases consists of 1) a TMA database containing image analysis results from 4740 cases of breast cancer, with 641 MB storage size; 2) an algorithm validation database, which stores markups and annotations from two segmentation algorithms and two parameter sets on 18 selected slides, with 66 GB storage size; and 3) an in silico brain tumor study database comprising results from 307 TCGA slides, with 365 GB storage size. The latter two databases also contain human-generated annotations and markups for regions and nuclei. Modeling and managing pathology image analysis results in a database provide immediate benefits on the value and usability of data in a research study. The database provides powerful query capabilities, which are otherwise difficult or cumbersome to support by other approaches such as programming languages. Standardized, semantic annotated data representation and interfaces also make it possible to more efficiently share image data and analysis results.
EuPathDB: the eukaryotic pathogen genomics database resource
Aurrecoechea, Cristina; Barreto, Ana; Basenko, Evelina Y.; Brestelli, John; Brunk, Brian P.; Cade, Shon; Crouch, Kathryn; Doherty, Ryan; Falke, Dave; Fischer, Steve; Gajria, Bindu; Harb, Omar S.; Heiges, Mark; Hertz-Fowler, Christiane; Hu, Sufen; Iodice, John; Kissinger, Jessica C.; Lawrence, Cris; Li, Wei; Pinney, Deborah F.; Pulman, Jane A.; Roos, David S.; Shanmugasundram, Achchuthan; Silva-Franco, Fatima; Steinbiss, Sascha; Stoeckert, Christian J.; Spruill, Drew; Wang, Haiming; Warrenfeltz, Susanne; Zheng, Jie
2017-01-01
The Eukaryotic Pathogen Genomics Database Resource (EuPathDB, http://eupathdb.org) is a collection of databases covering 170+ eukaryotic pathogens (protists & fungi), along with relevant free-living and non-pathogenic species, and select pathogen hosts. To facilitate the discovery of meaningful biological relationships, the databases couple preconfigured searches with visualization and analysis tools for comprehensive data mining via intuitive graphical interfaces and APIs. All data are analyzed with the same workflows, including creation of gene orthology profiles, so data are easily compared across data sets, data types and organisms. EuPathDB is updated with numerous new analysis tools, features, data sets and data types. New tools include GO, metabolic pathway and word enrichment analyses plus an online workspace for analysis of personal, non-public, large-scale data. Expanded data content is mostly genomic and functional genomic data while new data types include protein microarray, metabolic pathways, compounds, quantitative proteomics, copy number variation, and polysomal transcriptomics. New features include consistent categorization of searches, data sets and genome browser tracks; redesigned gene pages; effective integration of alternative transcripts; and a EuPathDB Galaxy instance for private analyses of a user's data. Forthcoming upgrades include user workspaces for private integration of data with existing EuPathDB data and improved integration and presentation of host–pathogen interactions. PMID:27903906
Mayo, Charles; Conners, Steve; Warren, Christopher; Miller, Robert; Court, Laurence; Popple, Richard
2013-01-01
Purpose: With emergence of clinical outcomes databases as tools utilized routinely within institutions, comes need for software tools to support automated statistical analysis of these large data sets and intrainstitutional exchange from independent federated databases to support data pooling. In this paper, the authors present a design approach and analysis methodology that addresses both issues. Methods: A software application was constructed to automate analysis of patient outcomes data using a wide range of statistical metrics, by combining use of C#.Net and R code. The accuracy and speed of the code was evaluated using benchmark data sets. Results: The approach provides data needed to evaluate combinations of statistical measurements for ability to identify patterns of interest in the data. Through application of the tools to a benchmark data set for dose-response threshold and to SBRT lung data sets, an algorithm was developed that uses receiver operator characteristic curves to identify a threshold value and combines use of contingency tables, Fisher exact tests, Welch t-tests, and Kolmogorov-Smirnov tests to filter the large data set to identify values demonstrating dose-response. Kullback-Leibler divergences were used to provide additional confirmation. Conclusions: The work demonstrates the viability of the design approach and the software tool for analysis of large data sets. PMID:24320426
Mayo, Charles; Conners, Steve; Warren, Christopher; Miller, Robert; Court, Laurence; Popple, Richard
2013-11-01
With emergence of clinical outcomes databases as tools utilized routinely within institutions, comes need for software tools to support automated statistical analysis of these large data sets and intrainstitutional exchange from independent federated databases to support data pooling. In this paper, the authors present a design approach and analysis methodology that addresses both issues. A software application was constructed to automate analysis of patient outcomes data using a wide range of statistical metrics, by combining use of C#.Net and R code. The accuracy and speed of the code was evaluated using benchmark data sets. The approach provides data needed to evaluate combinations of statistical measurements for ability to identify patterns of interest in the data. Through application of the tools to a benchmark data set for dose-response threshold and to SBRT lung data sets, an algorithm was developed that uses receiver operator characteristic curves to identify a threshold value and combines use of contingency tables, Fisher exact tests, Welch t-tests, and Kolmogorov-Smirnov tests to filter the large data set to identify values demonstrating dose-response. Kullback-Leibler divergences were used to provide additional confirmation. The work demonstrates the viability of the design approach and the software tool for analysis of large data sets.
Synthesis, Interdiction, and Protection of Layered Networks
2009-09-01
152 4.7 Al Qaeda Network from Sageman Database . . . . . . . . . . 157 4.8 Interdiction Resources versus Closeness Centrality . . . . . . 159...where S may be a polyhedron , a set with discrete variables, a set with nonlin- earities, or so on); and partitions it into two mutually exclusive subsets...p. vii]. However, this database is based on Dr. Sagemans’s 2004 publication and may be dated. Therefore, the analysis in this section is to
VitisExpDB: a database resource for grape functional genomics.
Doddapaneni, Harshavardhan; Lin, Hong; Walker, M Andrew; Yao, Jiqiang; Civerolo, Edwin L
2008-02-28
The family Vitaceae consists of many different grape species that grow in a range of climatic conditions. In the past few years, several studies have generated functional genomic information on different Vitis species and cultivars, including the European grape vine, Vitis vinifera. Our goal is to develop a comprehensive web data source for Vitaceae. VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for V. vinifera and non-vinifera grape species and varieties. Currently, the database stores approximately 320,000 EST sequences derived from 8 species/hybrids, their annotation (BLAST top match) details and Gene Ontology based structured vocabulary. Putative homologs for each EST in other species and varieties along with information on their percent nucleotide identities, phylogenetic relationship and common primers can be retrieved. The database also includes information on probe sequence and annotation features of the high density 60-mer gene expression chip consisting of approximately 20,000 non-redundant set of ESTs. Finally, the database includes 14 processed global microarray expression profile sets. Data from 12 of these expression profile sets have been mapped onto metabolic pathways. A user-friendly web interface with multiple search indices and extensively hyperlinked result features that permit efficient data retrieval has been developed. Several online bioinformatics tools that interact with the database along with other sequence analysis tools have been added. In addition, users can submit their ESTs to the database. The developed database provides genomic resource to grape community for functional analysis of genes in the collection and for the grape genome annotation and gene function identification. The VitisExpDB database is available through our website http://cropdisease.ars.usda.gov/vitis_at/main-page.htm.
VitisExpDB: A database resource for grape functional genomics
Doddapaneni, Harshavardhan; Lin, Hong; Walker, M Andrew; Yao, Jiqiang; Civerolo, Edwin L
2008-01-01
Background The family Vitaceae consists of many different grape species that grow in a range of climatic conditions. In the past few years, several studies have generated functional genomic information on different Vitis species and cultivars, including the European grape vine, Vitis vinifera. Our goal is to develop a comprehensive web data source for Vitaceae. Description VitisExpDB is an online MySQL-PHP driven relational database that houses annotated EST and gene expression data for V. vinifera and non-vinifera grape species and varieties. Currently, the database stores ~320,000 EST sequences derived from 8 species/hybrids, their annotation (BLAST top match) details and Gene Ontology based structured vocabulary. Putative homologs for each EST in other species and varieties along with information on their percent nucleotide identities, phylogenetic relationship and common primers can be retrieved. The database also includes information on probe sequence and annotation features of the high density 60-mer gene expression chip consisting of ~20,000 non-redundant set of ESTs. Finally, the database includes 14 processed global microarray expression profile sets. Data from 12 of these expression profile sets have been mapped onto metabolic pathways. A user-friendly web interface with multiple search indices and extensively hyperlinked result features that permit efficient data retrieval has been developed. Several online bioinformatics tools that interact with the database along with other sequence analysis tools have been added. In addition, users can submit their ESTs to the database. Conclusion The developed database provides genomic resource to grape community for functional analysis of genes in the collection and for the grape genome annotation and gene function identification. The VitisExpDB database is available through our website . PMID:18307813
BμG@Sbase—a microbial gene expression and comparative genomic database
Witney, Adam A.; Waldron, Denise E.; Brooks, Lucy A.; Tyler, Richard H.; Withers, Michael; Stoker, Neil G.; Wren, Brendan W.; Butcher, Philip D.; Hinds, Jason
2012-01-01
The reducing cost of high-throughput functional genomic technologies is creating a deluge of high volume, complex data, placing the burden on bioinformatics resources and tool development. The Bacterial Microarray Group at St George's (BμG@S) has been at the forefront of bacterial microarray design and analysis for over a decade and while serving as a hub of a global network of microbial research groups has developed BμG@Sbase, a microbial gene expression and comparative genomic database. BμG@Sbase (http://bugs.sgul.ac.uk/bugsbase/) is a web-browsable, expertly curated, MIAME-compliant database that stores comprehensive experimental annotation and multiple raw and analysed data formats. Consistent annotation is enabled through a structured set of web forms, which guide the user through the process following a set of best practices and controlled vocabulary. The database currently contains 86 expertly curated publicly available data sets (with a further 124 not yet published) and full annotation information for 59 bacterial microarray designs. The data can be browsed and queried using an explorer-like interface; integrating intuitive tree diagrams to present complex experimental details clearly and concisely. Furthermore the modular design of the database will provide a robust platform for integrating other data types beyond microarrays into a more Systems analysis based future. PMID:21948792
BμG@Sbase--a microbial gene expression and comparative genomic database.
Witney, Adam A; Waldron, Denise E; Brooks, Lucy A; Tyler, Richard H; Withers, Michael; Stoker, Neil G; Wren, Brendan W; Butcher, Philip D; Hinds, Jason
2012-01-01
The reducing cost of high-throughput functional genomic technologies is creating a deluge of high volume, complex data, placing the burden on bioinformatics resources and tool development. The Bacterial Microarray Group at St George's (BμG@S) has been at the forefront of bacterial microarray design and analysis for over a decade and while serving as a hub of a global network of microbial research groups has developed BμG@Sbase, a microbial gene expression and comparative genomic database. BμG@Sbase (http://bugs.sgul.ac.uk/bugsbase/) is a web-browsable, expertly curated, MIAME-compliant database that stores comprehensive experimental annotation and multiple raw and analysed data formats. Consistent annotation is enabled through a structured set of web forms, which guide the user through the process following a set of best practices and controlled vocabulary. The database currently contains 86 expertly curated publicly available data sets (with a further 124 not yet published) and full annotation information for 59 bacterial microarray designs. The data can be browsed and queried using an explorer-like interface; integrating intuitive tree diagrams to present complex experimental details clearly and concisely. Furthermore the modular design of the database will provide a robust platform for integrating other data types beyond microarrays into a more Systems analysis based future.
Fan, Qianrui; Wang, Wenyu; Hao, Jingcan; He, Awen; Wen, Yan; Guo, Xiong; Wu, Cuiyan; Ning, Yujie; Wang, Xi; Wang, Sen; Zhang, Feng
2017-08-01
Neuroticism is a fundamental personality trait with significant genetic determinant. To identify novel susceptibility genes for neuroticism, we conducted an integrative analysis of genomic and transcriptomic data of genome wide association study (GWAS) and expression quantitative trait locus (eQTL) study. GWAS summary data was driven from published studies of neuroticism, totally involving 170,906 subjects. eQTL dataset containing 927,753 eQTLs were obtained from an eQTL meta-analysis of 5311 samples. Integrative analysis of GWAS and eQTL data was conducted by summary data-based Mendelian randomization (SMR) analysis software. To identify neuroticism associated gene sets, the SMR analysis results were further subjected to gene set enrichment analysis (GSEA). The gene set annotation dataset (containing 13,311 annotated gene sets) of GSEA Molecular Signatures Database was used. SMR single gene analysis identified 6 significant genes for neuroticism, including MSRA (p value=2.27×10 -10 ), MGC57346 (p value=6.92×10 -7 ), BLK (p value=1.01×10 -6 ), XKR6 (p value=1.11×10 -6 ), C17ORF69 (p value=1.12×10 -6 ) and KIAA1267 (p value=4.00×10 -6 ). Gene set enrichment analysis observed significant association for Chr8p23 gene set (false discovery rate=0.033). Our results provide novel clues for the genetic mechanism studies of neuroticism. Copyright © 2017. Published by Elsevier Inc.
GANESH: software for customized annotation of genome regions.
Huntley, Derek; Hummerich, Holger; Smedley, Damian; Kittivoravitkul, Sasivimol; McCarthy, Mark; Little, Peter; Sergot, Marek
2003-09-01
GANESH is a software package designed to support the genetic analysis of regions of human and other genomes. It provides a set of components that may be assembled to construct a self-updating database of DNA sequence, mapping data, and annotations of possible genome features. Once one or more remote sources of data for the target region have been identified, all sequences for that region are downloaded, assimilated, and subjected to a (configurable) set of standard database-searching and genome-analysis packages. The results are stored in compressed form in a relational database, and are updated automatically on a regular schedule so that they are always immediately available in their most up-to-date versions. A Java front-end, executed as a stand alone application or web applet, provides a graphical interface for navigating the database and for viewing the annotations. There are facilities for importing and exporting data in the format of the Distributed Annotation System (DAS), enabling a GANESH database to be used as a component of a DAS configuration. The system has been used to construct databases for about a dozen regions of human chromosomes and for three regions of mouse chromosomes.
A high-performance spatial database based approach for pathology imaging algorithm evaluation
Wang, Fusheng; Kong, Jun; Gao, Jingjing; Cooper, Lee A.D.; Kurc, Tahsin; Zhou, Zhengwen; Adler, David; Vergara-Niedermayr, Cristobal; Katigbak, Bryan; Brat, Daniel J.; Saltz, Joel H.
2013-01-01
Background: Algorithm evaluation provides a means to characterize variability across image analysis algorithms, validate algorithms by comparison with human annotations, combine results from multiple algorithms for performance improvement, and facilitate algorithm sensitivity studies. The sizes of images and image analysis results in pathology image analysis pose significant challenges in algorithm evaluation. We present an efficient parallel spatial database approach to model, normalize, manage, and query large volumes of analytical image result data. This provides an efficient platform for algorithm evaluation. Our experiments with a set of brain tumor images demonstrate the application, scalability, and effectiveness of the platform. Context: The paper describes an approach and platform for evaluation of pathology image analysis algorithms. The platform facilitates algorithm evaluation through a high-performance database built on the Pathology Analytic Imaging Standards (PAIS) data model. Aims: (1) Develop a framework to support algorithm evaluation by modeling and managing analytical results and human annotations from pathology images; (2) Create a robust data normalization tool for converting, validating, and fixing spatial data from algorithm or human annotations; (3) Develop a set of queries to support data sampling and result comparisons; (4) Achieve high performance computation capacity via a parallel data management infrastructure, parallel data loading and spatial indexing optimizations in this infrastructure. Materials and Methods: We have considered two scenarios for algorithm evaluation: (1) algorithm comparison where multiple result sets from different methods are compared and consolidated; and (2) algorithm validation where algorithm results are compared with human annotations. We have developed a spatial normalization toolkit to validate and normalize spatial boundaries produced by image analysis algorithms or human annotations. The validated data were formatted based on the PAIS data model and loaded into a spatial database. To support efficient data loading, we have implemented a parallel data loading tool that takes advantage of multi-core CPUs to accelerate data injection. The spatial database manages both geometric shapes and image features or classifications, and enables spatial sampling, result comparison, and result aggregation through expressive structured query language (SQL) queries with spatial extensions. To provide scalable and efficient query support, we have employed a shared nothing parallel database architecture, which distributes data homogenously across multiple database partitions to take advantage of parallel computation power and implements spatial indexing to achieve high I/O throughput. Results: Our work proposes a high performance, parallel spatial database platform for algorithm validation and comparison. This platform was evaluated by storing, managing, and comparing analysis results from a set of brain tumor whole slide images. The tools we develop are open source and available to download. Conclusions: Pathology image algorithm validation and comparison are essential to iterative algorithm development and refinement. One critical component is the support for queries involving spatial predicates and comparisons. In our work, we develop an efficient data model and parallel database approach to model, normalize, manage and query large volumes of analytical image result data. Our experiments demonstrate that the data partitioning strategy and the grid-based indexing result in good data distribution across database nodes and reduce I/O overhead in spatial join queries through parallel retrieval of relevant data and quick subsetting of datasets. The set of tools in the framework provide a full pipeline to normalize, load, manage and query analytical results for algorithm evaluation. PMID:23599905
Does an Otolaryngology-Specific Database Have Added Value? A Comparative Feasibility Analysis.
Bellmunt, Angela M; Roberts, Rhonda; Lee, Walter T; Schulz, Kris; Pynnonen, Melissa A; Crowson, Matthew G; Witsell, David; Parham, Kourosh; Langman, Alan; Vambutas, Andrea; Ryan, Sheila E; Shin, Jennifer J
2016-07-01
There are multiple nationally representative databases that support epidemiologic and outcomes research, and it is unknown whether an otolaryngology-specific resource would prove indispensable or superfluous. Therefore, our objective was to determine the feasibility of analyses in the National Ambulatory Medical Care Survey (NAMCS) and National Hospital Ambulatory Medical Care Survey (NHAMCS) databases as compared with the otolaryngology-specific Creating Healthcare Excellence through Education and Research (CHEER) database. Parallel analyses in 2 data sets. Ambulatory visits in the United States. To test a fixed hypothesis that could be directly compared between data sets, we focused on a condition with expected prevalence high enough to substantiate availability in both. This query also encompassed a broad span of diagnoses to sample the breadth of available information. Specifically, we compared an assessment of suspected risk factors for sensorineural hearing loss in subjects 0 to 21 years of age, according to a predetermined protocol. We also assessed the feasibility of 6 additional diagnostic queries among all age groups. In the NAMCS/NHAMCS data set, the number of measured observations was not sufficient to support reliable numeric conclusions (percentage standard error among risk factors: 38.6-92.1). Analysis of the CHEER database demonstrated that age, sex, meningitis, and cytomegalovirus were statistically significant factors associated with pediatric sensorineural hearing loss (P < .01). Among the 6 additional diagnostic queries assessed, NAMCS/NHAMCS usage was also infeasible; the CHEER database contained 1585 to 212,521 more observations per annum. An otolaryngology-specific database has added utility when compared with already available national ambulatory databases. © American Academy of Otolaryngology—Head and Neck Surgery Foundation 2016.
Kong, Xiangxing; Li, Jun; Cai, Yibo; Tian, Yu; Chi, Shengqiang; Tong, Danyang; Hu, Yeting; Yang, Qi; Li, Jingsong; Poston, Graeme; Yuan, Ying; Ding, Kefeng
2018-01-08
To revise the American Joint Committee on Cancer TNM staging system for colorectal cancer (CRC) based on a nomogram analysis of Surveillance, Epidemiology, and End Results (SEER) database, and to prove the rationality of enhancing T stage's weighting in our previously proposed T-plus staging system. Total 115,377 non-metastatic CRC patients from SEER were randomly grouped as training and testing set by ratio 1:1. The Nomo-staging system was established via three nomograms based on 1-year, 2-year and 3-year disease specific survival (DSS) Logistic regression analysis of the training set. The predictive value of Nomo-staging system for the testing set was evaluated by concordance index (c-index), likelihood ratio (L.R.) and Akaike information criteria (AIC) for 1-year, 2-year, 3-year overall survival (OS) and DSS. Kaplan-Meier survival curve was used to valuate discrimination and gradient monotonicity. And an external validation was performed on database from the Second Affiliated Hospital of Zhejiang University (SAHZU). Patients with T1-2 N1 and T1N2a were classified into stage II while T4 N0 patients were classified into stage III in Nomo-staging system. Kaplan-Meier survival curves of OS and DSS in testing set showed Nomo-staging system performed better in discrimination and gradient monotonicity, and the external validation in SAHZU database also showed distinctly better discrimination. The Nomo-staging system showed higher value in L.R. and c-index, and lower value in AIC when predicting OS and DSS in testing set. The Nomo-staging system showed better performance in prognosis prediction and the weight of lymph nodes status in prognosis prediction should be cautiously reconsidered.
PROFESS: a PROtein Function, Evolution, Structure and Sequence database
Triplet, Thomas; Shortridge, Matthew D.; Griep, Mark A.; Stark, Jaime L.; Powers, Robert; Revesz, Peter
2010-01-01
The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are ∼1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein–protein interaction networks. Database URL: http://cse.unl.edu/∼profess/ PMID:20624718
Use of large electronic health record databases for environmental epidemiology studies.
Background: Electronic health records (EHRs) are a ubiquitous component of the United States healthcare system and capture nearly all data collected in a clinic or hospital setting. EHR databases are attractive for secondary data analysis as they may contain detailed clinical rec...
LOLAweb: a containerized web server for interactive genomic locus overlap enrichment analysis.
Nagraj, V P; Magee, Neal E; Sheffield, Nathan C
2018-06-06
The past few years have seen an explosion of interest in understanding the role of regulatory DNA. This interest has driven large-scale production of functional genomics data and analytical methods. One popular analysis is to test for enrichment of overlaps between a query set of genomic regions and a database of region sets. In this way, new genomic data can be easily connected to annotations from external data sources. Here, we present an interactive interface for enrichment analysis of genomic locus overlaps using a web server called LOLAweb. LOLAweb accepts a set of genomic ranges from the user and tests it for enrichment against a database of region sets. LOLAweb renders results in an R Shiny application to provide interactive visualization features, enabling users to filter, sort, and explore enrichment results dynamically. LOLAweb is built and deployed in a Linux container, making it scalable to many concurrent users on our servers and also enabling users to download and run LOLAweb locally.
MIPS: a database for genomes and protein sequences
Mewes, H. W.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Mayer, K.; Mokrejs, M.; Morgenstern, B.; Münsterkötter, M.; Rudd, S.; Weil, B.
2002-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz–Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91–93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155–158; Barker et al. (2001) Nucleic Acids Res., 29, 29–32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de). PMID:11752246
MIPS: a database for genomes and protein sequences.
Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B
2002-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).
Aßmann, C
2016-06-01
Besides large efforts regarding field work, provision of valid databases requires statistical and informational infrastructure to enable long-term access to longitudinal data sets on height, weight and related issues. To foster use of longitudinal data sets within the scientific community, provision of valid databases has to address data-protection regulations. It is, therefore, of major importance to hinder identifiability of individuals from publicly available databases. To reach this goal, one possible strategy is to provide a synthetic database to the public allowing for pretesting strategies for data analysis. The synthetic databases can be established using multiple imputation tools. Given the approval of the strategy, verification is based on the original data. Multiple imputation by chained equations is illustrated to facilitate provision of synthetic databases as it allows for capturing a wide range of statistical interdependencies. Also missing values, typically occurring within longitudinal databases for reasons of item non-response, can be addressed via multiple imputation when providing databases. The provision of synthetic databases using multiple imputation techniques is one possible strategy to ensure data protection, increase visibility of longitudinal databases and enhance the analytical potential.
ISAAC - InterSpecies Analysing Application using Containers.
Baier, Herbert; Schultz, Jörg
2014-01-15
Information about genes, transcripts and proteins is spread over a wide variety of databases. Different tools have been developed using these databases to identify biological signals in gene lists from large scale analysis. Mostly, they search for enrichments of specific features. But, these tools do not allow an explorative walk through different views and to change the gene lists according to newly upcoming stories. To fill this niche, we have developed ISAAC, the InterSpecies Analysing Application using Containers. The central idea of this web based tool is to enable the analysis of sets of genes, transcripts and proteins under different biological viewpoints and to interactively modify these sets at any point of the analysis. Detailed history and snapshot information allows tracing each action. Furthermore, one can easily switch back to previous states and perform new analyses. Currently, sets can be viewed in the context of genomes, protein functions, protein interactions, pathways, regulation, diseases and drugs. Additionally, users can switch between species with an automatic, orthology based translation of existing gene sets. As todays research usually is performed in larger teams and consortia, ISAAC provides group based functionalities. Here, sets as well as results of analyses can be exchanged between members of groups. ISAAC fills the gap between primary databases and tools for the analysis of large gene lists. With its highly modular, JavaEE based design, the implementation of new modules is straight forward. Furthermore, ISAAC comes with an extensive web-based administration interface including tools for the integration of third party data. Thus, a local installation is easily feasible. In summary, ISAAC is tailor made for highly explorative interactive analyses of gene, transcript and protein sets in a collaborative environment.
MIPS: analysis and annotation of proteins from whole genomes
Mewes, H. W.; Amid, C.; Arnold, R.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M.; Pagel, P.; Strack, N.; Stümpflen, V.; Warfsmann, J.; Ruepp, A.
2004-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein–protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:14681354
MIPS: analysis and annotation of proteins from whole genomes.
Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A
2004-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
NASA Technical Reports Server (NTRS)
Smith, Andrew; LaVerde, Bruce; Hunt, Ron; Fulcher, Clay; Towner, Robert; McDonald, Emmett
2012-01-01
The design and theoretical basis of a new database tool that quickly generates vibroacoustic response estimates using a library of transfer functions (TFs) is discussed. During the early stages of a launch vehicle development program, these response estimates can be used to provide vibration environment specification to hardware vendors. The tool accesses TFs from a database, combines the TFs, and multiplies these by input excitations to estimate vibration responses. The database is populated with two sets of uncoupled TFs; the first set representing vibration response of a bare panel, designated as H(sup s), and the second set representing the response of the free-free component equipment by itself, designated as H(sup c). For a particular configuration undergoing analysis, the appropriate H(sup s) and H(sup c) are selected and coupled to generate an integrated TF, designated as H(sup s +c). This integrated TF is then used with the appropriate input excitations to estimate vibration responses. This simple yet powerful tool enables a user to estimate vibration responses without directly using finite element models, so long as suitable H(sup s) and H(sup c) sets are defined in the database libraries. The paper discusses the preparation of the database tool and provides the assumptions and methodologies necessary to combine H(sup s) and H(sup c) sets into an integrated H(sup s + c). An experimental validation of the approach is also presented.
Expert system development for commonality analysis in space programs
NASA Technical Reports Server (NTRS)
Yeager, Dorian P.
1987-01-01
This report is a combination of foundational mathematics and software design. A mathematical model of the Commonality Analysis problem was developed and some important properties discovered. The complexity of the problem is described herein and techniques, both deterministic and heuristic, for reducing that complexity are presented. Weaknesses are pointed out in the existing software (System Commonality Analysis Tool) and several improvements are recommended. It is recommended that: (1) an expert system for guiding the design of new databases be developed; (2) a distributed knowledge base be created and maintained for the purpose of encoding the commonality relationships between design items in commonality databases; (3) a software module be produced which automatically generates commonality alternative sets from commonality databases using the knowledge associated with those databases; and (4) a more complete commonality analysis module be written which is capable of generating any type of feasible solution.
Ho, Lap; Cheng, Haoxiang; Wang, Jun; Simon, James E; Wu, Qingli; Zhao, Danyue; Carry, Eileen; Ferruzzi, Mario G; Faith, Jeremiah; Valcarcel, Breanna; Hao, Ke; Pasinetti, Giulio M
2018-03-05
The development of a given botanical preparation for eventual clinical application requires extensive, detailed characterizations of the chemical composition, as well as the biological availability, biological activity, and safety profiles of the botanical. These issues are typically addressed using diverse experimental protocols and model systems. Based on this consideration, in this study we established a comprehensive database and analysis framework for the collection, collation, and integrative analysis of diverse, multiscale data sets. Using this framework, we conducted an integrative analysis of heterogeneous data from in vivo and in vitro investigation of a complex bioactive dietary polyphenol-rich preparation (BDPP) and built an integrated network linking data sets generated from this multitude of diverse experimental paradigms. We established a comprehensive database and analysis framework as well as a systematic and logical means to catalogue and collate the diverse array of information gathered, which is securely stored and added to in a standardized manner to enable fast query. We demonstrated the utility of the database in (1) a statistical ranking scheme to prioritize response to treatments and (2) in depth reconstruction of functionality studies. By examination of these data sets, the system allows analytical querying of heterogeneous data and the access of information related to interactions, mechanism of actions, functions, etc., which ultimately provide a global overview of complex biological responses. Collectively, we present an integrative analysis framework that leads to novel insights on the biological activities of a complex botanical such as BDPP that is based on data-driven characterizations of interactions between BDPP-derived phenolic metabolites and their mechanisms of action, as well as synergism and/or potential cancellation of biological functions. Out integrative analytical approach provides novel means for a systematic integrative analysis of heterogeneous data types in the development of complex botanicals such as polyphenols for eventual clinical and translational applications.
CREDO: a structural interactomics database for drug discovery
Schreyer, Adrian M.; Blundell, Tom L.
2013-01-01
CREDO is a unique relational database storing all pairwise atomic interactions of inter- as well as intra-molecular contacts between small molecules and macromolecules found in experimentally determined structures from the Protein Data Bank. These interactions are integrated with further chemical and biological data. The database implements useful data structures and algorithms such as cheminformatics routines to create a comprehensive analysis platform for drug discovery. The database can be accessed through a web-based interface, downloads of data sets and web services at http://www-cryst.bioc.cam.ac.uk/credo. Database URL: http://www-cryst.bioc.cam.ac.uk/credo PMID:23868908
Gadelha, Luiz; Ribeiro-Alves, Marcelo; Porto, Fábio
2017-01-01
There are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced can be represented as networks of interactions among genes and these may additionally be integrated with other biological databases, such as Protein-Protein Interactions, transcription factors and gene annotation. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managing the resulting data and its respective metadata are challenging tasks. Additionally, a great amount of effort is equally required to run in-silico experiments to structure and compose the information as needed for analysis. Different programs may need to be applied and different files are produced during the experiment cycle. In this context, the availability of a platform supporting experiment execution is paramount. We present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. It includes GeNNet-Wf, a scientific workflow that pre-loads biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and gene set enrichment analysis. A user-friendly web interface, GeNNet-Web, allows for setting parameters, executing, and visualizing the results of GeNNet-Wf executions. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment in different analysis scenarios. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships. The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene interaction networks. GeNNet is the first platform to integrate the analytical process of transcriptome data with graph databases. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers can add new functionality to components of GeNNet. The derived data allows for testing previous hypotheses about an experiment and exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet. PMID:28695067
Costa, Raquel L; Gadelha, Luiz; Ribeiro-Alves, Marcelo; Porto, Fábio
2017-01-01
There are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced can be represented as networks of interactions among genes and these may additionally be integrated with other biological databases, such as Protein-Protein Interactions, transcription factors and gene annotation. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managing the resulting data and its respective metadata are challenging tasks. Additionally, a great amount of effort is equally required to run in-silico experiments to structure and compose the information as needed for analysis. Different programs may need to be applied and different files are produced during the experiment cycle. In this context, the availability of a platform supporting experiment execution is paramount. We present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. It includes GeNNet-Wf, a scientific workflow that pre-loads biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and gene set enrichment analysis. A user-friendly web interface, GeNNet-Web, allows for setting parameters, executing, and visualizing the results of GeNNet-Wf executions. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment in different analysis scenarios. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships. The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene interaction networks. GeNNet is the first platform to integrate the analytical process of transcriptome data with graph databases. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers can add new functionality to components of GeNNet. The derived data allows for testing previous hypotheses about an experiment and exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms. GeNNet is available as an open source platform at https://github.com/raquele/GeNNet and can be retrieved as a software container with the command docker pull quelopes/gennet.
Improving the Scalability of an Exact Approach for Frequent Item Set Hiding
ERIC Educational Resources Information Center
LaMacchia, Carolyn
2013-01-01
Technological advances have led to the generation of large databases of organizational data recognized as an information-rich, strategic asset for internal analysis and sharing with trading partners. Data mining techniques can discover patterns in large databases including relationships considered strategically relevant to the owner of the data.…
Chen, Honglin; Wang, Lixia; Liu, Xiaoyan; Hu, Liangliang; Wang, Suhua; Cheng, Xuzhen
2017-07-11
Cowpea [Vigna unguiculata (L.) Walp.] is one of the most important legumes in tropical and semi-arid regions. However, there is relatively little genomic information available for genetic research on and breeding of cowpea. The objectives of this study were to analyse the cowpea transcriptome and develop genic molecular markers for future genetic studies of this genus. Approximately 54 million high-quality cDNA sequence reads were obtained from cowpea based on Illumina paired-end sequencing technology and were de novo assembled to generate 47,899 unigenes with an N50 length of 1534 bp. Sequence similarity analysis revealed 36,289 unigenes (75.8%) with significant similarity to known proteins in the non-redundant (Nr) protein database, 23,471 unigenes (49.0%) with BLAST hits in the Swiss-Prot database, and 20,654 unigenes (43.1%) with high similarity in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Further analysis identified 5560 simple sequence repeats (SSRs) as potential genic molecular markers. Validating a random set of 500 SSR markers yielded 54 polymorphic markers among 32 cowpea accessions. This transcriptomic analysis of cowpea provided a valuable set of genomic data for characterizing genes with important agronomic traits in Vigna unguiculata and a new set of genic SSR markers for further genetic studies and breeding in cowpea and related Vigna species.
dBBQs: dataBase of Bacterial Quality scores.
Wanchai, Visanu; Patumcharoenpol, Preecha; Nookaew, Intawat; Ussery, David
2017-12-28
It is well-known that genome sequencing technologies are becoming significantly cheaper and faster. As a result of this, the exponential growth in sequencing data in public databases allows us to explore ever growing large collections of genome sequences. However, it is less known that the majority of available sequenced genome sequences in public databases are not complete, drafts of varying qualities. We have calculated quality scores for around 100,000 bacterial genomes from all major genome repositories and put them in a fast and easy-to-use database. Prokaryotic genomic data from all sources were collected and combined to make a non-redundant set of bacterial genomes. The genome quality score for each was calculated by four different measurements: assembly quality, number of rRNA and tRNA genes, and the occurrence of conserved functional domains. The dataBase of Bacterial Quality scores (dBBQs) was designed to store and retrieve quality scores. It offers fast searching and download features which the result can be used for further analysis. In addition, the search results are shown in interactive JavaScript chart framework using DC.js. The analysis of quality scores across major public genome databases find that around 68% of the genomes are of acceptable quality for many uses. dBBQs (available at http://arc-gem.uams.edu/dbbqs ) provides genome quality scores for all available prokaryotic genome sequences with a user-friendly Web-interface. These scores can be used as cut-offs to get a high-quality set of genomes for testing bioinformatics tools or improving the analysis. Moreover, all data of the four measurements that were combined to make the quality score for each genome, which can potentially be used for further analysis. dBBQs will be updated regularly and is freely use for non-commercial purpose.
SALAD database: a motif-based database of protein annotations for plant comparative genomics
Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi
2010-01-01
Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209 529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named ‘SALAD on ARRAYs’ to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis. PMID:19854933
Bagger, Frederik Otzen; Sasivarevic, Damir; Sohi, Sina Hadi; Laursen, Linea Gøricke; Pundhir, Sachin; Sønderby, Casper Kaae; Winther, Ole; Rapin, Nicolas; Porse, Bo T.
2016-01-01
Research on human and murine haematopoiesis has resulted in a vast number of gene-expression data sets that can potentially answer questions regarding normal and aberrant blood formation. To researchers and clinicians with limited bioinformatics experience, these data have remained available, yet largely inaccessible. Current databases provide information about gene-expression but fail to answer key questions regarding co-regulation, genetic programs or effect on patient survival. To address these shortcomings, we present BloodSpot (www.bloodspot.eu), which includes and greatly extends our previously released database HemaExplorer, a database of gene expression profiles from FACS sorted healthy and malignant haematopoietic cells. A revised interactive interface simultaneously provides a plot of gene expression along with a Kaplan–Meier analysis and a hierarchical tree depicting the relationship between different cell types in the database. The database now includes 23 high-quality curated data sets relevant to normal and malignant blood formation and, in addition, we have assembled and built a unique integrated data set, BloodPool. Bloodpool contains more than 2000 samples assembled from six independent studies on acute myeloid leukemia. Furthermore, we have devised a robust sample integration procedure that allows for sensitive comparison of user-supplied patient samples in a well-defined haematopoietic cellular space. PMID:26507857
SALAD database: a motif-based database of protein annotations for plant comparative genomics.
Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi
2010-01-01
Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209,529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named 'SALAD on ARRAYs' to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis.
Space Launch System Ascent Static Aerodynamic Database Development
NASA Technical Reports Server (NTRS)
Pinier, Jeremy T.; Bennett, David W.; Blevins, John A.; Erickson, Gary E.; Favaregh, Noah M.; Houlden, Heather P.; Tomek, William G.
2014-01-01
This paper describes the wind tunnel testing work and data analysis required to characterize the static aerodynamic environment of NASA's Space Launch System (SLS) ascent portion of flight. Scaled models of the SLS have been tested in transonic and supersonic wind tunnels to gather the high fidelity data that is used to build aerodynamic databases. A detailed description of the wind tunnel test that was conducted to produce the latest version of the database is presented, and a representative set of aerodynamic data is shown. The wind tunnel data quality remains very high, however some concerns with wall interference effects through transonic Mach numbers are also discussed. Post-processing and analysis of the wind tunnel dataset are crucial for the development of a formal ascent aerodynamics database.
Tcof1-Related Molecular Networks in Treacher Collins Syndrome.
Dai, Jiewen; Si, Jiawen; Wang, Minjiao; Huang, Li; Fang, Bing; Shi, Jun; Wang, Xudong; Shen, Guofang
2016-09-01
Treacher Collins syndrome (TCS) is a rare, autosomal-dominant disorder characterized by craniofacial deformities, and is primarily caused by mutations in the Tcof1 gene. This article was aimed to perform a comprehensive literature review and systematic bioinformatic analysis of Tcof1-related molecular networks in TCS. First, the up- and down-regulated genes in Tcof1 heterozygous haploinsufficient mutant mice embryos and Tcof1 knockdown and Tcof1 over-expressed neuroblastoma N1E-115 cells were obtained from the Gene Expression Omnibus database. The GeneDecks database was used to calculate the 500 genes most closely related to Tcof1. Then, the relationships between 4 gene sets (a predicted set and sets comparing the wildtype with the 3 Gene Expression Omnibus datasets) were analyzed using the DAVID, GeneMANIA and STRING databases. The analysis results showed that the Tcof1-related genes were enriched in various biological processes, including cell proliferation, apoptosis, cell cycle, differentiation, and migration. They were also enriched in several signaling pathways, such as the ribosome, p53, cell cycle, and WNT signaling pathways. Additionally, these genes clearly had direct or indirect interactions with Tcof1 and between each other. Literature review and bioinformatic analysis finds imply that special attention should be given to these pathways, as they may offer target points for TCS therapies.
Catlin, Ann Christine; Fernando, Sumudinie; Gamage, Ruwan; Renner, Lorna; Antwi, Sampson; Tettey, Jonas Kusah; Amisah, Kofi Aikins; Kyriakides, Tassos; Cong, Xiangyu; Reynolds, Nancy R; Paintsil, Elijah
2015-01-01
Prevalence of pediatric HIV disclosure is low in resource-limited settings. Innovative, culturally sensitive, and patient-centered disclosure approaches are needed. Conducting such studies in resource-limited settings is not trivial considering the challenges of capturing, cleaning, and storing clinical research data. To overcome some of these challenges, the Sankofa pediatric disclosure intervention adopted an interactive cyber infrastructure for data capture and analysis. The Sankofa Project database system is built on the HUBzero cyber infrastructure ( https://hubzero.org ), an open source software platform. The hub database components support: (1) data management - the "databases" component creates, configures, and manages database access, backup, repositories, applications, and access control; (2) data collection - the "forms" component is used to build customized web case report forms that incorporate common data elements and include tailored form submit processing to handle error checking, data validation, and data linkage as the data are stored to the database; and (3) data exploration - the "dataviewer" component provides powerful methods for users to view, search, sort, navigate, explore, map, graph, visualize, aggregate, drill-down, compute, and export data from the database. The Sankofa cyber data management tool supports a user-friendly, secure, and systematic collection of all data. We have screened more than 400 child-caregiver dyads and enrolled nearly 300 dyads, with tens of thousands of data elements. The dataviews have successfully supported all data exploration and analysis needs of the Sankofa Project. Moreover, the ability of the sites to query and view data summaries has proven to be an incentive for collecting complete and accurate data. The data system has all the desirable attributes of an electronic data capture tool. It also provides an added advantage of building data management capacity in resource-limited settings due to its innovative data query and summary views and availability of real-time support by the data management team.
Strategies to explore functional genomics data sets in NCBI's GEO database.
Wilhite, Stephen E; Barrett, Tanya
2012-01-01
The Gene Expression Omnibus (GEO) database is a major repository that stores high-throughput functional genomics data sets that are generated using both microarray-based and sequence-based technologies. Data sets are submitted to GEO primarily by researchers who are publishing their results in journals that require original data to be made freely available for review and analysis. In addition to serving as a public archive for these data, GEO has a suite of tools that allow users to identify, analyze, and visualize data relevant to their specific interests. These tools include sample comparison applications, gene expression profile charts, data set clusters, genome browser tracks, and a powerful search engine that enables users to construct complex queries.
Strategies to Explore Functional Genomics Data Sets in NCBI’s GEO Database
Wilhite, Stephen E.; Barrett, Tanya
2012-01-01
The Gene Expression Omnibus (GEO) database is a major repository that stores high-throughput functional genomics data sets that are generated using both microarray-based and sequence-based technologies. Data sets are submitted to GEO primarily by researchers who are publishing their results in journals that require original data to be made freely available for review and analysis. In addition to serving as a public archive for these data, GEO has a suite of tools that allow users to identify, analyze and visualize data relevant to their specific interests. These tools include sample comparison applications, gene expression profile charts, data set clusters, genome browser tracks, and a powerful search engine that enables users to construct complex queries. PMID:22130872
Weinreb, Jeffrey H; Yoshida, Ryu; Cote, Mark P; O'Sullivan, Michael B; Mazzocca, Augustus D
2017-01-01
The purpose of this study was to evaluate how database use has changed over time in Arthroscopy: The Journal of Arthroscopic and Related Surgery and to inform readers about available databases used in orthopaedic literature. An extensive literature search was conducted to identify databases used in Arthroscopy and other orthopaedic literature. All articles published in Arthroscopy between January 1, 2006, and December 31, 2015, were reviewed. A database was defined as a national, widely available set of individual patient encounters, applicable to multiple patient populations, used in orthopaedic research in a peer-reviewed journal, not restricted by encounter setting or visit duration, and with information available in English. Databases used in Arthroscopy included PearlDiver, the American College of Surgeons National Surgical Quality Improvement Program, the Danish Common Orthopaedic Database, the Swedish National Knee Ligament Register, the Hospital Episodes Statistics database, and the National Inpatient Sample. Database use increased significantly from 4 articles in 2013 to 11 articles in 2015 (P = .012), with no database use between January 1, 2006, and December 31, 2012. Database use increased significantly between January 1, 2006, and December 31, 2015, in Arthroscopy. Level IV, systematic review of Level II through IV studies. Copyright © 2016 Arthroscopy Association of North America. Published by Elsevier Inc. All rights reserved.
Dhanasekaran, A Ranjitha; Pearson, Jon L; Ganesan, Balasubramanian; Weimer, Bart C
2015-02-25
Mass spectrometric analysis of microbial metabolism provides a long list of possible compounds. Restricting the identification of the possible compounds to those produced by the specific organism would benefit the identification process. Currently, identification of mass spectrometry (MS) data is commonly done using empirically derived compound databases. Unfortunately, most databases contain relatively few compounds, leaving long lists of unidentified molecules. Incorporating genome-encoded metabolism enables MS output identification that may not be included in databases. Using an organism's genome as a database restricts metabolite identification to only those compounds that the organism can produce. To address the challenge of metabolomic analysis from MS data, a web-based application to directly search genome-constructed metabolic databases was developed. The user query returns a genome-restricted list of possible compound identifications along with the putative metabolic pathways based on the name, formula, SMILES structure, and the compound mass as defined by the user. Multiple queries can be done simultaneously by submitting a text file created by the user or obtained from the MS analysis software. The user can also provide parameters specific to the experiment's MS analysis conditions, such as mass deviation, adducts, and detection mode during the query so as to provide additional levels of evidence to produce the tentative identification. The query results are provided as an HTML page and downloadable text file of possible compounds that are restricted to a specific genome. Hyperlinks provided in the HTML file connect the user to the curated metabolic databases housed in ProCyc, a Pathway Tools platform, as well as the KEGG Pathway database for visualization and metabolic pathway analysis. Metabolome Searcher, a web-based tool, facilitates putative compound identification of MS output based on genome-restricted metabolic capability. This enables researchers to rapidly extend the possible identifications of large data sets for metabolites that are not in compound databases. Putative compound names with their associated metabolic pathways from metabolomics data sets are returned to the user for additional biological interpretation and visualization. This novel approach enables compound identification by restricting the possible masses to those encoded in the genome.
Bias-Free Chemically Diverse Test Sets from Machine Learning.
Swann, Ellen T; Fernandez, Michael; Coote, Michelle L; Barnard, Amanda S
2017-08-14
Current benchmarking methods in quantum chemistry rely on databases that are built using a chemist's intuition. It is not fully understood how diverse or representative these databases truly are. Multivariate statistical techniques like archetypal analysis and K-means clustering have previously been used to summarize large sets of nanoparticles however molecules are more diverse and not as easily characterized by descriptors. In this work, we compare three sets of descriptors based on the one-, two-, and three-dimensional structure of a molecule. Using data from the NIST Computational Chemistry Comparison and Benchmark Database and machine learning techniques, we demonstrate the functional relationship between these structural descriptors and the electronic energy of molecules. Archetypes and prototypes found with topological or Coulomb matrix descriptors can be used to identify smaller, statistically significant test sets that better capture the diversity of chemical space. We apply this same method to find a diverse subset of organic molecules to demonstrate how the methods can easily be reapplied to individual research projects. Finally, we use our bias-free test sets to assess the performance of density functional theory and quantum Monte Carlo methods.
AphasiaBank: a resource for clinicians.
Forbes, Margaret M; Fromm, Davida; Macwhinney, Brian
2012-08-01
AphasiaBank is a shared, multimedia database containing videos and transcriptions of ~180 aphasic individuals and 140 nonaphasic controls performing a uniform set of discourse tasks. The language in the videos is transcribed in Codes for the Human Analysis of Transcripts (CHAT) format and coded for analysis with Computerized Language ANalysis (CLAN) programs, which can perform a wide variety of language analyses. The database and the CLAN programs are freely available to aphasia researchers and clinicians for educational, clinical, and scholarly uses. This article describes the database, suggests some ways in which clinicians and clinician researchers might find these materials useful, and introduces a new language analysis program, EVAL, designed to streamline the transcription and coding processes, while still producing an extensive and useful language profile. Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.
Bonfill, Xavier; Osorio, Dimelza; Solà, Ivan; Pijoan, Jose Ignacio; Balasso, Valentina; Quintana, Maria Jesús; Puig, Teresa; Bolibar, Ignasi; Urrútia, Gerard; Zamora, Javier; Emparanza, José Ignacio; Gómez de la Cámara, Agustín; Ferreira-González, Ignacio
2016-01-01
To describe the development of a novel on-line database aimed to serve as a source of information concerning healthcare interventions appraised for their clinical value and appropriateness by several initiatives worldwide, and to present a retrospective analysis of the appraisals already included in the database. Database development and a retrospective analysis. The database DianaHealth.com is already on-line and it is regularly updated, independent, open access and available in English and Spanish. Initiatives are identified in medical news, in article references, and by contacting experts in the field. We include appraisals in the form of clinical recommendations, expert analyses, conclusions from systematic reviews, and original research that label any health care intervention as low-value or inappropriate. We obtain the information necessary to classify the appraisals according to type of intervention, specialties involved, publication year, authoring initiative, and key words. The database is accessible through a search engine which retrieves a list of appraisals and a link to the website where they were published. DianaHealth.com also provides a brief description of the initiatives and a section where users can report new appraisals or suggest new initiatives. From January 2014 to July 2015, the on-line database included 2940 appraisals from 22 initiatives: eleven campaigns gathering clinical recommendations from scientific societies, five sets of conclusions from literature review, three sets of recommendations from guidelines, two collections of articles on low clinical value in medical journals, and an initiative of our own. We have developed an open access on-line database of appraisals about healthcare interventions considered of low clinical value or inappropriate. DianaHealth.com could help physicians and other stakeholders make better decisions concerning patient care and healthcare systems sustainability. Future efforts should be focused on assessing the impact of these appraisals in the clinical practice.
Bonfill, Xavier; Osorio, Dimelza; Solà, Ivan; Pijoan, Jose Ignacio; Balasso, Valentina; Quintana, Maria Jesús; Puig, Teresa; Bolibar, Ignasi; Urrútia, Gerard; Zamora, Javier; Emparanza, José Ignacio; Gómez de la Cámara, Agustín; Ferreira-González, Ignacio
2016-01-01
Objective To describe the development of a novel on-line database aimed to serve as a source of information concerning healthcare interventions appraised for their clinical value and appropriateness by several initiatives worldwide, and to present a retrospective analysis of the appraisals already included in the database. Methods and Findings Database development and a retrospective analysis. The database DianaHealth.com is already on-line and it is regularly updated, independent, open access and available in English and Spanish. Initiatives are identified in medical news, in article references, and by contacting experts in the field. We include appraisals in the form of clinical recommendations, expert analyses, conclusions from systematic reviews, and original research that label any health care intervention as low-value or inappropriate. We obtain the information necessary to classify the appraisals according to type of intervention, specialties involved, publication year, authoring initiative, and key words. The database is accessible through a search engine which retrieves a list of appraisals and a link to the website where they were published. DianaHealth.com also provides a brief description of the initiatives and a section where users can report new appraisals or suggest new initiatives. From January 2014 to July 2015, the on-line database included 2940 appraisals from 22 initiatives: eleven campaigns gathering clinical recommendations from scientific societies, five sets of conclusions from literature review, three sets of recommendations from guidelines, two collections of articles on low clinical value in medical journals, and an initiative of our own. Conclusions We have developed an open access on-line database of appraisals about healthcare interventions considered of low clinical value or inappropriate. DianaHealth.com could help physicians and other stakeholders make better decisions concerning patient care and healthcare systems sustainability. Future efforts should be focused on assessing the impact of these appraisals in the clinical practice. PMID:26840451
Rule-Based Statistical Calculations on a Database Abstract.
1983-06-01
quadruples 17 L.6.6. Our methds ~ in distributed systems 17 L.6.7. Easy extensions 17 17. The datibms abstract as a database 17 17.1.w S orae mu is 1.7.2...the largest item in the intersection of two sets cannot be any larger that the minima of the maxima of the two sets for some numeric attribute. On the...from "range analysis" of arbitrary numeric attributes. Suppose the length range of tankers is from 300 to 1000 feet and that of American ships 50 to
Improved Data Analysis Tools for the Thermal Emission Spectrometer
NASA Astrophysics Data System (ADS)
Rodriguez, K.; Laura, J.; Fergason, R.; Bogle, R.
2017-06-01
We plan to stand up three different database systems for testing of a new datastore for MGS TES data allowing for more accessible tools supporting high throughput data analysis on the high-dimensionality hyperspectral data set.
Arntzen, Magnus Ø; Thiede, Bernd
2012-02-01
Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no.
Arntzen, Magnus Ø.; Thiede, Bernd
2012-01-01
Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no. PMID:22067098
Unique effects of setting goals on behavior change: Systematic review and meta-analysis.
Epton, Tracy; Currie, Sinead; Armitage, Christopher J
2017-12-01
Goal setting is a common feature of behavior change interventions, but it is unclear when goal setting is optimally effective. The aims of this systematic review and meta-analysis were to evaluate: (a) the unique effects of goal setting on behavior change, and (b) under what circumstances and for whom goal setting works best. Four databases were searched for articles that assessed the unique effects of goal setting on behavior change using randomized controlled trials. One-hundred and 41 papers were identified from which 384 effect sizes (N = 16,523) were extracted and analyzed. A moderator analysis of sample characteristics, intervention characteristics, inclusion of other behavior change techniques, study design and delivery, quality of study, outcome measures, and behavior targeted was conducted. A random effects model indicated a small positive unique effect of goal setting across a range of behaviors, d = .34 (CI [.28, .41]). Moderator analyses indicated that goal setting was particularly effective if the goal was: (a) difficult, (b) set publicly, and (c) was a group goal. There was weaker evidence that goal setting was more effective when paired with external monitoring of the behavior/outcome by others without feedback and delivered face-to-face. Goal setting is an effective behavior change technique that has the potential to be considered a fundamental component of successful interventions. The present review adds novel insights into the means by which goal setting might be augmented to maximize behavior change and sets the agenda for future programs of research. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Family Support in Nursing Homes Serving Residents with a Mental Health History
ERIC Educational Resources Information Center
Frahm, Kathryn; Gammonley, Denise; Zhang, Ning Jackie; Paek, Seung Chun
2010-01-01
Using 2003 nursing home data from the Minimum Data Set (MDS) database, this study investigated the role of family support among nursing homes serving residents with a mental health history. Exploratory factor analysis was used to create and test a conceptual model of family support using indicators located within the MDS database. Families were…
Computer Administering of the Psychological Investigations: Set-Relational Representation
NASA Astrophysics Data System (ADS)
Yordzhev, Krasimir
Computer administering of a psychological investigation is the computer representation of the entire procedure of psychological assessments - test construction, test implementation, results evaluation, storage and maintenance of the developed database, its statistical processing, analysis and interpretation. A mathematical description of psychological assessment with the aid of personality tests is discussed in this article. The set theory and the relational algebra are used in this description. A relational model of data, needed to design a computer system for automation of certain psychological assessments is given. Some finite sets and relation on them, which are necessary for creating a personality psychological test, are described. The described model could be used to develop real software for computer administering of any psychological test and there is full automation of the whole process: test construction, test implementation, result evaluation, storage of the developed database, statistical implementation, analysis and interpretation. A software project for computer administering personality psychological tests is suggested.
Using SQL Databases for Sequence Similarity Searching and Analysis.
Pearson, William R; Mackey, Aaron J
2017-09-13
Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
Analysis of high accuracy, quantitative proteomics data in the MaxQB database.
Schaab, Christoph; Geiger, Tamar; Stoehr, Gabriele; Cox, Juergen; Mann, Matthias
2012-03-01
MS-based proteomics generates rapidly increasing amounts of precise and quantitative information. Analysis of individual proteomic experiments has made great strides, but the crucial ability to compare and store information across different proteome measurements still presents many challenges. For example, it has been difficult to avoid contamination of databases with low quality peptide identifications, to control for the inflation in false positive identifications when combining data sets, and to integrate quantitative data. Although, for example, the contamination with low quality identifications has been addressed by joint analysis of deposited raw data in some public repositories, we reasoned that there should be a role for a database specifically designed for high resolution and quantitative data. Here we describe a novel database termed MaxQB that stores and displays collections of large proteomics projects and allows joint analysis and comparison. We demonstrate the analysis tools of MaxQB using proteome data of 11 different human cell lines and 28 mouse tissues. The database-wide false discovery rate is controlled by adjusting the project specific cutoff scores for the combined data sets. The 11 cell line proteomes together identify proteins expressed from more than half of all human genes. For each protein of interest, expression levels estimated by label-free quantification can be visualized across the cell lines. Similarly, the expression rank order and estimated amount of each protein within each proteome are plotted. We used MaxQB to calculate the signal reproducibility of the detected peptides for the same proteins across different proteomes. Spearman rank correlation between peptide intensity and detection probability of identified proteins was greater than 0.8 for 64% of the proteome, whereas a minority of proteins have negative correlation. This information can be used to pinpoint false protein identifications, independently of peptide database scores. The information contained in MaxQB, including high resolution fragment spectra, is accessible to the community via a user-friendly web interface at http://www.biochem.mpg.de/maxqb.
SINEBase: a database and tool for SINE analysis.
Vassetzky, Nikita S; Kramerov, Dmitri A
2013-01-01
SINEBase (http://sines.eimb.ru) integrates the revisited body of knowledge about short interspersed elements (SINEs). A set of formal definitions concerning SINEs was introduced. All available sequence data were screened through these definitions and the genetic elements misidentified as SINEs were discarded. As a result, 175 SINE families have been recognized in animals, flowering plants and green algae. These families were classified by the modular structure of their nucleotide sequences and the frequencies of different patterns were evaluated. These data formed the basis for the database of SINEs. The SINEBase website can be used in two ways: first, to explore the database of SINE families, and second, to analyse candidate SINE sequences using specifically developed tools. This article presents an overview of the database and the process of SINE identification and analysis.
SINEBase: a database and tool for SINE analysis
Vassetzky, Nikita S.; Kramerov, Dmitri A.
2013-01-01
SINEBase (http://sines.eimb.ru) integrates the revisited body of knowledge about short interspersed elements (SINEs). A set of formal definitions concerning SINEs was introduced. All available sequence data were screened through these definitions and the genetic elements misidentified as SINEs were discarded. As a result, 175 SINE families have been recognized in animals, flowering plants and green algae. These families were classified by the modular structure of their nucleotide sequences and the frequencies of different patterns were evaluated. These data formed the basis for the database of SINEs. The SINEBase website can be used in two ways: first, to explore the database of SINE families, and second, to analyse candidate SINE sequences using specifically developed tools. This article presents an overview of the database and the process of SINE identification and analysis. PMID:23203982
An Analysis Platform for Mobile Ad Hoc Network (MANET) Scenario Execution Log Data
2016-01-01
these technologies. 4.1 Backend Technologies • Java 1.8 • my-sql-connector- java -5.0.8.jar • Tomcat • VirtualBox • Kali MANET Virtual Machine 4.2...Frontend Technologies • LAMPP 4.3 Database • MySQL Server 5. Database The SEDAP database settings and structure are described in this section...contains all the backend java functionality including the web services, should be placed in the webapps directory inside the Tomcat installation
Mackey, Aaron J; Pearson, William R
2004-10-01
Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Database resources of the National Center for Biotechnology Information: 2002 update
Wheeler, David L.; Church, Deanna M.; Lash, Alex E.; Leipe, Detlef D.; Madden, Thomas L.; Pontius, Joan U.; Schuler, Gregory D.; Schriml, Lynn M.; Tatusova, Tatiana A.; Wagner, Lukas; Rapp, Barbara A.
2002-01-01
In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, HomoloGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing, Human MapViewer, Human¡VMouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) and the Conserved Domain Database (CDD). Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov. PMID:11752242
Bagger, Frederik Otzen; Sasivarevic, Damir; Sohi, Sina Hadi; Laursen, Linea Gøricke; Pundhir, Sachin; Sønderby, Casper Kaae; Winther, Ole; Rapin, Nicolas; Porse, Bo T
2016-01-04
Research on human and murine haematopoiesis has resulted in a vast number of gene-expression data sets that can potentially answer questions regarding normal and aberrant blood formation. To researchers and clinicians with limited bioinformatics experience, these data have remained available, yet largely inaccessible. Current databases provide information about gene-expression but fail to answer key questions regarding co-regulation, genetic programs or effect on patient survival. To address these shortcomings, we present BloodSpot (www.bloodspot.eu), which includes and greatly extends our previously released database HemaExplorer, a database of gene expression profiles from FACS sorted healthy and malignant haematopoietic cells. A revised interactive interface simultaneously provides a plot of gene expression along with a Kaplan-Meier analysis and a hierarchical tree depicting the relationship between different cell types in the database. The database now includes 23 high-quality curated data sets relevant to normal and malignant blood formation and, in addition, we have assembled and built a unique integrated data set, BloodPool. Bloodpool contains more than 2000 samples assembled from six independent studies on acute myeloid leukemia. Furthermore, we have devised a robust sample integration procedure that allows for sensitive comparison of user-supplied patient samples in a well-defined haematopoietic cellular space. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Bate, A; Lindquist, M; Edwards, I R
2008-04-01
After market launch, new information on adverse effects of medicinal products is almost exclusively first highlighted by spontaneous reporting. As data sets of spontaneous reports have become larger, and computational capability has increased, quantitative methods have been increasingly applied to such data sets. The screening of such data sets is an application of knowledge discovery in databases (KDD). Effective KDD is an iterative and interactive process made up of the following steps: developing an understanding of an application domain, creating a target data set, data cleaning and pre-processing, data reduction and projection, choosing the data mining task, choosing the data mining algorithm, data mining, interpretation of results and consolidating and using acquired knowledge. The process of KDD as it applies to the analysis of spontaneous reports can be exemplified by its routine use on the 3.5 million suspected adverse drug reaction (ADR) reports in the WHO ADR database. Examples of new adverse effects first highlighted by the KDD process on WHO data include topiramate glaucoma, infliximab vasculitis and the association of selective serotonin reuptake inhibitors (SSRIs) and neonatal convulsions. The KDD process has already improved our ability to highlight previously unsuspected ADRs for clinical review in spontaneous reporting, and we anticipate that such techniques will be increasingly used in the successful screening of other healthcare data sets such as patient records in the future.
LFQuant: a label-free fast quantitative analysis tool for high-resolution LC-MS/MS proteomics data.
Zhang, Wei; Zhang, Jiyang; Xu, Changming; Li, Ning; Liu, Hui; Ma, Jie; Zhu, Yunping; Xie, Hongwei
2012-12-01
Database searching based methods for label-free quantification aim to reconstruct the peptide extracted ion chromatogram based on the identification information, which can limit the search space and thus make the data processing much faster. The random effect of the MS/MS sampling can be remedied by cross-assignment among different runs. Here, we present a new label-free fast quantitative analysis tool, LFQuant, for high-resolution LC-MS/MS proteomics data based on database searching. It is designed to accept raw data in two common formats (mzXML and Thermo RAW), and database search results from mainstream tools (MASCOT, SEQUEST, and X!Tandem), as input data. LFQuant can handle large-scale label-free data with fractionation such as SDS-PAGE and 2D LC. It is easy to use and provides handy user interfaces for data loading, parameter setting, quantitative analysis, and quantitative data visualization. LFQuant was compared with two common quantification software packages, MaxQuant and IDEAL-Q, on the replication data set and the UPS1 standard data set. The results show that LFQuant performs better than them in terms of both precision and accuracy, and consumes significantly less processing time. LFQuant is freely available under the GNU General Public License v3.0 at http://sourceforge.net/projects/lfquant/. © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Qiao, Yuanhua; Keren, Nir; Mannan, M Sam
2009-08-15
Risk assessment and management of transportation of hazardous materials (HazMat) require the estimation of accident frequency. This paper presents a methodology to estimate hazardous materials transportation accident frequency by utilizing publicly available databases and expert knowledge. The estimation process addresses route-dependent and route-independent variables. Negative binomial regression is applied to an analysis of the Department of Public Safety (DPS) accident database to derive basic accident frequency as a function of route-dependent variables, while the effects of route-independent variables are modeled by fuzzy logic. The integrated methodology provides the basis for an overall transportation risk analysis, which can be used later to develop a decision support system.
Wang, Qi; Zhao, Xiao-Juan; Wang, Zi-Wei; Liu, Li; Wei, Yong-Xin; Han, Xiao; Zeng, Jing; Liao, Wan-Jin
2017-08-01
Rapid and precise identification of Cronobacter species is important for foodborne pathogen detection, however, commercial biochemical methods can only identify Cronobacter strains to genus level in most cases. To evaluate the power of mass spectrometry based on matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF MS) for Cronobacter species identification, 51 Cronobacter strains (eight reference and 43 wild strains) were identified by both MALDI-TOF MS and 16S rRNA gene sequencing. Biotyper RTC provided by Bruker identified all eight reference and 43 wild strains as Cronobacter species, which demonstrated the power of MALDI-TOF MS to identify Cronobacter strains to genus level. However, using the Bruker's database (6903 main spectra products) and Biotyper software, the MALDI-TOF MS analysis could not identify the investigated strains to species level. When MALDI-TOF MS analysis was performed using the combined in-house Cronobacter database and Bruker's database, bin setting, and unweighted pair group method with arithmetic mean (UPGMA) clustering, all the 51 strains were clearly identified into six Cronobacter species and the identification accuracy increased from 60% to 100%. We demonstrated that MALDI-TOF MS was reliable and easy-to-use for Cronobacter species identification and highlighted the importance of establishing a reliable database and improving the current data analysis methods by integrating the bin setting and UPGMA clustering. Copyright © 2017. Published by Elsevier B.V.
Uchiyama, Ikuo; Mihara, Motohiro; Nishide, Hiroyo; Chiba, Hirokazu
2015-01-01
The microbial genome database for comparative analysis (MBGD) (available at http://mbgd.genome.ad.jp/) is a comprehensive ortholog database for flexible comparative analysis of microbial genomes, where the users are allowed to create an ortholog table among any specified set of organisms. Because of the rapid increase in microbial genome data owing to the next-generation sequencing technology, it becomes increasingly challenging to maintain high-quality orthology relationships while allowing the users to incorporate the latest genomic data available into an analysis. Because many of the recently accumulating genomic data are draft genome sequences for which some complete genome sequences of the same or closely related species are available, MBGD now stores draft genome data and allows the users to incorporate them into a user-specific ortholog database using the MyMBGD functionality. In this function, draft genome data are incorporated into an existing ortholog table created only from the complete genome data in an incremental manner to prevent low-quality draft data from affecting clustering results. In addition, to provide high-quality orthology relationships, the standard ortholog table containing all the representative genomes, which is first created by the rapid classification program DomClust, is now refined using DomRefine, a recently developed program for improving domain-level clustering using multiple sequence alignment information. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
A Public Database of Memory and Naive B-Cell Receptor Sequences.
DeWitt, William S; Lindau, Paul; Snyder, Thomas M; Sherwood, Anna M; Vignali, Marissa; Carlson, Christopher S; Greenberg, Philip D; Duerkopp, Natalie; Emerson, Ryan O; Robins, Harlan S
2016-01-01
The vast diversity of B-cell receptors (BCR) and secreted antibodies enables the recognition of, and response to, a wide range of epitopes, but this diversity has also limited our understanding of humoral immunity. We present a public database of more than 37 million unique BCR sequences from three healthy adult donors that is many fold deeper than any existing resource, together with a set of online tools designed to facilitate the visualization and analysis of the annotated data. We estimate the clonal diversity of the naive and memory B-cell repertoires of healthy individuals, and provide a set of examples that illustrate the utility of the database, including several views of the basic properties of immunoglobulin heavy chain sequences, such as rearrangement length, subunit usage, and somatic hypermutation positions and dynamics.
IMPROVED SEARCH OF PRINCIPAL COMPONENT ANALYSIS DATABASES FOR SPECTRO-POLARIMETRIC INVERSION
DOE Office of Scientific and Technical Information (OSTI.GOV)
Casini, R.; Lites, B. W.; Ramos, A. Asensio
2013-08-20
We describe a simple technique for the acceleration of spectro-polarimetric inversions based on principal component analysis (PCA) of Stokes profiles. This technique involves the indexing of the database models based on the sign of the projections (PCA coefficients) of the first few relevant orders of principal components of the four Stokes parameters. In this way, each model in the database can be attributed a distinctive binary number of 2{sup 4n} bits, where n is the number of PCA orders used for the indexing. Each of these binary numbers (indices) identifies a group of ''compatible'' models for the inversion of amore » given set of observed Stokes profiles sharing the same index. The complete set of the binary numbers so constructed evidently determines a partition of the database. The search of the database for the PCA inversion of spectro-polarimetric data can profit greatly from this indexing. In practical cases it becomes possible to approach the ideal acceleration factor of 2{sup 4n} as compared to the systematic search of a non-indexed database for a traditional PCA inversion. This indexing method relies on the existence of a physical meaning in the sign of the PCA coefficients of a model. For this reason, the presence of model ambiguities and of spectro-polarimetric noise in the observations limits in practice the number n of relevant PCA orders that can be used for the indexing.« less
Analysis of human serum phosphopeptidome by a focused database searching strategy.
Zhu, Jun; Wang, Fangjun; Cheng, Kai; Song, Chunxia; Qin, Hongqiang; Hu, Lianghai; Figeys, Daniel; Ye, Mingliang; Zou, Hanfa
2013-01-14
As human serum is an important source for early diagnosis of many serious diseases, analysis of serum proteome and peptidome has been extensively performed. However, the serum phosphopeptidome was less explored probably because the effective method for database searching is lacking. Conventional database searching strategy always uses the whole proteome database, which is very time-consuming for phosphopeptidome search due to the huge searching space resulted from the high redundancy of the database and the setting of dynamic modifications during searching. In this work, a focused database searching strategy using an in-house collected human serum pro-peptidome target/decoy database (HuSPep) was established. It was found that the searching time was significantly decreased without compromising the identification sensitivity. By combining size-selective Ti (IV)-MCM-41 enrichment, RP-RP off-line separation, and complementary CID and ETD fragmentation with the new searching strategy, 143 unique endogenous phosphopeptides and 133 phosphorylation sites (109 novel sites) were identified from human serum with high reliability. Copyright © 2012 Elsevier B.V. All rights reserved.
A BRDF-BPDF database for the analysis of Earth target reflectances
NASA Astrophysics Data System (ADS)
Breon, Francois-Marie; Maignan, Fabienne
2017-01-01
Land surface reflectance is not isotropic. It varies with the observation geometry that is defined by the sun, view zenith angles, and the relative azimuth. In addition, the reflectance is linearly polarized. The reflectance anisotropy is quantified by the bidirectional reflectance distribution function (BRDF), while its polarization properties are defined by the bidirectional polarization distribution function (BPDF). The POLDER radiometer that flew onboard the PARASOL microsatellite remains the only space instrument that measured numerous samples of the BRDF and BPDF of Earth targets. Here, we describe a database of representative BRDFs and BPDFs derived from the POLDER measurements. From the huge number of data acquired by the spaceborne instrument over a period of 7 years, we selected a set of targets with high-quality observations. The selection aimed for a large number of observations, free of significant cloud or aerosol contamination, acquired in diverse observation geometries with a focus on the backscatter direction that shows the specific hot spot signature. The targets are sorted according to the 16-class International Geosphere-Biosphere Programme (IGBP) land cover classification system, and the target selection aims at a spatial representativeness within the class. The database thus provides a set of high-quality BRDF and BPDF samples that can be used to assess the typical variability of natural surface reflectances or to evaluate models. It is available freely from the PANGAEA website (doi:10.1594/PANGAEA.864090). In addition to the database, we provide a visualization and analysis tool based on the Interactive Data Language (IDL). It allows an interactive analysis of the measurements and a comparison against various BRDF and BPDF analytical models. The present paper describes the input data, the selection principles, the database format, and the analysis tool
Ultra-Structure database design methodology for managing systems biology data and analyses
Maier, Christopher W; Long, Jeffrey G; Hemminger, Bradley M; Giddings, Morgan C
2009-01-01
Background Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping). Results We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research. Conclusion We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era. PMID:19691849
Maiers, M; Gragert, L; Madbouly, A; Steiner, D; Marsh, S G E; Gourraud, P-A; Oudshoorn, M; Zanden, H; Schmidt, A H; Pingel, J; Hofmann, J; Müller, C; Eberhard, H-P
2013-01-01
This project has the goal to validate bioinformatics methods and tools for HLA haplotype frequency analysis specifically addressing unique issues of haematopoietic stem cell registry data sets. In addition to generating new methods and tools for the analysis of registry data sets, the intent is to produce a comprehensive analysis of HLA data from 20 million donors from the Bone Marrow Donors Worldwide (BMDW) database. This report summarizes the activity on this project as of the 16IHIW meeting in Liverpool. PMID:23280139
MBGD update 2013: the microbial genome database for exploring the diversity of microbial world.
Uchiyama, Ikuo; Mihara, Motohiro; Nishide, Hiroyo; Chiba, Hirokazu
2013-01-01
The microbial genome database for comparative analysis (MBGD, available at http://mbgd.genome.ad.jp/) is a platform for microbial genome comparison based on orthology analysis. As its unique feature, MBGD allows users to conduct orthology analysis among any specified set of organisms; this flexibility allows MBGD to adapt to a variety of microbial genomic study. Reflecting the huge diversity of microbial world, the number of microbial genome projects now becomes several thousands. To efficiently explore the diversity of the entire microbial genomic data, MBGD now provides summary pages for pre-calculated ortholog tables among various taxonomic groups. For some closely related taxa, MBGD also provides the conserved synteny information (core genome alignment) pre-calculated using the CoreAligner program. In addition, efficient incremental updating procedure can create extended ortholog table by adding additional genomes to the default ortholog table generated from the representative set of genomes. Combining with the functionalities of the dynamic orthology calculation of any specified set of organisms, MBGD is an efficient and flexible tool for exploring the microbial genome diversity.
Sub-Audible Speech Recognition Based upon Electromyographic Signals
NASA Technical Reports Server (NTRS)
Jorgensen, Charles C. (Inventor); Agabon, Shane T. (Inventor); Lee, Diana D. (Inventor)
2012-01-01
Method and system for processing and identifying a sub-audible signal formed by a source of sub-audible sounds. Sequences of samples of sub-audible sound patterns ("SASPs") for known words/phrases in a selected database are received for overlapping time intervals, and Signal Processing Transforms ("SPTs") are formed for each sample, as part of a matrix of entry values. The matrix is decomposed into contiguous, non-overlapping two-dimensional cells of entries, and neural net analysis is applied to estimate reference sets of weight coefficients that provide sums with optimal matches to reference sets of values. The reference sets of weight coefficients are used to determine a correspondence between a new (unknown) word/phrase and a word/phrase in the database.
Experience with ATLAS MySQL PanDA database service
NASA Astrophysics Data System (ADS)
Smirnov, Y.; Wlodek, T.; De, K.; Hover, J.; Ozturk, N.; Smith, J.; Wenaus, T.; Yu, D.
2010-04-01
The PanDA distributed production and analysis system has been in production use for ATLAS data processing and analysis since late 2005 in the US, and globally throughout ATLAS since early 2008. Its core architecture is based on a set of stateless web services served by Apache and backed by a suite of MySQL databases that are the repository for all PanDA information: active and archival job queues, dataset and file catalogs, site configuration information, monitoring information, system control parameters, and so on. This database system is one of the most critical components of PanDA, and has successfully delivered the functional and scaling performance required by PanDA, currently operating at a scale of half a million jobs per week, with much growth still to come. In this paper we describe the design and implementation of the PanDA database system, its architecture of MySQL servers deployed at BNL and CERN, backup strategy and monitoring tools. The system has been developed, thoroughly tested, and brought to production to provide highly reliable, scalable, flexible and available database services for ATLAS Monte Carlo production, reconstruction and physics analysis.
Catlin, Ann Christine; Fernando, Sumudinie; Gamage, Ruwan; Renner, Lorna; Antwi, Sampson; Tettey, Jonas Kusah; Amisah, Kofi Aikins; Kyriakides, Tassos; Cong, Xiangyu; Reynolds, Nancy R.; Paintsil, Elijah
2015-01-01
Prevalence of pediatric HIV disclosure is low in resource-limited settings. Innovative, culturally sensitive, and patient-centered disclosure approaches are needed. Conducting such studies in resource-limited settings is not trivial considering the challenges of capturing, cleaning, and storing clinical research data. To overcome some of these challenges, the Sankofa pediatric disclosure intervention adopted an interactive cyber infrastructure for data capture and analysis. The Sankofa Project database system is built on the HUBzero cyber infrastructure (https://hubzero.org), an open source software platform. The hub database components support: (1) data management – the “databases” component creates, configures, and manages database access, backup, repositories, applications, and access control; (2) data collection – the “forms” component is used to build customized web case report forms that incorporate common data elements and include tailored form submit processing to handle error checking, data validation, and data linkage as the data are stored to the database; and (3) data exploration – the “dataviewer” component provides powerful methods for users to view, search, sort, navigate, explore, map, graph, visualize, aggregate, drill-down, compute, and export data from the database. The Sankofa cyber data management tool supports a user-friendly, secure, and systematic collection of all data. We have screened more than 400 child–caregiver dyads and enrolled nearly 300 dyads, with tens of thousands of data elements. The dataviews have successfully supported all data exploration and analysis needs of the Sankofa Project. Moreover, the ability of the sites to query and view data summaries has proven to be an incentive for collecting complete and accurate data. The data system has all the desirable attributes of an electronic data capture tool. It also provides an added advantage of building data management capacity in resource-limited settings due to its innovative data query and summary views and availability of real-time support by the data management team. PMID:26616131
Ben-Ari Fuchs, Shani; Lieder, Iris; Stelzer, Gil; Mazor, Yaron; Buzhor, Ella; Kaplan, Sergey; Bogoch, Yoel; Plaschkes, Inbar; Shitrit, Alina; Rappaport, Noa; Kohn, Asher; Edgar, Ron; Shenhav, Liraz; Safran, Marilyn; Lancet, Doron; Guan-Golan, Yaron; Warshawsky, David; Shtrichman, Ronit
2016-03-01
Postgenomics data are produced in large volumes by life sciences and clinical applications of novel omics diagnostics and therapeutics for precision medicine. To move from "data-to-knowledge-to-innovation," a crucial missing step in the current era is, however, our limited understanding of biological and clinical contexts associated with data. Prominent among the emerging remedies to this challenge are the gene set enrichment tools. This study reports on GeneAnalytics™ ( geneanalytics.genecards.org ), a comprehensive and easy-to-apply gene set analysis tool for rapid contextualization of expression patterns and functional signatures embedded in the postgenomics Big Data domains, such as Next Generation Sequencing (NGS), RNAseq, and microarray experiments. GeneAnalytics' differentiating features include in-depth evidence-based scoring algorithms, an intuitive user interface and proprietary unified data. GeneAnalytics employs the LifeMap Science's GeneCards suite, including the GeneCards®--the human gene database; the MalaCards-the human diseases database; and the PathCards--the biological pathways database. Expression-based analysis in GeneAnalytics relies on the LifeMap Discovery®--the embryonic development and stem cells database, which includes manually curated expression data for normal and diseased tissues, enabling advanced matching algorithm for gene-tissue association. This assists in evaluating differentiation protocols and discovering biomarkers for tissues and cells. Results are directly linked to gene, disease, or cell "cards" in the GeneCards suite. Future developments aim to enhance the GeneAnalytics algorithm as well as visualizations, employing varied graphical display items. Such attributes make GeneAnalytics a broadly applicable postgenomics data analyses and interpretation tool for translation of data to knowledge-based innovation in various Big Data fields such as precision medicine, ecogenomics, nutrigenomics, pharmacogenomics, vaccinomics, and others yet to emerge on the postgenomics horizon.
Initial Flight Test Evaluation of the F-15 ACTIVE Axisymmetric Vectoring Nozzle Performance
NASA Technical Reports Server (NTRS)
Orme, John S.; Hathaway, Ross; Ferguson, Michael D.
1998-01-01
A full envelope database of a thrust-vectoring axisymmetric nozzle performance for the Pratt & Whitney Pitch/Yaw Balance Beam Nozzle (P/YBBN) is being developed using the F-15 Advanced Control Technology for Integrated Vehicles (ACTIVE) aircraft. At this time, flight research has been completed for steady-state pitch vector angles up to 20' at an altitude of 30,000 ft from low power settings to maximum afterburner power. The nozzle performance database includes vector forces, internal nozzle pressures, and temperatures all of which can be used for regression analysis modeling. The database was used to substantiate a set of nozzle performance data from wind tunnel testing and computational fluid dynamic analyses. Findings from initial flight research at Mach 0.9 and 1.2 are presented in this paper. The results show that vector efficiency is strongly influenced by power setting. A significant discrepancy in nozzle performance has been discovered between predicted and measured results during vectoring.
Govindaraj, Mahalingam
2015-01-01
The number of sequenced crop genomes and associated genomic resources is growing rapidly with the advent of inexpensive next generation sequencing methods. Databases have become an integral part of all aspects of science research, including basic and applied plant and animal sciences. The importance of databases keeps increasing as the volume of datasets from direct and indirect genomics, as well as other omics approaches, keeps expanding in recent years. The databases and associated web portals provide at a minimum a uniform set of tools and automated analysis across a wide range of crop plant genomes. This paper reviews some basic terms and considerations in dealing with crop plant databases utilization in advancing genomic era. The utilization of databases for variation analysis with other comparative genomics tools, and data interpretation platforms are well described. The major focus of this review is to provide knowledge on platforms and databases for genome-based investigations of agriculturally important crop plants. The utilization of these databases in applied crop improvement program is still being achieved widely; otherwise, the end for sequencing is not far away. PMID:25874133
Big Data Mining and Adverse Event Pattern Analysis in Clinical Drug Trials
Federer, Callie; Yoo, Minjae
2016-01-01
Abstract Drug adverse events (AEs) are a major health threat to patients seeking medical treatment and a significant barrier in drug discovery and development. AEs are now required to be submitted during clinical trials and can be extracted from ClinicalTrials.gov (https://clinicaltrials.gov/), a database of clinical studies around the world. By extracting drug and AE information from ClinicalTrials.gov and structuring it into a database, drug-AEs could be established for future drug development and repositioning. To our knowledge, current AE databases contain mainly U.S. Food and Drug Administration (FDA)-approved drugs. However, our database contains both FDA-approved and experimental compounds extracted from ClinicalTrials.gov. Our database contains 8,161 clinical trials of 3,102,675 patients and 713,103 reported AEs. We extracted the information from ClinicalTrials.gov using a set of python scripts, and then used regular expressions and a drug dictionary to process and structure relevant information into a relational database. We performed data mining and pattern analysis of drug-AEs in our database. Our database can serve as a tool to assist researchers to discover drug-AE relationships for developing, repositioning, and repurposing drugs. PMID:27631620
Big Data Mining and Adverse Event Pattern Analysis in Clinical Drug Trials.
Federer, Callie; Yoo, Minjae; Tan, Aik Choon
2016-12-01
Drug adverse events (AEs) are a major health threat to patients seeking medical treatment and a significant barrier in drug discovery and development. AEs are now required to be submitted during clinical trials and can be extracted from ClinicalTrials.gov ( https://clinicaltrials.gov/ ), a database of clinical studies around the world. By extracting drug and AE information from ClinicalTrials.gov and structuring it into a database, drug-AEs could be established for future drug development and repositioning. To our knowledge, current AE databases contain mainly U.S. Food and Drug Administration (FDA)-approved drugs. However, our database contains both FDA-approved and experimental compounds extracted from ClinicalTrials.gov . Our database contains 8,161 clinical trials of 3,102,675 patients and 713,103 reported AEs. We extracted the information from ClinicalTrials.gov using a set of python scripts, and then used regular expressions and a drug dictionary to process and structure relevant information into a relational database. We performed data mining and pattern analysis of drug-AEs in our database. Our database can serve as a tool to assist researchers to discover drug-AE relationships for developing, repositioning, and repurposing drugs.
Long-term cycles in the history of life: periodic biodiversity in the paleobiology database.
Melott, Adrian L
2008-01-01
Time series analysis of fossil biodiversity of marine invertebrates in the Paleobiology Database (PBDB) shows a significant periodicity at approximately 63 My, in agreement with previous analyses based on the Sepkoski database. I discuss how this result did not appear in a previous analysis of the PBDB. The existence of the 63 My periodicity, despite very different treatment of systematic error in both PBDB and Sepkoski databases strongly argues for consideration of its reality in the fossil record. Cross-spectral analysis of the two datasets finds that a 62 My periodicity coincides in phase by 1.6 My, equivalent to better than the errors in either measurement. Consequently, the two data sets not only contain the same strong periodicity, but its peaks and valleys closely correspond in time. Two other spectral peaks appear in the PBDB analysis, but appear to be artifacts associated with detrending and with the increased interval length. Sampling-standardization procedures implemented by the PBDB collaboration suggest that the signal is not an artifact of sampling bias. Further work should focus on finding the cause of the 62 My periodicity.
EasyKSORD: A Platform of Keyword Search Over Relational Databases
NASA Astrophysics Data System (ADS)
Peng, Zhaohui; Li, Jing; Wang, Shan
Keyword Search Over Relational Databases (KSORD) enables casual users to use keyword queries (a set of keywords) to search relational databases just like searching the Web, without any knowledge of the database schema or any need of writing SQL queries. Based on our previous work, we design and implement a novel KSORD platform named EasyKSORD for users and system administrators to use and manage different KSORD systems in a novel and simple manner. EasyKSORD supports advanced queries, efficient data-graph-based search engines, multiform result presentations, and system logging and analysis. Through EasyKSORD, users can search relational databases easily and read search results conveniently, and system administrators can easily monitor and analyze the operations of KSORD and manage KSORD systems much better.
Mathematical models for exploring different aspects of genotoxicity and carcinogenicity databases.
Benigni, R; Giuliani, A
1991-12-01
One great obstacle to understanding and using the information contained in the genotoxicity and carcinogenicity databases is the very size of such databases. Their vastness makes them difficult to read; this leads to inadequate exploitation of the information, which becomes costly in terms of time, labor, and money. In its search for adequate approaches to the problem, the scientific community has, curiously, almost entirely neglected an existent series of very powerful methods of data analysis: the multivariate data analysis techniques. These methods were specifically designed for exploring large data sets. This paper presents the multivariate techniques and reports a number of applications to genotoxicity problems. These studies show how biology and mathematical modeling can be combined and how successful this combination is.
Rudnick, Paul A; Markey, Sanford P; Roth, Jeri; Mirokhin, Yuri; Yan, Xinjian; Tchekhovskoi, Dmitrii V; Edwards, Nathan J; Thangudu, Ratna R; Ketchum, Karen A; Kinsinger, Christopher R; Mesri, Mehdi; Rodriguez, Henry; Stein, Stephen E
2016-03-04
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) has produced large proteomics data sets from the mass spectrometric interrogation of tumor samples previously analyzed by The Cancer Genome Atlas (TCGA) program. The availability of the genomic and proteomic data is enabling proteogenomic study for both reference (i.e., contained in major sequence databases) and nonreference markers of cancer. The CPTAC laboratories have focused on colon, breast, and ovarian tissues in the first round of analyses; spectra from these data sets were produced from 2D liquid chromatography-tandem mass spectrometry analyses and represent deep coverage. To reduce the variability introduced by disparate data analysis platforms (e.g., software packages, versions, parameters, sequence databases, etc.), the CPTAC Common Data Analysis Platform (CDAP) was created. The CDAP produces both peptide-spectrum-match (PSM) reports and gene-level reports. The pipeline processes raw mass spectrometry data according to the following: (1) peak-picking and quantitative data extraction, (2) database searching, (3) gene-based protein parsimony, and (4) false-discovery rate-based filtering. The pipeline also produces localization scores for the phosphopeptide enrichment studies using the PhosphoRS program. Quantitative information for each of the data sets is specific to the sample processing, with PSM and protein reports containing the spectrum-level or gene-level ("rolled-up") precursor peak areas and spectral counts for label-free or reporter ion log-ratios for 4plex iTRAQ. The reports are available in simple tab-delimited formats and, for the PSM-reports, in mzIdentML. The goal of the CDAP is to provide standard, uniform reports for all of the CPTAC data to enable comparisons between different samples and cancer types as well as across the major omics fields.
Development of a prototype commonality analysis tool for use in space programs
NASA Technical Reports Server (NTRS)
Yeager, Dorian P.
1988-01-01
A software tool to aid in performing commonality analyses, called Commonality Analysis Problem Solver (CAPS), was designed, and a prototype version (CAPS 1.0) was implemented and tested. The CAPS 1.0 runs in an MS-DOS or IBM PC-DOS environment. The CAPS is designed around a simple input language which provides a natural syntax for the description of feasibility constraints. It provides its users with the ability to load a database representing a set of design items, describe the feasibility constraints on items in that database, and do a comprehensive cost analysis to find the most economical substitution pattern.
An architecture for a brain-image database
NASA Technical Reports Server (NTRS)
Herskovits, E. H.
2000-01-01
The widespread availability of methods for noninvasive assessment of brain structure has enabled researchers to investigate neuroimaging correlates of normal aging, cerebrovascular disease, and other processes; we designate such studies as image-based clinical trials (IBCTs). We propose an architecture for a brain-image database, which integrates image processing and statistical operators, and thus supports the implementation and analysis of IBCTs. The implementation of this architecture is described and results from the analysis of image and clinical data from two IBCTs are presented. We expect that systems such as this will play a central role in the management and analysis of complex research data sets.
Albreht, T; Paulin, M
1999-01-01
The article describes the possibilities of planning of the health care providers' network enabled by the use of information technology. The cornerstone of such planning is the development and establishment of a quality database on health care providers, health care professionals and their employment statuses. Based on the analysis of information needs, a new database was developed for various users in health care delivery as well as for those in health insurance. The method of information engineering was used in the standard four steps of the information system construction, while the whole project was run in accordance with the principles of two internationally approved project management methods. Special attention was dedicated to a careful analysis of the users' requirements and we believe the latter to be fulfilled to a very large degree. The new NHCPD is a relational database which is set up in two important state institutions, the National Institute of Public Health and the Health Insurance Institute of Slovenia. The former is responsible for updating the database, while the latter is responsible for the technological side as well as for the implementation of data security and protection. NHCPD will be inter linked with several other existing applications in the area of health care, public health and health insurance. Several important state institutions and professional chambers are users of the database in question, thus integrating various aspects of the health care system in Slovenia. The setting up of a completely revised health care providers' database in Slovenia is an important step in the development of a uniform and integrated information system that would support top decision-making processes at the national level.
Interactive Multi-Instrument Database of Solar Flares (IMIDSF)
NASA Astrophysics Data System (ADS)
Sadykov, Viacheslav M.; Nita, Gelu M.; Oria, Vincent; Kosovichev, Alexander G.
2017-08-01
Solar flares represent a complicated physical phenomenon observed in a broad range of the electromagnetic spectrum, from radiowaves to gamma-rays. For a complete understanding of the flares it is necessary to perform a combined multi-wavelength analysis using observations from many satellites and ground-based observatories. For efficient data search, integration of different flare lists and representation of observational data, we have developed the Interactive Multi-Instrument Database of Solar Flares (https://solarflare.njit.edu/). The web database is fully functional and allows the user to search for uniquely-identified flare events based on their physical descriptors and availability of observations of a particular set of instruments. Currently, data from three primary flare lists (GOES, RHESSI and HEK) and a variety of other event catalogs (Hinode, Fermi GBM, Konus-Wind, OVSA flare catalogs, CACTus CME catalog, Filament eruption catalog) and observing logs (IRIS and Nobeyama coverage), are integrated. An additional set of physical descriptors (temperature and emission measure) along with observing summary, data links and multi-wavelength light curves is provided for each flare event since January 2002. Results of an initial statistical analysis will be presented.
Hahn, Lars; Leimeister, Chris-André; Ounit, Rachid; Lonardi, Stefano; Morgenstern, Burkhard
2016-10-01
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.
Cochrane pregnancy and childbirth database: resource for evidence-based practice.
Callister, L C; Hobbins-Garbett, D
2000-01-01
The Cochrane Pregnancy and Childbirth database is an ongoing meta-analysis of evidence documenting effective health care practices for childbearing women and their neonates. It is proving invaluable to nurse educators, researchers, clinicians, and administrators working in a variety of health care delivery settings. Evidence-based nursing practice that is safe and effective can enhance rather than overpower pivotal and celebratory life events such as childbirth.
Constructing Benchmark Databases and Protocols for Medical Image Analysis: Diabetic Retinopathy
Kauppi, Tomi; Kämäräinen, Joni-Kristian; Kalesnykiene, Valentina; Sorri, Iiris; Uusitalo, Hannu; Kälviäinen, Heikki
2013-01-01
We address the performance evaluation practices for developing medical image analysis methods, in particular, how to establish and share databases of medical images with verified ground truth and solid evaluation protocols. Such databases support the development of better algorithms, execution of profound method comparisons, and, consequently, technology transfer from research laboratories to clinical practice. For this purpose, we propose a framework consisting of reusable methods and tools for the laborious task of constructing a benchmark database. We provide a software tool for medical image annotation helping to collect class label, spatial span, and expert's confidence on lesions and a method to appropriately combine the manual segmentations from multiple experts. The tool and all necessary functionality for method evaluation are provided as public software packages. As a case study, we utilized the framework and tools to establish the DiaRetDB1 V2.1 database for benchmarking diabetic retinopathy detection algorithms. The database contains a set of retinal images, ground truth based on information from multiple experts, and a baseline algorithm for the detection of retinopathy lesions. PMID:23956787
Database resources of the National Center for Biotechnology Information
Wheeler, David L.; Church, Deanna M.; Lash, Alex E.; Leipe, Detlef D.; Madden, Thomas L.; Pontius, Joan U.; Schuler, Gregory D.; Schriml, Lynn M.; Tatusova, Tatiana A.; Wagner, Lukas; Rapp, Barbara A.
2001-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources that operate on the data in GenBank and a variety of other biological data made available through NCBI’s Web site. NCBI data retrieval resources include Entrez, PubMed, LocusLink and the Taxonomy Browser. Data analysis resources include BLAST, Electronic PCR, OrfFinder, RefSeq, UniGene, HomoloGene, Database of Single Nucleotide Polymorphisms (dbSNP), Human Genome Sequencing, Human MapViewer, GeneMap’99, Human–Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes, Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, Cancer Genome Anatomy Project (CGAP), SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB) and the Conserved Domain Database (CDD). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov. PMID:11125038
VKCDB: voltage-gated K+ channel database updated and upgraded.
Gallin, Warren J; Boutet, Patrick A
2011-01-01
The Voltage-gated K(+) Channel DataBase (VKCDB) (http://vkcdb.biology.ualberta.ca) makes a comprehensive set of sequence data readily available for phylogenetic and comparative analysis. The current update contains 2063 entries for full-length or nearly full-length unique channel sequences from Bacteria (477), Archaea (18) and Eukaryotes (1568), an increase from 346 solely eukaryotic entries in the original release. In addition to protein sequences for channels, corresponding nucleotide sequences of the open reading frames corresponding to the amino acid sequences are now available and can be extracted in parallel with sets of protein sequences. Channels are categorized into subfamilies by phylogenetic analysis and by using hidden Markov model analyses. Although the raw database contains a number of fragmentary, duplicated, obsolete and non-channel sequences that were collected in early steps of data collection, the web interface will only return entries that have been validated as likely K(+) channels. The retrieval function of the web interface allows retrieval of entries that contain a substantial fraction of the core structural elements of VKCs, fragmentary entries, or both. The full database can be downloaded as either a MySQL dump or as an XML dump from the web site. We have now implemented automated updates at quarterly intervals.
The HARPS-N archive through a Cassandra, NoSQL database suite?
NASA Astrophysics Data System (ADS)
Molinari, Emilio; Guerra, Jose; Harutyunyan, Avet; Lodi, Marcello; Martin, Adrian
2016-07-01
The TNG-INAF is developing the science archive for the WEAVE instrument. The underlying architecture of the archive is based on a non relational database, more precisely, on Apache Cassandra cluster, which uses a NoSQL technology. In order to test and validate the use of this architecture, we created a local archive which we populated with all the HARPSN spectra collected at the TNG since the instrument's start of operations in mid-2012, as well as developed tools for the analysis of this data set. The HARPS-N data set is two orders of magnitude smaller than WEAVE, but we want to demonstrate the ability to walk through a complete data set and produce scientific output, as valuable as that produced by an ordinary pipeline, though without accessing directly the FITS files. The analytics is done by Apache Solr and Spark and on a relational PostgreSQL database. As an example, we produce observables like metallicity indexes for the targets in the archive and compare the results with the ones coming from the HARPS-N regular data reduction software. The aim of this experiment is to explore the viability of a high availability cluster and distributed NoSQL database as a platform for complex scientific analytics on a large data set, which will then be ported to the WEAVE Archive System (WAS) which we are developing for the WEAVE multi object, fiber spectrograph.
A meta-data based method for DNA microarray imputation.
Jörnsten, Rebecka; Ouyang, Ming; Wang, Hui-Yu
2007-03-29
DNA microarray experiments are conducted in logical sets, such as time course profiling after a treatment is applied to the samples, or comparisons of the samples under two or more conditions. Due to cost and design constraints of spotted cDNA microarray experiments, each logical set commonly includes only a small number of replicates per condition. Despite the vast improvement of the microarray technology in recent years, missing values are prevalent. Intuitively, imputation of missing values is best done using many replicates within the same logical set. In practice, there are few replicates and thus reliable imputation within logical sets is difficult. However, it is in the case of few replicates that the presence of missing values, and how they are imputed, can have the most profound impact on the outcome of downstream analyses (e.g. significance analysis and clustering). This study explores the feasibility of imputation across logical sets, using the vast amount of publicly available microarray data to improve imputation reliability in the small sample size setting. We download all cDNA microarray data of Saccharomyces cerevisiae, Arabidopsis thaliana, and Caenorhabditis elegans from the Stanford Microarray Database. Through cross-validation and simulation, we find that, for all three species, our proposed imputation using data from public databases is far superior to imputation within a logical set, sometimes to an astonishing degree. Furthermore, the imputation root mean square error for significant genes is generally a lot less than that of non-significant ones. Since downstream analysis of significant genes, such as clustering and network analysis, can be very sensitive to small perturbations of estimated gene effects, it is highly recommended that researchers apply reliable data imputation prior to further analysis. Our method can also be applied to cDNA microarray experiments from other species, provided good reference data are available.
Stability assessment of structures under earthquake hazard through GRID technology
NASA Astrophysics Data System (ADS)
Prieto Castrillo, F.; Boton Fernandez, M.
2009-04-01
This work presents a GRID framework to estimate the vulnerability of structures under earthquake hazard. The tool has been designed to cover the needs of a typical earthquake engineering stability analysis; preparation of input data (pre-processing), response computation and stability analysis (post-processing). In order to validate the application over GRID, a simplified model of structure under artificially generated earthquake records has been implemented. To achieve this goal, the proposed scheme exploits the GRID technology and its main advantages (parallel intensive computing, huge storage capacity and collaboration analysis among institutions) through intensive interaction among the GRID elements (Computing Element, Storage Element, LHC File Catalogue, federated database etc.) The dynamical model is described by a set of ordinary differential equations (ODE's) and by a set of parameters. Both elements, along with the integration engine, are encapsulated into Java classes. With this high level design, subsequent improvements/changes of the model can be addressed with little effort. In the procedure, an earthquake record database is prepared and stored (pre-processing) in the GRID Storage Element (SE). The Metadata of these records is also stored in the GRID federated database. This Metadata contains both relevant information about the earthquake (as it is usual in a seismic repository) and also the Logical File Name (LFN) of the record for its later retrieval. Then, from the available set of accelerograms in the SE, the user can specify a range of earthquake parameters to carry out a dynamic analysis. This way, a GRID job is created for each selected accelerogram in the database. At the GRID Computing Element (CE), displacements are then obtained by numerical integration of the ODE's over time. The resulting response for that configuration is stored in the GRID Storage Element (SE) and the maximum structure displacement is computed. Then, the corresponding Metadata containing the response LFN, earthquake magnitude and maximum structure displacement is also stored. Finally, the displacements are post-processed through a statistically-based algorithm from the available Metadata to obtain the probability of collapse of the structure for different earthquake magnitudes. From this study, it is possible to build a vulnerability report for the structure type and seismic data. The proposed methodology can be combined with the on-going initiatives to build a European earthquake record database. In this context, Grid enables collaboration analysis over shared seismic data and results among different institutions.
Mapping Creativity: Creativity Measurements Network Analysis
ERIC Educational Resources Information Center
Pinheiro, Igor Reszka; Cruz, Roberto Moraes
2014-01-01
This article borrowed network analysis tools to discover how the construct formed by the set of all measures of creativity configures itself. To this end, using a variant of the meta-analytical method, a database was compiled simulating 42,381 responses to 974 variables centered on 64 creativity measures. Results, although preliminary, indicate…
Natural language processing and the representation of clinical data.
Sager, N; Lyman, M; Bucknall, C; Nhan, N; Tick, L J
1994-01-01
OBJECTIVE: Develop a representation of clinical observations and actions and a method of processing free-text patient documents to facilitate applications such as quality assurance. DESIGN: The Linguistic String Project (LSP) system of New York University utilizes syntactic analysis, augmented by a sublanguage grammar and an information structure that are specific to the clinical narrative, to map free-text documents into a database for querying. MEASUREMENTS: Information precision (I-P) and information recall (I-R) were measured for queries for the presence of 13 asthma-health-care quality assurance criteria in a database generated from 59 discharge letters. RESULTS: I-P, using counts of major errors only, was 95.7% for the 28-letter training set and 98.6% for the 31-letter test set. I-R, using counts of major omissions only, was 93.9% for the training set and 92.5% for the test set. PMID:7719796
An Optimized Informatics Pipeline for Mass Spectrometry-Based Peptidomics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Chaochao; Monroe, Matthew E.; Xu, Zhe
2015-12-26
Comprehensive MS analysis of peptidome, the intracellular and intercellular products of protein degradation, has the potential to provide novel insights on endogenous proteolytic processing and their utility in disease diagnosis and prognosis. Along with the advances in MS instrumentation, a plethora of proteomics data analysis tools have been applied for direct use in peptidomics; however an evaluation of the currently available informatics pipelines for peptidomics data analysis has yet to be reported. In this study, we set off by evaluating the results of several popular MS/MS database search engines including MS-GF+, SEQUEST and MS-Align+ for peptidomics data analysis, followed bymore » identification and label-free quantification using the well-established accurate mass and time (AMT) tag and newly developed informed quantification (IQ) approaches, both based on direct LC-MS analysis. Our result demonstrated that MS-GF+ outperformed both SEQUEST and MS-Align+ in identifying peptidome peptides. Using a database established from the MS-GF+ peptide identifications, both the AMT tag and IQ approaches provided significantly deeper peptidome coverage and less missing value for each individual data set than the MS/MS methods, while achieving robust label-free quantification. Besides having an excellent correlation with the AMT tag quantification results, IQ also provided slightly higher peptidome coverage than AMT. Taken together, we propose an optimal informatics pipeline combining MS-GF+ for initial database searching with IQ (or AMT) for identification and label-free quantification for high-throughput, comprehensive and quantitative peptidomics analysis.« less
Norén, G Niklas; Bergvall, Tomas; Ryan, Patrick B; Juhlin, Kristina; Schuemie, Martijn J; Madigan, David
2013-10-01
Observational healthcare data offer the potential to identify adverse drug reactions that may be missed by spontaneous reporting. The self-controlled cohort analysis within the Temporal Pattern Discovery framework compares the observed-to-expected ratio of medical outcomes during post-exposure surveillance periods with those during a set of distinct pre-exposure control periods in the same patients. It utilizes an external control group to account for systematic differences between the different time periods, thus combining within- and between-patient confounder adjustment in a single measure. To evaluate the performance of the calibrated self-controlled cohort analysis within Temporal Pattern Discovery as a tool for risk identification in observational healthcare data. Different implementations of the calibrated self-controlled cohort analysis were applied to 399 drug-outcome pairs (165 positive and 234 negative test cases across 4 health outcomes of interest) in 5 real observational databases (four with administrative claims and one with electronic health records). Performance was evaluated on real data through sensitivity/specificity, the area under receiver operator characteristics curve (AUC), and bias. The calibrated self-controlled cohort analysis achieved good predictive accuracy across the outcomes and databases under study. The optimal design based on this reference set uses a 360 days surveillance period and a single control period 180 days prior to new prescriptions. It achieved an average AUC of 0.75 and AUC >0.70 in all but one scenario. A design with three separate control periods performed better for the electronic health records database and for acute renal failure across all data sets. The estimates for negative test cases were generally unbiased, but a minor negative bias of up to 0.2 on the RR-scale was observed with the configurations using multiple control periods, for acute liver injury and upper gastrointestinal bleeding. The calibrated self-controlled cohort analysis within Temporal Pattern Discovery shows promise as a tool for risk identification; it performs well at discriminating positive from negative test cases. The optimal parameter configuration may vary with the data set and medical outcome of interest.
Vadigepalli, Rajanikanth; Chakravarthula, Praveen; Zak, Daniel E; Schwaber, James S; Gonye, Gregory E
2003-01-01
We have developed a bioinformatics tool named PAINT that automates the promoter analysis of a given set of genes for the presence of transcription factor binding sites. Based on coincidence of regulatory sites, this tool produces an interaction matrix that represents a candidate transcriptional regulatory network. This tool currently consists of (1) a database of promoter sequences of known or predicted genes in the Ensembl annotated mouse genome database, (2) various modules that can retrieve and process the promoter sequences for binding sites of known transcription factors, and (3) modules for visualization and analysis of the resulting set of candidate network connections. This information provides a substantially pruned list of genes and transcription factors that can be examined in detail in further experimental studies on gene regulation. Also, the candidate network can be incorporated into network identification methods in the form of constraints on feasible structures in order to render the algorithms tractable for large-scale systems. The tool can also produce output in various formats suitable for use in external visualization and analysis software. In this manuscript, PAINT is demonstrated in two case studies involving analysis of differentially regulated genes chosen from two microarray data sets. The first set is from a neuroblastoma N1E-115 cell differentiation experiment, and the second set is from neuroblastoma N1E-115 cells at different time intervals following exposure to neuropeptide angiotensin II. PAINT is available for use as an agent in BioSPICE simulation and analysis framework (www.biospice.org), and can also be accessed via a WWW interface at www.dbi.tju.edu/dbi/tools/paint/.
Madsen, Heidi Holst; Madsen, Dicte; Gauffriau, Marianne
2016-01-01
Unique identifiers (UID) are seen as an effective key to match identical publications across databases or identify duplicates in a database. The objective of the present study is to investigate how well UIDs work as match keys in the integration between Pure and SciVal, based on a case with publications from the health sciences. We evaluate the matching process based on information about coverage, precision, and characteristics of publications matched versus not matched with UIDs as the match keys. We analyze this information to detect errors, if any, in the matching process. As an example we also briefly discuss how publication sets formed by using UIDs as the match keys may affect the bibliometric indicators number of publications, number of citations, and the average number of citations per publication. The objective is addressed in a literature review and a case study. The literature review shows that only a few studies evaluate how well UIDs work as a match key. From the literature we identify four error types: Duplicate digital object identifiers (DOI), incorrect DOIs in reference lists and databases, DOIs not registered by the database where a bibliometric analysis is performed, and erroneous optical or special character recognition. The case study explores the use of UIDs in the integration between the databases Pure and SciVal. Specifically journal publications in English are matched between the two databases. We find all error types except erroneous optical or special character recognition in our publication sets. In particular the duplicate DOIs constitute a problem for the calculation of bibliometric indicators as both keeping the duplicates to improve the reliability of citation counts and deleting them to improve the reliability of publication counts will distort the calculation of average number of citations per publication. The use of UIDs as a match key in citation linking is implemented in many settings, and the availability of UIDs may become critical for the inclusion of a publication or a database in a bibliometric analysis.
Madsen, Heidi Holst; Madsen, Dicte; Gauffriau, Marianne
2016-01-01
Unique identifiers (UID) are seen as an effective key to match identical publications across databases or identify duplicates in a database. The objective of the present study is to investigate how well UIDs work as match keys in the integration between Pure and SciVal, based on a case with publications from the health sciences. We evaluate the matching process based on information about coverage, precision, and characteristics of publications matched versus not matched with UIDs as the match keys. We analyze this information to detect errors, if any, in the matching process. As an example we also briefly discuss how publication sets formed by using UIDs as the match keys may affect the bibliometric indicators number of publications, number of citations, and the average number of citations per publication. The objective is addressed in a literature review and a case study. The literature review shows that only a few studies evaluate how well UIDs work as a match key. From the literature we identify four error types: Duplicate digital object identifiers (DOI), incorrect DOIs in reference lists and databases, DOIs not registered by the database where a bibliometric analysis is performed, and erroneous optical or special character recognition. The case study explores the use of UIDs in the integration between the databases Pure and SciVal. Specifically journal publications in English are matched between the two databases. We find all error types except erroneous optical or special character recognition in our publication sets. In particular the duplicate DOIs constitute a problem for the calculation of bibliometric indicators as both keeping the duplicates to improve the reliability of citation counts and deleting them to improve the reliability of publication counts will distort the calculation of average number of citations per publication. The use of UIDs as a match key in citation linking is implemented in many settings, and the availability of UIDs may become critical for the inclusion of a publication or a database in a bibliometric analysis. PMID:27635223
CARINA data synthesis project: pH data scale unification and cruise adjustments
NASA Astrophysics Data System (ADS)
Velo, A.; Pérez, F. F.; Lin, X.; Key, R. M.; Tanhua, T.; de La Paz, M.; van Heuven, S.; Jutterström, S.; Ríos, A. F.
2009-10-01
Data on carbon and carbon-relevant hydrographic and hydrochemical parameters from previously non-publicly available cruise data sets in the Artic Mediterranean Seas (AMS), Atlantic and Southern Ocean have been retrieved and merged to a new database: CARINA (CARbon IN the Atlantic). These data have gone through rigorous quality control (QC) procedures to assure the highest possible quality and consistency. The data for most of the measured parameters in the CARINA database were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the data products, i.e. three merged data files with measured, calculated and interpolated data for each of the three CARINA regions; AMS, Atlantic and Southern Ocean. Out of a total of 188 cruise entries in the CARINA database, 59 reported pH measured values. Here we present details of the secondary QC on pH for the CARINA database. Procedures of quality control, including crossover analysis between cruises and inversion analysis of all crossover data are briefly described. Adjustments were applied to the pH values for 21 of the cruises in the CARINA dataset. With these adjustments the CARINA database is consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s. Based on our analysis we estimate the internal accuracy of the CARINA pH data to be 0.005 pH units. The CARINA data are now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
IDD Info: a software to manage surveillance data of Iodine Deficiency Disorders.
Liu, Peng; Teng, Bai-Jun; Zhang, Shu-Bin; Su, Xiao-Hui; Yu, Jun; Liu, Shou-Jun
2011-08-01
IDD info, a new software for managing survey data of Iodine Deficiency Disorders (IDD), is presented in this paper. IDD Info aims to create IDD project databases, process, analyze various national or regional surveillance data and form final report. It has series measures of choosing database from existing ones, revising it, choosing indicators from pool to establish database and adding indicators to pool. It also provides simple tools to scan one database and compare two databases, to set IDD standard parameters, to analyze data by single indicator and multi-indicators, and finally to form typeset report with content customized. IDD Info was developed using Chinese national IDD surveillance data of 2005. Its validity was evaluated by comparing with survey report given by China CDC. The IDD Info is a professional analysis tool, which succeeds in speeding IDD data analysis up to about 14.28% with respect to standard reference routines. It consequently enhances analysis performance and user compliance. IDD Info is a practical and accurate means of managing the multifarious IDD surveillance data that can be widely used by non-statisticians in national and regional IDD surveillance. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
Morphology-based Query for Galaxy Image Databases
NASA Astrophysics Data System (ADS)
Shamir, Lior
2017-02-01
Galaxies of rare morphology are of paramount scientific interest, as they carry important information about the past, present, and future Universe. Once a rare galaxy is identified, studying it more effectively requires a set of galaxies of similar morphology, allowing generalization and statistical analysis that cannot be done when N=1. Databases generated by digital sky surveys can contain a very large number of galaxy images, and therefore once a rare galaxy of interest is identified it is possible that more instances of the same morphology are also present in the database. However, when a researcher identifies a certain galaxy of rare morphology in the database, it is virtually impossible to mine the database manually in the search for galaxies of similar morphology. Here we propose a computer method that can automatically search databases of galaxy images and identify galaxies that are morphologically similar to a certain user-defined query galaxy. That is, the researcher provides an image of a galaxy of interest, and the pattern recognition system automatically returns a list of galaxies that are visually similar to the target galaxy. The algorithm uses a comprehensive set of descriptors, allowing it to support different types of galaxies, and it is not limited to a finite set of known morphologies. While the list of returned galaxies is neither clean nor complete, it contains a far higher frequency of galaxies of the morphology of interest, providing a substantial reduction of the data. Such algorithms can be integrated into data management systems of autonomous digital sky surveys such as the Large Synoptic Survey Telescope (LSST), where the number of galaxies in the database is extremely large. The source code of the method is available at http://vfacstaff.ltu.edu/lshamir/downloads/udat.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Roberts, D
Purpose: A unified database system was developed to allow accumulation, review and analysis of quality assurance (QA) data for measurement, treatment, imaging and simulation equipment in our department. Recording these data in a database allows a unified and structured approach to review and analysis of data gathered using commercial database tools. Methods: A clinical database was developed to track records of quality assurance operations on linear accelerators, a computed tomography (CT) scanner, high dose rate (HDR) afterloader and imaging systems such as on-board imaging (OBI) and Calypso in our department. The database was developed using Microsoft Access database and visualmore » basic for applications (VBA) programming interface. Separate modules were written for accumulation, review and analysis of daily, monthly and annual QA data. All modules were designed to use structured query language (SQL) as the basis of data accumulation and review. The SQL strings are dynamically re-written at run time. The database also features embedded documentation, storage of documents produced during QA activities and the ability to annotate all data within the database. Tests are defined in a set of tables that define test type, specific value, and schedule. Results: Daily, Monthly and Annual QA data has been taken in parallel with established procedures to test MQA. The database has been used to aggregate data across machines to examine the consistency of machine parameters and operations within the clinic for several months. Conclusion: The MQA application has been developed as an interface to a commercially available SQL engine (JET 5.0) and a standard database back-end. The MQA system has been used for several months for routine data collection.. The system is robust, relatively simple to extend and can be migrated to a commercial SQL server.« less
Kanapathipillai, Rupa; McManus, Hamish; Kamarulzaman, Adeeba; Lim, Poh Lian; Templeton, David J; Law, Matthew; Woolley, Ian
2014-01-01
Magnitude and frequency of HIV viral load blips in resource-limited settings, has not previously been assessed. This study was undertaken in a cohort from a high income country (Australia) known as AHOD (Australian HIV Observational Database) and another cohort from a mixture of Asian countries of varying national income per capita, TAHOD (TREAT Asia HIV Observational Database). Blips were defined as detectable VL (≥ 50 copies/mL) preceded and followed by undetectable VL (<50 copies/mL). Virological failure (VF) was defined as two consecutive VL ≥50 copies/ml. Cox proportional hazard models of time to first VF after entry, were developed. 5040 patients (AHOD n = 2597 and TAHOD n = 2521) were included; 910 (18%) of patients experienced blips. 744 (21%) and 166 (11%) of high- and middle/low-income participants, respectively, experienced blips ever. 711 (14%) experienced blips prior to virological failure. 559 (16%) and 152 (10%) of high- and middle/low-income participants, respectively, experienced blips prior to virological failure. VL testing occurred at a median frequency of 175 and 91 days in middle/low- and high-income sites, respectively. Longer time to VF occurred in middle/low income sites, compared with high-income sites (adjusted hazards ratio (AHR) 0.41; p<0.001), adjusted for year of first cART, Hepatitis C co-infection, cART regimen, and prior blips. Prior blips were not a significant predictor of VF in univariate analysis (AHR 0.97, p = 0.82). Differing magnitudes of blips were not significant in univariate analyses as predictors of virological failure (p = 0.360 for blip 50-≤1000, p = 0.309 for blip 50-≤400 and p = 0.300 for blip 50-≤200). 209 of 866 (24%) patients were switched to an alternate regimen in the setting of a blip. Despite a lower proportion of blips occurring in low/middle-income settings, no significant difference was found between settings. Nonetheless, a substantial number of participants were switched to alternative regimens in the setting of blips.
ERIC Educational Resources Information Center
United Nations Educational, Scientific, and Cultural Organization, Bangkok (Thailand). Regional Office for Education in Asia and the Pacific.
The objectives of the Expert Group Consultation Meeting for Developing a Non-Formal Education and Literacy Database in the Asia-Pacific Region were: to exchange information and review the state-of-the-art in the field of data collection, analysis and indicators of non-formal education and literacy programs; to examine and review the set of…
Shift-invariant discrete wavelet transform analysis for retinal image classification.
Khademi, April; Krishnan, Sridhar
2007-12-01
This work involves retinal image classification and a novel analysis system was developed. From the compressed domain, the proposed scheme extracts textural features from wavelet coefficients, which describe the relative homogeneity of localized areas of the retinal images. Since the discrete wavelet transform (DWT) is shift-variant, a shift-invariant DWT was explored to ensure that a robust feature set was extracted. To combat the small database size, linear discriminant analysis classification was used with the leave one out method. 38 normal and 48 abnormal (exudates, large drusens, fine drusens, choroidal neovascularization, central vein and artery occlusion, histoplasmosis, arteriosclerotic retinopathy, hemi-central retinal vein occlusion and more) were used and a specificity of 79% and sensitivity of 85.4% were achieved (the average classification rate is 82.2%). The success of the system can be accounted to the highly robust feature set which included translation, scale and semi-rotational, features. Additionally, this technique is database independent since the features were specifically tuned to the pathologies of the human eye.
Automated 3D Phenotype Analysis Using Data Mining
Plyusnin, Ilya; Evans, Alistair R.; Karme, Aleksis; Gionis, Aristides; Jernvall, Jukka
2008-01-01
The ability to analyze and classify three-dimensional (3D) biological morphology has lagged behind the analysis of other biological data types such as gene sequences. Here, we introduce the techniques of data mining to the study of 3D biological shapes to bring the analyses of phenomes closer to the efficiency of studying genomes. We compiled five training sets of highly variable morphologies of mammalian teeth from the MorphoBrowser database. Samples were labeled either by dietary class or by conventional dental types (e.g. carnassial, selenodont). We automatically extracted a multitude of topological attributes using Geographic Information Systems (GIS)-like procedures that were then used in several combinations of feature selection schemes and probabilistic classification models to build and optimize classifiers for predicting the labels of the training sets. In terms of classification accuracy, computational time and size of the feature sets used, non-repeated best-first search combined with 1-nearest neighbor classifier was the best approach. However, several other classification models combined with the same searching scheme proved practical. The current study represents a first step in the automatic analysis of 3D phenotypes, which will be increasingly valuable with the future increase in 3D morphology and phenomics databases. PMID:18320060
Analysis of prescription database extracted from standard textbooks of traditional Dai medicine.
Zhang, Chuang; Chongsuvivatwong, Virasakdi; Keawpradub, Niwat; Lin, Yanfang
2012-08-29
Traditional Dai Medicine (TDM) is one of the four major ethnomedicine of China. In 2007 a group of experts produced a set of seven Dai medical textbooks on this subject. The first two were selected as the main data source to analyse well recognized prescriptions. To quantify patterns of prescriptions, common ingredients, indications and usages of TDM. A relational database linking the prescriptions, ingredients, herb names, indications, and usages was set up. Frequency of pattern of combination and common ingredients were tabulated. A total of 200 prescriptions and 402 herbs were compiled. Prescriptions based on "wind" disorders, a detoxification theory that most commonly deals with symptoms of digestive system diseases, accounted for over one third of all prescriptions. The major methods of preparations mostly used roots and whole herbs. The information extracted from the relational database may be useful for understanding symptomatic treatments. Antidote and detoxification theory deserves further research.
Harb, Omar S; Roos, David S
2015-01-01
Over the past 20 years, advances in high-throughput biological techniques and the availability of computational resources including fast Internet access have resulted in an explosion of large genome-scale data sets "big data." While such data are readily available for download and personal use and analysis from a variety of repositories, often such analysis requires access to seldom-available computational skills. As a result a number of databases have emerged to provide scientists with online tools enabling the interrogation of data without the need for sophisticated computational skills beyond basic knowledge of Internet browser utility. This chapter focuses on the Eukaryotic Pathogen Databases (EuPathDB: http://eupathdb.org) Bioinformatic Resource Center (BRC) and illustrates some of the available tools and methods.
STBase: one million species trees for comparative biology.
McMahon, Michelle M; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael J
2015-01-01
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user's query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees.
A Review of Research on Environmental Education in Non-Traditional Settings in Turkey, 2000 and 2011
ERIC Educational Resources Information Center
Erdogan, Mehmet; Usak, Muhammet; Bahar, Mehmet
2013-01-01
The purpose of the present study was to collect and analyze the research on environmental education in non-traditional settings in Turkey undertaken with various subjects (e.g. students, graduates and teachers) and published over the years of 2000-2011. For systematic analysis, selected data-bases and journals were scrutinized across five…
A Tool for Conditions Tag Management in ATLAS
NASA Astrophysics Data System (ADS)
Sharmazanashvili, A.; Batiashvili, G.; Gvaberidze, G.; Shekriladze, L.; Formica, A.; Atlas Collaboration
2014-06-01
ATLAS Conditions data include about 2 TB in a relational database and 400 GB of files referenced from the database. Conditions data is entered and retrieved using COOL, the API for accessing data in the LCG Conditions Database infrastructure. It is managed using an ATLAS-customized python based tool set. Conditions data are required for every reconstruction and simulation job, so access to them is crucial for all aspects of ATLAS data taking and analysis, as well as by preceding tasks to derive optimal corrections to reconstruction. Optimized sets of conditions for processing are accomplished using strict version control on those conditions: a process which assigns COOL Tags to sets of conditions, and then unifies those conditions over data-taking intervals into a COOL Global Tag. This Global Tag identifies the set of conditions used to process data so that the underlying conditions can be uniquely identified with 100% reproducibility should the processing be executed again. Understanding shifts in the underlying conditions from one tag to another and ensuring interval completeness for all detectors for a set of runs to be processed is a complex task, requiring tools beyond the above mentioned python utilities. Therefore, a JavaScript /PHP based utility called the Conditions Tag Browser (CTB) has been developed. CTB gives detector and conditions experts the possibility to navigate through the different databases and COOL folders; explore the content of given tags and the differences between them, as well as their extent in time; visualize the content of channels associated with leaf tags. This report describes the structure and PHP/ JavaScript classes of functions of the CTB.
A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics*
Li, Jing; Su, Zengliu; Ma, Ze-Qiang; Slebos, Robbert J. C.; Halvey, Patrick; Tabb, David L.; Liebler, Daniel C.; Pao, William; Zhang, Bing
2011-01-01
Shotgun proteomics data analysis usually relies on database search. However, commonly used protein sequence databases do not contain information on protein variants and thus prevent variant peptides and proteins from been identified. Including known coding variations into protein sequence databases could help alleviate this problem. Based on our recently published human Cancer Proteome Variation Database, we have created a protein sequence database that comprehensively annotates thousands of cancer-related coding variants collected in the Cancer Proteome Variation Database as well as noncancer-specific ones from the Single Nucleotide Polymorphism Database (dbSNP). Using this database, we then developed a data analysis workflow for variant peptide identification in shotgun proteomics. The high risk of false positive variant identifications was addressed by a modified false discovery rate estimation method. Analysis of colorectal cancer cell lines SW480, RKO, and HCT-116 revealed a total of 81 peptides that contain either noncancer-specific or cancer-related variations. Twenty-three out of 26 variants randomly selected from the 81 were confirmed by genomic sequencing. We further applied the workflow on data sets from three individual colorectal tumor specimens. A total of 204 distinct variant peptides were detected, and five carried known cancer-related mutations. Each individual showed a specific pattern of cancer-related mutations, suggesting potential use of this type of information for personalized medicine. Compatibility of the workflow has been tested with four popular database search engines including Sequest, Mascot, X!Tandem, and MyriMatch. In summary, we have developed a workflow that effectively uses existing genomic data to enable variant peptide detection in proteomics. PMID:21389108
Simulation analysis of operation respond in a field setting
DOT National Transportation Integrated Search
1997-01-01
The Operation Respond system aims to facilitate rapid access to transportation carrier databases containing information on hazardous material cargo. As a consequence, first responders, such as police and fire department personnel, are expected to res...
Transcriptome analysis and related databases of Lactococcus lactis.
Kuipers, Oscar P; de Jong, Anne; Baerends, Richard J S; van Hijum, Sacha A F T; Zomer, Aldert L; Karsens, Harma A; den Hengst, Chris D; Kramer, Naomi E; Buist, Girbe; Kok, Jan
2002-08-01
Several complete genome sequences of Lactococcus lactis and their annotations will become available in the near future, next to the already published genome sequence of L. lactis ssp. lactis IL 1403. This will allow intraspecies comparative genomics studies as well as functional genomics studies aimed at a better understanding of physiological processes and regulatory networks operating in lactococci. This paper describes the initial set-up of a DNA-microarray facility in our group, to enable transcriptome analysis of various Gram-positive bacteria, including a ssp. lactis and a ssp. cremoris strain of Lactococcus lactis. Moreover a global description will be given of the hardware and software requirements for such a set-up, highlighting the crucial integration of relevant bioinformatics tools and methods. This includes the development of MolGenIS, an information system for transcriptome data storage and retrieval, and LactococCye, a metabolic pathway/genome database of Lactococcus lactis.
Research on distributed heterogeneous data PCA algorithm based on cloud platform
NASA Astrophysics Data System (ADS)
Zhang, Jin; Huang, Gang
2018-05-01
Principal component analysis (PCA) of heterogeneous data sets can solve the problem that centralized data scalability is limited. In order to reduce the generation of intermediate data and error components of distributed heterogeneous data sets, a principal component analysis algorithm based on heterogeneous data sets under cloud platform is proposed. The algorithm performs eigenvalue processing by using Householder tridiagonalization and QR factorization to calculate the error component of the heterogeneous database associated with the public key to obtain the intermediate data set and the lost information. Experiments on distributed DBM heterogeneous datasets show that the model method has the feasibility and reliability in terms of execution time and accuracy.
Functional Analysis in Public Schools: A Summary of 90 Functional Analyses
ERIC Educational Resources Information Center
Mueller, Michael M.; Nkosi, Ajamu; Hine, Jeffrey F.
2011-01-01
Several review and epidemiological studies have been conducted over recent years to inform behavior analysts of functional analysis outcomes. None to date have closely examined demographic and clinical data for functional analyses conducted exclusively in public school settings. The current paper presents a data-based summary of 90 functional…
Lacagnina, Valerio; Leto-Barone, Maria S; La Piana, Simona; Seidita, Aurelio; Pingitore, Giuseppe; Di Lorenzo, Gabriele
2014-01-01
This article uses the logistic regression model for diagnostic decision making in patients with chronic nasal symptoms. We studied the ability of the logistic regression model, obtained by the evaluation of a database, to detect patients with positive allergy skin-prick test (SPT) and patients with negative SPT. The model developed was validated using the data set obtained from another medical institution. The analysis was performed using a database obtained from a questionnaire administered to the patients with nasal symptoms containing personal data, clinical data, and results of allergy testing (SPT). All variables found to be significantly different between patients with positive and negative SPT (p < 0.05) were selected for the logistic regression models and were analyzed with backward stepwise logistic regression, evaluated with area under the curve of the receiver operating characteristic curve. A second set of patients from another institution was used to prove the model. The accuracy of the model in identifying, over the second set, both patients whose SPT will be positive and negative was high. The model detected 96% of patients with nasal symptoms and positive SPT and classified 94% of those with negative SPT. This study is preliminary to the creation of a software that could help the primary care doctors in a diagnostic decision making process (need of allergy testing) in patients complaining of chronic nasal symptoms.
Reducing Information Overload in Large Seismic Data Sets
DOE Office of Scientific and Technical Information (OSTI.GOV)
HAMPTON,JEFFERY W.; YOUNG,CHRISTOPHER J.; MERCHANT,BION J.
2000-08-02
Event catalogs for seismic data can become very large. Furthermore, as researchers collect multiple catalogs and reconcile them into a single catalog that is stored in a relational database, the reconciled set becomes even larger. The sheer number of these events makes searching for relevant events to compare with events of interest problematic. Information overload in this form can lead to the data sets being under-utilized and/or used incorrectly or inconsistently. Thus, efforts have been initiated to research techniques and strategies for helping researchers to make better use of large data sets. In this paper, the authors present their effortsmore » to do so in two ways: (1) the Event Search Engine, which is a waveform correlation tool and (2) some content analysis tools, which area combination of custom-built and commercial off-the-shelf tools for accessing, managing, and querying seismic data stored in a relational database. The current Event Search Engine is based on a hierarchical clustering tool known as the dendrogram tool, which is written as a MatSeis graphical user interface. The dendrogram tool allows the user to build dendrogram diagrams for a set of waveforms by controlling phase windowing, down-sampling, filtering, enveloping, and the clustering method (e.g. single linkage, complete linkage, flexible method). It also allows the clustering to be based on two or more stations simultaneously, which is important to bridge gaps in the sparsely recorded event sets anticipated in such a large reconciled event set. Current efforts are focusing on tools to help the researcher winnow the clusters defined using the dendrogram tool down to the minimum optimal identification set. This will become critical as the number of reference events in the reconciled event set continually grows. The dendrogram tool is part of the MatSeis analysis package, which is available on the Nuclear Explosion Monitoring Research and Engineering Program Web Site. As part of the research into how to winnow the reference events in these large reconciled event sets, additional database query approaches have been developed to provide windows into these datasets. These custom built content analysis tools help identify dataset characteristics that can potentially aid in providing a basis for comparing similar reference events in these large reconciled event sets. Once these characteristics can be identified, algorithms can be developed to create and add to the reduced set of events used by the Event Search Engine. These content analysis tools have already been useful in providing information on station coverage of the referenced events and basic statistical, information on events in the research datasets. The tools can also provide researchers with a quick way to find interesting and useful events within the research datasets. The tools could also be used as a means to review reference event datasets as part of a dataset delivery verification process. There has also been an effort to explore the usefulness of commercially available web-based software to help with this problem. The advantages of using off-the-shelf software applications, such as Oracle's WebDB, to manipulate, customize and manage research data are being investigated. These types of applications are being examined to provide access to large integrated data sets for regional seismic research in Asia. All of these software tools would provide the researcher with unprecedented power without having to learn the intricacies and complexities of relational database systems.« less
Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; Liu, Hongfang
2015-06-06
Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%. Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.
Tomato Expression Database (TED): a suite of data presentation and analysis tools
Fei, Zhangjun; Tang, Xuemei; Alba, Rob; Giovannoni, James
2006-01-01
The Tomato Expression Database (TED) includes three integrated components. The Tomato Microarray Data Warehouse serves as a central repository for raw gene expression data derived from the public tomato cDNA microarray. In addition to expression data, TED stores experimental design and array information in compliance with the MIAME guidelines and provides web interfaces for researchers to retrieve data for their own analysis and use. The Tomato Microarray Expression Database contains normalized and processed microarray data for ten time points with nine pair-wise comparisons during fruit development and ripening in a normal tomato variety and nearly isogenic single gene mutants impacting fruit development and ripening. Finally, the Tomato Digital Expression Database contains raw and normalized digital expression (EST abundance) data derived from analysis of the complete public tomato EST collection containing >150 000 ESTs derived from 27 different non-normalized EST libraries. This last component also includes tools for the comparison of tomato and Arabidopsis digital expression data. A set of query interfaces and analysis, and visualization tools have been developed and incorporated into TED, which aid users in identifying and deciphering biologically important information from our datasets. TED can be accessed at . PMID:16381976
Tomato Expression Database (TED): a suite of data presentation and analysis tools.
Fei, Zhangjun; Tang, Xuemei; Alba, Rob; Giovannoni, James
2006-01-01
The Tomato Expression Database (TED) includes three integrated components. The Tomato Microarray Data Warehouse serves as a central repository for raw gene expression data derived from the public tomato cDNA microarray. In addition to expression data, TED stores experimental design and array information in compliance with the MIAME guidelines and provides web interfaces for researchers to retrieve data for their own analysis and use. The Tomato Microarray Expression Database contains normalized and processed microarray data for ten time points with nine pair-wise comparisons during fruit development and ripening in a normal tomato variety and nearly isogenic single gene mutants impacting fruit development and ripening. Finally, the Tomato Digital Expression Database contains raw and normalized digital expression (EST abundance) data derived from analysis of the complete public tomato EST collection containing >150,000 ESTs derived from 27 different non-normalized EST libraries. This last component also includes tools for the comparison of tomato and Arabidopsis digital expression data. A set of query interfaces and analysis, and visualization tools have been developed and incorporated into TED, which aid users in identifying and deciphering biologically important information from our datasets. TED can be accessed at http://ted.bti.cornell.edu.
Le, T Hoang Ngan; Luu, Khoa; Savvides, Marios
2013-08-01
Robust facial hair detection and segmentation is a highly valued soft biometric attribute for carrying out forensic facial analysis. In this paper, we propose a novel and fully automatic system, called SparCLeS, for beard/moustache detection and segmentation in challenging facial images. SparCLeS uses the multiscale self-quotient (MSQ) algorithm to preprocess facial images and deal with illumination variation. Histogram of oriented gradients (HOG) features are extracted from the preprocessed images and a dynamic sparse classifier is built using these features to classify a facial region as either containing skin or facial hair. A level set based approach, which makes use of the advantages of both global and local information, is then used to segment the regions of a face containing facial hair. Experimental results demonstrate the effectiveness of our proposed system in detecting and segmenting facial hair regions in images drawn from three databases, i.e., the NIST Multiple Biometric Grand Challenge (MBGC) still face database, the NIST Color Facial Recognition Technology FERET database, and the Labeled Faces in the Wild (LFW) database.
Lepoivre, Cyrille; Bergon, Aurélie; Lopez, Fabrice; Perumal, Narayanan B; Nguyen, Catherine; Imbert, Jean; Puthier, Denis
2012-01-31
Deciphering gene regulatory networks by in silico approaches is a crucial step in the study of the molecular perturbations that occur in diseases. The development of regulatory maps is a tedious process requiring the comprehensive integration of various evidences scattered over biological databases. Thus, the research community would greatly benefit from having a unified database storing known and predicted molecular interactions. Furthermore, given the intrinsic complexity of the data, the development of new tools offering integrated and meaningful visualizations of molecular interactions is necessary to help users drawing new hypotheses without being overwhelmed by the density of the subsequent graph. We extend the previously developed TranscriptomeBrowser database with a set of tables containing 1,594,978 human and mouse molecular interactions. The database includes: (i) predicted regulatory interactions (computed by scanning vertebrate alignments with a set of 1,213 position weight matrices), (ii) potential regulatory interactions inferred from systematic analysis of ChIP-seq experiments, (iii) regulatory interactions curated from the literature, (iv) predicted post-transcriptional regulation by micro-RNA, (v) protein kinase-substrate interactions and (vi) physical protein-protein interactions. In order to easily retrieve and efficiently analyze these interactions, we developed In-teractomeBrowser, a graph-based knowledge browser that comes as a plug-in for Transcriptome-Browser. The first objective of InteractomeBrowser is to provide a user-friendly tool to get new insight into any gene list by providing a context-specific display of putative regulatory and physical interactions. To achieve this, InteractomeBrowser relies on a "cell compartments-based layout" that makes use of a subset of the Gene Ontology to map gene products onto relevant cell compartments. This layout is particularly powerful for visual integration of heterogeneous biological information and is a productive avenue in generating new hypotheses. The second objective of InteractomeBrowser is to fill the gap between interaction databases and dynamic modeling. It is thus compatible with the network analysis software Cytoscape and with the Gene Interaction Network simulation software (GINsim). We provide examples underlying the benefits of this visualization tool for large gene set analysis related to thymocyte differentiation. The InteractomeBrowser plugin is a powerful tool to get quick access to a knowledge database that includes both predicted and validated molecular interactions. InteractomeBrowser is available through the TranscriptomeBrowser framework and can be found at: http://tagc.univ-mrs.fr/tbrowser/. Our database is updated on a regular basis.
NASA Technical Reports Server (NTRS)
Ho, C. Y.; Li, H. H.
1989-01-01
A computerized comprehensive numerical database system on the mechanical, thermophysical, electronic, electrical, magnetic, optical, and other properties of various types of technologically important materials such as metals, alloys, composites, dielectrics, polymers, and ceramics has been established and operational at the Center for Information and Numerical Data Analysis and Synthesis (CINDAS) of Purdue University. This is an on-line, interactive, menu-driven, user-friendly database system. Users can easily search, retrieve, and manipulate the data from the database system without learning special query language, special commands, standardized names of materials, properties, variables, etc. It enables both the direct mode of search/retrieval of data for specified materials, properties, independent variables, etc., and the inverted mode of search/retrieval of candidate materials that meet a set of specified requirements (which is the computer-aided materials selection). It enables also tabular and graphical displays and on-line data manipulations such as units conversion, variables transformation, statistical analysis, etc., of the retrieved data. The development, content, accessibility, etc., of the database system are presented and discussed.
Mallik, Saurav; Maulik, Ujjwal
2015-10-01
Gene ranking is an important problem in bioinformatics. Here, we propose a new framework for ranking biomolecules (viz., miRNAs, transcription-factors/TFs and genes) in a multi-informative uterine leiomyoma dataset having both gene expression and methylation data using (statistical) eigenvector centrality based approach. At first, genes that are both differentially expressed and methylated, are identified using Limma statistical test. A network, comprising these genes, corresponding TFs from TRANSFAC and ITFP databases, and targeter miRNAs from miRWalk database, is then built. The biomolecules are then ranked based on eigenvector centrality. Our proposed method provides better average accuracy in hub gene and non-hub gene classifications than other methods. Furthermore, pre-ranked Gene set enrichment analysis is applied on the pathway database as well as GO-term databases of Molecular Signatures Database with providing a pre-ranked gene-list based on different centrality values for comparing among the ranking methods. Finally, top novel potential gene-markers for the uterine leiomyoma are provided. Copyright © 2015 Elsevier Inc. All rights reserved.
Austvoll-Dahlgren, Astrid; Guttersrud, Øystein; Nsangi, Allen; Semakula, Daniel; Oxman, Andrew D
2017-01-01
Background The Claim Evaluation Tools database contains multiple-choice items for measuring people’s ability to apply the key concepts they need to know to be able to assess treatment claims. We assessed items from the database using Rasch analysis to develop an outcome measure to be used in two randomised trials in Uganda. Rasch analysis is a form of psychometric testing relying on Item Response Theory. It is a dynamic way of developing outcome measures that are valid and reliable. Objectives To assess the validity, reliability and responsiveness of 88 items addressing 22 key concepts using Rasch analysis. Participants We administrated four sets of multiple-choice items in English to 1114 people in Uganda and Norway, of which 685 were children and 429 were adults (including 171 health professionals). We scored all items dichotomously. We explored summary and individual fit statistics using the RUMM2030 analysis package. We used SPSS to perform distractor analysis. Results Most items conformed well to the Rasch model, but some items needed revision. Overall, the four item sets had satisfactory reliability. We did not identify significant response dependence between any pairs of items and, overall, the magnitude of multidimensionality in the data was acceptable. The items had a high level of difficulty. Conclusion Most of the items conformed well to the Rasch model’s expectations. Following revision of some items, we concluded that most of the items were suitable for use in an outcome measure for evaluating the ability of children or adults to assess treatment claims. PMID:28550019
Spectral signature verification using statistical analysis and text mining
NASA Astrophysics Data System (ADS)
DeCoster, Mallory E.; Firpi, Alexe H.; Jacobs, Samantha K.; Cone, Shelli R.; Tzeng, Nigel H.; Rodriguez, Benjamin M.
2016-05-01
In the spectral science community, numerous spectral signatures are stored in databases representative of many sample materials collected from a variety of spectrometers and spectroscopists. Due to the variety and variability of the spectra that comprise many spectral databases, it is necessary to establish a metric for validating the quality of spectral signatures. This has been an area of great discussion and debate in the spectral science community. This paper discusses a method that independently validates two different aspects of a spectral signature to arrive at a final qualitative assessment; the textual meta-data and numerical spectral data. Results associated with the spectral data stored in the Signature Database1 (SigDB) are proposed. The numerical data comprising a sample material's spectrum is validated based on statistical properties derived from an ideal population set. The quality of the test spectrum is ranked based on a spectral angle mapper (SAM) comparison to the mean spectrum derived from the population set. Additionally, the contextual data of a test spectrum is qualitatively analyzed using lexical analysis text mining. This technique analyzes to understand the syntax of the meta-data to provide local learning patterns and trends within the spectral data, indicative of the test spectrum's quality. Text mining applications have successfully been implemented for security2 (text encryption/decryption), biomedical3 , and marketing4 applications. The text mining lexical analysis algorithm is trained on the meta-data patterns of a subset of high and low quality spectra, in order to have a model to apply to the entire SigDB data set. The statistical and textual methods combine to assess the quality of a test spectrum existing in a database without the need of an expert user. This method has been compared to other validation methods accepted by the spectral science community, and has provided promising results when a baseline spectral signature is present for comparison. The spectral validation method proposed is described from a practical application and analytical perspective.
Atlantic Ocean CARINA data: overview and salinity adjustments
NASA Astrophysics Data System (ADS)
Tanhua, T.; Steinfeldt, R.; Key, R. M.; Brown, P.; Gruber, N.; Wanninkhof, R.; Perez, F.; Körtzinger, A.; Velo, A.; Schuster, U.; van Heuven, S.; Bullister, J. L.; Stendardo, I.; Hoppema, M.; Olsen, A.; Kozyr, A.; Pierrot, D.; Schirnick, C.; Wallace, D. W. R.
2009-08-01
Water column data of carbon and carbon-relevant hydrographic and hydrochemical parameters from 188 previously non-publicly available cruise data sets in the Arctic, Atlantic and Southern Ocean have been retrieved and merged into a new database: CARINA (CARbon IN the Atlantic). The data have gone through rigorous quality control procedures to assure the highest possible quality and consistency. The data for the pertinent parameters in the CARINA database were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the data products, i.e. three merged data files with measured, calculated and interpolated data for each of the three CARINA regions, i.e. Arctic, Atlantic and Southern Ocean. Ninety-eight of the cruises in the CARINA database were conducted in the Atlantic Ocean, defined here as the region south of the Greenland-Iceland-Scotland Ridge and north of about 30° S. Here we present an overview of the Atlantic Ocean synthesis of the CARINA data and the adjustments that were applied to the data product. We also report details of the secondary QC for salinity for this data set. Procedures of quality control - including crossover analysis between stations and inversion analysis of all crossover data - are briefly described. Adjustments to salinity measurements were applied to the data from 10 cruises in the Atlantic Ocean region. Based on our analysis we estimate the internal accuracy of the CARINA-ATL salinity data to be 4.1 ppm. With these adjustments the CARINA database is consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s (Key et al., 2004), and is now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
ProtaBank: A repository for protein design and engineering data.
Wang, Connie Y; Chang, Paul M; Ary, Marie L; Allen, Benjamin D; Chica, Roberto A; Mayo, Stephen L; Olafson, Barry D
2018-03-25
We present ProtaBank, a repository for storing, querying, analyzing, and sharing protein design and engineering data in an actively maintained and updated database. ProtaBank provides a format to describe and compare all types of protein mutational data, spanning a wide range of properties and techniques. It features a user-friendly web interface and programming layer that streamlines data deposition and allows for batch input and queries. The database schema design incorporates a standard format for reporting protein sequences and experimental data that facilitates comparison of results across different data sets. A suite of analysis and visualization tools are provided to facilitate discovery, to guide future designs, and to benchmark and train new predictive tools and algorithms. ProtaBank will provide a valuable resource to the protein engineering community by storing and safeguarding newly generated data, allowing for fast searching and identification of relevant data from the existing literature, and exploring correlations between disparate data sets. ProtaBank invites researchers to contribute data to the database to make it accessible for search and analysis. ProtaBank is available at https://protabank.org. © 2018 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.
NONATObase: a database for Polychaeta (Annelida) from the Southwestern Atlantic Ocean.
Pagliosa, Paulo R; Doria, João G; Misturini, Dairana; Otegui, Mariana B P; Oortman, Mariana S; Weis, Wilson A; Faroni-Perez, Larisse; Alves, Alexandre P; Camargo, Maurício G; Amaral, A Cecília Z; Marques, Antonio C; Lana, Paulo C
2014-01-01
Networks can greatly advance data sharing attitudes by providing organized and useful data sets on marine biodiversity in a friendly and shared scientific environment. NONATObase, the interactive database on polychaetes presented herein, will provide new macroecological and taxonomic insights of the Southwestern Atlantic region. The database was developed by the NONATO network, a team of South American researchers, who integrated available information on polychaetes from between 5°N and 80°S in the Atlantic Ocean and near the Antarctic. The guiding principle of the database is to keep free and open access to data based on partnerships. Its architecture consists of a relational database integrated in the MySQL and PHP framework. Its web application allows access to the data from three different directions: species (qualitative data), abundance (quantitative data) and data set (reference data). The database has built-in functionality, such as the filter of data on user-defined taxonomic levels, characteristics of site, sample, sampler, and mesh size used. Considering that there are still many taxonomic issues related to poorly known regional fauna, a scientific committee was created to work out consistent solutions to current misidentifications and equivocal taxonomy status of some species. Expertise from this committee will be incorporated by NONATObase continually. The use of quantitative data was possible by standardization of a sample unit. All data, maps of distribution and references from a data set or a specified query can be visualized and exported to a commonly used data format in statistical analysis or reference manager software. The NONATO network has initialized with NONATObase, a valuable resource for marine ecologists and taxonomists. The database is expected to grow in functionality as it comes in useful, particularly regarding the challenges of dealing with molecular genetic data and tools to assess the effects of global environment change. Database URL: http://nonatobase.ufsc.br/.
NONATObase: a database for Polychaeta (Annelida) from the Southwestern Atlantic Ocean
Pagliosa, Paulo R.; Doria, João G.; Misturini, Dairana; Otegui, Mariana B. P.; Oortman, Mariana S.; Weis, Wilson A.; Faroni-Perez, Larisse; Alves, Alexandre P.; Camargo, Maurício G.; Amaral, A. Cecília Z.; Marques, Antonio C.; Lana, Paulo C.
2014-01-01
Networks can greatly advance data sharing attitudes by providing organized and useful data sets on marine biodiversity in a friendly and shared scientific environment. NONATObase, the interactive database on polychaetes presented herein, will provide new macroecological and taxonomic insights of the Southwestern Atlantic region. The database was developed by the NONATO network, a team of South American researchers, who integrated available information on polychaetes from between 5°N and 80°S in the Atlantic Ocean and near the Antarctic. The guiding principle of the database is to keep free and open access to data based on partnerships. Its architecture consists of a relational database integrated in the MySQL and PHP framework. Its web application allows access to the data from three different directions: species (qualitative data), abundance (quantitative data) and data set (reference data). The database has built-in functionality, such as the filter of data on user-defined taxonomic levels, characteristics of site, sample, sampler, and mesh size used. Considering that there are still many taxonomic issues related to poorly known regional fauna, a scientific committee was created to work out consistent solutions to current misidentifications and equivocal taxonomy status of some species. Expertise from this committee will be incorporated by NONATObase continually. The use of quantitative data was possible by standardization of a sample unit. All data, maps of distribution and references from a data set or a specified query can be visualized and exported to a commonly used data format in statistical analysis or reference manager software. The NONATO network has initialized with NONATObase, a valuable resource for marine ecologists and taxonomists. The database is expected to grow in functionality as it comes in useful, particularly regarding the challenges of dealing with molecular genetic data and tools to assess the effects of global environment change. Database URL: http://nonatobase.ufsc.br/ PMID:24573879
The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses.
Kuiken, Carla; Thurmond, Jim; Dimitrijevic, Mira; Yoon, Hyejin
2012-01-01
Hemorrhagic fever viruses (HFVs) are a diverse set of over 80 viral species, found in 10 different genera comprising five different families: arena-, bunya-, flavi-, filo- and togaviridae. All these viruses are highly variable and evolve rapidly, making them elusive targets for the immune system and for vaccine and drug design. About 55,000 HFV sequences exist in the public domain today. A central website that provides annotated sequences and analysis tools will be helpful to HFV researchers worldwide. The HFV sequence database collects and stores sequence data and provides a user-friendly search interface and a large number of sequence analysis tools, following the model of the highly regarded and widely used Los Alamos HIV database [Kuiken, C., B. Korber, and R.W. Shafer, HIV sequence databases. AIDS Rev, 2003. 5: p. 52-61]. The database uses an algorithm that aligns each sequence to a species-wide reference sequence. The NCBI RefSeq database [Sayers et al. (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 39, D38-D51.] is used for this; if a reference sequence is not available, a Blast search finds the best candidate. Using this method, sequences in each genus can be retrieved pre-aligned. The HFV website can be accessed via http://hfv.lanl.gov.
MIPS: analysis and annotation of proteins from whole genomes in 2005
Mewes, H. W.; Frishman, D.; Mayer, K. F. X.; Münsterkötter, M.; Noubibou, O.; Pagel, P.; Rattei, T.; Oesterheld, M.; Ruepp, A.; Stümpflen, V.
2006-01-01
The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein–protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (). PMID:16381839
MIPS: analysis and annotation of proteins from whole genomes in 2005.
Mewes, H W; Frishman, D; Mayer, K F X; Münsterkötter, M; Noubibou, O; Pagel, P; Rattei, T; Oesterheld, M; Ruepp, A; Stümpflen, V
2006-01-01
The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).
Habermann, Bianca; Bebin, Anne-Gaelle; Herklotz, Stephan; Volkmer, Michael; Eckelt, Kay; Pehlke, Kerstin; Epperlein, Hans Henning; Schackert, Hans Konrad; Wiebe, Glenis; Tanaka, Elly M
2004-01-01
Background The ambystomatid salamander, Ambystoma mexicanum (axolotl), is an important model organism in evolutionary and regeneration research but relatively little sequence information has so far been available. This is a major limitation for molecular studies on caudate development, regeneration and evolution. To address this lack of sequence information we have generated an expressed sequence tag (EST) database for A. mexicanum. Results Two cDNA libraries, one made from stage 18-22 embryos and the other from day-6 regenerating tail blastemas, generated 17,352 sequences. From the sequenced ESTs, 6,377 contigs were assembled that probably represent 25% of the expressed genes in this organism. Sequence comparison revealed significant homology to entries in the NCBI non-redundant database. Further examination of this gene set revealed the presence of genes involved in important cell and developmental processes, including cell proliferation, cell differentiation and cell-cell communication. On the basis of these data, we have performed phylogenetic analysis of key cell-cycle regulators. Interestingly, while cell-cycle proteins such as the cyclin B family display expected evolutionary relationships, the cyclin-dependent kinase inhibitor 1 gene family shows an unusual evolutionary behavior among the amphibians. Conclusions Our analysis reveals the importance of a comprehensive sequence set from a representative of the Caudata and illustrates that the EST sequence database is a rich source of molecular, developmental and regeneration studies. To aid in data mining, the ESTs have been organized into an easily searchable database that is freely available online. PMID:15345051
NASA Astrophysics Data System (ADS)
Barette, Florian; Poppe, Sam; Smets, Benoît; Benbakkar, Mhammed; Kervyn, Matthieu
2017-10-01
We present an integrated, spatially-explicit database of existing geochemical major-element analyses available from (post-) colonial scientific reports, PhD Theses and international publications for the Virunga Volcanic Province, located in the western branch of the East African Rift System. This volcanic province is characterised by alkaline volcanism, including silica-undersaturated, alkaline and potassic lavas. The database contains a total of 908 geochemical analyses of eruptive rocks for the entire volcanic province with a localisation for most samples. A preliminary analysis of the overall consistency of the database, using statistical techniques on sets of geochemical analyses with contrasted analytical methods or dates, demonstrates that the database is consistent. We applied a principal component analysis and cluster analysis on whole-rock major element compositions included in the database to study the spatial variation of the chemical composition of eruptive products in the Virunga Volcanic Province. These statistical analyses identify spatially distributed clusters of eruptive products. The known geochemical contrasts are highlighted by the spatial analysis, such as the unique geochemical signature of Nyiragongo lavas compared to other Virunga lavas, the geochemical heterogeneity of the Bulengo area, and the trachyte flows of Karisimbi volcano. Most importantly, we identified separate clusters of eruptive products which originate from primitive magmatic sources. These lavas of primitive composition are preferentially located along NE-SW inherited rift structures, often at distance from the central Virunga volcanoes. Our results illustrate the relevance of a spatial analysis on integrated geochemical data for a volcanic province, as a complement to classical petrological investigations. This approach indeed helps to characterise geochemical variations within a complex of magmatic systems and to identify specific petrologic and geochemical investigations that should be tackled within a study area.
Zhu, Zhikai; Su, Xiaomeng; Go, Eden P; Desaire, Heather
2014-09-16
Glycoproteins are biologically significant large molecules that participate in numerous cellular activities. In order to obtain site-specific protein glycosylation information, intact glycopeptides, with the glycan attached to the peptide sequence, are characterized by tandem mass spectrometry (MS/MS) methods such as collision-induced dissociation (CID) and electron transfer dissociation (ETD). While several emerging automated tools are developed, no consensus is present in the field about the best way to determine the reliability of the tools and/or provide the false discovery rate (FDR). A common approach to calculate FDRs for glycopeptide analysis, adopted from the target-decoy strategy in proteomics, employs a decoy database that is created based on the target protein sequence database. Nonetheless, this approach is not optimal in measuring the confidence of N-linked glycopeptide matches, because the glycopeptide data set is considerably smaller compared to that of peptides, and the requirement of a consensus sequence for N-glycosylation further limits the number of possible decoy glycopeptides tested in a database search. To address the need to accurately determine FDRs for automated glycopeptide assignments, we developed GlycoPep Evaluator (GPE), a tool that helps to measure FDRs in identifying glycopeptides without using a decoy database. GPE generates decoy glycopeptides de novo for every target glycopeptide, in a 1:20 target-to-decoy ratio. The decoys, along with target glycopeptides, are scored against the ETD data, from which FDRs can be calculated accurately based on the number of decoy matches and the ratio of the number of targets to decoys, for small data sets. GPE is freely accessible for download and can work with any search engine that interprets ETD data of N-linked glycopeptides. The software is provided at https://desairegroup.ku.edu/research.
Encryption Characteristics of Two USB-based Personal Health Record Devices
Wright, Adam; Sittig, Dean F.
2007-01-01
Personal health records (PHRs) hold great promise for empowering patients and increasing the accuracy and completeness of health information. We reviewed two small USB-based PHR devices that allow a patient to easily store and transport their personal health information. Both devices offer password protection and encryption features. Analysis of the devices shows that they store their data in a Microsoft Access database. Due to a flaw in the encryption of this database, recovering the user’s password can be accomplished with minimal effort. Our analysis also showed that, rather than encrypting health information with the password chosen by the user, the devices stored the user’s password as a string in the database and then encrypted that database with a common password set by the manufacturer. This is another serious vulnerability. This article describes the weaknesses we discovered, outlines three critical flaws with the security model used by the devices, and recommends four guidelines for improving the security of similar devices. PMID:17460132
Mining functionally relevant gene sets for analyzing physiologically novel clinical expression data.
Turcan, Sevin; Vetter, Douglas E; Maron, Jill L; Wei, Xintao; Slonim, Donna K
2011-01-01
Gene set analyses have become a standard approach for increasing the sensitivity of transcriptomic studies. However, analytical methods incorporating gene sets require the availability of pre-defined gene sets relevant to the underlying physiology being studied. For novel physiological problems, relevant gene sets may be unavailable or existing gene set databases may bias the results towards only the best-studied of the relevant biological processes. We describe a successful attempt to mine novel functional gene sets for translational projects where the underlying physiology is not necessarily well characterized in existing annotation databases. We choose targeted training data from public expression data repositories and define new criteria for selecting biclusters to serve as candidate gene sets. Many of the discovered gene sets show little or no enrichment for informative Gene Ontology terms or other functional annotation. However, we observe that such gene sets show coherent differential expression in new clinical test data sets, even if derived from different species, tissues, and disease states. We demonstrate the efficacy of this method on a human metabolic data set, where we discover novel, uncharacterized gene sets that are diagnostic of diabetes, and on additional data sets related to neuronal processes and human development. Our results suggest that our approach may be an efficient way to generate a collection of gene sets relevant to the analysis of data for novel clinical applications where existing functional annotation is relatively incomplete.
Andrejasic, Miha; Praaenikar, Jure; Turk, Dusan
2008-11-01
The number and variety of macromolecular structures in complex with ;hetero' ligands is growing. The need for rapid delivery of correct geometric parameters for their refinement, which is often crucial for understanding the biological relevance of the structure, is growing correspondingly. The current standard for describing protein structures is the Engh-Huber parameter set. It is an expert data set resulting from selection and analysis of the crystal structures gathered in the Cambridge Structural Database (CSD). Clearly, such a manual approach cannot be applied to the vast and ever-growing number of chemical compounds. Therefore, a database, named PURY, of geometric parameters of chemical compounds has been developed, together with a server that accesses it. PURY is a compilation of the whole CSD. It contains lists of atom classes and bonds connecting them, as well as angle, chirality, planarity and conformation parameters. The current compilation is based on CSD 5.28 and contains 1978 atom classes and 32,702 bonding, 237,068 angle, 201,860 dihedral and 64,193 improper geometric restraints. Analysis has confirmed that the restraints from the PURY database are suitable for use in macromolecular crystal structure refinement and should be of value to the crystallographic community. The database can be accessed through the web server http://pury.ijs.si/, which creates topology and parameter files from deposited coordinates in suitable forms for the refinement programs MAIN, CNS and REFMAC. In the near future, the server will move to the CSD website http://pury.ccdc.cam.ac.uk/.
Analysis of Landslide Hazard Impact Using the Landslide Database for Germany
NASA Astrophysics Data System (ADS)
Klose, M.; Damm, B.
2014-12-01
The Federal Republic of Germany has long been among the few European countries that lack a national landslide database. Systematic collection and inventory of landslide data still shows a comprehensive research history in Germany, but only one focused on development of databases with local or regional coverage. This has changed in recent years with the launch of a database initiative aimed at closing the data gap existing at national level. The present contribution reports on this project that is based on a landslide database which evolved over the last 15 years to a database covering large parts of Germany. A strategy of systematic retrieval, extraction, and fusion of landslide data is at the heart of the methodology, providing the basis for a database with a broad potential of application. The database offers a data pool of more than 4,200 landslide data sets with over 13,000 single data files and dates back to 12th century. All types of landslides are covered by the database, which stores not only core attributes, but also various complementary data, including data on landslide causes, impacts, and mitigation. The current database migration to PostgreSQL/PostGIS is focused on unlocking the full scientific potential of the database, while enabling data sharing and knowledge transfer via a web GIS platform. In this contribution, the goals and the research strategy of the database project are highlighted at first, with a summary of best practices in database development providing perspective. Next, the focus is on key aspects of the methodology, which is followed by the results of different case studies in the German Central Uplands. The case study results exemplify database application in analysis of vulnerability to landslides, impact statistics, and hazard or cost modeling.
[Review of digital ground object spectral library].
Zhou, Xiao-Hu; Zhou, Ding-Wu
2009-06-01
A higher spectral resolution is the main direction of developing remote sensing technology, and it is quite important to set up the digital ground object reflectance spectral database library, one of fundamental research fields in remote sensing application. Remote sensing application has been increasingly relying on ground object spectral characteristics, and quantitative analysis has been developed to a new stage. The present article summarized and systematically introduced the research status quo and development trend of digital ground object reflectance spectral libraries at home and in the world in recent years. Introducing the spectral libraries has been established, including desertification spectral database library, plants spectral database library, geological spectral database library, soil spectral database library, minerals spectral database library, cloud spectral database library, snow spectral database library, the atmosphere spectral database library, rocks spectral database library, water spectral database library, meteorites spectral database library, moon rock spectral database library, and man-made materials spectral database library, mixture spectral database library, volatile compounds spectral database library, and liquids spectral database library. In the process of establishing spectral database libraries, there have been some problems, such as the lack of uniform national spectral database standard and uniform standards for the ground object features as well as the comparability between different databases. In addition, data sharing mechanism can not be carried out, etc. This article also put forward some suggestions on those problems.
Geoscience research databases for coastal Alabama ecosystem management
Hummell, Richard L.
1995-01-01
Effective management of complex coastal ecosystems necessitates access to scientific knowledge that can be acquired through a multidisciplinary approach involving Federal and State scientists that take advantage of agency expertise and resources for the benefit of all participants working toward a set of common research and management goals. Cooperative geostatic investigations have led toward building databases of fundamental scientific knowledge that can be utilized to manage coastal Alabama's natural and future development. These databases have been used to assess the occurrence and economic potential of hard mineral resources in the Alabama EFZ, and to support oil spill contingency planning and environmental analysis for coastal Alabama.
Cyber-Bullying in School Settings: A Research Citation Analysis
ERIC Educational Resources Information Center
Piotrowski, Chris
2011-01-01
Research on the topic of cyber-bullying has proliferated over the past decade, particularly on its impact on school-aged children. Thus, it would be of interest to examine the scope and extent of research interest in the topic in scholarly publications. This paper reports on a reference citation analysis of the database PsycINFO, using…
dbWFA: a web-based database for functional annotation of Triticum aestivum transcripts
Vincent, Jonathan; Dai, Zhanwu; Ravel, Catherine; Choulet, Frédéric; Mouzeyar, Said; Bouzidi, M. Fouad; Agier, Marie; Martre, Pierre
2013-01-01
The functional annotation of genes based on sequence homology with genes from model species genomes is time-consuming because it is necessary to mine several unrelated databases. The aim of the present work was to develop a functional annotation database for common wheat Triticum aestivum (L.). The database, named dbWFA, is based on the reference NCBI UniGene set, an expressed gene catalogue built by expressed sequence tag clustering, and on full-length coding sequences retrieved from the TriFLDB database. Information from good-quality heterogeneous sources, including annotations for model plant species Arabidopsis thaliana (L.) Heynh. and Oryza sativa L., was gathered and linked to T. aestivum sequences through BLAST-based homology searches. Even though the complexity of the transcriptome cannot yet be fully appreciated, we developed a tool to easily and promptly obtain information from multiple functional annotation systems (Gene Ontology, MapMan bin codes, MIPS Functional Categories, PlantCyc pathway reactions and TAIR gene families). The use of dbWFA is illustrated here with several query examples. We were able to assign a putative function to 45% of the UniGenes and 81% of the full-length coding sequences from TriFLDB. Moreover, comparison of the annotation of the whole T. aestivum UniGene set along with curated annotations of the two model species assessed the accuracy of the annotation provided by dbWFA. To further illustrate the use of dbWFA, genes specifically expressed during the early cell division or late storage polymer accumulation phases of T. aestivum grain development were identified using a clustering analysis and then annotated using dbWFA. The annotation of these two sets of genes was consistent with previous analyses of T. aestivum grain transcriptomes and proteomes. Database URL: urgi.versailles.inra.fr/dbWFA/ PMID:23660284
Using databases in medical education research: AMEE Guide No. 77.
Cleland, Jennifer; Scott, Neil; Harrild, Kirsten; Moffat, Mandy
2013-05-01
This AMEE Guide offers an introduction to the use of databases in medical education research. It is intended for those who are contemplating conducting research in medical education but are new to the field. The Guide is structured around the process of planning your research so that data collection, management and analysis are appropriate for the research question. Throughout we consider contextual possibilities and constraints to educational research using databases, such as the resources available, and provide concrete examples of medical education research to illustrate many points. The first section of the Guide explains the difference between different types of data and classifying data, and addresses the rationale for research using databases in medical education. We explain the difference between qualitative research and qualitative data, the difference between categorical and quantitative data, and the difference types of data which fall into these categories. The Guide reviews the strengths and weaknesses of qualitative and quantitative research. The next section is structured around how to work with quantitative and qualitative databases and provides guidance on the many practicalities of setting up a database. This includes how to organise your database, including anonymising data and coding, as well as preparing and describing your data so it is ready for analysis. The critical matter of the ethics of using databases in medical educational research, including using routinely collected data versus data collected for research purposes, and issues of confidentiality, is discussed. Core to the Guide is drawing out the similarities and differences in working with different types of data and different types of databases. Future AMEE Guides in the research series will address statistical analysis of data in more detail.
GMOMETHODS: the European Union database of reference methods for GMO analysis.
Bonfini, Laura; Van den Bulcke, Marc H; Mazzara, Marco; Ben, Enrico; Patak, Alexandre
2012-01-01
In order to provide reliable and harmonized information on methods for GMO (genetically modified organism) analysis we have published a database called "GMOMETHODS" that supplies information on PCR assays validated according to the principles and requirements of ISO 5725 and/or the International Union of Pure and Applied Chemistry protocol. In addition, the database contains methods that have been verified by the European Union Reference Laboratory for Genetically Modified Food and Feed in the context of compliance with an European Union legislative act. The web application provides search capabilities to retrieve primers and probes sequence information on the available methods. It further supplies core data required by analytical labs to carry out GM tests and comprises information on the applied reference material and plasmid standards. The GMOMETHODS database currently contains 118 different PCR methods allowing identification of 51 single GM events and 18 taxon-specific genes in a sample. It also provides screening assays for detection of eight different genetic elements commonly used for the development of GMOs. The application is referred to by the Biosafety Clearing House, a global mechanism set up by the Cartagena Protocol on Biosafety to facilitate the exchange of information on Living Modified Organisms. The publication of the GMOMETHODS database can be considered an important step toward worldwide standardization and harmonization in GMO analysis.
Down-weighting overlapping genes improves gene set analysis
2012-01-01
Background The identification of gene sets that are significantly impacted in a given condition based on microarray data is a crucial step in current life science research. Most gene set analysis methods treat genes equally, regardless how specific they are to a given gene set. Results In this work we propose a new gene set analysis method that computes a gene set score as the mean of absolute values of weighted moderated gene t-scores. The gene weights are designed to emphasize the genes appearing in few gene sets, versus genes that appear in many gene sets. We demonstrate the usefulness of the method when analyzing gene sets that correspond to the KEGG pathways, and hence we called our method Pathway Analysis with Down-weighting of Overlapping Genes (PADOG). Unlike most gene set analysis methods which are validated through the analysis of 2-3 data sets followed by a human interpretation of the results, the validation employed here uses 24 different data sets and a completely objective assessment scheme that makes minimal assumptions and eliminates the need for possibly biased human assessments of the analysis results. Conclusions PADOG significantly improves gene set ranking and boosts sensitivity of analysis using information already available in the gene expression profiles and the collection of gene sets to be analyzed. The advantages of PADOG over other existing approaches are shown to be stable to changes in the database of gene sets to be analyzed. PADOG was implemented as an R package available at: http://bioinformaticsprb.med.wayne.edu/PADOG/or http://www.bioconductor.org. PMID:22713124
Rice, Michael; Gladstone, William; Weir, Michael
2004-01-01
We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills.
2004-01-01
We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills. PMID:15592597
VIEWCACHE: An incremental pointer-based access method for autonomous interoperable databases
NASA Technical Reports Server (NTRS)
Roussopoulos, N.; Sellis, Timos
1992-01-01
One of biggest problems facing NASA today is to provide scientists efficient access to a large number of distributed databases. Our pointer-based incremental database access method, VIEWCACHE, provides such an interface for accessing distributed data sets and directories. VIEWCACHE allows database browsing and search performing inter-database cross-referencing with no actual data movement between database sites. This organization and processing is especially suitable for managing Astrophysics databases which are physically distributed all over the world. Once the search is complete, the set of collected pointers pointing to the desired data are cached. VIEWCACHE includes spatial access methods for accessing image data sets, which provide much easier query formulation by referring directly to the image and very efficient search for objects contained within a two-dimensional window. We will develop and optimize a VIEWCACHE External Gateway Access to database management systems to facilitate distributed database search.
Analysis of NASA Common Research Model Dynamic Data
NASA Technical Reports Server (NTRS)
Balakrishna, S.; Acheson, Michael J.
2011-01-01
Recent NASA Common Research Model (CRM) tests at the Langley National Transonic Facility (NTF) and Ames 11-foot Transonic Wind Tunnel (11-foot TWT) have generated an experimental database for CFD code validation. The database consists of force and moment, surface pressures and wideband wing-root dynamic strain/wing Kulite data from continuous sweep pitch polars. The dynamic data sets, acquired at 12,800 Hz sampling rate, are analyzed in this study to evaluate CRM wing buffet onset and potential CRM wing flow separation.
MIPS plant genome information resources.
Spannagl, Manuel; Haberer, Georg; Ernst, Rebecca; Schoof, Heiko; Mayer, Klaus F X
2007-01-01
The Munich Institute for Protein Sequences (MIPS) has been involved in maintaining plant genome databases since the Arabidopsis thaliana genome project. Genome databases and analysis resources have focused on individual genomes and aim to provide flexible and maintainable data sets for model plant genomes as a backbone against which experimental data, for example from high-throughput functional genomics, can be organized and evaluated. In addition, model genomes also form a scaffold for comparative genomics, and much can be learned from genome-wide evolutionary studies.
Using the Saccharomyces Genome Database (SGD) for analysis of genomic information
Skrzypek, Marek S.; Hirschman, Jodi
2011-01-01
Analysis of genomic data requires access to software tools that place the sequence-derived information in the context of biology. The Saccharomyces Genome Database (SGD) integrates functional information about budding yeast genes and their products with a set of analysis tools that facilitate exploring their biological details. This unit describes how the various types of functional data available at SGD can be searched, retrieved, and analyzed. Starting with the guided tour of the SGD Home page and Locus Summary page, this unit highlights how to retrieve data using YeastMine, how to visualize genomic information with GBrowse, how to explore gene expression patterns with SPELL, and how to use Gene Ontology tools to characterize large-scale datasets. PMID:21901739
Improved Bond Equations for Fiber-Reinforced Polymer Bars in Concrete.
Pour, Sadaf Moallemi; Alam, M Shahria; Milani, Abbas S
2016-08-30
This paper explores a set of new equations to predict the bond strength between fiber reinforced polymer (FRP) rebar and concrete. The proposed equations are based on a comprehensive statistical analysis and existing experimental results in the literature. Namely, the most effective parameters on bond behavior of FRP concrete were first identified by applying a factorial analysis on a part of the available database. Then the database that contains 250 pullout tests were divided into four groups based on the concrete compressive strength and the rebar surface. Afterward, nonlinear regression analysis was performed for each study group in order to determine the bond equations. The results show that the proposed equations can predict bond strengths more accurately compared to the other previously reported models.
Indicators for the automated analysis of drug prescribing quality.
Coste, J; Séné, B; Milstein, C; Bouée, S; Venot, A
1998-01-01
Irrational and inconsistent drug prescription has considerable impact on morbidity, mortality, health service utilization, and community burden. However, few studies have addressed the methodology of processing the information contained in these drug orders used to study the quality of drug prescriptions and prescriber behavior. We present a comprehensive set of quantitative indicators for the quality of drug prescriptions which can be derived from a drug order. These indicators were constructed using explicit a priori criteria which were previously validated on the basis of scientific data. Automatic computation is straightforward, using a relational database system, such that large sets of prescriptions can be processed with minimal human effort. We illustrate the feasibility and value of this approach by using a large set of 23,000 prescriptions for several diseases, selected from a nationally representative prescriptions database. Our study may result in direct and wide applications in the epidemiology of medical practice and in quality control procedures.
NASA Technical Reports Server (NTRS)
Barrett, Charles A.
2003-01-01
The cyclic oxidation test results for some 1000 high temperature commercial and experimental alloys have been collected in an EXCEL database. This database represents over thirty years of research at NASA Glenn Research Center in Cleveland, Ohio. The data is in the form of a series of runs of specific weight change versus time values for a set of samples tested at a given temperature, cycle time, and exposure time. Included on each run is a set of embedded plots of the critical data. The nature of the data is discussed along with analysis of the cyclic oxidation process. In addition examples are given as to how a set of results can be analyzed. The data is assembled on a read-only compact disk which is available on request from Materials Durability Branch, NASA Glenn Research Center, Cleveland, Ohio.
Accountability in action?: the case of a database purchasing decision.
Neyland, Daniel; Woolgar, Steve
2002-06-01
The increasing prevalence of audit in university settings has raised concerns about the potentially adverse effects of invasive measures of performance upon the conduct of research and generation of knowledge. What sustains the current commitment to audit? It is argued that in order to address this question we need to understand how and to what extent notions of accountability are played out in practice. This is illustrated through the analysis of materials from an ethnographic study of 'good management practice' in the deployment of technologies in university settings. The paper examines the ways in which ideas of accountability - involving considerations such as 'value for money' - inform the practical processes of deciding about the purchase of a new database technology.
Yue, Zongliang; Zheng, Qi; Neylon, Michael T; Yoo, Minjae; Shin, Jimin; Zhao, Zhiying; Tan, Aik Choon
2018-01-01
Abstract Integrative Gene-set, Network and Pathway Analysis (GNPA) is a powerful data analysis approach developed to help interpret high-throughput omics data. In PAGER 1.0, we demonstrated that researchers can gain unbiased and reproducible biological insights with the introduction of PAGs (Pathways, Annotated-lists and Gene-signatures) as the basic data representation elements. In PAGER 2.0, we improve the utility of integrative GNPA by significantly expanding the coverage of PAGs and PAG-to-PAG relationships in the database, defining a new metric to quantify PAG data qualities, and developing new software features to simplify online integrative GNPA. Specifically, we included 84 282 PAGs spanning 24 different data sources that cover human diseases, published gene-expression signatures, drug–gene, miRNA–gene interactions, pathways and tissue-specific gene expressions. We introduced a new normalized Cohesion Coefficient (nCoCo) score to assess the biological relevance of genes inside a PAG, and RP-score to rank genes and assign gene-specific weights inside a PAG. The companion web interface contains numerous features to help users query and navigate the database content. The database content can be freely downloaded and is compatible with third-party Gene Set Enrichment Analysis tools. We expect PAGER 2.0 to become a major resource in integrative GNPA. PAGER 2.0 is available at http://discovery.informatics.uab.edu/PAGER/. PMID:29126216
Improving accuracy and power with transfer learning using a meta-analytic database.
Schwartz, Yannick; Varoquaux, Gaël; Pallier, Christophe; Pinel, Philippe; Poline, Jean-Baptiste; Thirion, Bertrand
2012-01-01
Typical cohorts in brain imaging studies are not large enough for systematic testing of all the information contained in the images. To build testable working hypotheses, investigators thus rely on analysis of previous work, sometimes formalized in a so-called meta-analysis. In brain imaging, this approach underlies the specification of regions of interest (ROIs) that are usually selected on the basis of the coordinates of previously detected effects. In this paper, we propose to use a database of images, rather than coordinates, and frame the problem as transfer learning: learning a discriminant model on a reference task to apply it to a different but related new task. To facilitate statistical analysis of small cohorts, we use a sparse discriminant model that selects predictive voxels on the reference task and thus provides a principled procedure to define ROIs. The benefits of our approach are twofold. First it uses the reference database for prediction, i.e., to provide potential biomarkers in a clinical setting. Second it increases statistical power on the new task. We demonstrate on a set of 18 pairs of functional MRI experimental conditions that our approach gives good prediction. In addition, on a specific transfer situation involving different scanners at different locations, we show that voxel selection based on transfer learning leads to higher detection power on small cohorts.
MIPS: curated databases and comprehensive secondary data resources in 2010.
Mewes, H Werner; Ruepp, Andreas; Theis, Fabian; Rattei, Thomas; Walter, Mathias; Frishman, Dmitrij; Suhre, Karsten; Spannagl, Manuel; Mayer, Klaus F X; Stümpflen, Volker; Antonov, Alexey
2011-01-01
The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38,000,000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de).
MIPS: curated databases and comprehensive secondary data resources in 2010
Mewes, H. Werner; Ruepp, Andreas; Theis, Fabian; Rattei, Thomas; Walter, Mathias; Frishman, Dmitrij; Suhre, Karsten; Spannagl, Manuel; Mayer, Klaus F.X.; Stümpflen, Volker; Antonov, Alexey
2011-01-01
The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38 000 000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de). PMID:21109531
Trials and tribulations: how we established a major incident database.
Hardy, S E J; Fattah, S
2017-01-25
We describe the process of setting up a database of major incident reports and its potential future application. A template for reporting on major incidents was developed using a consensus-based process involving a team of experts in the field. A website was set up as a platform from which to launch the template and as a database of submitted reports. This paper describes the processes involved in setting up a major incident reporting database. It describes how specific difficulties have been overcome and anticipates challenges for the future. We have successfully set up a major incident database, the main purpose of which is to have a repository of standardised major incident reports that can be analysed and compared in order to learn from them.
Pathway Distiller - multisource biological pathway consolidation
2012-01-01
Background One method to understand and evaluate an experiment that produces a large set of genes, such as a gene expression microarray analysis, is to identify overrepresentation or enrichment for biological pathways. Because pathways are able to functionally describe the set of genes, much effort has been made to collect curated biological pathways into publicly accessible databases. When combining disparate databases, highly related or redundant pathways exist, making their consolidation into pathway concepts essential. This will facilitate unbiased, comprehensive yet streamlined analysis of experiments that result in large gene sets. Methods After gene set enrichment finds representative pathways for large gene sets, pathways are consolidated into representative pathway concepts. Three complementary, but different methods of pathway consolidation are explored. Enrichment Consolidation combines the set of the pathways enriched for the signature gene list through iterative combining of enriched pathways with other pathways with similar signature gene sets; Weighted Consolidation utilizes a Protein-Protein Interaction network based gene-weighting approach that finds clusters of both enriched and non-enriched pathways limited to the experiments' resultant gene list; and finally the de novo Consolidation method uses several measurements of pathway similarity, that finds static pathway clusters independent of any given experiment. Results We demonstrate that the three consolidation methods provide unified yet different functional insights of a resultant gene set derived from a genome-wide profiling experiment. Results from the methods are presented, demonstrating their applications in biological studies and comparing with a pathway web-based framework that also combines several pathway databases. Additionally a web-based consolidation framework that encompasses all three methods discussed in this paper, Pathway Distiller (http://cbbiweb.uthscsa.edu/PathwayDistiller), is established to allow researchers access to the methods and example microarray data described in this manuscript, and the ability to analyze their own gene list by using our unique consolidation methods. Conclusions By combining several pathway systems, implementing different, but complementary pathway consolidation methods, and providing a user-friendly web-accessible tool, we have enabled users the ability to extract functional explanations of their genome wide experiments. PMID:23134636
Pathway Distiller - multisource biological pathway consolidation.
Doderer, Mark S; Anguiano, Zachry; Suresh, Uthra; Dashnamoorthy, Ravi; Bishop, Alexander J R; Chen, Yidong
2012-01-01
One method to understand and evaluate an experiment that produces a large set of genes, such as a gene expression microarray analysis, is to identify overrepresentation or enrichment for biological pathways. Because pathways are able to functionally describe the set of genes, much effort has been made to collect curated biological pathways into publicly accessible databases. When combining disparate databases, highly related or redundant pathways exist, making their consolidation into pathway concepts essential. This will facilitate unbiased, comprehensive yet streamlined analysis of experiments that result in large gene sets. After gene set enrichment finds representative pathways for large gene sets, pathways are consolidated into representative pathway concepts. Three complementary, but different methods of pathway consolidation are explored. Enrichment Consolidation combines the set of the pathways enriched for the signature gene list through iterative combining of enriched pathways with other pathways with similar signature gene sets; Weighted Consolidation utilizes a Protein-Protein Interaction network based gene-weighting approach that finds clusters of both enriched and non-enriched pathways limited to the experiments' resultant gene list; and finally the de novo Consolidation method uses several measurements of pathway similarity, that finds static pathway clusters independent of any given experiment. We demonstrate that the three consolidation methods provide unified yet different functional insights of a resultant gene set derived from a genome-wide profiling experiment. Results from the methods are presented, demonstrating their applications in biological studies and comparing with a pathway web-based framework that also combines several pathway databases. Additionally a web-based consolidation framework that encompasses all three methods discussed in this paper, Pathway Distiller (http://cbbiweb.uthscsa.edu/PathwayDistiller), is established to allow researchers access to the methods and example microarray data described in this manuscript, and the ability to analyze their own gene list by using our unique consolidation methods. By combining several pathway systems, implementing different, but complementary pathway consolidation methods, and providing a user-friendly web-accessible tool, we have enabled users the ability to extract functional explanations of their genome wide experiments.
Quality assurance software inspections at NASA Ames: Metrics for feedback and modification
NASA Technical Reports Server (NTRS)
Wenneson, G.
1985-01-01
Software inspections are a set of formal technical review procedures held at selected key points during software development in order to find defects in software documents--is described in terms of history, participants, tools, procedures, statistics, and database analysis.
Scale out databases for CERN use cases
NASA Astrophysics Data System (ADS)
Baranowski, Zbigniew; Grzybek, Maciej; Canali, Luca; Lanza Garcia, Daniel; Surdy, Kacper
2015-12-01
Data generation rates are expected to grow very fast for some database workloads going into LHC run 2 and beyond. In particular this is expected for data coming from controls, logging and monitoring systems. Storing, administering and accessing big data sets in a relational database system can quickly become a very hard technical challenge, as the size of the active data set and the number of concurrent users increase. Scale-out database technologies are a rapidly developing set of solutions for deploying and managing very large data warehouses on commodity hardware and with open source software. In this paper we will describe the architecture and tests on database systems based on Hadoop and the Cloudera Impala engine. We will discuss the results of our tests, including tests of data loading and integration with existing data sources and in particular with relational databases. We will report on query performance tests done with various data sets of interest at CERN, notably data from the accelerator log database.
Miller, Stephan W.
1981-01-01
A second set of related problems deals with how this format and other representations of spatial entities, such as vector formats for point and line features, can be interrelated for manipulation, retrieval, and analysis by a spatial database management subsystem. Methods have been developed for interrelating areal data sets in the raster format with point and line data in a vector format and these are described.
IPRStats: visualization of the functional potential of an InterProScan run.
Kelly, Ryan J; Vincent, David E; Friedberg, Iddo
2010-12-21
InterPro is a collection of protein signatures for the classification and automated annotation of proteins. Interproscan is a software tool that scans protein sequences against Interpro member databases using a variety of profile-based, hidden markov model and positional specific score matrix methods. It not only combines a set of analysis tools, but also performs data look-up from various sources, as well as some redundancy removal. Interproscan is robust and scalable, able to perform on any machine from a netbook to a large cluster. However, when performing whole-genome or metagenome analysis, there is a need for a fast statistical visualization of the results to have good initial grasp on the functional potential of the sequences in the analyzed data set. This is especially important when analyzing and comparing metagenomic or metaproteomic data-sets. IPRStats is a tool for the visualization of Interproscan results. Interproscan results are parsed from the Interproscan XML or EBIXML file into an SQLite or MySQL database. The results for each signature database scan are read and displayed as pie-charts or bar charts as summary statistics. A table is also provided, where each entry is a signature (e.g. a Pfam entry) accompanied by one or more Gene Ontology terms, if Interproscan was run using the Gene Ontology option. We present an platform-independent, open source licensed tool that is useful for Interproscan users who wish to view the summary of their results in a rapid and concise fashion.
SSBD: a database of quantitative data of spatiotemporal dynamics of biological phenomena
Tohsato, Yukako; Ho, Kenneth H. L.; Kyoda, Koji; Onami, Shuichi
2016-01-01
Motivation: Rapid advances in live-cell imaging analysis and mathematical modeling have produced a large amount of quantitative data on spatiotemporal dynamics of biological objects ranging from molecules to organisms. There is now a crucial need to bring these large amounts of quantitative biological dynamics data together centrally in a coherent and systematic manner. This will facilitate the reuse of this data for further analysis. Results: We have developed the Systems Science of Biological Dynamics database (SSBD) to store and share quantitative biological dynamics data. SSBD currently provides 311 sets of quantitative data for single molecules, nuclei and whole organisms in a wide variety of model organisms from Escherichia coli to Mus musculus. The data are provided in Biological Dynamics Markup Language format and also through a REST API. In addition, SSBD provides 188 sets of time-lapse microscopy images from which the quantitative data were obtained and software tools for data visualization and analysis. Availability and Implementation: SSBD is accessible at http://ssbd.qbic.riken.jp. Contact: sonami@riken.jp PMID:27412095
SSBD: a database of quantitative data of spatiotemporal dynamics of biological phenomena.
Tohsato, Yukako; Ho, Kenneth H L; Kyoda, Koji; Onami, Shuichi
2016-11-15
Rapid advances in live-cell imaging analysis and mathematical modeling have produced a large amount of quantitative data on spatiotemporal dynamics of biological objects ranging from molecules to organisms. There is now a crucial need to bring these large amounts of quantitative biological dynamics data together centrally in a coherent and systematic manner. This will facilitate the reuse of this data for further analysis. We have developed the Systems Science of Biological Dynamics database (SSBD) to store and share quantitative biological dynamics data. SSBD currently provides 311 sets of quantitative data for single molecules, nuclei and whole organisms in a wide variety of model organisms from Escherichia coli to Mus musculus The data are provided in Biological Dynamics Markup Language format and also through a REST API. In addition, SSBD provides 188 sets of time-lapse microscopy images from which the quantitative data were obtained and software tools for data visualization and analysis. SSBD is accessible at http://ssbd.qbic.riken.jp CONTACT: sonami@riken.jp. © The Author 2016. Published by Oxford University Press.
Crone, Damien L; Bode, Stefan; Murawski, Carsten; Laham, Simon M
2018-01-01
A major obstacle for the design of rigorous, reproducible studies in moral psychology is the lack of suitable stimulus sets. Here, we present the Socio-Moral Image Database (SMID), the largest standardized moral stimulus set assembled to date, containing 2,941 freely available photographic images, representing a wide range of morally (and affectively) positive, negative and neutral content. The SMID was validated with over 820,525 individual judgments from 2,716 participants, with normative ratings currently available for all images on affective valence and arousal, moral wrongness, and relevance to each of the five moral values posited by Moral Foundations Theory. We present a thorough analysis of the SMID regarding (1) inter-rater consensus, (2) rating precision, and (3) breadth and variability of moral content. Additionally, we provide recommendations for use aimed at efficient study design and reproducibility, and outline planned extensions to the database. We anticipate that the SMID will serve as a useful resource for psychological, neuroscientific and computational (e.g., natural language processing or computer vision) investigations of social, moral and affective processes. The SMID images, along with associated normative data and additional resources are available at https://osf.io/2rqad/.
A Database of Woody Vegetation Responses to Elevated Atmospheric CO2 (NDP-072)
Curtis, Peter S [The Ohio State Univ., Columbus, OH (United States); Cushman, Robert M [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Brenkert, Antoinette L [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
1999-01-01
To perform a statistically rigorous meta-analysis of research results on the response by woody vegetation to increased atmospheric CO2 levels, a multiparameter database of responses was compiled. Eighty-four independent CO2-enrichment studies, covering 65 species and 35 response parameters, met the necessary criteria for inclusion in the database: reporting mean response, sample size, and variance of the response (either as standard deviation or standard error). Data were retrieved from the published literature and unpublished reports. This numeric data package contains a 29-field data set of CO2-exposure experiment responses by woody plants (as both a flat ASCII file and a spreadsheet file), files listing the references to the CO2-exposure experiments and specific comments relevant to the data in the data set, and this documentation file (which includes SAS and Fortran codes to read the ASCII data file; SAS is a registered trademark of the SAS Institute, Inc., Cary, North Carolina 27511).
Analysis of prescription database extracted from standard textbooks of traditional Dai medicine
2012-01-01
Background Traditional Dai Medicine (TDM) is one of the four major ethnomedicine of China. In 2007 a group of experts produced a set of seven Dai medical textbooks on this subject. The first two were selected as the main data source to analyse well recognized prescriptions. Objective To quantify patterns of prescriptions, common ingredients, indications and usages of TDM. Methods A relational database linking the prescriptions, ingredients, herb names, indications, and usages was set up. Frequency of pattern of combination and common ingredients were tabulated. Results A total of 200 prescriptions and 402 herbs were compiled. Prescriptions based on "wind" disorders, a detoxification theory that most commonly deals with symptoms of digestive system diseases, accounted for over one third of all prescriptions. The major methods of preparations mostly used roots and whole herbs. Conclusion The information extracted from the relational database may be useful for understanding symptomatic treatments. Antidote and detoxification theory deserves further research. PMID:22931752
The BaMM web server for de-novo motif discovery and regulatory sequence analysis.
Kiesel, Anja; Roth, Christian; Ge, Wanwan; Wess, Maximilian; Meier, Markus; Söding, Johannes
2018-05-28
The BaMM web server offers four tools: (i) de-novo discovery of enriched motifs in a set of nucleotide sequences, (ii) scanning a set of nucleotide sequences with motifs to find motif occurrences, (iii) searching with an input motif for similar motifs in our BaMM database with motifs for >1000 transcription factors, trained from the GTRD ChIP-seq database and (iv) browsing and keyword searching the motif database. In contrast to most other servers, we represent sequence motifs not by position weight matrices (PWMs) but by Bayesian Markov Models (BaMMs) of order 4, which we showed previously to perform substantially better in ROC analyses than PWMs or first order models. To address the inadequacy of P- and E-values as measures of motif quality, we introduce the AvRec score, the average recall over the TP-to-FP ratio between 1 and 100. The BaMM server is freely accessible without registration at https://bammmotif.mpibpc.mpg.de.
GARNET--gene set analysis with exploration of annotation relations.
Rho, Kyoohyoung; Kim, Bumjin; Jang, Youngjun; Lee, Sanghyun; Bae, Taejeong; Seo, Jihae; Seo, Chaehwa; Lee, Jihyun; Kang, Hyunjung; Yu, Ungsik; Kim, Sunghoon; Lee, Sanghyuk; Kim, Wan Kyu
2011-02-15
Gene set analysis is a powerful method of deducing biological meaning for an a priori defined set of genes. Numerous tools have been developed to test statistical enrichment or depletion in specific pathways or gene ontology (GO) terms. Major difficulties towards biological interpretation are integrating diverse types of annotation categories and exploring the relationships between annotation terms of similar information. GARNET (Gene Annotation Relationship NEtwork Tools) is an integrative platform for gene set analysis with many novel features. It includes tools for retrieval of genes from annotation database, statistical analysis & visualization of annotation relationships, and managing gene sets. In an effort to allow access to a full spectrum of amassed biological knowledge, we have integrated a variety of annotation data that include the GO, domain, disease, drug, chromosomal location, and custom-defined annotations. Diverse types of molecular networks (pathways, transcription and microRNA regulations, protein-protein interaction) are also included. The pair-wise relationship between annotation gene sets was calculated using kappa statistics. GARNET consists of three modules--gene set manager, gene set analysis and gene set retrieval, which are tightly integrated to provide virtually automatic analysis for gene sets. A dedicated viewer for annotation network has been developed to facilitate exploration of the related annotations. GARNET (gene annotation relationship network tools) is an integrative platform for diverse types of gene set analysis, where complex relationships among gene annotations can be easily explored with an intuitive network visualization tool (http://garnet.isysbio.org/ or http://ercsb.ewha.ac.kr/garnet/).
Nuclear Forensic Inferences Using Iterative Multidimensional Statistics
DOE Office of Scientific and Technical Information (OSTI.GOV)
Robel, M; Kristo, M J; Heller, M A
2009-06-09
Nuclear forensics involves the analysis of interdicted nuclear material for specific material characteristics (referred to as 'signatures') that imply specific geographical locations, production processes, culprit intentions, etc. Predictive signatures rely on expert knowledge of physics, chemistry, and engineering to develop inferences from these material characteristics. Comparative signatures, on the other hand, rely on comparison of the material characteristics of the interdicted sample (the 'questioned sample' in FBI parlance) with those of a set of known samples. In the ideal case, the set of known samples would be a comprehensive nuclear forensics database, a database which does not currently exist. Inmore » fact, our ability to analyze interdicted samples and produce an extensive list of precise materials characteristics far exceeds our ability to interpret the results. Therefore, as we seek to develop the extensive databases necessary for nuclear forensics, we must also develop the methods necessary to produce the necessary inferences from comparison of our analytical results with these large, multidimensional sets of data. In the work reported here, we used a large, multidimensional dataset of results from quality control analyses of uranium ore concentrate (UOC, sometimes called 'yellowcake'). We have found that traditional multidimensional techniques, such as principal components analysis (PCA), are especially useful for understanding such datasets and drawing relevant conclusions. In particular, we have developed an iterative partial least squares-discriminant analysis (PLS-DA) procedure that has proven especially adept at identifying the production location of unknown UOC samples. By removing classes which fell far outside the initial decision boundary, and then rebuilding the PLS-DA model, we have consistently produced better and more definitive attributions than with a single pass classification approach. Performance of the iterative PLS-DA method compared favorably to that of classification and regression tree (CART) and k nearest neighbor (KNN) algorithms, with the best combination of accuracy and robustness, as tested by classifying samples measured independently in our laboratories against the vendor QC based reference set.« less
Kanapathipillai, Rupa; McManus, Hamish; Kamarulzaman, Adeeba; Lim, Poh Lian; Templeton, David J.; Law, Matthew; Woolley, Ian
2014-01-01
Introduction Magnitude and frequency of HIV viral load blips in resource-limited settings, has not previously been assessed. This study was undertaken in a cohort from a high income country (Australia) known as AHOD (Australian HIV Observational Database) and another cohort from a mixture of Asian countries of varying national income per capita, TAHOD (TREAT Asia HIV Observational Database). Methods Blips were defined as detectable VL (≥ 50 copies/mL) preceded and followed by undetectable VL (<50 copies/mL). Virological failure (VF) was defined as two consecutive VL ≥50 copies/ml. Cox proportional hazard models of time to first VF after entry, were developed. Results 5040 patients (AHOD n = 2597 and TAHOD n = 2521) were included; 910 (18%) of patients experienced blips. 744 (21%) and 166 (11%) of high- and middle/low-income participants, respectively, experienced blips ever. 711 (14%) experienced blips prior to virological failure. 559 (16%) and 152 (10%) of high- and middle/low-income participants, respectively, experienced blips prior to virological failure. VL testing occurred at a median frequency of 175 and 91 days in middle/low- and high-income sites, respectively. Longer time to VF occurred in middle/low income sites, compared with high-income sites (adjusted hazards ratio (AHR) 0.41; p<0.001), adjusted for year of first cART, Hepatitis C co-infection, cART regimen, and prior blips. Prior blips were not a significant predictor of VF in univariate analysis (AHR 0.97, p = 0.82). Differing magnitudes of blips were not significant in univariate analyses as predictors of virological failure (p = 0.360 for blip 50–≤1000, p = 0.309 for blip 50–≤400 and p = 0.300 for blip 50–≤200). 209 of 866 (24%) patients were switched to an alternate regimen in the setting of a blip. Conclusion Despite a lower proportion of blips occurring in low/middle-income settings, no significant difference was found between settings. Nonetheless, a substantial number of participants were switched to alternative regimens in the setting of blips. PMID:24516527
Bovine Genome Database: supporting community annotation and analysis of the Bos taurus genome
2010-01-01
Background A goal of the Bovine Genome Database (BGD; http://BovineGenome.org) has been to support the Bovine Genome Sequencing and Analysis Consortium (BGSAC) in the annotation and analysis of the bovine genome. We were faced with several challenges, including the need to maintain consistent quality despite diversity in annotation expertise in the research community, the need to maintain consistent data formats, and the need to minimize the potential duplication of annotation effort. With new sequencing technologies allowing many more eukaryotic genomes to be sequenced, the demand for collaborative annotation is likely to increase. Here we present our approach, challenges and solutions facilitating a large distributed annotation project. Results and Discussion BGD has provided annotation tools that supported 147 members of the BGSAC in contributing 3,871 gene models over a fifteen-week period, and these annotations have been integrated into the bovine Official Gene Set. Our approach has been to provide an annotation system, which includes a BLAST site, multiple genome browsers, an annotation portal, and the Apollo Annotation Editor configured to connect directly to our Chado database. In addition to implementing and integrating components of the annotation system, we have performed computational analyses to create gene evidence tracks and a consensus gene set, which can be viewed on individual gene pages at BGD. Conclusions We have provided annotation tools that alleviate challenges associated with distributed annotation. Our system provides a consistent set of data to all annotators and eliminates the need for annotators to format data. Involving the bovine research community in genome annotation has allowed us to leverage expertise in various areas of bovine biology to provide biological insight into the genome sequence. PMID:21092105
Gao, Jianyong; Tian, Gang; Han, Xu; Zhu, Qiang
2018-01-01
Oral squamous cell carcinoma (OSCC) is the sixth most common type cancer worldwide, with poor prognosis. The present study aimed to identify gene signatures that could classify OSCC and predict prognosis in different stages. A training data set (GSE41613) and two validation data sets (GSE42743 and GSE26549) were acquired from the online Gene Expression Omnibus database. In the training data set, patients were classified based on the tumor-node-metastasis staging system, and subsequently grouped into low stage (L) or high stage (H). Signature genes between L and H stages were selected by disparity index analysis, and classification was performed by the expression of these signature genes. The established classification was compared with the L and H classification, and fivefold cross validation was used to evaluate the stability. Enrichment analysis for the signature genes was implemented by the Database for Annotation, Visualization and Integration Discovery. Two validation data sets were used to determine the precise of classification. Survival analysis was conducted followed each classification using the package ‘survival’ in R software. A set of 24 signature genes was identified based on the classification model with the Fi value of 0.47, which was used to distinguish OSCC samples in two different stages. Overall survival of patients in the H stage was higher than those in the L stage. Signature genes were primarily enriched in ‘ether lipid metabolism’ pathway and biological processes such as ‘positive regulation of adaptive immune response’ and ‘apoptotic cell clearance’. The results provided a novel 24-gene set that may be used as biomarkers to predict OSCC prognosis with high accuracy, which may be used to determine an appropriate treatment program for patients with OSCC in addition to the traditional evaluation index. PMID:29257303
Handwritten word preprocessing for database adaptation
NASA Astrophysics Data System (ADS)
Oprean, Cristina; Likforman-Sulem, Laurence; Mokbel, Chafic
2013-01-01
Handwriting recognition systems are typically trained using publicly available databases, where data have been collected in controlled conditions (image resolution, paper background, noise level,...). Since this is not often the case in real-world scenarios, classification performance can be affected when novel data is presented to the word recognition system. To overcome this problem, we present in this paper a new approach called database adaptation. It consists of processing one set (training or test) in order to adapt it to the other set (test or training, respectively). Specifically, two kinds of preprocessing, namely stroke thickness normalization and pixel intensity normalization are considered. The advantage of such approach is that we can re-use the existing recognition system trained on controlled data. We conduct several experiments with the Rimes 2011 word database and with a real-world database. We adapt either the test set or the training set. Results show that training set adaptation achieves better results than test set adaptation, at the cost of a second training stage on the adapted data. Accuracy of data set adaptation is increased by 2% to 3% in absolute value over no adaptation.
A simplified Forest Inventory and Analysis database: FIADB-Lite
Patrick D. Miles
2008-01-01
This publication is a simplified version of the Forest Inventory and Analysis Data Base (FIADB) for users who do not need to compute sampling errors and may find the FIADB unnecessarily complex. Possible users include GIS specialists who may be interested only in identifying and retrieving geographic information and per acre values for the set of plots used in...
Austvoll-Dahlgren, Astrid; Guttersrud, Øystein; Nsangi, Allen; Semakula, Daniel; Oxman, Andrew D
2017-05-25
The Claim Evaluation Tools database contains multiple-choice items for measuring people's ability to apply the key concepts they need to know to be able to assess treatment claims. We assessed items from the database using Rasch analysis to develop an outcome measure to be used in two randomised trials in Uganda. Rasch analysis is a form of psychometric testing relying on Item Response Theory. It is a dynamic way of developing outcome measures that are valid and reliable. To assess the validity, reliability and responsiveness of 88 items addressing 22 key concepts using Rasch analysis. We administrated four sets of multiple-choice items in English to 1114 people in Uganda and Norway, of which 685 were children and 429 were adults (including 171 health professionals). We scored all items dichotomously. We explored summary and individual fit statistics using the RUMM2030 analysis package. We used SPSS to perform distractor analysis. Most items conformed well to the Rasch model, but some items needed revision. Overall, the four item sets had satisfactory reliability. We did not identify significant response dependence between any pairs of items and, overall, the magnitude of multidimensionality in the data was acceptable. The items had a high level of difficulty. Most of the items conformed well to the Rasch model's expectations. Following revision of some items, we concluded that most of the items were suitable for use in an outcome measure for evaluating the ability of children or adults to assess treatment claims. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
Combined use of computational chemistry and chemoinformatics methods for chemical discovery
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sugimoto, Manabu, E-mail: sugimoto@kumamoto-u.ac.jp; Institute for Molecular Science, 38 Nishigo-Naka, Myodaiji, Okazaki 444-8585; CREST, Japan Science and Technology Agency, 4-1-8 Honcho, Kawaguchi, Saitama 332-0012
2015-12-31
Data analysis on numerical data by the computational chemistry calculations is carried out to obtain knowledge information of molecules. A molecular database is developed to systematically store chemical, electronic-structure, and knowledge-based information. The database is used to find molecules related to a keyword of “cancer”. Then the electronic-structure calculations are performed to quantitatively evaluate quantum chemical similarity of the molecules. Among the 377 compounds registered in the database, 24 molecules are found to be “cancer”-related. This set of molecules includes both carcinogens and anticancer drugs. The quantum chemical similarity analysis, which is carried out by using numerical results of themore » density-functional theory calculations, shows that, when some energy spectra are referred to, carcinogens are reasonably distinguished from the anticancer drugs. Therefore these spectral properties are considered of as important measures for classification.« less
Sweetness prediction of natural compounds.
Chéron, Jean-Baptiste; Casciuc, Iuri; Golebiowski, Jérôme; Antonczak, Serge; Fiorucci, Sébastien
2017-04-15
Based on the most exhaustive database of sweeteners with known sweetness values, a new quantitative structure-activity relationship model for sweetness prediction has been set up. Analysis of the physico-chemical properties of sweeteners in the database indicates that the structure of most potent sweeteners combines a hydrophobic scaffold functionalized by a limited number of hydrogen bond sites (less than 4 hydrogen bond donors and 10 acceptors), with a moderate molecular weight ranging from 350 to 450g·mol -1 . Prediction of sweetness, bitterness and toxicity properties of the largest database of natural compounds have been performed. In silico screening reveals that the majority of the predicted natural intense sweeteners comprise saponin or stevioside scaffolds. The model highlights that their sweetness potency is comparable to known natural sweeteners. The identified compounds provide a rational basis to initiate the design and chemosensory analysis of new low-calorie sweeteners. Copyright © 2016 Elsevier Ltd. All rights reserved.
Database constraints applied to metabolic pathway reconstruction tools.
Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi
2014-01-01
Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes.
CARINA data synthesis project: pH data scale unification and cruise adjustments
NASA Astrophysics Data System (ADS)
Velo, A.; Pérez, F. F.; Lin, X.; Key, R. M.; Tanhua, T.; de La Paz, M.; Olsen, A.; van Heuven, S.; Jutterström, S.; Ríos, A. F.
2010-05-01
Data on carbon and carbon-relevant hydrographic and hydrochemical parameters from 188 previously non-publicly available cruise data sets in the Artic Mediterranean Seas (AMS), Atlantic Ocean and Southern Ocean have been retrieved and merged to a new database: CARINA (CARbon IN the Atlantic Ocean). These data have gone through rigorous quality control (QC) procedures to assure the highest possible quality and consistency. The data for most of the measured parameters in the CARINA database were objectively examined in order to quantify systematic differences in the reported values. Systematic biases found in the data have been corrected in the data products, three merged data files with measured, calculated and interpolated data for each of the three CARINA regions; AMS, Atlantic Ocean and Southern Ocean. Out of a total of 188 cruise entries in the CARINA database, 59 reported pH measured values. All reported pH data have been unified to the Sea-Water Scale (SWS) at 25 °C. Here we present details of the secondary QC of pH in the CARINA database and the scale unification to SWS at 25 °C. The pH scale has been converted for 36 cruises. Procedures of quality control, including crossover analysis between cruises and inversion analysis are described. Adjustments were applied to the pH values for 21 of the cruises in the CARINA dataset. With these adjustments the CARINA database is consistent both internally as well as with the GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s. Based on our analysis we estimate the internal consistency of the CARINA pH data to be 0.005 pH units. The CARINA data are now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates, for ocean acidification assessment and for model validation.
Molecular scaffold analysis of natural products databases in the public domain.
Yongye, Austin B; Waddell, Jacob; Medina-Franco, José L
2012-11-01
Natural products represent important sources of bioactive compounds in drug discovery efforts. In this work, we compiled five natural products databases available in the public domain and performed a comprehensive chemoinformatic analysis focused on the content and diversity of the scaffolds with an overview of the diversity based on molecular fingerprints. The natural products databases were compared with each other and with a set of molecules obtained from in-house combinatorial libraries, and with a general screening commercial library. It was found that publicly available natural products databases have different scaffold diversity. In contrast to the common concept that larger libraries have the largest scaffold diversity, the largest natural products collection analyzed in this work was not the most diverse. The general screening library showed, overall, the highest scaffold diversity. However, considering the most frequent scaffolds, the general reference library was the least diverse. In general, natural products databases in the public domain showed low molecule overlap. In addition to benzene and acyclic compounds, flavones, coumarins, and flavanones were identified as the most frequent molecular scaffolds across the different natural products collections. The results of this work have direct implications in the computational and experimental screening of natural product databases for drug discovery. © 2012 John Wiley & Sons A/S.
Database Software for Social Studies. A MicroSIFT Quarterly Report.
ERIC Educational Resources Information Center
Weaver, Dave
The report describes and evaluates the use of a set of learning tools called database managers and their creation of databases to help teach problem solving skills in social studies. Details include the design, building, and use of databases in a social studies setting, along with advantages and disadvantages of using them. The three types of…
The Gene Set Builder: collation, curation, and distribution of sets of genes
Yusuf, Dimas; Lim, Jonathan S; Wasserman, Wyeth W
2005-01-01
Background In bioinformatics and genomics, there are many applications designed to investigate the common properties for a set of genes. Often, these multi-gene analysis tools attempt to reveal sequential, functional, and expressional ties. However, while tremendous effort has been invested in developing tools that can analyze a set of genes, minimal effort has been invested in developing tools that can help researchers compile, store, and annotate gene sets in the first place. As a result, the process of making or accessing a set often involves tedious and time consuming steps such as finding identifiers for each individual gene. These steps are often repeated extensively to shift from one identifier type to another; or to recreate a published set. In this paper, we present a simple online tool which – with the help of the gene catalogs Ensembl and GeneLynx – can help researchers build and annotate sets of genes quickly and easily. Description The Gene Set Builder is a database-driven, web-based tool designed to help researchers compile, store, export, and share sets of genes. This application supports the 17 eukaryotic genomes found in version 32 of the Ensembl database, which includes species from yeast to human. User-created information such as sets and customized annotations are stored to facilitate easy access. Gene sets stored in the system can be "exported" in a variety of output formats – as lists of identifiers, in tables, or as sequences. In addition, gene sets can be "shared" with specific users to facilitate collaborations or fully released to provide access to published results. The application also features a Perl API (Application Programming Interface) for direct connectivity to custom analysis tools. A downloadable Quick Reference guide and an online tutorial are available to help new users learn its functionalities. Conclusion The Gene Set Builder is an Ensembl-facilitated online tool designed to help researchers compile and manage sets of genes in a user-friendly environment. The application can be accessed via . PMID:16371163
GeneSCF: a real-time based functional enrichment tool with support for multiple organisms.
Subhash, Santhilal; Kanduri, Chandrasekhar
2016-09-13
High-throughput technologies such as ChIP-sequencing, RNA-sequencing, DNA sequencing and quantitative metabolomics generate a huge volume of data. Researchers often rely on functional enrichment tools to interpret the biological significance of the affected genes from these high-throughput studies. However, currently available functional enrichment tools need to be updated frequently to adapt to new entries from the functional database repositories. Hence there is a need for a simplified tool that can perform functional enrichment analysis by using updated information directly from the source databases such as KEGG, Reactome or Gene Ontology etc. In this study, we focused on designing a command-line tool called GeneSCF (Gene Set Clustering based on Functional annotations), that can predict the functionally relevant biological information for a set of genes in a real-time updated manner. It is designed to handle information from more than 4000 organisms from freely available prominent functional databases like KEGG, Reactome and Gene Ontology. We successfully employed our tool on two of published datasets to predict the biologically relevant functional information. The core features of this tool were tested on Linux machines without the need for installation of more dependencies. GeneSCF is more reliable compared to other enrichment tools because of its ability to use reference functional databases in real-time to perform enrichment analysis. It is an easy-to-integrate tool with other pipelines available for downstream analysis of high-throughput data. More importantly, GeneSCF can run multiple gene lists simultaneously on different organisms thereby saving time for the users. Since the tool is designed to be ready-to-use, there is no need for any complex compilation and installation procedures.
Chen, Tsute; Yu, Wen-Han; Izard, Jacques; Baranova, Oxana V.; Lakshmanan, Abirami; Dewhirst, Floyd E.
2010-01-01
The human oral microbiome is the most studied human microflora, but 53% of the species have not yet been validly named and 35% remain uncultivated. The uncultivated taxa are known primarily from 16S rRNA sequence information. Sequence information tied solely to obscure isolate or clone numbers, and usually lacking accurate phylogenetic placement, is a major impediment to working with human oral microbiome data. The goal of creating the Human Oral Microbiome Database (HOMD) is to provide the scientific community with a body site-specific comprehensive database for the more than 600 prokaryote species that are present in the human oral cavity based on a curated 16S rRNA gene-based provisional naming scheme. Currently, two primary types of information are provided in HOMD—taxonomic and genomic. Named oral species and taxa identified from 16S rRNA gene sequence analysis of oral isolates and cloning studies were placed into defined 16S rRNA phylotypes and each given unique Human Oral Taxon (HOT) number. The HOT interlinks phenotypic, phylogenetic, genomic, clinical and bibliographic information for each taxon. A BLAST search tool is provided to match user 16S rRNA gene sequences to a curated, full length, 16S rRNA gene reference data set. For genomic analysis, HOMD provides comprehensive set of analysis tools and maintains frequently updated annotations for all the human oral microbial genomes that have been sequenced and publicly released. Oral bacterial genome sequences, determined as part of the Human Microbiome Project, are being added to the HOMD as they become available. We provide HOMD as a conceptual model for the presentation of microbiome data for other human body sites. Database URL: http://www.homd.org PMID:20624719
DataSpread: Unifying Databases and Spreadsheets.
Bendre, Mangesh; Sun, Bofan; Zhang, Ding; Zhou, Xinyan; Chang, Kevin ChenChuan; Parameswaran, Aditya
2015-08-01
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current "pane" (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases.
DataSpread: Unifying Databases and Spreadsheets
Bendre, Mangesh; Sun, Bofan; Zhang, Ding; Zhou, Xinyan; Chang, Kevin ChenChuan; Parameswaran, Aditya
2015-01-01
Spreadsheet software is often the tool of choice for ad-hoc tabular data management, processing, and visualization, especially on tiny data sets. On the other hand, relational database systems offer significant power, expressivity, and efficiency over spreadsheet software for data management, while lacking in the ease of use and ad-hoc analysis capabilities. We demonstrate DataSpread, a data exploration tool that holistically unifies databases and spreadsheets. It continues to offer a Microsoft Excel-based spreadsheet front-end, while in parallel managing all the data in a back-end database, specifically, PostgreSQL. DataSpread retains all the advantages of spreadsheets, including ease of use, ad-hoc analysis and visualization capabilities, and a schema-free nature, while also adding the advantages of traditional relational databases, such as scalability and the ability to use arbitrary SQL to import, filter, or join external or internal tables and have the results appear in the spreadsheet. DataSpread needs to reason about and reconcile differences in the notions of schema, addressing of cells and tuples, and the current “pane” (which exists in spreadsheets but not in traditional databases), and support data modifications at both the front-end and the back-end. Our demonstration will center on our first and early prototype of the DataSpread, and will give the attendees a sense for the enormous data exploration capabilities offered by unifying spreadsheets and databases. PMID:26900487
Translation from the collaborative OSM database to cartography
NASA Astrophysics Data System (ADS)
Hayat, Flora
2018-05-01
The OpenStreetMap (OSM) database includes original items very useful for geographical analysis and for creating thematic maps. Contributors record in the open database various themes regarding amenities, leisure, transports, buildings and boundaries. The Michelin mapping department develops map prototypes to test the feasibility of mapping based on OSM. To translate the OSM database structure into a database structure fitted with Michelin graphic guidelines a research project is in development. It aims at defining the right structure for the Michelin uses. The research project relies on the analysis of semantic and geometric heterogeneities in OSM data. In that order, Michelin implements methods to transform the input geographical database into a cartographic image dedicated for specific uses (routing and tourist maps). The paper focuses on the mapping tools available to produce a personalised spatial database. Based on processed data, paper and Web maps can be displayed. Two prototypes are described in this article: a vector tile web map and a mapping method to produce paper maps on a regional scale. The vector tile mapping method offers an easy navigation within the map and within graphic and thematic guide- lines. Paper maps can be partly automatically drawn. The drawing automation and data management are part of the mapping creation as well as the final hand-drawing phase. Both prototypes have been set up using the OSM technical ecosystem.
The LANL hemorrhagic fever virus database, a new platform for analyzing biothreat viruses
Kuiken, Carla; Thurmond, Jim; Dimitrijevic, Mira; Yoon, Hyejin
2012-01-01
Hemorrhagic fever viruses (HFVs) are a diverse set of over 80 viral species, found in 10 different genera comprising five different families: arena-, bunya-, flavi-, filo- and togaviridae. All these viruses are highly variable and evolve rapidly, making them elusive targets for the immune system and for vaccine and drug design. About 55 000 HFV sequences exist in the public domain today. A central website that provides annotated sequences and analysis tools will be helpful to HFV researchers worldwide. The HFV sequence database collects and stores sequence data and provides a user-friendly search interface and a large number of sequence analysis tools, following the model of the highly regarded and widely used Los Alamos HIV database [Kuiken, C., B. Korber, and R.W. Shafer, HIV sequence databases. AIDS Rev, 2003. 5: p. 52–61]. The database uses an algorithm that aligns each sequence to a species-wide reference sequence. The NCBI RefSeq database [Sayers et al. (2011) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 39, D38–D51.] is used for this; if a reference sequence is not available, a Blast search finds the best candidate. Using this method, sequences in each genus can be retrieved pre-aligned. The HFV website can be accessed via http://hfv.lanl.gov. PMID:22064861
probeBase—an online resource for rRNA-targeted oligonucleotide probes and primers: new features 2016
Greuter, Daniel; Loy, Alexander; Horn, Matthias; Rattei, Thomas
2016-01-01
probeBase http://www.probebase.net is a manually maintained and curated database of rRNA-targeted oligonucleotide probes and primers. Contextual information and multiple options for evaluating in silico hybridization performance against the most recent rRNA sequence databases are provided for each oligonucleotide entry, which makes probeBase an important and frequently used resource for microbiology research and diagnostics. Here we present a major update of probeBase, which was last featured in the NAR Database Issue 2007. This update describes a complete remodeling of the database architecture and environment to accommodate computationally efficient access. Improved search functions, sequence match tools and data output now extend the opportunities for finding suitable hierarchical probe sets that target an organism or taxon at different taxonomic levels. To facilitate the identification of complementary probe sets for organisms represented by short rRNA sequence reads generated by amplicon sequencing or metagenomic analysis with next generation sequencing technologies such as Illumina and IonTorrent, we introduce a novel tool that recovers surrogate near full-length rRNA sequences for short query sequences and finds matching oligonucleotides in probeBase. PMID:26586809
Gaspar, Héléna A; Baskin, Igor I; Marcou, Gilles; Horvath, Dragos; Varnek, Alexandre
2015-01-26
This paper is devoted to the analysis and visualization in 2-dimensional space of large data sets of millions of compounds using the incremental version of generative topographic mapping (iGTM). The iGTM algorithm implemented in the in-house ISIDA-GTM program was applied to a database of more than 2 million compounds combining data sets of 36 chemicals suppliers and the NCI collection, encoded either by MOE descriptors or by MACCS keys. Taking advantage of the probabilistic nature of GTM, several approaches to data analysis were proposed. The chemical space coverage was evaluated using the normalized Shannon entropy. Different views of the data (property landscapes) were obtained by mapping various physical and chemical properties (molecular weight, aqueous solubility, LogP, etc.) onto the iGTM map. The superposition of these views helped to identify the regions in the chemical space populated by compounds with desirable physicochemical profiles and the suppliers providing them. The data sets similarity in the latent space was assessed by applying several metrics (Euclidean distance, Tanimoto and Bhattacharyya coefficients) to data probability distributions based on cumulated responsibility vectors. As a complementary approach, data sets were compared by considering them as individual objects on a meta-GTM map, built on cumulated responsibility vectors or property landscapes produced with iGTM. We believe that the iGTM methodology described in this article represents a fast and reliable way to analyze and visualize large chemical databases.
A web-based system architecture for ontology-based data integration in the domain of IT benchmarking
NASA Astrophysics Data System (ADS)
Pfaff, Matthias; Krcmar, Helmut
2018-03-01
In the domain of IT benchmarking (ITBM), a variety of data and information are collected. Although these data serve as the basis for business analyses, no unified semantic representation of such data yet exists. Consequently, data analysis across different distributed data sets and different benchmarks is almost impossible. This paper presents a system architecture and prototypical implementation for an integrated data management of distributed databases based on a domain-specific ontology. To preserve the semantic meaning of the data, the ITBM ontology is linked to data sources and functions as the central concept for database access. Thus, additional databases can be integrated by linking them to this domain-specific ontology and are directly available for further business analyses. Moreover, the web-based system supports the process of mapping ontology concepts to external databases by introducing a semi-automatic mapping recommender and by visualizing possible mapping candidates. The system also provides a natural language interface to easily query linked databases. The expected result of this ontology-based approach of knowledge representation and data access is an increase in knowledge and data sharing in this domain, which will enhance existing business analysis methods.
Manasa, Justen; Lessells, Richard; Rossouw, Theresa; Naidu, Kevindra; Van Vuuren, Cloete; Goedhals, Dominique; van Zyl, Gert; Bester, Armand; Skingsley, Andrew; Stott, Katharine; Danaviah, Siva; Chetty, Terusha; Singh, Lavanya; Moodley, Pravi; Iwuji, Collins; McGrath, Nuala; Seebregts, Christopher J.; de Oliveira, Tulio
2014-01-01
Abstract Substantial amounts of data have been generated from patient management and academic exercises designed to better understand the human immunodeficiency virus (HIV) epidemic and design interventions to control it. A number of specialized databases have been designed to manage huge data sets from HIV cohort, vaccine, host genomic and drug resistance studies. Besides databases from cohort studies, most of the online databases contain limited curated data and are thus sequence repositories. HIV drug resistance has been shown to have a great potential to derail the progress made thus far through antiretroviral therapy. Thus, a lot of resources have been invested in generating drug resistance data for patient management and surveillance purposes. Unfortunately, most of the data currently available relate to subtype B even though >60% of the epidemic is caused by HIV-1 subtype C. A consortium of clinicians, scientists, public health experts and policy markers working in southern Africa came together and formed a network, the Southern African Treatment and Resistance Network (SATuRN), with the aim of increasing curated HIV-1 subtype C and tuberculosis drug resistance data. This article describes the HIV-1 data curation process using the SATuRN Rega database. The data curation is a manual and time-consuming process done by clinical, laboratory and data curation specialists. Access to the highly curated data sets is through applications that are reviewed by the SATuRN executive committee. Examples of research outputs from the analysis of the curated data include trends in the level of transmitted drug resistance in South Africa, analysis of the levels of acquired resistance among patients failing therapy and factors associated with the absence of genotypic evidence of drug resistance among patients failing therapy. All these studies have been important for informing first- and second-line therapy. This database is a free password-protected open source database available on www.bioafrica.net. Database URL: http://www.bioafrica.net/regadb/ PMID:24504151
STBase: One Million Species Trees for Comparative Biology
McMahon, Michelle M.; Deepak, Akshay; Fernández-Baca, David; Boss, Darren; Sanderson, Michael J.
2015-01-01
Comprehensively sampled phylogenetic trees provide the most compelling foundations for strong inferences in comparative evolutionary biology. Mismatches are common, however, between the taxa for which comparative data are available and the taxa sampled by published phylogenetic analyses. Moreover, many published phylogenies are gene trees, which cannot always be adapted immediately for species level comparisons because of discordance, gene duplication, and other confounding biological processes. A new database, STBase, lets comparative biologists quickly retrieve species level phylogenetic hypotheses in response to a query list of species names. The database consists of 1 million single- and multi-locus data sets, each with a confidence set of 1000 putative species trees, computed from GenBank sequence data for 413,000 eukaryotic taxa. Two bodies of theoretical work are leveraged to aid in the assembly of multi-locus concatenated data sets for species tree construction. First, multiply labeled gene trees are pruned to conflict-free singly-labeled species-level trees that can be combined between loci. Second, impacts of missing data in multi-locus data sets are ameliorated by assembling only decisive data sets. Data sets overlapping with the user’s query are ranked using a scheme that depends on user-provided weights for tree quality and for taxonomic overlap of the tree with the query. Retrieval times are independent of the size of the database, typically a few seconds. Tree quality is assessed by a real-time evaluation of bootstrap support on just the overlapping subtree. Associated sequence alignments, tree files and metadata can be downloaded for subsequent analysis. STBase provides a tool for comparative biologists interested in exploiting the most relevant sequence data available for the taxa of interest. It may also serve as a prototype for future species tree oriented databases and as a resource for assembly of larger species phylogenies from precomputed trees. PMID:25679219
Buell, Gary R.; Wehmeyer, Loren L.; Calhoun, Daniel L.
2012-01-01
A hydrologic and landscape database was developed by the U.S. Geological Survey, in cooperation with the U.S. Fish and Wildlife Service, for the Cache River and White River National Wildlife Refuges and their contributing watersheds in Arkansas, Missouri, and Oklahoma. The database is composed of a set of ASCII files, Microsoft Access® files, Microsoft Excel® files, an Environmental Systems Research Institute (ESRI) ArcGIS® geodatabase, ESRI ArcGRID® raster datasets, and an ESRI ArcReader® published map. The database was developed as an assessment and evaluation tool to use in examining refuge-specific hydrologic patterns and trends as related to water availability for refuge ecosystems, habitats, and target species; and includes hydrologic time-series data, statistics, and hydroecological metrics that can be used to assess refuge hydrologic conditions and the availability of aquatic and riparian habitat. Landscape data that describe the refuge physiographic setting and the locations of hydrologic-data collection stations are also included in the database. Categories of landscape data include land cover, soil hydrologic characteristics, physiographic features, geographic and hydrographic boundaries, hydrographic features, regional runoff estimates, and gaging-station locations. The database geographic extent covers three hydrologic subregions—the Lower Mississippi–St Francis (0802), the Upper White (1101), and the Lower Arkansas (1111)—within which human activities, climatic variation, and hydrologic processes can potentially affect the hydrologic regime of the refuges and adjacent areas. Database construction has been automated to facilitate periodic updates with new data. The database report (1) serves as a user guide for the database, (2) describes the data-collection, data-reduction, and data-analysis methods used to construct the database, (3) provides a statistical and graphical description of the database, and (4) provides detailed information on the development of analytical techniques designed to assess water availability for ecological needs.
Pathway results from the chicken data set using GOTM, Pathway Studio and Ingenuity softwares
Bonnet, Agnès; Lagarrigue, Sandrine; Liaubet, Laurence; Robert-Granié, Christèle; SanCristobal, Magali; Tosser-Klopp, Gwenola
2009-01-01
Background As presented in the introduction paper, three sets of differentially regulated genes were found after the analysis of the chicken infection data set from EADGENE. Different methods were used to interpret these results. Results GOTM, Pathway Studio and Ingenuity softwares were used to investigate the three lists of genes. The three softwares allowed the analysis of the data and highlighted different networks. However, only one set of genes, showing a differential expression between primary and secondary response gave significant biological interpretation. Conclusion Combining these databases that were developed independently on different annotation sources supplies a useful tool for a global biological interpretation of microarray data, even if they may contain some imperfections (e.g. gene not or not well annotated). PMID:19615111
Bowden, Peter; Beavis, Ron; Marshall, John
2009-11-02
A goodness of fit test may be used to assign tandem mass spectra of peptides to amino acid sequences and to directly calculate the expected probability of mis-identification. The product of the peptide expectation values directly yields the probability that the parent protein has been mis-identified. A relational database could capture the mass spectral data, the best fit results, and permit subsequent calculations by a general statistical analysis system. The many files of the Hupo blood protein data correlated by X!TANDEM against the proteins of ENSEMBL were collected into a relational database. A redundant set of 247,077 proteins and peptides were correlated by X!TANDEM, and that was collapsed to a set of 34,956 peptides from 13,379 distinct proteins. About 6875 distinct proteins were only represented by a single distinct peptide, 2866 proteins showed 2 distinct peptides, and 3454 proteins showed at least three distinct peptides by X!TANDEM. More than 99% of the peptides were associated with proteins that had cumulative expectation values, i.e. probability of false positive identification, of one in one hundred or less. The distribution of peptides per protein from X!TANDEM was significantly different than those expected from random assignment of peptides.
NASA Astrophysics Data System (ADS)
Damm, Bodo; Klose, Martin
2014-05-01
This contribution presents an initiative to develop a national landslide database for the Federal Republic of Germany. It highlights structure and contents of the landslide database and outlines its major data sources and the strategy of information retrieval. Furthermore, the contribution exemplifies the database potentials in applied landslide impact research, including statistics of landslide damage, repair, and mitigation. The landslide database offers due to systematic regional data compilation a differentiated data pool of more than 5,000 data sets and over 13,000 single data files. It dates back to 1137 AD and covers landslide sites throughout Germany. In seven main data blocks, the landslide database stores besides information on landslide types, dimensions, and processes, additional data on soil and bedrock properties, geomorphometry, and climatic or other major triggering events. A peculiarity of this landslide database is its storage of data sets on land use effects, damage impacts, hazard mitigation, and landslide costs. Compilation of landslide data is based on a two-tier strategy of data collection. The first step of information retrieval includes systematic web content mining and exploration of online archives of emergency agencies, fire and police departments, and news organizations. Using web and RSS feeds and soon also a focused web crawler, this enables effective nationwide data collection for recent landslides. On the basis of this information, in-depth data mining is performed to deepen and diversify the data pool in key landslide areas. This enables to gather detailed landslide information from, amongst others, agency records, geotechnical reports, climate statistics, maps, and satellite imagery. Landslide data is extracted from these information sources using a mix of methods, including statistical techniques, imagery analysis, and qualitative text interpretation. The landslide database is currently migrated to a spatial database system running on PostgreSQL/PostGIS. This provides advanced functionality for spatial data analysis and forms the basis for future data provision and visualization using a WebGIS application. Analysis of landslide database contents shows that in most parts of Germany landslides primarily affect transportation infrastructures. Although with distinct lower frequency, recent landslides are also recorded to cause serious damage to hydraulic facilities and waterways, supply and disposal infrastructures, sites of cultural heritage, as well as forest, agricultural, and mining areas. The main types of landslide damage are failure of cut and fill slopes, destruction of retaining walls, street lights, and forest stocks, burial of roads, backyards, and garden areas, as well as crack formation in foundations, sewer lines, and building walls. Landslide repair and mitigation at transportation infrastructures is dominated by simple solutions such as catch barriers or rock fall drapery. These solutions are often undersized and fail under stress. The use of costly slope stabilization or protection systems is proven to reduce these risks effectively over longer maintenance cycles. The right balancing of landslide mitigation is thus a crucial problem in managing landslide risks. Development and analysis of such landslide databases helps to support decision-makers in finding efficient solutions to minimize landslide risks for human beings, infrastructures, and financial assets.
Lack, N
2001-08-01
The introduction of the modified data set for quality assurance in obstetrics (formerly perinatal survey) in Lower Saxony and Bavaria as early as 1999 saw the urgent requirement for a corresponding new statistical analysis of the revised data. The general outline of a new data reporting concept was originally presented by the Bavarian Commission for Perinatology and Neonatology at the Munich Perinatal Conference in November 1997. These ideas are germinal to content and layout of the new quality report for obstetrics currently in its nationwide harmonisation phase coordinated by the federal office for quality assurance in hospital care. A flexible and modular database oriented analysis tool developed in Bavaria is now in its second year of successful operation. The functionalities of this system are described in detail.
Improved Bond Equations for Fiber-Reinforced Polymer Bars in Concrete
Pour, Sadaf Moallemi; Alam, M. Shahria; Milani, Abbas S.
2016-01-01
This paper explores a set of new equations to predict the bond strength between fiber reinforced polymer (FRP) rebar and concrete. The proposed equations are based on a comprehensive statistical analysis and existing experimental results in the literature. Namely, the most effective parameters on bond behavior of FRP concrete were first identified by applying a factorial analysis on a part of the available database. Then the database that contains 250 pullout tests were divided into four groups based on the concrete compressive strength and the rebar surface. Afterward, nonlinear regression analysis was performed for each study group in order to determine the bond equations. The results show that the proposed equations can predict bond strengths more accurately compared to the other previously reported models. PMID:28773859
Semantic memory: a feature-based analysis and new norms for Italian.
Montefinese, Maria; Ambrosini, Ettore; Fairfield, Beth; Mammarella, Nicola
2013-06-01
Semantic norms for properties produced by native speakers are valuable tools for researchers interested in the structure of semantic memory and in category-specific semantic deficits in individuals following brain damage. The aims of this study were threefold. First, we sought to extend existing semantic norms by adopting an empirical approach to category (Exp. 1) and concept (Exp. 2) selection, in order to obtain a more representative set of semantic memory features. Second, we extensively outlined a new set of semantic production norms collected from Italian native speakers for 120 artifactual and natural basic-level concepts, using numerous measures and statistics following a feature-listing task (Exp. 3b). Finally, we aimed to create a new publicly accessible database, since only a few existing databases are publicly available online.
Extending GIS Technology to Study Karst Features of Southeastern Minnesota
NASA Astrophysics Data System (ADS)
Gao, Y.; Tipping, R. G.; Alexander, E. C.; Alexander, S. C.
2001-12-01
This paper summarizes ongoing research on karst feature distribution of southeastern Minnesota. The main goals of this interdisciplinary research are: 1) to look for large-scale patterns in the rate and distribution of sinkhole development; 2) to conduct statistical tests of hypotheses about the formation of sinkholes; 3) to create management tools for land-use managers and planners; and 4) to deliver geomorphic and hydrogeologic criteria for making scientifically valid land-use policies and ethical decisions in karst areas of southeastern Minnesota. Existing county and sub-county karst feature datasets of southeastern Minnesota have been assembled into a large GIS-based database capable of analyzing the entire data set. The central database management system (DBMS) is a relational GIS-based system interacting with three modules: GIS, statistical and hydrogeologic modules. ArcInfo and ArcView were used to generate a series of 2D and 3D maps depicting karst feature distributions in southeastern Minnesota. IRIS ExplorerTM was used to produce satisfying 3D maps and animations using data exported from GIS-based database. Nearest-neighbor analysis has been used to test sinkhole distributions in different topographic and geologic settings. All current nearest-neighbor analyses testify that sinkholes in southeastern Minnesota are not evenly distributed in this area (i.e., they tend to be clustered). More detailed statistical methods such as cluster analysis, histograms, probability estimation, correlation and regression have been used to study the spatial distributions of some mapped karst features of southeastern Minnesota. A sinkhole probability map for Goodhue County has been constructed based on sinkhole distribution, bedrock geology, depth to bedrock, GIS buffer analysis and nearest-neighbor analysis. A series of karst features for Winona County including sinkholes, springs, seeps, stream sinks and outcrop has been mapped and entered into the Karst Feature Database of Southeastern Minnesota. The Karst Feature Database of Winona County is being expanded to include all the mapped karst features of southeastern Minnesota. Air photos from 1930s to 1990s of Spring Valley Cavern Area in Fillmore County were scanned and geo-referenced into our GIS system. This technology has been proved to be very useful to identify sinkholes and study the rate of sinkhole development.
NASA Technical Reports Server (NTRS)
Brenton, James C.; Barbre. Robert E., Jr.; Decker, Ryan K.; Orcutt, John M.
2018-01-01
The National Aeronautics and Space Administration's (NASA) Marshall Space Flight Center (MSFC) Natural Environments Branch (EV44) has provided atmospheric databases and analysis in support of space vehicle design and day-of-launch operations for NASA and commercial launch vehicle programs launching from the NASA Kennedy Space Center (KSC), co-located on the United States Air Force's Eastern Range (ER) at the Cape Canaveral Air Force Station. The ER complex is one of the most heavily instrumented sites in the United States with over 31 towers measuring various atmospheric parameters on a continuous basis. An inherent challenge with large sets of data consists of ensuring erroneous data is removed from databases, and thus excluded from launch vehicle design analyses. EV44 has put forth great effort in developing quality control (QC) procedures for individual meteorological instruments, however no standard QC procedures for all databases currently exists resulting in QC databases that have inconsistencies in variables, methodologies, and periods of record. The goal of this activity is to use the previous efforts by EV44 to develop a standardized set of QC procedures from which to build meteorological databases from KSC and the ER, while maintaining open communication with end users from the launch community to develop ways to improve, adapt and grow the QC database. Details of the QC procedures will be described. As the rate of launches increases with additional launch vehicle programs, it is becoming more important that weather databases are continually updated and checked for data quality before use in launch vehicle design and certification analyses.
The impact of database quality on keystroke dynamics authentication
NASA Astrophysics Data System (ADS)
Panasiuk, Piotr; Rybnik, Mariusz; Saeed, Khalid; Rogowski, Marcin
2016-06-01
This paper concerns keystroke dynamics, also partially in the context of touchscreen devices. The authors concentrate on the impact of database quality and propose their algorithm to test database quality issues. The algorithm is used on their own
QSAR Modeling Using Large-Scale Databases: Case Study for HIV-1 Reverse Transcriptase Inhibitors.
Tarasova, Olga A; Urusova, Aleksandra F; Filimonov, Dmitry A; Nicklaus, Marc C; Zakharov, Alexey V; Poroikov, Vladimir V
2015-07-27
Large-scale databases are important sources of training sets for various QSAR modeling approaches. Generally, these databases contain information extracted from different sources. This variety of sources can produce inconsistency in the data, defined as sometimes widely diverging activity results for the same compound against the same target. Because such inconsistency can reduce the accuracy of predictive models built from these data, we are addressing the question of how best to use data from publicly and commercially accessible databases to create accurate and predictive QSAR models. We investigate the suitability of commercially and publicly available databases to QSAR modeling of antiviral activity (HIV-1 reverse transcriptase (RT) inhibition). We present several methods for the creation of modeling (i.e., training and test) sets from two, either commercially or freely available, databases: Thomson Reuters Integrity and ChEMBL. We found that the typical predictivities of QSAR models obtained using these different modeling set compilation methods differ significantly from each other. The best results were obtained using training sets compiled for compounds tested using only one method and material (i.e., a specific type of biological assay). Compound sets aggregated by target only typically yielded poorly predictive models. We discuss the possibility of "mix-and-matching" assay data across aggregating databases such as ChEMBL and Integrity and their current severe limitations for this purpose. One of them is the general lack of complete and semantic/computer-parsable descriptions of assay methodology carried by these databases that would allow one to determine mix-and-matchability of result sets at the assay level.
ExAtlas: An interactive online tool for meta-analysis of gene expression data.
Sharov, Alexei A; Schlessinger, David; Ko, Minoru S H
2015-12-01
We have developed ExAtlas, an on-line software tool for meta-analysis and visualization of gene expression data. In contrast to existing software tools, ExAtlas compares multi-component data sets and generates results for all combinations (e.g. all gene expression profiles versus all Gene Ontology annotations). ExAtlas handles both users' own data and data extracted semi-automatically from the public repository (GEO/NCBI database). ExAtlas provides a variety of tools for meta-analyses: (1) standard meta-analysis (fixed effects, random effects, z-score, and Fisher's methods); (2) analyses of global correlations between gene expression data sets; (3) gene set enrichment; (4) gene set overlap; (5) gene association by expression profile; (6) gene specificity; and (7) statistical analysis (ANOVA, pairwise comparison, and PCA). ExAtlas produces graphical outputs, including heatmaps, scatter-plots, bar-charts, and three-dimensional images. Some of the most widely used public data sets (e.g. GNF/BioGPS, Gene Ontology, KEGG, GAD phenotypes, BrainScan, ENCODE ChIP-seq, and protein-protein interaction) are pre-loaded and can be used for functional annotations.
NASA Astrophysics Data System (ADS)
Liang, Y.; Gallaher, D. W.; Grant, G.; Lv, Q.
2011-12-01
Change over time, is the central driver of climate change detection. The goal is to diagnose the underlying causes, and make projections into the future. In an effort to optimize this process we have developed the Data Rod model, an object-oriented approach that provides the ability to query grid cell changes and their relationships to neighboring grid cells through time. The time series data is organized in time-centric structures called "data rods." A single data rod can be pictured as the multi-spectral data history at one grid cell: a vertical column of data through time. This resolves the long-standing problem of managing time-series data and opens new possibilities for temporal data analysis. This structure enables rapid time- centric analysis at any grid cell across multiple sensors and satellite platforms. Collections of data rods can be spatially and temporally filtered, statistically analyzed, and aggregated for use with pattern matching algorithms. Likewise, individual image pixels can be extracted to generate multi-spectral imagery at any spatial and temporal location. The Data Rods project has created a series of prototype databases to store and analyze massive datasets containing multi-modality remote sensing data. Using object-oriented technology, this method overcomes the operational limitations of traditional relational databases. To demonstrate the speed and efficiency of time-centric analysis using the Data Rods model, we have developed a sea ice detection algorithm. This application determines the concentration of sea ice in a small spatial region across a long temporal window. If performed using traditional analytical techniques, this task would typically require extensive data downloads and spatial filtering. Using Data Rods databases, the exact spatio-temporal data set is immediately available No extraneous data is downloaded, and all selected data querying occurs transparently on the server side. Moreover, fundamental statistical calculations such as running averages are easily implemented against the time-centric columns of data.
CycADS: an annotation database system to ease the development and update of BioCyc databases
Vellozo, Augusto F.; Véron, Amélie S.; Baa-Puyoulet, Patrice; Huerta-Cepas, Jaime; Cottret, Ludovic; Febvay, Gérard; Calevro, Federica; Rahbé, Yvan; Douglas, Angela E.; Gabaldón, Toni; Sagot, Marie-France; Charles, Hubert; Colella, Stefano
2011-01-01
In recent years, genomes from an increasing number of organisms have been sequenced, but their annotation remains a time-consuming process. The BioCyc databases offer a framework for the integrated analysis of metabolic networks. The Pathway tool software suite allows the automated construction of a database starting from an annotated genome, but it requires prior integration of all annotations into a specific summary file or into a GenBank file. To allow the easy creation and update of a BioCyc database starting from the multiple genome annotation resources available over time, we have developed an ad hoc data management system that we called Cyc Annotation Database System (CycADS). CycADS is centred on a specific database model and on a set of Java programs to import, filter and export relevant information. Data from GenBank and other annotation sources (including for example: KAAS, PRIAM, Blast2GO and PhylomeDB) are collected into a database to be subsequently filtered and extracted to generate a complete annotation file. This file is then used to build an enriched BioCyc database using the PathoLogic program of Pathway Tools. The CycADS pipeline for annotation management was used to build the AcypiCyc database for the pea aphid (Acyrthosiphon pisum) whose genome was recently sequenced. The AcypiCyc database webpage includes also, for comparative analyses, two other metabolic reconstruction BioCyc databases generated using CycADS: TricaCyc for Tribolium castaneum and DromeCyc for Drosophila melanogaster. Linked to its flexible design, CycADS offers a powerful software tool for the generation and regular updating of enriched BioCyc databases. The CycADS system is particularly suited for metabolic gene annotation and network reconstruction in newly sequenced genomes. Because of the uniform annotation used for metabolic network reconstruction, CycADS is particularly useful for comparative analysis of the metabolism of different organisms. Database URL: http://www.cycadsys.org PMID:21474551
Seventy Years of RN Effectiveness: A Database Development Project to Inform Best Practice.
Lulat, Zainab; Blain-McLeod, Julie; Grinspun, Doris; Penney, Tasha; Harripaul-Yhap, Anastasia; Rey, Michelle
2018-03-23
The appropriate nursing staff mix is imperative to the provision of quality care. Nurse staffing levels and staff mix vary from country to country, as well as between care settings. Understanding how staffing skill mix impacts patient, organizational, and financial outcomes is critical in order to allow policymakers and clinicians to make evidence-informed staffing decisions. This paper reports on the methodology for creation of an electronic database of studies exploring the effectiveness of Registered Nurses (RNs) on clinical and patient outcomes, organizational and nurse outcomes, and financial outcomes. Comprehensive literature searches were conducted in four electronic databases. Inclusion criteria for the database included studies published from 1946 to 2016, peer-reviewed international literature, and studies focused on RNs in all health-care disciplines, settings, and sectors. Masters-prepared nurse researchers conducted title and abstract screening and relevance review to determine eligibility of studies for the database. High-level analysis was conducted to determine key outcomes and the frequency at which they appeared within the database. Of the initial 90,352 records, a total of 626 abstracts were included within the database. Studies were organized into three groups corresponding to clinical and patient outcomes, organizational and nurse-related outcomes, and financial outcomes. Organizational and nurse-related outcomes represented the largest category in the database with 282 studies, followed by clinical and patient outcomes with 244 studies, and lastly financial outcomes, which included 124 studies. The comprehensive database of evidence for RN effectiveness is freely available at https://rnao.ca/bpg/initiatives/RNEffectiveness. The database will serve as a resource for the Registered Nurses' Association of Ontario, as well as a tool for researchers, clinicians, and policymakers for making evidence-informed staffing decisions. © 2018 The Authors. Worldviews on Evidence-Based Nursing published by Wiley Periodicals, Inc. on behalf of Sigma Theta Tau International The Honor Society of Nursing.
DaVIE: Database for the Visualization and Integration of Epigenetic data
Fejes, Anthony P.; Jones, Meaghan J.; Kobor, Michael S.
2014-01-01
One of the challenges in the analysis of large data sets, particularly in a population-based setting, is the ability to perform comparisons across projects. This has to be done in such a way that the integrity of each individual project is maintained, while ensuring that the data are comparable across projects. These issues are beginning to be observed in human DNA methylation studies, as the Illumina 450k platform and next generation sequencing-based assays grow in popularity and decrease in price. This increase in productivity is enabling new insights into epigenetics, but also requires the development of pipelines and software capable of handling the large volumes of data. The specific problems inherent in creating a platform for the storage, comparison, integration, and visualization of DNA methylation data include data storage, algorithm efficiency and ability to interpret the results to derive biological meaning from them. Databases provide a ready-made solution to these issues, but as yet no tools exist that that leverage these advantages while providing an intuitive user interface for interpreting results in a genomic context. We have addressed this void by integrating a database to store DNA methylation data with a web interface to query and visualize the database and a set of libraries for more complex analysis. The resulting platform is called DaVIE: Database for the Visualization and Integration of Epigenetics data. DaVIE can use data culled from a variety of sources, and the web interface includes the ability to group samples by sub-type, compare multiple projects and visualize genomic features in relation to sites of interest. We have used DaVIE to identify patterns of DNA methylation in specific projects and across different projects, identify outlier samples, and cross-check differentially methylated CpG sites identified in specific projects across large numbers of samples. A demonstration server has been setup using GEO data at http://echelon.cmmt.ubc.ca/dbaccess/, with login “guest” and password “guest.” Groups may download and install their own version of the server following the instructions on the project's wiki. PMID:25278960
FTree query construction for virtual screening: a statistical analysis.
Gerlach, Christof; Broughton, Howard; Zaliani, Andrea
2008-02-01
FTrees (FT) is a known chemoinformatic tool able to condense molecular descriptions into a graph object and to search for actives in large databases using graph similarity. The query graph is classically derived from a known active molecule, or a set of actives, for which a similar compound has to be found. Recently, FT similarity has been extended to fragment space, widening its capabilities. If a user were able to build a knowledge-based FT query from information other than a known active structure, the similarity search could be combined with other, normally separate, fields like de-novo design or pharmacophore searches. With this aim in mind, we performed a comprehensive analysis of several databases in terms of FT description and provide a basic statistical analysis of the FT spaces so far at hand. Vendors' catalogue collections and MDDR as a source of potential or known "actives", respectively, have been used. With the results reported herein, a set of ranges, mean values and standard deviations for several query parameters are presented in order to set a reference guide for the users. Applications on how to use this information in FT query building are also provided, using a newly built 3D-pharmacophore from 57 5HT-1F agonists and a published one which was used for virtual screening for tRNA-guanine transglycosylase (TGT) inhibitors.
FTree query construction for virtual screening: a statistical analysis
NASA Astrophysics Data System (ADS)
Gerlach, Christof; Broughton, Howard; Zaliani, Andrea
2008-02-01
FTrees (FT) is a known chemoinformatic tool able to condense molecular descriptions into a graph object and to search for actives in large databases using graph similarity. The query graph is classically derived from a known active molecule, or a set of actives, for which a similar compound has to be found. Recently, FT similarity has been extended to fragment space, widening its capabilities. If a user were able to build a knowledge-based FT query from information other than a known active structure, the similarity search could be combined with other, normally separate, fields like de-novo design or pharmacophore searches. With this aim in mind, we performed a comprehensive analysis of several databases in terms of FT description and provide a basic statistical analysis of the FT spaces so far at hand. Vendors' catalogue collections and MDDR as a source of potential or known "actives", respectively, have been used. With the results reported herein, a set of ranges, mean values and standard deviations for several query parameters are presented in order to set a reference guide for the users. Applications on how to use this information in FT query building are also provided, using a newly built 3D-pharmacophore from 57 5HT-1F agonists and a published one which was used for virtual screening for tRNA-guanine transglycosylase (TGT) inhibitors.
González-Calabozo, Jose M; Valverde-Albacete, Francisco J; Peláez-Moreno, Carmen
2016-09-15
Gene Expression Data (GED) analysis poses a great challenge to the scientific community that can be framed into the Knowledge Discovery in Databases (KDD) and Data Mining (DM) paradigm. Biclustering has emerged as the machine learning method of choice to solve this task, but its unsupervised nature makes result assessment problematic. This is often addressed by means of Gene Set Enrichment Analysis (GSEA). We put forward a framework in which GED analysis is understood as an Exploratory Data Analysis (EDA) process where we provide support for continuous human interaction with data aiming at improving the step of hypothesis abduction and assessment. We focus on the adaptation to human cognition of data interpretation and visualization of the output of EDA. First, we give a proper theoretical background to bi-clustering using Lattice Theory and provide a set of analysis tools revolving around [Formula: see text]-Formal Concept Analysis ([Formula: see text]-FCA), a lattice-theoretic unsupervised learning technique for real-valued matrices. By using different kinds of cost structures to quantify expression we obtain different sequences of hierarchical bi-clusterings for gene under- and over-expression using thresholds. Consequently, we provide a method with interleaved analysis steps and visualization devices so that the sequences of lattices for a particular experiment summarize the researcher's vision of the data. This also allows us to define measures of persistence and robustness of biclusters to assess them. Second, the resulting biclusters are used to index external omics databases-for instance, Gene Ontology (GO)-thus offering a new way of accessing publicly available resources. This provides different flavors of gene set enrichment against which to assess the biclusters, by obtaining their p-values according to the terminology of those resources. We illustrate the exploration procedure on a real data example confirming results previously published. The GED analysis problem gets transformed into the exploration of a sequence of lattices enabling the visualization of the hierarchical structure of the biclusters with a certain degree of granularity. The ability of FCA-based bi-clustering methods to index external databases such as GO allows us to obtain a quality measure of the biclusters, to observe the evolution of a gene throughout the different biclusters it appears in, to look for relevant biclusters-by observing their genes and what their persistence is-to infer, for instance, hypotheses on their function.
Database citation in full text biomedical articles.
Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R
2013-01-01
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.
Database Citation in Full Text Biomedical Articles
Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.
2013-01-01
Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176
Comprehensive analysis of orthologous protein domains using the HOPS database.
Storm, Christian E V; Sonnhammer, Erik L L
2003-10-01
One of the most reliable methods for protein function annotation is to transfer experimentally known functions from orthologous proteins in other organisms. Most methods for identifying orthologs operate on a subset of organisms with a completely sequenced genome, and treat proteins as single-domain units. However, it is well known that proteins are often made up of several independent domains, and there is a wealth of protein sequences from genomes that are not completely sequenced. A comprehensive set of protein domain families is found in the Pfam database. We wanted to apply orthology detection to Pfam families, but first some issues needed to be addressed. First, orthology detection becomes impractical and unreliable when too many species are included. Second, shorter domains contain less information. It is therefore important to assess the quality of the orthology assignment and avoid very short domains altogether. We present a database of orthologous protein domains in Pfam called HOPS: Hierarchical grouping of Orthologous and Paralogous Sequences. Orthology is inferred in a hierarchic system of phylogenetic subgroups using ortholog bootstrapping. To avoid the frequent errors stemming from horizontally transferred genes in bacteria, the analysis is presently limited to eukaryotic genes. The results are accessible in the graphical browser NIFAS, a Java tool originally developed for analyzing phylogenetic relations within Pfam families. The method was tested on a set of curated orthologs with experimentally verified function. In comparison to tree reconciliation with a complete species tree, our approach finds significantly more orthologs in the test set. Examples for investigating gene fusions and domain recombination using HOPS are given.
Implementation of Goal Attainment Scaling in Community Intellectual Disability Services
ERIC Educational Resources Information Center
Chapman, Melanie; Burton, Mark; Hunt, Victoria; Reeves, David
2006-01-01
The authors describe the evaluation of the implementation of an outcome measurement system (Goal Attainment Scaling-GAS) within the context of an interdisciplinary and interagency intellectual disability services setting. The GAS database allowed analysis of follow-up goals and indicated the extent of implementation, while a rater study evaluated…
ERIC Educational Resources Information Center
Burton, Hilary D.
TIS (Technology Information System) is an intelligent gateway system capable of performing quantitative evaluation and analysis of bibliographic citations using a set of Process functions. Originally developed by Lawrence Livermore National Laboratory (LLNL) to analyze information retrieved from three major federal databases, DOE/RECON,…
Secondary Data Analysis: An Important Tool for Addressing Developmental Questions
ERIC Educational Resources Information Center
Greenhoot, Andrea Follmer; Dowsett, Chantelle J.
2012-01-01
Existing data sets can be an efficient, powerful, and readily available resource for addressing questions about developmental science. Many of the available databases contain hundreds of variables of interest to developmental psychologists, track participants longitudinally, and have representative samples. In this article, the authors discuss the…
Version VI of the ESTree db: an improved tool for peach transcriptome analysis
Lazzari, Barbara; Caprera, Andrea; Vecchietti, Alberto; Merelli, Ivan; Barale, Francesca; Milanesi, Luciano; Stella, Alessandra; Pozzi, Carlo
2008-01-01
Background The ESTree database (db) is a collection of Prunus persica and Prunus dulcis EST sequences that in its current version encompasses 75,404 sequences from 3 almond and 19 peach libraries. Nine peach genotypes and four peach tissues are represented, from four fruit developmental stages. The aim of this work was to implement the already existing ESTree db by adding new sequences and analysis programs. Particular care was given to the implementation of the web interface, that allows querying each of the database features. Results A Perl modular pipeline is the backbone of sequence analysis in the ESTree db project. Outputs obtained during the pipeline steps are automatically arrayed into the fields of a MySQL database. Apart from standard clustering and annotation analyses, version VI of the ESTree db encompasses new tools for tandem repeat identification, annotation against genomic Rosaceae sequences, and positioning on the database of oligomer sequences that were used in a peach microarray study. Furthermore, known protein patterns and motifs were identified by comparison to PROSITE. Based on data retrieved from sequence annotation against the UniProtKB database, a script was prepared to track positions of homologous hits on the GO tree and build statistics on the ontologies distribution in GO functional categories. EST mapping data were also integrated in the database. The PHP-based web interface was upgraded and extended. The aim of the authors was to enable querying the database according to all the biological aspects that can be investigated from the analysis of data available in the ESTree db. This is achieved by allowing multiple searches on logical subsets of sequences that represent different biological situations or features. Conclusions The version VI of ESTree db offers a broad overview on peach gene expression. Sequence analyses results contained in the database, extensively linked to external related resources, represent a large amount of information that can be queried via the tools offered in the web interface. Flexibility and modularity of the ESTree analysis pipeline and of the web interface allowed the authors to set up similar structures for different datasets, with limited manual intervention. PMID:18387211
Kaushik, Abhinav; Bhatia, Yashuma; Ali, Shakir; Gupta, Dinesh
2015-01-01
Metastatic melanoma patients have a poor prognosis, mainly attributable to the underlying heterogeneity in melanoma driver genes and altered gene expression profiles. These characteristics of melanoma also make the development of drugs and identification of novel drug targets for metastatic melanoma a daunting task. Systems biology offers an alternative approach to re-explore the genes or gene sets that display dysregulated behaviour without being differentially expressed. In this study, we have performed systems biology studies to enhance our knowledge about the conserved property of disease genes or gene sets among mutually exclusive datasets representing melanoma progression. We meta-analysed 642 microarray samples to generate melanoma reconstructed networks representing four different stages of melanoma progression to extract genes with altered molecular circuitry wiring as compared to a normal cellular state. Intriguingly, a majority of the melanoma network-rewired genes are not differentially expressed and the disease genes involved in melanoma progression consistently modulate its activity by rewiring network connections. We found that the shortlisted disease genes in the study show strong and abnormal network connectivity, which enhances with the disease progression. Moreover, the deviated network properties of the disease gene sets allow ranking/prioritization of different enriched, dysregulated and conserved pathway terms in metastatic melanoma, in agreement with previous findings. Our analysis also reveals presence of distinct network hubs in different stages of metastasizing tumor for the same set of pathways in the statistically conserved gene sets. The study results are also presented as a freely available database at http://bioinfo.icgeb.res.in/m3db/. The web-based database resource consists of results from the analysis presented here, integrated with cytoscape web and user-friendly tools for visualization, retrieval and further analysis. PMID:26558755
Matsuda, Fumio; Shinbo, Yoko; Oikawa, Akira; Hirai, Masami Yokota; Fiehn, Oliver; Kanaya, Shigehiko; Saito, Kazuki
2009-01-01
Background In metabolomics researches using mass spectrometry (MS), systematic searching of high-resolution mass data against compound databases is often the first step of metabolite annotation to determine elemental compositions possessing similar theoretical mass numbers. However, incorrect hits derived from errors in mass analyses will be included in the results of elemental composition searches. To assess the quality of peak annotation information, a novel methodology for false discovery rates (FDR) evaluation is presented in this study. Based on the FDR analyses, several aspects of an elemental composition search, including setting a threshold, estimating FDR, and the types of elemental composition databases most reliable for searching are discussed. Methodology/Principal Findings The FDR can be determined from one measured value (i.e., the hit rate for search queries) and four parameters determined by Monte Carlo simulation. The results indicate that relatively high FDR values (30–50%) were obtained when searching time-of-flight (TOF)/MS data using the KNApSAcK and KEGG databases. In addition, searches against large all-in-one databases (e.g., PubChem) always produced unacceptable results (FDR >70%). The estimated FDRs suggest that the quality of search results can be improved not only by performing more accurate mass analysis but also by modifying the properties of the compound database. A theoretical analysis indicates that FDR could be improved by using compound database with smaller but higher completeness entries. Conclusions/Significance High accuracy mass analysis, such as Fourier transform (FT)-MS, is needed for reliable annotation (FDR <10%). In addition, a small, customized compound database is preferable for high-quality annotation of metabolome data. PMID:19847304
Nascimento, Leandro Costa; Salazar, Marcela Mendes; Lepikson-Neto, Jorge; Camargo, Eduardo Leal Oliveira; Parreiras, Lucas Salera; Carazzolle, Marcelo Falsarella
2017-01-01
Abstract Tree species of the genus Eucalyptus are the most valuable and widely planted hardwoods in the world. Given the economic importance of Eucalyptus trees, much effort has been made towards the generation of specimens with superior forestry properties that can deliver high-quality feedstocks, customized to the industrýs needs for both cellulosic (paper) and lignocellulosic biomass production. In line with these efforts, large sets of molecular data have been generated by several scientific groups, providing invaluable information that can be applied in the development of improved specimens. In order to fully explore the potential of available datasets, the development of a public database that provides integrated access to genomic and transcriptomic data from Eucalyptus is needed. EUCANEXT is a database that analyses and integrates publicly available Eucalyptus molecular data, such as the E. grandis genome assembly and predicted genes, ESTs from several species and digital gene expression from 26 RNA-Seq libraries. The database has been implemented in a Fedora Linux machine running MySQL and Apache, while Perl CGI was used for the web interfaces. EUCANEXT provides a user-friendly web interface for easy access and analysis of publicly available molecular data from Eucalyptus species. This integrated database allows for complex searches by gene name, keyword or sequence similarity and is publicly accessible at http://www.lge.ibi.unicamp.br/eucalyptusdb. Through EUCANEXT, users can perform complex analysis to identify genes related traits of interest using RNA-Seq libraries and tools for differential expression analysis. Moreover, all the bioinformatics pipeline here described, including the database schema and PERL scripts, are readily available and can be applied to any genomic and transcriptomic project, regardless of the organism. Database URL: http://www.lge.ibi.unicamp.br/eucalyptusdb PMID:29220468
Database Constraints Applied to Metabolic Pathway Reconstruction Tools
Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi
2014-01-01
Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes. PMID:25202745
Analyzing high energy physics data using database computing: Preliminary report
NASA Technical Reports Server (NTRS)
Baden, Andrew; Day, Chris; Grossman, Robert; Lifka, Dave; Lusk, Ewing; May, Edward; Price, Larry
1991-01-01
A proof of concept system is described for analyzing high energy physics (HEP) data using data base computing. The system is designed to scale up to the size required for HEP experiments at the Superconducting SuperCollider (SSC) lab. These experiments will require collecting and analyzing approximately 10 to 100 million 'events' per year during proton colliding beam collisions. Each 'event' consists of a set of vectors with a total length of approx. one megabyte. This represents an increase of approx. 2 to 3 orders of magnitude in the amount of data accumulated by present HEP experiments. The system is called the HEPDBC System (High Energy Physics Database Computing System). At present, the Mark 0 HEPDBC System is completed, and can produce analysis of HEP experimental data approx. an order of magnitude faster than current production software on data sets of approx. 1 GB. The Mark 1 HEPDBC System is currently undergoing testing and is designed to analyze data sets 10 to 100 times larger.
Storm, Lance; Tressoldi, Patrizio E; Di Risio, Lorenzo
2010-07-01
We report the results of meta-analyses on 3 types of free-response study: (a) ganzfeld (a technique that enhances a communication anomaly referred to as "psi"); (b) nonganzfeld noise reduction using alleged psi-enhancing techniques such as dream psi, meditation, relaxation, or hypnosis; and (c) standard free response (nonganzfeld, no noise reduction). For the period 1997-2008, a homogeneous data set of 29 ganzfeld studies yielded a mean effect size of 0.142 (Stouffer Z = 5.48, p = 2.13 x 10(-8)). A homogeneous nonganzfeld noise reduction data set of 16 studies yielded a mean effect size of 0.110 (Stouffer Z = 3.35, p = 2.08 x 10(-4)), and a homogeneous data set of 14 standard free-response studies produced a weak negative mean effect size of -0.029 (Stouffer Z = -2.29, p = .989). The mean effect size value of the ganzfeld database was significantly higher than the mean effect size of the standard free-response database but was not higher than the effect size of the nonganzfeld noise reduction database [corrected].We also found that selected participants (believers in the paranormal, meditators, etc.) had a performance advantage over unselected participants, but only if they were in the ganzfeld condition.
Bode, Stefan; Murawski, Carsten; Laham, Simon M.
2018-01-01
A major obstacle for the design of rigorous, reproducible studies in moral psychology is the lack of suitable stimulus sets. Here, we present the Socio-Moral Image Database (SMID), the largest standardized moral stimulus set assembled to date, containing 2,941 freely available photographic images, representing a wide range of morally (and affectively) positive, negative and neutral content. The SMID was validated with over 820,525 individual judgments from 2,716 participants, with normative ratings currently available for all images on affective valence and arousal, moral wrongness, and relevance to each of the five moral values posited by Moral Foundations Theory. We present a thorough analysis of the SMID regarding (1) inter-rater consensus, (2) rating precision, and (3) breadth and variability of moral content. Additionally, we provide recommendations for use aimed at efficient study design and reproducibility, and outline planned extensions to the database. We anticipate that the SMID will serve as a useful resource for psychological, neuroscientific and computational (e.g., natural language processing or computer vision) investigations of social, moral and affective processes. The SMID images, along with associated normative data and additional resources are available at https://osf.io/2rqad/. PMID:29364985
Computation and application of tissue-specific gene set weights.
Frost, H Robert
2018-04-06
Gene set testing, or pathway analysis, has become a critical tool for the analysis of highdimensional genomic data. Although the function and activity of many genes and higher-level processes is tissue-specific, gene set testing is typically performed in a tissue agnostic fashion, which impacts statistical power and the interpretation and replication of results. To address this challenge, we have developed a bioinformatics approach to compute tissuespecific weights for individual gene sets using information on tissue-specific gene activity from the Human Protein Atlas (HPA). We used this approach to create a public repository of tissue-specific gene set weights for 37 different human tissue types from the HPA and all collections in the Molecular Signatures Database (MSigDB). To demonstrate the validity and utility of these weights, we explored three different applications: the functional characterization of human tissues, multi-tissue analysis for systemic diseases and tissue-specific gene set testing. All data used in the reported analyses is publicly available. An R implementation of the method and tissue-specific weights for MSigDB gene set collections can be downloaded at http://www.dartmouth.edu/∼hrfrost/TissueSpecificGeneSets. rob.frost@dartmouth.edu.
A scattering database of marine particles and its application in optical analysis
NASA Astrophysics Data System (ADS)
Xu, G.; Yang, P.; Kattawar, G.; Zhang, X.
2016-12-01
In modeling the scattering properties of marine particles (e.g. phytoplankton), the laboratory studies imply a need to properly account for the influence of particle morphology, in addition to size and composition. In this study, a marine particle scattering database is constructed using a collection of distorted hexahedral shapes. Specifically, the scattering properties of each size bin and refractive index are obtained by the ensemble average associated with distorted hexahedra with randomly tilted facets and selected aspect ratios (from elongated to flattened). The randomness degree in shape-generation process defines the geometric irregularity of the particles in the group. The geometric irregularity and particle aspect ratios constitute a set of "shape factors" to be accounted for (e.g. in best-fit analysis). To cover most of the marine particle size range, we combine the Invariant Imbedding T-matrix (II-TM) method and the Physical-Geometric Optics Hybrid (PGOH) method in the calculations. The simulated optical properties are shown and compared with those obtained from Lorenz-Mie Theory. Using the scattering database, we present a preliminary optical analysis of laboratory-measured optical properties of marine particles.
Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks.
Deeter, Anthony; Dalman, Mark; Haddad, Joseph; Duan, Zhong-Hui
2017-01-01
The PubMed database offers an extensive set of publication data that can be useful, yet inherently complex to use without automated computational techniques. Data repositories such as the Genomic Data Commons (GDC) and the Gene Expression Omnibus (GEO) offer experimental data storage and retrieval as well as curated gene expression profiles. Genetic interaction databases, including Reactome and Ingenuity Pathway Analysis, offer pathway and experiment data analysis using data curated from these publications and data repositories. We have created a method to generate and analyze consensus networks, inferring potential gene interactions, using large numbers of Bayesian networks generated by data mining publications in the PubMed database. Through the concept of network resolution, these consensus networks can be tailored to represent possible genetic interactions. We designed a set of experiments to confirm that our method is stable across variation in both sample and topological input sizes. Using gene product interactions from the KEGG pathway database and data mining PubMed publication abstracts, we verify that regardless of the network resolution or the inferred consensus network, our method is capable of inferring meaningful gene interactions through consensus Bayesian network generation with multiple, randomized topological orderings. Our method can not only confirm the existence of currently accepted interactions, but has the potential to hypothesize new ones as well. We show our method confirms the existence of known gene interactions such as JAK-STAT-PI3K-AKT-mTOR, infers novel gene interactions such as RAS- Bcl-2 and RAS-AKT, and found significant pathway-pathway interactions between the JAK-STAT signaling and Cardiac Muscle Contraction KEGG pathways.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karpinets, Tatiana V; Park, Byung; Syed, Mustafa H
2010-01-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire non-redundant sequences of the CAZy database. Themore » second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains (DUF) and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit (CAT), and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.« less
Park, Byung H; Karpinets, Tatiana V; Syed, Mustafa H; Leuze, Michael R; Uberbacher, Edward C
2010-12-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.
BIGNASim: a NoSQL database structure and analysis portal for nucleic acids simulation data
Hospital, Adam; Andrio, Pau; Cugnasco, Cesare; Codo, Laia; Becerra, Yolanda; Dans, Pablo D.; Battistini, Federica; Torres, Jordi; Goñi, Ramón; Orozco, Modesto; Gelpí, Josep Ll.
2016-01-01
Molecular dynamics simulation (MD) is, just behind genomics, the bioinformatics tool that generates the largest amounts of data, and that is using the largest amount of CPU time in supercomputing centres. MD trajectories are obtained after months of calculations, analysed in situ, and in practice forgotten. Several projects to generate stable trajectory databases have been developed for proteins, but no equivalence exists in the nucleic acids world. We present here a novel database system to store MD trajectories and analyses of nucleic acids. The initial data set available consists mainly of the benchmark of the new molecular dynamics force-field, parmBSC1. It contains 156 simulations, with over 120 μs of total simulation time. A deposition protocol is available to accept the submission of new trajectory data. The database is based on the combination of two NoSQL engines, Cassandra for storing trajectories and MongoDB to store analysis results and simulation metadata. The analyses available include backbone geometries, helical analysis, NMR observables and a variety of mechanical analyses. Individual trajectories and combined meta-trajectories can be downloaded from the portal. The system is accessible through http://mmb.irbbarcelona.org/BIGNASim/. Supplementary Material is also available on-line at http://mmb.irbbarcelona.org/BIGNASim/SuppMaterial/. PMID:26612862
VerSeDa: vertebrate secretome database
Cortazar, Ana R.; Oguiza, José A.
2017-01-01
Based on the current tools, de novo secretome (full set of proteins secreted by an organism) prediction is a time consuming bioinformatic task that requires a multifactorial analysis in order to obtain reliable in silico predictions. Hence, to accelerate this process and offer researchers a reliable repository where secretome information can be obtained for vertebrates and model organisms, we have developed VerSeDa (Vertebrate Secretome Database). This freely available database stores information about proteins that are predicted to be secreted through the classical and non-classical mechanisms, for the wide range of vertebrate species deposited at the NCBI, UCSC and ENSEMBL sites. To our knowledge, VerSeDa is the only state-of-the-art database designed to store secretome data from multiple vertebrate genomes, thus, saving an important amount of time spent in the prediction of protein features that can be retrieved from this repository directly. Database URL: VerSeDa is freely available at http://genomics.cicbiogune.es/VerSeDa/index.php PMID:28365718
Data Preparation Process for the Buildings Performance Database
DOE Office of Scientific and Technical Information (OSTI.GOV)
Walter, Travis; Dunn, Laurel; Mercado, Andrea
2014-06-30
The Buildings Performance Database (BPD) includes empirically measured data from a variety of data sources with varying degrees of data quality and data availability. The purpose of the data preparation process is to maintain data quality within the database and to ensure that all database entries have sufficient data for meaningful analysis and for the database API. Data preparation is a systematic process of mapping data into the Building Energy Data Exchange Specification (BEDES), cleansing data using a set of criteria and rules of thumb, and deriving values such as energy totals and dominant asset types. The data preparation processmore » takes the most amount of effort and time therefore most of the cleansing process has been automated. The process also needs to adapt as more data is contributed to the BPD and as building technologies over time. The data preparation process is an essential step between data contributed by providers and data published to the public in the BPD.« less
Murumkar, Prashant R; Giridhar, Rajani; Yadav, Mange Ram
2008-04-01
A set of 29 benzothiadiazepine hydroxamates having selective tumor necrosis factor-alpha converting enzyme inhibitory activity were used to compare the quality and predictive power of 3D-quantitative structure-activity relationship, comparative molecular field analysis, and comparative molecular similarity indices models for the atom-based, centroid/atom-based, data-based, and docked conformer-based alignment. Removal of two outliers from the initial training set of molecules improved the predictivity of models. Among the 3D-quantitative structure-activity relationship models developed using the above four alignments, the database alignment provided the optimal predictive comparative molecular field analysis model for the training set with cross-validated r(2) (q(2)) = 0.510, non-cross-validated r(2) = 0.972, standard error of estimates (s) = 0.098, and F = 215.44 and the optimal comparative molecular similarity indices model with cross-validated r(2) (q(2)) = 0.556, non-cross-validated r(2) = 0.946, standard error of estimates (s) = 0.163, and F = 99.785. These models also showed the best test set prediction for six compounds with predictive r(2) values of 0.460 and 0.535, respectively. The contour maps obtained from 3D-quantitative structure-activity relationship studies were appraised for activity trends for the molecules analyzed. The comparative molecular similarity indices models exhibited good external predictivity as compared with that of comparative molecular field analysis models. The data generated from the present study helped us to further design and report some novel and potent tumor necrosis factor-alpha converting enzyme inhibitors.
Luther, Stephen L; French, Dustin D; Powell-Cope, Gail; Rubenstein, Laurence Z; Campbell, Robert
2005-10-01
The Veterans Administration (VA) Healthcare system, containing hospital and community-based outpatient clinics, provides the setting for the study. Summary data was obtained from the VA Ambulatory Events Database for fiscal years (FY) 1997-2001 and in-depth data for FY 2001. In FY 2001, the database included approximately 4 million unique patients with 60 million encounters. The purpose of this study was: 1) to quantify injuries and use of services associated with falls among the elderly treated in Veterans Administration (VA) ambulatory care settings using administrative data; 2) to compare fall-related services provided to elderly veterans with those provided to younger veterans. Retrospective analysis of administrative data. This study describes the trends (FY 1997-2001) and patterns of fall-related ambulatory care encounters (FY 2001) in the VA Healthcare System. An approximately four-fold increase in both encounters and patients seen was observed in FY 1997-2001, largely paralleling the growth of VA ambulatory care services. More than two-thirds of the patients treated were found to be over the age of 65. Veterans over the age of 65 were found to be more likely to receive care in the non-urgent setting and had higher numbers of co-morbid conditions than younger veterans. While nearly half of the encounters occurred in the Emergency/Urgent Care setting, fall-related injuries led to services across a wide spectrum of medical and surgical providers/departments. This study represents the first attempt to use the VA Ambulatory Events Database to study fall-related services provided to elderly veterans. In view of the aging population served by the VA and the movement to provide increased services in the outpatient setting, this database provides an important resource for researchers and administrators interested in the prevention and treatment of fall-related injuries.
Tissue Non-Specific Genes and Pathways Associated with Diabetes: An Expression Meta-Analysis.
Mei, Hao; Li, Lianna; Liu, Shijian; Jiang, Fan; Griswold, Michael; Mosley, Thomas
2017-01-21
We performed expression studies to identify tissue non-specific genes and pathways of diabetes by meta-analysis. We searched curated datasets of the Gene Expression Omnibus (GEO) database and identified 13 and five expression studies of diabetes and insulin responses at various tissues, respectively. We tested differential gene expression by empirical Bayes-based linear method and investigated gene set expression association by knowledge-based enrichment analysis. Meta-analysis by different methods was applied to identify tissue non-specific genes and gene sets. We also proposed pathway mapping analysis to infer functions of the identified gene sets, and correlation and independent analysis to evaluate expression association profile of genes and gene sets between studies and tissues. Our analysis showed that PGRMC1 and HADH genes were significant over diabetes studies, while IRS1 and MPST genes were significant over insulin response studies, and joint analysis showed that HADH and MPST genes were significant over all combined data sets. The pathway analysis identified six significant gene sets over all studies. The KEGG pathway mapping indicated that the significant gene sets are related to diabetes pathogenesis. The results also presented that 12.8% and 59.0% pairwise studies had significantly correlated expression association for genes and gene sets, respectively; moreover, 12.8% pairwise studies had independent expression association for genes, but no studies were observed significantly different for expression association of gene sets. Our analysis indicated that there are both tissue specific and non-specific genes and pathways associated with diabetes pathogenesis. Compared to the gene expression, pathway association tends to be tissue non-specific, and a common pathway influencing diabetes development is activated through different genes at different tissues.
Squires, R. Burke; Noronha, Jyothi; Hunt, Victoria; García‐Sastre, Adolfo; Macken, Catherine; Baumgarth, Nicole; Suarez, David; Pickett, Brett E.; Zhang, Yun; Larsen, Christopher N.; Ramsey, Alvin; Zhou, Liwei; Zaremba, Sam; Kumar, Sanjeev; Deitrich, Jon; Klem, Edward; Scheuermann, Richard H.
2012-01-01
Please cite this paper as: Squires et al. (2012) Influenza research database: an integrated bioinformatics resource for influenza research and surveillance. Influenza and Other Respiratory Viruses 6(6), 404–416. Background The recent emergence of the 2009 pandemic influenza A/H1N1 virus has highlighted the value of free and open access to influenza virus genome sequence data integrated with information about other important virus characteristics. Design The Influenza Research Database (IRD, http://www.fludb.org) is a free, open, publicly‐accessible resource funded by the U.S. National Institute of Allergy and Infectious Diseases through the Bioinformatics Resource Centers program. IRD provides a comprehensive, integrated database and analysis resource for influenza sequence, surveillance, and research data, including user‐friendly interfaces for data retrieval, visualization and comparative genomics analysis, together with personal log in‐protected ‘workbench’ spaces for saving data sets and analysis results. IRD integrates genomic, proteomic, immune epitope, and surveillance data from a variety of sources, including public databases, computational algorithms, external research groups, and the scientific literature. Results To demonstrate the utility of the data and analysis tools available in IRD, two scientific use cases are presented. A comparison of hemagglutinin sequence conservation and epitope coverage information revealed highly conserved protein regions that can be recognized by the human adaptive immune system as possible targets for inducing cross‐protective immunity. Phylogenetic and geospatial analysis of sequences from wild bird surveillance samples revealed a possible evolutionary connection between influenza virus from Delaware Bay shorebirds and Alberta ducks. Conclusions The IRD provides a wealth of integrated data and information about influenza virus to support research of the genetic determinants dictating virus pathogenicity, host range restriction and transmission, and to facilitate development of vaccines, diagnostics, and therapeutics. PMID:22260278
Structural Pattern Recognition Techniques for Data Retrieval in Massive Fusion Databases
NASA Astrophysics Data System (ADS)
Vega, J.; Murari, A.; Rattá, G. A.; Castro, P.; Pereira, A.; Portas, A.
2008-03-01
Diagnostics of present day reactor class fusion experiments, like the Joint European Torus (JET), generate thousands of signals (time series and video images) in each discharge. There is a direct correspondence between the physical phenomena taking place in the plasma and the set of structural shapes (patterns) that they form in the signals: bumps, unexpected amplitude changes, abrupt peaks, periodic components, high intensity zones or specific edge contours. A major difficulty related to data analysis is the identification, in a rapid and automated way, of a set of discharges with comparable behavior, i.e. discharges with "similar" patterns. Pattern recognition techniques are efficient tools to search for similar structural forms within the database in a fast an intelligent way. To this end, classification systems must be developed to be used as indexation methods to directly fetch the more similar patterns.
Algorithms for database-dependent search of MS/MS data.
Matthiesen, Rune
2013-01-01
The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.
Abedini, Atosa A.; Hurwitz, S.; Evans, William C.
2006-01-01
The database (Version 1.0) is a MS-Excel file that contains close to 5,000 entries of published information on noble gas concentrations and isotopic ratios from volcanic systems in Mid-Ocean ridges, ocean islands, seamounts, and oceanic and continental arcs (location map). Where they were available we also included the isotopic ratios of strontium, neodymium, and carbon. The database is sub-divided both into material sampled (e.g., volcanic glass, different minerals, fumarole, spring), and into different tectonic settings (MOR, ocean islands, volcanic arcs). Included is also a reference list in MS-Word and pdf from which the data was derived. The database extends previous compilations by Ozima (1994), Farley and Neroda (1998), and Graham (2002). The extended database allows scientists to test competing hypotheses, and it provides a framework for analysis of noble gas data during periods of volcanic unrest.
Forensic DNA databases in Western Balkan region: retrospectives, perspectives, and initiatives
Marjanović, Damir; Konjhodžić, Rijad; Butorac, Sara Sanela; Drobnič, Katja; Merkaš, Siniša; Lauc, Gordan; Primorac, Damir; Anđelinović, Šimun; Milosavljević, Mladen; Karan, Željko; Vidović, Stojko; Stojković, Oliver; Panić, Bojana; Vučetić Dragović, Anđelka; Kovačević, Sandra; Jakovski, Zlatko; Asplen, Chris; Primorac, Dragan
2011-01-01
The European Network of Forensic Science Institutes (ENFSI) recommended the establishment of forensic DNA databases and specific implementation and management legislations for all EU/ENFSI members. Therefore, forensic institutions from Bosnia and Herzegovina, Serbia, Montenegro, and Macedonia launched a wide set of activities to support these recommendations. To assess the current state, a regional expert team completed detailed screening and investigation of the existing forensic DNA data repositories and associated legislation in these countries. The scope also included relevant concurrent projects and a wide spectrum of different activities in relation to forensics DNA use. The state of forensic DNA analysis was also determined in the neighboring Slovenia and Croatia, which already have functional national DNA databases. There is a need for a ‘regional supplement’ to the current documentation and standards pertaining to forensic application of DNA databases, which should include regional-specific preliminary aims and recommendations. PMID:21674821
Forensic DNA databases in Western Balkan region: retrospectives, perspectives, and initiatives.
Marjanović, Damir; Konjhodzić, Rijad; Butorac, Sara Sanela; Drobnic, Katja; Merkas, Sinisa; Lauc, Gordan; Primorac, Damir; Andjelinović, Simun; Milosavljević, Mladen; Karan, Zeljko; Vidović, Stojko; Stojković, Oliver; Panić, Bojana; Vucetić Dragović, Andjelka; Kovacević, Sandra; Jakovski, Zlatko; Asplen, Chris; Primorac, Dragan
2011-06-01
The European Network of Forensic Science Institutes (ENFSI) recommended the establishment of forensic DNA databases and specific implementation and management legislations for all EU/ENFSI members. Therefore, forensic institutions from Bosnia and Herzegovina, Serbia, Montenegro, and Macedonia launched a wide set of activities to support these recommendations. To assess the current state, a regional expert team completed detailed screening and investigation of the existing forensic DNA data repositories and associated legislation in these countries. The scope also included relevant concurrent projects and a wide spectrum of different activities in relation to forensics DNA use. The state of forensic DNA analysis was also determined in the neighboring Slovenia and Croatia, which already have functional national DNA databases. There is a need for a 'regional supplement' to the current documentation and standards pertaining to forensic application of DNA databases, which should include regional-specific preliminary aims and recommendations.
A prototypic small molecule database for bronchoalveolar lavage-based metabolomics
NASA Astrophysics Data System (ADS)
Walmsley, Scott; Cruickshank-Quinn, Charmion; Quinn, Kevin; Zhang, Xing; Petrache, Irina; Bowler, Russell P.; Reisdorph, Richard; Reisdorph, Nichole
2018-04-01
The analysis of bronchoalveolar lavage fluid (BALF) using mass spectrometry-based metabolomics can provide insight into lung diseases, such as asthma. However, the important step of compound identification is hindered by the lack of a small molecule database that is specific for BALF. Here we describe prototypic, small molecule databases derived from human BALF samples (n=117). Human BALF was extracted into lipid and aqueous fractions and analyzed using liquid chromatography mass spectrometry. Following filtering to reduce contaminants and artifacts, the resulting BALF databases (BALF-DBs) contain 11,736 lipid and 658 aqueous compounds. Over 10% of these were found in 100% of samples. Testing the BALF-DBs using nested test sets produced a 99% match rate for lipids and 47% match rate for aqueous molecules. Searching an independent dataset resulted in 45% matching to the lipid BALF-DB compared to<25% when general databases are searched. The BALF-DBs are available for download from MetaboLights. Overall, the BALF-DBs can reduce false positives and improve confidence in compound identification compared to when general databases are used.
The New Zealand Tsunami Database: historical and modern records
NASA Astrophysics Data System (ADS)
Barberopoulou, A.; Downes, G. L.; Cochran, U. A.; Clark, K.; Scheele, F.
2016-12-01
A database of historical (pre-instrumental) and modern (instrumentally recorded)tsunamis that have impacted or been observed in New Zealand has been compiled andpublished online. New Zealand's tectonic setting, astride an obliquely convergenttectonic boundary on the Pacific Rim, means that it is vulnerable to local, regional andcircum-Pacific tsunamis. Despite New Zealand's comparatively short written historicalrecord of c. 200 years there is a wealth of information about the impact of past tsunamis.The New Zealand Tsunami Database currently has 800+ entries that describe >50 highvaliditytsunamis. Sources of historical information include witness reports recorded indiaries, notes, newspapers, books, and photographs. Information on recent events comesfrom tide gauges and other instrumental recordings such as DART® buoys, and media ofgreater variety, for example, video and online surveys. The New Zealand TsunamiDatabase is an ongoing project with information added as further historical records cometo light. Modern tsunamis are also added to the database once the relevant data for anevent has been collated and edited. This paper briefly overviews the procedures and toolsused in the recording and analysis of New Zealand's historical tsunamis, with emphasison database content.
Database resources of the National Center for Biotechnology Information.
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; Dicuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian
2012-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Database resources of the National Center for Biotechnology Information
Acland, Abigail; Agarwala, Richa; Barrett, Tanya; Beck, Jeff; Benson, Dennis A.; Bollin, Colleen; Bolton, Evan; Bryant, Stephen H.; Canese, Kathi; Church, Deanna M.; Clark, Karen; DiCuccio, Michael; Dondoshansky, Ilya; Federhen, Scott; Feolo, Michael; Geer, Lewis Y.; Gorelenkov, Viatcheslav; Hoeppner, Marilu; Johnson, Mark; Kelly, Christopher; Khotomlianski, Viatcheslav; Kimchi, Avi; Kimelman, Michael; Kitts, Paul; Krasnov, Sergey; Kuznetsov, Anatoliy; Landsman, David; Lipman, David J.; Lu, Zhiyong; Madden, Thomas L.; Madej, Tom; Maglott, Donna R.; Marchler-Bauer, Aron; Karsch-Mizrachi, Ilene; Murphy, Terence; Ostell, James; O'Sullivan, Christopher; Panchenko, Anna; Phan, Lon; Pruitt, Don Preussm Kim D.; Rubinstein, Wendy; Sayers, Eric W.; Schneider, Valerie; Schuler, Gregory D.; Sequeira, Edwin; Sherry, Stephen T.; Shumway, Martin; Sirotkin, Karl; Siyan, Karanjit; Slotta, Douglas; Soboleva, Alexandra; Soussov, Vladimir; Starchenko, Grigory; Tatusova, Tatiana A.; Trawick, Bart W.; Vakatov, Denis; Wang, Yanli; Ward, Minghong; John Wilbur, W.; Yaschenko, Eugene; Zbicz, Kerry
2014-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, PubReader, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Primer-BLAST, COBALT, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, ClinVar, MedGen, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page. PMID:24259429
Database resources of the National Center for Biotechnology
Wheeler, David L.; Church, Deanna M.; Federhen, Scott; Lash, Alex E.; Madden, Thomas L.; Pontius, Joan U.; Schuler, Gregory D.; Schriml, Lynn M.; Sequeira, Edwin; Tatusova, Tatiana A.; Wagner, Lukas
2003-01-01
In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides data analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, PubMed, PubMed Central (PMC), LocusLink, the NCBITaxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR (e-PCR), Open Reading Frame (ORF) Finder, References Sequence (RefSeq), UniGene, HomoloGene, ProtEST, Database of Single Nucleotide Polymorphisms (dbSNP), Human/Mouse Homology Map, Cancer Chromosome Aberration Project (CCAP), Entrez Genomes and related tools, the Map Viewer, Model Maker (MM), Evidence Viewer (EV), Clusters of Orthologous Groups (COGs) database, Retroviral Genotyping Tools, SAGEmap, Gene Expression Omnibus (GEO), Online Mendelian Inheritance in Man (OMIM), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), and the Conserved Domain Architecture Retrieval Tool (CDART). Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at: http://www.ncbi.nlm.nih.gov. PMID:12519941
GenColors-based comparative genome databases for small eukaryotic genomes.
Felder, Marius; Romualdi, Alessandro; Petzold, Andreas; Platzer, Matthias; Sühnel, Jürgen; Glöckner, Gernot
2013-01-01
Many sequence data repositories can give a quick and easily accessible overview on genomes and their annotations. Less widespread is the possibility to compare related genomes with each other in a common database environment. We have previously described the GenColors database system (http://gencolors.fli-leibniz.de) and its applications to a number of bacterial genomes such as Borrelia, Legionella, Leptospira and Treponema. This system has an emphasis on genome comparison. It combines data from related genomes and provides the user with an extensive set of visualization and analysis tools. Eukaryote genomes are normally larger than prokaryote genomes and thus pose additional challenges for such a system. We have, therefore, adapted GenColors to also handle larger datasets of small eukaryotic genomes and to display eukaryotic gene structures. Further recent developments include whole genome views, genome list options and, for bacterial genome browsers, the display of horizontal gene transfer predictions. Two new GenColors-based databases for two fungal species (http://fgb.fli-leibniz.de) and for four social amoebas (http://sacgb.fli-leibniz.de) were set up. Both new resources open up a single entry point for related genomes for the amoebozoa and fungal research communities and other interested users. Comparative genomics approaches are greatly facilitated by these resources.
Building a medical image processing algorithm verification database
NASA Astrophysics Data System (ADS)
Brown, C. Wayne
2000-06-01
The design of a database containing head Computed Tomography (CT) studies is presented, along with a justification for the database's composition. The database will be used to validate software algorithms that screen normal head CT studies from studies that contain pathology. The database is designed to have the following major properties: (1) a size sufficient for statistical viability, (2) inclusion of both normal (no pathology) and abnormal scans, (3) inclusion of scans due to equipment malfunction, technologist error, and uncooperative patients, (4) inclusion of data sets from multiple scanner manufacturers, (5) inclusion of data sets from different gender and age groups, and (6) three independent diagnosis of each data set. Designed correctly, the database will provide a partial basis for FDA (United States Food and Drug Administration) approval of image processing algorithms for clinical use. Our goal for the database is the proof of viability of screening head CT's for normal anatomy using computer algorithms. To put this work into context, a classification scheme for 'computer aided diagnosis' systems is proposed.
PyPathway: Python Package for Biological Network Analysis and Visualization.
Xu, Yang; Luo, Xiao-Chun
2018-05-01
Life science studies represent one of the biggest generators of large data sets, mainly because of rapid sequencing technological advances. Biological networks including interactive networks and human curated pathways are essential to understand these high-throughput data sets. Biological network analysis offers a method to explore systematically not only the molecular complexity of a particular disease but also the molecular relationships among apparently distinct phenotypes. Currently, several packages for Python community have been developed, such as BioPython and Goatools. However, tools to perform comprehensive network analysis and visualization are still needed. Here, we have developed PyPathway, an extensible free and open source Python package for functional enrichment analysis, network modeling, and network visualization. The network process module supports various interaction network and pathway databases such as Reactome, WikiPathway, STRING, and BioGRID. The network analysis module implements overrepresentation analysis, gene set enrichment analysis, network-based enrichment, and de novo network modeling. Finally, the visualization and data publishing modules enable users to share their analysis by using an easy web application. For package availability, see the first Reference.
Computer systems and methods for the query and visualization of multidimensional databases
Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick
2006-08-08
A method and system for producing graphics. A hierarchical structure of a database is determined. A visual table, comprising a plurality of panes, is constructed by providing a specification that is in a language based on the hierarchical structure of the database. In some cases, this language can include fields that are in the database schema. The database is queried to retrieve a set of tuples in accordance with the specification. A subset of the set of tuples is associated with a pane in the plurality of panes.
Computer systems and methods for the query and visualization of multidimensional database
Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick
2010-05-11
A method and system for producing graphics. A hierarchical structure of a database is determined. A visual table, comprising a plurality of panes, is constructed by providing a specification that is in a language based on the hierarchical structure of the database. In some cases, this language can include fields that are in the database schema. The database is queried to retrieve a set of tuples in accordance with the specification. A subset of the set of tuples is associated with a pane in the plurality of panes.
CB Database: A change blindness database for objects in natural indoor scenes.
Sareen, Preeti; Ehinger, Krista A; Wolfe, Jeremy M
2016-12-01
Change blindness has been a topic of interest in cognitive sciences for decades. Change detection experiments are frequently used for studying various research topics such as attention and perception. However, creating change detection stimuli is tedious and there is no open repository of such stimuli using natural scenes. We introduce the Change Blindness (CB) Database with object changes in 130 colored images of natural indoor scenes. The size and eccentricity are provided for all the changes as well as reaction time data from a baseline experiment. In addition, we have two specialized satellite databases that are subsets of the 130 images. In one set, changes are seen in rooms or in mirrors in those rooms (Mirror Change Database). In the other, changes occur in a room or out a window (Window Change Database). Both the sets have controlled background, change size, and eccentricity. The CB Database is intended to provide researchers with a stimulus set of natural scenes with defined stimulus parameters that can be used for a wide range of experiments. The CB Database can be found at http://search.bwh.harvard.edu/new/CBDatabase.html .
Manchester visual query language
NASA Astrophysics Data System (ADS)
Oakley, John P.; Davis, Darryl N.; Shann, Richard T.
1993-04-01
We report a database language for visual retrieval which allows queries on image feature information which has been computed and stored along with images. The language is novel in that it provides facilities for dealing with feature data which has actually been obtained from image analysis. Each line in the Manchester Visual Query Language (MVQL) takes a set of objects as input and produces another, usually smaller, set as output. The MVQL constructs are mainly based on proven operators from the field of digital image analysis. An example is the Hough-group operator which takes as input a specification for the objects to be grouped, a specification for the relevant Hough space, and a definition of the voting rule. The output is a ranked list of high scoring bins. The query could be directed towards one particular image or an entire image database, in the latter case the bins in the output list would in general be associated with different images. We have implemented MVQL in two layers. The command interpreter is a Lisp program which maps each MVQL line to a sequence of commands which are used to control a specialized database engine. The latter is a hybrid graph/relational system which provides low-level support for inheritance and schema evolution. In the paper we outline the language and provide examples of useful queries. We also describe our solution to the engineering problems associated with the implementation of MVQL.
Analysis of the diffuse ionized gas database: DIGEDA
NASA Astrophysics Data System (ADS)
Flores-Fajardo, N.; Morisset, C.; Binette, L.
2009-10-01
Studies of the Diffuse Ionized Gas (DIG) have progressed without providing so far any strict criterion to distinguish DIGs from H II regions. In this work, we compile the emission line measurements of 29 galaxies that are available in the scientific literature, thereby setting up the first DIG database (DIGEDA). Making use of this database, we proceed to analyze the global properties of the DIG using the [NII]λ6583/Hα, [O I]λ6300/Hα, [O III]λ5007/Hβ and [SII]λ6716/Hα lines ratios, including the H α emission measure. This analysis leads us to conclude that the [N II]/Hα ratio provides an objective criterion for distinguishing whether an emission region is a DIG or an H II region, while the EM(Hα) is a useful quantity only when the galaxies are considered individually. Finally, we find that the emission regions of Irr galaxies classified as DIG in the literature appear in fact to be much more similar to H II regions than to the DIGs of spiral galaxies.
David, Matthieu; Fertin, Guillaume; Rogniaux, Hélène; Tessier, Dominique
2017-08-04
The analysis of discovery proteomics experiments relies on algorithms that identify peptides from their tandem mass spectra. The almost exhaustive interpretation of these spectra remains an unresolved issue. At present, an important number of missing interpretations is probably due to peptides displaying post-translational modifications and variants that yield spectra that are particularly difficult to interpret. However, the emergence of a new generation of mass spectrometers that provide high fragment ion accuracy has paved the way for more efficient algorithms. We present a new software, SpecOMS, that can handle the computational complexity of pairwise comparisons of spectra in the context of large volumes. SpecOMS can compare a whole set of experimental spectra generated by a discovery proteomics experiment to a whole set of theoretical spectra deduced from a protein database in a few minutes on a standard workstation. SpecOMS can ingeniously exploit those capabilities to improve the peptide identification process, allowing strong competition between all possible peptides for spectrum interpretation. Remarkably, this software resolves the drawbacks (i.e., efficiency problems and decreased sensitivity) that usually accompany open modification searches. We highlight this promising approach using results obtained from the analysis of a public human data set downloaded from the PRIDE (PRoteomics IDEntification) database.
PANGEA: pipeline for analysis of next generation amplicons
Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G; Chauliac, Diane; Mobberley, Jennifer M; Gano, Kelsey A; Mukherjee, Nabanita; Casella, George; Roesch, Luiz FW; Walts, Brandon; Riva, Alberto; King, Gary; Triplett, Eric W
2010-01-01
High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including preprocessing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the χ2 step, are joined into one program called the ‘backbone’. PMID:20182525
PANGEA: pipeline for analysis of next generation amplicons.
Giongo, Adriana; Crabb, David B; Davis-Richardson, Austin G; Chauliac, Diane; Mobberley, Jennifer M; Gano, Kelsey A; Mukherjee, Nabanita; Casella, George; Roesch, Luiz F W; Walts, Brandon; Riva, Alberto; King, Gary; Triplett, Eric W
2010-07-01
High-throughput DNA sequencing can identify organisms and describe population structures in many environmental and clinical samples. Current technologies generate millions of reads in a single run, requiring extensive computational strategies to organize, analyze and interpret those sequences. A series of bioinformatics tools for high-throughput sequencing analysis, including pre-processing, clustering, database matching and classification, have been compiled into a pipeline called PANGEA. The PANGEA pipeline was written in Perl and can be run on Mac OSX, Windows or Linux. With PANGEA, sequences obtained directly from the sequencer can be processed quickly to provide the files needed for sequence identification by BLAST and for comparison of microbial communities. Two different sets of bacterial 16S rRNA sequences were used to show the efficiency of this workflow. The first set of 16S rRNA sequences is derived from various soils from Hawaii Volcanoes National Park. The second set is derived from stool samples collected from diabetes-resistant and diabetes-prone rats. The workflow described here allows the investigator to quickly assess libraries of sequences on personal computers with customized databases. PANGEA is provided for users as individual scripts for each step in the process or as a single script where all processes, except the chi(2) step, are joined into one program called the 'backbone'.
Radio-science performance analysis software
NASA Astrophysics Data System (ADS)
Morabito, D. D.; Asmar, S. W.
1995-02-01
The Radio Science Systems Group (RSSG) provides various support functions for several flight project radio-science teams. Among these support functions are uplink and sequence planning, real-time operations monitoring and support, data validation, archiving and distribution functions, and data processing and analysis. This article describes the support functions that encompass radio-science data performance analysis. The primary tool used by the RSSG to fulfill this support function is the STBLTY program set. STBLTY is used to reconstruct observable frequencies and calculate model frequencies, frequency residuals, frequency stability in terms of Allan deviation, reconstructed phase, frequency and phase power spectral density, and frequency drift rates. In the case of one-way data, using an ultrastable oscillator (USO) as a frequency reference, the program set computes the spacecraft transmitted frequency and maintains a database containing the in-flight history of the USO measurements. The program set also produces graphical displays. Some examples and discussions on operating the program set on Galileo and Ulysses data will be presented.
Radio-Science Performance Analysis Software
NASA Astrophysics Data System (ADS)
Morabito, D. D.; Asmar, S. W.
1994-10-01
The Radio Science Systems Group (RSSG) provides various support functions for several flight project radio-science teams. Among these support functions are uplink and sequence planning, real-time operations monitoring and support, data validation, archiving and distribution functions, and data processing and analysis. This article describes the support functions that encompass radio science data performance analysis. The primary tool used by the RSSG to fulfill this support function is the STBLTY program set. STBLTY is used to reconstruct observable frequencies and calculate model frequencies, frequency residuals, frequency stability in terms of Allan deviation, reconstructed phase, frequency and phase power spectral density, and frequency drift rates. In the case of one-way data, using an ultrastable oscillator (USO) as a frequency reference, the program set computes the spacecraft transmitted frequency and maintains a database containing the in-flight history of the USO measurements. The program set also produces graphical displays. Some examples and discussion on operating the program set on Galileo and Ulysses data will be presented.
Radio-science performance analysis software
NASA Technical Reports Server (NTRS)
Morabito, D. D.; Asmar, S. W.
1995-01-01
The Radio Science Systems Group (RSSG) provides various support functions for several flight project radio-science teams. Among these support functions are uplink and sequence planning, real-time operations monitoring and support, data validation, archiving and distribution functions, and data processing and analysis. This article describes the support functions that encompass radio-science data performance analysis. The primary tool used by the RSSG to fulfill this support function is the STBLTY program set. STBLTY is used to reconstruct observable frequencies and calculate model frequencies, frequency residuals, frequency stability in terms of Allan deviation, reconstructed phase, frequency and phase power spectral density, and frequency drift rates. In the case of one-way data, using an ultrastable oscillator (USO) as a frequency reference, the program set computes the spacecraft transmitted frequency and maintains a database containing the in-flight history of the USO measurements. The program set also produces graphical displays. Some examples and discussions on operating the program set on Galileo and Ulysses data will be presented.
Database resources of the National Center for Biotechnology Information
Wheeler, David L.; Barrett, Tanya; Benson, Dennis A.; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Feolo, Michael; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Khovayko, Oleg; Landsman, David; Lipman, David J.; Madden, Thomas L.; Maglott, Donna R.; Miller, Vadim; Ostell, James; Pruitt, Kim D.; Schuler, Gregory D.; Shumway, Martin; Sequeira, Edwin; Sherry, Steven T.; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Tatusov, Roman L.; Tatusova, Tatiana A.; Wagner, Lukas; Yaschenko, Eugene
2008-01-01
In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data available through NCBI's web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace, Assembly, and Short Read Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Database of Genotype and Phenotype, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting the web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:18045790
ERIC Educational Resources Information Center
Chudagr, Amita; Luschei, Thomas F.
2016-01-01
The objective of this commentary is to call attention to the feasibility and importance of large-scale, systematic, quantitative analysis in international and comparative education research. We contend that although many existing databases are under- or unutilized in quantitative international-comparative research, these resources present the…
Epidemiology of Abusive Abdominal Trauma Hospitalizations in United States Children
ERIC Educational Resources Information Center
Lane, Wendy Gwirtzman; Dubowitz, Howard; Langenberg, Patricia; Dischinger, Patricia
2012-01-01
Objectives: (1) To estimate the incidence of abusive abdominal trauma (AAT) hospitalizations among US children age 0-9 years. (2) To identify demographic characteristics of children at highest risk for AAT. Design: Secondary data analysis of a cross-sectional, national hospitalization database. Setting: Hospitalization data from the 2003 and 2006…
A Database of Herbaceous Vegetation Responses to Elevated Atmospheric CO2 (NDP-073)
Jones, Michael H [The Ohio State Univ., Columbus, OH (United States); Curtis, Peter S [The Ohio State Univ., Columbus, OH (United States); Cushman, Robert M [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States); Brenkert, Antoinette L [Oak Ridge National Lab. (ORNL), Oak Ridge, TN (United States)
1999-01-01
To perform a statistically rigorous meta-analysis of research results on the response by herbaceous vegetation to increased atmospheric CO2 levels, a multiparameter database of responses was compiled from the published literature. Seventy-eight independent CO2-enrichment studies, covering 53 species and 26 response parameters, reported mean response, sample size, and variance of the response (either as standard deviation or standard error). An additional 43 studies, covering 25 species and 6 response parameters, did not report variances. This numeric data package accompanies the Carbon Dioxide Information Analysis Center's (CDIAC's) NDP- 072, which provides similar information for woody vegetation. This numeric data package contains a 30-field data set of CO2- exposure experiment responses by herbaceous plants (as both a flat ASCII file and a spreadsheet file), files listing the references to the CO2-exposure experiments and specific comments relevant to the data in the data sets, and this documentation file (which includes SAS and Fortran codes to read the ASCII data file; SAS is a registered trademark of the SAS Institute, Inc., Cary, North Carolina 27511).
Su, Xiaoquan; Xu, Jian; Ning, Kang
2012-10-01
It has long been intriguing scientists to effectively compare different microbial communities (also referred as 'metagenomic samples' here) in a large scale: given a set of unknown samples, find similar metagenomic samples from a large repository and examine how similar these samples are. With the current metagenomic samples accumulated, it is possible to build a database of metagenomic samples of interests. Any metagenomic samples could then be searched against this database to find the most similar metagenomic sample(s). However, on one hand, current databases with a large number of metagenomic samples mostly serve as data repositories that offer few functionalities for analysis; and on the other hand, methods to measure the similarity of metagenomic data work well only for small set of samples by pairwise comparison. It is not yet clear, how to efficiently search for metagenomic samples against a large metagenomic database. In this study, we have proposed a novel method, Meta-Storms, that could systematically and efficiently organize and search metagenomic data. It includes the following components: (i) creating a database of metagenomic samples based on their taxonomical annotations, (ii) efficient indexing of samples in the database based on a hierarchical taxonomy indexing strategy, (iii) searching for a metagenomic sample against the database by a fast scoring function based on quantitative phylogeny and (iv) managing database by index export, index import, data insertion, data deletion and database merging. We have collected more than 1300 metagenomic data from the public domain and in-house facilities, and tested the Meta-Storms method on these datasets. Our experimental results show that Meta-Storms is capable of database creation and effective searching for a large number of metagenomic samples, and it could achieve similar accuracies compared with the current popular significance testing-based methods. Meta-Storms method would serve as a suitable database management and search system to quickly identify similar metagenomic samples from a large pool of samples. ningkang@qibebt.ac.cn Supplementary data are available at Bioinformatics online.
New insights from DEM's into form, process and causality in Distributive Fluvial Systems
NASA Astrophysics Data System (ADS)
Scuderi, Louis; Weissmann, Gary; Hartley, Adrian; Kindilien, Peter
2014-05-01
Recent developments in platforms and sensors, as well as advances in our ability to access these rich data sources in near real time presents geoscientists with both opportunities and problems. We currently record raster and point cloud data about the physical world at unprecedented rates with extremely high spatial and spectral resolution. Yet the ability to extract scientifically useful knowledge from such immense data sets has lagged considerably. The interrelated fields of database creation, data mining and modern geostatistics all focus on such interdisciplinary data analysis problems. In recent years these fields have made great advances in analyzing the complex real-world data such as that captured in Digital Elevation Models (DEM's) and satellite imagery and by LIDAR and other geospatially referenced data sets. However, even considering the vast increase in the use of these data sets in the past decade these methods have enjoyed only a relatively modest penetration into the geosciences when compared to data analysis in other scientific disciplines. In part, a great deal of the current research weakness is due to the lack of a unifying conceptual approach and the failure to appreciate the value of highly structured and synthesized compilations of data, organized in user-friendly formats. We report on the application of these new technologies and database approaches to global scale parameterization of Distributive Fluvial Systems (DFS) within continental sedimentary basins and illustrate the value of well-constructed databases and tool-rich analysis environments for understanding form, process and causality in these systems. We analyzed the characteristics of aggradational fluvial systems in more than 700 modern continental sedimentary basins and the links between DFS within these systems and their contributing drainage basins. Our studies show that in sedimentary basins, distributive fluvial and alluvial systems dominate the depositional environment. Consequently, we have found that studies of modern tributary drainage systems in degradational settings are likely insufficient for understanding the geomorphology expressed within these basins and ultimately for understanding the basin-scale architecture of dominantly distributive fluvial deposits preserved in the rock record.
Interactive Database of Pulsar Flux Density Measurements
NASA Astrophysics Data System (ADS)
Koralewska, O.; Krzeszowski, K.; Kijak, J.; Lewandowski, W.
2012-12-01
The number of astronomical observations is steadily growing, giving rise to the need of cataloguing the obtained results. There are a lot of databases, created to store different types of data and serve a variety of purposes, e. g. databases providing basic data for astronomical objects (SIMBAD Astronomical Database), databases devoted to one type of astronomical object (ATNF Pulsar Database) or to a set of values of the specific parameter (Lorimer 1995 - database of flux density measurements for 280 pulsars on the frequencies up to 1606 MHz), etc. We found that creating an online database of pulsar flux measurements, provided with facilities for plotting diagrams and histograms, calculating mean values for a chosen set of data, filtering parameter values and adding new measurements by the registered users, could be useful in further studies on pulsar spectra.
Chang, Suhua; Zhang, Jiajie; Liao, Xiaoyun; Zhu, Xinxing; Wang, Dahai; Zhu, Jiang; Feng, Tao; Zhu, Baoli; Gao, George F; Wang, Jian; Yang, Huanming; Yu, Jun; Wang, Jing
2007-01-01
Frequent outbreaks of highly pathogenic avian influenza and the increasing data available for comparative analysis require a central database specialized in influenza viruses (IVs). We have established the Influenza Virus Database (IVDB) to integrate information and create an analysis platform for genetic, genomic, and phylogenetic studies of the virus. IVDB hosts complete genome sequences of influenza A virus generated by Beijing Institute of Genomics (BIG) and curates all other published IV sequences after expert annotation. Our Q-Filter system classifies and ranks all nucleotide sequences into seven categories according to sequence content and integrity. IVDB provides a series of tools and viewers for comparative analysis of the viral genomes, genes, genetic polymorphisms and phylogenetic relationships. A search system has been developed for users to retrieve a combination of different data types by setting search options. To facilitate analysis of global viral transmission and evolution, the IV Sequence Distribution Tool (IVDT) has been developed to display the worldwide geographic distribution of chosen viral genotypes and to couple genomic data with epidemiological data. The BLAST, multiple sequence alignment and phylogenetic analysis tools were integrated for online data analysis. Furthermore, IVDB offers instant access to pre-computed alignments and polymorphisms of IV genes and proteins, and presents the results as SNP distribution plots and minor allele distributions. IVDB is publicly available at http://influenza.genomics.org.cn.
Structured Forms Reference Set of Binary Images (SFRS)
National Institute of Standards and Technology Data Gateway
NIST Structured Forms Reference Set of Binary Images (SFRS) (Web, free access) The NIST Structured Forms Database (Special Database 2) consists of 5,590 pages of binary, black-and-white images of synthesized documents. The documents in this database are 12 different tax forms from the IRS 1040 Package X for the year 1988.
The need for a juvenile fire setting database.
Klein, Julianne J; Mondozzi, Mary A; Andrews, David A
2008-01-01
A juvenile fire setter can be classified as any youth setting a fire regardless of the reason. Many communities have programs to deal with this problem, most based on models developed by the United States Fire Administration. We reviewed our programs data to compare it with that published nationally. Currently there is not a nationwide database to compare fire setter data. A single institution, retrospective chart review of all fire setters between the years of January 1, 2003 and December 31, 2005 was completed. There were 133 participants ages 3 to 17. Information obtained included age, location, ignition source, court order and recidivism. Analysis from our data set found 26% of the peak ages for fire involvement to be 12 and 14. Location, ignition source, and court ordered participants were divided into two age groups: 3 to 10 (N = 58) and 11 to 17 (N = 75). Bedrooms ranked first for the younger population and schools for the latter. Fifty-four percentage of the 133 participants used lighters over matches. Twelve percentage of the 3- to 10-year-olds were court mandated, compared with 52% of the 11- to 17-year-olds. Recidivism rates were 4 to 10% with a 33 to 38% survey return rate. Currently there is no state or nationwide, time honored data base to compare facts from which conclusions can be drawn. Starting small with a statewide database could educe a stimulus for a national database. This could also enhance the information provided by the United States Fire Administration, National Fire Data Center beginning one juvenile firesetter program and State Fire Marshal's office at a time.
Komatsu, Setsuko; Wang, Xin; Yin, Xiaojian; Nanjo, Yohei; Ohyanagi, Hajime; Sakata, Katsumi
2017-06-23
The Soybean Proteome Database (SPD) stores data on soybean proteins obtained with gel-based and gel-free proteomic techniques. The database was constructed to provide information on proteins for functional analyses. The majority of the data is focused on soybean (Glycine max 'Enrei'). The growth and yield of soybean are strongly affected by environmental stresses such as flooding. The database was originally constructed using data on soybean proteins separated by two-dimensional polyacrylamide gel electrophoresis, which is a gel-based proteomic technique. Since 2015, the database has been expanded to incorporate data obtained by label-free mass spectrometry-based quantitative proteomics, which is a gel-free proteomic technique. Here, the portions of the database consisting of gel-free proteomic data are described. The gel-free proteomic database contains 39,212 proteins identified in 63 sample sets, such as temporal and organ-specific samples of soybean plants grown under flooding stress or non-stressed conditions. In addition, data on organellar proteins identified in mitochondria, nuclei, and endoplasmic reticulum are stored. Furthermore, the database integrates multiple omics data such as genomics, transcriptomics, metabolomics, and proteomics. The SPD database is accessible at http://proteome.dc.affrc.go.jp/Soybean/. The Soybean Proteome Database stores data obtained from both gel-based and gel-free proteomic techniques. The gel-free proteomic database comprises 39,212 proteins identified in 63 sample sets, such as different organs of soybean plants grown under flooding stress or non-stressed conditions in a time-dependent manner. In addition, organellar proteins identified in mitochondria, nuclei, and endoplasmic reticulum are stored in the gel-free proteomics database. A total of 44,704 proteins, including 5490 proteins identified using a gel-based proteomic technique, are stored in the SPD. It accounts for approximately 80% of all predicted proteins from genome sequences, though there are over lapped proteins. Based on the demonstrated application of data stored in the database for functional analyses, it is suggested that these data will be useful for analyses of biological mechanisms in soybean. Furthermore, coupled with recent advances in information and communication technology, the usefulness of this database would increase in the analyses of biological mechanisms. Copyright © 2017 Elsevier B.V. All rights reserved.
Building An Integrated Neurodegenerative Disease Database At An Academic Health Center
Xie, Sharon X.; Baek, Young; Grossman, Murray; Arnold, Steven E.; Karlawish, Jason; Siderowf, Andrew; Hurtig, Howard; Elman, Lauren; McCluskey, Leo; Van Deerlin, Vivianna; Lee, Virginia M.-Y.; Trojanowski, John Q.
2010-01-01
Background It is becoming increasingly important to study common and distinct etiologies, clinical and pathological features, and mechanisms related to neurodegenerative diseases such as Alzheimer’s disease (AD), Parkinson’s disease (PD), amyotrophic lateral sclerosis (ALS), and frontotemporal lobar degeneration (FTLD). These comparative studies rely on powerful database tools to quickly generate data sets which match diverse and complementary criteria set by the studies. Methods In this paper, we present a novel Integrated NeuroDegenerative Disease (INDD) database developed at the University of Pennsylvania (Penn) through a consortium of Penn investigators. Since these investigators work on AD, PD, ALS and FTLD, this allowed us to achieve the goal of developing an INDD database for these major neurodegenerative disorders. We used Microsoft SQL Server as the platform with built-in “backwards” functionality to provide Access as a front-end client to interface with the database. We used PHP hypertext Preprocessor to create the “front end” web interface and then integrated individual neurodegenerative disease databases using a master lookup table. We also present methods of data entry, database security, database backups, and database audit trails for this INDD database. Results We compare the results of a biomarker study using the INDD database to those using an alternative approach by querying individual database separately. Conclusions We have demonstrated that the Penn INDD database has the ability to query multiple database tables from a single console with high accuracy and reliability. The INDD database provides a powerful tool for generating data sets in comparative studies across several neurodegenerative diseases. PMID:21784346
Scabbio, Camilla; Zoccarato, Orazio; Malaspina, Simona; Lucignani, Giovanni; Del Sole, Angelo; Lecchi, Michela
2017-10-17
To evaluate the impact of non-specific normal databases on the percent summed rest score (SR%) and stress score (SS%) from simulated low-dose SPECT studies by shortening the acquisition time/projection. Forty normal-weight and 40 overweight/obese patients underwent myocardial studies with a conventional gamma-camera (BrightView, Philips) using three different acquisition times/projection: 30, 15, and 8 s (100%-counts, 50%-counts, and 25%-counts scan, respectively) and reconstructed using the iterative algorithm with resolution recovery (IRR) Astonish TM (Philips). Three sets of normal databases were used: (1) full-counts IRR; (2) half-counts IRR; and (3) full-counts traditional reconstruction algorithm database (TRAD). The impact of these databases and the acquired count statistics on the SR% and SS% was assessed by ANOVA analysis and Tukey test (P < 0.05). Significantly higher SR% and SS% values (> 40%) were found for the full-counts TRAD databases respect to the IRR databases. For overweight/obese patients, significantly higher SS% values for 25%-counts scans (+19%) are confirmed compared to those of 50%-counts scan, independently of using the half-counts or the full-counts IRR databases. Astonish TM requires the adoption of the own specific normal databases in order to prevent very high overestimation of both stress and rest perfusion scores. Conversely, the count statistics of the normal databases seems not to influence the quantification scores.
Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data.
Marco-Ramell, Anna; Palau-Rodriguez, Magali; Alay, Ania; Tulipani, Sara; Urpi-Sarda, Mireia; Sanchez-Pla, Alex; Andres-Lacueva, Cristina
2018-01-02
Bioinformatic tools for the enrichment of 'omics' datasets facilitate interpretation and understanding of data. To date few are suitable for metabolomics datasets. The main objective of this work is to give a critical overview, for the first time, of the performance of these tools. To that aim, datasets from metabolomic repositories were selected and enriched data were created. Both types of data were analysed with these tools and outputs were thoroughly examined. An exploratory multivariate analysis of the most used tools for the enrichment of metabolite sets, based on a non-metric multidimensional scaling (NMDS) of Jaccard's distances, was performed and mirrored their diversity. Codes (identifiers) of the metabolites of the datasets were searched in different metabolite databases (HMDB, KEGG, PubChem, ChEBI, BioCyc/HumanCyc, LipidMAPS, ChemSpider, METLIN and Recon2). The databases that presented more identifiers of the metabolites of the dataset were PubChem, followed by METLIN and ChEBI. However, these databases had duplicated entries and might present false positives. The performance of over-representation analysis (ORA) tools, including BioCyc/HumanCyc, ConsensusPathDB, IMPaLA, MBRole, MetaboAnalyst, Metabox, MetExplore, MPEA, PathVisio and Reactome and the mapping tool KEGGREST, was examined. Results were mostly consistent among tools and between real and enriched data despite the variability of the tools. Nevertheless, a few controversial results such as differences in the total number of metabolites were also found. Disease-based enrichment analyses were also assessed, but they were not found to be accurate probably due to the fact that metabolite disease sets are not up-to-date and the difficulty of predicting diseases from a list of metabolites. We have extensively reviewed the state-of-the-art of the available range of tools for metabolomic datasets, the completeness of metabolite databases, the performance of ORA methods and disease-based analyses. Despite the variability of the tools, they provided consistent results independent of their analytic approach. However, more work on the completeness of metabolite and pathway databases is required, which strongly affects the accuracy of enrichment analyses. Improvements will be translated into more accurate and global insights of the metabolome.
National Transportation Atlas Databases : 1999
DOT National Transportation Integrated Search
1999-01-01
The National Transportation Atlas Databases -- 1999 (NTAD99) is a set of national : geographic databases of transportation facilities. These databases include geospatial : information for transportation modal networks and intermodal terminals, and re...
National Transportation Atlas Databases : 2001
DOT National Transportation Integrated Search
2001-01-01
The National Transportation Atlas Databases-2001 (NTAD-2001) is a set of national geographic databases of transportation facilities. These databases include geospatial information for transportation modal networks and intermodal terminals and related...
National Transportation Atlas Databases : 1996
DOT National Transportation Integrated Search
1996-01-01
The National Transportation Atlas Databases -- 1996 (NTAD96) is a set of national : geographic databases of transportation facilities. These databases include geospatial : information for transportation modal networks and intermodal terminals, and re...
National Transportation Atlas Databases : 2000
DOT National Transportation Integrated Search
2000-01-01
The National Transportation Atlas Databases-2000 (NTAD-2000) is a set of national geographic databases of transportation facilities. These databases include geospatial information for transportation modal networks and intermodal terminals and related...
National Transportation Atlas Databases : 1997
DOT National Transportation Integrated Search
1997-01-01
The National Transportation Atlas Databases -- 1997 (NTAD97) is a set of national : geographic databases of transportation facilities. These databases include geospatial : information for transportation modal networks and intermodal terminals, and re...
Bio-optical data integration based on a 4 D database system approach
NASA Astrophysics Data System (ADS)
Imai, N. N.; Shimabukuro, M. H.; Carmo, A. F. C.; Alcantara, E. H.; Rodrigues, T. W. P.; Watanabe, F. S. Y.
2015-04-01
Bio-optical characterization of water bodies requires spatio-temporal data about Inherent Optical Properties and Apparent Optical Properties which allow the comprehension of underwater light field aiming at the development of models for monitoring water quality. Measurements are taken to represent optical properties along a column of water, and then the spectral data must be related to depth. However, the spatial positions of measurement may differ since collecting instruments vary. In addition, the records should not refer to the same wavelengths. Additional difficulty is that distinct instruments store data in different formats. A data integration approach is needed to make these large and multi source data sets suitable for analysis. Thus, it becomes possible, even automatically, semi-empirical models evaluation, preceded by preliminary tasks of quality control. In this work it is presented a solution, in the stated scenario, based on spatial - geographic - database approach with the adoption of an object relational Database Management System - DBMS - due to the possibilities to represent all data collected in the field, in conjunction with data obtained by laboratory analysis and Remote Sensing images that have been taken at the time of field data collection. This data integration approach leads to a 4D representation since that its coordinate system includes 3D spatial coordinates - planimetric and depth - and the time when each data was taken. It was adopted PostgreSQL DBMS extended by PostGIS module to provide abilities to manage spatial/geospatial data. It was developed a prototype which has the mainly tools an analyst needs to prepare the data sets for analysis.
Fatigue Crack Growth Database for Damage Tolerance Analysis
NASA Technical Reports Server (NTRS)
Forman, R. G.; Shivakumar, V.; Cardinal, J. W.; Williams, L. C.; McKeighan, P. C.
2005-01-01
The objective of this project was to begin the process of developing a fatigue crack growth database (FCGD) of metallic materials for use in damage tolerance analysis of aircraft structure. For this initial effort, crack growth rate data in the NASGRO (Registered trademark) database, the United States Air Force Damage Tolerant Design Handbook, and other publicly available sources were examined and used to develop a database that characterizes crack growth behavior for specific applications (materials). The focus of this effort was on materials for general commercial aircraft applications, including large transport airplanes, small transport commuter airplanes, general aviation airplanes, and rotorcraft. The end products of this project are the FCGD software and this report. The specific goal of this effort was to present fatigue crack growth data in three usable formats: (1) NASGRO equation parameters, (2) Walker equation parameters, and (3) tabular data points. The development of this FCGD will begin the process of developing a consistent set of standard fatigue crack growth material properties. It is envisioned that the end product of the process will be a general repository for credible and well-documented fracture properties that may be used as a default standard in damage tolerance analyses.
Neighborhood Structural Similarity Mapping for the Classification of Masses in Mammograms.
Rabidas, Rinku; Midya, Abhishek; Chakraborty, Jayasree
2018-05-01
In this paper, two novel feature extraction methods, using neighborhood structural similarity (NSS), are proposed for the characterization of mammographic masses as benign or malignant. Since gray-level distribution of pixels is different in benign and malignant masses, more regular and homogeneous patterns are visible in benign masses compared to malignant masses; the proposed method exploits the similarity between neighboring regions of masses by designing two new features, namely, NSS-I and NSS-II, which capture global similarity at different scales. Complementary to these global features, uniform local binary patterns are computed to enhance the classification efficiency by combining with the proposed features. The performance of the features are evaluated using the images from the mini-mammographic image analysis society (mini-MIAS) and digital database for screening mammography (DDSM) databases, where a tenfold cross-validation technique is incorporated with Fisher linear discriminant analysis, after selecting the optimal set of features using stepwise logistic regression method. The best area under the receiver operating characteristic curve of 0.98 with an accuracy of is achieved with the mini-MIAS database, while the same for the DDSM database is 0.93 with accuracy .
Nakajima, Kenichi; Matsumoto, Naoya; Kasai, Tokuo; Matsuo, Shinro; Kiso, Keisuke; Okuda, Koichi
2016-04-01
As a 2-year project of the Japanese Society of Nuclear Medicine working group activity, normal myocardial imaging databases were accumulated and summarized. Stress-rest with gated and non-gated image sets were accumulated for myocardial perfusion imaging and could be used for perfusion defect scoring and normal left ventricular (LV) function analysis. For single-photon emission computed tomography (SPECT) with multi-focal collimator design, databases of supine and prone positions and computed tomography (CT)-based attenuation correction were created. The CT-based correction provided similar perfusion patterns between genders. In phase analysis of gated myocardial perfusion SPECT, a new approach for analyzing dyssynchrony, normal ranges of parameters for phase bandwidth, standard deviation and entropy were determined in four software programs. Although the results were not interchangeable, dependency on gender, ejection fraction and volumes were common characteristics of these parameters. Standardization of (123)I-MIBG sympathetic imaging was performed regarding heart-to-mediastinum ratio (HMR) using a calibration phantom method. The HMRs from any collimator types could be converted to the value with medium-energy comparable collimators. Appropriate quantification based on common normal databases and standard technology could play a pivotal role for clinical practice and researches.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sewell, Christopher Meyer
This is a set of slides from a guest lecture for a class at the University of Texas, El Paso on visualization and data analysis for high-performance computing. The topics covered are the following: trends in high-performance computing; scientific visualization, such as OpenGL, ray tracing and volume rendering, VTK, and ParaView; data science at scale, such as in-situ visualization, image databases, distributed memory parallelism, shared memory parallelism, VTK-m, "big data", and then an analysis example.
Atlantic Ocean CARINA data: overview and salinity adjustments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Tanhua, T.; Steinfeldt, R.; Key, Robert
2010-01-01
Water column data of carbon and carbon-relevant hydrographic and hydrochemical parameters from 188 previously non-publicly available cruise data sets in the Arctic Mediterranean Seas, Atlantic and Southern Ocean have been retrieved and merged into a new database: CARINA (CARbon dioxide IN the Atlantic Ocean). The data have gone through rigorous quality control procedures to assure the highest possible quality and consistency. The data for the pertinent parameters in the CARINA database were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the threemore » data products: merged data files with measured, calculated and interpolated data for each of the three CARINA regions, i.e. the Arctic Mediterranean Seas, the Atlantic and the Southern Ocean. These products have been corrected to be internally consistent. Ninety-eight of the cruises in the CARINA database were conducted in the Atlantic Ocean, defined here as the region south of the Greenland-Iceland-Scotland Ridge and north of about 30 S. Here we present an overview of the Atlantic Ocean synthesis of the CARINA data and the adjustments that were applied to the data product. We also report the details of the secondary QC (Quality Control) for salinity for this data set. Procedures of quality control including crossover analysis between stations and inversion analysis of all crossover data are briefly described. Adjustments to salinity measurements were applied to the data from 10 cruises in the Atlantic Ocean region. Based on our analysis we estimate the internal consistency of the CARINA-ATL salinity data to be 4.1 ppm. With these adjustments the CARINA data products are consistent both internally was well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s, and is now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.« less
Postorino, Maurizio; Amato, Claudia; Mancini, Elena; Carioni, Paola; D'Arrigo, Graziella; Di Benedetto, Attilio; Cerino, Fabrizio; Marino, Carmela; Vilasi, Antonio; Tripepi, Giovanni; Stuard, Stefano; Capasso, Giovanbattista; Santoro, Antonio; Zoccali, Carmine
2018-02-01
In 2013, the Italian Society of Nephrology joined forces with Nephrocare-Italy to create a clinical research cohort of patients on file in the data-rich clinical management system (EUCLID) of this organization for the performance of observational studies in the hemodialysis (HD) population. To see whether patients in EUCLID are representative of the HD population in Italy, we set out to compare the whole EUCLID population with patients included in the regional HD registries in Emilia-Romagna (Northern Italy) and in Calabria (Southern Italy), the sole regions in Italy which have systematically collected an enlarged clinical data set allowing comparison with the data-rich EUCLID system. An analysis of prevalent and incident patients in 2010 and 2011 showed that EUCLID patients had a lower prevalence of coronary heart disease, peripheral vascular disease, heart failure, valvular heart disease, liver disease, peptic ulcer and other comorbidities and risk factors and a higher fractional urea clearance (Kt/V) than those in the Emilia Romagna and Calabria registries. Accordingly, survival analysis showed a lower mortality risk in the EUCLID 2010 and 2011 cohorts than in the combined two regional registries in the corresponding years: for 2010, hazard ratio (HR) EUCLID vs. Regional registries: 0.80 [95% confidence interval: 0.71-0.90]; for 2011, HR: 0.76 [0.65-0.90]. However, this difference was nullified by statistical adjustment for the difference in comorbidities and risk factors, indicating that the longer survival in the EUCLID database was attributable to the lower risk profile of patients included in that database. This preliminary analysis sets the stage for future observational studies and indicates that appropriate adjustment for difference in comorbidities and risk factors is needed to generalize to the Italian HD population analyses based on the data-rich EUCLID database.
Atlantic Ocean CARINA data: overview and salinity adjustments
NASA Astrophysics Data System (ADS)
Tanhua, T.; Steinfeldt, R.; Key, R. M.; Brown, P.; Gruber, N.; Wanninkhof, R.; Perez, F.; Körtzinger, A.; Velo, A.; Schuster, U.; van Heuven, S.; Bullister, J. L.; Stendardo, I.; Hoppema, M.; Olsen, A.; Kozyr, A.; Pierrot, D.; Schirnick, C.; Wallace, D. W. R.
2010-02-01
Water column data of carbon and carbon-relevant hydrographic and hydrochemical parameters from 188 previously non-publicly available cruise data sets in the Arctic Mediterranean Seas, Atlantic and Southern Ocean have been retrieved and merged into a new database: CARINA (CARbon dioxide IN the Atlantic Ocean). The data have gone through rigorous quality control procedures to assure the highest possible quality and consistency. The data for the pertinent parameters in the CARINA database were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the three data products: merged data files with measured, calculated and interpolated data for each of the three CARINA regions, i.e. the Arctic Mediterranean Seas, the Atlantic and the Southern Ocean. These products have been corrected to be internally consistent. Ninety-eight of the cruises in the CARINA database were conducted in the Atlantic Ocean, defined here as the region south of the Greenland-Iceland-Scotland Ridge and north of about 30° S. Here we present an overview of the Atlantic Ocean synthesis of the CARINA data and the adjustments that were applied to the data product. We also report the details of the secondary QC (Quality Control) for salinity for this data set. Procedures of quality control - including crossover analysis between stations and inversion analysis of all crossover data - are briefly described. Adjustments to salinity measurements were applied to the data from 10 cruises in the Atlantic Ocean region. Based on our analysis we estimate the internal consistency of the CARINA-ATL salinity data to be 4.1 ppm. With these adjustments the CARINA data products are consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s, and is now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
PANDORA: keyword-based analysis of protein sets by integration of annotation sources.
Kaplan, Noam; Vaaknin, Avishay; Linial, Michal
2003-10-01
Recent advances in high-throughput methods and the application of computational tools for automatic classification of proteins have made it possible to carry out large-scale proteomic analyses. Biological analysis and interpretation of sets of proteins is a time-consuming undertaking carried out manually by experts. We have developed PANDORA (Protein ANnotation Diagram ORiented Analysis), a web-based tool that provides an automatic representation of the biological knowledge associated with any set of proteins. PANDORA uses a unique approach of keyword-based graphical analysis that focuses on detecting subsets of proteins that share unique biological properties and the intersections of such sets. PANDORA currently supports SwissProt keywords, NCBI Taxonomy, InterPro entries and the hierarchical classification terms from ENZYME, SCOP and GO databases. The integrated study of several annotation sources simultaneously allows a representation of biological relations of structure, function, cellular location, taxonomy, domains and motifs. PANDORA is also integrated into the ProtoNet system, thus allowing testing thousands of automatically generated clusters. We illustrate how PANDORA enhances the biological understanding of large, non-uniform sets of proteins originating from experimental and computational sources, without the need for prior biological knowledge on individual proteins.
NONCODE v2.0: decoding the non-coding.
He, Shunmin; Liu, Changning; Skogerbø, Geir; Zhao, Haitao; Wang, Jie; Liu, Tao; Bai, Baoyan; Zhao, Yi; Chen, Runsheng
2008-01-01
The NONCODE database is an integrated knowledge database designed for the analysis of non-coding RNAs (ncRNAs). Since NONCODE was first released 3 years ago, the number of known ncRNAs has grown rapidly, and there is growing recognition that ncRNAs play important regulatory roles in most organisms. In the updated version of NONCODE (NONCODE v2.0), the number of collected ncRNAs has reached 206 226, including a wide range of microRNAs, Piwi-interacting RNAs and mRNA-like ncRNAs. The improvements brought to the database include not only new and updated ncRNA data sets, but also an incorporation of BLAST alignment search service and access through our custom UCSC Genome Browser. NONCODE can be found under http://www.noncode.org or http://noncode.bioinfo.org.cn.
A gene network bioinformatics analysis for pemphigoid autoimmune blistering diseases.
Barone, Antonio; Toti, Paolo; Giuca, Maria Rita; Derchi, Giacomo; Covani, Ugo
2015-07-01
In this theoretical study, a text mining search and clustering analysis of data related to genes potentially involved in human pemphigoid autoimmune blistering diseases (PAIBD) was performed using web tools to create a gene/protein interaction network. The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database was employed to identify a final set of PAIBD-involved genes and to calculate the overall significant interactions among genes: for each gene, the weighted number of links, or WNL, was registered and a clustering procedure was performed using the WNL analysis. Genes were ranked in class (leader, B, C, D and so on, up to orphans). An ontological analysis was performed for the set of 'leader' genes. Using the above-mentioned data network, 115 genes represented the final set; leader genes numbered 7 (intercellular adhesion molecule 1 (ICAM-1), interferon gamma (IFNG), interleukin (IL)-2, IL-4, IL-6, IL-8 and tumour necrosis factor (TNF)), class B genes were 13, whereas the orphans were 24. The ontological analysis attested that the molecular action was focused on extracellular space and cell surface, whereas the activation and regulation of the immunity system was widely involved. Despite the limited knowledge of the present pathologic phenomenon, attested by the presence of 24 genes revealing no protein-protein direct or indirect interactions, the network showed significant pathways gathered in several subgroups: cellular components, molecular functions, biological processes and the pathologic phenomenon obtained from the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database. The molecular basis for PAIBD was summarised and expanded, which will perhaps give researchers promising directions for the identification of new therapeutic targets.
Memory and long-range correlations in chess games
NASA Astrophysics Data System (ADS)
Schaigorodsky, Ana L.; Perotti, Juan I.; Billoni, Orlando V.
2014-01-01
In this paper we report the existence of long-range memory in the opening moves of a chronologically ordered set of chess games using an extensive chess database. We used two mapping rules to build discrete time series and analyzed them using two methods for detecting long-range correlations; rescaled range analysis and detrended fluctuation analysis. We found that long-range memory is related to the level of the players. When the database is filtered according to player levels we found differences in the persistence of the different subsets. For high level players, correlations are stronger at long time scales; whereas in intermediate and low level players they reach the maximum value at shorter time scales. This can be interpreted as a signature of the different strategies used by players with different levels of expertise. These results are robust against the assignation rules and the method employed in the analysis of the time series.
Nagaraj, Shivashankar H; Gasser, Robin B; Nisbet, Alasdair J; Ranganathan, Shoba
2008-01-01
The analysis of expressed sequence tags (EST) offers a rapid and cost effective approach to elucidate the transcriptome of an organism, but requires several computational methods for assembly and annotation. Researchers frequently analyse each step manually, which is laborious and time consuming. We have recently developed ESTExplorer, a semi-automated computational workflow system, in order to achieve the rapid analysis of EST datasets. In this study, we evaluated EST data analysis for the parasitic nematode Trichostrongylus vitrinus (order Strongylida) using ESTExplorer, compared with database matching alone. We functionally annotated 1776 ESTs obtained via suppressive-subtractive hybridisation from T. vitrinus, an important parasitic trichostrongylid of small ruminants. Cluster and comparative genomic analyses of the transcripts using ESTExplorer indicated that 290 (41%) sequences had homologues in Caenorhabditis elegans, 329 (42%) in parasitic nematodes, 202 (28%) in organisms other than nematodes, and 218 (31%) had no significant match to any sequence in the current databases. Of the C. elegans homologues, 90 were associated with 'non-wildtype' double-stranded RNA interference (RNAi) phenotypes, including embryonic lethality, maternal sterility, sterile progeny, larval arrest and slow growth. We could functionally classify 267 (38%) sequences using the Gene Ontologies (GO) and establish pathway associations for 230 (33%) sequences using the Kyoto Encyclopedia of Genes and Genomes (KEGG). Further examination of this EST dataset revealed a number of signalling molecules, proteases, protease inhibitors, enzymes, ion channels and immune-related genes. In addition, we identified 40 putative secreted proteins that could represent potential candidates for developing novel anthelmintics or vaccines. We further compared the automated EST sequence annotations, using ESTExplorer, with database search results for individual T. vitrinus ESTs. ESTExplorer reliably and rapidly annotated 301 ESTs, with pathway and GO information, eliminating 60 low quality hits from database searches. We evaluated the efficacy of ESTExplorer in analysing EST data, and demonstrate that computational tools can be used to accelerate the process of gene discovery in EST sequencing projects. The present study has elucidated sets of relatively conserved and potentially novel genes for biological investigation, and the annotated EST set provides further insight into the molecular biology of T. vitrinus, towards the identification of novel drug targets.
Park, Hyun-Seok
2012-12-01
Whereas a vast amount of new information on bioinformatics is made available to the public through patents, only a small set of patents are cited in academic papers. A detailed analysis of registered bioinformatics patents, using the existing patent search system, can provide valuable information links between science and technology. However, it is extremely difficult to select keywords to capture bioinformatics patents, reflecting the convergence of several underlying technologies. No single word or even several words are sufficient to identify such patents. The analysis of patent subclasses can provide valuable information. In this paper, I did a preliminary study of the current status of bioinformatics patents and their International Patent Classification (IPC) groups registered in the Korea Intellectual Property Rights Information Service (KIPRIS) database.
Wang, Wei; Xia, Minxuan; Chen, Jie; Deng, Fenni; Yuan, Rui; Zhang, Xiaopei; Shen, Fafu
2016-12-01
The data presented in this paper is supporting the research article "Genome-Wide Analysis of Superoxide Dismutase Gene Family in Gossypium raimondii and G. arboreum" [1]. In this data article, we present phylogenetic tree showing dichotomy with two different clusters of SODs inferred by the Bayesian method of MrBayes (version 3.2.4), "Bayesian phylogenetic inference under mixed models" [2], Ramachandran plots of G. raimondii and G. arboreum SODs, the protein sequence used to generate 3D sructure of proteins and the template accession via SWISS-MODEL server, "SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information." [3] and motif sequences of SODs identified by InterProScan (version 4.8) with the Pfam database, "Pfam: the protein families database" [4].
oPOSSUM: integrated tools for analysis of regulatory motif over-representation
Ho Sui, Shannan J.; Fulton, Debra L.; Arenillas, David J.; Kwon, Andrew T.; Wasserman, Wyeth W.
2007-01-01
The identification of over-represented transcription factor binding sites from sets of co-expressed genes provides insights into the mechanisms of regulation for diverse biological contexts. oPOSSUM, an internet-based system for such studies of regulation, has been improved and expanded in this new release. New features include a worm-specific version for investigating binding sites conserved between Caenorhabditis elegans and C. briggsae, as well as a yeast-specific version for the analysis of co-expressed sets of Saccharomyces cerevisiae genes. The human and mouse applications feature improvements in ortholog mapping, sequence alignments and the delineation of multiple alternative promoters. oPOSSUM2, introduced for the analysis of over-represented combinations of motifs in human and mouse genes, has been integrated with the original oPOSSUM system. Analysis using user-defined background gene sets is now supported. The transcription factor binding site models have been updated to include new profiles from the JASPAR database. oPOSSUM is available at http://www.cisreg.ca/oPOSSUM/ PMID:17576675
Upper bound of abutment scour in laboratory and field data
Benedict, Stephen
2016-01-01
The U.S. Geological Survey, in cooperation with the South Carolina Department of Transportation, conducted a field investigation of abutment scour in South Carolina and used those data to develop envelope curves that define the upper bound of abutment scour. To expand on this previous work, an additional cooperative investigation was initiated to combine the South Carolina data with abutment scour data from other sources and evaluate upper bound patterns with this larger data set. To facilitate this analysis, 446 laboratory and 331 field measurements of abutment scour were compiled into a digital database. This extensive database was used to evaluate the South Carolina abutment scour envelope curves and to develop additional envelope curves that reflected the upper bound of abutment scour depth for the laboratory and field data. The envelope curves provide simple but useful supplementary tools for assessing the potential maximum abutment scour depth in the field setting.
Prediction of pelvic organ prolapse using an artificial neural network.
Robinson, Christopher J; Swift, Steven; Johnson, Donna D; Almeida, Jonas S
2008-08-01
The objective of this investigation was to test the ability of a feedforward artificial neural network (ANN) to differentiate patients who have pelvic organ prolapse (POP) from those who retain good pelvic organ support. Following institutional review board approval, patients with POP (n = 87) and controls with good pelvic organ support (n = 368) were identified from the urogynecology research database. Historical and clinical information was extracted from the database. Data analysis included the training of a feedforward ANN, variable selection, and external validation of the model with an independent data set. Twenty variables were used. The median-performing ANN model used a median of 3 (quartile 1:3 to quartile 3:5) variables and achieved an area under the receiver operator curve of 0.90 (external, independent validation set). Ninety percent sensitivity and 83% specificity were obtained in the external validation by ANN classification. Feedforward ANN modeling is applicable to the identification and prediction of POP.
Inferring gene and protein interactions using PubMed citations and consensus Bayesian networks
Dalman, Mark; Haddad, Joseph; Duan, Zhong-Hui
2017-01-01
The PubMed database offers an extensive set of publication data that can be useful, yet inherently complex to use without automated computational techniques. Data repositories such as the Genomic Data Commons (GDC) and the Gene Expression Omnibus (GEO) offer experimental data storage and retrieval as well as curated gene expression profiles. Genetic interaction databases, including Reactome and Ingenuity Pathway Analysis, offer pathway and experiment data analysis using data curated from these publications and data repositories. We have created a method to generate and analyze consensus networks, inferring potential gene interactions, using large numbers of Bayesian networks generated by data mining publications in the PubMed database. Through the concept of network resolution, these consensus networks can be tailored to represent possible genetic interactions. We designed a set of experiments to confirm that our method is stable across variation in both sample and topological input sizes. Using gene product interactions from the KEGG pathway database and data mining PubMed publication abstracts, we verify that regardless of the network resolution or the inferred consensus network, our method is capable of inferring meaningful gene interactions through consensus Bayesian network generation with multiple, randomized topological orderings. Our method can not only confirm the existence of currently accepted interactions, but has the potential to hypothesize new ones as well. We show our method confirms the existence of known gene interactions such as JAK-STAT-PI3K-AKT-mTOR, infers novel gene interactions such as RAS- Bcl-2 and RAS-AKT, and found significant pathway-pathway interactions between the JAK-STAT signaling and Cardiac Muscle Contraction KEGG pathways. PMID:29049295
The Function Biomedical Informatics Research Network Data Repository.
Keator, David B; van Erp, Theo G M; Turner, Jessica A; Glover, Gary H; Mueller, Bryon A; Liu, Thomas T; Voyvodic, James T; Rasmussen, Jerod; Calhoun, Vince D; Lee, Hyo Jong; Toga, Arthur W; McEwen, Sarah; Ford, Judith M; Mathalon, Daniel H; Diaz, Michele; O'Leary, Daniel S; Jeremy Bockholt, H; Gadde, Syam; Preda, Adrian; Wible, Cynthia G; Stern, Hal S; Belger, Aysenil; McCarthy, Gregory; Ozyurt, Burak; Potkin, Steven G
2016-01-01
The Function Biomedical Informatics Research Network (FBIRN) developed methods and tools for conducting multi-scanner functional magnetic resonance imaging (fMRI) studies. Method and tool development were based on two major goals: 1) to assess the major sources of variation in fMRI studies conducted across scanners, including instrumentation, acquisition protocols, challenge tasks, and analysis methods, and 2) to provide a distributed network infrastructure and an associated federated database to host and query large, multi-site, fMRI and clinical data sets. In the process of achieving these goals the FBIRN test bed generated several multi-scanner brain imaging data sets to be shared with the wider scientific community via the BIRN Data Repository (BDR). The FBIRN Phase 1 data set consists of a traveling subject study of 5 healthy subjects, each scanned on 10 different 1.5 to 4 T scanners. The FBIRN Phase 2 and Phase 3 data sets consist of subjects with schizophrenia or schizoaffective disorder along with healthy comparison subjects scanned at multiple sites. In this paper, we provide concise descriptions of FBIRN's multi-scanner brain imaging data sets and details about the BIRN Data Repository instance of the Human Imaging Database (HID) used to publicly share the data. Copyright © 2015 Elsevier Inc. All rights reserved.
Are there two processes in reasoning? The dimensionality of inductive and deductive inferences.
Stephens, Rachel G; Dunn, John C; Hayes, Brett K
2018-03-01
Single-process accounts of reasoning propose that the same cognitive mechanisms underlie inductive and deductive inferences. In contrast, dual-process accounts propose that these inferences depend upon 2 qualitatively different mechanisms. To distinguish between these accounts, we derived a set of single-process and dual-process models based on an overarching signal detection framework. We then used signed difference analysis to test each model against data from an argument evaluation task, in which induction and deduction judgments are elicited for sets of valid and invalid arguments. Three data sets were analyzed: data from Singmann and Klauer (2011), a database of argument evaluation studies, and the results of an experiment designed to test model predictions. Of the large set of testable models, we found that almost all could be rejected, including all 2-dimensional models. The only testable model able to account for all 3 data sets was a model with 1 dimension of argument strength and independent decision criteria for induction and deduction judgments. We conclude that despite the popularity of dual-process accounts, current results from the argument evaluation task are best explained by a single-process account that incorporates separate decision thresholds for inductive and deductive inferences. (PsycINFO Database Record (c) 2018 APA, all rights reserved).
Neural networks for computer-aided diagnosis: detection of lung nodules in chest radiograms.
Coppini, Giuseppe; Diciotti, Stefano; Falchini, Massimo; Villari, Natale; Valli, Guido
2003-12-01
The paper describes a neural-network-based system for the computer aided detection of lung nodules in chest radiograms. Our approach is based on multiscale processing and artificial neural networks (ANNs). The problem of nodule detection is faced by using a two-stage architecture including: 1) an attention focusing subsystem that processes whole radiographs to locate possible nodular regions ensuring high sensitivity; 2) a validation subsystem that processes regions of interest to evaluate the likelihood of the presence of a nodule, so as to reduce false alarms and increase detection specificity. Biologically inspired filters (both LoG and Gabor kernels) are used to enhance salient image features. ANNs of the feedforward type are employed, which allow an efficient use of a priori knowledge about the shape of nodules, and the background structure. The images from the public JSRT database, including 247 radiograms, were used to build and test the system. We performed a further test by using a second private database with 65 radiograms collected and annotated at the Radiology Department of the University of Florence. Both data sets include nodule and nonnodule radiographs. The use of a public data set along with independent testing with a different image set makes the comparison with other systems easier and allows a deeper understanding of system behavior. Experimental results are described by ROC/FROC analysis. For the JSRT database, we observed that by varying sensitivity from 60 to 75% the number of false alarms per image lies in the range 4-10, while accuracy is in the range 95.7-98.0%. When the second data set was used comparable results were obtained. The observed system performances support the undertaking of system validation in clinical settings.
BIGNASim: a NoSQL database structure and analysis portal for nucleic acids simulation data.
Hospital, Adam; Andrio, Pau; Cugnasco, Cesare; Codo, Laia; Becerra, Yolanda; Dans, Pablo D; Battistini, Federica; Torres, Jordi; Goñi, Ramón; Orozco, Modesto; Gelpí, Josep Ll
2016-01-04
Molecular dynamics simulation (MD) is, just behind genomics, the bioinformatics tool that generates the largest amounts of data, and that is using the largest amount of CPU time in supercomputing centres. MD trajectories are obtained after months of calculations, analysed in situ, and in practice forgotten. Several projects to generate stable trajectory databases have been developed for proteins, but no equivalence exists in the nucleic acids world. We present here a novel database system to store MD trajectories and analyses of nucleic acids. The initial data set available consists mainly of the benchmark of the new molecular dynamics force-field, parmBSC1. It contains 156 simulations, with over 120 μs of total simulation time. A deposition protocol is available to accept the submission of new trajectory data. The database is based on the combination of two NoSQL engines, Cassandra for storing trajectories and MongoDB to store analysis results and simulation metadata. The analyses available include backbone geometries, helical analysis, NMR observables and a variety of mechanical analyses. Individual trajectories and combined meta-trajectories can be downloaded from the portal. The system is accessible through http://mmb.irbbarcelona.org/BIGNASim/. Supplementary Material is also available on-line at http://mmb.irbbarcelona.org/BIGNASim/SuppMaterial/. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Building an integrated neurodegenerative disease database at an academic health center.
Xie, Sharon X; Baek, Young; Grossman, Murray; Arnold, Steven E; Karlawish, Jason; Siderowf, Andrew; Hurtig, Howard; Elman, Lauren; McCluskey, Leo; Van Deerlin, Vivianna; Lee, Virginia M-Y; Trojanowski, John Q
2011-07-01
It is becoming increasingly important to study common and distinct etiologies, clinical and pathological features, and mechanisms related to neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, amyotrophic lateral sclerosis, and frontotemporal lobar degeneration. These comparative studies rely on powerful database tools to quickly generate data sets that match diverse and complementary criteria set by them. In this article, we present a novel integrated neurodegenerative disease (INDD) database, which was developed at the University of Pennsylvania (Penn) with the help of a consortium of Penn investigators. Because the work of these investigators are based on Alzheimer's disease, Parkinson's disease, amyotrophic lateral sclerosis, and frontotemporal lobar degeneration, it allowed us to achieve the goal of developing an INDD database for these major neurodegenerative disorders. We used the Microsoft SQL server as a platform, with built-in "backwards" functionality to provide Access as a frontend client to interface with the database. We used PHP Hypertext Preprocessor to create the "frontend" web interface and then used a master lookup table to integrate individual neurodegenerative disease databases. We also present methods of data entry, database security, database backups, and database audit trails for this INDD database. Using the INDD database, we compared the results of a biomarker study with those using an alternative approach by querying individual databases separately. We have demonstrated that the Penn INDD database has the ability to query multiple database tables from a single console with high accuracy and reliability. The INDD database provides a powerful tool for generating data sets in comparative studies on several neurodegenerative diseases. Copyright © 2011 The Alzheimer's Association. Published by Elsevier Inc. All rights reserved.
FIA Estimation in the New Millennium
Francis A. Roesch
2001-01-01
In the new millennium, Forest Inventory and Analysis (FIA) will deliver most of its database information directly to the users over the Internet. This assumption indicates the need for a GIS-based estimation system to support the information delivery system. Presumably, as the data set evolves, it will free FIA and the users from exclusive estimation within political...
Data-Based Decision Making: The Impact of Data Variability, Training, and Context
ERIC Educational Resources Information Center
Vanselow, Nicholas R.; Thompson, Rachel; Karsina, Allen
2011-01-01
The current study examines agreement among individuals with varying expertise in behavior analysis about the length of baseline when data were presented point by point. Participants were asked to respond to baseline data and to indicate when to terminate the baseline phase. When only minimal information was provided about the data set, experts and…
A geo-spatial data management system for potentially active volcanoes—GEOWARN project
NASA Astrophysics Data System (ADS)
Gogu, Radu C.; Dietrich, Volker J.; Jenny, Bernhard; Schwandner, Florian M.; Hurni, Lorenz
2006-02-01
Integrated studies of active volcanic systems for the purpose of long-term monitoring and forecast and short-term eruption prediction require large numbers of data-sets from various disciplines. A modern database concept has been developed for managing and analyzing multi-disciplinary volcanological data-sets. The GEOWARN project (choosing the "Kos-Yali-Nisyros-Tilos volcanic field, Greece" and the "Campi Flegrei, Italy" as test sites) is oriented toward potentially active volcanoes situated in regions of high geodynamic unrest. This article describes the volcanological database of the spatial and temporal data acquired within the GEOWARN project. As a first step, a spatial database embedded in a Geographic Information System (GIS) environment was created. Digital data of different spatial resolution, and time-series data collected at different intervals or periods, were unified in a common, four-dimensional representation of space and time. The database scheme comprises various information layers containing geographic data (e.g. seafloor and land digital elevation model, satellite imagery, anthropogenic structures, land-use), geophysical data (e.g. from active and passive seismicity, gravity, tomography, SAR interferometry, thermal imagery, differential GPS), geological data (e.g. lithology, structural geology, oceanography), and geochemical data (e.g. from hydrothermal fluid chemistry and diffuse degassing features). As a second step based on the presented database, spatial data analysis has been performed using custom-programmed interfaces that execute query scripts resulting in a graphical visualization of data. These query tools were designed and compiled following scenarios of known "behavior" patterns of dormant volcanoes and first candidate signs of potential unrest. The spatial database and query approach is intended to facilitate scientific research on volcanic processes and phenomena, and volcanic surveillance.
A Quality-Control-Oriented Database for a Mesoscale Meteorological Observation Network
NASA Astrophysics Data System (ADS)
Lussana, C.; Ranci, M.; Uboldi, F.
2012-04-01
In the operational context of a local weather service, data accessibility and quality related issues must be managed by taking into account a wide set of user needs. This work describes the structure and the operational choices made for the operational implementation of a database system storing data from highly automated observing stations, metadata and information on data quality. Lombardy's environmental protection agency, ARPA Lombardia, manages a highly automated mesoscale meteorological network. A Quality Assurance System (QAS) ensures that reliable observational information is collected and disseminated to the users. The weather unit in ARPA Lombardia, at the same time an important QAS component and an intensive data user, has developed a database specifically aimed to: 1) providing quick access to data for operational activities and 2) ensuring data quality for real-time applications, by means of an Automatic Data Quality Control (ADQC) procedure. Quantities stored in the archive include hourly aggregated observations of: precipitation amount, temperature, wind, relative humidity, pressure, global and net solar radiation. The ADQC performs several independent tests on raw data and compares their results in a decision-making procedure. An important ADQC component is the Spatial Consistency Test based on Optimal Interpolation. Interpolated and Cross-Validation analysis values are also stored in the database, providing further information to human operators and useful estimates in case of missing data. The technical solution adopted is based on a LAMP (Linux, Apache, MySQL and Php) system, constituting an open source environment suitable for both development and operational practice. The ADQC procedure itself is performed by R scripts directly interacting with the MySQL database. Users and network managers can access the database by using a set of web-based Php applications.
The MIND PALACE: A Multi-Spectral Imaging and Spectroscopy Database for Planetary Science
NASA Astrophysics Data System (ADS)
Eshelman, E.; Doloboff, I.; Hara, E. K.; Uckert, K.; Sapers, H. M.; Abbey, W.; Beegle, L. W.; Bhartia, R.
2017-12-01
The Multi-Instrument Database (MIND) is the web-based home to a well-characterized set of analytical data collected by a suite of deep-UV fluorescence/Raman instruments built at the Jet Propulsion Laboratory (JPL). Samples derive from a growing body of planetary surface analogs, mineral and microbial standards, meteorites, spacecraft materials, and other astrobiologically relevant materials. In addition to deep-UV spectroscopy, datasets stored in MIND are obtained from a variety of analytical techniques obtained over multiple spatial and spectral scales including electron microscopy, optical microscopy, infrared spectroscopy, X-ray fluorescence, and direct fluorescence imaging. Multivariate statistical analysis techniques, primarily Principal Component Analysis (PCA), are used to guide interpretation of these large multi-analytical spectral datasets. Spatial co-referencing of integrated spectral/visual maps is performed using QGIS (geographic information system software). Georeferencing techniques transform individual instrument data maps into a layered co-registered data cube for analysis across spectral and spatial scales. The body of data in MIND is intended to serve as a permanent, reliable, and expanding database of deep-UV spectroscopy datasets generated by this unique suite of JPL-based instruments on samples of broad planetary science interest.
Sanz-Valero, J; Gil, Á; Wanden-Berghe, C; Martínez de Victoria, E
2012-11-01
To evaluate by bibliometric and thematic analysis the scientific literature on omega-3 fatty acids indexed in international databases on health sciences and to establish a comparative base for future analysis. Searches were conducted with the descriptor (MeSH, as Major Topic) "Fatty Acids, Omega-3" from the first date available until December 31, 2010. Databases consulted: MEDLINE (via PubMed), EMBASE, ISI Web of Knowledge, CINAHL and LILACS. The most common type of document was originals articles. Obsolescence was set at 5 years. The geographical distribution of authors who appear as first author was EEUU and the articles were written predominantly in English. The study population was 90.98% (95% CI 89.25 to 92.71) adult humans. The documents were classified into 59 subject areas and the most studied topic 16.24% (95% CI 14.4 to 18.04) associated with omega-3, was cardiovascular disease. This study indicates that the scientific literature on omega-3 fatty acids is a full force area of knowledge. The Anglo-Saxon institutions dominate the scientific production and it is mainly oriented to the study of cardiovascular disease.
Yu, Kebing; Salomon, Arthur R
2009-12-01
Recently, dramatic progress has been achieved in expanding the sensitivity, resolution, mass accuracy, and scan rate of mass spectrometers able to fragment and identify peptides through MS/MS. Unfortunately, this enhanced ability to acquire proteomic data has not been accompanied by a concomitant increase in the availability of flexible tools allowing users to rapidly assimilate, explore, and analyze this data and adapt to various experimental workflows with minimal user intervention. Here we fill this critical gap by providing a flexible relational database called PeptideDepot for organization of expansive proteomic data sets, collation of proteomic data with available protein information resources, and visual comparison of multiple quantitative proteomic experiments. Our software design, built upon the synergistic combination of a MySQL database for safe warehousing of proteomic data with a FileMaker-driven graphical user interface for flexible adaptation to diverse workflows, enables proteomic end-users to directly tailor the presentation of proteomic data to the unique analysis requirements of the individual proteomics lab. PeptideDepot may be deployed as an independent software tool or integrated directly with our high throughput autonomous proteomic pipeline used in the automated acquisition and post-acquisition analysis of proteomic data.
1990-09-01
conflicts. The current prototyping tool also provides a multiversion data object control mechanism. From a series of experiments, we found that the...performance of a multiversion distributed database system is quite sensitive to the size of read-sets and write-sets of transactions. A multiversion database...510-512. (18) Son, S. H. and N. Haghighi, "Performance Evaluation of Multiversion Database Systems," Sixth IEEE International Conference on Data
Structured Forms Reference Set of Binary Images II (SFRS2)
National Institute of Standards and Technology Data Gateway
NIST Structured Forms Reference Set of Binary Images II (SFRS2) (Web, free access) The second NIST database of structured forms (Special Database 6) consists of 5,595 pages of binary, black-and-white images of synthesized documents containing hand-print. The documents in this database are 12 different tax forms with the IRS 1040 Package X for the year 1988.
Ortholog Identification and Comparative Analysis of Microbial Genomes Using MBGD and RECOG.
Uchiyama, Ikuo
2017-01-01
Comparative genomics is becoming an essential approach for identification of genes associated with a specific function or phenotype. Here, we introduce the microbial genome database for comparative analysis (MBGD), which is a comprehensive ortholog database among the microbial genomes available so far. MBGD contains several precomputed ortholog tables including the standard ortholog table covering the entire taxonomic range and taxon-specific ortholog tables for various major taxa. In addition, MBGD allows the users to create an ortholog table within any specified set of genomes through dynamic calculations. In particular, MBGD has a "My MBGD" mode where users can upload their original genome sequences and incorporate them into orthology analysis. The created ortholog table can serve as the basis for various comparative analyses. Here, we describe the use of MBGD and briefly explain how to utilize the orthology information during comparative genome analysis in combination with the stand-alone comparative genomics software RECOG, focusing on the application to comparison of closely related microbial genomes.
Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike
2018-01-01
ABSTRACT Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection. PMID:29564396
Goodacre, Norman; Aljanahi, Aisha; Nandakumar, Subhiksha; Mikailov, Mike; Khan, Arifa S
2018-01-01
Detection of distantly related viruses by high-throughput sequencing (HTS) is bioinformatically challenging because of the lack of a public database containing all viral sequences, without abundant nonviral sequences, which can extend runtime and obscure viral hits. Our reference viral database (RVDB) includes all viral, virus-related, and virus-like nucleotide sequences (excluding bacterial viruses), regardless of length, and with overall reduced cellular sequences. Semantic selection criteria (SEM-I) were used to select viral sequences from GenBank, resulting in a first-generation viral database (VDB). This database was manually and computationally reviewed, resulting in refined, semantic selection criteria (SEM-R), which were applied to a new download of updated GenBank sequences to create a second-generation VDB. Viral entries in the latter were clustered at 98% by CD-HIT-EST to reduce redundancy while retaining high viral sequence diversity. The viral identity of the clustered representative sequences (creps) was confirmed by BLAST searches in NCBI databases and HMMER searches in PFAM and DFAM databases. The resulting RVDB contained a broad representation of viral families, sequence diversity, and a reduced cellular content; it includes full-length and partial sequences and endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Testing of RVDBv10.2, with an in-house HTS transcriptomic data set indicated a significantly faster run for virus detection than interrogating the entirety of the NCBI nonredundant nucleotide database, which contains all viral sequences but also nonviral sequences. RVDB is publically available for facilitating HTS analysis, particularly for novel virus detection. It is meant to be updated on a regular basis to include new viral sequences added to GenBank. IMPORTANCE To facilitate bioinformatics analysis of high-throughput sequencing (HTS) data for the detection of both known and novel viruses, we have developed a new reference viral database (RVDB) that provides a broad representation of different virus species from eukaryotes by including all viral, virus-like, and virus-related sequences (excluding bacteriophages), regardless of their size. In particular, RVDB contains endogenous nonretroviral elements, endogenous retroviruses, and retrotransposons. Sequences were clustered to reduce redundancy while retaining high viral sequence diversity. A particularly useful feature of RVDB is the reduction of cellular sequences, which can enhance the run efficiency of large transcriptomic and genomic data analysis and increase the specificity of virus detection.
Identifying functionally informative evolutionary sequence profiles.
Gil, Nelson; Fiser, Andras
2018-04-15
Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.
Bflinks: Reliable Bugfix Links via Bidirectional References and Tuned Heuristics
2014-01-01
Background. Data from software version archives and defect databases can be used for defect insertion circumstance analysis and defect prediction. The first step in such analyses is identifying defect-correcting changes in the version archive (bugfix commits) and enriching them with additional metadata by establishing bugfix links to corresponding entries in the defect database. Candidate bugfix commits are typically identified via heuristic string matching on the commit message. Research Questions. Which filters could be used to obtain a set of bugfix links? How to tune their parameters? What accuracy is achieved? Method. We analyze a modular set of seven independent filters, including new ones that make use of reverse links, and evaluate visual heuristics for setting cutoff parameters. For a commercial repository, a product expert manually verifies over 2500 links to validate the results with unprecedented accuracy. Results. The heuristics pick a very good parameter value for five filters and a reasonably good one for the sixth. The combined filtering, called bflinks, provides 93% precision and only 7% results loss. Conclusion. Bflinks can provide high-quality results and adapts to repositories with different properties. PMID:27433506
Sun, Shulei; Chen, Jing; Li, Weizhong; Altintas, Ilkay; Lin, Abel; Peltier, Steve; Stocks, Karen; Allen, Eric E.; Ellisman, Mark; Grethe, Jeffrey; Wooley, John
2011-01-01
The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data. PMID:21045053
Sun, Shulei; Chen, Jing; Li, Weizhong; Altintas, Ilkay; Lin, Abel; Peltier, Steve; Stocks, Karen; Allen, Eric E; Ellisman, Mark; Grethe, Jeffrey; Wooley, John
2011-01-01
The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome data sets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share and forward data to other metagenomics sites and community data archives such as GenBank. It has multiple interfaces for easy submission of large or complex data sets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of tools and viewers for querying, analyzing, annotating and comparing metagenome and genome data.
Analysis Tool Web Services from the EMBL-EBI.
McWilliam, Hamish; Li, Weizhong; Uludag, Mahmut; Squizzato, Silvano; Park, Young Mi; Buso, Nicola; Cowley, Andrew Peter; Lopez, Rodrigo
2013-07-01
Since 2004 the European Bioinformatics Institute (EMBL-EBI) has provided access to a wide range of databases and analysis tools via Web Services interfaces. This comprises services to search across the databases available from the EMBL-EBI and to explore the network of cross-references present in the data (e.g. EB-eye), services to retrieve entry data in various data formats and to access the data in specific fields (e.g. dbfetch), and analysis tool services, for example, sequence similarity search (e.g. FASTA and NCBI BLAST), multiple sequence alignment (e.g. Clustal Omega and MUSCLE), pairwise sequence alignment and protein functional analysis (e.g. InterProScan and Phobius). The REST/SOAP Web Services (http://www.ebi.ac.uk/Tools/webservices/) interfaces to these databases and tools allow their integration into other tools, applications, web sites, pipeline processes and analytical workflows. To get users started using the Web Services, sample clients are provided covering a range of programming languages and popular Web Service tool kits, and a brief guide to Web Services technologies, including a set of tutorials, is available for those wishing to learn more and develop their own clients. Users of the Web Services are informed of improvements and updates via a range of methods.
Analysis Tool Web Services from the EMBL-EBI
McWilliam, Hamish; Li, Weizhong; Uludag, Mahmut; Squizzato, Silvano; Park, Young Mi; Buso, Nicola; Cowley, Andrew Peter; Lopez, Rodrigo
2013-01-01
Since 2004 the European Bioinformatics Institute (EMBL-EBI) has provided access to a wide range of databases and analysis tools via Web Services interfaces. This comprises services to search across the databases available from the EMBL-EBI and to explore the network of cross-references present in the data (e.g. EB-eye), services to retrieve entry data in various data formats and to access the data in specific fields (e.g. dbfetch), and analysis tool services, for example, sequence similarity search (e.g. FASTA and NCBI BLAST), multiple sequence alignment (e.g. Clustal Omega and MUSCLE), pairwise sequence alignment and protein functional analysis (e.g. InterProScan and Phobius). The REST/SOAP Web Services (http://www.ebi.ac.uk/Tools/webservices/) interfaces to these databases and tools allow their integration into other tools, applications, web sites, pipeline processes and analytical workflows. To get users started using the Web Services, sample clients are provided covering a range of programming languages and popular Web Service tool kits, and a brief guide to Web Services technologies, including a set of tutorials, is available for those wishing to learn more and develop their own clients. Users of the Web Services are informed of improvements and updates via a range of methods. PMID:23671338
Developing and Implementing the Data Mining Algorithms in RAVEN
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sen, Ramazan Sonat; Maljovec, Daniel Patrick; Alfonsi, Andrea
The RAVEN code is becoming a comprehensive tool to perform probabilistic risk assessment, uncertainty quantification, and verification and validation. The RAVEN code is being developed to support many programs and to provide a set of methodologies and algorithms for advanced analysis. Scientific computer codes can generate enormous amounts of data. To post-process and analyze such data might, in some cases, take longer than the initial software runtime. Data mining algorithms/methods help in recognizing and understanding patterns in the data, and thus discover knowledge in databases. The methodologies used in the dynamic probabilistic risk assessment or in uncertainty and error quantificationmore » analysis couple system/physics codes with simulation controller codes, such as RAVEN. RAVEN introduces both deterministic and stochastic elements into the simulation while the system/physics code model the dynamics deterministically. A typical analysis is performed by sampling values of a set of parameter values. A major challenge in using dynamic probabilistic risk assessment or uncertainty and error quantification analysis for a complex system is to analyze the large number of scenarios generated. Data mining techniques are typically used to better organize and understand data, i.e. recognizing patterns in the data. This report focuses on development and implementation of Application Programming Interfaces (APIs) for different data mining algorithms, and the application of these algorithms to different databases.« less
Asynchronous data change notification between database server and accelerator controls system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fu, W.; Morris, J.; Nemesure, S.
2011-10-10
Database data change notification (DCN) is a commonly used feature. Not all database management systems (DBMS) provide an explicit DCN mechanism. Even for those DBMS's which support DCN (such as Oracle and MS SQL server), some server side and/or client side programming may be required to make the DCN system work. This makes the setup of DCN between database server and interested clients tedious and time consuming. In accelerator control systems, there are many well established software client/server architectures (such as CDEV, EPICS, and ADO) that can be used to implement data reflection servers that transfer data asynchronously to anymore » client using the standard SET/GET API. This paper describes a method for using such a data reflection server to set up asynchronous DCN (ADCN) between a DBMS and clients. This method works well for all DBMS systems which provide database trigger functionality. Asynchronous data change notification (ADCN) between database server and clients can be realized by combining the use of a database trigger mechanism, which is supported by major DBMS systems, with server processes that use client/server software architectures that are familiar in the accelerator controls community (such as EPICS, CDEV or ADO). This approach makes the ADCN system easy to set up and integrate into an accelerator controls system. Several ADCN systems have been set up and used in the RHIC-AGS controls system.« less
Integrating the Allen Brain Institute Cell Types Database into Automated Neuroscience Workflow.
Stockton, David B; Santamaria, Fidel
2017-10-01
We developed software tools to download, extract features, and organize the Cell Types Database from the Allen Brain Institute (ABI) in order to integrate its whole cell patch clamp characterization data into the automated modeling/data analysis cycle. To expand the potential user base we employed both Python and MATLAB. The basic set of tools downloads selected raw data and extracts cell, sweep, and spike features, using ABI's feature extraction code. To facilitate data manipulation we added a tool to build a local specialized database of raw data plus extracted features. Finally, to maximize automation, we extended our NeuroManager workflow automation suite to include these tools plus a separate investigation database. The extended suite allows the user to integrate ABI experimental and modeling data into an automated workflow deployed on heterogeneous computer infrastructures, from local servers, to high performance computing environments, to the cloud. Since our approach is focused on workflow procedures our tools can be modified to interact with the increasing number of neuroscience databases being developed to cover all scales and properties of the nervous system.
Database resources of the National Center for Biotechnology Information
Sayers, Eric W.; Barrett, Tanya; Benson, Dennis A.; Bolton, Evan; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M.; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J.; Lu, Zhiyong; Madden, Thomas L.; Madej, Tom; Maglott, Donna R.; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D.; Schuler, Gregory D.; Sequeira, Edwin; Sherry, Stephen T.; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A.; Wagner, Lukas; Wang, Yanli; Wilbur, W. John; Yaschenko, Eugene; Ye, Jian
2012-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:22140104
Database resources of the National Center for Biotechnology Information
2013-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page. PMID:23193264
Database resources of the National Center for Biotechnology Information.
Wheeler, David L; Barrett, Tanya; Benson, Dennis A; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Geer, Lewis Y; Kapustin, Yuri; Khovayko, Oleg; Landsman, David; Lipman, David J; Madden, Thomas L; Maglott, Donna R; Ostell, James; Miller, Vadim; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Steven T; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Tatusov, Roman L; Tatusova, Tatiana A; Wagner, Lukas; Yaschenko, Eugene
2007-01-01
In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link(BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace and Assembly Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Viral Genotyping Tools, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Database resources of the National Center for Biotechnology Information.
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Feolo, Michael; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Landsman, David; Lipman, David J; Madden, Thomas L; Maglott, Donna R; Miller, Vadim; Mizrachi, Ilene; Ostell, James; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Yaschenko, Eugene; Ye, Jian
2009-01-01
In addition to maintaining the GenBank nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the web applications is custom implementation of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Angermeier, Paul L.; Frimpong, Emmanuel A.
2011-01-01
The need for integrated and widely accessible sources of species traits data to facilitate studies of ecology, conservation, and management has motivated development of traits databases for various taxa. In spite of the increasing number of traits-based analyses of freshwater fishes in the United States, no consolidated database of traits of this group exists publicly, and much useful information on these species is documented only in obscure sources. The largely inaccessible and unconsolidated traits information makes large-scale analysis involving many fishes and/or traits particularly challenging. We have compiled a database of > 100 traits for 809 (731 native and 78 nonnative) fish species found in freshwaters of the conterminous United States, including 37 native families and 145 native genera. The database, named Fish Traits, contains information on four major categories of traits: (1) trophic ecology; (2) body size, reproductive ecology, and life history; (3) habitat preferences; and (4) salinity and temperature tolerances. Information on geographic distribution and conservation status was also compiled. The database enhances many opportunities for conducting research on fish species traits and constitutes the first step toward establishing a central repository for a continually expanding set of traits of North American fishes.
Data resource profile: United Nations Children's Fund (UNICEF).
Murray, Colleen; Newby, Holly
2012-12-01
The United Nations Children's Fund (UNICEF) plays a leading role in the collection, compilation, analysis and dissemination of data to inform sound policies, legislation and programmes for promoting children's rights and well-being, and for global monitoring of progress towards the Millennium Development Goals. UNICEF maintains a set of global databases representing nearly 200 countries and covering the areas of child mortality, child health, maternal health, nutrition, immunization, water and sanitation, HIV/AIDS, education and child protection. These databases consist of internationally comparable and statistically sound data, and are updated annually through a process that draws on a wealth of data provided by UNICEF's wide network of >150 field offices. The databases are composed primarily of estimates from household surveys, with data from censuses, administrative records, vital registration systems and statistical models contributing to some key indicators as well. The data are assessed for quality based on a set of objective criteria to ensure that only the most reliable nationally representative information is included. For most indicators, data are available at the global, regional and national levels, plus sub-national disaggregation by sex, urban/rural residence and household wealth. The global databases are featured in UNICEF's flagship publications, inter-agency reports, including the Secretary General's Millennium Development Goals Report and Countdown to 2015, sector-specific reports and statistical country profiles. They are also publicly available on www.childinfo.org, together with trend data and equity analyses.
Finding Protein and Nucleotide Similarities with FASTA
Pearson, William R.
2016-01-01
The FASTA programs provide a comprehensive set of rapid similarity searching tools ( fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local and global similarity searches ( ssearch36, ggsearch36) and for searching with short peptides and oligonucleotides ( fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity (Unit 3.5). The FASTA programs can produce “BLAST-like” alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases (Unit 9.4). The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. PMID:27010337
HodDB: Design and Analysis of a Query Processor for Brick.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fierro, Gabriel; Culler, David
Brick is a recently proposed metadata schema and ontology for describing building components and the relationships between them. It represents buildings as directed labeled graphs using the RDF data model. Using the SPARQL query language, building-agnostic applications query a Brick graph to discover the set of resources and relationships they require to operate. Latency-sensitive applications, such as user interfaces, demand response and modelpredictive control, require fast queries — conventionally less than 100ms. We benchmark a set of popular open-source and commercial SPARQL databases against three real Brick models using seven application queries and find that none of them meet thismore » performance target. This lack of performance can be attributed to design decisions that optimize for queries over large graphs consisting of billions of triples, but give poor spatial locality and join performance on the small dense graphs typical of Brick. We present the design and evaluation of HodDB, a RDF/SPARQL database for Brick built over a node-based index structure. HodDB performs Brick queries 3-700x faster than leading SPARQL databases and consistently meets the 100ms threshold, enabling the portability of important latency-sensitive building applications.« less
Finding Protein and Nucleotide Similarities with FASTA.
Pearson, William R
2016-03-24
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. Copyright © 2016 John Wiley & Sons, Inc.
NASA Astrophysics Data System (ADS)
Schwortz, Andria C.; Burrows, Andrea C.; Myers, Adam D.
2015-01-01
Astronomy is increasingly moving towards working with large databases, from the state-of-the-art Sloan Digital Sky Survey Data Release 10, to the historical Digital Access to a Sky Century at Harvard. Non-astronomy fields as well tend to work with large datasets, be it in the form of warehouse inventory, health trends, or the stock market. However very few fields explicitly teach students the necessary skills to analyze such data. The authors studied a matched set of 37 participants working with 200-entry databases in astronomy using Google Spreadsheets, with limited information about a random set of quasars drawn from SDSS DR5. Here the authors present the quantitative results from an eight question pre-/post-test, with questions designed to span Bloom's taxonomy, on both the topics of the skills of using spreadsheets, and the content of quasars. Participants included both Astro 101 summer students and professionals including in-service K-12 teachers and science communicators. All groups showed statistically significant gains (as per Hake, 1998), with the greatest difference between women's gains of 0.196 and men's of 0.480.
Automatic detection of anomalies in screening mammograms
2013-01-01
Background Diagnostic performance in breast screening programs may be influenced by the prior probability of disease. Since breast cancer incidence is roughly half a percent in the general population there is a large probability that the screening exam will be normal. That factor may contribute to false negatives. Screening programs typically exhibit about 83% sensitivity and 91% specificity. This investigation was undertaken to determine if a system could be developed to pre-sort screening-images into normal and suspicious bins based on their likelihood to contain disease. Wavelets were investigated as a method to parse the image data, potentially removing confounding information. The development of a classification system based on features extracted from wavelet transformed mammograms is reported. Methods In the multi-step procedure images were processed using 2D discrete wavelet transforms to create a set of maps at different size scales. Next, statistical features were computed from each map, and a subset of these features was the input for a concerted-effort set of naïve Bayesian classifiers. The classifier network was constructed to calculate the probability that the parent mammography image contained an abnormality. The abnormalities were not identified, nor were they regionalized. The algorithm was tested on two publicly available databases: the Digital Database for Screening Mammography (DDSM) and the Mammographic Images Analysis Society’s database (MIAS). These databases contain radiologist-verified images and feature common abnormalities including: spiculations, masses, geometric deformations and fibroid tissues. Results The classifier-network designs tested achieved sensitivities and specificities sufficient to be potentially useful in a clinical setting. This first series of tests identified networks with 100% sensitivity and up to 79% specificity for abnormalities. This performance significantly exceeds the mean sensitivity reported in literature for the unaided human expert. Conclusions Classifiers based on wavelet-derived features proved to be highly sensitive to a range of pathologies, as a result Type II errors were nearly eliminated. Pre-sorting the images changed the prior probability in the sorted database from 37% to 74%. PMID:24330643
Dark, Paul; Wilson, Claire; Blackwood, Bronagh; McAuley, Danny F; Perkins, Gavin D; McMullan, Ronan; Gates, Simon; Warhurst, Geoffrey
2012-01-01
Background There is growing interest in the potential utility of molecular diagnostics in improving the detection of life-threatening infection (sepsis). LightCycler® SeptiFast is a multipathogen probe-based real-time PCR system targeting DNA sequences of bacteria and fungi present in blood samples within a few hours. We report here the protocol of the first systematic review of published clinical diagnostic accuracy studies of this technology when compared with blood culture in the setting of suspected sepsis. Methods/design Data sources: the Cochrane Database of Systematic Reviews, the Database of Abstracts of Reviews of Effects (DARE), the Health Technology Assessment Database (HTA), the NHS Economic Evaluation Database (NHSEED), The Cochrane Library, MEDLINE, EMBASE, ISI Web of Science, BIOSIS Previews, MEDION and the Aggressive Research Intelligence Facility Database (ARIF). diagnostic accuracy studies that compare the real-time PCR technology with standard culture results performed on a patient's blood sample during the management of sepsis. three reviewers, working independently, will determine the level of evidence, methodological quality and a standard data set relating to demographics and diagnostic accuracy metrics for each study. Statistical analysis/data synthesis: heterogeneity of studies will be investigated using a coupled forest plot of sensitivity and specificity and a scatter plot in Receiver Operator Characteristic (ROC) space. Bivariate model method will be used to estimate summary sensitivity and specificity. The authors will investigate reporting biases using funnel plots based on effective sample size and regression tests of asymmetry. Subgroup analyses are planned for adults, children and infection setting (hospital vs community) if sufficient data are uncovered. Dissemination Recommendations will be made to the Department of Health (as part of an open-access HTA report) as to whether the real-time PCR technology has sufficient clinical diagnostic accuracy potential to move forward to efficacy testing during the provision of routine clinical care. Registration PROSPERO-NIHR Prospective Register of Systematic Reviews (CRD42011001289).
Integer sequence discovery from small graphs
Hoppe, Travis; Petrone, Anna
2015-01-01
We have exhaustively enumerated all simple, connected graphs of a finite order and have computed a selection of invariants over this set. Integer sequences were constructed from these invariants and checked against the Online Encyclopedia of Integer Sequences (OEIS). 141 new sequences were added and six sequences were extended. From the graph database, we were able to programmatically suggest relationships among the invariants. It will be shown that we can readily visualize any sequence of graphs with a given criteria. The code has been released as an open-source framework for further analysis and the database was constructed to be extensible to invariants not considered in this work. PMID:27034526
Learning lessons from Natech accidents - the eNATECH accident database
NASA Astrophysics Data System (ADS)
Krausmann, Elisabeth; Girgin, Serkan
2016-04-01
When natural hazards impact industrial facilities that house or process hazardous materials, fires, explosions and toxic releases can occur. This type of accident is commonly referred to as Natech accident. In order to prevent the recurrence of accidents or to better mitigate their consequences, lessons-learned type studies using available accident data are usually carried out. Through post-accident analysis, conclusions can be drawn on the most common damage and failure modes and hazmat release paths, particularly vulnerable storage and process equipment, and the hazardous materials most commonly involved in these types of accidents. These analyses also lend themselves to identifying technical and organisational risk-reduction measures that require improvement or are missing. Industrial accident databases are commonly used for retrieving sets of Natech accident case histories for further analysis. These databases contain accident data from the open literature, government authorities or in-company sources. The quality of reported information is not uniform and exhibits different levels of detail and accuracy. This is due to the difficulty of finding qualified information sources, especially in situations where accident reporting by the industry or by authorities is not compulsory, e.g. when spill quantities are below the reporting threshold. Data collection has then to rely on voluntary record keeping often by non-experts. The level of detail is particularly non-uniform for Natech accident data depending on whether the consequences of the Natech event were major or minor, and whether comprehensive information was available for reporting. In addition to the reporting bias towards high-consequence events, industrial accident databases frequently lack information on the severity of the triggering natural hazard, as well as on failure modes that led to the hazmat release. This makes it difficult to reconstruct the dynamics of the accident and renders the development of equipment vulnerability models linking the natural-hazard severity to the observed damage almost impossible. As a consequence, the European Commission has set up the eNATECH database for the systematic collection of Natech accident data and near misses. The database exhibits the more sophisticated accident representation required to capture the characteristics of Natech events and is publicly accessible at http://enatech.jrc.ec.europa.eu. This presentation outlines the general lessons-learning process, introduces the eNATECH database and its specific structure, and discusses natural-hazard specific lessons learned and features common to Natech accidents triggered by different natural hazards.
Ojima-Kato, Teruyo; Yamamoto, Naomi; Takahashi, Hajime; Tamura, Hiroto
2016-01-01
The genetic lineages of Listeria monocytogenes and other species of the genus Listeria are correlated with pathogenesis in humans. Although matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) has become a prevailing tool for rapid and reliable microbial identification, the precise discrimination of Listeria species and lineages remains a crucial issue in clinical settings and for food safety. In this study, we constructed an accurate and reliable MS database to discriminate the lineages of L. monocytogenes and the species of Listeria (L. monocytogenes, L. innocua, L. welshimeri, L. seeligeri, L. ivanovii, L. grayi, and L. rocourtiae) based on the S10-spc-alpha operon gene encoded ribosomal protein mass spectrum (S10-GERMS) proteotyping method, which relies on both genetic information (genomics) and observed MS peaks in MALDI-TOF MS (proteomics). The specific set of eight biomarkers (ribosomal proteins L24, L6, L18, L15, S11, S9, L31 type B, and S16) yielded characteristic MS patterns for the lineages of L. monocytogenes and the different species of Listeria, and led to the construction of a MS database that was successful in discriminating between these organisms in MALDI-TOF MS fingerprinting analysis followed by advanced proteotyping software Strain Solution analysis. We also confirmed the constructed database on the proteotyping software Strain Solution by using 23 Listeria strains collected from natural sources.
Yamamoto, Naomi; Takahashi, Hajime; Tamura, Hiroto
2016-01-01
The genetic lineages of Listeria monocytogenes and other species of the genus Listeria are correlated with pathogenesis in humans. Although matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS) has become a prevailing tool for rapid and reliable microbial identification, the precise discrimination of Listeria species and lineages remains a crucial issue in clinical settings and for food safety. In this study, we constructed an accurate and reliable MS database to discriminate the lineages of L. monocytogenes and the species of Listeria (L. monocytogenes, L. innocua, L. welshimeri, L. seeligeri, L. ivanovii, L. grayi, and L. rocourtiae) based on the S10-spc-alpha operon gene encoded ribosomal protein mass spectrum (S10-GERMS) proteotyping method, which relies on both genetic information (genomics) and observed MS peaks in MALDI-TOF MS (proteomics). The specific set of eight biomarkers (ribosomal proteins L24, L6, L18, L15, S11, S9, L31 type B, and S16) yielded characteristic MS patterns for the lineages of L. monocytogenes and the different species of Listeria, and led to the construction of a MS database that was successful in discriminating between these organisms in MALDI-TOF MS fingerprinting analysis followed by advanced proteotyping software Strain Solution analysis. We also confirmed the constructed database on the proteotyping software Strain Solution by using 23 Listeria strains collected from natural sources. PMID:27442502
cPath: open source software for collecting, storing, and querying biological pathways.
Cerami, Ethan G; Bader, Gary D; Gross, Benjamin E; Sander, Chris
2006-11-13
Biological pathways, including metabolic pathways, protein interaction networks, signal transduction pathways, and gene regulatory networks, are currently represented in over 220 diverse databases. These data are crucial for the study of specific biological processes, including human diseases. Standard exchange formats for pathway information, such as BioPAX, CellML, SBML and PSI-MI, enable convenient collection of this data for biological research, but mechanisms for common storage and communication are required. We have developed cPath, an open source database and web application for collecting, storing, and querying biological pathway data. cPath makes it easy to aggregate custom pathway data sets available in standard exchange formats from multiple databases, present pathway data to biologists via a customizable web interface, and export pathway data via a web service to third-party software, such as Cytoscape, for visualization and analysis. cPath is software only, and does not include new pathway information. Key features include: a built-in identifier mapping service for linking identical interactors and linking to external resources; built-in support for PSI-MI and BioPAX standard pathway exchange formats; a web service interface for searching and retrieving pathway data sets; and thorough documentation. The cPath software is freely available under the LGPL open source license for academic and commercial use. cPath is a robust, scalable, modular, professional-grade software platform for collecting, storing, and querying biological pathways. It can serve as the core data handling component in information systems for pathway visualization, analysis and modeling.
An integrated approach to reservoir modeling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Donaldson, K.
1993-08-01
The purpose of this research is to evaluate the usefulness of the following procedural and analytical methods in investigating the heterogeneity of the oil reserve for the Mississipian Big Injun Sandstone of the Granny Creek field, Clay and Roane counties, West Virginia: (1) relational database, (2) two-dimensional cross sections, (3) true three-dimensional modeling, (4) geohistory analysis, (5) a rule-based expert system, and (6) geographical information systems. The large data set could not be effectively integrated and interpreted without this approach. A relational database was designed to fully integrate three- and four-dimensional data. The database provides an effective means for maintainingmore » and manipulating the data. A two-dimensional cross section program was designed to correlate stratigraphy, depositional environments, porosity, permeability, and petrographic data. This flexible design allows for additional four-dimensional data. Dynamic Graphics[sup [trademark
A series of PDB related databases for everyday needs.
Joosten, Robbie P; te Beek, Tim A H; Krieger, Elmar; Hekkelman, Maarten L; Hooft, Rob W W; Schneider, Reinhard; Sander, Chris; Vriend, Gert
2011-01-01
The Protein Data Bank (PDB) is the world-wide repository of macromolecular structure information. We present a series of databases that run parallel to the PDB. Each database holds one entry, if possible, for each PDB entry. DSSP holds the secondary structure of the proteins. PDBREPORT holds reports on the structure quality and lists errors. HSSP holds a multiple sequence alignment for all proteins. The PDBFINDER holds easy to parse summaries of the PDB file content, augmented with essentials from the other systems. PDB_REDO holds re-refined, and often improved, copies of all structures solved by X-ray. WHY_NOT summarizes why certain files could not be produced. All these systems are updated weekly. The data sets can be used for the analysis of properties of protein structures in areas ranging from structural genomics, to cancer biology and protein design.
GenomeHubs: simple containerized setup of a custom Ensembl database and web server for any species
Kumar, Sujai; Stevens, Lewis; Blaxter, Mark
2017-01-01
Abstract As the generation and use of genomic datasets is becoming increasingly common in all areas of biology, the need for resources to collate, analyse and present data from one or more genome projects is becoming more pressing. The Ensembl platform is a powerful tool to make genome data and cross-species analyses easily accessible through a web interface and a comprehensive application programming interface. Here we introduce GenomeHubs, which provide a containerized environment to facilitate the setup and hosting of custom Ensembl genome browsers. This simplifies mirroring of existing content and import of new genomic data into the Ensembl database schema. GenomeHubs also provide a set of analysis containers to decorate imported genomes with results of standard analyses and functional annotations and support export to flat files, including EMBL format for submission of assemblies and annotations to International Nucleotide Sequence Database Collaboration. Database URL: http://GenomeHubs.org PMID:28605774
A biomedical information system for retrieval and manipulation of NHANES data.
Mukherjee, Sukrit; Martins, David; Norris, Keith C; Jenders, Robert A
2013-01-01
The retrieval and manipulation of data from large public databases like the U.S. National Health and Nutrition Examination Survey (NHANES) may require sophisticated statistical software and significant expertise that may be unavailable in the university setting. In response, we have developed the Data Retrieval And Manipulation System (DReAMS), an automated information system to handle all processes of data extraction and cleaning and then joining different subsets to produce analysis-ready output. The system is a browser-based data warehouse application in which the input data from flat files or operational systems are aggregated in a structured way so that the desired data can be read, recoded, queried and extracted efficiently. The current pilot implementation of the system provides access to a limited amount of NHANES database. We plan to increase the amount of data available through the system in the near future and to extend the techniques to other large databases from CDU archive with a current holding of about 53 databases.
NASA Astrophysics Data System (ADS)
Wang, Jian
2017-01-01
In order to change traditional PE teaching mode and realize the interconnection, interworking and sharing of PE teaching resources, a distance PE teaching platform based on broadband network is designed and PE teaching information resource database is set up. The designing of PE teaching information resource database takes Windows NT 4/2000Server as operating system platform, Microsoft SQL Server 7.0 as RDBMS, and takes NAS technology for data storage and flow technology for video service. The analysis of system designing and implementation shows that the dynamic PE teaching information resource sharing platform based on Web Service can realize loose coupling collaboration, realize dynamic integration and active integration and has good integration, openness and encapsulation. The distance PE teaching platform based on Web Service and the design scheme of PE teaching information resource database can effectively solve and realize the interconnection, interworking and sharing of PE teaching resources and adapt to the informatization development demands of PE teaching.
A psycholinguistic database for traditional Chinese character naming.
Chang, Ya-Ning; Hsu, Chun-Hsien; Tsai, Jie-Li; Chen, Chien-Liang; Lee, Chia-Ying
2016-03-01
In this study, we aimed to provide a large-scale set of psycholinguistic norms for 3,314 traditional Chinese characters, along with their naming reaction times (RTs), collected from 140 Chinese speakers. The lexical and semantic variables in the database include frequency, regularity, familiarity, consistency, number of strokes, homophone density, semantic ambiguity rating, phonetic combinability, semantic combinability, and the number of disyllabic compound words formed by a character. Multiple regression analyses were conducted to examine the predictive powers of these variables for the naming RTs. The results demonstrated that these variables could account for a significant portion of variance (55.8%) in the naming RTs. An additional multiple regression analysis was conducted to demonstrate the effects of consistency and character frequency. Overall, the regression results were consistent with the findings of previous studies on Chinese character naming. This database should be useful for research into Chinese language processing, Chinese education, or cross-linguistic comparisons. The database can be accessed via an online inquiry system (http://ball.ling.sinica.edu.tw/namingdatabase/index.html).
NASA Astrophysics Data System (ADS)
Bauschlicher, Charles W., Jr.; Ricca, A.; Boersma, C.; Allamandola, L. J.
2018-02-01
Version 3.00 of the library of computed spectra in the NASA Ames PAH IR Spectroscopic Database (PAHdb) is described. Version 3.00 introduces the use of multiple scale factors, instead of the single scaling factor used previously, to align the theoretical harmonic frequencies with the experimental fundamentals. The use of multiple scale factors permits the use of a variety of basis sets; this allows new PAH species to be included in the database, such as those containing oxygen, and yields an improved treatment of strained species and those containing nitrogen. In addition, the computed spectra of 2439 new PAH species have been added. The impact of these changes on the analysis of an astronomical spectrum through database-fitting is considered and compared with a fit using Version 2.00 of the library of computed spectra. Finally, astronomical constraints are defined for the PAH spectral libraries in PAHdb.
VerSeDa: vertebrate secretome database.
Cortazar, Ana R; Oguiza, José A; Aransay, Ana M; Lavín, José L
2017-01-01
Based on the current tools, de novo secretome (full set of proteins secreted by an organism) prediction is a time consuming bioinformatic task that requires a multifactorial analysis in order to obtain reliable in silico predictions. Hence, to accelerate this process and offer researchers a reliable repository where secretome information can be obtained for vertebrates and model organisms, we have developed VerSeDa (Vertebrate Secretome Database). This freely available database stores information about proteins that are predicted to be secreted through the classical and non-classical mechanisms, for the wide range of vertebrate species deposited at the NCBI, UCSC and ENSEMBL sites. To our knowledge, VerSeDa is the only state-of-the-art database designed to store secretome data from multiple vertebrate genomes, thus, saving an important amount of time spent in the prediction of protein features that can be retrieved from this repository directly. VerSeDa is freely available at http://genomics.cicbiogune.es/VerSeDa/index.php. © The Author(s) 2017. Published by Oxford University Press.
Correction to storm, Tressoldi, and di Risio (2010).
2015-03-01
Reports an error in "Meta-analysis of free-response studies, 1992-2008: Assessing the noise reduction model in parapsychology" by Lance Storm, Patrizio E. Tressoldi and Lorenzo Di Risio (Psychological Bulletin, 2010[Jul], Vol 136[4], 471-485). In the article, the sentence giving the formula in the second paragraph on p. 479 was stated incorrectly. The corrected sentence is included. (The following abstract of the original article appeared in record 2010-12718-001.) [Correction Notice: An erratum for this article was reported in Vol 136(5) of Psychological Bulletin (see record 2010-17510-009). In the article, the second to last sentence of the abstract (p. 471) was stated incorrectly. The sentence should read as follows: "The mean effect size value of the ganzfeld database was significantly higher than the mean effect size of the standard free-response database but was not higher than the effect size of the nonganzfeld noise reduction database."] We report the results of meta-analyses on 3 types of free-response study: (a) ganzfeld (a technique that enhances a communication anomaly referred to as "psi"); (b) nonganzfeld noise reduction using alleged psi-enhancing techniques such as dream psi, meditation, relaxation, or hypnosis; and (c) standard free response (nonganzfeld, no noise reduction). For the period 1997-2008, a homogeneous data set of 29 ganzfeld studies yielded a mean effect size of 0.142 (Stouffer Z = 5.48, p = 2.13 × 10-8). A homogeneous nonganzfeld noise reduction data set of 16 studies yielded a mean effect size of 0.110 (Stouffer Z = 3.35, p = 2.08 × 10-4), and a homogeneous data set of 14 standard free-response studies produced a weak negative mean effect size of -0.029 (Stouffer Z = -2.29, p = .989). The mean effect size value of the ganzfeld database were significantly higher than the mean effect size of the nonganzfeld noise reduction and the standard free-response databases. We also found that selected participants (believers in the paranormal, meditators, etc.) had a performance advantage over unselected participants, but only if they were in the ganzfeld condition. PsycINFO Database Record (c) 2015 APA, all rights reserved.
Computational Chemistry Comparison and Benchmark Database
National Institute of Standards and Technology Data Gateway
SRD 101 NIST Computational Chemistry Comparison and Benchmark Database (Web, free access) The NIST Computational Chemistry Comparison and Benchmark Database is a collection of experimental and ab initio thermochemical properties for a selected set of molecules. The goals are to provide a benchmark set of molecules for the evaluation of ab initio computational methods and allow the comparison between different ab initio computational methods for the prediction of thermochemical properties.
GIS Toolsets for Planetary Geomorphology and Landing-Site Analysis
NASA Astrophysics Data System (ADS)
Nass, Andrea; van Gasselt, Stephan
2015-04-01
Modern Geographic Information Systems (GIS) allow expert and lay users alike to load and position geographic data and perform simple to highly complex surface analyses. For many applications dedicated and ready-to-use GIS tools are available in standard software systems while other applications require the modular combination of available basic tools to answer more specific questions. This also applies to analyses in modern planetary geomorphology where many of such (basic) tools can be used to build complex analysis tools, e.g. in image- and terrain model analysis. Apart from the simple application of sets of different tools, many complex tasks require a more sophisticated design for storing and accessing data using databases (e.g. ArcHydro for hydrological data analysis). In planetary sciences, complex database-driven models are often required to efficiently analyse potential landings sites or store rover data, but also geologic mapping data can be efficiently stored and accessed using database models rather than stand-alone shapefiles. For landings-site analyses, relief and surface roughness estimates are two common concepts that are of particular interest and for both, a number of different definitions co-exist. We here present an advanced toolset for the analysis of image and terrain-model data with an emphasis on extraction of landing site characteristics using established criteria. We provide working examples and particularly focus on the concepts of terrain roughness as it is interpreted in geomorphology and engineering studies.
DBMap: a TreeMap-based framework for data navigation and visualization of brain research registry
NASA Astrophysics Data System (ADS)
Zhang, Ming; Zhang, Hong; Tjandra, Donny; Wong, Stephen T. C.
2003-05-01
The purpose of this study is to investigate and apply a new, intuitive and space-conscious visualization framework to facilitate efficient data presentation and exploration of large-scale data warehouses. We have implemented the DBMap framework for the UCSF Brain Research Registry. Such a novel utility would facilitate medical specialists and clinical researchers in better exploring and evaluating a number of attributes organized in the brain research registry. The current UCSF Brain Research Registry consists of a federation of disease-oriented database modules, including Epilepsy, Brain Tumor, Intracerebral Hemorrphage, and CJD (Creuzfeld-Jacob disease). These database modules organize large volumes of imaging and non-imaging data to support Web-based clinical research. While the data warehouse supports general information retrieval and analysis, there lacks an effective way to visualize and present the voluminous and complex data stored. This study investigates whether the TreeMap algorithm can be adapted to display and navigate categorical biomedical data warehouse or registry. TreeMap is a space constrained graphical representation of large hierarchical data sets, mapped to a matrix of rectangles, whose size and color represent interested database fields. It allows the display of a large amount of numerical and categorical information in limited real estate of computer screen with an intuitive user interface. The paper will describe, DBMap, the proposed new data visualization framework for large biomedical databases. Built upon XML, Java and JDBC technologies, the prototype system includes a set of software modules that reside in the application server tier and provide interface to backend database tier and front-end Web tier of the brain registry.
Challenges in Identifying Refugees in National Health Data Sets.
Semere, Wagahta; Yun, Katherine; Ahalt, Cyrus; Williams, Brie; Wang, Emily A
2016-07-01
To evaluate publicly available data sets to determine their utility for studying refugee health. We searched for keywords describing refugees in data sets within the Society of General Internal Medicine Dataset Compendium and the Inter-University Consortium for Political and Social Research database. We included in our analysis US-based data sets with publicly available documentation and a self-defined, health-related focus that allowed for an examination of patient-level factors. Of the 68 data sets that met the study criteria, 37 (54%) registered keyword matches related to refugees, but only 2 uniquely identified refugees. Few health data sets identify refugee status among participants, presenting barriers to understanding refugees' health and health care needs. Information about refugee status in national health surveys should include expanded demographic questions and focus on mental health and chronic disease.
BioWarehouse: a bioinformatics database warehouse toolkit
Lee, Thomas J; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David WJ; Tenenbaum, Jessica D; Karp, Peter D
2006-01-01
Background This article addresses the problem of interoperation of heterogeneous bioinformatics databases. Results We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. Conclusion BioWarehouse embodies significant progress on the database integration problem for bioinformatics. PMID:16556315
BioWarehouse: a bioinformatics database warehouse toolkit.
Lee, Thomas J; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David W J; Tenenbaum, Jessica D; Karp, Peter D
2006-03-23
This article addresses the problem of interoperation of heterogeneous bioinformatics databases. We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. BioWarehouse embodies significant progress on the database integration problem for bioinformatics.
Molecular signatures database (MSigDB) 3.0.
Liberzon, Arthur; Subramanian, Aravind; Pinchback, Reid; Thorvaldsdóttir, Helga; Tamayo, Pablo; Mesirov, Jill P
2011-06-15
Well-annotated gene sets representing the universe of the biological processes are critical for meaningful and insightful interpretation of large-scale genomic data. The Molecular Signatures Database (MSigDB) is one of the most widely used repositories of such sets. We report the availability of a new version of the database, MSigDB 3.0, with over 6700 gene sets, a complete revision of the collection of canonical pathways and experimental signatures from publications, enhanced annotations and upgrades to the web site. MSigDB is freely available for non-commercial use at http://www.broadinstitute.org/msigdb.
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data
2014-01-01
Background Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. Results We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Conclusions Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility. PMID:25089180
BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.
Wu, Hongyan; Fujiwara, Toyofumi; Yamamoto, Yasunori; Bolleman, Jerven; Yamaguchi, Atsuko
2014-01-01
Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.
Comparative Bacterial Proteomics: Analysis of the Core Genome Concept
Callister, Stephen J.; McCue, Lee Ann; Turse, Joshua E.; Monroe, Matthew E.; Auberry, Kenneth J.; Smith, Richard D.; Adkins, Joshua N.; Lipton, Mary S.
2008-01-01
While comparative bacterial genomic studies commonly predict a set of genes indicative of common ancestry, experimental validation of the existence of this core genome requires extensive measurement and is typically not undertaken. Enabled by an extensive proteome database developed over six years, we have experimentally verified the expression of proteins predicted from genomic ortholog comparisons among 17 environmental and pathogenic bacteria. More exclusive relationships were observed among the expressed protein content of phenotypically related bacteria, which is indicative of the specific lifestyles associated with these organisms. Although genomic studies can establish relative orthologous relationships among a set of bacteria and propose a set of ancestral genes, our proteomics study establishes expressed lifestyle differences among conserved genes and proposes a set of expressed ancestral traits. PMID:18253490
MetPetDB: A database for metamorphic geochemistry
NASA Astrophysics Data System (ADS)
Spear, Frank S.; Hallett, Benjamin; Pyle, Joseph M.; Adalı, Sibel; Szymanski, Boleslaw K.; Waters, Anthony; Linder, Zak; Pearce, Shawn O.; Fyffe, Matthew; Goldfarb, Dennis; Glickenhouse, Nickolas; Buletti, Heather
2009-12-01
We present a data model for the initial implementation of MetPetDB, a geochemical database specific to metamorphic rock samples. The database is designed around the concept of preservation of spatial relationships, at all scales, of chemical analyses and their textural setting. Objects in the database (samples) represent physical rock samples; each sample may contain one or more subsamples with associated geochemical and image data. Samples, subsamples, geochemical data, and images are described with attributes (some required, some optional); these attributes also serve as search delimiters. All data in the database are classified as published (i.e., archived or published data), public or private. Public and published data may be freely searched and downloaded. All private data is owned; permission to view, edit, download and otherwise manipulate private data may be granted only by the data owner; all such editing operations are recorded by the database to create a data version log. The sharing of data permissions among a group of collaborators researching a common sample is done by the sample owner through the project manager. User interaction with MetPetDB is hosted by a web-based platform based upon the Java servlet application programming interface, with the PostgreSQL relational database. The database web portal includes modules that allow the user to interact with the database: registered users may save and download public and published data, upload private data, create projects, and assign permission levels to project collaborators. An Image Viewer module provides for spatial integration of image and geochemical data. A toolkit consisting of plotting and geochemical calculation software for data analysis and a mobile application for viewing the public and published data is being developed. Future issues to address include population of the database, integration with other geochemical databases, development of the analysis toolkit, creation of data models for derivative data, and building a community-wide user base. It is believed that this and other geochemical databases will enable more productive collaborations, generate more efficient research efforts, and foster new developments in basic research in the field of solid earth geochemistry.
CARINA alkalinity data in the Atlantic Ocean
NASA Astrophysics Data System (ADS)
Velo, A.; Perez, F. F.; Brown, P.; Tanhua, T.; Schuster, U.; Key, R. M.
2009-08-01
Data on carbon and carbon-relevant hydrographic and hydrochemical parameters from previously non-publicly available cruise data sets in the Arctic, Atlantic and Southern Ocean have been retrieved and merged to a new database: CARINA (CARbon IN the Atlantic). These data have gone through rigorous quality control (QC) procedures to assure the highest possible quality and consistency. The data for most of the measured parameters in the CARINA data base were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the data products, i.e. three merged data files with measured, calculated and interpolated data for each of the three CARINA regions; Arctic, Atlantic and Southern Ocean. Out of a total of 188 cruise entries in the CARINA database, 98 were conducted in the Atlantic Ocean and of these, 75 cruises report alkalinity values. Here we present details of the secondary QC on alkalinity for the Atlantic Ocean part of CARINA. Procedures of quality control, including crossover analysis between cruises and inversion analysis of all crossover data are briefly described. Adjustments were applied to the alkalinity values for 16 of the cruises in the Atlantic Ocean region. With these adjustments the CARINA database is consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s. Based on our analysis we estimate the internal accuracy of the CARINA-ATL alkalinity data to be 3.3 μmol kg-1. The CARINA data are now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
CARINA: nutrient data in the Atlantic Ocean
NASA Astrophysics Data System (ADS)
Tanhua, T.; Brown, P. J.; Key, R. M.
2009-11-01
Data on carbon and carbon-relevant hydrographic and hydrochemical parameters from previously non-publicly available cruise data sets in the Arctic, Atlantic and Southern Ocean have been retrieved and merged to a new database: CARINA (CARbon IN the Atlantic). These data have gone through rigorous quality control (QC) procedures to assure the highest possible quality and consistency. The data for most of the measured parameters in the CARINA data base were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the data products, i.e. three merged data files with measured, calculated and interpolated data for each of the three CARINA regions; Arctic Mediterranean Seas, Atlantic and Southern Ocean. Out of a total of 188 cruise entries in the CARINA database, 98 were conducted in the Atlantic Ocean and of these 84 cruises report nitrate values, 79 silicate, and 78 phosphate. Here we present details of the secondary QC for nutrients for the Atlantic Ocean part of CARINA. Procedures of quality control, including crossover analysis between cruises and inversion analysis of all crossover data are briefly described. Adjustments were applied to the nutrient values for 43 of the cruises in the Atlantic Ocean region. With these adjustments the CARINA database is consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s (Key et al., 2004). Based on our analysis we estimate the internal accuracy of the CARINA-ATL nutrient data to be: nitrate 1.5%; phosphate 2.6%; silicate 3.1%. The CARINA data are now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
CARINA: nutrient data in the Atlantic Ocean
NASA Astrophysics Data System (ADS)
Tanhua, T.; Brown, P. J.; Key, R. M.
2009-07-01
Data on carbon and carbon-relevant hydrographic and hydrochemical parameters from previously non-publicly available cruise data sets in the Arctic, Atlantic and Southern Ocean have been retrieved and merged to a new database: CARINA (CARbon IN the Atlantic). These data have gone through rigorous quality control (QC) procedures to assure the highest possible quality and consistency. The data for most of the measured parameters in the CARINA data base were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the data products, i.e. three merged data files with measured, calculated and interpolated data for each of the three CARINA regions; Arctic, Atlantic and Southern Ocean. Out of a total of 188 cruise entries in the CARINA database, 98 were conducted in the Atlantic Ocean and of these 84 cruises report nitrate values, 79 silicate, and 78 phosphate. Here we present details of the secondary QC for nutrients for the Atlantic Ocean part of CARINA. Procedures of quality control, including crossover analysis between cruises and inversion analysis of all crossover data are briefly described. Adjustments were applied to the nutrient values for 43 of the cruises in the Atlantic Ocean region. With these adjustments the CARINA database is consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s (Key et al., 2004). Based on our analysis we estimate the internal accuracy of the CARINA-ATL nutrient data to be: nitrate 1.5%; phosphate 2.6%; silicate 3.1%. The CARINA data are now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
CARINA alkalinity data in the Atlantic Ocean
NASA Astrophysics Data System (ADS)
Velo, A.; Perez, F. F.; Brown, P.; Tanhua, T.; Schuster, U.; Key, R. M.
2009-11-01
Data on carbon and carbon-relevant hydrographic and hydrochemical parameters from previously non-publicly available cruise data sets in the Arctic, Atlantic and Southern Ocean have been retrieved and merged to a new database: CARINA (CARbon IN the Atlantic). These data have gone through rigorous quality control (QC) procedures to assure the highest possible quality and consistency. The data for most of the measured parameters in the CARINA data base were objectively examined in order to quantify systematic differences in the reported values, i.e. secondary quality control. Systematic biases found in the data have been corrected in the data products, i.e. three merged data files with measured, calculated and interpolated data for each of the three CARINA regions; Arctic, Atlantic and Southern Ocean. Out of a total of 188 cruise entries in the CARINA database, 98 were conducted in the Atlantic Ocean and of these, 75 cruises report alkalinity values. Here we present details of the secondary QC on alkalinity for the Atlantic Ocean part of CARINA. Procedures of quality control, including crossover analysis between cruises and inversion analysis of all crossover data are briefly described. Adjustments were applied to the alkalinity values for 16 of the cruises in the Atlantic Ocean region. With these adjustments the CARINA database is consistent both internally as well as with GLODAP data, an oceanographic data set based on the World Hydrographic Program in the 1990s. Based on our analysis we estimate the internal accuracy of the CARINA-ATL alkalinity data to be 3.3 μmol kg-1. The CARINA data are now suitable for accurate assessments of, for example, oceanic carbon inventories and uptake rates and for model validation.
An experimental investigation of masking in the US FDA adverse event reporting system database.
Wang, Hsin-wei; Hochberg, Alan M; Pearson, Ronald K; Hauben, Manfred
2010-12-01
A phenomenon of 'masking' or 'cloaking' in pharmacovigilance data mining has been described, which can potentially cause signals of disproportionate reporting (SDRs) to be missed, particularly in pharmaceutical company databases. Masking has been predicted theoretically, observed anecdotally or studied to a limited extent in both pharmaceutical company and health authority databases, but no previous publication systematically assesses its occurrence in a large health authority database. To explore the nature, extent and possible consequences of masking in the US FDA Adverse Event Reporting System (AERS) database by applying various experimental unmasking protocols to a set of drugs and events representing realistic pharmacovigilance analysis conditions. This study employed AERS data from 2001 through 2005. For a set of 63 Medical Dictionary for Regulatory Activities (MedDRA®) Preferred Terms (PTs), disproportionality analysis was carried out with respect to all drugs included in the AERS database, using a previously described urn-model-based algorithm. We specifically sought masking in which drug removal induced an increase in the statistical representation of a drug-event combination (DEC) that resulted in the emergence of a new SDR. We performed a series of unmasking experiments selecting drugs for removal using rational statistical decision rules based on the requirement of a reporting ratio (RR) >1, top-ranked statistical unexpectedness (SU) and relatedness as reflected in the WHO Anatomical Therapeutic Chemical level 4 (ATC4) grouping. In order to assess the possible extent of residual masking we performed two supplemental purely empirical analyses on a limited subset of data. This entailed testing every drug and drug group to determine which was most influential in uncovering masked SDRs. We assessed the strength of external evidence for a causal association for a small number of masked SDRs involving a subset of 29 drugs for which level of evidence adjudication was available from a previous study. The original disproportionality analysis identified 8719 SDRs for the 63 PTs. The SU-based unmasking protocols generated variable numbers of masked SDRs ranging from 38 to 156, representing a 0.43-1.8% increase over the number of baseline SDRs. A significant number of baseline SDRs were also lost in the course of our experiments. The trend in the number of gained SDRs per report removed was inversely related to the number of lost SDRs per protocol. Both the number and nature of the reports removed influenced the number of gained SDRs observed. The purely empirical protocols unmasked up to ten times as many SDRs. None of the masked SDRs had strong external evidence supporting a causal association. Most involved associations for which there was no external supporting evidence or were in the original product label. For two masked SDRs, there was external evidence of a possible causal association. We documented masking in the FDA AERS database. Attempts at unmasking SDRs using practically implementable protocols produced only small changes in the output of SDRs in our analysis. This is undoubtedly related to the large size and diversity of the database, but the complex interdependencies between drugs and events in authentic spontaneous reporting system (SRS) databases, and the impact of measures of statistical variability that are typically used in real-world disproportionality analysis, may be additional factors that constrain the discovery of masked SDRs and which may also operate in pharmaceutical company databases. Empirical determination of the most influential drugs may uncover significantly more SDRs than protocols based on predetermined statistical selection rules but are impractical except possibly for evaluating specific events. Routine global exercises to elicit masking, especially in large health authority databases are not justified based on results available to date. Exercises to elicit unmasking should be driven by prior knowledge or obvious data imbalances.
National Transportation Atlas Databases : 2002
DOT National Transportation Integrated Search
2002-01-01
The National Transportation Atlas Databases 2002 (NTAD2002) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2010
DOT National Transportation Integrated Search
2010-01-01
The National Transportation Atlas Databases 2010 (NTAD2010) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2006
DOT National Transportation Integrated Search
2006-01-01
The National Transportation Atlas Databases 2006 (NTAD2006) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2005
DOT National Transportation Integrated Search
2005-01-01
The National Transportation Atlas Databases 2005 (NTAD2005) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2008
DOT National Transportation Integrated Search
2008-01-01
The National Transportation Atlas Databases 2008 (NTAD2008) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2003
DOT National Transportation Integrated Search
2003-01-01
The National Transportation Atlas Databases 2003 (NTAD2003) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2004
DOT National Transportation Integrated Search
2004-01-01
The National Transportation Atlas Databases 2004 (NTAD2004) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2009
DOT National Transportation Integrated Search
2009-01-01
The National Transportation Atlas Databases 2009 (NTAD2009) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2007
DOT National Transportation Integrated Search
2007-01-01
The National Transportation Atlas Databases 2007 (NTAD2007) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2012
DOT National Transportation Integrated Search
2012-01-01
The National Transportation Atlas Databases 2012 (NTAD2012) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
National Transportation Atlas Databases : 2011
DOT National Transportation Integrated Search
2011-01-01
The National Transportation Atlas Databases 2011 (NTAD2011) is a set of nationwide geographic databases of transportation facilities, transportation networks, and associated infrastructure. These datasets include spatial information for transportatio...
The HISTMAG database: combining historical, archaeomagnetic and volcanic data
NASA Astrophysics Data System (ADS)
Arneitz, Patrick; Leonhardt, Roman; Schnepp, Elisabeth; Heilig, Balázs; Mayrhofer, Franziska; Kovacs, Peter; Hejda, Pavel; Valach, Fridrich; Vadasz, Gergely; Hammerl, Christa; Egli, Ramon; Fabian, Karl; Kompein, Niko
2017-09-01
Records of the past geomagnetic field can be divided into two main categories. These are instrumental historical observations on the one hand, and field estimates based on the magnetization acquired by rocks, sediments and archaeological artefacts on the other hand. In this paper, a new database combining historical, archaeomagnetic and volcanic records is presented. HISTMAG is a relational database, implemented in MySQL, and can be accessed via a web-based interface (http://www.conrad-observatory.at/zamg/index.php/data-en/histmag-database). It combines available global historical data compilations covering the last ∼500 yr as well as archaeomagnetic and volcanic data collections from the last 50 000 yr. Furthermore, new historical and archaeomagnetic records, mainly from central Europe, have been acquired. In total, 190 427 records are currently available in the HISTMAG database, whereby the majority is related to historical declination measurements (155 525). The original database structure was complemented by new fields, which allow for a detailed description of the different data types. A user-comment function provides the possibility for a scientific discussion about individual records. Therefore, HISTMAG database supports thorough reliability and uncertainty assessments of the widely different data sets, which are an essential basis for geomagnetic field reconstructions. A database analysis revealed systematic offset for declination records derived from compass roses on historical geographical maps through comparison with other historical records, while maps created for mining activities represent a reliable source.
NASA Astrophysics Data System (ADS)
Moncoulon, D.; Labat, D.; Ardon, J.; Onfroy, T.; Leblois, E.; Poulard, C.; Aji, S.; Rémy, A.; Quantin, A.
2013-07-01
The analysis of flood exposure at a national scale for the French insurance market must combine the generation of a probabilistic event set of all possible but not yet occurred flood situations with hazard and damage modeling. In this study, hazard and damage models are calibrated on a 1995-2012 historical event set, both for hazard results (river flow, flooded areas) and loss estimations. Thus, uncertainties in the deterministic estimation of a single event loss are known before simulating a probabilistic event set. To take into account at least 90% of the insured flood losses, the probabilistic event set must combine the river overflow (small and large catchments) with the surface runoff due to heavy rainfall, on the slopes of the watershed. Indeed, internal studies of CCR claim database has shown that approximately 45% of the insured flood losses are located inside the floodplains and 45% outside. 10% other percent are due to seasurge floods and groundwater rise. In this approach, two independent probabilistic methods are combined to create a single flood loss distribution: generation of fictive river flows based on the historical records of the river gauge network and generation of fictive rain fields on small catchments, calibrated on the 1958-2010 Météo-France rain database SAFRAN. All the events in the probabilistic event sets are simulated with the deterministic model. This hazard and damage distribution is used to simulate the flood losses at the national scale for an insurance company (MACIF) and to generate flood areas associated with hazard return periods. The flood maps concern river overflow and surface water runoff. Validation of these maps is conducted by comparison with the address located claim data on a small catchment (downstream Argens).
ACLAME: a CLAssification of Mobile genetic Elements, update 2010.
Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane
2010-01-01
The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.
Pattern database applications from design to manufacturing
NASA Astrophysics Data System (ADS)
Zhuang, Linda; Zhu, Annie; Zhang, Yifan; Sweis, Jason; Lai, Ya-Chieh
2017-03-01
Pattern-based approaches are becoming more common and popular as the industry moves to advanced technology nodes. At the beginning of a new technology node, a library of process weak point patterns for physical and electrical verification are starting to build up and used to prevent known hotspots from re-occurring on new designs. Then the pattern set is expanded to create test keys for process development in order to verify the manufacturing capability and precheck new tape-out designs for any potential yield detractors. With the database growing, the adoption of pattern-based approaches has expanded from design flows to technology development and then needed for mass-production purposes. This paper will present the complete downstream working flows of a design pattern database(PDB). This pattern-based data analysis flow covers different applications across different functional teams from generating enhancement kits to improving design manufacturability, populating new testing design data based on previous-learning, generating analysis data to improve mass-production efficiency and manufacturing equipment in-line control to check machine status consistency across different fab sites.
Newt-omics: a comprehensive repository for omics data from the newt Notophthalmus viridescens
Bruckskotten, Marc; Looso, Mario; Reinhardt, Richard; Braun, Thomas; Borchardt, Thilo
2012-01-01
Notophthalmus viridescens, a member of the salamander family is an excellent model organism to study regenerative processes due to its unique ability to replace lost appendages and to repair internal organs. Molecular insights into regenerative events have been severely hampered by the lack of genomic, transcriptomic and proteomic data, as well as an appropriate database to store such novel information. Here, we describe ‘Newt-omics’ (http://newt-omics.mpi-bn.mpg.de), a database, which enables researchers to locate, retrieve and store data sets dedicated to the molecular characterization of newts. Newt-omics is a transcript-centred database, based on an Expressed Sequence Tag (EST) data set from the newt, covering ∼50 000 Sanger sequenced transcripts and a set of high-density microarray data, generated from regenerating hearts. Newt-omics also contains a large set of peptides identified by mass spectrometry, which was used to validate 13 810 ESTs as true protein coding. Newt-omics is open to implement additional high-throughput data sets without changing the database structure. Via a user-friendly interface Newt-omics allows access to a huge set of molecular data without the need for prior bioinformatical expertise. PMID:22039101
The MPI Emotional Body Expressions Database for Narrative Scenarios
Volkova, Ekaterina; de la Rosa, Stephan; Bülthoff, Heinrich H.; Mohler, Betty
2014-01-01
Emotion expression in human-human interaction takes place via various types of information, including body motion. Research on the perceptual-cognitive mechanisms underlying the processing of natural emotional body language can benefit greatly from datasets of natural emotional body expressions that facilitate stimulus manipulation and analysis. The existing databases have so far focused on few emotion categories which display predominantly prototypical, exaggerated emotion expressions. Moreover, many of these databases consist of video recordings which limit the ability to manipulate and analyse the physical properties of these stimuli. We present a new database consisting of a large set (over 1400) of natural emotional body expressions typical of monologues. To achieve close-to-natural emotional body expressions, amateur actors were narrating coherent stories while their body movements were recorded with motion capture technology. The resulting 3-dimensional motion data recorded at a high frame rate (120 frames per second) provides fine-grained information about body movements and allows the manipulation of movement on a body joint basis. For each expression it gives the positions and orientations in space of 23 body joints for every frame. We report the results of physical motion properties analysis and of an emotion categorisation study. The reactions of observers from the emotion categorisation study are included in the database. Moreover, we recorded the intended emotion expression for each motion sequence from the actor to allow for investigations regarding the link between intended and perceived emotions. The motion sequences along with the accompanying information are made available in a searchable MPI Emotional Body Expression Database. We hope that this database will enable researchers to study expression and perception of naturally occurring emotional body expressions in greater depth. PMID:25461382
Active browsing using similarity pyramids
NASA Astrophysics Data System (ADS)
Chen, Jau-Yuen; Bouman, Charles A.; Dalton, John C.
1998-12-01
In this paper, we describe a new approach to managing large image databases, which we call active browsing. Active browsing integrates relevance feedback into the browsing environment, so that users can modify the database's organization to suit the desired task. Our method is based on a similarity pyramid data structure, which hierarchically organizes the database, so that it can be efficiently browsed. At coarse levels, the similarity pyramid allows users to view the database as large clusters of similar images. Alternatively, users can 'zoom into' finer levels to view individual images. We discuss relevance feedback for the browsing process, and argue that it is fundamentally different from relevance feedback for more traditional search-by-query tasks. We propose two fundamental operations for active browsing: pruning and reorganization. Both of these operations depend on a user-defined relevance set, which represents the image or set of images desired by the user. We present statistical methods for accurately pruning the database, and we propose a new 'worm hole' distance metric for reorganizing the database, so that members of the relevance set are grouped together.
2007-06-01
other databases such as MySQL , Oracle , and Derby will be added to future versions of the program. Setting a factor requires more than changing a single...Non-Penetrating vs . Penetrating Results.............106 a. Coverage...Interaction Profile for D U-2 and C RQ-4 .......................................................89 Figure 59. R-Squared vs . Number of Regression Tree
ERIC Educational Resources Information Center
Ingram, Jenny; Salmon, Debra
2010-01-01
Objective: To describe patterns and reasons of attendance and young people's views of the drop-in service. Design: Analysis of a prospective database, questionnaire survey and qualitative interviews and discussions. Setting: Sexual health drop-in clinics in 16 secondary schools (including three pupil-referral units) in deprived areas of a city in…
Interactive access to forest inventory data for the South Central United States
William H. McWilliams
1990-01-01
On-line access to USDA, Forest Service successive forest inventory data for the South Central United States is provided by two computer systems. The Easy Access to Forest Inventory and Analysis Tables program (EZTAB) produces a set of tables for specific geographic areas. The Interactive Graphics and Retrieval System (INGRES) is a database management system that...
Predicting Host Level Reachability via Static Analysis of Routing Protocol Configuration
2007-09-01
check_function_bodies = false; SET client_min_messages = warning; -- -- Name: SCHEMA public; Type: COMMENT; Schema: -; Owner: postgres -- COMMENT...public; Owner: mcmanst -- -- -- Name: public; Type: ACL; Schema: -; Owner: postgres -- REVOKE ALL ON SCHEMA public FROM PUBLIC; REVOKE...ALL ON SCHEMA public FROM postgres ; GRANT ALL ON SCHEMA public TO postgres ; GRANT ALL ON SCHEMA public TO PUBLIC; -- -- PostgreSQL database
Naturally Occurring Human Urinary Peptides for Use in Diagnosis of Chronic Kidney Disease*
Good, David M.; Zürbig, Petra; Argilés, Àngel; Bauer, Hartwig W.; Behrens, Georg; Coon, Joshua J.; Dakna, Mohammed; Decramer, Stéphane; Delles, Christian; Dominiczak, Anna F.; Ehrich, Jochen H. H.; Eitner, Frank; Fliser, Danilo; Frommberger, Moritz; Ganser, Arnold; Girolami, Mark A.; Golovko, Igor; Gwinner, Wilfried; Haubitz, Marion; Herget-Rosenthal, Stefan; Jankowski, Joachim; Jahn, Holger; Jerums, George; Julian, Bruce A.; Kellmann, Markus; Kliem, Volker; Kolch, Walter; Krolewski, Andrzej S.; Luppi, Mario; Massy, Ziad; Melter, Michael; Neusüss, Christian; Novak, Jan; Peter, Karlheinz; Rossing, Kasper; Rupprecht, Harald; Schanstra, Joost P.; Schiffer, Eric; Stolzenburg, Jens-Uwe; Tarnow, Lise; Theodorescu, Dan; Thongboonkerd, Visith; Vanholder, Raymond; Weissinger, Eva M.; Mischak, Harald; Schmitt-Kopplin, Philippe
2010-01-01
Because of its availability, ease of collection, and correlation with physiology and pathology, urine is an attractive source for clinical proteomics/peptidomics. However, the lack of comparable data sets from large cohorts has greatly hindered the development of clinical proteomics. Here, we report the establishment of a reproducible, high resolution method for peptidome analysis of naturally occurring human urinary peptides and proteins, ranging from 800 to 17,000 Da, using samples from 3,600 individuals analyzed by capillary electrophoresis coupled to MS. All processed data were deposited in an Structured Query Language (SQL) database. This database currently contains 5,010 relevant unique urinary peptides that serve as a pool of potential classifiers for diagnosis and monitoring of various diseases. As an example, by using this source of information, we were able to define urinary peptide biomarkers for chronic kidney diseases, allowing diagnosis of these diseases with high accuracy. Application of the chronic kidney disease-specific biomarker set to an independent test cohort in the subsequent replication phase resulted in 85.5% sensitivity and 100% specificity. These results indicate the potential usefulness of capillary electrophoresis coupled to MS for clinical applications in the analysis of naturally occurring urinary peptides. PMID:20616184
Staff nurse clinical leadership: a concept analysis.
Chávez, Eduardo C; Yoder, Linda H
2015-01-01
The purpose of this article is to provide a concept analysis of staff nurse clinical leadership (SNCL). A clear delineation of SNCL will promote understanding and encourage communication of the phenomenon. Clarification of the concept will establish a common understanding of the concept, and advance the practice, education, and research of this phenomenon. A review of the literature was conducted using several databases. The databases were searched using the following keywords: clinical leadership, nursing, bedside, staff nurse, front-line, front line, and leadership. The search yielded several sources; however, only those that focused on clinical leadership demonstrated by staff nurses in acute care hospital settings were selected for review. SNCL is defined as staff nurses who exert significant influence over other individuals in the healthcare team, and although no formal authority has been vested in them facilitates individual and collective efforts to accomplish shared clinical objectives. The theoretical definition for SNCL within the team context will provide a common understanding of this concept and differentiate it from other types of leadership in the nursing profession. This clarification and conceptualization of the concept will assist further research of the concept and advance its practical application in acute care hospital settings. © 2014 Wiley Periodicals, Inc.
searchSCF: Using MongoDB to Enable Richer Searches of Locally Hosted Science Data Repositories
NASA Astrophysics Data System (ADS)
Knosp, B.
2016-12-01
Science teams today are in the unusual position of almost having too much data available to them. Modern sensors and models are capable of outputting terabytes of data per day, which can make it difficult to find specific subsets of data. The sheer size of files can also make it time consuming to retrieve this big data from national data archive centers. Thus, many science teams choose to store what data they can on their local systems, but they are not always equipped with tools to help them intelligently organize and search their data. In its local data repository, the Aura Microwave Limb Sounder (MLS) science team at NASA's Jet Propulsion Laboratory has collected over 300TB of atmospheric science data from 71 missions/models that aid in validation, algorithm development, and research activities. When the project began, the team developed a MySQL database to aid in data queries, but this database was only designed to keep track of MLS and a few ancillary data sets, leving much of the data uncatalogued. The team has also seen database query time rise over the life of the mission. Even though the MLS science team's data holdings are not the size of a national data center's, team members still need tools to help them discover and utilize the data that they have on-hand. Over the past year, members of the science team have been looking for solutions to (1) store information on all the data sets they have collected in a single database, (2) store more metadata about each data file, (3) develop queries that can find relationships among these disparate data types, and (4) plug any new functions developed around this database into existing analysis, visualization, and web tools, transparently to users. In this presentation, I will discuss the searchSCF package that is currently under development. This package includes a NoSQL database management system (MongoDB) and a set of Python tools that both ingests data into the database and supports user queries. I will also highlight case studies of how this system could be used by the MLS science team, and how it could be implemented by other science teams with local data repositories.
SWS: accessing SRS sites contents through Web Services.
Romano, Paolo; Marra, Domenico
2008-03-26
Web Services and Workflow Management Systems can support creation and deployment of network systems, able to automate data analysis and retrieval processes in biomedical research. Web Services have been implemented at bioinformatics centres and workflow systems have been proposed for biological data analysis. New databanks are often developed by taking into account these technologies, but many existing databases do not allow a programmatic access. Only a fraction of available databanks can thus be queried through programmatic interfaces. SRS is a well know indexing and search engine for biomedical databanks offering public access to many databanks and analysis tools. Unfortunately, these data are not easily and efficiently accessible through Web Services. We have developed 'SRS by WS' (SWS), a tool that makes information available in SRS sites accessible through Web Services. Information on known sites is maintained in a database, srsdb. SWS consists in a suite of WS that can query both srsdb, for information on sites and databases, and SRS sites. SWS returns results in a text-only format and can be accessed through a WSDL compliant client. SWS enables interoperability between workflow systems and SRS implementations, by also managing access to alternative sites, in order to cope with network and maintenance problems, and selecting the most up-to-date among available systems. Development and implementation of Web Services, allowing to make a programmatic access to an exhaustive set of biomedical databases can significantly improve automation of in-silico analysis. SWS supports this activity by making biological databanks that are managed in public SRS sites available through a programmatic interface.
A practical implementation for a data dictionary in an environment of diverse data sets
Sprenger, Karla K.; Larsen, Dana M.
1993-01-01
The need for a data dictionary database at the U.S. Geological Survey's EROS Data Center (EDC) was reinforced with the Earth Observing System Data and Information System (EOSDIS) requirement for consistent field definitions of data sets residing at more than one archive center. The EDC requirement addresses the existence of multiple sets with identical field definitions using various naming conventions. The EDC is developing a data dictionary database to accomplish the following foals: to standardize field names for ease in software development; to facilitate querying and updating of the date; and to generate ad hoc reports. The structure of the EDC electronic data dictionary database supports different metadata systems as well as many different data sets. A series of reports is used to keep consistency among data sets and various metadata systems.
Human Chromosome Y and Haplogroups; introducing YDHS Database.
Tiirikka, Timo; Moilanen, Jukka S
2015-12-01
As the high throughput sequencing efforts generate more biological information, scientists from different disciplines are interpreting the polymorphisms that make us unique. In addition, there is an increasing trend in general public to research their own genealogy, find distant relatives and to know more about their biological background. Commercial vendors are providing analyses of mitochondrial and Y-chromosomal markers for such purposes. Clearly, an easy-to-use free interface to the existing data on the identified variants would be in the interest of general public and professionals less familiar with the field. Here we introduce a novel metadatabase YDHS that aims to provide such an interface for Y-chromosomal DNA (Y-DNA) haplogroups and sequence variants. The database uses ISOGG Y-DNA tree as the source of mutations and haplogroups and by using genomic positions of the mutations the database links them to genes and other biological entities. YDHS contains analysis tools for deeper Y-SNP analysis. YDHS addresses the shortage of Y-DNA related databases. We have tested our database using a set of different cases from literature ranging from infertility to autism. The database is at http://www.semanticgen.net/ydhs Y-chromosomal DNA (Y-DNA) haplogroups and sequence variants have not been in the scientific limelight, excluding certain specialized fields like forensics, mainly because there is not much freely available information or it is scattered in different sources. However, as we have demonstrated Y-SNPs do play a role in various cases on the haplogroup level and it is possible to create a free Y-DNA dedicated bioinformatics resource.
A reference system for animal biometrics: application to the northern leopard frog
Petrovska-Delacretaz, D.; Edwards, A.; Chiasson, J.; Chollet, G.; Pilliod, D.S.
2014-01-01
Reference systems and public databases are available for human biometrics, but to our knowledge nothing is available for animal biometrics. This is surprising because animals are not required to give their agreement to be in a database. This paper proposes a reference system and database for the northern leopard frog (Lithobates pipiens). Both are available for reproducible experiments. Results of both open set and closed set experiments are given.
NASA Technical Reports Server (NTRS)
Germany, G. A.
2001-01-01
The primary goal of the funded task was to restore and distribute the ISO ATLAS-1 space science data set with enhanced software and database utilities. The first year was primarily dedicated to physically transferring the data from its original format to its initial CD archival format. The remainder of the first year was devoted to the verification of the restored data set and database. The second year was devoted to the enhancement of the data set, especially the development of IDL utilities and redesign of the database and search interface as needed. This period was also devoted to distribution of the rescued data set, principally the creation and maintenance of a web interface to the data set. The final six months was dedicated to working with NSSDC to create a permanent, off site, hive of the data set and supporting utilities. This time was also used to resolve last minute quality and design issues.
Bagging Voronoi classifiers for clustering spatial functional data
NASA Astrophysics Data System (ADS)
Secchi, Piercesare; Vantini, Simone; Vitelli, Valeria
2013-06-01
We propose a bagging strategy based on random Voronoi tessellations for the exploration of geo-referenced functional data, suitable for different purposes (e.g., classification, regression, dimensional reduction, …). Urged by an application to environmental data contained in the Surface Solar Energy database, we focus in particular on the problem of clustering functional data indexed by the sites of a spatial finite lattice. We thus illustrate our strategy by implementing a specific algorithm whose rationale is to (i) replace the original data set with a reduced one, composed by local representatives of neighborhoods covering the entire investigated area; (ii) analyze the local representatives; (iii) repeat the previous analysis many times for different reduced data sets associated to randomly generated different sets of neighborhoods, thus obtaining many different weak formulations of the analysis; (iv) finally, bag together the weak analyses to obtain a conclusive strong analysis. Through an extensive simulation study, we show that this new procedure - which does not require an explicit model for spatial dependence - is statistically and computationally efficient.
47 CFR 0.241 - Authority delegated.
Code of Federal Regulations, 2012 CFR
2012-10-01
... database functions for unlicensed devices operating in the television broadcast bands (TV bands) as set... methods that will be used to designate TV bands database managers, to designate these database managers; to develop procedures that these database managers will use to ensure compliance with the...
47 CFR 0.241 - Authority delegated.
Code of Federal Regulations, 2013 CFR
2013-10-01
... database functions for unlicensed devices operating in the television broadcast bands (TV bands) as set... methods that will be used to designate TV bands database managers, to designate these database managers; to develop procedures that these database managers will use to ensure compliance with the...
47 CFR 0.241 - Authority delegated.
Code of Federal Regulations, 2011 CFR
2011-10-01
... database functions for unlicensed devices operating in the television broadcast bands (TV bands) as set... methods that will be used to designate TV bands database managers, to designate these database managers; to develop procedures that these database managers will use to ensure compliance with the...
IPD—the Immuno Polymorphism Database
Robinson, James; Halliwell, Jason A.; McWilliam, Hamish; Lopez, Rodrigo; Marsh, Steven G. E.
2013-01-01
The Immuno Polymorphism Database (IPD), http://www.ebi.ac.uk/ipd/ is a set of specialist databases related to the study of polymorphic genes in the immune system. The IPD project works with specialist groups or nomenclature committees who provide and curate individual sections before they are submitted to IPD for online publication. The IPD project stores all the data in a set of related databases. IPD currently consists of four databases: IPD-KIR, contains the allelic sequences of killer-cell immunoglobulin-like receptors, IPD-MHC, a database of sequences of the major histocompatibility complex of different species; IPD-HPA, alloantigens expressed only on platelets; and IPD-ESTDAB, which provides access to the European Searchable Tumour Cell-Line Database, a cell bank of immunologically characterized melanoma cell lines. The data is currently available online from the website and FTP directory. This article describes the latest updates and additional tools added to the IPD project. PMID:23180793
snpGeneSets: An R Package for Genome-Wide Study Annotation
Mei, Hao; Li, Lianna; Jiang, Fan; Simino, Jeannette; Griswold, Michael; Mosley, Thomas; Liu, Shijian
2016-01-01
Genome-wide studies (GWS) of SNP associations and differential gene expressions have generated abundant results; next-generation sequencing technology has further boosted the number of variants and genes identified. Effective interpretation requires massive annotation and downstream analysis of these genome-wide results, a computationally challenging task. We developed the snpGeneSets package to simplify annotation and analysis of GWS results. Our package integrates local copies of knowledge bases for SNPs, genes, and gene sets, and implements wrapper functions in the R language to enable transparent access to low-level databases for efficient annotation of large genomic data. The package contains functions that execute three types of annotations: (1) genomic mapping annotation for SNPs and genes and functional annotation for gene sets; (2) bidirectional mapping between SNPs and genes, and genes and gene sets; and (3) calculation of gene effect measures from SNP associations and performance of gene set enrichment analyses to identify functional pathways. We applied snpGeneSets to type 2 diabetes (T2D) results from the NHGRI genome-wide association study (GWAS) catalog, a Finnish GWAS, and a genome-wide expression study (GWES). These studies demonstrate the usefulness of snpGeneSets for annotating and performing enrichment analysis of GWS results. The package is open-source, free, and can be downloaded at: https://www.umc.edu/biostats_software/. PMID:27807048
PERFORMANCE AUDITING OF A HUMAN AIR POLLUTION EXPOSURE CHAMBER FOR PM2.5
Databases derived from human health effects research play a vital role in setting environmental standards. An underlying assumption in using these databases for standard setting purposes is that they are of adequate quality. The performance auditing program described provides n...
Resource Purpose:The Watershed Information Network is a set of about 30 web pages that are organized by topic. These pages access existing databases like the American Heritage Rivers Services database and Surf Your Watershed. WIN in itself has no data or data sets.
L...
A Quantitative Approach to Determine Analogous Areas Using Environmental Parameters
2008-03-01
degrees Celsius COADS Comprehensive Ocean - Atmosphere Data Set CONUS Continental United States CTD Conductivity/Temperature/Depth probe...consolidation of a marine database. Out of this effort came the Comprehensive Ocean - Atmosphere Data Set (COADS). The original 17 data sets were...National Oceanic and Atmospheric Administration (NOAA) has compiled a database of total sediment thickness of the global oceans and seas. These data are
Carazo, J M; Stelzer, E H
1999-01-01
The BioImage Database Project collects and structures multidimensional data sets recorded by various microscopic techniques relevant to modern life sciences. It provides, as precisely as possible, the circumstances in which the sample was prepared and the data were recorded. It grants access to the actual data and maintains links between related data sets. In order to promote the interdisciplinary approach of modern science, it offers a large set of key words, which covers essentially all aspects of microscopy. Nonspecialists can, therefore, access and retrieve significant information recorded and submitted by specialists in other areas. A key issue of the undertaking is to exploit the available technology and to provide a well-defined yet flexible structure for dealing with data. Its pivotal element is, therefore, a modern object relational database that structures the metadata and ameliorates the provision of a complete service. The BioImage database can be accessed through the Internet. Copyright 1999 Academic Press.
Nicholson, Suzanne W.; Dicken, Connie L.; Horton, John D.; Foose, Michael P.; Mueller, Julia A.L.; Hon, Rudi
2006-01-01
The rapid growth in the use of Geographic Information Systems (GIS) has highlighted the need for regional and national scale digital geologic maps that have standardized information about geologic age and lithology. Such maps can be conveniently used to generate derivative maps for manifold special purposes such as mineral-resource assessment, metallogenic studies, tectonic studies, and environmental research. Although two digital geologic maps (Schruben and others, 1994; Reed and Bush, 2004) of the United States currently exist, their scales (1:2,500,000 and 1:5,000,000) are too general for many regional applications. Most states have digital geologic maps at scales of about 1:500,000, but the databases are not comparably structured and, thus, it is difficult to use the digital database for more than one state at a time. This report describes the result for a seven state region of an effort by the U.S. Geological Survey to produce a series of integrated and standardized state geologic map databases that cover the entire United States. In 1997, the United States Geological Survey's Mineral Resources Program initiated the National Surveys and Analysis (NSA) Project to develop national digital databases. One primary activity of this project was to compile a national digital geologic map database, utilizing state geologic maps, to support studies in the range of 1:250,000- to 1:1,000,000-scale. To accomplish this, state databases were prepared using a common standard for the database structure, fields, attribution, and data dictionaries. For Alaska and Hawaii new state maps are being prepared and the preliminary work for Alaska is being released as a series of 1:250,000 scale quadrangle reports. This document provides background information and documentation for the integrated geologic map databases of this report. This report is one of a series of such reports releasing preliminary standardized geologic map databases for the United States. The data products of the project consist of two main parts, the spatial databases and a set of supplemental tables relating to geologic map units. The datasets serve as a data resource to generate a variety of stratigraphic, age, and lithologic maps. This documentation is divided into four main sections: (1) description of the set of data files provided in this report, (2) specifications of the spatial databases, (3) specifications of the supplemental tables, and (4) an appendix containing the data dictionaries used to populate some fields of the spatial database and supplemental tables.
Jones, Andrew R; Siepen, Jennifer A; Hubbard, Simon J; Paton, Norman W
2009-03-01
LC-MS experiments can generate large quantities of data, for which a variety of database search engines are available to make peptide and protein identifications. Decoy databases are becoming widely used to place statistical confidence in result sets, allowing the false discovery rate (FDR) to be estimated. Different search engines produce different identification sets so employing more than one search engine could result in an increased number of peptides (and proteins) being identified, if an appropriate mechanism for combining data can be defined. We have developed a search engine independent score, based on FDR, which allows peptide identifications from different search engines to be combined, called the FDR Score. The results demonstrate that the observed FDR is significantly different when analysing the set of identifications made by all three search engines, by each pair of search engines or by a single search engine. Our algorithm assigns identifications to groups according to the set of search engines that have made the identification, and re-assigns the score (combined FDR Score). The combined FDR Score can differentiate between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.
A Statistical Analysis of Corona Topography: New Insights into Corona Formation and Evolution
NASA Technical Reports Server (NTRS)
Stofan, E. R.; Glaze, L. S.; Smrekar, S. E.; Baloga, S. M.
2003-01-01
Extensive mapping of the surface of Venus and continued analysis of Magellan data have allowed a more comprehensive survey of coronae to be conducted. Our updated corona database contains 514 features, an increase from the 326 coronae of the previous survey. We include a new set of 106 Type 2 or stealth coronae, which have a topographic rather than a fracture annulus. The large increase in the number of coronae over the 1992 survey results from several factors, including the use of the full Magellan data set and the addition of features identified as part of the systematic geologic mapping of Venus. Parameters of the population that we have analyzed to date include size and topography.
The LSST metrics analysis framework (MAF)
NASA Astrophysics Data System (ADS)
Jones, R. L.; Yoachim, Peter; Chandrasekharan, Srinivasan; Connolly, Andrew J.; Cook, Kem H.; Ivezic, Željko; Krughoff, K. S.; Petry, Catherine; Ridgway, Stephen T.
2014-07-01
We describe the Metrics Analysis Framework (MAF), an open-source python framework developed to provide a user-friendly, customizable, easily-extensible set of tools for analyzing data sets. MAF is part of the Large Synoptic Survey Telescope (LSST) Simulations effort. Its initial goal is to provide a tool to evaluate LSST Operations Simulation (OpSim) simulated surveys to help understand the effects of telescope scheduling on survey performance, however MAF can be applied to a much wider range of datasets. The building blocks of the framework are Metrics (algorithms to analyze a given quantity of data), Slicers (subdividing the overall data set into smaller data slices as relevant for each Metric), and Database classes (to access the dataset and read data into memory). We describe how these building blocks work together, and provide an example of using MAF to evaluate different dithering strategies. We also outline how users can write their own custom Metrics and use these within the framework.
PERFORMANCE AUDITING OF A HUMAN AIR POLLUTION EXPOSURE SYSTEM FOR PM2.5
Databases derived from human health effects research play a vital role in setting environmental standards. An underlying assumption in using these databases for standard setting purposes is that they are of adequate quality. The performance auditing program described in this ma...
Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems
Wang, Kaibo; Huai, Yin; Lee, Rubao; Wang, Fusheng; Zhang, Xiaodong; Saltz, Joel H.
2012-01-01
As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully utilize the power of modern parallel hardware. In this paper, we provide a customized software solution that exploits GPUs and multi-core CPUs to accelerate spatial cross-comparison in a cost-effective way. Our solution consists of an efficient GPU algorithm and a pipelined system framework with task migration support. Extensive experiments with real-world data sets demonstrate the effectiveness of our solution, which improves the performance of spatial cross-comparison by over 18 times compared with a parallelized spatial database approach. PMID:23355955
PGMapper: a web-based tool linking phenotype to genes.
Xiong, Qing; Qiu, Yuhui; Gu, Weikuan
2008-04-01
With the availability of whole genome sequence in many species, linkage analysis, positional cloning and microarray are gradually becoming powerful tools for investigating the links between phenotype and genotype or genes. However, in these methods, causative genes underlying a quantitative trait locus, or a disease, are usually located within a large genomic region or a large set of genes. Examining the function of every gene is very time consuming and needs to retrieve and integrate the information from multiple databases or genome resources. PGMapper is a software tool for automatically matching phenotype to genes from a defined genome region or a group of given genes by combining the mapping information from the Ensembl database and gene function information from the OMIM and PubMed databases. PGMapper is currently available for candidate gene search of human, mouse, rat, zebrafish and 12 other species. Available online at http://www.genediscovery.org/pgmapper/index.jsp.
An engineering database management system for spacecraft operations
NASA Technical Reports Server (NTRS)
Cipollone, Gregorio; Mckay, Michael H.; Paris, Joseph
1993-01-01
Studies at ESOC have demonstrated the feasibility of a flexible and powerful Engineering Database Management System in support for spacecraft operations documentation. The objectives set out were three-fold: first an analysis of the problems encountered by the Operations team in obtaining and managing operations documents; secondly, the definition of a concept for operations documentation and the implementation of prototype to prove the feasibility of the concept; and thirdly, definition of standards and protocols required for the exchange of data between the top-level partners in a satellite project. The EDMS prototype was populated with ERS-l satellite design data and has been used by the operations team at ESOC to gather operational experience. An operational EDMS would be implemented at the satellite prime contractor's site as a common database for all technical information surrounding a project and would be accessible by the cocontractor's and ESA teams.
Recent advances in proteomics of cereals.
Bansal, Monika; Sharma, Madhu; Kanwar, Priyanka; Goyal, Aakash
Cereals contribute a major part of human nutrition and are considered as an integral source of energy for human diets. With genomic databases already available in cereals such as rice, wheat, barley, and maize, the focus has now moved to proteome analysis. Proteomics studies involve the development of appropriate databases based on developing suitable separation and purification protocols, identification of protein functions, and can confirm their functional networks based on already available data from other sources. Tremendous progress has been made in the past decade in generating huge data-sets for covering interactions among proteins, protein composition of various organs and organelles, quantitative and qualitative analysis of proteins, and to characterize their modulation during plant development, biotic, and abiotic stresses. Proteomics platforms have been used to identify and improve our understanding of various metabolic pathways. This article gives a brief review of efforts made by different research groups on comparative descriptive and functional analysis of proteomics applications achieved in the cereal science so far.
The CMS dataset bookkeeping service
NASA Astrophysics Data System (ADS)
Afaq, A.; Dolgert, A.; Guo, Y.; Jones, C.; Kosyakov, S.; Kuznetsov, V.; Lueking, L.; Riley, D.; Sekhri, V.
2008-07-01
The CMS Dataset Bookkeeping Service (DBS) has been developed to catalog all CMS event data from Monte Carlo and Detector sources. It provides the ability to identify MC or trigger source, track data provenance, construct datasets for analysis, and discover interesting data. CMS requires processing and analysis activities at various service levels and the DBS system provides support for localized processing or private analysis, as well as global access for CMS users at large. Catalog entries can be moved among the various service levels with a simple set of migration tools, thus forming a loose federation of databases. DBS is available to CMS users via a Python API, Command Line, and a Discovery web page interfaces. The system is built as a multi-tier web application with Java servlets running under Tomcat, with connections via JDBC to Oracle or MySQL database backends. Clients connect to the service through HTTP or HTTPS with authentication provided by GRID certificates and authorization through VOMS. DBS is an integral part of the overall CMS Data Management and Workflow Management systems.
High-throughput determination of structural phase diagram and constituent phases using GRENDEL
NASA Astrophysics Data System (ADS)
Kusne, A. G.; Keller, D.; Anderson, A.; Zaban, A.; Takeuchi, I.
2015-11-01
Advances in high-throughput materials fabrication and characterization techniques have resulted in faster rates of data collection and rapidly growing volumes of experimental data. To convert this mass of information into actionable knowledge of material process-structure-property relationships requires high-throughput data analysis techniques. This work explores the use of the Graph-based endmember extraction and labeling (GRENDEL) algorithm as a high-throughput method for analyzing structural data from combinatorial libraries, specifically, to determine phase diagrams and constituent phases from both x-ray diffraction and Raman spectral data. The GRENDEL algorithm utilizes a set of physical constraints to optimize results and provides a framework by which additional physics-based constraints can be easily incorporated. GRENDEL also permits the integration of database data as shown by the use of critically evaluated data from the Inorganic Crystal Structure Database in the x-ray diffraction data analysis. Also the Sunburst radial tree map is demonstrated as a tool to visualize material structure-property relationships found through graph based analysis.
KA-SB: from data integration to large scale reasoning
Roldán-García, María del Mar; Navas-Delgado, Ismael; Kerzazi, Amine; Chniber, Othmane; Molina-Castro, Joaquín; Aldana-Montes, José F
2009-01-01
Background The analysis of information in the biological domain is usually focused on the analysis of data from single on-line data sources. Unfortunately, studying a biological process requires having access to disperse, heterogeneous, autonomous data sources. In this context, an analysis of the information is not possible without the integration of such data. Methods KA-SB is a querying and analysis system for final users based on combining a data integration solution with a reasoner. Thus, the tool has been created with a process divided into two steps: 1) KOMF, the Khaos Ontology-based Mediator Framework, is used to retrieve information from heterogeneous and distributed databases; 2) the integrated information is crystallized in a (persistent and high performance) reasoner (DBOWL). This information could be further analyzed later (by means of querying and reasoning). Results In this paper we present a novel system that combines the use of a mediation system with the reasoning capabilities of a large scale reasoner to provide a way of finding new knowledge and of analyzing the integrated information from different databases, which is retrieved as a set of ontology instances. This tool uses a graphical query interface to build user queries easily, which shows a graphical representation of the ontology and allows users o build queries by clicking on the ontology concepts. Conclusion These kinds of systems (based on KOMF) will provide users with very large amounts of information (interpreted as ontology instances once retrieved), which cannot be managed using traditional main memory-based reasoners. We propose a process for creating persistent and scalable knowledgebases from sets of OWL instances obtained by integrating heterogeneous data sources with KOMF. This process has been applied to develop a demo tool , which uses the BioPax Level 3 ontology as the integration schema, and integrates UNIPROT, KEGG, CHEBI, BRENDA and SABIORK databases. PMID:19796402
Performance of Stratified and Subgrouped Disproportionality Analyses in Spontaneous Databases.
Seabroke, Suzie; Candore, Gianmario; Juhlin, Kristina; Quarcoo, Naashika; Wisniewski, Antoni; Arani, Ramin; Painter, Jeffery; Tregunno, Philip; Norén, G Niklas; Slattery, Jim
2016-04-01
Disproportionality analyses are used in many organisations to identify adverse drug reactions (ADRs) from spontaneous report data. Reporting patterns vary over time, with patient demographics, and between different geographical regions, and therefore subgroup analyses or adjustment by stratification may be beneficial. The objective of this study was to evaluate the performance of subgroup and stratified disproportionality analyses for a number of key covariates within spontaneous report databases of differing sizes and characteristics. Using a reference set of established ADRs, signal detection performance (sensitivity and precision) was compared for stratified, subgroup and crude (unadjusted) analyses within five spontaneous report databases (two company, one national and two international databases). Analyses were repeated for a range of covariates: age, sex, country/region of origin, calendar time period, event seriousness, vaccine/non-vaccine, reporter qualification and report source. Subgroup analyses consistently performed better than stratified analyses in all databases. Subgroup analyses also showed benefits in both sensitivity and precision over crude analyses for the larger international databases, whilst for the smaller databases a gain in precision tended to result in some loss of sensitivity. Additionally, stratified analyses did not increase sensitivity or precision beyond that associated with analytical artefacts of the analysis. The most promising subgroup covariates were age and region/country of origin, although this varied between databases. Subgroup analyses perform better than stratified analyses and should be considered over the latter in routine first-pass signal detection. Subgroup analyses are also clearly beneficial over crude analyses for larger databases, but further validation is required for smaller databases.
47 CFR 0.241 - Authority delegated.
Code of Federal Regulations, 2014 CFR
2014-10-01
... individual database managers; and to perform other functions as needed for the administration of the TV bands... database functions for unlicensed devices operating in the television broadcast bands (TV bands) as set... methods that will be used to designate TV bands database managers, to designate these database managers...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Uehara, Takeki, E-mail: takeki.uehara@shionogi.co.jp; Toxicogenomics Informatics Project, National Institute of Biomedical Innovation, 7-6-8 Asagi, Ibaraki, Osaka 567-0085; Minowa, Yohsuke
2011-09-15
The present study was performed to develop a robust gene-based prediction model for early assessment of potential hepatocarcinogenicity of chemicals in rats by using our toxicogenomics database, TG-GATEs (Genomics-Assisted Toxicity Evaluation System developed by the Toxicogenomics Project in Japan). The positive training set consisted of high- or middle-dose groups that received 6 different non-genotoxic hepatocarcinogens during a 28-day period. The negative training set consisted of high- or middle-dose groups of 54 non-carcinogens. Support vector machine combined with wrapper-type gene selection algorithms was used for modeling. Consequently, our best classifier yielded prediction accuracies for hepatocarcinogenicity of 99% sensitivity and 97% specificitymore » in the training data set, and false positive prediction was almost completely eliminated. Pathway analysis of feature genes revealed that the mitogen-activated protein kinase p38- and phosphatidylinositol-3-kinase-centered interactome and the v-myc myelocytomatosis viral oncogene homolog-centered interactome were the 2 most significant networks. The usefulness and robustness of our predictor were further confirmed in an independent validation data set obtained from the public database. Interestingly, similar positive predictions were obtained in several genotoxic hepatocarcinogens as well as non-genotoxic hepatocarcinogens. These results indicate that the expression profiles of our newly selected candidate biomarker genes might be common characteristics in the early stage of carcinogenesis for both genotoxic and non-genotoxic carcinogens in the rat liver. Our toxicogenomic model might be useful for the prospective screening of hepatocarcinogenicity of compounds and prioritization of compounds for carcinogenicity testing. - Highlights: >We developed a toxicogenomic model to predict hepatocarcinogenicity of chemicals. >The optimized model consisting of 9 probes had 99% sensitivity and 97% specificity. >This model enables us to detect genotoxic as well as non-genotoxic hepatocarcinogens.« less
Attenuation relation for strong motion in Eastern Java based on appropriate database and method
NASA Astrophysics Data System (ADS)
Mahendra, Rian; Rohadi, Supriyanto; Rudyanto, Ariska
2017-07-01
The selection and determination of attenuation relation has become important for seismic hazard assessment in active seismic region. This research initially constructs the appropriate strong motion database, including site condition and type of the earthquake. The data set consisted of large number earthquakes of 5 ≤ Mw ≤ 9 and distance less than 500 km that occurred around Java from 2009 until 2016. The location and depth of earthquake are being relocated using double difference method to improve the quality of database. Strong motion data from twelve BMKG's accelerographs which are located in east Java is used. The site condition is known by using dominant period and Vs30. The type of earthquake is classified into crustal earthquake, interface, and intraslab based on slab geometry analysis. A total of 10 Ground Motion Prediction Equations (GMPEs) are tested using Likelihood (Scherbaum et al., 2004) and Euclidean Distance Ranking method (Kale and Akkar, 2012) with the associated database. The evaluation of these methods lead to a set of GMPEs that can be applied for seismic hazard in East Java where the strong motion data is collected. The result of these methods found that there is still high deviation of GMPEs, so the writer modified some GMPEs using inversion method. Validation was performed by analysing the attenuation curve of the selected GMPE and observation data in period 2015 up to 2016. The results show that the selected GMPE is suitable for estimated PGA value in East Java.
Photo-z-SQL: Integrated, flexible photometric redshift computation in a database
NASA Astrophysics Data System (ADS)
Beck, R.; Dobos, L.; Budavári, T.; Szalay, A. S.; Csabai, I.
2017-04-01
We present a flexible template-based photometric redshift estimation framework, implemented in C#, that can be seamlessly integrated into a SQL database (or DB) server and executed on-demand in SQL. The DB integration eliminates the need to move large photometric datasets outside a database for redshift estimation, and utilizes the computational capabilities of DB hardware. The code is able to perform both maximum likelihood and Bayesian estimation, and can handle inputs of variable photometric filter sets and corresponding broad-band magnitudes. It is possible to take into account the full covariance matrix between filters, and filter zero points can be empirically calibrated using measurements with given redshifts. The list of spectral templates and the prior can be specified flexibly, and the expensive synthetic magnitude computations are done via lazy evaluation, coupled with a caching of results. Parallel execution is fully supported. For large upcoming photometric surveys such as the LSST, the ability to perform in-place photo-z calculation would be a significant advantage. Also, the efficient handling of variable filter sets is a necessity for heterogeneous databases, for example the Hubble Source Catalog, and for cross-match services such as SkyQuery. We illustrate the performance of our code on two reference photo-z estimation testing datasets, and provide an analysis of execution time and scalability with respect to different configurations. The code is available for download at https://github.com/beckrob/Photo-z-SQL.
ARIADNE: a Tracking System for Relationships in LHCb Metadata
NASA Astrophysics Data System (ADS)
Shapoval, I.; Clemencic, M.; Cattaneo, M.
2014-06-01
The data processing model of the LHCb experiment implies handling of an evolving set of heterogeneous metadata entities and relationships between them. The entities range from software and databases states to architecture specificators and software/data deployment locations. For instance, there is an important relationship between the LHCb Conditions Database (CondDB), which provides versioned, time dependent geometry and conditions data, and the LHCb software, which is the data processing applications (used for simulation, high level triggering, reconstruction and analysis of physics data). The evolution of CondDB and of the LHCb applications is a weakly-homomorphic process. It means that relationships between a CondDB state and LHCb application state may not be preserved across different database and application generations. These issues may lead to various kinds of problems in the LHCb production, varying from unexpected application crashes to incorrect data processing results. In this paper we present Ariadne - a generic metadata relationships tracking system based on the novel NoSQL Neo4j graph database. Its aim is to track and analyze many thousands of evolving relationships for cases such as the one described above, and several others, which would otherwise remain unmanaged and potentially harmful. The highlights of the paper include the system's implementation and management details, infrastructure needed for running it, security issues, first experience of usage in the LHCb production and potential of the system to be applied to a wider set of LHCb tasks.
Data Resource Profile: United Nations Children’s Fund (UNICEF)
Murray, Colleen; Newby, Holly
2012-01-01
The United Nations Children’s Fund (UNICEF) plays a leading role in the collection, compilation, analysis and dissemination of data to inform sound policies, legislation and programmes for promoting children’s rights and well-being, and for global monitoring of progress towards the Millennium Development Goals. UNICEF maintains a set of global databases representing nearly 200 countries and covering the areas of child mortality, child health, maternal health, nutrition, immunization, water and sanitation, HIV/AIDS, education and child protection. These databases consist of internationally comparable and statistically sound data, and are updated annually through a process that draws on a wealth of data provided by UNICEF’s wide network of >150 field offices. The databases are composed primarily of estimates from household surveys, with data from censuses, administrative records, vital registration systems and statistical models contributing to some key indicators as well. The data are assessed for quality based on a set of objective criteria to ensure that only the most reliable nationally representative information is included. For most indicators, data are available at the global, regional and national levels, plus sub-national disaggregation by sex, urban/rural residence and household wealth. The global databases are featured in UNICEF’s flagship publications, inter-agency reports, including the Secretary General’s Millennium Development Goals Report and Countdown to 2015, sector-specific reports and statistical country profiles. They are also publicly available on www.childinfo.org, together with trend data and equity analyses. PMID:23211414
Preliminary assessment of the robustness of dynamic inversion based flight control laws
NASA Technical Reports Server (NTRS)
Snell, S. A.
1992-01-01
Dynamic-inversion-based flight control laws present an attractive alternative to conventional gain-scheduled designs for high angle-of-attack maneuvering, where nonlinearities dominate the dynamics. Dynamic inversion is easily applied to the aircraft dynamics requiring a knowledge of the nonlinear equations of motion alone, rather than an extensive set of linearizations. However, the robustness properties of the dynamic inversion are questionable especially when considering the uncertainties involved with the aerodynamic database during post-stall flight. This paper presents a simple analysis and some preliminary results of simulations with a perturbed database. It is shown that incorporating integrators into the control loops helps to improve the performance in the presence of these perturbations.
Database resources of the National Center for Biotechnology Information.
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; DiCuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian
2011-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Yu, Kebing; Salomon, Arthur R.
2010-01-01
Recently, dramatic progress has been achieved in expanding the sensitivity, resolution, mass accuracy, and scan rate of mass spectrometers able to fragment and identify peptides through tandem mass spectrometry (MS/MS). Unfortunately, this enhanced ability to acquire proteomic data has not been accompanied by a concomitant increase in the availability of flexible tools allowing users to rapidly assimilate, explore, and analyze this data and adapt to a variety of experimental workflows with minimal user intervention. Here we fill this critical gap by providing a flexible relational database called PeptideDepot for organization of expansive proteomic data sets, collation of proteomic data with available protein information resources, and visual comparison of multiple quantitative proteomic experiments. Our software design, built upon the synergistic combination of a MySQL database for safe warehousing of proteomic data with a FileMaker-driven graphical user interface for flexible adaptation to diverse workflows, enables proteomic end-users to directly tailor the presentation of proteomic data to the unique analysis requirements of the individual proteomics lab. PeptideDepot may be deployed as an independent software tool or integrated directly with our High Throughput Autonomous Proteomic Pipeline (HTAPP) used in the automated acquisition and post-acquisition analysis of proteomic data. PMID:19834895
Towards an integrated set of surface meterological observations for climate science and applications
NASA Astrophysics Data System (ADS)
Dunn, Robert; Thorne, Peter
2017-04-01
We cannot predict what is not observed, and we cannot analyse what is not archived. To meet current scientific and societal demands, as well as future requirements for climate services, it is vital that the management and curation of land-based meteorological data holdings is improved. A comprehensive global set of data holdings, of known provenance, integrated across both climate variable and timescale are required to meet the wide range of user needs. Presently, the land-based holdings are highly fractured into global, region and national holdings for different variables and timescales, from a variety of sources, and in a mixture of formats. We present a high level overview, based on broad community input, of the steps that are required to bring about this integration and progress towards such a database. Any long-term, international, program creating such an integrated database will transform the our collective ability to provide societally relevant research, analysis and predictions across the globe.
NASA Technical Reports Server (NTRS)
Rilee, Michael Lee; Kuo, Kwo-Sen
2017-01-01
The SpatioTemporal Adaptive Resolution Encoding (STARE) is a unifying scheme encoding geospatial and temporal information for organizing data on scalable computing/storage resources, minimizing expensive data transfers. STARE provides a compact representation that turns set-logic functions into integer operations, e.g. conditional sub-setting, taking into account representative spatiotemporal resolutions of the data in the datasets. STARE geo-spatiotemporally aligns data placements of diverse data on massive parallel resources to maximize performance. Automating important scientific functions (e.g. regridding) and computational functions (e.g. data placement) allows scientists to focus on domain-specific questions instead of expending their efforts and expertise on data processing. With STARE-enabled automation, SciDB (Scientific Database) plus STARE provides a database interface, reducing costly data preparation, increasing the volume and variety of interoperable data, and easing result sharing. Using SciDB plus STARE as part of an integrated analysis infrastructure dramatically eases combining diametrically different datasets.
Databases derived from human health effects research play a vital role in setting environmental standards. An underlying assumption in using these databases for standard setting purposes is that they are of adequate quality. The performance auditing program described in this ma...
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chain, Patrick
Genomics — the genetic mapping and DNA sequencing of sets of genes or the complete genomes of organisms, along with related genome analysis and database work — is emerging as one of the transformative sciences of the 21st century. But current bioinformatics tools are not accessible to most biological researchers. Now, a new computational and web-based tool called EDGE Bioinformatics is working to fulfill the promise of democratizing genomics.
TARA: Tool Assisted Requirements Analysis
1988-05-01
provided during the project and to aid tool integration . Chapter 6 provides a brief discussion of the experience of specifying the ASET case study in CORE...set of Prolog clauses. This includes the context-free grammar rules depicted in Figure 2.1, integrity constraints such as those defining the binding...Jeremaes (1986). This was developed originally for specifying database management ". semantics (for example, the preservation of integrity constraints
GeneSigDB: a manually curated database and resource for analysis of gene expression signatures
Culhane, Aedín C.; Schröder, Markus S.; Sultana, Razvan; Picard, Shaita C.; Martinelli, Enzo N.; Kelly, Caroline; Haibe-Kains, Benjamin; Kapushesky, Misha; St Pierre, Anne-Alyssa; Flahive, William; Picard, Kermshlise C.; Gusenleitner, Daniel; Papenhausen, Gerald; O'Connor, Niall; Correll, Mick; Quackenbush, John
2012-01-01
GeneSigDB (http://www.genesigdb.org or http://compbio.dfci.harvard.edu/genesigdb/) is a database of gene signatures that have been extracted and manually curated from the published literature. It provides a standardized resource of published prognostic, diagnostic and other gene signatures of cancer and related disease to the community so they can compare the predictive power of gene signatures or use these in gene set enrichment analysis. Since GeneSigDB release 1.0, we have expanded from 575 to 3515 gene signatures, which were collected and transcribed from 1604 published articles largely focused on gene expression in cancer, stem cells, immune cells, development and lung disease. We have made substantial upgrades to the GeneSigDB website to improve accessibility and usability, including adding a tag cloud browse function, facetted navigation and a ‘basket’ feature to store genes or gene signatures of interest. Users can analyze GeneSigDB gene signatures, or upload their own gene list, to identify gene signatures with significant gene overlap and results can be viewed on a dynamic editable heatmap that can be downloaded as a publication quality image. All data in GeneSigDB can be downloaded in numerous formats including .gmt file format for gene set enrichment analysis or as a R/Bioconductor data file. GeneSigDB is available from http://www.genesigdb.org. PMID:22110038
Kafkas, Şenay; Kim, Jee-Hyub; Pi, Xingjun; McEntyre, Johanna R
2015-01-01
In this study, we present an analysis of data citation practices in full text research articles and their corresponding supplementary data files, made available in the Open Access set of articles from Europe PubMed Central. Our aim is to investigate whether supplementary data files should be considered as a source of information for integrating the literature with biomolecular databases. Using text-mining methods to identify and extract a variety of core biological database accession numbers, we found that the supplemental data files contain many more database citations than the body of the article, and that those citations often take the form of a relatively small number of articles citing large collections of accession numbers in text-based files. Moreover, citation of value-added databases derived from submission databases (such as Pfam, UniProt or Ensembl) is common, demonstrating the reuse of these resources as datasets in themselves. All the database accession numbers extracted from the supplementary data are publicly accessible from http://dx.doi.org/10.5281/zenodo.11771. Our study suggests that supplementary data should be considered when linking articles with data, in curation pipelines, and in information retrieval tasks in order to make full use of the entire research article. These observations highlight the need to improve the management of supplemental data in general, in order to make this information more discoverable and useful.
Improved Infrastucture for Cdms and JPL Molecular Spectroscopy Catalogues
NASA Astrophysics Data System (ADS)
Endres, Christian; Schlemmer, Stephan; Drouin, Brian; Pearson, John; Müller, Holger S. P.; Schilke, P.; Stutzki, Jürgen
2014-06-01
Over the past years a new infrastructure for atomic and molecular databases has been developed within the framework of the Virtual Atomic and Molecular Data Centre (VAMDC). Standards for the representation of atomic and molecular data as well as a set of protocols have been established which allow now to retrieve data from various databases through one portal and to combine the data easily. Apart from spectroscopic databases such as the Cologne Database for Molecular Spectroscopy (CDMS), the Jet Propulsion Laboratory microwave, millimeter and submillimeter spectral line catalogue (JPL) and the HITRAN database, various databases on molecular collisions (BASECOL, KIDA) and reactions (UMIST) are connected. Together with other groups within the VAMDC consortium we are working on common user tools to simplify the access for new customers and to tailor data requests for users with specified needs. This comprises in particular tools to support the analysis of complex observational data obtained with the ALMA telescope. In this presentation requests to CDMS and JPL will be used to explain the basic concepts and the tools which are provided by VAMDC. In addition a new portal to CDMS will be presented which has a number of new features, in particular meaningful quantum numbers, references linked to data points, access to state energies and improved documentation. Fit files are accessible for download and queries to other databases are possible.
GeneTools--application for functional annotation and statistical hypothesis testing.
Beisvag, Vidar; Jünge, Frode K R; Bergum, Hallgeir; Jølsum, Lars; Lydersen, Stian; Günther, Clara-Cecilie; Ramampiaro, Heri; Langaas, Mette; Sandvik, Arne K; Laegreid, Astrid
2006-10-24
Modern biology has shifted from "one gene" approaches to methods for genomic-scale analysis like microarray technology, which allow simultaneous measurement of thousands of genes. This has created a need for tools facilitating interpretation of biological data in "batch" mode. However, such tools often leave the investigator with large volumes of apparently unorganized information. To meet this interpretation challenge, gene-set, or cluster testing has become a popular analytical tool. Many gene-set testing methods and software packages are now available, most of which use a variety of statistical tests to assess the genes in a set for biological information. However, the field is still evolving, and there is a great need for "integrated" solutions. GeneTools is a web-service providing access to a database that brings together information from a broad range of resources. The annotation data are updated weekly, guaranteeing that users get data most recently available. Data submitted by the user are stored in the database, where it can easily be updated, shared between users and exported in various formats. GeneTools provides three different tools: i) NMC Annotation Tool, which offers annotations from several databases like UniGene, Entrez Gene, SwissProt and GeneOntology, in both single- and batch search mode. ii) GO Annotator Tool, where users can add new gene ontology (GO) annotations to genes of interest. These user defined GO annotations can be used in further analysis or exported for public distribution. iii) eGOn, a tool for visualization and statistical hypothesis testing of GO category representation. As the first GO tool, eGOn supports hypothesis testing for three different situations (master-target situation, mutually exclusive target-target situation and intersecting target-target situation). An important additional function is an evidence-code filter that allows users, to select the GO annotations for the analysis. GeneTools is the first "all in one" annotation tool, providing users with a rapid extraction of highly relevant gene annotation data for e.g. thousands of genes or clones at once. It allows a user to define and archive new GO annotations and it supports hypothesis testing related to GO category representations. GeneTools is freely available through www.genetools.no
The landslide database for Germany: Closing the gap at national level
NASA Astrophysics Data System (ADS)
Damm, Bodo; Klose, Martin
2015-11-01
The Federal Republic of Germany has long been among the few European countries that lack a national landslide database. Systematic collection and inventory of landslide data still has a long research history in Germany, but one focussed on the development of databases with local or regional coverage. This has changed in recent years with the launch of a database initiative aimed at closing the data gap existing at national level. The present paper reports on this project that is based on a landslide database which evolved over the last 15 years to a database covering large parts of Germany. A strategy of systematic retrieval, extraction, and fusion of landslide data is at the heart of the methodology, providing the basis for a database with a broad potential of application. The database offers a data pool of more than 4,200 landslide data sets with over 13,000 single data files and dates back to the 12th century. All types of landslides are covered by the database, which stores not only core attributes, but also various complementary data, including data on landslide causes, impacts, and mitigation. The current database migration to PostgreSQL/PostGIS is focused on unlocking the full scientific potential of the database, while enabling data sharing and knowledge transfer via a web GIS platform. In this paper, the goals and the research strategy of the database project are highlighted at first, with a summary of best practices in database development providing perspective. Next, the focus is on key aspects of the methodology, which is followed by the results of three case studies in the German Central Uplands. The case study results exemplify database application in the analysis of landslide frequency and causes, impact statistics, and landslide susceptibility modeling. Using the example of these case studies, strengths and weaknesses of the database are discussed in detail. The paper concludes with a summary of the database project with regard to previous achievements and the strategic roadmap.
Wang, Luman; Mo, Qiaochu; Wang, Jianxin
2015-01-01
Most current gene coexpression databases support the analysis for linear correlation of gene pairs, but not nonlinear correlation of them, which hinders precisely evaluating the gene-gene coexpression strengths. Here, we report a new database, MIrExpress, which takes advantage of the information theory, as well as the Pearson linear correlation method, to measure the linear correlation, nonlinear correlation, and their hybrid of cell-specific gene coexpressions in immune cells. For a given gene pair or probe set pair input by web users, both mutual information (MI) and Pearson correlation coefficient (r) are calculated, and several corresponding values are reported to reflect their coexpression correlation nature, including MI and r values, their respective rank orderings, their rank comparison, and their hybrid correlation value. Furthermore, for a given gene, the top 10 most relevant genes to it are displayed with the MI, r, or their hybrid perspective, respectively. Currently, the database totally includes 16 human cell groups, involving 20,283 human genes. The expression data and the calculated correlation results from the database are interactively accessible on the web page and can be implemented for other related applications and researches. PMID:26881263
Implementation of an open adoption research data management system for clinical studies.
Müller, Jan; Heiss, Kirsten Ingmar; Oberhoffer, Renate
2017-07-06
Research institutions need to manage multiple studies with individual data sets, processing rules and different permissions. So far, there is no standard technology that provides an easy to use environment to create databases and user interfaces for clinical trials or research studies. Therefore various software solutions are being used-from custom software, explicitly designed for a specific study, to cost intensive commercial Clinical Trial Management Systems (CTMS) up to very basic approaches with self-designed Microsoft ® databases. The technology applied to conduct those studies varies tremendously from study to study, making it difficult to evaluate data across various studies (meta-analysis) and keeping a defined level of quality in database design, data processing, displaying and exporting. Furthermore, the systems being used to collect study data are often operated redundantly to systems used in patient care. As a consequence the data collection in studies is inefficient and data quality may suffer from unsynchronized datasets, non-normalized database scenarios and manually executed data transfers. With OpenCampus Research we implemented an open adoption software (OAS) solution on an open source basis, which provides a standard environment for state-of-the-art research database management at low cost.
Wang, Luman; Mo, Qiaochu; Wang, Jianxin
2015-01-01
Most current gene coexpression databases support the analysis for linear correlation of gene pairs, but not nonlinear correlation of them, which hinders precisely evaluating the gene-gene coexpression strengths. Here, we report a new database, MIrExpress, which takes advantage of the information theory, as well as the Pearson linear correlation method, to measure the linear correlation, nonlinear correlation, and their hybrid of cell-specific gene coexpressions in immune cells. For a given gene pair or probe set pair input by web users, both mutual information (MI) and Pearson correlation coefficient (r) are calculated, and several corresponding values are reported to reflect their coexpression correlation nature, including MI and r values, their respective rank orderings, their rank comparison, and their hybrid correlation value. Furthermore, for a given gene, the top 10 most relevant genes to it are displayed with the MI, r, or their hybrid perspective, respectively. Currently, the database totally includes 16 human cell groups, involving 20,283 human genes. The expression data and the calculated correlation results from the database are interactively accessible on the web page and can be implemented for other related applications and researches.
Kin-Driver: a database of driver mutations in protein kinases.
Simonetti, Franco L; Tornador, Cristian; Nabau-Moretó, Nuria; Molina-Vila, Miguel A; Marino-Buslje, Cristina
2014-01-01
Somatic mutations in protein kinases (PKs) are frequent driver events in many human tumors, while germ-line mutations are associated with hereditary diseases. Here we present Kin-driver, the first database that compiles driver mutations in PKs with experimental evidence demonstrating their functional role. Kin-driver is a manual expert-curated database that pays special attention to activating mutations (AMs) and can serve as a validation set to develop new generation tools focused on the prediction of gain-of-function driver mutations. It also offers an easy and intuitive environment to facilitate the visualization and analysis of mutations in PKs. Because all mutations are mapped onto a multiple sequence alignment, analogue positions between kinases can be identified and tentative new mutations can be proposed for studying by transferring annotation. Finally, our database can also be of use to clinical and translational laboratories, helping them to identify uncommon AMs that can correlate with response to new antitumor drugs. The website was developed using PHP and JavaScript, which are supported by all major browsers; the database was built using MySQL server. Kin-driver is available at: http://kin-driver.leloir.org.ar/ © The Author(s) 2014. Published by Oxford University Press.
NASA Technical Reports Server (NTRS)
Okong'o, Nora; Bellan, Josette
2005-01-01
Models for large eddy simulation (LES) are assessed on a database obtained from direct numerical simulations (DNS) of supercritical binary-species temporal mixing layers. The analysis is performed at the DNS transitional states for heptane/nitrogen, oxygen/hydrogen and oxygen/helium mixing layers. The incorporation of simplifying assumptions that are validated on the DNS database leads to a set of LES equations that requires only models for the subgrid scale (SGS) fluxes, which arise from filtering the convective terms in the DNS equations. Constant-coefficient versions of three different models for the SGS fluxes are assessed and calibrated. The Smagorinsky SGS-flux model shows poor correlations with the SGS fluxes, while the Gradient and Similarity models have high correlations, as well as good quantitative agreement with the SGS fluxes when the calibrated coefficients are used.
Grubb, Stephen C.; Bult, Carol J.; Bogue, Molly A.
2014-01-01
The Mouse Phenome Database (MPD; phenome.jax.org) was launched in 2001 as the data coordination center for the international Mouse Phenome Project. MPD integrates quantitative phenotype, gene expression and genotype data into a common annotated framework to facilitate query and analysis. MPD contains >3500 phenotype measurements or traits relevant to human health, including cancer, aging, cardiovascular disorders, obesity, infectious disease susceptibility, blood disorders, neurosensory disorders, drug addiction and toxicity. Since our 2012 NAR report, we have added >70 new data sets, including data from Collaborative Cross lines and Diversity Outbred mice. During this time we have completely revamped our homepage, improved search and navigational aspects of the MPD application, developed several web-enabled data analysis and visualization tools, annotated phenotype data to public ontologies, developed an ontology browser and released new single nucleotide polymorphism query functionality with much higher density coverage than before. Here, we summarize recent data acquisitions and describe our latest improvements. PMID:24243846
The Amateurs' Love Affair with Large Datasets
NASA Astrophysics Data System (ADS)
Price, Aaron; Jacoby, S. H.; Henden, A.
2006-12-01
Amateur astronomers are professionals in other areas. They bring expertise from such varied and technical careers as computer science, mathematics, engineering, and marketing. These skills, coupled with an enthusiasm for astronomy, can be used to help manage the large data sets coming online in the next decade. We will show specific examples where teams of amateurs have been involved in mining large, online data sets and have authored and published their own papers in peer-reviewed astronomical journals. Using the proposed LSST database as an example, we will outline a framework for involving amateurs in data analysis and education with large astronomical surveys.
Akiyama, Kenji; Kurotani, Atsushi; Iida, Kei; Kuromori, Takashi; Shinozaki, Kazuo; Sakurai, Tetsuya
2014-01-01
Arabidopsis thaliana is one of the most popular experimental plants. However, only 40% of its genes have at least one experimental Gene Ontology (GO) annotation assigned. Systematic observation of mutant phenotypes is an important technique for elucidating gene functions. Indeed, several large-scale phenotypic analyses have been performed and have generated phenotypic data sets from many Arabidopsis mutant lines and overexpressing lines, which are freely available online. Since each Arabidopsis mutant line database uses individual phenotype expression, the differences in the structured term sets used by each database make it difficult to compare data sets and make it impossible to search across databases. Therefore, we obtained publicly available information for a total of 66,209 Arabidopsis mutant lines, including loss-of-function (RATM and TARAPPER) and gain-of-function (AtFOX and OsFOX) lines, and integrated the phenotype data by mapping the descriptions onto Plant Ontology (PO) and Phenotypic Quality Ontology (PATO) terms. This approach made it possible to manage the four different phenotype databases as one large data set. Here, we report a publicly accessible web-based database, the RIKEN Arabidopsis Genome Encyclopedia II (RARGE II; http://rarge-v2.psc.riken.jp/), in which all of the data described in this study are included. Using the database, we demonstrated consistency (in terms of protein function) with a previous study and identified the presumed function of an unknown gene. We provide examples of AT1G21600, which is a subunit in the plastid-encoded RNA polymerase complex, and AT5G56980, which is related to the jasmonic acid signaling pathway.
Database Systems. Course Three. Information Systems Curriculum.
ERIC Educational Resources Information Center
O'Neil, Sharon Lund; Everett, Donna R.
This course is the third of seven in the Information Systems curriculum. The purpose of the course is to familiarize students with database management concepts and standard database management software. Databases and their roles, advantages, and limitations are explained. An overview of the course sets forth the condition and performance standard…
ERIC Educational Resources Information Center
Moore, Pam
2010-01-01
The Internet and electronic commerce (e-commerce) generate lots of data. Data must be stored, organized, and managed. Database administrators, or DBAs, work with database software to find ways to do this. They identify user needs, set up computer databases, and test systems. They ensure that systems perform as they should and add people to the…
16 CFR 1102.28 - Publication of reports of harm.
Code of Federal Regulations, 2011 CFR
2011-01-01
... REGULATIONS PUBLICLY AVAILABLE CONSUMER PRODUCT SAFETY INFORMATION DATABASE (Eff. Jan. 10, 2011) Procedural..., the Commission will publish reports of harm that meet the requirements for publication in the Database...(d) in the Database beyond the 10-business-day time frame set forth in paragraph (a) of this section...
NASA Astrophysics Data System (ADS)
Shao, Weber; Kupelian, Patrick A.; Wang, Jason; Low, Daniel A.; Ruan, Dan
2014-03-01
We devise a paradigm for representing the DICOM-RT structure sets in a database management system, in such way that secondary calculations of geometric information can be performed quickly from the existing contour definitions. The implementation of this paradigm is achieved using the PostgreSQL database system and the PostGIS extension, a geographic information system commonly used for encoding geographical map data. The proposed paradigm eliminates the overhead of retrieving large data records from the database, as well as the need to implement various numerical and data parsing routines, when additional information related to the geometry of the anatomy is desired.
Baux, David; Faugère, Valérie; Larrieu, Lise; Le Guédard-Méreuze, Sandie; Hamroun, Dalil; Béroud, Christophe; Malcolm, Sue; Claustres, Mireille; Roux, Anne-Françoise
2008-08-01
Using the Universal Mutation Database (UMD) software, we have constructed "UMD-USHbases", a set of relational databases of nucleotide variations for seven genes involved in Usher syndrome (MYO7A, CDH23, PCDH15, USH1C, USH1G, USH3A and USH2A). Mutations in the Usher syndrome type I causing genes are also recorded in non-syndromic hearing loss cases and mutations in USH2A in non-syndromic retinitis pigmentosa. Usher syndrome provides a particular challenge for molecular diagnostics because of the clinical and molecular heterogeneity. As many mutations are missense changes, and all the genes also contain apparently non-pathogenic polymorphisms, well-curated databases are crucial for accurate interpretation of pathogenicity. Tools are provided to assess the pathogenicity of mutations, including conservation of amino acids and analysis of splice-sites. Reference amino acid alignments are provided. Apparently non-pathogenic variants in patients with Usher syndrome, at both the nucleotide and amino acid level, are included. The UMD-USHbases currently contain more than 2,830 entries including disease causing mutations, unclassified variants or non-pathogenic polymorphisms identified in over 938 patients. In addition to data collected from 89 publications, 15 novel mutations identified in our laboratory are recorded in MYO7A (6), CDH23 (8), or PCDH15 (1) genes. Information is given on the relative involvement of the seven genes, the number and distribution of variants in each gene. UMD-USHbases give access to a software package that provides specific routines and optimized multicriteria research and sorting tools. These databases should assist clinicians and geneticists seeking information about mutations responsible for Usher syndrome.
Slimani, N; Deharveng, G; Unwin, I; Southgate, D A T; Vignat, J; Skeie, G; Salvini, S; Parpinel, M; Møller, A; Ireland, J; Becker, W; Farran, A; Westenbrink, S; Vasilopoulou, E; Unwin, J; Borgejordet, A; Rohrmann, S; Church, S; Gnagnarella, P; Casagrande, C; van Bakel, M; Niravong, M; Boutron-Ruault, M C; Stripp, C; Tjønneland, A; Trichopoulou, A; Georga, K; Nilsson, S; Mattisson, I; Ray, J; Boeing, H; Ocké, M; Peeters, P H M; Jakszyn, P; Amiano, P; Engeset, D; Lund, E; de Magistris, M Santucci; Sacerdote, C; Welch, A; Bingham, S; Subar, A F; Riboli, E
2007-09-01
This paper describes the ad hoc methodological concepts and procedures developed to improve the comparability of Nutrient databases (NDBs) across the 10 European countries participating in the European Prospective Investigation into Cancer and Nutrition (EPIC). This was required because there is currently no European reference NDB available. A large network involving national compilers, nutritionists and experts on food chemistry and computer science was set up for the 'EPIC Nutrient DataBase' (ENDB) project. A total of 550-1500 foods derived from about 37,000 standardized EPIC 24-h dietary recalls (24-HDRS) were matched as closely as possible to foods available in the 10 national NDBs. The resulting national data sets (NDS) were then successively documented, standardized and evaluated according to common guidelines and using a DataBase Management System specifically designed for this project. The nutrient values of foods unavailable or not readily available in NDSs were approximated by recipe calculation, weighted averaging or adjustment for weight changes and vitamin/mineral losses, using common algorithms. The final ENDB contains about 550-1500 foods depending on the country and 26 common components. Each component value was documented and standardized for unit, mode of expression, definition and chemical method of analysis, as far as possible. Furthermore, the overall completeness of NDSs was improved (>or=99%), particularly for beta-carotene and vitamin E. The ENDB constitutes a first real attempt to improve the comparability of NDBs across European countries. This methodological work will provide a useful tool for nutritional research as well as end-user recommendations to improve NDBs in the future.
An expression database for roots of the model legume Medicago truncatula under salt stress
2009-01-01
Background Medicago truncatula is a model legume whose genome is currently being sequenced by an international consortium. Abiotic stresses such as salt stress limit plant growth and crop productivity, including those of legumes. We anticipate that studies on M. truncatula will shed light on other economically important legumes across the world. Here, we report the development of a database called MtED that contains gene expression profiles of the roots of M. truncatula based on time-course salt stress experiments using the Affymetrix Medicago GeneChip. Our hope is that MtED will provide information to assist in improving abiotic stress resistance in legumes. Description The results of our microarray experiment with roots of M. truncatula under 180 mM sodium chloride were deposited in the MtED database. Additionally, sequence and annotation information regarding microarray probe sets were included. MtED provides functional category analysis based on Gene and GeneBins Ontology, and other Web-based tools for querying and retrieving query results, browsing pathways and transcription factor families, showing metabolic maps, and comparing and visualizing expression profiles. Utilities like mapping probe sets to genome of M. truncatula and In-Silico PCR were implemented by BLAT software suite, which were also available through MtED database. Conclusion MtED was built in the PHP script language and as a MySQL relational database system on a Linux server. It has an integrated Web interface, which facilitates ready examination and interpretation of the results of microarray experiments. It is intended to help in selecting gene markers to improve abiotic stress resistance in legumes. MtED is available at http://bioinformatics.cau.edu.cn/MtED/. PMID:19906315
An expression database for roots of the model legume Medicago truncatula under salt stress.
Li, Daofeng; Su, Zhen; Dong, Jiangli; Wang, Tao
2009-11-11
Medicago truncatula is a model legume whose genome is currently being sequenced by an international consortium. Abiotic stresses such as salt stress limit plant growth and crop productivity, including those of legumes. We anticipate that studies on M. truncatula will shed light on other economically important legumes across the world. Here, we report the development of a database called MtED that contains gene expression profiles of the roots of M. truncatula based on time-course salt stress experiments using the Affymetrix Medicago GeneChip. Our hope is that MtED will provide information to assist in improving abiotic stress resistance in legumes. The results of our microarray experiment with roots of M. truncatula under 180 mM sodium chloride were deposited in the MtED database. Additionally, sequence and annotation information regarding microarray probe sets were included. MtED provides functional category analysis based on Gene and GeneBins Ontology, and other Web-based tools for querying and retrieving query results, browsing pathways and transcription factor families, showing metabolic maps, and comparing and visualizing expression profiles. Utilities like mapping probe sets to genome of M. truncatula and In-Silico PCR were implemented by BLAT software suite, which were also available through MtED database. MtED was built in the PHP script language and as a MySQL relational database system on a Linux server. It has an integrated Web interface, which facilitates ready examination and interpretation of the results of microarray experiments. It is intended to help in selecting gene markers to improve abiotic stress resistance in legumes. MtED is available at http://bioinformatics.cau.edu.cn/MtED/.
Bouadjenek, Mohamed Reda; Verspoor, Karin; Zobel, Justin
2017-07-01
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature. Copyright © 2017 Elsevier Inc. All rights reserved.
'Seed + expand': a general methodology for detecting publication oeuvres of individual researchers.
Reijnhoudt, Linda; Costas, Rodrigo; Noyons, Ed; Börner, Katy; Scharnhorst, Andrea
2014-01-01
The study of science at the individual scholar level requires the disambiguation of author names. The creation of author's publication oeuvres involves matching the list of unique author names to names used in publication databases. Despite recent progress in the development of unique author identifiers, e.g., ORCID, VIVO, or DAI, author disambiguation remains a key problem when it comes to large-scale bibliometric analysis using data from multiple databases. This study introduces and tests a new methodology called seed + expand for semi-automatic bibliographic data collection for a given set of individual authors. Specifically, we identify the oeuvre of a set of Dutch full professors during the period 1980-2011. In particular, we combine author records from a Dutch National Research Information System (NARCIS) with publication records from the Web of Science. Starting with an initial list of 8,378 names, we identify 'seed publications' for each author using five different approaches. Subsequently, we 'expand' the set of publications in three different approaches. The different approaches are compared and resulting oeuvres are evaluated on precision and recall using a 'gold standard' dataset of authors for which verified publications in the period 2001-2010 are available.
Secure searching of biomarkers through hybrid homomorphic encryption scheme.
Kim, Miran; Song, Yongsoo; Cheon, Jung Hee
2017-07-26
As genome sequencing technology develops rapidly, there has lately been an increasing need to keep genomic data secure even when stored in the cloud and still used for research. We are interested in designing a protocol for the secure outsourcing matching problem on encrypted data. We propose an efficient method to securely search a matching position with the query data and extract some information at the position. After decryption, only a small amount of comparisons with the query information should be performed in plaintext state. We apply this method to find a set of biomarkers in encrypted genomes. The important feature of our method is to encode a genomic database as a single element of polynomial ring. Since our method requires a single homomorphic multiplication of hybrid scheme for query computation, it has the advantage over the previous methods in parameter size, computation complexity, and communication cost. In particular, the extraction procedure not only prevents leakage of database information that has not been queried by user but also reduces the communication cost by half. We evaluate the performance of our method and verify that the computation on large-scale personal data can be securely and practically outsourced to a cloud environment during data analysis. It takes about 3.9 s to search-and-extract the reference and alternate sequences at the queried position in a database of size 4M. Our solution for finding a set of biomarkers in DNA sequences shows the progress of cryptographic techniques in terms of their capability can support real-world genome data analysis in a cloud environment.
Automated generation and ensemble-learned matching of X-ray absorption spectra
NASA Astrophysics Data System (ADS)
Zheng, Chen; Mathew, Kiran; Chen, Chi; Chen, Yiming; Tang, Hanmei; Dozier, Alan; Kas, Joshua J.; Vila, Fernando D.; Rehr, John J.; Piper, Louis F. J.; Persson, Kristin A.; Ong, Shyue Ping
2018-12-01
X-ray absorption spectroscopy (XAS) is a widely used materials characterization technique to determine oxidation states, coordination environment, and other local atomic structure information. Analysis of XAS relies on comparison of measured spectra to reliable reference spectra. However, existing databases of XAS spectra are highly limited both in terms of the number of reference spectra available as well as the breadth of chemistry coverage. In this work, we report the development of XASdb, a large database of computed reference XAS, and an Ensemble-Learned Spectra IdEntification (ELSIE) algorithm for the matching of spectra. XASdb currently hosts more than 800,000 K-edge X-ray absorption near-edge spectra (XANES) for over 40,000 materials from the open-science Materials Project database. We discuss a high-throughput automation framework for FEFF calculations, built on robust, rigorously benchmarked parameters. FEFF is a computer program uses a real-space Green's function approach to calculate X-ray absorption spectra. We will demonstrate that the ELSIE algorithm, which combines 33 weak "learners" comprising a set of preprocessing steps and a similarity metric, can achieve up to 84.2% accuracy in identifying the correct oxidation state and coordination environment of a test set of 19 K-edge XANES spectra encompassing a diverse range of chemistries and crystal structures. The XASdb with the ELSIE algorithm has been integrated into a web application in the Materials Project, providing an important new public resource for the analysis of XAS to all materials researchers. Finally, the ELSIE algorithm itself has been made available as part of veidt, an open source machine-learning library for materials science.
Rémy, Vanessa; Mathevet, Patrice; Vainchtock, Alexandre
2009-12-01
Literature on the epidemiology of vulvar and vaginal cancers is scarce. The incidence of these diseases seems to be increasing. It has been reported that about 40% of vulvar and 70% of vaginal cancers may be linked to human papillomavirus (HPV). This study aimed to assess the medical burden associated with hospitalizations and management of vulvar and vaginal cancers and dysplasia (VIN and VaIN) in France. A retrospective analysis using the French national hospital database (PMSI) was performed to assess the annual number of patients hospitalized for vulvar and vaginal cancers and VIN/VaIN, based on hospital admissions in 2006. Data for all stays and chemotherapy/radiotherapy sessions were extracted. SAE database (Statistiques annuelles des établissements de santé) was used to take into account patients who had radiotherapy sessions performed in the private sector which are not reported in the PMSI. In 2006, 1237 and 623 patients were hospitalized for vulvar cancer and VIN, respectively. There were also 728 and 244 patients hospitalized for vaginal cancer and VaIN, respectively. Overall, about 35% of all patients were hospitalized in the private setting. For all lesions except vaginal cancer, surgery was the most common type of management. For vaginal cancer, medical care was the most prevalent (52%), followed by surgery (31%). The burden of hospitalizations due to vulvar and vaginal cancers is substantial. Further research is needed to assess the outpatient burden due to these diseases especially for precancerous dysplasia which may be mostly managed in an outpatient setting.
cPath: open source software for collecting, storing, and querying biological pathways
Cerami, Ethan G; Bader, Gary D; Gross, Benjamin E; Sander, Chris
2006-01-01
Background Biological pathways, including metabolic pathways, protein interaction networks, signal transduction pathways, and gene regulatory networks, are currently represented in over 220 diverse databases. These data are crucial for the study of specific biological processes, including human diseases. Standard exchange formats for pathway information, such as BioPAX, CellML, SBML and PSI-MI, enable convenient collection of this data for biological research, but mechanisms for common storage and communication are required. Results We have developed cPath, an open source database and web application for collecting, storing, and querying biological pathway data. cPath makes it easy to aggregate custom pathway data sets available in standard exchange formats from multiple databases, present pathway data to biologists via a customizable web interface, and export pathway data via a web service to third-party software, such as Cytoscape, for visualization and analysis. cPath is software only, and does not include new pathway information. Key features include: a built-in identifier mapping service for linking identical interactors and linking to external resources; built-in support for PSI-MI and BioPAX standard pathway exchange formats; a web service interface for searching and retrieving pathway data sets; and thorough documentation. The cPath software is freely available under the LGPL open source license for academic and commercial use. Conclusion cPath is a robust, scalable, modular, professional-grade software platform for collecting, storing, and querying biological pathways. It can serve as the core data handling component in information systems for pathway visualization, analysis and modeling. PMID:17101041
NASA Astrophysics Data System (ADS)
Kadow, Christopher; Illing, Sebastian; Kunst, Oliver; Schartner, Thomas; Kirchner, Ingo; Rust, Henning W.; Cubasch, Ulrich; Ulbrich, Uwe
2016-04-01
The Freie Univ Evaluation System Framework (Freva - freva.met.fu-berlin.de) is a software infrastructure for standardized data and tool solutions in Earth system science. Freva runs on high performance computers to handle customizable evaluation systems of research projects, institutes or universities. It combines different software technologies into one common hybrid infrastructure, including all features present in the shell and web environment. The database interface satisfies the international standards provided by the Earth System Grid Federation (ESGF). Freva indexes different data projects into one common search environment by storing the meta data information of the self-describing model, reanalysis and observational data sets in a database. This implemented meta data system with its advanced but easy-to-handle search tool supports users, developers and their plugins to retrieve the required information. A generic application programming interface (API) allows scientific developers to connect their analysis tools with the evaluation system independently of the programming language used. Users of the evaluation techniques benefit from the common interface of the evaluation system without any need to understand the different scripting languages. Facilitation of the provision and usage of tools and climate data automatically increases the number of scientists working with the data sets and identifying discrepancies. The integrated web-shell (shellinabox) adds a degree of freedom in the choice of the working environment and can be used as a gate to the research projects HPC. Plugins are able to integrate their e.g. post-processed results into the database of the user. This allows e.g. post-processing plugins to feed statistical analysis plugins, which fosters an active exchange between plugin developers of a research project. Additionally, the history and configuration sub-system stores every analysis performed with the evaluation system in a database. Configurations and results of the tools can be shared among scientists via shell or web system. Therefore, plugged-in tools benefit from transparency and reproducibility. Furthermore, if configurations match while starting an evaluation plugin, the system suggests to use results already produced by other users - saving CPU/h, I/O, disk space and time. The efficient interaction between different technologies improves the Earth system modeling science framed by Freva.
Salgueiro, Ana Rita; Pereira, Henrique Garcia; Rico, Maria-Teresa; Benito, Gerado; Díez-Herreo, Andrés
2008-02-01
A new statistical approach for preliminary risk evaluation of breakage in tailings dam is presented and illustrated by a case study regarding the Mediterranean region. The objective of the proposed method is to establish an empirical scale of risk, from which guidelines for prioritizing the collection of further specific information can be derived. The method relies on a historical database containing, in essence, two sets of qualitative data: the first set concerns the variables that are observable before the disaster (e.g., type and size of the dam, its location, and state of activity), and the second refers to the consequences of the disaster (e.g., failure type, sludge characteristics, fatalities categorization, and downstream range of damage). Based on a modified form of correspondence analysis, where the second set of attributes are projected as "supplementary variables" onto the axes provided by the eigenvalue decomposition of the matrix referring to the first set, a "qualitative regression" is performed, relating the variables to be predicted (contained in the second set) with the "predictors" (the observable variables). On the grounds of the previously derived relationship, the risk of breakage in a new case can be evaluated, given observable variables. The method was applied in a case study regarding a set of 13 test sites where the ranking of risk obtained was validated by expert knowledge. Once validated, the procedure was included in the final output of the e-EcoRisk UE project (A Regional Enterprise Network Decision-Support System for Environmental Risk and Disaster Management of Large-Scale Industrial Spills), allowing for a dynamic historical database updating and providing a prompt rough risk evaluation for a new case. The aim of this section of the global project is to provide a quantified context where failure cases occurred in the past for supporting analogue reasoning in preventing similar situations.
Maintaining Multimedia Data in a Geospatial Database
2012-09-01
at PostgreSQL and MySQL as spatial databases was offered. Given their results, as each database produced result sets from zero to 100,000, it was...excelled given multiple conditions. A different look at PostgreSQL and MySQL as spatial databases was offered. Given their results, as each database... MySQL ................................................................................................14 B. BENCHMARKING DATA RETRIEVED FROM TABLE
Monitoring of IaaS and scientific applications on the Cloud using the Elasticsearch ecosystem
NASA Astrophysics Data System (ADS)
Bagnasco, S.; Berzano, D.; Guarise, A.; Lusso, S.; Masera, M.; Vallero, S.
2015-05-01
The private Cloud at the Torino INFN computing centre offers IaaS services to different scientific computing applications. The infrastructure is managed with the OpenNebula cloud controller. The main stakeholders of the facility are a grid Tier-2 site for the ALICE collaboration at LHC, an interactive analysis facility for the same experiment and a grid Tier-2 site for the BES-III collaboration, plus an increasing number of other small tenants. Besides keeping track of the usage, the automation of dynamic allocation of resources to tenants requires detailed monitoring and accounting of the resource usage. As a first investigation towards this, we set up a monitoring system to inspect the site activities both in terms of IaaS and applications running on the hosted virtual instances. For this purpose we used the Elasticsearch, Logstash and Kibana stack. In the current implementation, the heterogeneous accounting information is fed to different MySQL databases and sent to Elasticsearch via a custom Logstash plugin. For the IaaS metering, we developed sensors for the OpenNebula API. The IaaS level information gathered through the API is sent to the MySQL database through an ad-hoc developed RESTful web service, which is also used for other accounting purposes. Concerning the application level, we used the Root plugin TProofMonSenderSQL to collect accounting data from the interactive analysis facility. The BES-III virtual instances used to be monitored with Zabbix, as a proof of concept we also retrieve the information contained in the Zabbix database. Each of these three cases is indexed separately in Elasticsearch. We are now starting to consider dismissing the intermediate level provided by the SQL database and evaluating a NoSQL option as a unique central database for all the monitoring information. We setup a set of Kibana dashboards with pre-defined queries in order to monitor the relevant information in each case. In this way we have achieved a uniform monitoring interface for both the IaaS and the scientific applications, mostly leveraging off-the-shelf tools.
Spreadsheets for Analyzing and Optimizing Space Missions
NASA Technical Reports Server (NTRS)
Some, Raphael R.; Agrawal, Anil K.; Czikmantory, Akos J.; Weisbin, Charles R.; Hua, Hook; Neff, Jon M.; Cowdin, Mark A.; Lewis, Brian S.; Iroz, Juana; Ross, Rick
2009-01-01
XCALIBR (XML Capability Analysis LIBRary) is a set of Extensible Markup Language (XML) database and spreadsheet- based analysis software tools designed to assist in technology-return-on-investment analysis and optimization of technology portfolios pertaining to outer-space missions. XCALIBR is also being examined for use in planning, tracking, and documentation of projects. An XCALIBR database contains information on mission requirements and technological capabilities, which are related by use of an XML taxonomy. XCALIBR incorporates a standardized interface for exporting data and analysis templates to an Excel spreadsheet. Unique features of XCALIBR include the following: It is inherently hierarchical by virtue of its XML basis. The XML taxonomy codifies a comprehensive data structure and data dictionary that includes performance metrics for spacecraft, sensors, and spacecraft systems other than sensors. The taxonomy contains >700 nodes representing all levels, from system through subsystem to individual parts. All entries are searchable and machine readable. There is an intuitive Web-based user interface. The software automatically matches technologies to mission requirements. The software automatically generates, and makes the required entries in, an Excel return-on-investment analysis software tool. The results of an analysis are presented in both tabular and graphical displays.