Sample records for complex biological databases

  1. Use of Graph Database for the Integration of Heterogeneous Biological Data.

    PubMed

    Yoon, Byoung-Ha; Kim, Seon-Kyu; Kim, Seon-Young

    2017-03-01

    Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data.

  2. Use of Graph Database for the Integration of Heterogeneous Biological Data

    PubMed Central

    Yoon, Byoung-Ha; Kim, Seon-Kyu

    2017-01-01

    Understanding complex relationships among heterogeneous biological data is one of the fundamental goals in biology. In most cases, diverse biological data are stored in relational databases, such as MySQL and Oracle, which store data in multiple tables and then infer relationships by multiple-join statements. Recently, a new type of database, called the graph-based database, was developed to natively represent various kinds of complex relationships, and it is widely used among computer science communities and IT industries. Here, we demonstrate the feasibility of using a graph-based database for complex biological relationships by comparing the performance between MySQL and Neo4j, one of the most widely used graph databases. We collected various biological data (protein-protein interaction, drug-target, gene-disease, etc.) from several existing sources, removed duplicate and redundant data, and finally constructed a graph database containing 114,550 nodes and 82,674,321 relationships. When we tested the query execution performance of MySQL versus Neo4j, we found that Neo4j outperformed MySQL in all cases. While Neo4j exhibited a very fast response for various queries, MySQL exhibited latent or unfinished responses for complex queries with multiple-join statements. These results show that using graph-based databases, such as Neo4j, is an efficient way to store complex biological relationships. Moreover, querying a graph database in diverse ways has the potential to reveal novel relationships among heterogeneous biological data. PMID:28416946

  3. The 2015 Nucleic Acids Research Database Issue and molecular biology database collection.

    PubMed

    Galperin, Michael Y; Rigden, Daniel J; Fernández-Suárez, Xosé M

    2015-01-01

    The 2015 Nucleic Acids Research Database Issue contains 172 papers that include descriptions of 56 new molecular biology databases, and updates on 115 databases whose descriptions have been previously published in NAR or other journals. Following the classification that has been introduced last year in order to simplify navigation of the entire issue, these articles are divided into eight subject categories. This year's highlights include RNAcentral, an international community portal to various databases on noncoding RNA; ValidatorDB, a validation database for protein structures and their ligands; SASBDB, a primary repository for small-angle scattering data of various macromolecular complexes; MoonProt, a database of 'moonlighting' proteins, and two new databases of protein-protein and other macromolecular complexes, ComPPI and the Complex Portal. This issue also includes an unusually high number of cancer-related databases and other databases dedicated to genomic basics of disease and potential drugs and drug targets. The size of NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/a/, remained approximately the same, following the addition of 74 new resources and removal of 77 obsolete web sites. The entire Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/). Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  4. Reactome graph database: Efficient access to complex pathway data

    PubMed Central

    Korninger, Florian; Viteri, Guilherme; Marin-Garcia, Pablo; Ping, Peipei; Wu, Guanming; Stein, Lincoln; D’Eustachio, Peter

    2018-01-01

    Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types. PMID:29377902

  5. Reactome graph database: Efficient access to complex pathway data.

    PubMed

    Fabregat, Antonio; Korninger, Florian; Viteri, Guilherme; Sidiropoulos, Konstantinos; Marin-Garcia, Pablo; Ping, Peipei; Wu, Guanming; Stein, Lincoln; D'Eustachio, Peter; Hermjakob, Henning

    2018-01-01

    Reactome is a free, open-source, open-data, curated and peer-reviewed knowledgebase of biomolecular pathways. One of its main priorities is to provide easy and efficient access to its high quality curated data. At present, biological pathway databases typically store their contents in relational databases. This limits access efficiency because there are performance issues associated with queries traversing highly interconnected data. The same data in a graph database can be queried more efficiently. Here we present the rationale behind the adoption of a graph database (Neo4j) as well as the new ContentService (REST API) that provides access to these data. The Neo4j graph database and its query language, Cypher, provide efficient access to the complex Reactome data model, facilitating easy traversal and knowledge discovery. The adoption of this technology greatly improved query efficiency, reducing the average query time by 93%. The web service built on top of the graph database provides programmatic access to Reactome data by object oriented queries, but also supports more complex queries that take advantage of the new underlying graph-based data storage. By adopting graph database technology we are providing a high performance pathway data resource to the community. The Reactome graph database use case shows the power of NoSQL database engines for complex biological data types.

  6. Data warehousing in molecular biology.

    PubMed

    Schönbach, C; Kowalski-Saunders, P; Brusic, V

    2000-05-01

    In the business and healthcare sectors data warehousing has provided effective solutions for information usage and knowledge discovery from databases. However, data warehousing applications in the biological research and development (R&D) sector are lagging far behind. The fuzziness and complexity of biological data represent a major challenge in data warehousing for molecular biology. By combining experiences in other domains with our findings from building a model database, we have defined the requirements for data warehousing in molecular biology.

  7. BIOSPIDA: A Relational Database Translator for NCBI.

    PubMed

    Hagen, Matthew S; Lee, Eva K

    2010-11-13

    As the volume and availability of biological databases continue widespread growth, it has become increasingly difficult for research scientists to identify all relevant information for biological entities of interest. Details of nucleotide sequences, gene expression, molecular interactions, and three-dimensional structures are maintained across many different databases. To retrieve all necessary information requires an integrated system that can query multiple databases with minimized overhead. This paper introduces a universal parser and relational schema translator that can be utilized for all NCBI databases in Abstract Syntax Notation (ASN.1). The data models for OMIM, Entrez-Gene, Pubmed, MMDB and GenBank have been successfully converted into relational databases and all are easily linkable helping to answer complex biological questions. These tools facilitate research scientists to locally integrate databases from NCBI without significant workload or development time.

  8. Structure-Based Characterization of Multiprotein Complexes

    PubMed Central

    Wiederstein, Markus; Gruber, Markus; Frank, Karl; Melo, Francisco; Sippl, Manfred J.

    2014-01-01

    Summary Multiprotein complexes govern virtually all cellular processes. Their 3D structures provide important clues to their biological roles, especially through structural correlations among protein molecules and complexes. The detection of such correlations generally requires comprehensive searches in databases of known protein structures by means of appropriate structure-matching techniques. Here, we present a high-speed structure search engine capable of instantly matching large protein oligomers against the complete and up-to-date database of biologically functional assemblies of protein molecules. We use this tool to reveal unseen structural correlations on the level of protein quaternary structure and demonstrate its general usefulness for efficiently exploring complex structural relationships among known protein assemblies. PMID:24954616

  9. BIOSPIDA: A Relational Database Translator for NCBI

    PubMed Central

    Hagen, Matthew S.; Lee, Eva K.

    2010-01-01

    As the volume and availability of biological databases continue widespread growth, it has become increasingly difficult for research scientists to identify all relevant information for biological entities of interest. Details of nucleotide sequences, gene expression, molecular interactions, and three-dimensional structures are maintained across many different databases. To retrieve all necessary information requires an integrated system that can query multiple databases with minimized overhead. This paper introduces a universal parser and relational schema translator that can be utilized for all NCBI databases in Abstract Syntax Notation (ASN.1). The data models for OMIM, Entrez-Gene, Pubmed, MMDB and GenBank have been successfully converted into relational databases and all are easily linkable helping to answer complex biological questions. These tools facilitate research scientists to locally integrate databases from NCBI without significant workload or development time. PMID:21347013

  10. Structure-based characterization of multiprotein complexes.

    PubMed

    Wiederstein, Markus; Gruber, Markus; Frank, Karl; Melo, Francisco; Sippl, Manfred J

    2014-07-08

    Multiprotein complexes govern virtually all cellular processes. Their 3D structures provide important clues to their biological roles, especially through structural correlations among protein molecules and complexes. The detection of such correlations generally requires comprehensive searches in databases of known protein structures by means of appropriate structure-matching techniques. Here, we present a high-speed structure search engine capable of instantly matching large protein oligomers against the complete and up-to-date database of biologically functional assemblies of protein molecules. We use this tool to reveal unseen structural correlations on the level of protein quaternary structure and demonstrate its general usefulness for efficiently exploring complex structural relationships among known protein assemblies. Copyright © 2014 The Authors. Published by Elsevier Inc. All rights reserved.

  11. CORUM: the comprehensive resource of mammalian protein complexes

    PubMed Central

    Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner

    2008-01-01

    Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090

  12. A dedicated database system for handling multi-level data in systems biology.

    PubMed

    Pornputtapong, Natapol; Wanichthanarak, Kwanjeera; Nilsson, Avlant; Nookaew, Intawat; Nielsen, Jens

    2014-01-01

    Advances in high-throughput technologies have enabled extensive generation of multi-level omics data. These data are crucial for systems biology research, though they are complex, heterogeneous, highly dynamic, incomplete and distributed among public databases. This leads to difficulties in data accessibility and often results in errors when data are merged and integrated from varied resources. Therefore, integration and management of systems biological data remain very challenging. To overcome this, we designed and developed a dedicated database system that can serve and solve the vital issues in data management and hereby facilitate data integration, modeling and analysis in systems biology within a sole database. In addition, a yeast data repository was implemented as an integrated database environment which is operated by the database system. Two applications were implemented to demonstrate extensibility and utilization of the system. Both illustrate how the user can access the database via the web query function and implemented scripts. These scripts are specific for two sample cases: 1) Detecting the pheromone pathway in protein interaction networks; and 2) Finding metabolic reactions regulated by Snf1 kinase. In this study we present the design of database system which offers an extensible environment to efficiently capture the majority of biological entities and relations encountered in systems biology. Critical functions and control processes were designed and implemented to ensure consistent, efficient, secure and reliable transactions. The two sample cases on the yeast integrated data clearly demonstrate the value of a sole database environment for systems biology research.

  13. MIMO: an efficient tool for molecular interaction maps overlap

    PubMed Central

    2013-01-01

    Background Molecular pathways represent an ensemble of interactions occurring among molecules within the cell and between cells. The identification of similarities between molecular pathways across organisms and functions has a critical role in understanding complex biological processes. For the inference of such novel information, the comparison of molecular pathways requires to account for imperfect matches (flexibility) and to efficiently handle complex network topologies. To date, these characteristics are only partially available in tools designed to compare molecular interaction maps. Results Our approach MIMO (Molecular Interaction Maps Overlap) addresses the first problem by allowing the introduction of gaps and mismatches between query and template pathways and permits -when necessary- supervised queries incorporating a priori biological information. It then addresses the second issue by relying directly on the rich graph topology described in the Systems Biology Markup Language (SBML) standard, and uses multidigraphs to efficiently handle multiple queries on biological graph databases. The algorithm has been here successfully used to highlight the contact point between various human pathways in the Reactome database. Conclusions MIMO offers a flexible and efficient graph-matching tool for comparing complex biological pathways. PMID:23672344

  14. NPIDB: Nucleic acid-Protein Interaction DataBase.

    PubMed

    Kirsanov, Dmitry D; Zanegina, Olga N; Aksianov, Evgeniy A; Spirin, Sergei A; Karyagina, Anna S; Alexeevski, Andrei V

    2013-01-01

    The Nucleic acid-Protein Interaction DataBase (http://npidb.belozersky.msu.ru/) contains information derived from structures of DNA-protein and RNA-protein complexes extracted from the Protein Data Bank (3846 complexes in October 2012). It provides a web interface and a set of tools for extracting biologically meaningful characteristics of nucleoprotein complexes. The content of the database is updated weekly. The current version of the Nucleic acid-Protein Interaction DataBase is an upgrade of the version published in 2007. The improvements include a new web interface, new tools for calculation of intermolecular interactions, a classification of SCOP families that contains DNA-binding protein domains and data on conserved water molecules on the DNA-protein interface.

  15. microRNAs Databases: Developmental Methodologies, Structural and Functional Annotations.

    PubMed

    Singh, Nagendra Kumar

    2017-09-01

    microRNA (miRNA) is an endogenous and evolutionary conserved non-coding RNA, involved in post-transcriptional process as gene repressor and mRNA cleavage through RNA-induced silencing complex (RISC) formation. In RISC, miRNA binds in complementary base pair with targeted mRNA along with Argonaut proteins complex, causes gene repression or endonucleolytic cleavage of mRNAs and results in many diseases and syndromes. After the discovery of miRNA lin-4 and let-7, subsequently large numbers of miRNAs were discovered by low-throughput and high-throughput experimental techniques along with computational process in various biological and metabolic processes. The miRNAs are important non-coding RNA for understanding the complex biological phenomena of organism because it controls the gene regulation. This paper reviews miRNA databases with structural and functional annotations developed by various researchers. These databases contain structural and functional information of animal, plant and virus miRNAs including miRNAs-associated diseases, stress resistance in plant, miRNAs take part in various biological processes, effect of miRNAs interaction on drugs and environment, effect of variance on miRNAs, miRNAs gene expression analysis, sequence of miRNAs, structure of miRNAs. This review focuses on the developmental methodology of miRNA databases such as computational tools and methods used for extraction of miRNAs annotation from different resources or through experiment. This study also discusses the efficiency of user interface design of every database along with current entry and annotations of miRNA (pathways, gene ontology, disease ontology, etc.). Here, an integrated schematic diagram of construction process for databases is also drawn along with tabular and graphical comparison of various types of entries in different databases. Aim of this paper is to present the importance of miRNAs-related resources at a single place.

  16. A Comprehensive Database and Analysis Framework To Incorporate Multiscale Data Types and Enable Integrated Analysis of Bioactive Polyphenols.

    PubMed

    Ho, Lap; Cheng, Haoxiang; Wang, Jun; Simon, James E; Wu, Qingli; Zhao, Danyue; Carry, Eileen; Ferruzzi, Mario G; Faith, Jeremiah; Valcarcel, Breanna; Hao, Ke; Pasinetti, Giulio M

    2018-03-05

    The development of a given botanical preparation for eventual clinical application requires extensive, detailed characterizations of the chemical composition, as well as the biological availability, biological activity, and safety profiles of the botanical. These issues are typically addressed using diverse experimental protocols and model systems. Based on this consideration, in this study we established a comprehensive database and analysis framework for the collection, collation, and integrative analysis of diverse, multiscale data sets. Using this framework, we conducted an integrative analysis of heterogeneous data from in vivo and in vitro investigation of a complex bioactive dietary polyphenol-rich preparation (BDPP) and built an integrated network linking data sets generated from this multitude of diverse experimental paradigms. We established a comprehensive database and analysis framework as well as a systematic and logical means to catalogue and collate the diverse array of information gathered, which is securely stored and added to in a standardized manner to enable fast query. We demonstrated the utility of the database in (1) a statistical ranking scheme to prioritize response to treatments and (2) in depth reconstruction of functionality studies. By examination of these data sets, the system allows analytical querying of heterogeneous data and the access of information related to interactions, mechanism of actions, functions, etc., which ultimately provide a global overview of complex biological responses. Collectively, we present an integrative analysis framework that leads to novel insights on the biological activities of a complex botanical such as BDPP that is based on data-driven characterizations of interactions between BDPP-derived phenolic metabolites and their mechanisms of action, as well as synergism and/or potential cancellation of biological functions. Out integrative analytical approach provides novel means for a systematic integrative analysis of heterogeneous data types in the development of complex botanicals such as polyphenols for eventual clinical and translational applications.

  17. Just Working with the Cellular Machine: A High School Game for Teaching Molecular Biology

    ERIC Educational Resources Information Center

    Cardoso, Fernanda Serpa; Dumpel, Renata; Gomes da Silva, Luisa B.; Rodrigues, Carlos R.; Santos, Dilvani O.; Cabral, Lucio Mendes; Castro, Helena C.

    2008-01-01

    Molecular biology is a difficult comprehension subject due to its high complexity, thus requiring new teaching approaches. Herein, we developed an interdisciplinary board game involving the human immune system response against a bacterial infection for teaching molecular biology at high school. Initially, we created a database with several…

  18. Protein-protein interaction predictions using text mining methods.

    PubMed

    Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Iliopoulos, Ioannis

    2015-03-01

    It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their function individually, but also in the form of protein complexes is of a great importance. Nowadays, despite the plethora of various high-throughput experimental approaches for detecting protein-protein interactions, many computational methods aiming to predict new interactions have appeared and gained interest. In this review, we focus on text-mining based computational methodologies, aiming to extract information for proteins and their interactions from public repositories such as literature and various biological databases. We discuss their strengths, their weaknesses and how they complement existing experimental techniques by simultaneously commenting on the biological databases which hold such information and the benchmark datasets that can be used for evaluating new tools. Copyright © 2014 Elsevier Inc. All rights reserved.

  19. Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes.

    PubMed

    Hassani-Pak, Keywan; Rawlings, Christopher

    2017-06-13

    Genetics and "omics" studies designed to uncover genotype to phenotype relationships often identify large numbers of potential candidate genes, among which the causal genes are hidden. Scientists generally lack the time and technical expertise to review all relevant information available from the literature, from key model species and from a potentially wide range of related biological databases in a variety of data formats with variable quality and coverage. Computational tools are needed for the integration and evaluation of heterogeneous information in order to prioritise candidate genes and components of interaction networks that, if perturbed through potential interventions, have a positive impact on the biological outcome in the whole organism without producing negative side effects. Here we review several bioinformatics tools and databases that play an important role in biological knowledge discovery and candidate gene prioritization. We conclude with several key challenges that need to be addressed in order to facilitate biological knowledge discovery in the future.

  20. From metaphor to practices: The introduction of "information engineers" into the first DNA sequence database.

    PubMed

    García-Sancho, Miguel

    2011-01-01

    This paper explores the introduction of professional systems engineers and information management practices into the first centralized DNA sequence database, developed at the European Molecular Biology Laboratory (EMBL) during the 1980s. In so doing, it complements the literature on the emergence of an information discourse after World War II and its subsequent influence in biological research. By the careers of the database creators and the computer algorithms they designed, analyzing, from the mid-1960s onwards information in biology gradually shifted from a pervasive metaphor to be embodied in practices and professionals such as those incorporated at the EMBL. I then investigate the reception of these database professionals by the EMBL biological staff, which evolved from initial disregard to necessary collaboration as the relationship between DNA, genes, and proteins turned out to be more complex than expected. The trajectories of the database professionals at the EMBL suggest that the initial subject matter of the historiography of genomics should be the long-standing practices that emerged after World War II and to a large extent originated outside biomedicine and academia. Only after addressing these practices, historians may turn to their further disciplinary assemblage in fields such as bioinformatics or biotechnology.

  1. BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data

    PubMed Central

    2014-01-01

    Background Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. Results We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Conclusions Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility. PMID:25089180

  2. BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data.

    PubMed

    Wu, Hongyan; Fujiwara, Toyofumi; Yamamoto, Yasunori; Bolleman, Jerven; Yamaguchi, Atsuko

    2014-01-01

    Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.

  3. MultitaskProtDB: a database of multitasking proteins.

    PubMed

    Hernández, Sergio; Ferragut, Gabriela; Amela, Isaac; Perez-Pons, JosepAntoni; Piñol, Jaume; Mozo-Villarias, Angel; Cedano, Juan; Querol, Enrique

    2014-01-01

    We have compiled MultitaskProtDB, available online at http://wallace.uab.es/multitask, to provide a repository where the many multitasking proteins found in the literature can be stored. Multitasking or moonlighting is the capability of some proteins to execute two or more biological functions. Usually, multitasking proteins are experimentally revealed by serendipity. This ability of proteins to perform multitasking functions helps us to understand one of the ways used by cells to perform many complex functions with a limited number of genes. Even so, the study of this phenomenon is complex because, among other things, there is no database of moonlighting proteins. The existence of such a tool facilitates the collection and dissemination of these important data. This work reports the database, MultitaskProtDB, which is designed as a friendly user web page containing >288 multitasking proteins with their NCBI and UniProt accession numbers, canonical and additional biological functions, monomeric/oligomeric states, PDB codes when available and bibliographic references. This database also serves to gain insight into some characteristics of multitasking proteins such as frequencies of the different pairs of functions, phylogenetic conservation and so forth.

  4. PDB-wide collection of binding data: current status of the PDBbind database.

    PubMed

    Liu, Zhihai; Li, Yan; Han, Li; Li, Jie; Liu, Jie; Zhao, Zhixiong; Nie, Wei; Liu, Yuchen; Wang, Renxiao

    2015-02-01

    Molecular recognition between biological macromolecules and organic small molecules plays an important role in various life processes. Both structural information and binding data of biomolecular complexes are indispensable for depicting the underlying mechanism in such an event. The PDBbind database was created to collect experimentally measured binding data for the biomolecular complexes throughout the Protein Data Bank (PDB). It thus provides the linkage between structural information and energetic properties of biomolecular complexes, which is especially desirable for computational studies or statistical analyses. Since its first public release in 2004, the PDBbind database has been updated on an annual basis. The latest release (version 2013) provides experimental binding affinity data for 10,776 biomolecular complexes in PDB, including 8302 protein-ligand complexes and 2474 other types of complexes. In this article, we will describe the current methods used for compiling PDBbind and the updated status of this database. We will also review some typical applications of PDBbind published in the scientific literature. All contents of this database are freely accessible at the PDBbind-CN Web server at http://www.pdbbind-cn.org/. wangrx@mail.sioc.ac.cn. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  5. Network-based drug discovery by integrating systems biology and computational technologies

    PubMed Central

    Leung, Elaine L.; Cao, Zhi-Wei; Jiang, Zhi-Hong; Zhou, Hua

    2013-01-01

    Network-based intervention has been a trend of curing systemic diseases, but it relies on regimen optimization and valid multi-target actions of the drugs. The complex multi-component nature of medicinal herbs may serve as valuable resources for network-based multi-target drug discovery due to its potential treatment effects by synergy. Recently, robustness of multiple systems biology platforms shows powerful to uncover molecular mechanisms and connections between the drugs and their targeting dynamic network. However, optimization methods of drug combination are insufficient, owning to lacking of tighter integration across multiple ‘-omics’ databases. The newly developed algorithm- or network-based computational models can tightly integrate ‘-omics’ databases and optimize combinational regimens of drug development, which encourage using medicinal herbs to develop into new wave of network-based multi-target drugs. However, challenges on further integration across the databases of medicinal herbs with multiple system biology platforms for multi-target drug optimization remain to the uncertain reliability of individual data sets, width and depth and degree of standardization of herbal medicine. Standardization of the methodology and terminology of multiple system biology and herbal database would facilitate the integration. Enhance public accessible databases and the number of research using system biology platform on herbal medicine would be helpful. Further integration across various ‘-omics’ platforms and computational tools would accelerate development of network-based drug discovery and network medicine. PMID:22877768

  6. PubSearch and PubFetch: a simple management system for semiautomated retrieval and annotation of biological information from the literature.

    PubMed

    Yoo, Danny; Xu, Iris; Berardini, Tanya Z; Rhee, Seung Yon; Narayanasamy, Vijay; Twigger, Simon

    2006-03-01

    For most systems in biology, a large body of literature exists that describes the complexity of the system based on experimental results. Manual review of this literature to extract targeted information into biological databases is difficult and time consuming. To address this problem, we developed PubSearch and PubFetch, which store literature, keyword, and gene information in a relational database, index the literature with keywords and gene names, and provide a Web user interface for annotating the genes from experimental data found in the associated literature. A set of protocols is provided in this unit for installing, populating, running, and using PubSearch and PubFetch. In addition, we provide support protocols for performing controlled vocabulary annotations. Intended users of PubSearch and PubFetch are database curators and biology researchers interested in tracking the literature and capturing information about genes of interest in a more effective way than with conventional spreadsheets and lab notebooks.

  7. Ontological interpretation of biomedical database content.

    PubMed

    Santana da Silva, Filipe; Jansen, Ludger; Freitas, Fred; Schulz, Stefan

    2017-06-26

    Biological databases store data about laboratory experiments, together with semantic annotations, in order to support data aggregation and retrieval. The exact meaning of such annotations in the context of a database record is often ambiguous. We address this problem by grounding implicit and explicit database content in a formal-ontological framework. By using a typical extract from the databases UniProt and Ensembl, annotated with content from GO, PR, ChEBI and NCBI Taxonomy, we created four ontological models (in OWL), which generate explicit, distinct interpretations under the BioTopLite2 (BTL2) upper-level ontology. The first three models interpret database entries as individuals (IND), defined classes (SUBC), and classes with dispositions (DISP), respectively; the fourth model (HYBR) is a combination of SUBC and DISP. For the evaluation of these four models, we consider (i) database content retrieval, using ontologies as query vocabulary; (ii) information completeness; and, (iii) DL complexity and decidability. The models were tested under these criteria against four competency questions (CQs). IND does not raise any ontological claim, besides asserting the existence of sample individuals and relations among them. Modelling patterns have to be created for each type of annotation referent. SUBC is interpreted regarding maximally fine-grained defined subclasses under the classes referred to by the data. DISP attempts to extract truly ontological statements from the database records, claiming the existence of dispositions. HYBR is a hybrid of SUBC and DISP and is more parsimonious regarding expressiveness and query answering complexity. For each of the four models, the four CQs were submitted as DL queries. This shows the ability to retrieve individuals with IND, and classes in SUBC and HYBR. DISP does not retrieve anything because the axioms with disposition are embedded in General Class Inclusion (GCI) statements. Ambiguity of biological database content is addressed by a method that identifies implicit knowledge behind semantic annotations in biological databases and grounds it in an expressive upper-level ontology. The result is a seamless representation of database structure, content and annotations as OWL models.

  8. MultitaskProtDB: a database of multitasking proteins

    PubMed Central

    Hernández, Sergio; Ferragut, Gabriela; Amela, Isaac; Perez-Pons, JosepAntoni; Piñol, Jaume; Mozo-Villarias, Angel; Cedano, Juan; Querol, Enrique

    2014-01-01

    We have compiled MultitaskProtDB, available online at http://wallace.uab.es/multitask, to provide a repository where the many multitasking proteins found in the literature can be stored. Multitasking or moonlighting is the capability of some proteins to execute two or more biological functions. Usually, multitasking proteins are experimentally revealed by serendipity. This ability of proteins to perform multitasking functions helps us to understand one of the ways used by cells to perform many complex functions with a limited number of genes. Even so, the study of this phenomenon is complex because, among other things, there is no database of moonlighting proteins. The existence of such a tool facilitates the collection and dissemination of these important data. This work reports the database, MultitaskProtDB, which is designed as a friendly user web page containing >288 multitasking proteins with their NCBI and UniProt accession numbers, canonical and additional biological functions, monomeric/oligomeric states, PDB codes when available and bibliographic references. This database also serves to gain insight into some characteristics of multitasking proteins such as frequencies of the different pairs of functions, phylogenetic conservation and so forth. PMID:24253302

  9. Biological data warehousing system for identifying transcriptional regulatory sites from gene expressions of microarray data.

    PubMed

    Tsou, Ann-Ping; Sun, Yi-Ming; Liu, Chia-Lin; Huang, Hsien-Da; Horng, Jorng-Tzong; Tsai, Meng-Feng; Liu, Baw-Juine

    2006-07-01

    Identification of transcriptional regulatory sites plays an important role in the investigation of gene regulation. For this propose, we designed and implemented a data warehouse to integrate multiple heterogeneous biological data sources with data types such as text-file, XML, image, MySQL database model, and Oracle database model. The utility of the biological data warehouse in predicting transcriptional regulatory sites of coregulated genes was explored using a synexpression group derived from a microarray study. Both of the binding sites of known transcription factors and predicted over-represented (OR) oligonucleotides were demonstrated for the gene group. The potential biological roles of both known nucleotides and one OR nucleotide were demonstrated using bioassays. Therefore, the results from the wet-lab experiments reinforce the power and utility of the data warehouse as an approach to the genome-wide search for important transcription regulatory elements that are the key to many complex biological systems.

  10. SQL is Dead; Long-live SQL: Relational Database Technology in Science Contexts

    NASA Astrophysics Data System (ADS)

    Howe, B.; Halperin, D.

    2014-12-01

    Relational databases are often perceived as a poor fit in science contexts: Rigid schemas, poor support for complex analytics, unpredictable performance, significant maintenance and tuning requirements --- these idiosyncrasies often make databases unattractive in science contexts characterized by heterogeneous data sources, complex analysis tasks, rapidly changing requirements, and limited IT budgets. In this talk, I'll argue that although the value proposition of typical relational database systems are weak in science, the core ideas that power relational databases have become incredibly prolific in open source science software, and are emerging as a universal abstraction for both big data and small data. In addition, I'll talk about two open source systems we are building to "jailbreak" the core technology of relational databases and adapt them for use in science. The first is SQLShare, a Database-as-a-Service system supporting collaborative data analysis and exchange by reducing database use to an Upload-Query-Share workflow with no installation, schema design, or configuration required. The second is Myria, a service that supports much larger scale data, complex analytics, and supports multiple back end systems. Finally, I'll describe some of the ways our collaborators in oceanography, astronomy, biology, fisheries science, and more are using these systems to replace script-based workflows for reasons of performance, flexibility, and convenience.

  11. Evaluation of "shotgun" proteomics for identification of biological threat agents in complex environmental matrixes: experimental simulations.

    PubMed

    Verberkmoes, Nathan C; Hervey, W Judson; Shah, Manesh; Land, Miriam; Hauser, Loren; Larimer, Frank W; Van Berkel, Gary J; Goeringer, Douglas E

    2005-02-01

    There is currently a great need for rapid detection and positive identification of biological threat agents, as well as microbial species in general, directly from complex environmental samples. This need is most urgent in the area of homeland security, but also extends into medical, environmental, and agricultural sciences. Mass-spectrometry-based analysis is one of the leading technologies in the field with a diversity of different methodologies for biothreat detection. Over the past few years, "shotgun"proteomics has become one method of choice for the rapid analysis of complex protein mixtures by mass spectrometry. Recently, it was demonstrated that this methodology is capable of distinguishing a target species against a large database of background species from a single-component sample or dual-component mixtures with relatively the same concentration. Here, we examine the potential of shotgun proteomics to analyze a target species in a background of four contaminant species. We tested the capability of a common commercial mass-spectrometry-based shotgun proteomics platform for the detection of the target species (Escherichia coli) at four different concentrations and four different time points of analysis. We also tested the effect of database size on positive identification of the four microbes used in this study by testing a small (13-species) database and a large (261-species) database. The results clearly indicated that this technology could easily identify the target species at 20% in the background mixture at a 60, 120, 180, or 240 min analysis time with the small database. The results also indicated that the target species could easily be identified at 20% or 6% but could not be identified at 0.6% or 0.06% in either a 240 min analysis or a 30 h analysis with the small database. The effects of the large database were severe on the target species where detection above the background at any concentration used in this study was impossible, though the three other microbes used in this study were clearly identified above the background when analyzed with the large database. This study points to the potential application of this technology for biological threat agent detection but highlights many areas of needed research before the technology will be useful in real world samples.

  12. The BiolAD-DB system : an informatics system for clinical and genetic data.

    PubMed

    Nielsen, David A; Leidner, Marty; Haynes, Chad; Krauthammer, Michael; Kreek, Mary Jeanne

    2007-01-01

    The Biology of Addictive Diseases-Database (BiolAD-DB) system is a research bioinformatics system for archiving, analyzing, and processing of complex clinical and genetic data. The database schema employs design principles for handling complex clinical information, such as response items in genetic questionnaires. Data access and validation is provided by the BiolAD-DB client application, which features a data validation engine tightly coupled to a graphical user interface. Data integrity is provided by the password-protected BiolAD-DB SQL compliant server and database. BiolAD-DB tools further provide functionalities for generating customized reports and views. The BiolAD-DB system schema, client, and installation instructions are freely available at http://www.rockefeller.edu/biolad-db/.

  13. BioCarian: search engine for exploratory searches in heterogeneous biological databases.

    PubMed

    Zaki, Nazar; Tennakoon, Chandana

    2017-10-02

    There are a large number of biological databases publicly available for scientists in the web. Also, there are many private databases generated in the course of research projects. These databases are in a wide variety of formats. Web standards have evolved in the recent times and semantic web technologies are now available to interconnect diverse and heterogeneous sources of data. Therefore, integration and querying of biological databases can be facilitated by techniques used in semantic web. Heterogeneous databases can be converted into Resource Description Format (RDF) and queried using SPARQL language. Searching for exact queries in these databases is trivial. However, exploratory searches need customized solutions, especially when multiple databases are involved. This process is cumbersome and time consuming for those without a sufficient background in computer science. In this context, a search engine facilitating exploratory searches of databases would be of great help to the scientific community. We present BioCarian, an efficient and user-friendly search engine for performing exploratory searches on biological databases. The search engine is an interface for SPARQL queries over RDF databases. We note that many of the databases can be converted to tabular form. We first convert the tabular databases to RDF. The search engine provides a graphical interface based on facets to explore the converted databases. The facet interface is more advanced than conventional facets. It allows complex queries to be constructed, and have additional features like ranking of facet values based on several criteria, visually indicating the relevance of a facet value and presenting the most important facet values when a large number of choices are available. For the advanced users, SPARQL queries can be run directly on the databases. Using this feature, users will be able to incorporate federated searches of SPARQL endpoints. We used the search engine to do an exploratory search on previously published viral integration data and were able to deduce the main conclusions of the original publication. BioCarian is accessible via http://www.biocarian.com . We have developed a search engine to explore RDF databases that can be used by both novice and advanced users.

  14. DNAproDB: an interactive tool for structural analysis of DNA–protein complexes

    PubMed Central

    Sagendorf, Jared M.

    2017-01-01

    Abstract Many biological processes are mediated by complex interactions between DNA and proteins. Transcription factors, various polymerases, nucleases and histones recognize and bind DNA with different levels of binding specificity. To understand the physical mechanisms that allow proteins to recognize DNA and achieve their biological functions, it is important to analyze structures of DNA–protein complexes in detail. DNAproDB is a web-based interactive tool designed to help researchers study these complexes. DNAproDB provides an automated structure-processing pipeline that extracts structural features from DNA–protein complexes. The extracted features are organized in structured data files, which are easily parsed with any programming language or viewed in a browser. We processed a large number of DNA–protein complexes retrieved from the Protein Data Bank and created the DNAproDB database to store this data. Users can search the database by combining features of the DNA, protein or DNA–protein interactions at the interface. Additionally, users can upload their own structures for processing privately and securely. DNAproDB provides several interactive and customizable tools for creating visualizations of the DNA–protein interface at different levels of abstraction that can be exported as high quality figures. All functionality is documented and freely accessible at http://dnaprodb.usc.edu. PMID:28431131

  15. BIOZON: a system for unification, management and analysis of heterogeneous biological data.

    PubMed

    Birkland, Aaron; Yona, Golan

    2006-02-15

    Integration of heterogeneous data types is a challenging problem, especially in biology, where the number of databases and data types increase rapidly. Amongst the problems that one has to face are integrity, consistency, redundancy, connectivity, expressiveness and updatability. Here we present a system (Biozon) that addresses these problems, and offers biologists a new knowledge resource to navigate through and explore. Biozon unifies multiple biological databases consisting of a variety of data types (such as DNA sequences, proteins, interactions and cellular pathways). It is fundamentally different from previous efforts as it uses a single extensive and tightly connected graph schema wrapped with hierarchical ontology of documents and relations. Beyond warehousing existing data, Biozon computes and stores novel derived data, such as similarity relationships and functional predictions. The integration of similarity data allows propagation of knowledge through inference and fuzzy searches. Sophisticated methods of query that span multiple data types were implemented and first-of-a-kind biological ranking systems were explored and integrated. The Biozon system is an extensive knowledge resource of heterogeneous biological data. Currently, it holds more than 100 million biological documents and 6.5 billion relations between them. The database is accessible through an advanced web interface that supports complex queries, "fuzzy" searches, data materialization and more, online at http://biozon.org.

  16. Library of molecular associations: curating the complex molecular basis of liver diseases.

    PubMed

    Buchkremer, Stefan; Hendel, Jasmin; Krupp, Markus; Weinmann, Arndt; Schlamp, Kai; Maass, Thorsten; Staib, Frank; Galle, Peter R; Teufel, Andreas

    2010-03-20

    Systems biology approaches offer novel insights into the development of chronic liver diseases. Current genomic databases supporting systems biology analyses are mostly based on microarray data. Although these data often cover genome wide expression, the validity of single microarray experiments remains questionable. However, for systems biology approaches addressing the interactions of molecular networks comprehensive but also highly validated data are necessary. We have therefore generated the first comprehensive database for published molecular associations in human liver diseases. It is based on PubMed published abstracts and aimed to close the gap between genome wide coverage of low validity from microarray data and individual highly validated data from PubMed. After an initial text mining process, the extracted abstracts were all manually validated to confirm content and potential genetic associations and may therefore be highly trusted. All data were stored in a publicly available database, Library of Molecular Associations http://www.medicalgenomics.org/databases/loma/news, currently holding approximately 1260 confirmed molecular associations for chronic liver diseases such as HCC, CCC, liver fibrosis, NASH/fatty liver disease, AIH, PBC, and PSC. We furthermore transformed these data into a powerful resource for molecular liver research by connecting them to multiple biomedical information resources. Together, this database is the first available database providing a comprehensive view and analysis options for published molecular associations on multiple liver diseases.

  17. Identification of a New Isoindole-2-yl Scaffold as a Qo and Qi Dual Inhibitor of Cytochrome bc 1 Complex: Virtual Screening, Synthesis, and Biochemical Assay.

    PubMed

    Azizian, Homa; Bagherzadeh, Kowsar; Shahbazi, Sophia; Sharifi, Niusha; Amanlou, Massoud

    2017-09-18

    Respiratory chain ubiquinol-cytochrome (cyt) c oxidoreductase (cyt bc 1 or complex III) has been demonstrated as a promising target for numerous antibiotics and fungicide applications. In this study, a virtual screening of NCI diversity database was carried out in order to find novel Qo/Qi cyt bc 1 complex inhibitors. Structure-based virtual screening and molecular docking methodology were employed to further screen compounds with inhibition activity against cyt bc 1 complex after extensive reliability validation protocol with cross-docking method and identification of the best score functions. Subsequently, the application of rational filtering procedure over the target database resulted in the elucidation of a novel class of cyt bc 1 complex potent inhibitors with comparable binding energies and biological activities to those of the standard inhibitor, antimycin.

  18. Design and implementation of a library-based information service in molecular biology and genetics at the University of Pittsburgh

    PubMed Central

    Chattopadhyay, Ansuman; Tannery, Nancy Hrinya; Silverman, Deborah A. L.; Bergen, Phillip; Epstein, Barbara A.

    2006-01-01

    Setting: In summer 2002, the Health Sciences Library System (HSLS) at the University of Pittsburgh initiated an information service in molecular biology and genetics to assist researchers with identifying and utilizing bioinformatics tools. Program Components: This novel information service comprises hands-on training workshops and consultation on the use of bioinformatics tools. The HSLS also provides an electronic portal and networked access to public and commercial molecular biology databases and software packages. Evaluation Mechanisms: Researcher feedback gathered during the first three years of workshops and individual consultation indicate that the information service is meeting user needs. Next Steps/Future Directions: The service's workshop offerings will expand to include emerging bioinformatics topics. A frequently asked questions database is also being developed to reuse advice on complex bioinformatics questions. PMID:16888665

  19. Partial automation of database processing of simulation outputs from L-systems models of plant morphogenesis.

    PubMed

    Chen, Yi- Ping Phoebe; Hanan, Jim

    2002-01-01

    Models of plant architecture allow us to explore how genotype environment interactions effect the development of plant phenotypes. Such models generate masses of data organised in complex hierarchies. This paper presents a generic system for creating and automatically populating a relational database from data generated by the widely used L-system approach to modelling plant morphogenesis. Techniques from compiler technology are applied to generate attributes (new fields) in the database, to simplify query development for the recursively-structured branching relationship. Use of biological terminology in an interactive query builder contributes towards making the system biologist-friendly.

  20. Advancing the large-scale CCS database for metabolomics and lipidomics at the machine-learning era.

    PubMed

    Zhou, Zhiwei; Tu, Jia; Zhu, Zheng-Jiang

    2018-02-01

    Metabolomics and lipidomics aim to comprehensively measure the dynamic changes of all metabolites and lipids that are present in biological systems. The use of ion mobility-mass spectrometry (IM-MS) for metabolomics and lipidomics has facilitated the separation and the identification of metabolites and lipids in complex biological samples. The collision cross-section (CCS) value derived from IM-MS is a valuable physiochemical property for the unambiguous identification of metabolites and lipids. However, CCS values obtained from experimental measurement and computational modeling are limited available, which significantly restricts the application of IM-MS. In this review, we will discuss the recently developed machine-learning based prediction approach, which could efficiently generate precise CCS databases in a large scale. We will also highlight the applications of CCS databases to support metabolomics and lipidomics. Copyright © 2017 Elsevier Ltd. All rights reserved.

  1. Comprehensive inventory of protein complexes in the Protein Data Bank from consistent classification of interfaces.

    PubMed

    Bordner, Andrew J; Gorin, Andrey A

    2008-05-12

    Protein-protein interactions are ubiquitous and essential for all cellular processes. High-resolution X-ray crystallographic structures of protein complexes can reveal the details of their function and provide a basis for many computational and experimental approaches. Differentiation between biological and non-biological contacts and reconstruction of the intact complex is a challenging computational problem. A successful solution can provide additional insights into the fundamental principles of biological recognition and reduce errors in many algorithms and databases utilizing interaction information extracted from the Protein Data Bank (PDB). We have developed a method for identifying protein complexes in the PDB X-ray structures by a four step procedure: (1) comprehensively collecting all protein-protein interfaces; (2) clustering similar protein-protein interfaces together; (3) estimating the probability that each cluster is relevant based on a diverse set of properties; and (4) combining these scores for each PDB entry in order to predict the complex structure. The resulting clusters of biologically relevant interfaces provide a reliable catalog of evolutionary conserved protein-protein interactions. These interfaces, as well as the predicted protein complexes, are available from the Protein Interface Server (PInS) website (see Availability and requirements section). Our method demonstrates an almost two-fold reduction of the annotation error rate as evaluated on a large benchmark set of complexes validated from the literature. We also estimate relative contributions of each interface property to the accurate discrimination of biologically relevant interfaces and discuss possible directions for further improving the prediction method.

  2. The portable UNIX programming system (PUPS) and CANTOR: a computational environment for dynamical representation and analysis of complex neurobiological data.

    PubMed

    O'Neill, M A; Hilgetag, C C

    2001-08-29

    Many problems in analytical biology, such as the classification of organisms, the modelling of macromolecules, or the structural analysis of metabolic or neural networks, involve complex relational data. Here, we describe a software environment, the portable UNIX programming system (PUPS), which has been developed to allow efficient computational representation and analysis of such data. The system can also be used as a general development tool for database and classification applications. As the complexity of analytical biology problems may lead to computation times of several days or weeks even on powerful computer hardware, the PUPS environment gives support for persistent computations by providing mechanisms for dynamic interaction and homeostatic protection of processes. Biological objects and their interrelations are also represented in a homeostatic way in PUPS. Object relationships are maintained and updated by the objects themselves, thus providing a flexible, scalable and current data representation. Based on the PUPS environment, we have developed an optimization package, CANTOR, which can be applied to a wide range of relational data and which has been employed in different analyses of neuroanatomical connectivity. The CANTOR package makes use of the PUPS system features by modifying candidate arrangements of objects within the system's database. This restructuring is carried out via optimization algorithms that are based on user-defined cost functions, thus providing flexible and powerful tools for the structural analysis of the database content. The use of stochastic optimization also enables the CANTOR system to deal effectively with incomplete and inconsistent data. Prototypical forms of PUPS and CANTOR have been coded and used successfully in the analysis of anatomical and functional mammalian brain connectivity, involving complex and inconsistent experimental data. In addition, PUPS has been used for solving multivariate engineering optimization problems and to implement the digital identification system (DAISY), a system for the automated classification of biological objects. PUPS is implemented in ANSI-C under the POSIX.1 standard and is to a great extent architecture- and operating-system independent. The software is supported by systems libraries that allow multi-threading (the concurrent processing of several database operations), as well as the distribution of the dynamic data objects and library operations over clusters of computers. These attributes make the system easily scalable, and in principle allow the representation and analysis of arbitrarily large sets of relational data. PUPS and CANTOR are freely distributed (http://www.pups.org.uk) as open-source software under the GNU license agreement.

  3. The portable UNIX programming system (PUPS) and CANTOR: a computational environment for dynamical representation and analysis of complex neurobiological data.

    PubMed Central

    O'Neill, M A; Hilgetag, C C

    2001-01-01

    Many problems in analytical biology, such as the classification of organisms, the modelling of macromolecules, or the structural analysis of metabolic or neural networks, involve complex relational data. Here, we describe a software environment, the portable UNIX programming system (PUPS), which has been developed to allow efficient computational representation and analysis of such data. The system can also be used as a general development tool for database and classification applications. As the complexity of analytical biology problems may lead to computation times of several days or weeks even on powerful computer hardware, the PUPS environment gives support for persistent computations by providing mechanisms for dynamic interaction and homeostatic protection of processes. Biological objects and their interrelations are also represented in a homeostatic way in PUPS. Object relationships are maintained and updated by the objects themselves, thus providing a flexible, scalable and current data representation. Based on the PUPS environment, we have developed an optimization package, CANTOR, which can be applied to a wide range of relational data and which has been employed in different analyses of neuroanatomical connectivity. The CANTOR package makes use of the PUPS system features by modifying candidate arrangements of objects within the system's database. This restructuring is carried out via optimization algorithms that are based on user-defined cost functions, thus providing flexible and powerful tools for the structural analysis of the database content. The use of stochastic optimization also enables the CANTOR system to deal effectively with incomplete and inconsistent data. Prototypical forms of PUPS and CANTOR have been coded and used successfully in the analysis of anatomical and functional mammalian brain connectivity, involving complex and inconsistent experimental data. In addition, PUPS has been used for solving multivariate engineering optimization problems and to implement the digital identification system (DAISY), a system for the automated classification of biological objects. PUPS is implemented in ANSI-C under the POSIX.1 standard and is to a great extent architecture- and operating-system independent. The software is supported by systems libraries that allow multi-threading (the concurrent processing of several database operations), as well as the distribution of the dynamic data objects and library operations over clusters of computers. These attributes make the system easily scalable, and in principle allow the representation and analysis of arbitrarily large sets of relational data. PUPS and CANTOR are freely distributed (http://www.pups.org.uk) as open-source software under the GNU license agreement. PMID:11545702

  4. Exploring molecular networks using MONET ontology.

    PubMed

    Silva, João Paulo Müller da; Lemke, Ney; Mombach, José Carlos; Souza, José Guilherme Camargo de; Sinigaglia, Marialva; Vieira, Renata

    2006-03-31

    The description of the complex molecular network responsible for cell behavior requires new tools to integrate large quantities of experimental data in the design of biological information systems. These tools could be used in the characterization of these networks and in the formulation of relevant biological hypotheses. The building of an ontology is a crucial step because it integrates in a coherent framework the concepts necessary to accomplish such a task. We present MONET (molecular network), an extensible ontology and an architecture designed to facilitate the integration of data originating from different public databases in a single- and well-documented relational database, that is compatible with MONET formal definition. We also present an example of an application that can easily be implemented using these tools.

  5. A novel medical image data-based multi-physics simulation platform for computational life sciences.

    PubMed

    Neufeld, Esra; Szczerba, Dominik; Chavannes, Nicolas; Kuster, Niels

    2013-04-06

    Simulating and modelling complex biological systems in computational life sciences requires specialized software tools that can perform medical image data-based modelling, jointly visualize the data and computational results, and handle large, complex, realistic and often noisy anatomical models. The required novel solvers must provide the power to model the physics, biology and physiology of living tissue within the full complexity of the human anatomy (e.g. neuronal activity, perfusion and ultrasound propagation). A multi-physics simulation platform satisfying these requirements has been developed for applications including device development and optimization, safety assessment, basic research, and treatment planning. This simulation platform consists of detailed, parametrized anatomical models, a segmentation and meshing tool, a wide range of solvers and optimizers, a framework for the rapid development of specialized and parallelized finite element method solvers, a visualization toolkit-based visualization engine, a Python scripting interface for customized applications, a coupling framework, and more. Core components are cross-platform compatible and use open formats. Several examples of applications are presented: hyperthermia cancer treatment planning, tumour growth modelling, evaluating the magneto-haemodynamic effect as a biomarker and physics-based morphing of anatomical models.

  6. RRW: repeated random walks on genome-scale protein networks for local cluster discovery

    PubMed Central

    Macropol, Kathy; Can, Tolga; Singh, Ambuj K

    2009-01-01

    Background We propose an efficient and biologically sensitive algorithm based on repeated random walks (RRW) for discovering functional modules, e.g., complexes and pathways, within large-scale protein networks. Compared to existing cluster identification techniques, RRW implicitly makes use of network topology, edge weights, and long range interactions between proteins. Results We apply the proposed technique on a functional network of yeast genes and accurately identify statistically significant clusters of proteins. We validate the biological significance of the results using known complexes in the MIPS complex catalogue database and well-characterized biological processes. We find that 90% of the created clusters have the majority of their catalogued proteins belonging to the same MIPS complex, and about 80% have the majority of their proteins involved in the same biological process. We compare our method to various other clustering techniques, such as the Markov Clustering Algorithm (MCL), and find a significant improvement in the RRW clusters' precision and accuracy values. Conclusion RRW, which is a technique that exploits the topology of the network, is more precise and robust in finding local clusters. In addition, it has the added flexibility of being able to find multi-functional proteins by allowing overlapping clusters. PMID:19740439

  7. FlyMine: an integrated database for Drosophila and Anopheles genomics

    PubMed Central

    Lyne, Rachel; Smith, Richard; Rutherford, Kim; Wakeling, Matthew; Varley, Andrew; Guillier, Francois; Janssens, Hilde; Ji, Wenyan; Mclaren, Peter; North, Philip; Rana, Debashis; Riley, Tom; Sullivan, Julie; Watkins, Xavier; Woodbridge, Mark; Lilley, Kathryn; Russell, Steve; Ashburner, Michael; Mizuguchi, Kenji; Micklem, Gos

    2007-01-01

    FlyMine is a data warehouse that addresses one of the important challenges of modern biology: how to integrate and make use of the diversity and volume of current biological data. Its main focus is genomic and proteomics data for Drosophila and other insects. It provides web access to integrated data at a number of different levels, from simple browsing to construction of complex queries, which can be executed on either single items or lists. PMID:17615057

  8. Integrative Systems Biology for Data Driven Knowledge Discovery

    PubMed Central

    Greene, Casey S.; Troyanskaya, Olga G.

    2015-01-01

    Integrative systems biology is an approach that brings together diverse high throughput experiments and databases to gain new insights into biological processes or systems at molecular through physiological levels. These approaches rely on diverse high-throughput experimental techniques that generate heterogeneous data by assaying varying aspects of complex biological processes. Computational approaches are necessary to provide an integrative view of these experimental results and enable data-driven knowledge discovery. Hypotheses generated from these approaches can direct definitive molecular experiments in a cost effective manner. Using integrative systems biology approaches, we can leverage existing biological knowledge and large-scale data to improve our understanding of yet unknown components of a system of interest and how its malfunction leads to disease. PMID:21044756

  9. EnsMart: A Generic System for Fast and Flexible Access to Biological Data

    PubMed Central

    Kasprzyk, Arek; Keefe, Damian; Smedley, Damian; London, Darin; Spooner, William; Melsopp, Craig; Hammond, Martin; Rocca-Serra, Philippe; Cox, Tony; Birney, Ewan

    2004-01-01

    The EnsMart system (www.ensembl.org/EnsMart) provides a generic data warehousing solution for fast and flexible querying of large biological data sets and integration with third-party data and tools. The system consists of a query-optimized database and interactive, user-friendly interfaces. EnsMart has been applied to Ensembl, where it extends its genomic browser capabilities, facilitating rapid retrieval of customized data sets. A wide variety of complex queries, on various types of annotations, for numerous species are supported. These can be applied to many research problems, ranging from SNP selection for candidate gene screening, through cross-species evolutionary comparisons, to microarray annotation. Users can group and refine biological data according to many criteria, including cross-species analyses, disease links, sequence variations, and expression patterns. Both tabulated list data and biological sequence output can be generated dynamically, in HTML, text, Microsoft Excel, and compressed formats. A wide range of sequence types, such as cDNA, peptides, coding regions, UTRs, and exons, with additional upstream and downstream regions, can be retrieved. The EnsMart database can be accessed via a public Web site, or through a Java application suite. Both implementations and the database are freely available for local installation, and can be extended or adapted to `non-Ensembl' data sets. PMID:14707178

  10. MetNetAPI: A flexible method to access and manipulate biological network data from MetNet

    PubMed Central

    2010-01-01

    Background Convenient programmatic access to different biological databases allows automated integration of scientific knowledge. Many databases support a function to download files or data snapshots, or a webservice that offers "live" data. However, the functionality that a database offers cannot be represented in a static data download file, and webservices may consume considerable computational resources from the host server. Results MetNetAPI is a versatile Application Programming Interface (API) to the MetNetDB database. It abstracts, captures and retains operations away from a biological network repository and website. A range of database functions, previously only available online, can be immediately (and independently from the website) applied to a dataset of interest. Data is available in four layers: molecular entities, localized entities (linked to a specific organelle), interactions, and pathways. Navigation between these layers is intuitive (e.g. one can request the molecular entities in a pathway, as well as request in what pathways a specific entity participates). Data retrieval can be customized: Network objects allow the construction of new and integration of existing pathways and interactions, which can be uploaded back to our server. In contrast to webservices, the computational demand on the host server is limited to processing data-related queries only. Conclusions An API provides several advantages to a systems biology software platform. MetNetAPI illustrates an interface with a central repository of data that represents the complex interrelationships of a metabolic and regulatory network. As an alternative to data-dumps and webservices, it allows access to a current and "live" database and exposes analytical functions to application developers. Yet it only requires limited resources on the server-side (thin server/fat client setup). The API is available for Java, Microsoft.NET and R programming environments and offers flexible query and broad data- retrieval methods. Data retrieval can be customized to client needs and the API offers a framework to construct and manipulate user-defined networks. The design principles can be used as a template to build programmable interfaces for other biological databases. The API software and tutorials are available at http://www.metnetonline.org/api. PMID:21083943

  11. PlantNATsDB: a comprehensive database of plant natural antisense transcripts.

    PubMed

    Chen, Dijun; Yuan, Chunhui; Zhang, Jian; Zhang, Zhao; Bai, Lin; Meng, Yijun; Chen, Ling-Ling; Chen, Ming

    2012-01-01

    Natural antisense transcripts (NATs), as one type of regulatory RNAs, occur prevalently in plant genomes and play significant roles in physiological and pathological processes. Although their important biological functions have been reported widely, a comprehensive database is lacking up to now. Consequently, we constructed a plant NAT database (PlantNATsDB) involving approximately 2 million NAT pairs in 69 plant species. GO annotation and high-throughput small RNA sequencing data currently available were integrated to investigate the biological function of NATs. PlantNATsDB provides various user-friendly web interfaces to facilitate the presentation of NATs and an integrated, graphical network browser to display the complex networks formed by different NATs. Moreover, a 'Gene Set Analysis' module based on GO annotation was designed to dig out the statistical significantly overrepresented GO categories from the specific NAT network. PlantNATsDB is currently the most comprehensive resource of NATs in the plant kingdom, which can serve as a reference database to investigate the regulatory function of NATs. The PlantNATsDB is freely available at http://bis.zju.edu.cn/pnatdb/.

  12. Freshwater Biological Traits Database (Final Report)

    EPA Science Inventory

    EPA announced the release of the final report, Freshwater Biological Traits Database. This report discusses the development of a database of freshwater biological traits. The database combines several existing traits databases into an online format. The database is also...

  13. Chinese Herbal Medicine Meets Biological Networks of Complex Diseases: A Computational Perspective

    PubMed Central

    Gu, Shuo

    2017-01-01

    With the rapid development of cheminformatics, computational biology, and systems biology, great progress has been made recently in the computational research of Chinese herbal medicine with in-depth understanding towards pharmacognosy. This paper summarized these studies in the aspects of computational methods, traditional Chinese medicine (TCM) compound databases, and TCM network pharmacology. Furthermore, we chose arachidonic acid metabolic network as a case study to demonstrate the regulatory function of herbal medicine in the treatment of inflammation at network level. Finally, a computational workflow for the network-based TCM study, derived from our previous successful applications, was proposed. PMID:28690664

  14. Chinese Herbal Medicine Meets Biological Networks of Complex Diseases: A Computational Perspective.

    PubMed

    Gu, Shuo; Pei, Jianfeng

    2017-01-01

    With the rapid development of cheminformatics, computational biology, and systems biology, great progress has been made recently in the computational research of Chinese herbal medicine with in-depth understanding towards pharmacognosy. This paper summarized these studies in the aspects of computational methods, traditional Chinese medicine (TCM) compound databases, and TCM network pharmacology. Furthermore, we chose arachidonic acid metabolic network as a case study to demonstrate the regulatory function of herbal medicine in the treatment of inflammation at network level. Finally, a computational workflow for the network-based TCM study, derived from our previous successful applications, was proposed.

  15. Logic programming to infer complex RNA expression patterns from RNA-seq data.

    PubMed

    Weirick, Tyler; Militello, Giuseppe; Ponomareva, Yuliya; John, David; Döring, Claudia; Dimmeler, Stefanie; Uchida, Shizuka

    2018-03-01

    To meet the increasing demand in the field, numerous long noncoding RNA (lncRNA) databases are available. Given many lncRNAs are specifically expressed in certain cell types and/or time-dependent manners, most lncRNA databases fall short of providing such profiles. We developed a strategy using logic programming to handle the complex organization of organs, their tissues and cell types as well as gender and developmental time points. To showcase this strategy, we introduce 'RenalDB' (http://renaldb.uni-frankfurt.de), a database providing expression profiles of RNAs in major organs focusing on kidney tissues and cells. RenalDB uses logic programming to describe complex anatomy, sample metadata and logical relationships defining expression, enrichment or specificity. We validated the content of RenalDB with biological experiments and functionally characterized two long intergenic noncoding RNAs: LOC440173 is important for cell growth or cell survival, whereas PAXIP1-AS1 is a regulator of cell death. We anticipate RenalDB will be used as a first step toward functional studies of lncRNAs in the kidney.

  16. An online model composition tool for system biology models

    PubMed Central

    2013-01-01

    Background There are multiple representation formats for Systems Biology computational models, and the Systems Biology Markup Language (SBML) is one of the most widely used. SBML is used to capture, store, and distribute computational models by Systems Biology data sources (e.g., the BioModels Database) and researchers. Therefore, there is a need for all-in-one web-based solutions that support advance SBML functionalities such as uploading, editing, composing, visualizing, simulating, querying, and browsing computational models. Results We present the design and implementation of the Model Composition Tool (Interface) within the PathCase-SB (PathCase Systems Biology) web portal. The tool helps users compose systems biology models to facilitate the complex process of merging systems biology models. We also present three tools that support the model composition tool, namely, (1) Model Simulation Interface that generates a visual plot of the simulation according to user’s input, (2) iModel Tool as a platform for users to upload their own models to compose, and (3) SimCom Tool that provides a side by side comparison of models being composed in the same pathway. Finally, we provide a web site that hosts BioModels Database models and a separate web site that hosts SBML Test Suite models. Conclusions Model composition tool (and the other three tools) can be used with little or no knowledge of the SBML document structure. For this reason, students or anyone who wants to learn about systems biology will benefit from the described functionalities. SBML Test Suite models will be a nice starting point for beginners. And, for more advanced purposes, users will able to access and employ models of the BioModels Database as well. PMID:24006914

  17. An overview of plant volatile metabolomics, sample treatment and reporting considerations with emphasis on mechanical damage and biological control of weeds.

    PubMed

    Beck, John J; Smith, Lincoln; Baig, Nausheena

    2014-01-01

    The technology for the collection and analysis of plant-emitted volatiles for understanding chemical cues of plant-plant, plant-insect or plant-microbe interactions has increased over the years. Consequently, the in situ collection, analysis and identification of volatiles are considered integral to elucidation of complex plant communications. Due to the complexity and range of emissions the conditions for consistent emission of volatiles are difficult to standardise. To discuss: evaluation of emitted volatile metabolites as a means of screening potential target- and non-target weeds/plants for insect biological control agents; plant volatile metabolomics to analyse resultant data; importance of considering volatiles from damaged plants; and use of a database for reporting experimental conditions and results. Recent literature relating to plant volatiles and plant volatile metabolomics are summarised to provide a basic understanding of how metabolomics can be applied to the study of plant volatiles. An overview of plant secondary metabolites, plant volatile metabolomics, analysis of plant volatile metabolomics data and the subsequent input into a database, the roles of plant volatiles, volatile emission as a function of treatment, and the application of plant volatile metabolomics to biological control of invasive weeds. It is recommended that in addition to a non-damaged treatment, plants be damaged prior to collecting volatiles to provide the greatest diversity of odours. For the model system provided, optimal volatile emission occurred when the leaf was punctured with a needle. Results stored in a database should include basic environmental conditions or treatments. Copyright © 2013 John Wiley & Sons, Ltd.

  18. Sensitivity assessment of freshwater macroinvertebrates to pesticides using biological traits.

    PubMed

    Ippolito, A; Todeschini, R; Vighi, M

    2012-03-01

    Assessing the sensitivity of different species to chemicals is one of the key points in predicting the effects of toxic compounds in the environment. Trait-based predicting methods have proved to be extremely efficient for assessing the sensitivity of macroinvertebrates toward compounds with non specific toxicity (narcotics). Nevertheless, predicting the sensitivity of organisms toward compounds with specific toxicity is much more complex, since it depends on the mode of action of the chemical. The aim of this work was to predict the sensitivity of several freshwater macroinvertebrates toward three classes of plant protection products: organophosphates, carbamates and pyrethroids. Two databases were built: one with sensitivity data (retrieved, evaluated and selected from the U.S. Environmental Protection Agency ECOTOX database) and the other with biological traits. Aside from the "traditional" traits usually considered in ecological analysis (i.e. body size, respiration technique, feeding habits, etc.), multivariate analysis was used to relate the sensitivity of organisms to some other characteristics which may be involved in the process of intoxication. Results confirmed that, besides traditional biological traits, related to uptake capability (e.g. body size and body shape) some traits more related to particular metabolic characteristics or patterns have a good predictive capacity on the sensitivity to these kinds of toxic substances. For example, behavioral complexity, assumed as an indicator of nervous system complexity, proved to be an important predictor of sensitivity towards these compounds. These results confirm the need for more complex traits to predict effects of highly specific substances. One key point for achieving a complete mechanistic understanding of the process is the choice of traits, whose role in the discrimination of sensitivity should be clearly interpretable, and not only statistically significant.

  19. DOE Office of Scientific and Technical Information (OSTI.GOV)

    SacconePhD, Scott F; Chesler, Elissa J; Bierut, Laura J

    Commercial SNP microarrays now provide comprehensive and affordable coverage of the human genome. However, some diseases have biologically relevant genomic regions that may require additional coverage. Addiction, for example, is thought to be influenced by complex interactions among many relevant genes and pathways. We have assembled a list of 486 biologically relevant genes nominated by a panel of experts on addiction. We then added 424 genes that showed evidence of association with addiction phenotypes through mouse QTL mappings and gene co-expression analysis. We demonstrate that there are a substantial number of SNPs in these genes that are not well representedmore » by commercial SNP platforms. We address this problem by introducing a publicly available SNP database for addiction. The database is annotated using numeric prioritization scores indicating the extent of biological relevance. The scores incorporate a number of factors such as SNP/gene functional properties (including synonymy and promoter regions), data from mouse systems genetics and measures of human/mouse evolutionary conservation. We then used HapMap genotyping data to determine if a SNP is tagged by a commercial microarray through linkage disequilibrium. This combination of biological prioritization scores and LD tagging annotation will enable addiction researchers to supplement commercial SNP microarrays to ensure comprehensive coverage of biologically relevant regions.« less

  20. ExplorEnz: the primary source of the IUBMB enzyme list

    PubMed Central

    McDonald, Andrew G.; Boyce, Sinéad; Tipton, Keith F.

    2009-01-01

    ExplorEnz is the MySQL database that is used for the curation and dissemination of the International Union of Biochemistry and Molecular Biology (IUBMB) Enzyme Nomenclature. A simple web-based query interface is provided, along with an advanced search engine for more complex Boolean queries. The WWW front-end is accessible at http://www.enzyme-database.org, from where downloads of the database as SQL and XML are also available. An associated form-based curatorial application has been developed to facilitate the curation of enzyme data as well as the internal and public review processes that occur before an enzyme entry is made official. Suggestions for new enzyme entries, or modifications to existing ones, can be made using the forms provided at http://www.enzyme-database.org/forms.php. PMID:18776214

  1. A multidisciplinary database for global distribution

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wolfe, P.J.

    The issue of selenium toxicity in the environment has been documented in the scientific literature for over 50 years. Recent studies reveal a complex connection between selenium and human and animal populations. This article introduces a bibliographic citation database on selenium in the environment developed for global distribution via the Internet by the University of Wyoming Libraries. The database incorporates material from commercial sources, print abstracts, indexes, and U.S. government literature, resulting in a multidisciplinary resource. Relevant disciplines include, biology, medicine, veterinary science, botany, chemistry, geology, pollution, aquatic sciences, ecology, and others. It covers the years 1985-1996 for most subjectmore » material, with additional years being added as resources permit.« less

  2. An advanced web query interface for biological databases

    PubMed Central

    Latendresse, Mario; Karp, Peter D.

    2010-01-01

    Although most web-based biological databases (DBs) offer some type of web-based form to allow users to author DB queries, these query forms are quite restricted in the complexity of DB queries that they can formulate. They can typically query only one DB, and can query only a single type of object at a time (e.g. genes) with no possible interaction between the objects—that is, in SQL parlance, no joins are allowed between DB objects. Writing precise queries against biological DBs is usually left to a programmer skillful enough in complex DB query languages like SQL. We present a web interface for building precise queries for biological DBs that can construct much more precise queries than most web-based query forms, yet that is user friendly enough to be used by biologists. It supports queries containing multiple conditions, and connecting multiple object types without using the join concept, which is unintuitive to biologists. This interactive web interface is called the Structured Advanced Query Page (SAQP). Users interactively build up a wide range of query constructs. Interactive documentation within the SAQP describes the schema of the queried DBs. The SAQP is based on BioVelo, a query language based on list comprehension. The SAQP is part of the Pathway Tools software and is available as part of several bioinformatics web sites powered by Pathway Tools, including the BioCyc.org site that contains more than 500 Pathway/Genome DBs. PMID:20624715

  3. DNAtraffic--a new database for systems biology of DNA dynamics during the cell life.

    PubMed

    Kuchta, Krzysztof; Barszcz, Daniela; Grzesiuk, Elzbieta; Pomorski, Pawel; Krwawicz, Joanna

    2012-01-01

    DNAtraffic (http://dnatraffic.ibb.waw.pl/) is dedicated to be a unique comprehensive and richly annotated database of genome dynamics during the cell life. It contains extensive data on the nomenclature, ontology, structure and function of proteins related to the DNA integrity mechanisms such as chromatin remodeling, histone modifications, DNA repair and damage response from eight organisms: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli and Arabidopsis thaliana. DNAtraffic contains comprehensive information on the diseases related to the assembled human proteins. DNAtraffic is richly annotated in the systemic information on the nomenclature, chemistry and structure of DNA damage and their sources, including environmental agents or commonly used drugs targeting nucleic acids and/or proteins involved in the maintenance of genome stability. One of the DNAtraffic database aim is to create the first platform of the combinatorial complexity of DNA network analysis. Database includes illustrations of pathways, damage, proteins and drugs. Since DNAtraffic is designed to cover a broad spectrum of scientific disciplines, it has to be extensively linked to numerous external data sources. Our database represents the result of the manual annotation work aimed at making the DNAtraffic much more useful for a wide range of systems biology applications.

  4. DNAtraffic—a new database for systems biology of DNA dynamics during the cell life

    PubMed Central

    Kuchta, Krzysztof; Barszcz, Daniela; Grzesiuk, Elzbieta; Pomorski, Pawel; Krwawicz, Joanna

    2012-01-01

    DNAtraffic (http://dnatraffic.ibb.waw.pl/) is dedicated to be a unique comprehensive and richly annotated database of genome dynamics during the cell life. It contains extensive data on the nomenclature, ontology, structure and function of proteins related to the DNA integrity mechanisms such as chromatin remodeling, histone modifications, DNA repair and damage response from eight organisms: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli and Arabidopsis thaliana. DNAtraffic contains comprehensive information on the diseases related to the assembled human proteins. DNAtraffic is richly annotated in the systemic information on the nomenclature, chemistry and structure of DNA damage and their sources, including environmental agents or commonly used drugs targeting nucleic acids and/or proteins involved in the maintenance of genome stability. One of the DNAtraffic database aim is to create the first platform of the combinatorial complexity of DNA network analysis. Database includes illustrations of pathways, damage, proteins and drugs. Since DNAtraffic is designed to cover a broad spectrum of scientific disciplines, it has to be extensively linked to numerous external data sources. Our database represents the result of the manual annotation work aimed at making the DNAtraffic much more useful for a wide range of systems biology applications. PMID:22110027

  5. OliveNet™: a comprehensive library of compounds from Olea europaea

    PubMed Central

    Bonvino, Natalie P; Liang, Julia; McCord, Elizabeth D; Zafiris, Elena; Benetti, Natalia; Ray, Nancy B; Hung, Andrew; Boskou, Dimitrios

    2018-01-01

    Abstract Accumulated epidemiological, clinical and experimental evidence has indicated the beneficial health effects of the Mediterranean diet, which is typified by the consumption of virgin olive oil (VOO) as a main source of dietary fat. At the cellular level, compounds derived from various olive (Olea europaea), matrices, have demonstrated potent antioxidant and anti-inflammatory effects, which are thought to account, at least in part, for their biological effects. Research efforts are expanding into the characterization of compounds derived from Olea europaea, however, the considerable diversity and complexity of the vast array of chemical compounds have made their precise identification and quantification challenging. As such, only a relatively small subset of olive-derived compounds has been explored for their biological activity and potential health effects to date. Although there is adequate information describing the identification or isolation of olive-derived compounds, these are not easily searchable, especially when attempting to acquire chemical or biological properties. Therefore, we have created the OliveNet™ database containing a comprehensive catalogue of compounds identified from matrices of the olive, including the fruit, leaf and VOO, as well as in the wastewater and pomace accrued during oil production. From a total of 752 compounds, chemical analysis was sufficient for 676 individual compounds, which have been included in the database. The database is curated and comprehensively referenced containing information for the 676 compounds, which are divided into 13 main classes and 47 subclasses. Importantly, with respect to current research trends, the database includes 222 olive phenolics, which are divided into 13 subclasses. To our knowledge, OliveNet™ is currently the only curated open access database with a comprehensive collection of compounds associated with Olea europaea. Database URL: https://www.mccordresearch.com.au PMID:29688352

  6. Radiation damage of biomolecules (RADAM) database development: current status

    NASA Astrophysics Data System (ADS)

    Denifl, S.; Garcia, G.; Huber, B. A.; Marinković, B. P.; Mason, N.; Postler, J.; Rabus, H.; Rixon, G.; Solov'yov, A. V.; Suraud, E.; Yakubovich, A. V.

    2013-06-01

    Ion beam therapy offers the possibility of excellent dose localization for treatment of malignant tumours, minimizing radiation damage in normal tissue, while maximizing cell killing within the tumour. However, as the underlying dependent physical, chemical and biological processes are too complex to treat them on a purely analytical level, most of our current and future understanding will rely on computer simulations, based on mathematical equations, algorithms and last, but not least, on the available atomic and molecular data. The viability of the simulated output and the success of any computer simulation will be determined by these data, which are treated as the input variables in each computer simulation performed. The radiation research community lacks a complete database for the cross sections of all the different processes involved in ion beam induced damage: ionization and excitation cross sections for ions with liquid water and biological molecules, all the possible electron - medium interactions, dielectric response data, electron attachment to biomolecules etc. In this paper we discuss current progress in the creation of such a database, outline the roadmap of the project and review plans for the exploitation of such a database in future simulations.

  7. Glycan fragment database: a database of PDB-based glycan 3D structures.

    PubMed

    Jo, Sunhwan; Im, Wonpil

    2013-01-01

    The glycan fragment database (GFDB), freely available at http://www.glycanstructure.org, is a database of the glycosidic torsion angles derived from the glycan structures in the Protein Data Bank (PDB). Analogous to protein structure, the structure of an oligosaccharide chain in a glycoprotein, referred to as a glycan, can be characterized by the torsion angles of glycosidic linkages between relatively rigid carbohydrate monomeric units. Knowledge of accessible conformations of biologically relevant glycans is essential in understanding their biological roles. The GFDB provides an intuitive glycan sequence search tool that allows the user to search complex glycan structures. After a glycan search is complete, each glycosidic torsion angle distribution is displayed in terms of the exact match and the fragment match. The exact match results are from the PDB entries that contain the glycan sequence identical to the query sequence. The fragment match results are from the entries with the glycan sequence whose substructure (fragment) or entire sequence is matched to the query sequence, such that the fragment results implicitly include the influences from the nearby carbohydrate residues. In addition, clustering analysis based on the torsion angle distribution can be performed to obtain the representative structures among the searched glycan structures.

  8. Integration of Molecular Networking and In-Silico MS/MS Fragmentation for Natural Products Dereplication.

    PubMed

    Allard, Pierre-Marie; Péresse, Tiphaine; Bisson, Jonathan; Gindro, Katia; Marcourt, Laurence; Pham, Van Cuong; Roussi, Fanny; Litaudon, Marc; Wolfender, Jean-Luc

    2016-03-15

    Dereplication represents a key step for rapidly identifying known secondary metabolites in complex biological matrices. In this context, liquid-chromatography coupled to high resolution mass spectrometry (LC-HRMS) is increasingly used and, via untargeted data-dependent MS/MS experiments, massive amounts of detailed information on the chemical composition of crude extracts can be generated. An efficient exploitation of such data sets requires automated data treatment and access to dedicated fragmentation databases. Various novel bioinformatics approaches such as molecular networking (MN) and in-silico fragmentation tools have emerged recently and provide new perspective for early metabolite identification in natural products (NPs) research. Here we propose an innovative dereplication strategy based on the combination of MN with an extensive in-silico MS/MS fragmentation database of NPs. Using two case studies, we demonstrate that this combined approach offers a powerful tool to navigate through the chemistry of complex NPs extracts, dereplicate metabolites, and annotate analogues of database entries.

  9. Freshwater Biological Traits Database (Data Sources)

    EPA Science Inventory

    When EPA release the final report, Freshwater Biological Traits Database, it referenced numerous data sources that are included below. The Traits Database report covers the development of a database of freshwater biological traits with additional traits that are relevan...

  10. BNDB - the Biochemical Network Database.

    PubMed

    Küntzer, Jan; Backes, Christina; Blum, Torsten; Gerasch, Andreas; Kaufmann, Michael; Kohlbacher, Oliver; Lenhof, Hans-Peter

    2007-10-02

    Technological advances in high-throughput techniques and efficient data acquisition methods have resulted in a massive amount of life science data. The data is stored in numerous databases that have been established over the last decades and are essential resources for scientists nowadays. However, the diversity of the databases and the underlying data models make it difficult to combine this information for solving complex problems in systems biology. Currently, researchers typically have to browse several, often highly focused, databases to obtain the required information. Hence, there is a pressing need for more efficient systems for integrating, analyzing, and interpreting these data. The standardization and virtual consolidation of the databases is a major challenge resulting in a unified access to a variety of data sources. We present the Biochemical Network Database (BNDB), a powerful relational database platform, allowing a complete semantic integration of an extensive collection of external databases. BNDB is built upon a comprehensive and extensible object model called BioCore, which is powerful enough to model most known biochemical processes and at the same time easily extensible to be adapted to new biological concepts. Besides a web interface for the search and curation of the data, a Java-based viewer (BiNA) provides a powerful platform-independent visualization and navigation of the data. BiNA uses sophisticated graph layout algorithms for an interactive visualization and navigation of BNDB. BNDB allows a simple, unified access to a variety of external data sources. Its tight integration with the biochemical network library BN++ offers the possibility for import, integration, analysis, and visualization of the data. BNDB is freely accessible at http://www.bndb.org.

  11. Regulatory administrative databases in FDA's Center for Biologics Evaluation and Research: convergence toward a unified database.

    PubMed

    Smith, Jeffrey K

    2013-04-01

    Regulatory administrative database systems within the Food and Drug Administration's (FDA) Center for Biologics Evaluation and Research (CBER) are essential to supporting its core mission, as a regulatory agency. Such systems are used within FDA to manage information and processes surrounding the processing, review, and tracking of investigational and marketed product submissions. This is an area of increasing interest in the pharmaceutical industry and has been a topic at trade association conferences (Buckley 2012). Such databases in CBER are complex, not for the type or relevance of the data to any particular scientific discipline but because of the variety of regulatory submission types and processes the systems support using the data. Commonalities among different data domains of CBER's regulatory administrative databases are discussed. These commonalities have evolved enough to constitute real database convergence and provide a valuable asset for business process intelligence. Balancing review workload across staff, exploring areas of risk in review capacity, process improvement, and presenting a clear and comprehensive landscape of review obligations are just some of the opportunities of such intelligence. This convergence has been occurring in the presence of usual forces that tend to drive information technology (IT) systems development toward separate stovepipes and data silos. CBER has achieved a significant level of convergence through a gradual process, using a clear goal, agreed upon development practices, and transparency of database objects, rather than through a single, discrete project or IT vendor solution. This approach offers a path forward for FDA systems toward a unified database.

  12. Database searching and accounting of multiplexed precursor and product ion spectra from the data independent analysis of simple and complex peptide mixtures.

    PubMed

    Li, Guo-Zhong; Vissers, Johannes P C; Silva, Jeffrey C; Golick, Dan; Gorenstein, Marc V; Geromanos, Scott J

    2009-03-01

    A novel database search algorithm is presented for the qualitative identification of proteins over a wide dynamic range, both in simple and complex biological samples. The algorithm has been designed for the analysis of data originating from data independent acquisitions, whereby multiple precursor ions are fragmented simultaneously. Measurements used by the algorithm include retention time, ion intensities, charge state, and accurate masses on both precursor and product ions from LC-MS data. The search algorithm uses an iterative process whereby each iteration incrementally increases the selectivity, specificity, and sensitivity of the overall strategy. Increased specificity is obtained by utilizing a subset database search approach, whereby for each subsequent stage of the search, only those peptides from securely identified proteins are queried. Tentative peptide and protein identifications are ranked and scored by their relative correlation to a number of models of known and empirically derived physicochemical attributes of proteins and peptides. In addition, the algorithm utilizes decoy database techniques for automatically determining the false positive identification rates. The search algorithm has been tested by comparing the search results from a four-protein mixture, the same four-protein mixture spiked into a complex biological background, and a variety of other "system" type protein digest mixtures. The method was validated independently by data dependent methods, while concurrently relying on replication and selectivity. Comparisons were also performed with other commercially and publicly available peptide fragmentation search algorithms. The presented results demonstrate the ability to correctly identify peptides and proteins from data independent acquisition strategies with high sensitivity and specificity. They also illustrate a more comprehensive analysis of the samples studied; providing approximately 20% more protein identifications, compared to a more conventional data directed approach using the same identification criteria, with a concurrent increase in both sequence coverage and the number of modified peptides.

  13. The transcriptome of Lutzomyia longipalpis (Diptera: Psychodidae) male reproductive organs.

    PubMed

    Azevedo, Renata V D M; Dias, Denise B S; Bretãs, Jorge A C; Mazzoni, Camila J; Souza, Nataly A; Albano, Rodolpho M; Wagner, Glauber; Davila, Alberto M R; Peixoto, Alexandre A

    2012-01-01

    It has been suggested that genes involved in the reproductive biology of insect disease vectors are potential targets for future alternative methods of control. Little is known about the molecular biology of reproduction in phlebotomine sand flies and there is no information available concerning genes that are expressed in male reproductive organs of Lutzomyia longipalpis, the main vector of American visceral leishmaniasis and a species complex. We generated 2678 high quality ESTs ("Expressed Sequence Tags") of L. longipalpis male reproductive organs that were grouped in 1391 non-redundant sequences (1136 singlets and 255 clusters). BLAST analysis revealed that only 57% of these sequences share similarity with a L. longipalpis female EST database. Although no more than 36% of the non-redundant sequences showed similarity to protein sequences deposited in databases, more than half of them presented the best-match hits with mosquito genes. Gene ontology analysis identified subsets of genes involved in biological processes such as protein biosynthesis and DNA replication, which are probably associated with spermatogenesis. A number of non-redundant sequences were also identified as putative male reproductive gland proteins (mRGPs), also known as male accessory gland protein genes (Acps). The transcriptome analysis of L. longipalpis male reproductive organs is one step further in the study of the molecular basis of the reproductive biology of this important species complex. It has allowed the identification of genes potentially involved in spermatogenesis as well as putative mRGPs sequences, which have been studied in many insect species because of their effects on female post-mating behavior and physiology and their potential role in sexual selection and speciation. These data open a number of new avenues for further research in the molecular and evolutionary reproductive biology of sand flies.

  14. The Transcriptome of Lutzomyia longipalpis (Diptera: Psychodidae) Male Reproductive Organs

    PubMed Central

    Bretãs, Jorge A. C.; Mazzoni, Camila J.; Souza, Nataly A.; Albano, Rodolpho M.; Wagner, Glauber; Davila, Alberto M. R.; Peixoto, Alexandre A.

    2012-01-01

    Background It has been suggested that genes involved in the reproductive biology of insect disease vectors are potential targets for future alternative methods of control. Little is known about the molecular biology of reproduction in phlebotomine sand flies and there is no information available concerning genes that are expressed in male reproductive organs of Lutzomyia longipalpis, the main vector of American visceral leishmaniasis and a species complex. Methods/Principal Findings We generated 2678 high quality ESTs (“Expressed Sequence Tags”) of L. longipalpis male reproductive organs that were grouped in 1391 non-redundant sequences (1136 singlets and 255 clusters). BLAST analysis revealed that only 57% of these sequences share similarity with a L. longipalpis female EST database. Although no more than 36% of the non-redundant sequences showed similarity to protein sequences deposited in databases, more than half of them presented the best-match hits with mosquito genes. Gene ontology analysis identified subsets of genes involved in biological processes such as protein biosynthesis and DNA replication, which are probably associated with spermatogenesis. A number of non-redundant sequences were also identified as putative male reproductive gland proteins (mRGPs), also known as male accessory gland protein genes (Acps). Conclusions The transcriptome analysis of L. longipalpis male reproductive organs is one step further in the study of the molecular basis of the reproductive biology of this important species complex. It has allowed the identification of genes potentially involved in spermatogenesis as well as putative mRGPs sequences, which have been studied in many insect species because of their effects on female post-mating behavior and physiology and their potential role in sexual selection and speciation. These data open a number of new avenues for further research in the molecular and evolutionary reproductive biology of sand flies. PMID:22496818

  15. ChemProt-2.0: visual navigation in a disease chemical biology database

    PubMed Central

    Kim Kjærulff, Sonny; Wich, Louis; Kringelum, Jens; Jacobsen, Ulrik P.; Kouskoumvekaki, Irene; Audouze, Karine; Lund, Ole; Brunak, Søren; Oprea, Tudor I.; Taboureau, Olivier

    2013-01-01

    ChemProt-2.0 (http://www.cbs.dtu.dk/services/ChemProt-2.0) is a public available compilation of multiple chemical–protein annotation resources integrated with diseases and clinical outcomes information. The database has been updated to >1.15 million compounds with 5.32 millions bioactivity measurements for 15 290 proteins. Each protein is linked to quality-scored human protein–protein interactions data based on more than half a million interactions, for studying diseases and biological outcomes (diseases, pathways and GO terms) through protein complexes. In ChemProt-2.0, therapeutic effects as well as adverse drug reactions have been integrated allowing for suggesting proteins associated to clinical outcomes. New chemical structure fingerprints were computed based on the similarity ensemble approach. Protein sequence similarity search was also integrated to evaluate the promiscuity of proteins, which can help in the prediction of off-target effects. Finally, the database was integrated into a visual interface that enables navigation of the pharmacological space for small molecules. Filtering options were included in order to facilitate and to guide dynamic search of specific queries. PMID:23185041

  16. Seed Storage Proteins as a System for Teaching Protein Identification by Mass Spectrometry in Biochemistry Laboratory

    ERIC Educational Resources Information Center

    Wilson, Karl A.; Tan-Wilson, Anna

    2013-01-01

    Mass spectrometry (MS) has become an important tool in studying biological systems. One application is the identification of proteins and peptides by the matching of peptide and peptide fragment masses to the sequences of proteins in protein sequence databases. Often prior protein separation of complex protein mixtures by 2D-PAGE is needed,…

  17. Conversion of KEGG metabolic pathways to SBGN maps including automatic layout

    PubMed Central

    2013-01-01

    Background Biologists make frequent use of databases containing large and complex biological networks. One popular database is the Kyoto Encyclopedia of Genes and Genomes (KEGG) which uses its own graphical representation and manual layout for pathways. While some general drawing conventions exist for biological networks, arbitrary graphical representations are very common. Recently, a new standard has been established for displaying biological processes, the Systems Biology Graphical Notation (SBGN), which aims to unify the look of such maps. Ideally, online repositories such as KEGG would automatically provide networks in a variety of notations including SBGN. Unfortunately, this is non‐trivial, since converting between notations may add, remove or otherwise alter map elements so that the existing layout cannot be simply reused. Results Here we describe a methodology for automatic translation of KEGG metabolic pathways into the SBGN format. We infer important properties of the KEGG layout and treat these as layout constraints that are maintained during the conversion to SBGN maps. Conclusions This allows for the drawing and layout conventions of SBGN to be followed while creating maps that are still recognizably the original KEGG pathways. This article details the steps in this process and provides examples of the final result. PMID:23953132

  18. Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database.

    PubMed Central

    Billoud, B; Kontic, M; Viari, A

    1996-01-01

    At the DNA/RNA level, biological signals are defined by a combination of spatial structures and sequence motifs. Until now, few attempts had been made in writing general purpose search programs that take into account both sequence and structure criteria. Indeed, the most successful structure scanning programs are usually dedicated to particular structures and are written using general purpose programming languages through a complex and time consuming process where the biological problem of defining the structure and the computer engineering problem of looking for it are intimately intertwined. In this paper, we describe a general representation of structures, suitable for database scanning, together with a programming language, Palingol, designed to manipulate it. Palingol has specific data types, corresponding to structural elements-basically helices-that can be arranged in any way to form a complex structure. As a consequence of the declarative approach used in Palingol, the user should only focus on 'what to search for' while the language engine takes care of 'how to look for it'. Therefore, it becomes simpler to write a scanning program and the structural constraints that define the required structure are more clearly identified. PMID:8628670

  19. SBROME: a scalable optimization and module matching framework for automated biosystems design.

    PubMed

    Huynh, Linh; Tsoukalas, Athanasios; Köppe, Matthias; Tagkopoulos, Ilias

    2013-05-17

    The development of a scalable framework for biodesign automation is a formidable challenge given the expected increase in part availability and the ever-growing complexity of synthetic circuits. To allow for (a) the use of previously constructed and characterized circuits or modules and (b) the implementation of designs that can scale up to hundreds of nodes, we here propose a divide-and-conquer Synthetic Biology Reusable Optimization Methodology (SBROME). An abstract user-defined circuit is first transformed and matched against a module database that incorporates circuits that have previously been experimentally characterized. Then the resulting circuit is decomposed to subcircuits that are populated with the set of parts that best approximate the desired function. Finally, all subcircuits are subsequently characterized and deposited back to the module database for future reuse. We successfully applied SBROME toward two alternative designs of a modular 3-input multiplexer that utilize pre-existing logic gates and characterized biological parts.

  20. MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome

    PubMed Central

    Schoof, Heiko; Zaccaria, Paolo; Gundlach, Heidrun; Lemcke, Kai; Rudd, Stephen; Kolesov, Grigory; Arnold, Roland; Mewes, H. W.; Mayer, Klaus F. X.

    2002-01-01

    Arabidopsis thaliana is the first plant for which the complete genome has been sequenced and published. Annotation of complex eukaryotic genomes requires more than the assignment of genetic elements to the sequence. Besides completing the list of genes, we need to discover their cellular roles, their regulation and their interactions in order to understand the workings of the whole plant. The MIPS Arabidopsis thaliana Database (MAtDB; http://mips.gsf.de/proj/thal/db) started out as a repository for genome sequence data in the European Scientists Sequencing Arabidopsis (ESSA) project and the Arabidopsis Genome Initiative. Our aim is to transform MAtDB into an integrated biological knowledge resource by integrating diverse data, tools, query and visualization capabilities and by creating a comprehensive resource for Arabidopsis as a reference model for other species, including crop plants. PMID:11752263

  1. ATLAS of Biochemistry: A Repository of All Possible Biochemical Reactions for Synthetic Biology and Metabolic Engineering Studies.

    PubMed

    Hadadi, Noushin; Hafner, Jasmin; Shajkofci, Adrian; Zisaki, Aikaterini; Hatzimanikatis, Vassily

    2016-10-21

    Because the complexity of metabolism cannot be intuitively understood or analyzed, computational methods are indispensable for studying biochemistry and deepening our understanding of cellular metabolism to promote new discoveries. We used the computational framework BNICE.ch along with cheminformatic tools to assemble the whole theoretical reactome from the known metabolome through expansion of the known biochemistry presented in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. We constructed the ATLAS of Biochemistry, a database of all theoretical biochemical reactions based on known biochemical principles and compounds. ATLAS includes more than 130 000 hypothetical enzymatic reactions that connect two or more KEGG metabolites through novel enzymatic reactions that have never been reported to occur in living organisms. Moreover, ATLAS reactions integrate 42% of KEGG metabolites that are not currently present in any KEGG reaction into one or more novel enzymatic reactions. The generated repository of information is organized in a Web-based database ( http://lcsb-databases.epfl.ch/atlas/ ) that allows the user to search for all possible routes from any substrate compound to any product. The resulting pathways involve known and novel enzymatic steps that may indicate unidentified enzymatic activities and provide potential targets for protein engineering. Our approach of introducing novel biochemistry into pathway design and associated databases will be important for synthetic biology and metabolic engineering.

  2. BioInt: an integrative biological object-oriented application framework and interpreter.

    PubMed

    Desai, Sanket; Burra, Prasad

    2015-01-01

    BioInt, a biological programming application framework and interpreter, is an attempt to equip the researchers with seamless integration, efficient extraction and effortless analysis of the data from various biological databases and algorithms. Based on the type of biological data, algorithms and related functionalities, a biology-specific framework was developed which has nine modules. The modules are a compilation of numerous reusable BioADTs. This software ecosystem containing more than 450 biological objects underneath the interpreter makes it flexible, integrative and comprehensive. Similar to Python, BioInt eliminates the compilation and linking steps cutting the time significantly. The researcher can write the scripts using available BioADTs (following C++ syntax) and execute them interactively or use as a command line application. It has features that enable automation, extension of the framework with new/external BioADTs/libraries and deployment of complex work flows.

  3. bioNerDS: exploring bioinformatics’ database and software use through literature mining

    PubMed Central

    2013-01-01

    Background Biology-focused databases and software define bioinformatics and their use is central to computational biology. In such a complex and dynamic field, it is of interest to understand what resources are available, which are used, how much they are used, and for what they are used. While scholarly literature surveys can provide some insights, large-scale computer-based approaches to identify mentions of bioinformatics databases and software from primary literature would automate systematic cataloguing, facilitate the monitoring of usage, and provide the foundations for the recovery of computational methods for analysing biological data, with the long-term aim of identifying best/common practice in different areas of biology. Results We have developed bioNerDS, a named entity recogniser for the recovery of bioinformatics databases and software from primary literature. We identify such entities with an F-measure ranging from 63% to 91% at the mention level and 63-78% at the document level, depending on corpus. Not attaining a higher F-measure is mostly due to high ambiguity in resource naming, which is compounded by the on-going introduction of new resources. To demonstrate the software, we applied bioNerDS to full-text articles from BMC Bioinformatics and Genome Biology. General mention patterns reflect the remit of these journals, highlighting BMC Bioinformatics’s emphasis on new tools and Genome Biology’s greater emphasis on data analysis. The data also illustrates some shifts in resource usage: for example, the past decade has seen R and the Gene Ontology join BLAST and GenBank as the main components in bioinformatics processing. Abstract Conclusions We demonstrate the feasibility of automatically identifying resource names on a large-scale from the scientific literature and show that the generated data can be used for exploration of bioinformatics database and software usage. For example, our results help to investigate the rate of change in resource usage and corroborate the suspicion that a vast majority of resources are created, but rarely (if ever) used thereafter. bioNerDS is available at http://bionerds.sourceforge.net/. PMID:23768135

  4. Update of KDBI: Kinetic Data of Bio-molecular Interaction database

    PubMed Central

    Kumar, Pankaj; Han, B. C.; Shi, Z.; Jia, J.; Wang, Y. P.; Zhang, Y. T.; Liang, L.; Liu, Q. F.; Ji, Z. L.; Chen, Y. Z.

    2009-01-01

    Knowledge of the kinetics of biomolecular interactions is important for facilitating the study of cellular processes and underlying molecular events, and is essential for quantitative study and simulation of biological systems. Kinetic Data of Bio-molecular Interaction database (KDBI) has been developed to provide information about experimentally determined kinetic data of protein–protein, protein–nucleic acid, protein–ligand, nucleic acid–ligand binding or reaction events described in the literature. To accommodate increasing demand for studying and simulating biological systems, numerous improvements and updates have been made to KDBI, including new ways to access data by pathway and molecule names, data file in System Biology Markup Language format, more efficient search engine, access to published parameter sets of simulation models of 63 pathways, and 2.3-fold increase of data (19 263 entries of 10 532 distinctive biomolecular binding and 11 954 interaction events, involving 2635 proteins/protein complexes, 847 nucleic acids, 1603 small molecules and 45 multi-step processes). KDBI is publically available at http://bidd.nus.edu.sg/group/kdbi/kdbi.asp. PMID:18971255

  5. [Establishment of a comprehensive database for laryngeal cancer related genes and the miRNAs].

    PubMed

    Li, Mengjiao; E, Qimin; Liu, Jialin; Huang, Tingting; Liang, Chuanyu

    2015-09-01

    By collecting and analyzing the laryngeal cancer related genes and the miRNAs, to build a comprehensive laryngeal cancer-related gene database, which differs from the current biological information database with complex and clumsy structure and focuses on the theme of gene and miRNA, and it could make the research and teaching more convenient and efficient. Based on the B/S architecture, using Apache as a Web server, MySQL as coding language of database design and PHP as coding language of web design, a comprehensive database for laryngeal cancer-related genes was established, providing with the gene tables, protein tables, miRNA tables and clinical information tables of the patients with laryngeal cancer. The established database containsed 207 laryngeal cancer related genes, 243 proteins, 26 miRNAs, and their particular information such as mutations, methylations, diversified expressions, and the empirical references of laryngeal cancer relevant molecules. The database could be accessed and operated via the Internet, by which browsing and retrieval of the information were performed. The database were maintained and updated regularly. The database for laryngeal cancer related genes is resource-integrated and user-friendly, providing a genetic information query tool for the study of laryngeal cancer.

  6. Report of the matrix of biological knowledge workshop

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Morowitz, H.J.; Smith, T.

    1987-10-30

    Current understanding of biology involves complex relationships rooted in enormous amounts of data. These data include entries from biochemistry, ecology, genetics, human and veterinary medicine, molecular structure studies, agriculture, embryology, systematics, and many other disciplines. The present wealth of biological data goes beyond past accumulations now include new understandings from molecular biology. Several important biological databases are currently being supported, and more are planned; however, major problems of interdatabase communication and management efficiency abound. Few scientists are currently capable of keeping up with this ever-increasing wealth of knowledge, let alone searching it efficiently for new or unsuspected links and importantmore » analogies. Yet this is what is required if the continued rapid generation of such data is to lead most effectively to the major conceptual, medical, and agricultural advances anticipated over the coming decades in the United States. The opportunity exists to combine the potential of modern computer science, database management, and artificial intelligence in a major effort to organize the vast wealth of biological and clinical data. The time is right because the amount of data is still manageable even in its current highly-fragmented form; important hardware and computer science tools have been greatly improved; and there have been recent fundamental advances in our comprehension of biology. This latter is particularly true at the molecular level where the information for nearly all higher structure and function is encoded. The organization of all biological experimental data coordinately within a structure incorporating our current understanding - the Matrix of Biological Knowledge - will provide the data and structure for the major advances foreseen in the years ahead.« less

  7. Systems Biology Approaches for Discovering Biomarkers for Traumatic Brain Injury

    PubMed Central

    Feala, Jacob D.; AbdulHameed, Mohamed Diwan M.; Yu, Chenggang; Dutta, Bhaskar; Yu, Xueping; Schmid, Kara; Dave, Jitendra; Tortella, Frank

    2013-01-01

    Abstract The rate of traumatic brain injury (TBI) in service members with wartime injuries has risen rapidly in recent years, and complex, variable links have emerged between TBI and long-term neurological disorders. The multifactorial nature of TBI secondary cellular response has confounded attempts to find cellular biomarkers for its diagnosis and prognosis or for guiding therapy for brain injury. One possibility is to apply emerging systems biology strategies to holistically probe and analyze the complex interweaving molecular pathways and networks that mediate the secondary cellular response through computational models that integrate these diverse data sets. Here, we review available systems biology strategies, databases, and tools. In addition, we describe opportunities for applying this methodology to existing TBI data sets to identify new biomarker candidates and gain insights about the underlying molecular mechanisms of TBI response. As an exemplar, we apply network and pathway analysis to a manually compiled list of 32 protein biomarker candidates from the literature, recover known TBI-related mechanisms, and generate hypothetical new biomarker candidates. PMID:23510232

  8. dbAMEPNI: a database of alanine mutagenic effects for protein–nucleic acid interactions

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Ling; Xiong, Yi; Gao, Hongyun

    Protein–nucleic acid interactions play essential roles in various biological activities such as gene regulation, transcription, DNA repair and DNA packaging. Understanding the effects of amino acid substitutions on protein–nucleic acid binding affinities can help elucidate the molecular mechanism of protein–nucleic acid recognition. Until now, no comprehensive and updated database of quantitative binding data on alanine mutagenic effects for protein–nucleic acid interactions is publicly accessible. Thus, we developed a new database of Alanine Mutagenic Effects for Protein-Nucleic Acid Interactions (dbAMEPNI). dbAMEPNI is a manually curated, literature-derived database, comprising over 577 alanine mutagenic data with experimentally determined binding affinities for protein–nucleic acidmore » complexes. Here, it contains several important parameters, such as dissociation constant (Kd), Gibbs free energy change (ΔΔG), experimental conditions and structural parameters of mutant residues. In addition, the database provides an extended dataset of 282 single alanine mutations with only qualitative data (or descriptive effects) of thermodynamic information.« less

  9. dbAMEPNI: a database of alanine mutagenic effects for protein–nucleic acid interactions

    DOE PAGES

    Liu, Ling; Xiong, Yi; Gao, Hongyun; ...

    2018-04-02

    Protein–nucleic acid interactions play essential roles in various biological activities such as gene regulation, transcription, DNA repair and DNA packaging. Understanding the effects of amino acid substitutions on protein–nucleic acid binding affinities can help elucidate the molecular mechanism of protein–nucleic acid recognition. Until now, no comprehensive and updated database of quantitative binding data on alanine mutagenic effects for protein–nucleic acid interactions is publicly accessible. Thus, we developed a new database of Alanine Mutagenic Effects for Protein-Nucleic Acid Interactions (dbAMEPNI). dbAMEPNI is a manually curated, literature-derived database, comprising over 577 alanine mutagenic data with experimentally determined binding affinities for protein–nucleic acidmore » complexes. Here, it contains several important parameters, such as dissociation constant (Kd), Gibbs free energy change (ΔΔG), experimental conditions and structural parameters of mutant residues. In addition, the database provides an extended dataset of 282 single alanine mutations with only qualitative data (or descriptive effects) of thermodynamic information.« less

  10. Gene-Disease Network Analysis Reveals Functional Modules in Mendelian, Complex and Environmental Diseases

    PubMed Central

    Bauer-Mehren, Anna; Bundschus, Markus; Rautschka, Michael; Mayer, Miguel A.; Sanz, Ferran; Furlong, Laura I.

    2011-01-01

    Background Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult. Principal Findings We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell. Conclusions For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. Availability The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download. PMID:21695124

  11. Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases.

    PubMed

    Bauer-Mehren, Anna; Bundschus, Markus; Rautschka, Michael; Mayer, Miguel A; Sanz, Ferran; Furlong, Laura I

    2011-01-01

    Scientists have been trying to understand the molecular mechanisms of diseases to design preventive and therapeutic strategies for a long time. For some diseases, it has become evident that it is not enough to obtain a catalogue of the disease-related genes but to uncover how disruptions of molecular networks in the cell give rise to disease phenotypes. Moreover, with the unprecedented wealth of information available, even obtaining such catalogue is extremely difficult. We developed a comprehensive gene-disease association database by integrating associations from several sources that cover different biomedical aspects of diseases. In particular, we focus on the current knowledge of human genetic diseases including mendelian, complex and environmental diseases. To assess the concept of modularity of human diseases, we performed a systematic study of the emergent properties of human gene-disease networks by means of network topology and functional annotation analysis. The results indicate a highly shared genetic origin of human diseases and show that for most diseases, including mendelian, complex and environmental diseases, functional modules exist. Moreover, a core set of biological pathways is found to be associated with most human diseases. We obtained similar results when studying clusters of diseases, suggesting that related diseases might arise due to dysfunction of common biological processes in the cell. For the first time, we include mendelian, complex and environmental diseases in an integrated gene-disease association database and show that the concept of modularity applies for all of them. We furthermore provide a functional analysis of disease-related modules providing important new biological insights, which might not be discovered when considering each of the gene-disease association repositories independently. Hence, we present a suitable framework for the study of how genetic and environmental factors, such as drugs, contribute to diseases. The gene-disease networks used in this study and part of the analysis are available at http://ibi.imim.es/DisGeNET/DisGeNETweb.html#Download.

  12. EssOilDB: a database of essential oils reflecting terpene composition and variability in the plant kingdom

    PubMed Central

    Kumari, Sangita; Pundhir, Sachin; Priya, Piyush; Jeena, Ganga; Punetha, Ankita; Chawla, Konika; Firdos Jafaree, Zohra; Mondal, Subhasish; Yadav, Gitanjali

    2014-01-01

    Plant essential oils are complex mixtures of volatile organic compounds, which play indispensable roles in the environment, for the plant itself, as well as for humans. The potential biological information stored in essential oil composition data can provide an insight into the silent language of plants, and the roles of these chemical emissions in defense, communication and pollinator attraction. In order to decipher volatile profile patterns from a global perspective, we have developed the ESSential OIL DataBase (EssOilDB), a continually updated, freely available electronic database designed to provide knowledge resource for plant essential oils, that enables one to address a multitude of queries on volatile profiles of native, invasive, normal or stressed plants, across taxonomic clades, geographical locations and several other biotic and abiotic influences. To our knowledge, EssOilDB is the only database in the public domain providing an opportunity for context based scientific research on volatile patterns in plants. EssOilDB presently contains 123 041 essential oil records spanning a century of published reports on volatile profiles, with data from 92 plant taxonomic families, spread across diverse geographical locations all over the globe. We hope that this huge repository of VOCs will facilitate unraveling of the true significance of volatiles in plants, along with creating potential avenues for industrial applications of essential oils. We also illustrate the use of this database in terpene biology and show how EssOilDB can be used to complement data from computational genomics to gain insights into the diversity and variability of terpenoids in the plant kingdom. EssOilDB would serve as a valuable information resource, for students and researchers in plant biology, in the design and discovery of new odor profiles, as well as for entrepreneurs—the potential for generating consumer specific scents being one of the most attractive and interesting topics in the cosmetic industry. Database URL: http://nipgr.res.in/Essoildb/ PMID:25534749

  13. Multi-factorial analysis of class prediction error: estimating optimal number of biomarkers for various classification rules.

    PubMed

    Khondoker, Mizanur R; Bachmann, Till T; Mewissen, Muriel; Dickinson, Paul; Dobrzelecki, Bartosz; Campbell, Colin J; Mount, Andrew R; Walton, Anthony J; Crain, Jason; Schulze, Holger; Giraud, Gerard; Ross, Alan J; Ciani, Ilenia; Ember, Stuart W J; Tlili, Chaker; Terry, Jonathan G; Grant, Eilidh; McDonnell, Nicola; Ghazal, Peter

    2010-12-01

    Machine learning and statistical model based classifiers have increasingly been used with more complex and high dimensional biological data obtained from high-throughput technologies. Understanding the impact of various factors associated with large and complex microarray datasets on the predictive performance of classifiers is computationally intensive, under investigated, yet vital in determining the optimal number of biomarkers for various classification purposes aimed towards improved detection, diagnosis, and therapeutic monitoring of diseases. We investigate the impact of microarray based data characteristics on the predictive performance for various classification rules using simulation studies. Our investigation using Random Forest, Support Vector Machines, Linear Discriminant Analysis and k-Nearest Neighbour shows that the predictive performance of classifiers is strongly influenced by training set size, biological and technical variability, replication, fold change and correlation between biomarkers. Optimal number of biomarkers for a classification problem should therefore be estimated taking account of the impact of all these factors. A database of average generalization errors is built for various combinations of these factors. The database of generalization errors can be used for estimating the optimal number of biomarkers for given levels of predictive accuracy as a function of these factors. Examples show that curves from actual biological data resemble that of simulated data with corresponding levels of data characteristics. An R package optBiomarker implementing the method is freely available for academic use from the Comprehensive R Archive Network (http://www.cran.r-project.org/web/packages/optBiomarker/).

  14. An updated version of NPIDB includes new classifications of DNA–protein complexes and their families

    PubMed Central

    Zanegina, Olga; Kirsanov, Dmitriy; Baulin, Eugene; Karyagina, Anna; Alexeevski, Andrei; Spirin, Sergey

    2016-01-01

    The recent upgrade of nucleic acid–protein interaction database (NPIDB, http://npidb.belozersky.msu.ru/) includes a newly elaborated classification of complexes of protein domains with double-stranded DNA and a classification of families of related complexes. Our classifications are based on contacting structural elements of both DNA: the major groove, the minor groove and the backbone; and protein: helices, beta-strands and unstructured segments. We took into account both hydrogen bonds and hydrophobic interaction. The analyzed material contains 1942 structures of protein domains from 748 PDB entries. We have identified 97 interaction modes of individual protein domain–DNA complexes and 17 DNA–protein interaction classes of protein domain families. We analyzed the sources of diversity of DNA–protein interaction modes in different complexes of one protein domain family. The observed interaction mode is sometimes influenced by artifacts of crystallization or diversity in secondary structure assignment. The interaction classes of domain families are more stable and thus possess more biological sense than a classification of single complexes. Integration of the classification into NPIDB allows the user to browse the database according to the interacting structural elements of DNA and protein molecules. For each family, we present average DNA shape parameters in contact zones with domains of the family. PMID:26656949

  15. EvoluCode: Evolutionary Barcodes as a Unifying Framework for Multilevel Evolutionary Data.

    PubMed

    Linard, Benjamin; Nguyen, Ngoc Hoan; Prosdocimi, Francisco; Poch, Olivier; Thompson, Julie D

    2012-01-01

    Evolutionary systems biology aims to uncover the general trends and principles governing the evolution of biological networks. An essential part of this process is the reconstruction and analysis of the evolutionary histories of these complex, dynamic networks. Unfortunately, the methodologies for representing and exploiting such complex evolutionary histories in large scale studies are currently limited. Here, we propose a new formalism, called EvoluCode (Evolutionary barCode), which allows the integration of different evolutionary parameters (eg, sequence conservation, orthology, synteny …) in a unifying format and facilitates the multilevel analysis and visualization of complex evolutionary histories at the genome scale. The advantages of the approach are demonstrated by constructing barcodes representing the evolution of the complete human proteome. Two large-scale studies are then described: (i) the mapping and visualization of the barcodes on the human chromosomes and (ii) automatic clustering of the barcodes to highlight protein subsets sharing similar evolutionary histories and their functional analysis. The methodologies developed here open the way to the efficient application of other data mining and knowledge extraction techniques in evolutionary systems biology studies. A database containing all EvoluCode data is available at: http://lbgi.igbmc.fr/barcodes.

  16. Achieving high confidence protein annotations in a sea of unknowns

    NASA Astrophysics Data System (ADS)

    Timmins-Schiffman, E.; May, D. H.; Noble, W. S.; Nunn, B. L.; Mikan, M.; Harvey, H. R.

    2016-02-01

    Increased sensitivity of mass spectrometry (MS) technology allows deep and broad insight into community functional analyses. Metaproteomics holds the promise to reveal functional responses of natural microbial communities, whereas metagenomics alone can only hint at potential functions. The complex datasets resulting from ocean MS have the potential to inform diverse realms of the biological, chemical, and physical ocean sciences, yet the extent of bacterial functional diversity and redundancy has not been fully explored. To take advantage of these impressive datasets, we need a clear bioinformatics pipeline for metaproteomics peptide identification and annotation with a database that can provide confident identifications. Researchers must consider whether it is sufficient to leverage the vast quantities of available ocean sequence data or if they must invest in site-specific metagenomic sequencing. We have sequenced, to our knowledge, the first western arctic metagenomes from the Bering Strait and the Chukchi Sea. We have addressed the long standing question: Is a metagenome required to accurately complete metaproteomics and assess the biological distribution of metabolic functions controlling nutrient acquisition in the ocean? Two different protein databases were constructed from 1) a site-specific metagenome and 2) subarctic/arctic groups available in NCBI's non-redundant database. Multiple proteomic search strategies were employed, against each individual database and against both databases combined, to determine the algorithm and approach that yielded the balance of high sensitivity and confident identification. Results yielded over 8200 confidently identified proteins. Our comparison of these results allows us to quantify the utility of investing resources in a metagenome versus using the constantly expanding and immediately available public databases for metaproteomic studies.

  17. Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery

    PubMed Central

    Huo, Zhiguang; Tseng, George

    2017-01-01

    Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K-means (is-K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is-K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency. PMID:28959370

  18. Integrative Sparse K-Means With Overlapping Group Lasso in Genomic Applications for Disease Subtype Discovery.

    PubMed

    Huo, Zhiguang; Tseng, George

    2017-06-01

    Cancer subtypes discovery is the first step to deliver personalized medicine to cancer patients. With the accumulation of massive multi-level omics datasets and established biological knowledge databases, omics data integration with incorporation of rich existing biological knowledge is essential for deciphering a biological mechanism behind the complex diseases. In this manuscript, we propose an integrative sparse K -means (is- K means) approach to discover disease subtypes with the guidance of prior biological knowledge via sparse overlapping group lasso. An algorithm using an alternating direction method of multiplier (ADMM) will be applied for fast optimization. Simulation and three real applications in breast cancer and leukemia will be used to compare is- K means with existing methods and demonstrate its superior clustering accuracy, feature selection, functional annotation of detected molecular features and computing efficiency.

  19. Comprehensive coverage of cardiovascular disease data in the disease portals at the Rat Genome Database.

    PubMed

    Wang, Shur-Jen; Laulederkind, Stanley J F; Hayman, G Thomas; Petri, Victoria; Smith, Jennifer R; Tutaj, Marek; Nigam, Rajni; Dwinell, Melinda R; Shimoyama, Mary

    2016-08-01

    Cardiovascular diseases are complex diseases caused by a combination of genetic and environmental factors. To facilitate progress in complex disease research, the Rat Genome Database (RGD) provides the community with a disease portal where genome objects and biological data related to cardiovascular diseases are systematically organized. The purpose of this study is to present biocuration at RGD, including disease, genetic, and pathway data. The RGD curation team uses controlled vocabularies/ontologies to organize data curated from the published literature or imported from disease and pathway databases. These organized annotations are associated with genes, strains, and quantitative trait loci (QTLs), thus linking functional annotations to genome objects. Screen shots from the web pages are used to demonstrate the organization of annotations at RGD. The human cardiovascular disease genes identified by annotations were grouped according to data sources and their annotation profiles were compared by in-house tools and other enrichment tools available to the public. The analysis results show that the imported cardiovascular disease genes from ClinVar and OMIM are functionally different from the RGD manually curated genes in terms of pathway and Gene Ontology annotations. The inclusion of disease genes from other databases enriches the collection of disease genes not only in quantity but also in quality. Copyright © 2016 the American Physiological Society.

  20. Complex network theory for the identification and assessment of candidate protein targets.

    PubMed

    McGarry, Ken; McDonald, Sharon

    2018-06-01

    In this work we use complex network theory to provide a statistical model of the connectivity patterns of human proteins and their interaction partners. Our intention is to identify important proteins that may be predisposed to be potential candidates as drug targets for therapeutic interventions. Target proteins usually have more interaction partners than non-target proteins, but there are no hard-and-fast rules for defining the actual number of interactions. We devise a statistical measure for identifying hub proteins, we score our target proteins with gene ontology annotations. The important druggable protein targets are likely to have similar biological functions that can be assessed for their potential therapeutic value. Our system provides a statistical analysis of the local and distant neighborhood protein interactions of the potential targets using complex network measures. This approach builds a more accurate model of drug-to-target activity and therefore the likely impact on treating diseases. We integrate high quality protein interaction data from the HINT database and disease associated proteins from the DrugTarget database. Other sources include biological knowledge from Gene Ontology and drug information from DrugBank. The problem is a very challenging one since the data is highly imbalanced between target proteins and the more numerous nontargets. We use undersampling on the training data and build Random Forest classifier models which are used to identify previously unclassified target proteins. We validate and corroborate these findings from the available literature. Copyright © 2018 Elsevier Ltd. All rights reserved.

  1. bioDBnet - Biological Database Network

    Cancer.gov

    bioDBnet is a comprehensive resource of most of the biological databases available from different sites like NCBI, Uniprot, EMBL, Ensembl, Affymetrix. It provides a queryable interface to all the databases available, converts identifiers from one database into another and generates comprehensive reports.

  2. ApiEST-DB: analyzing clustered EST data of the apicomplexan parasites.

    PubMed

    Li, Li; Crabtree, Jonathan; Fischer, Steve; Pinney, Deborah; Stoeckert, Christian J; Sibley, L David; Roos, David S

    2004-01-01

    ApiEST-DB (http://www.cbil.upenn.edu/paradbs-servlet/) provides integrated access to publicly available EST data from protozoan parasites in the phylum Apicomplexa. The database currently incorporates a total of nearly 100,000 ESTs from several parasite species of clinical and/or veterinary interest, including Eimeria tenella, Neospora caninum, Plasmodium falciparum, Sarcocystis neurona and Toxoplasma gondii. To facilitate analysis of these data, EST sequences were clustered and assembled to form consensus sequences for each organism, and these assemblies were then subjected to automated annotation via similarity searches against protein and domain databases. The underlying relational database infrastructure, Genomics Unified Schema (GUS), enables complex biologically based queries, facilitating validation of gene models, identification of alternative splicing, detection of single nucleotide polymorphisms, identification of stage-specific genes and recognition of phylogenetically conserved and phylogenetically restricted sequences.

  3. The BRENDA enzyme information system-From a database to an expert system.

    PubMed

    Schomburg, I; Jeske, L; Ulbrich, M; Placzek, S; Chang, A; Schomburg, D

    2017-11-10

    Enzymes, representing the largest and by far most complex group of proteins, play an essential role in all processes of life, including metabolism, gene expression, cell division, the immune system, and others. Their function, also connected to most diseases or stress control makes them interesting targets for research and applications in biotechnology, medical treatments, or diagnosis. Their functional parameters and other properties are collected, integrated, and made available to the scientific community in the BRaunschweig ENzyme DAtabase (BRENDA). In the last 30 years BRENDA has developed into one of the most highly used biological databases worldwide. The data contents, the process of data acquisition, data integration and control, the ways to access the data, and visualizations provided by the website are described and discussed. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  4. The Importance of Biological Databases in Biological Discovery.

    PubMed

    Baxevanis, Andreas D; Bateman, Alex

    2015-06-19

    Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access a wide variety of biologically relevant data, including the genomic sequences of an increasingly broad range of organisms. This unit provides a brief overview of major sequence databases and portals, such as GenBank, the UCSC Genome Browser, and Ensembl. Model organism databases, including WormBase, The Arabidopsis Information Resource (TAIR), and those made available through the Mouse Genome Informatics (MGI) resource, are also covered. Non-sequence-centric databases, such as Online Mendelian Inheritance in Man (OMIM), the Protein Data Bank (PDB), MetaCyc, and the Kyoto Encyclopedia of Genes and Genomes (KEGG), are also discussed. Copyright © 2015 John Wiley & Sons, Inc.

  5. SPARQLGraph: a web-based platform for graphically querying biological Semantic Web databases.

    PubMed

    Schweiger, Dominik; Trajanoski, Zlatko; Pabinger, Stephan

    2014-08-15

    Semantic Web has established itself as a framework for using and sharing data across applications and database boundaries. Here, we present a web-based platform for querying biological Semantic Web databases in a graphical way. SPARQLGraph offers an intuitive drag & drop query builder, which converts the visual graph into a query and executes it on a public endpoint. The tool integrates several publicly available Semantic Web databases, including the databases of the just recently released EBI RDF platform. Furthermore, it provides several predefined template queries for answering biological questions. Users can easily create and save new query graphs, which can also be shared with other researchers. This new graphical way of creating queries for biological Semantic Web databases considerably facilitates usability as it removes the requirement of knowing specific query languages and database structures. The system is freely available at http://sparqlgraph.i-med.ac.at.

  6. Beyond mitochondria, what would be the energy source of the cell?

    PubMed

    Herrera, Arturo S; Del C A Esparza, Maria; Md Ashraf, Ghulam; Zamyatnin, Andrey A; Aliev, Gjumrakch

    2015-01-01

    Currently, cell biology is based on glucose as the main source of energy. Cellular bioenergetic pathways have become unnecessarily complex in their eagerness to explain that how the cell is able to generate and use energy from the oxidation of glucose, where mitochondria play an important role through oxidative phosphorylation. During a descriptive study about the three leading causes of blindness in the world, the ability of melanin to transform light energy into chemical energy through the dissociation of water molecule was unraveled. Initially, during 2 or 3 years; we tried to link together our findings with the widely accepted metabolic pathways already described in metabolic pathway databases, which have been developed to collect and organize the current knowledge on metabolism scattered across a multitude of scientific articles. However, firstly, the literature on metabolism is extensive but rarely conclusive evidence is available, and secondly, one would expect these databases to contain largely the same information, but the contrary is true. For the apparently well studied metabolic process Krebs cycle, which was described as early as 1937 and is found in nearly every biology and chemistry curriculum, there is a considerable disagreement between at least five databases. Of the nearly 7000 reactions contained jointly by these five databases, only 199 are described in the same way in all the five databases. Thus to try to integrate chemical energy from melanin with the supposedly well-known bioenergetic pathways is easier said than done; and the lack of consensus about metabolic network constitutes an insurmountable barrier. After years of unsuccessful results, we finally realized that the chemical energy released through the dissociation of water molecule by melanin represents over 90% of cell energy requirements. These findings reveal a new aspect of cell biology, as glucose and ATP have biological functions related mainly to biomass and not so much with energy. Our finding about the unexpected intrinsic property of melanin to transform photon energy into chemical energy through the dissociation of water molecule, a role performed supposedly only by chlorophyll in plants, seriously questions the sacrosanct role of glucose and thereby mitochondria as the primary source of energy and power for the cells.

  7. IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model

    PubMed Central

    Xia, Kai; Dong, Dong; Han, Jing-Dong J

    2006-01-01

    Background Although protein-protein interaction (PPI) networks have been explored by various experimental methods, the maps so built are still limited in coverage and accuracy. To further expand the PPI network and to extract more accurate information from existing maps, studies have been carried out to integrate various types of functional relationship data. A frequently updated database of computationally analyzed potential PPIs to provide biological researchers with rapid and easy access to analyze original data as a biological network is still lacking. Results By applying a probabilistic model, we integrated 27 heterogeneous genomic, proteomic and functional annotation datasets to predict PPI networks in human. In addition to previously studied data types, we show that phenotypic distances and genetic interactions can also be integrated to predict PPIs. We further built an easy-to-use, updatable integrated PPI database, the Integrated Network Database (IntNetDB) online, to provide automatic prediction and visualization of PPI network among genes of interest. The networks can be visualized in SVG (Scalable Vector Graphics) format for zooming in or out. IntNetDB also provides a tool to extract topologically highly connected network neighborhoods from a specific network for further exploration and research. Using the MCODE (Molecular Complex Detections) algorithm, 190 such neighborhoods were detected among all the predicted interactions. The predicted PPIs can also be mapped to worm, fly and mouse interologs. Conclusion IntNetDB includes 180,010 predicted protein-protein interactions among 9,901 human proteins and represents a useful resource for the research community. Our study has increased prediction coverage by five-fold. IntNetDB also provides easy-to-use network visualization and analysis tools that allow biological researchers unfamiliar with computational biology to access and analyze data over the internet. The web interface of IntNetDB is freely accessible at . Visualization requires Mozilla version 1.8 (or higher) or Internet Explorer with installation of SVGviewer. PMID:17112386

  8. Application of Genetic/Genomic Approaches to Allergic Disorders

    PubMed Central

    Baye, Tesfaye M.; Martin, Lisa J.; Khurana Hershey, Gurjit K.

    2010-01-01

    Completion of the human genome project and rapid progress in genetics and bioinformatics have enabled the development of large public databases, which include genetic and genomic data linked to clinical health data. With the massive amount of information available, clinicians and researchers have the unique opportunity to complement and integrate their daily practice with the existing resources to clarify the underlying etiology of complex phenotypes such as allergic diseases. The genome itself is now often utilized as a starting point for many studies and multiple innovative approaches have emerged applying genetic/genomic strategies to key questions in the field of allergy and immunology. There have been several successes, which have uncovered new insights into the biologic underpinnings of allergic disorders. Herein, we will provide an in depth review of genomic approaches to identifying genes and biologic networks involved in allergic diseases. We will discuss genetic and phenotypic variation, statistical approaches for gene discovery, public databases, functional genomics, clinical implications, and the challenges that remain. PMID:20638111

  9. ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining.

    PubMed

    Huan, Tianxiao; Sivachenko, Andrey Y; Harrison, Scott H; Chen, Jake Y

    2008-08-12

    New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed. We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network. The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.

  10. Biological Databases for Human Research

    PubMed Central

    Zou, Dong; Ma, Lina; Yu, Jun; Zhang, Zhang

    2015-01-01

    The completion of the Human Genome Project lays a foundation for systematically studying the human genome from evolutionary history to precision medicine against diseases. With the explosive growth of biological data, there is an increasing number of biological databases that have been developed in aid of human-related research. Here we present a collection of human-related biological databases and provide a mini-review by classifying them into different categories according to their data types. As human-related databases continue to grow not only in count but also in volume, challenges are ahead in big data storage, processing, exchange and curation. PMID:25712261

  11. Identifying relevant data for a biological database: handcrafted rules versus machine learning.

    PubMed

    Sehgal, Aditya Kumar; Das, Sanmay; Noto, Keith; Saier, Milton H; Elkan, Charles

    2011-01-01

    With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.

  12. AN OVERVIEW OF COMPUTATIONAL LIFE SCIENCE DATABASES & EXCHANGE FORMATS OF RELEVANCE TO CHEMICAL BIOLOGY RESEARCH

    PubMed Central

    Hall, Aaron Smalter; Shan, Yunfeng; Lushington, Gerald; Visvanathan, Mahesh

    2016-01-01

    Databases and exchange formats describing biological entities such as chemicals and proteins, along with their relationships, are a critical component of research in life sciences disciplines, including chemical biology wherein small information about small molecule properties converges with cellular and molecular biology. Databases for storing biological entities are growing not only in size, but also in type, with many similarities between them and often subtle differences. The data formats available to describe and exchange these entities are numerous as well. In general, each format is optimized for a particular purpose or database, and hence some understanding of these formats is required when choosing one for research purposes. This paper reviews a selection of different databases and data formats with the goal of summarizing their purposes, features, and limitations. Databases are reviewed under the categories of 1) protein interactions, 2) metabolic pathways, 3) chemical interactions, and 4) drug discovery. Representation formats will be discussed according to those describing chemical structures, and those describing genomic/proteomic entities. PMID:22934944

  13. An overview of computational life science databases & exchange formats of relevance to chemical biology research.

    PubMed

    Smalter Hall, Aaron; Shan, Yunfeng; Lushington, Gerald; Visvanathan, Mahesh

    2013-03-01

    Databases and exchange formats describing biological entities such as chemicals and proteins, along with their relationships, are a critical component of research in life sciences disciplines, including chemical biology wherein small information about small molecule properties converges with cellular and molecular biology. Databases for storing biological entities are growing not only in size, but also in type, with many similarities between them and often subtle differences. The data formats available to describe and exchange these entities are numerous as well. In general, each format is optimized for a particular purpose or database, and hence some understanding of these formats is required when choosing one for research purposes. This paper reviews a selection of different databases and data formats with the goal of summarizing their purposes, features, and limitations. Databases are reviewed under the categories of 1) protein interactions, 2) metabolic pathways, 3) chemical interactions, and 4) drug discovery. Representation formats will be discussed according to those describing chemical structures, and those describing genomic/proteomic entities.

  14. GeneSCF: a real-time based functional enrichment tool with support for multiple organisms.

    PubMed

    Subhash, Santhilal; Kanduri, Chandrasekhar

    2016-09-13

    High-throughput technologies such as ChIP-sequencing, RNA-sequencing, DNA sequencing and quantitative metabolomics generate a huge volume of data. Researchers often rely on functional enrichment tools to interpret the biological significance of the affected genes from these high-throughput studies. However, currently available functional enrichment tools need to be updated frequently to adapt to new entries from the functional database repositories. Hence there is a need for a simplified tool that can perform functional enrichment analysis by using updated information directly from the source databases such as KEGG, Reactome or Gene Ontology etc. In this study, we focused on designing a command-line tool called GeneSCF (Gene Set Clustering based on Functional annotations), that can predict the functionally relevant biological information for a set of genes in a real-time updated manner. It is designed to handle information from more than 4000 organisms from freely available prominent functional databases like KEGG, Reactome and Gene Ontology. We successfully employed our tool on two of published datasets to predict the biologically relevant functional information. The core features of this tool were tested on Linux machines without the need for installation of more dependencies. GeneSCF is more reliable compared to other enrichment tools because of its ability to use reference functional databases in real-time to perform enrichment analysis. It is an easy-to-integrate tool with other pipelines available for downstream analysis of high-throughput data. More importantly, GeneSCF can run multiple gene lists simultaneously on different organisms thereby saving time for the users. Since the tool is designed to be ready-to-use, there is no need for any complex compilation and installation procedures.

  15. Alternatives to relational databases in precision medicine: Comparison of NoSQL approaches for big data storage using supercomputers

    NASA Astrophysics Data System (ADS)

    Velazquez, Enrique Israel

    Improvements in medical and genomic technologies have dramatically increased the production of electronic data over the last decade. As a result, data management is rapidly becoming a major determinant, and urgent challenge, for the development of Precision Medicine. Although successful data management is achievable using Relational Database Management Systems (RDBMS), exponential data growth is a significant contributor to failure scenarios. Growing amounts of data can also be observed in other sectors, such as economics and business, which, together with the previous facts, suggests that alternate database approaches (NoSQL) may soon be required for efficient storage and management of big databases. However, this hypothesis has been difficult to test in the Precision Medicine field since alternate database architectures are complex to assess and means to integrate heterogeneous electronic health records (EHR) with dynamic genomic data are not easily available. In this dissertation, we present a novel set of experiments for identifying NoSQL database approaches that enable effective data storage and management in Precision Medicine using patients' clinical and genomic information from the cancer genome atlas (TCGA). The first experiment draws on performance and scalability from biologically meaningful queries with differing complexity and database sizes. The second experiment measures performance and scalability in database updates without schema changes. The third experiment assesses performance and scalability in database updates with schema modifications due dynamic data. We have identified two NoSQL approach, based on Cassandra and Redis, which seems to be the ideal database management systems for our precision medicine queries in terms of performance and scalability. We present NoSQL approaches and show how they can be used to manage clinical and genomic big data. Our research is relevant to the public health since we are focusing on one of the main challenges to the development of Precision Medicine and, consequently, investigating a potential solution to the progressively increasing demands on health care.

  16. KnotProt: a database of proteins with knots and slipknots

    PubMed Central

    Jamroz, Michal; Niemyska, Wanda; Rawdon, Eric J.; Stasiak, Andrzej; Millett, Kenneth C.; Sułkowski, Piotr; Sulkowska, Joanna I.

    2015-01-01

    The protein topology database KnotProt, http://knotprot.cent.uw.edu.pl/, collects information about protein structures with open polypeptide chains forming knots or slipknots. The knotting complexity of the cataloged proteins is presented in the form of a matrix diagram that shows users the knot type of the entire polypeptide chain and of each of its subchains. The pattern visible in the matrix gives the knotting fingerprint of a given protein and permits users to determine, for example, the minimal length of the knotted regions (knot's core size) or the depth of a knot, i.e. how many amino acids can be removed from either end of the cataloged protein structure before converting it from a knot to a different type of knot. In addition, the database presents extensive information about the biological functions, families and fold types of proteins with non-trivial knotting. As an additional feature, the KnotProt database enables users to submit protein or polymer chains and generate their knotting fingerprints. PMID:25361973

  17. EcoCyc: a comprehensive database resource for Escherichia coli

    PubMed Central

    Keseler, Ingrid M.; Collado-Vides, Julio; Gama-Castro, Socorro; Ingraham, John; Paley, Suzanne; Paulsen, Ian T.; Peralta-Gil, Martín; Karp, Peter D.

    2005-01-01

    The EcoCyc database (http://EcoCyc.org/) is a comprehensive source of information on the biology of the prototypical model organism Escherichia coli K12. The mission for EcoCyc is to contain both computable descriptions of, and detailed comments describing, all genes, proteins, pathways and molecular interactions in E.coli. Through ongoing manual curation, extensive information such as summary comments, regulatory information, literature citations and evidence types has been extracted from 8862 publications and added to Version 8.5 of the EcoCyc database. The EcoCyc database can be accessed through a World Wide Web interface, while the downloadable Pathway Tools software and data files enable computational exploration of the data and provide enhanced querying capabilities that web interfaces cannot support. For example, EcoCyc contains carefully curated information that can be used as training sets for bioinformatics prediction of entities such as promoters, operons, genetic networks, transcription factor binding sites, metabolic pathways, functionally related genes, protein complexes and protein–ligand interactions. PMID:15608210

  18. The Biological Macromolecule Crystallization Database and NASA Protein Crystal Growth Archive

    PubMed Central

    Gilliland, Gary L.; Tung, Michael; Ladner, Jane

    1996-01-01

    The NIST/NASA/CARB Biological Macromolecule Crystallization Database (BMCD), NIST Standard Reference Database 21, contains crystal data and crystallization conditions for biological macromolecules. The database entries include data abstracted from published crystallographic reports. Each entry consists of information describing the biological macromolecule crystallized and crystal data and the crystallization conditions for each crystal form. The BMCD serves as the NASA Protein Crystal Growth Archive in that it contains protocols and results of crystallization experiments undertaken in microgravity (space). These database entries report the results, whether successful or not, from NASA-sponsored protein crystal growth experiments in microgravity and from microgravity crystallization studies sponsored by other international organizations. The BMCD was designed as a tool to assist x-ray crystallographers in the development of protocols to crystallize biological macromolecules, those that have previously been crystallized, and those that have not been crystallized. PMID:11542472

  19. Protein-protein interaction analysis of Alzheimer`s disease and NAFLD based on systems biology methods unhide common ancestor pathways.

    PubMed

    Karbalaei, Reza; Allahyari, Marzieh; Rezaei-Tavirani, Mostafa; Asadzadeh-Aghdaei, Hamid; Zali, Mohammad Reza

    2018-01-01

    Analysis reconstruction networks from two diseases, NAFLD and Alzheimer`s diseases and their relationship based on systems biology methods. NAFLD and Alzheimer`s diseases are two complex diseases, with progressive prevalence and high cost for countries. There are some reports on relation and same spreading pathways of these two diseases. In addition, they have some similar risk factors, exclusively lifestyle such as feeding, exercises and so on. Therefore, systems biology approach can help to discover their relationship. DisGeNET and STRING databases were sources of disease genes and constructing networks. Three plugins of Cytoscape software, including ClusterONE, ClueGO and CluePedia, were used to analyze and cluster networks and enrichment of pathways. An R package used to define best centrality method. Finally, based on degree and Betweenness, hubs and bottleneck nodes were defined. Common genes between NAFLD and Alzheimer`s disease were 190 genes that used construct a network with STRING database. The resulting network contained 182 nodes and 2591 edges and comprises from four clusters. Enrichment of these clusters separately lead to carbohydrate metabolism, long chain fatty acid and regulation of JAK-STAT and IL-17 signaling pathways, respectively. Also seven genes selected as hub-bottleneck include: IL6, AKT1, TP53, TNF, JUN, VEGFA and PPARG. Enrichment of these proteins and their first neighbors in network by OMIM database lead to diabetes and obesity as ancestors of NAFLD and AD. Systems biology methods, specifically PPI networks, can be useful for analyzing complicated related diseases. Finding Hub and bottleneck proteins should be the goal of drug designing and introducing disease markers.

  20. 3D Complex: A Structural Classification of Protein Complexes

    PubMed Central

    Levy, Emmanuel D; Pereira-Leal, Jose B; Chothia, Cyrus; Teichmann, Sarah A

    2006-01-01

    Most of the proteins in a cell assemble into complexes to carry out their function. It is therefore crucial to understand the physicochemical properties as well as the evolution of interactions between proteins. The Protein Data Bank represents an important source of information for such studies, because more than half of the structures are homo- or heteromeric protein complexes. Here we propose the first hierarchical classification of whole protein complexes of known 3-D structure, based on representing their fundamental structural features as a graph. This classification provides the first overview of all the complexes in the Protein Data Bank and allows nonredundant sets to be derived at different levels of detail. This reveals that between one-half and two-thirds of known structures are multimeric, depending on the level of redundancy accepted. We also analyse the structures in terms of the topological arrangement of their subunits and find that they form a small number of arrangements compared with all theoretically possible ones. This is because most complexes contain four subunits or less, and the large majority are homomeric. In addition, there is a strong tendency for symmetry in complexes, even for heteromeric complexes. Finally, through comparison of Biological Units in the Protein Data Bank with the Protein Quaternary Structure database, we identified many possible errors in quaternary structure assignments. Our classification, available as a database and Web server at http://www.3Dcomplex.org, will be a starting point for future work aimed at understanding the structure and evolution of protein complexes. PMID:17112313

  1. The Human Ageing Genomic Resources: online databases and tools for biogerontologists

    PubMed Central

    de Magalhães, João Pedro; Budovsky, Arie; Lehmann, Gilad; Costa, Joana; Li, Yang; Fraifeld, Vadim; Church, George M.

    2009-01-01

    Summary Ageing is a complex, challenging phenomenon that will require multiple, interdisciplinary approaches to unravel its puzzles. To assist basic research on ageing, we developed the Human Ageing Genomic Resources (HAGR). This work provides an overview of the databases and tools in HAGR and describes how the gerontology research community can employ them. Several recent changes and improvements to HAGR are also presented. The two centrepieces in HAGR are GenAge and AnAge. GenAge is a gene database featuring genes associated with ageing and longevity in model organisms, a curated database of genes potentially associated with human ageing, and a list of genes tested for their association with human longevity. A myriad of biological data and information is included for hundreds of genes, making GenAge a reference for research that reflects our current understanding of the genetic basis of ageing. GenAge can also serve as a platform for the systems biology of ageing, and tools for the visualization of protein-protein interactions are also included. AnAge is a database of ageing in animals, featuring over 4,000 species, primarily assembled as a resource for comparative and evolutionary studies of ageing. Longevity records, developmental and reproductive traits, taxonomic information, basic metabolic characteristics, and key observations related to ageing are included in AnAge. Software is also available to aid researchers in the form of Perl modules to automate numerous tasks and as an SPSS script to analyse demographic mortality data. The Human Ageing Genomic Resources are available online at http://genomics.senescence.info. PMID:18986374

  2. A spatial classification and database for management, research, and policy making: The Great Lakes aquatic habitat framework

    USGS Publications Warehouse

    Wang, Lizhu; Riseng, Catherine M.; Mason, Lacey; Werhrly, Kevin; Rutherford, Edward; McKenna, James E.; Castiglione, Chris; Johnson, Lucinda B.; Infante, Dana M.; Sowa, Scott P.; Robertson, Mike; Schaeffer, Jeff; Khoury, Mary; Gaiot, John; Hollenhurst, Tom; Brooks, Colin N.; Coscarelli, Mark

    2015-01-01

    Managing the world's largest and most complex freshwater ecosystem, the Laurentian Great Lakes, requires a spatially hierarchical basin-wide database of ecological and socioeconomic information that is comparable across the region. To meet such a need, we developed a spatial classification framework and database — Great Lakes Aquatic Habitat Framework (GLAHF). GLAHF consists of catchments, coastal terrestrial, coastal margin, nearshore, and offshore zones that encompass the entire Great Lakes Basin. The catchments captured in the database as river pour points or coastline segments are attributed with data known to influence physicochemical and biological characteristics of the lakes from the catchments. The coastal terrestrial zone consists of 30-m grid cells attributed with data from the terrestrial region that has direct connection with the lakes. The coastal margin and nearshore zones consist of 30-m grid cells attributed with data describing the coastline conditions, coastal human disturbances, and moderately to highly variable physicochemical and biological characteristics. The offshore zone consists of 1.8-km grid cells attributed with data that are spatially less variable compared with the other aquatic zones. These spatial classification zones and their associated data are nested within lake sub-basins and political boundaries and allow the synthesis of information from grid cells to classification zones, within and among political boundaries, lake sub-basins, Great Lakes, or within the entire Great Lakes Basin. This spatially structured database could help the development of basin-wide management plans, prioritize locations for funding and specific management actions, track protection and restoration progress, and conduct research for science-based decision making.

  3. Web server to identify similarity of amino acid motifs to compounds (SAAMCO).

    PubMed

    Casey, Fergal P; Davey, Norman E; Baran, Ivan; Varekova, Radka Svobodova; Shields, Denis C

    2008-07-01

    Protein-protein interactions are fundamental in mediating biological processes including metabolism, cell growth, and signaling. To be able to selectively inhibit or induce protein activity or complex formation is a key feature in controlling disease. For those situations in which protein-protein interactions derive substantial affinity from short linear peptide sequences, or motifs, we can develop search algorithms for peptidomimetic compounds that resemble the short peptide's structure but are not compromised by poor pharmacological properties. SAAMCO is a Web service ( http://bioware.ucd.ie/ approximately saamco) that facilitates the screening of motifs with known structures against bioactive compound databases. It is built on an algorithm that defines compound similarity based on the presence of appropriate amino acid side chain fragments and a favorable Root Mean Squared Deviation (RMSD) between compound and motif structure. The methodology is efficient as the available compound databases are preprocessed and fast regular expression searches filter potential matches before time-intensive 3D superposition is performed. The required input information is minimal, and the compound databases have been selected to maximize the availability of information on biological activity. "Hits" are accompanied with a visualization window and links to source database entries. Motif matching can be defined on partial or full similarity which will increase or reduce respectively the number of potential mimetic compounds. The Web server provides the functionality for rapid screening of known or putative interaction motifs against prepared compound libraries using a novel search algorithm. The tabulated results can be analyzed by linking to appropriate databases and by visualization.

  4. Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases

    PubMed Central

    Assmus, Jens; Kleffe, Jürgen; Schmitt, Armin O.; Brockmann, Gudrun A.

    2013-01-01

    There is considerable interest in studying sequenced variations. However, while the positions of substitutions are uniquely identifiable by sequence alignment, the location of insertions and deletions still poses problems. Each insertion and deletion causes a change of sequence. Yet, due to low complexity or repetitive sequence structures, the same indel can sometimes be annotated in different ways. Two indels which differ in allele sequence and position can be one and the same, i.e. the alternative sequence of the whole chromosome is identical in both cases and, therefore, the two deletions are biologically equivalent. In such a case, it is impossible to identify the exact position of an indel merely based on sequence alignment. Thus, variation entries in a mutation database are not necessarily uniquely defined. We prove the existence of a contiguous region around an indel in which all deletions of the same length are biologically identical. Databases often show only one of several possible locations for a given variation. Furthermore, different data base entries can represent equivalent variation events. We identified 1,045,590 such problematic entries of insertions and deletions out of 5,860,408 indel entries in the current human database of Ensembl. Equivalent indels are found in sequence regions of different functions like exons, introns or 5' and 3' UTRs. One and the same variation can be assigned to several different functional classifications of which only one is correct. We implemented an algorithm that determines for each indel database entry its complete set of equivalent indels which is uniquely characterized by the indel itself and a given interval of the reference sequence. PMID:23658777

  5. Just working with the cellular machine: A high school game for teaching molecular biology.

    PubMed

    Cardoso, Fernanda Serpa; Dumpel, Renata; da Silva, Luisa B Gomes; Rodrigues, Carlos R; Santos, Dilvani O; Cabral, Lucio Mendes; Castro, Helena C

    2008-03-01

    Molecular biology is a difficult comprehension subject due to its high complexity, thus requiring new teaching approaches. Herein, we developed an interdisciplinary board game involving the human immune system response against a bacterial infection for teaching molecular biology at high school. Initially, we created a database with several questions and a game story that invites the students for helping the human immunological system to produce antibodies (IgG) and fight back a pathogenic bacterium second-time invasion. The game involves answering questions completing the game board in which the antibodies "are synthesized" through the molecular biology process. At the end, a problem-based learning approach is used, and a last question is raised about proteins. Biology teachers and high school students evaluated the game and considered it an easy and interesting tool for teaching the theme. An increase of about 5-30% in answering molecular biology questions revealed that the game improves learning and induced a more engaged and proactive learning profile in the high school students. Copyright © 2008 International Union of Biochemistry and Molecular Biology, Inc.

  6. A large scale Plasmodium vivax- Saimiri boliviensis trophozoite-schizont transition proteome

    PubMed Central

    Lapp, Stacey A.; Barnwell, John W.; Galinski, Mary R.

    2017-01-01

    Plasmodium vivax is a complex protozoan parasite with over 6,500 genes and stage-specific differential expression. Much of the unique biology of this pathogen remains unknown, including how it modifies and restructures the host reticulocyte. Using a recently published P. vivax reference genome, we report the proteome from two biological replicates of infected Saimiri boliviensis host reticulocytes undergoing transition from the late trophozoite to early schizont stages. Using five database search engines, we identified a total of 2000 P. vivax and 3487 S. boliviensis proteins, making this the most comprehensive P. vivax proteome to date. PlasmoDB GO-term enrichment analysis of proteins identified at least twice by a search engine highlighted core metabolic processes and molecular functions such as glycolysis, translation and protein folding, cell components such as ribosomes, proteasomes and the Golgi apparatus, and a number of vesicle and trafficking related clusters. Database for Annotation, Visualization and Integrated Discovery (DAVID) v6.8 enriched functional annotation clusters of S. boliviensis proteins highlighted vesicle and trafficking-related clusters, elements of the cytoskeleton, oxidative processes and response to oxidative stress, macromolecular complexes such as the proteasome and ribosome, metabolism, translation, and cell death. Host and parasite proteins potentially involved in cell adhesion were also identified. Over 25% of the P. vivax proteins have no functional annotation; this group includes 45 VIR members of the large PIR family. A number of host and pathogen proteins contained highly oxidized or nitrated residues, extending prior trophozoite-enriched stage observations from S. boliviensis infections, and supporting the possibility of oxidative stress in relation to the disease. This proteome significantly expands the size and complexity of the known P. vivax and Saimiri host iRBC proteomes, and provides in-depth data that will be valuable for ongoing research on this parasite’s biology and pathogenesis. PMID:28829774

  7. Network portal: a database for storage, analysis and visualization of biological networks

    PubMed Central

    Turkarslan, Serdar; Wurtmann, Elisabeth J.; Wu, Wei-Ju; Jiang, Ning; Bare, J. Christopher; Foley, Karen; Reiss, David J.; Novichkov, Pavel; Baliga, Nitin S.

    2014-01-01

    The ease of generating high-throughput data has enabled investigations into organismal complexity at the systems level through the inference of networks of interactions among the various cellular components (genes, RNAs, proteins and metabolites). The wider scientific community, however, currently has limited access to tools for network inference, visualization and analysis because these tasks often require advanced computational knowledge and expensive computing resources. We have designed the network portal (http://networks.systemsbiology.net) to serve as a modular database for the integration of user uploaded and public data, with inference algorithms and tools for the storage, visualization and analysis of biological networks. The portal is fully integrated into the Gaggle framework to seamlessly exchange data with desktop and web applications and to allow the user to create, save and modify workspaces, and it includes social networking capabilities for collaborative projects. While the current release of the database contains networks for 13 prokaryotic organisms from diverse phylogenetic clades (4678 co-regulated gene modules, 3466 regulators and 9291 cis-regulatory motifs), it will be rapidly populated with prokaryotic and eukaryotic organisms as relevant data become available in public repositories and through user input. The modular architecture, simple data formats and open API support community development of the portal. PMID:24271392

  8. Emerging new strategies for successful metabolite identification in metabolomics

    PubMed Central

    Bingol, Kerem; Bruschweiler-Li, Lei; Li, Dawei; Zhang, Bo; Xie, Mouzhe; Brüschweiler, Rafael

    2016-01-01

    This review discusses strategies for the identification of metabolites in complex biological mixtures, as encountered in metabolomics, which have emerged in the recent past. These include NMR database-assisted approaches for the identification of commonly known metabolites as well as novel combinations of NMR and MS analysis methods for the identification of unknown metabolites. The use of certain chemical additives to the NMR tube can permit identification of metabolites with specific physical chemical properties. PMID:26915807

  9. Relational Databases: A Transparent Framework for Encouraging Biology Students to Think Informatically

    ERIC Educational Resources Information Center

    Rice, Michael; Gladstone, William; Weir, Michael

    2004-01-01

    We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a…

  10. Knowledge representation in metabolic pathway databases.

    PubMed

    Stobbe, Miranda D; Jansen, Gerbert A; Moerland, Perry D; van Kampen, Antoine H C

    2014-05-01

    The accurate representation of all aspects of a metabolic network in a structured format, such that it can be used for a wide variety of computational analyses, is a challenge faced by a growing number of researchers. Analysis of five major metabolic pathway databases reveals that each database has made widely different choices to address this challenge, including how to deal with knowledge that is uncertain or missing. In concise overviews, we show how concepts such as compartments, enzymatic complexes and the direction of reactions are represented in each database. Importantly, also concepts which a database does not represent are described. Which aspects of the metabolic network need to be available in a structured format and to what detail differs per application. For example, for in silico phenotype prediction, a detailed representation of gene-protein-reaction relations and the compartmentalization of the network is essential. Our analysis also shows that current databases are still limited in capturing all details of the biology of the metabolic network, further illustrated with a detailed analysis of three metabolic processes. Finally, we conclude that the conceptual differences between the databases, which make knowledge exchange and integration a challenge, have not been resolved, so far, by the exchange formats in which knowledge representation is standardized.

  11. Towards BioDBcore: a community-defined information specification for biological databases

    PubMed Central

    Gaudet, Pascale; Bairoch, Amos; Field, Dawn; Sansone, Susanna-Assunta; Taylor, Chris; Attwood, Teresa K.; Bateman, Alex; Blake, Judith A.; Bult, Carol J.; Cherry, J. Michael; Chisholm, Rex L.; Cochrane, Guy; Cook, Charles E.; Eppig, Janan T.; Galperin, Michael Y.; Gentleman, Robert; Goble, Carole A.; Gojobori, Takashi; Hancock, John M.; Howe, Douglas G.; Imanishi, Tadashi; Kelso, Janet; Landsman, David; Lewis, Suzanna E.; Mizrachi, Ilene Karsch; Orchard, Sandra; Ouellette, B. F. Francis; Ranganathan, Shoba; Richardson, Lorna; Rocca-Serra, Philippe; Schofield, Paul N.; Smedley, Damian; Southan, Christopher; Tan, Tin Wee; Tatusova, Tatiana; Whetzel, Patricia L.; White, Owen; Yamasaki, Chisato

    2011-01-01

    The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases. PMID:21097465

  12. Towards BioDBcore: a community-defined information specification for biological databases

    PubMed Central

    Gaudet, Pascale; Bairoch, Amos; Field, Dawn; Sansone, Susanna-Assunta; Taylor, Chris; Attwood, Teresa K.; Bateman, Alex; Blake, Judith A.; Bult, Carol J.; Cherry, J. Michael; Chisholm, Rex L.; Cochrane, Guy; Cook, Charles E.; Eppig, Janan T.; Galperin, Michael Y.; Gentleman, Robert; Goble, Carole A.; Gojobori, Takashi; Hancock, John M.; Howe, Douglas G.; Imanishi, Tadashi; Kelso, Janet; Landsman, David; Lewis, Suzanna E.; Karsch Mizrachi, Ilene; Orchard, Sandra; Ouellette, B.F. Francis; Ranganathan, Shoba; Richardson, Lorna; Rocca-Serra, Philippe; Schofield, Paul N.; Smedley, Damian; Southan, Christopher; Tan, Tin W.; Tatusova, Tatiana; Whetzel, Patricia L.; White, Owen; Yamasaki, Chisato

    2011-01-01

    The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore. The goals of these attributes are to provide a general overview of the database landscape, to encourage consistency and interoperability between resources; and to promote the use of semantic and syntactic standards. BioDBCore will make it easier for users to evaluate the scope and relevance of available resources. This new resource will increase the collective impact of the information present in biological databases. PMID:21205783

  13. Biological network extraction from scientific literature: state of the art and challenges.

    PubMed

    Li, Chen; Liakata, Maria; Rebholz-Schuhmann, Dietrich

    2014-09-01

    Networks of molecular interactions explain complex biological processes, and all known information on molecular events is contained in a number of public repositories including the scientific literature. Metabolic and signalling pathways are often viewed separately, even though both types are composed of interactions involving proteins and other chemical entities. It is necessary to be able to combine data from all available resources to judge the functionality, complexity and completeness of any given network overall, but especially the full integration of relevant information from the scientific literature is still an ongoing and complex task. Currently, the text-mining research community is steadily moving towards processing the full body of the scientific literature by making use of rich linguistic features such as full text parsing, to extract biological interactions. The next step will be to combine these with information from scientific databases to support hypothesis generation for the discovery of new knowledge and the extension of biological networks. The generation of comprehensive networks requires technologies such as entity grounding, coordination resolution and co-reference resolution, which are not fully solved and are required to further improve the quality of results. Here, we analyse the state of the art for the extraction of network information from the scientific literature and the evaluation of extraction methods against reference corpora, discuss challenges involved and identify directions for future research. © The Author 2013. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  14. Object-oriented parsing of biological databases with Python.

    PubMed

    Ramu, C; Gemünd, C; Gibson, T J

    2000-07-01

    While database activities in the biological area are increasing rapidly, rather little is done in the area of parsing them in a simple and object-oriented way. We present here an elegant, simple yet powerful way of parsing biological flat-file databases. We have taken EMBL, SWISSPROT and GENBANK as examples. EMBL and SWISS-PROT do not differ much in the format structure. GENBANK has a very different format structure than EMBL and SWISS-PROT. Extracting the desired fields in an entry (for example a sub-sequence with an associated feature) for later analysis is a constant need in the biological sequence-analysis community: this is illustrated with tools to make new splice-site databases. The interface to the parser is abstract in the sense that the access to all the databases is independent from their different formats, since parsing instructions are hidden.

  15. Evolving Strategies for the Incorporation of Bioinformatics within the Undergraduate Cell Biology Curriculum

    ERIC Educational Resources Information Center

    Honts, Jerry E.

    2003-01-01

    Recent advances in genomics and structural biology have resulted in an unprecedented increase in biological data available from Internet-accessible databases. In order to help students effectively use this vast repository of information, undergraduate biology students at Drake University were introduced to bioinformatics software and databases in…

  16. Integrating ecological risk assessments across levels of organization using the Franklin-Noss model of biodiversity

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brugger, K.E.; Tiebout, H.M. III

    1994-12-31

    Wildlife toxicologists pioneered methodologies for assessing ecological risk to nontarget species. Historically, ecological risk assessments (ERAS) focused on a limited array of species and were based on a relatively few population-level endpoints (mortality, reproduction). Currently, risk assessment models are becoming increasingly complex that factor in multi-species interactions (across trophic levels) and utilize an increasingly diverse number of ecologically significant endpoints. This trend suggests the increasing importance of safeguarding not only populations of individual species, but also the overall integrity of the larger biotic systems that support them. In this sense, ERAs are in alignment with Conservation Biology, an applied sciencemore » of ecological knowledge used to conserve biodiversity. A theoretical conservation biology model could be incorporated in ERAs to quantify impacts to biodiversity (structure, function or composition across levels of biological organization). The authors suggest that the Franklin-Noss model for evaluating biodiversity, with its nested, hierarchical approach, may provide a suitable paradigm for assessing and integrating the ecological risk that chemical contaminants pose to biological systems from the simplest levels (genotypes, individual organisms) to the most complex levels of organization (communities and ecosystems). The Franklin-Noss model can accommodate the existing ecotoxicological database and, perhaps more importantly, indicate new areas in which critical endpoints should be identified and investigated.« less

  17. Financing a future for public biological data.

    PubMed

    Ellis, L B; Kalumbi, D

    1999-09-01

    The public web-based biological database infrastructure is a source of both wonder and worry. Users delight in the ever increasing amounts of information available; database administrators and curators worry about long-term financial support. An earlier study of 153 biological databases (Ellis and Kalumbi, Nature Biotechnol., 16, 1323-1324, 1998) determined that near future (1-5 year) funding for over two-thirds of them was uncertain. More detailed data are required to determine the magnitude of the problem and offer possible solutions. This study examines the finances and use statistics of a few of these organizations in more depth, and reviews several economic models that may help sustain them. Six organizations were studied. Their administrative overhead is fairly low; non-administrative personnel and computer-related costs account for 77% of expenses. One smaller, more specialized US database, in 1997, had 60% of total access from US domains; a majority (56%) of its US accesses came from commercial domains, although only 2% of the 153 databases originally studied received any industrial support. The most popular model used to gain industrial support is asymmetric pricing: preferentially charging the commercial users of a database. At least five biological databases have recently begun using this model. Advertising is another model which may be useful for the more general, more heavily used sites. Microcommerce has promise, especially for databases that do not attract advertisers, but needs further testing. The least income reported for any of the databases studied was $50,000/year; applying this rate to 400 biological databases (a lower limit of the number of such databases, many of which require far larger resources) would mean annual support need of at least $20 million. To obtain this level of support is challenging, yet failure to accept the challenge could be catastrophic. lynda@tc.umn. edu

  18. RaMP: A Comprehensive Relational Database of Metabolomics Pathways for Pathway Enrichment Analysis of Genes and Metabolites

    PubMed Central

    Zhang, Bofei; Hu, Senyang; Baskin, Elizabeth; Patt, Andrew; Siddiqui, Jalal K.

    2018-01-01

    The value of metabolomics in translational research is undeniable, and metabolomics data are increasingly generated in large cohorts. The functional interpretation of disease-associated metabolites though is difficult, and the biological mechanisms that underlie cell type or disease-specific metabolomics profiles are oftentimes unknown. To help fully exploit metabolomics data and to aid in its interpretation, analysis of metabolomics data with other complementary omics data, including transcriptomics, is helpful. To facilitate such analyses at a pathway level, we have developed RaMP (Relational database of Metabolomics Pathways), which combines biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, WikiPathways, and the Human Metabolome DataBase (HMDB). To the best of our knowledge, an off-the-shelf, public database that maps genes and metabolites to biochemical/disease pathways and can readily be integrated into other existing software is currently lacking. For consistent and comprehensive analysis, RaMP enables batch and complex queries (e.g., list all metabolites involved in glycolysis and lung cancer), can readily be integrated into pathway analysis tools, and supports pathway overrepresentation analysis given a list of genes and/or metabolites of interest. For usability, we have developed a RaMP R package (https://github.com/Mathelab/RaMP-DB), including a user-friendly RShiny web application, that supports basic simple and batch queries, pathway overrepresentation analysis given a list of genes or metabolites of interest, and network visualization of gene-metabolite relationships. The package also includes the raw database file (mysql dump), thereby providing a stand-alone downloadable framework for public use and integration with other tools. In addition, the Python code needed to recreate the database on another system is also publicly available (https://github.com/Mathelab/RaMP-BackEnd). Updates for databases in RaMP will be checked multiple times a year and RaMP will be updated accordingly. PMID:29470400

  19. RaMP: A Comprehensive Relational Database of Metabolomics Pathways for Pathway Enrichment Analysis of Genes and Metabolites.

    PubMed

    Zhang, Bofei; Hu, Senyang; Baskin, Elizabeth; Patt, Andrew; Siddiqui, Jalal K; Mathé, Ewy A

    2018-02-22

    The value of metabolomics in translational research is undeniable, and metabolomics data are increasingly generated in large cohorts. The functional interpretation of disease-associated metabolites though is difficult, and the biological mechanisms that underlie cell type or disease-specific metabolomics profiles are oftentimes unknown. To help fully exploit metabolomics data and to aid in its interpretation, analysis of metabolomics data with other complementary omics data, including transcriptomics, is helpful. To facilitate such analyses at a pathway level, we have developed RaMP (Relational database of Metabolomics Pathways), which combines biological pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, WikiPathways, and the Human Metabolome DataBase (HMDB). To the best of our knowledge, an off-the-shelf, public database that maps genes and metabolites to biochemical/disease pathways and can readily be integrated into other existing software is currently lacking. For consistent and comprehensive analysis, RaMP enables batch and complex queries (e.g., list all metabolites involved in glycolysis and lung cancer), can readily be integrated into pathway analysis tools, and supports pathway overrepresentation analysis given a list of genes and/or metabolites of interest. For usability, we have developed a RaMP R package (https://github.com/Mathelab/RaMP-DB), including a user-friendly RShiny web application, that supports basic simple and batch queries, pathway overrepresentation analysis given a list of genes or metabolites of interest, and network visualization of gene-metabolite relationships. The package also includes the raw database file (mysql dump), thereby providing a stand-alone downloadable framework for public use and integration with other tools. In addition, the Python code needed to recreate the database on another system is also publicly available (https://github.com/Mathelab/RaMP-BackEnd). Updates for databases in RaMP will be checked multiple times a year and RaMP will be updated accordingly.

  20. The Pathway Coexpression Network: Revealing pathway relationships

    PubMed Central

    Tanzi, Rudolph E.

    2018-01-01

    A goal of genomics is to understand the relationships between biological processes. Pathways contribute to functional interplay within biological processes through complex but poorly understood interactions. However, limited functional references for global pathway relationships exist. Pathways from databases such as KEGG and Reactome provide discrete annotations of biological processes. Their relationships are currently either inferred from gene set enrichment within specific experiments, or by simple overlap, linking pathway annotations that have genes in common. Here, we provide a unifying interpretation of functional interaction between pathways by systematically quantifying coexpression between 1,330 canonical pathways from the Molecular Signatures Database (MSigDB) to establish the Pathway Coexpression Network (PCxN). We estimated the correlation between canonical pathways valid in a broad context using a curated collection of 3,207 microarrays from 72 normal human tissues. PCxN accounts for shared genes between annotations to estimate significant correlations between pathways with related functions rather than with similar annotations. We demonstrate that PCxN provides novel insight into mechanisms of complex diseases using an Alzheimer’s Disease (AD) case study. PCxN retrieved pathways significantly correlated with an expert curated AD gene list. These pathways have known associations with AD and were significantly enriched for genes independently associated with AD. As a further step, we show how PCxN complements the results of gene set enrichment methods by revealing relationships between enriched pathways, and by identifying additional highly correlated pathways. PCxN revealed that correlated pathways from an AD expression profiling study include functional clusters involved in cell adhesion and oxidative stress. PCxN provides expanded connections to pathways from the extracellular matrix. PCxN provides a powerful new framework for interrogation of global pathway relationships. Comprehensive exploration of PCxN can be performed at http://pcxn.org/. PMID:29554099

  1. HBLAST: Parallelised sequence similarity--A Hadoop MapReducable basic local alignment search tool.

    PubMed

    O'Driscoll, Aisling; Belogrudov, Vladislav; Carroll, John; Kropp, Kai; Walsh, Paul; Ghazal, Peter; Sleator, Roy D

    2015-04-01

    The recent exponential growth of genomic databases has resulted in the common task of sequence alignment becoming one of the major bottlenecks in the field of computational biology. It is typical for these large datasets and complex computations to require cost prohibitive High Performance Computing (HPC) to function. As such, parallelised solutions have been proposed but many exhibit scalability limitations and are incapable of effectively processing "Big Data" - the name attributed to datasets that are extremely large, complex and require rapid processing. The Hadoop framework, comprised of distributed storage and a parallelised programming framework known as MapReduce, is specifically designed to work with such datasets but it is not trivial to efficiently redesign and implement bioinformatics algorithms according to this paradigm. The parallelisation strategy of "divide and conquer" for alignment algorithms can be applied to both data sets and input query sequences. However, scalability is still an issue due to memory constraints or large databases, with very large database segmentation leading to additional performance decline. Herein, we present Hadoop Blast (HBlast), a parallelised BLAST algorithm that proposes a flexible method to partition both databases and input query sequences using "virtual partitioning". HBlast presents improved scalability over existing solutions and well balanced computational work load while keeping database segmentation and recompilation to a minimum. Enhanced BLAST search performance on cheap memory constrained hardware has significant implications for in field clinical diagnostic testing; enabling faster and more accurate identification of pathogenic DNA in human blood or tissue samples. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. PURY: a database of geometric restraints of hetero compounds for refinement in complexes with macromolecular structures.

    PubMed

    Andrejasic, Miha; Praaenikar, Jure; Turk, Dusan

    2008-11-01

    The number and variety of macromolecular structures in complex with ;hetero' ligands is growing. The need for rapid delivery of correct geometric parameters for their refinement, which is often crucial for understanding the biological relevance of the structure, is growing correspondingly. The current standard for describing protein structures is the Engh-Huber parameter set. It is an expert data set resulting from selection and analysis of the crystal structures gathered in the Cambridge Structural Database (CSD). Clearly, such a manual approach cannot be applied to the vast and ever-growing number of chemical compounds. Therefore, a database, named PURY, of geometric parameters of chemical compounds has been developed, together with a server that accesses it. PURY is a compilation of the whole CSD. It contains lists of atom classes and bonds connecting them, as well as angle, chirality, planarity and conformation parameters. The current compilation is based on CSD 5.28 and contains 1978 atom classes and 32,702 bonding, 237,068 angle, 201,860 dihedral and 64,193 improper geometric restraints. Analysis has confirmed that the restraints from the PURY database are suitable for use in macromolecular crystal structure refinement and should be of value to the crystallographic community. The database can be accessed through the web server http://pury.ijs.si/, which creates topology and parameter files from deposited coordinates in suitable forms for the refinement programs MAIN, CNS and REFMAC. In the near future, the server will move to the CSD website http://pury.ccdc.cam.ac.uk/.

  3. The Movable Type Method Applied to Protein-Ligand Binding.

    PubMed

    Zheng, Zheng; Ucisik, Melek N; Merz, Kenneth M

    2013-12-10

    Accurately computing the free energy for biological processes like protein folding or protein-ligand association remains a challenging problem. Both describing the complex intermolecular forces involved and sampling the requisite configuration space make understanding these processes innately difficult. Herein, we address the sampling problem using a novel methodology we term "movable type". Conceptually it can be understood by analogy with the evolution of printing and, hence, the name movable type. For example, a common approach to the study of protein-ligand complexation involves taking a database of intact drug-like molecules and exhaustively docking them into a binding pocket. This is reminiscent of early woodblock printing where each page had to be laboriously created prior to printing a book. However, printing evolved to an approach where a database of symbols (letters, numerals, etc.) was created and then assembled using a movable type system, which allowed for the creation of all possible combinations of symbols on a given page, thereby, revolutionizing the dissemination of knowledge. Our movable type (MT) method involves the identification of all atom pairs seen in protein-ligand complexes and then creating two databases: one with their associated pairwise distant dependent energies and another associated with the probability of how these pairs can combine in terms of bonds, angles, dihedrals and non-bonded interactions. Combining these two databases coupled with the principles of statistical mechanics allows us to accurately estimate binding free energies as well as the pose of a ligand in a receptor. This method, by its mathematical construction, samples all of configuration space of a selected region (the protein active site here) in one shot without resorting to brute force sampling schemes involving Monte Carlo, genetic algorithms or molecular dynamics simulations making the methodology extremely efficient. Importantly, this method explores the free energy surface eliminating the need to estimate the enthalpy and entropy components individually. Finally, low free energy structures can be obtained via a free energy minimization procedure yielding all low free energy poses on a given free energy surface. Besides revolutionizing the protein-ligand docking and scoring problem this approach can be utilized in a wide range of applications in computational biology which involve the computation of free energies for systems with extensive phase spaces including protein folding, protein-protein docking and protein design.

  4. CellBase, a comprehensive collection of RESTful web services for retrieving relevant biological information from heterogeneous sources.

    PubMed

    Bleda, Marta; Tarraga, Joaquin; de Maria, Alejandro; Salavert, Francisco; Garcia-Alonso, Luz; Celma, Matilde; Martin, Ainoha; Dopazo, Joaquin; Medina, Ignacio

    2012-07-01

    During the past years, the advances in high-throughput technologies have produced an unprecedented growth in the number and size of repositories and databases storing relevant biological data. Today, there is more biological information than ever but, unfortunately, the current status of many of these repositories is far from being optimal. Some of the most common problems are that the information is spread out in many small databases; frequently there are different standards among repositories and some databases are no longer supported or they contain too specific and unconnected information. In addition, data size is increasingly becoming an obstacle when accessing or storing biological data. All these issues make very difficult to extract and integrate information from different sources, to analyze experiments or to access and query this information in a programmatic way. CellBase provides a solution to the growing necessity of integration by easing the access to biological data. CellBase implements a set of RESTful web services that query a centralized database containing the most relevant biological data sources. The database is hosted in our servers and is regularly updated. CellBase documentation can be found at http://docs.bioinfo.cipf.es/projects/cellbase.

  5. Comparison of human cell signaling pathway databases—evolution, drawbacks and challenges

    PubMed Central

    Chowdhury, Saikat; Sarkar, Ram Rup

    2015-01-01

    Elucidating the complexities of cell signaling pathways is of immense importance to gain understanding about various biological phenomenon, such as dynamics of gene/protein expression regulation, cell fate determination, embryogenesis and disease progression. The successful completion of human genome project has also helped experimental and theoretical biologists to analyze various important pathways. To advance this study, during the past two decades, systematic collections of pathway data from experimental studies have been compiled and distributed freely by several databases, which also integrate various computational tools for further analysis. Despite significant advancements, there exist several drawbacks and challenges, such as pathway data heterogeneity, annotation, regular update and automated image reconstructions, which motivated us to perform a thorough review on popular and actively functioning 24 cell signaling databases. Based on two major characteristics, pathway information and technical details, freely accessible data from commercial and academic databases are examined to understand their evolution and enrichment. This review not only helps to identify some novel and useful features, which are not yet included in any of the databases but also highlights their current limitations and subsequently propose the reasonable solutions for future database development, which could be useful to the whole scientific community. PMID:25632107

  6. MetReS, an Efficient Database for Genomic Applications.

    PubMed

    Vilaplana, Jordi; Alves, Rui; Solsona, Francesc; Mateo, Jordi; Teixidó, Ivan; Pifarré, Marc

    2018-02-01

    MetReS (Metabolic Reconstruction Server) is a genomic database that is shared between two software applications that address important biological problems. Biblio-MetReS is a data-mining tool that enables the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the processes of interest and their function. The main goal of this work was to identify the areas where the performance of the MetReS database performance could be improved and to test whether this improvement would scale to larger datasets and more complex types of analysis. The study was started with a relational database, MySQL, which is the current database server used by the applications. We also tested the performance of an alternative data-handling framework, Apache Hadoop. Hadoop is currently used for large-scale data processing. We found that this data handling framework is likely to greatly improve the efficiency of the MetReS applications as the dataset and the processing needs increase by several orders of magnitude, as expected to happen in the near future.

  7. MelanomaDB: A Web Tool for Integrative Analysis of Melanoma Genomic Information to Identify Disease-Associated Molecular Pathways

    PubMed Central

    Trevarton, Alexander J.; Mann, Michael B.; Knapp, Christoph; Araki, Hiromitsu; Wren, Jonathan D.; Stones-Havas, Steven; Black, Michael A.; Print, Cristin G.

    2013-01-01

    Despite on-going research, metastatic melanoma survival rates remain low and treatment options are limited. Researchers can now access a rapidly growing amount of molecular and clinical information about melanoma. This information is becoming difficult to assemble and interpret due to its dispersed nature, yet as it grows it becomes increasingly valuable for understanding melanoma. Integration of this information into a comprehensive resource to aid rational experimental design and patient stratification is needed. As an initial step in this direction, we have assembled a web-accessible melanoma database, MelanomaDB, which incorporates clinical and molecular data from publically available sources, which will be regularly updated as new information becomes available. This database allows complex links to be drawn between many different aspects of melanoma biology: genetic changes (e.g., mutations) in individual melanomas revealed by DNA sequencing, associations between gene expression and patient survival, data concerning drug targets, biomarkers, druggability, and clinical trials, as well as our own statistical analysis of relationships between molecular pathways and clinical parameters that have been produced using these data sets. The database is freely available at http://genesetdb.auckland.ac.nz/melanomadb/about.html. A subset of the information in the database can also be accessed through a freely available web application in the Illumina genomic cloud computing platform BaseSpace at http://www.biomatters.com/apps/melanoma-profiler-for-research. The MelanomaDB database illustrates dysregulation of specific signaling pathways across 310 exome-sequenced melanomas and in individual tumors and identifies the distribution of somatic variants in melanoma. We suggest that MelanomaDB can provide a context in which to interpret the tumor molecular profiles of individual melanoma patients relative to biological information and available drug therapies. PMID:23875173

  8. A Machine Reading System for Assembling Synthetic Paleontological Databases

    PubMed Central

    Peters, Shanan E.; Zhang, Ce; Livny, Miron; Ré, Christopher

    2014-01-01

    Many aspects of macroevolutionary theory and our understanding of biotic responses to global environmental change derive from literature-based compilations of paleontological data. Existing manually assembled databases are, however, incomplete and difficult to assess and enhance with new data types. Here, we develop and validate the quality of a machine reading system, PaleoDeepDive, that automatically locates and extracts data from heterogeneous text, tables, and figures in publications. PaleoDeepDive performs comparably to humans in several complex data extraction and inference tasks and generates congruent synthetic results that describe the geological history of taxonomic diversity and genus-level rates of origination and extinction. Unlike traditional databases, PaleoDeepDive produces a probabilistic database that systematically improves as information is added. We show that the system can readily accommodate sophisticated data types, such as morphological data in biological illustrations and associated textual descriptions. Our machine reading approach to scientific data integration and synthesis brings within reach many questions that are currently underdetermined and does so in ways that may stimulate entirely new modes of inquiry. PMID:25436610

  9. KnotProt: a database of proteins with knots and slipknots.

    PubMed

    Jamroz, Michal; Niemyska, Wanda; Rawdon, Eric J; Stasiak, Andrzej; Millett, Kenneth C; Sułkowski, Piotr; Sulkowska, Joanna I

    2015-01-01

    The protein topology database KnotProt, http://knotprot.cent.uw.edu.pl/, collects information about protein structures with open polypeptide chains forming knots or slipknots. The knotting complexity of the cataloged proteins is presented in the form of a matrix diagram that shows users the knot type of the entire polypeptide chain and of each of its subchains. The pattern visible in the matrix gives the knotting fingerprint of a given protein and permits users to determine, for example, the minimal length of the knotted regions (knot's core size) or the depth of a knot, i.e. how many amino acids can be removed from either end of the cataloged protein structure before converting it from a knot to a different type of knot. In addition, the database presents extensive information about the biological functions, families and fold types of proteins with non-trivial knotting. As an additional feature, the KnotProt database enables users to submit protein or polymer chains and generate their knotting fingerprints. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  10. Text mining for metabolic pathways, signaling cascades, and protein networks.

    PubMed

    Hoffmann, Robert; Krallinger, Martin; Andres, Eduardo; Tamames, Javier; Blaschke, Christian; Valencia, Alfonso

    2005-05-10

    The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.

  11. Systems biology impact on antiepileptic drug discovery.

    PubMed

    Margineanu, Doru Georg

    2012-02-01

    Systems biology (SB), a recent trend in bioscience research to consider the complex interactions in biological systems from a holistic perspective, sees the disease as a disturbed network of interactions, rather than alteration of single molecular component(s). SB-relying network pharmacology replaces the prevailing focus on specific drug-receptor interaction and the corollary of rational drug design of "magic bullets", by the search for multi-target drugs that would act on biological networks as "magic shotguns". Epilepsy being a multi-factorial, polygenic and dynamic pathology, SB approach appears particularly fit and promising for antiepileptic drug (AED) discovery. In fact, long before the advent of SB, AED discovery already involved some SB-like elements. A reported SB project aimed to find out new drug targets in epilepsy relies on a relational database that integrates clinical information, recordings from deep electrodes and 3D-brain imagery with histology and molecular biology data on modified expression of specific genes in the brain regions displaying spontaneous epileptic activity. Since hitting a single target does not treat complex diseases, a proper pharmacological promiscuity might impart on an AED the merit of being multi-potent. However, multi-target drug discovery entails the complicated task of optimizing multiple activities of compounds, while having to balance drug-like properties and to control unwanted effects. Specific design tools for this new approach in drug discovery barely emerge, but computational methods making reliable in silico predictions of poly-pharmacology did appear, and their progress might be quite rapid. The current move away from reductionism into network pharmacology allows expecting that a proper integration of the intrinsic complexity of epileptic pathology in AED discovery might result in literally anti-epileptic drugs. Copyright © 2011 Elsevier B.V. All rights reserved.

  12. PGSB PlantsDB: updates to the database framework for comparative plant genome research.

    PubMed

    Spannagl, Manuel; Nussbaumer, Thomas; Bader, Kai C; Martis, Mihaela M; Seidel, Michael; Kugler, Karl G; Gundlach, Heidrun; Mayer, Klaus F X

    2016-01-04

    PGSB (Plant Genome and Systems Biology: formerly MIPS) PlantsDB (http://pgsb.helmholtz-muenchen.de/plant/index.jsp) is a database framework for the comparative analysis and visualization of plant genome data. The resource has been updated with new data sets and types as well as specialized tools and interfaces to address user demands for intuitive access to complex plant genome data. In its latest incarnation, we have re-worked both the layout and navigation structure and implemented new keyword search options and a new BLAST sequence search functionality. Actively involved in corresponding sequencing consortia, PlantsDB has dedicated special efforts to the integration and visualization of complex triticeae genome data, especially for barley, wheat and rye. We enhanced CrowsNest, a tool to visualize syntenic relationships between genomes, with data from the wheat sub-genome progenitor Aegilops tauschii and added functionality to the PGSB RNASeqExpressionBrowser. GenomeZipper results were integrated for the genomes of barley, rye, wheat and perennial ryegrass and interactive access is granted through PlantsDB interfaces. Data exchange and cross-linking between PlantsDB and other plant genome databases is stimulated by the transPLANT project (http://transplantdb.eu/). © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.

  13. Adverse drug events with hyperkalaemia during inpatient stays: evaluation of an automated method for retrospective detection in hospital databases

    PubMed Central

    2014-01-01

    Background Adverse drug reactions and adverse drug events (ADEs) are major public health issues. Many different prospective tools for the automated detection of ADEs in hospital databases have been developed and evaluated. The objective of the present study was to evaluate an automated method for the retrospective detection of ADEs with hyperkalaemia during inpatient stays. Methods We used a set of complex detection rules to take account of the patient’s clinical and biological context and the chronological relationship between the causes and the expected outcome. The dataset consisted of 3,444 inpatient stays in a French general hospital. An automated review was performed for all data and the results were compared with those of an expert chart review. The complex detection rules’ analytical quality was evaluated for ADEs. Results In terms of recall, 89.5% of ADEs with hyperkalaemia “with or without an abnormal symptom” were automatically identified (including all three serious ADEs). In terms of precision, 63.7% of the automatically identified ADEs with hyperkalaemia were true ADEs. Conclusions The use of context-sensitive rules appears to improve the automated detection of ADEs with hyperkalaemia. This type of tool may have an important role in pharmacoepidemiology via the routine analysis of large inter-hospital databases. PMID:25212108

  14. Adverse drug events with hyperkalaemia during inpatient stays: evaluation of an automated method for retrospective detection in hospital databases.

    PubMed

    Ficheur, Grégoire; Chazard, Emmanuel; Beuscart, Jean-Baptiste; Merlin, Béatrice; Luyckx, Michel; Beuscart, Régis

    2014-09-12

    Adverse drug reactions and adverse drug events (ADEs) are major public health issues. Many different prospective tools for the automated detection of ADEs in hospital databases have been developed and evaluated. The objective of the present study was to evaluate an automated method for the retrospective detection of ADEs with hyperkalaemia during inpatient stays. We used a set of complex detection rules to take account of the patient's clinical and biological context and the chronological relationship between the causes and the expected outcome. The dataset consisted of 3,444 inpatient stays in a French general hospital. An automated review was performed for all data and the results were compared with those of an expert chart review. The complex detection rules' analytical quality was evaluated for ADEs. In terms of recall, 89.5% of ADEs with hyperkalaemia "with or without an abnormal symptom" were automatically identified (including all three serious ADEs). In terms of precision, 63.7% of the automatically identified ADEs with hyperkalaemia were true ADEs. The use of context-sensitive rules appears to improve the automated detection of ADEs with hyperkalaemia. This type of tool may have an important role in pharmacoepidemiology via the routine analysis of large inter-hospital databases.

  15. Clique-based data mining for related genes in a biomedical database.

    PubMed

    Matsunaga, Tsutomu; Yonemori, Chikara; Tomita, Etsuji; Muramatsu, Masaaki

    2009-07-01

    Progress in the life sciences cannot be made without integrating biomedical knowledge on numerous genes in order to help formulate hypotheses on the genetic mechanisms behind various biological phenomena, including diseases. There is thus a strong need for a way to automatically and comprehensively search from biomedical databases for related genes, such as genes in the same families and genes encoding components of the same pathways. Here we address the extraction of related genes by searching for densely-connected subgraphs, which are modeled as cliques, in a biomedical relational graph. We constructed a graph whose nodes were gene or disease pages, and edges were the hyperlink connections between those pages in the Online Mendelian Inheritance in Man (OMIM) database. We obtained over 20,000 sets of related genes (called 'gene modules') by enumerating cliques computationally. The modules included genes in the same family, genes for proteins that form a complex, and genes for components of the same signaling pathway. The results of experiments using 'metabolic syndrome'-related gene modules show that the gene modules can be used to get a coherent holistic picture helpful for interpreting relations among genes. We presented a data mining approach extracting related genes by enumerating cliques. The extracted gene sets provide a holistic picture useful for comprehending complex disease mechanisms.

  16. sc-PDB-Frag: a database of protein-ligand interaction patterns for Bioisosteric replacements.

    PubMed

    Desaphy, Jérémy; Rognan, Didier

    2014-07-28

    Bioisosteric replacement plays an important role in medicinal chemistry by keeping the biological activity of a molecule while changing either its core scaffold or substituents, thereby facilitating lead optimization and patenting. Bioisosteres are classically chosen in order to keep the main pharmacophoric moieties of the substructure to replace. However, notably when changing a scaffold, no attention is usually paid as whether all atoms of the reference scaffold are equally important for binding to the desired target. We herewith propose a novel database for bioisosteric replacement (scPDBFrag), capitalizing on our recently published structure-based approach to scaffold hopping, focusing on interaction pattern graphs. Protein-bound ligands are first fragmented and the interaction of the corresponding fragments with their protein environment computed-on-the-fly. Using an in-house developed graph alignment tool, interaction patterns graphs can be compared, aligned, and sorted by decreasing similarity to any reference. In the herein presented sc-PDB-Frag database ( http://bioinfo-pharma.u-strasbg.fr/scPDBFrag ), fragments, interaction patterns, alignments, and pairwise similarity scores have been extracted from the sc-PDB database of 8077 druggable protein-ligand complexes and further stored in a relational database. We herewith present the database, its Web implementation, and procedures for identifying true bioisosteric replacements based on conserved interaction patterns.

  17. Biological agents database in the armed forces.

    PubMed

    Niemcewicz, Marcin; Kocik, Janusz; Bielecka, Anna; Wierciński, Michał

    2014-10-01

    Rapid detection and identification of the biological agent during both, natural or deliberate outbreak is crucial for implementation of appropriate control measures and procedures in order to mitigate the spread of disease. Determination of pathogen etiology may not only support epidemiological investigation and safety of human beings, but also enhance forensic efforts in pathogen tracing, collection of evidences and correct inference. The article presents objectives of the Biological Agents Database, which was developed for the purpose of the Ministry of National Defense of the Republic of Poland under the European Defence Agency frame. The Biological Agents Database is an electronic catalogue of genetic markers of highly dangerous pathogens and biological agents of weapon of mass destruction concern, which provides full identification of biological threats emerging in Poland and in locations of activity of Polish troops. The Biological Agents Database is a supportive tool used for tracing biological agents' origin as well as rapid identification of agent causing the disease of unknown etiology. It also provides support in diagnosis, analysis, response and exchange of information between institutions that use information contained in it. Therefore, it can be used not only for military purposes, but also in a civilian environment.

  18. A computational platform to maintain and migrate manual functional annotations for BioCyc databases.

    PubMed

    Walsh, Jesse R; Sen, Taner Z; Dickerson, Julie A

    2014-10-12

    BioCyc databases are an important resource for information on biological pathways and genomic data. Such databases represent the accumulation of biological data, some of which has been manually curated from literature. An essential feature of these databases is the continuing data integration as new knowledge is discovered. As functional annotations are improved, scalable methods are needed for curators to manage annotations without detailed knowledge of the specific design of the BioCyc database. We have developed CycTools, a software tool which allows curators to maintain functional annotations in a model organism database. This tool builds on existing software to improve and simplify annotation data imports of user provided data into BioCyc databases. Additionally, CycTools automatically resolves synonyms and alternate identifiers contained within the database into the appropriate internal identifiers. Automating steps in the manual data entry process can improve curation efforts for major biological databases. The functionality of CycTools is demonstrated by transferring GO term annotations from MaizeCyc to matching proteins in CornCyc, both maize metabolic pathway databases available at MaizeGDB, and by creating strain specific databases for metabolic engineering.

  19. Re-thinking organisms: The impact of databases on model organism biology.

    PubMed

    Leonelli, Sabina; Ankeny, Rachel A

    2012-03-01

    Community databases have become crucial to the collection, ordering and retrieval of data gathered on model organisms, as well as to the ways in which these data are interpreted and used across a range of research contexts. This paper analyses the impact of community databases on research practices in model organism biology by focusing on the history and current use of four community databases: FlyBase, Mouse Genome Informatics, WormBase and The Arabidopsis Information Resource. We discuss the standards used by the curators of these databases for what counts as reliable evidence, acceptable terminology, appropriate experimental set-ups and adequate materials (e.g., specimens). On the one hand, these choices are informed by the collaborative research ethos characterising most model organism communities. On the other hand, the deployment of these standards in databases reinforces this ethos and gives it concrete and precise instantiations by shaping the skills, practices, values and background knowledge required of the database users. We conclude that the increasing reliance on community databases as vehicles to circulate data is having a major impact on how researchers conduct and communicate their research, which affects how they understand the biology of model organisms and its relation to the biology of other species. Copyright © 2011 Elsevier Ltd. All rights reserved.

  20. New tools and methods for direct programmatic access to the dbSNP relational database.

    PubMed

    Saccone, Scott F; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A; Rice, John P

    2011-01-01

    Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale.

  1. Functional Analysis of OMICs Data and Small Molecule Compounds in an Integrated "Knowledge-Based" Platform.

    PubMed

    Dubovenko, Alexey; Nikolsky, Yuri; Rakhmatulin, Eugene; Nikolskaya, Tatiana

    2017-01-01

    Analysis of NGS and other sequencing data, gene variants, gene expression, proteomics, and other high-throughput (OMICs) data is challenging because of its biological complexity and high level of technical and biological noise. One way to deal with both problems is to perform analysis with a high fidelity annotated knowledgebase of protein interactions, pathways, and functional ontologies. This knowledgebase has to be structured in a computer-readable format and must include software tools for managing experimental data, analysis, and reporting. Here, we present MetaCore™ and Key Pathway Advisor (KPA), an integrated platform for functional data analysis. On the content side, MetaCore and KPA encompass a comprehensive database of molecular interactions of different types, pathways, network models, and ten functional ontologies covering human, mouse, and rat genes. The analytical toolkit includes tools for gene/protein list enrichment analysis, statistical "interactome" tool for the identification of over- and under-connected proteins in the dataset, and a biological network analysis module made up of network generation algorithms and filters. The suite also features Advanced Search, an application for combinatorial search of the database content, as well as a Java-based tool called Pathway Map Creator for drawing and editing custom pathway maps. Applications of MetaCore and KPA include molecular mode of action of disease research, identification of potential biomarkers and drug targets, pathway hypothesis generation, analysis of biological effects for novel small molecule compounds and clinical applications (analysis of large cohorts of patients, and translational and personalized medicine).

  2. PathCase-SB architecture and database design

    PubMed Central

    2011-01-01

    Background Integration of metabolic pathways resources and regulatory metabolic network models, and deploying new tools on the integrated platform can help perform more effective and more efficient systems biology research on understanding the regulation in metabolic networks. Therefore, the tasks of (a) integrating under a single database environment regulatory metabolic networks and existing models, and (b) building tools to help with modeling and analysis are desirable and intellectually challenging computational tasks. Description PathCase Systems Biology (PathCase-SB) is built and released. The PathCase-SB database provides data and API for multiple user interfaces and software tools. The current PathCase-SB system provides a database-enabled framework and web-based computational tools towards facilitating the development of kinetic models for biological systems. PathCase-SB aims to integrate data of selected biological data sources on the web (currently, BioModels database and KEGG), and to provide more powerful and/or new capabilities via the new web-based integrative framework. This paper describes architecture and database design issues encountered in PathCase-SB's design and implementation, and presents the current design of PathCase-SB's architecture and database. Conclusions PathCase-SB architecture and database provide a highly extensible and scalable environment with easy and fast (real-time) access to the data in the database. PathCase-SB itself is already being used by researchers across the world. PMID:22070889

  3. A systems biology-led insight into the role of the proteome in neurodegenerative diseases.

    PubMed

    Fasano, Mauro; Monti, Chiara; Alberio, Tiziana

    2016-09-01

    Multifactorial disorders are the result of nonlinear interactions of several factors; therefore, a reductionist approach does not appear to be appropriate. Proteomics is a global approach that can be efficiently used to investigate pathogenetic mechanisms of neurodegenerative diseases. Here, we report a general introduction about the systems biology approach and mechanistic insights recently obtained by over-representation analysis of proteomics data of cellular and animal models of Alzheimer's disease, Parkinson's disease and other neurodegenerative disorders, as well as of affected human tissues. Expert commentary: As an inductive method, proteomics is based on unbiased observations that further require validation of generated hypotheses. Pathway databases and over-representation analysis tools allow researchers to assign an expectation value to pathogenetic mechanisms linked to neurodegenerative diseases. The systems biology approach based on omics data may be the key to unravel the complex mechanisms underlying neurodegeneration.

  4. An analytical platform for mass spectrometry-based identification and chemical analysis of RNA in ribonucleoprotein complexes.

    PubMed

    Taoka, Masato; Yamauchi, Yoshio; Nobe, Yuko; Masaki, Shunpei; Nakayama, Hiroshi; Ishikawa, Hideaki; Takahashi, Nobuhiro; Isobe, Toshiaki

    2009-11-01

    We describe here a mass spectrometry (MS)-based analytical platform of RNA, which combines direct nano-flow reversed-phase liquid chromatography (RPLC) on a spray tip column and a high-resolution LTQ-Orbitrap mass spectrometer. Operating RPLC under a very low flow rate with volatile solvents and MS in the negative mode, we could estimate highly accurate mass values sufficient to predict the nucleotide composition of a approximately 21-nucleotide small interfering RNA, detect post-transcriptional modifications in yeast tRNA, and perform collision-induced dissociation/tandem MS-based structural analysis of nucleolytic fragments of RNA at a sub-femtomole level. Importantly, the method allowed the identification and chemical analysis of small RNAs in ribonucleoprotein (RNP) complex, such as the pre-spliceosomal RNP complex, which was pulled down from cultured cells with a tagged protein cofactor as bait. We have recently developed a unique genome-oriented database search engine, Ariadne, which allows tandem MS-based identification of RNAs in biological samples. Thus, the method presented here has broad potential for automated analysis of RNA; it complements conventional molecular biology-based techniques and is particularly suited for simultaneous analysis of the composition, structure, interaction, and dynamics of RNA and protein components in various cellular RNP complexes.

  5. Construction of a Linux based chemical and biological information system.

    PubMed

    Molnár, László; Vágó, István; Fehér, András

    2003-01-01

    A chemical and biological information system with a Web-based easy-to-use interface and corresponding databases has been developed. The constructed system incorporates all chemical, numerical and textual data related to the chemical compounds, including numerical biological screen results. Users can search the database by traditional textual/numerical and/or substructure or similarity queries through the web interface. To build our chemical database management system, we utilized existing IT components such as ORACLE or Tripos SYBYL for database management and Zope application server for the web interface. We chose Linux as the main platform, however, almost every component can be used under various operating systems.

  6. The MIGenAS integrated bioinformatics toolkit for web-based sequence analysis

    PubMed Central

    Rampp, Markus; Soddemann, Thomas; Lederer, Hermann

    2006-01-01

    We describe a versatile and extensible integrated bioinformatics toolkit for the analysis of biological sequences over the Internet. The web portal offers convenient interactive access to a growing pool of chainable bioinformatics software tools and databases that are centrally installed and maintained by the RZG. Currently, supported tasks comprise sequence similarity searches in public or user-supplied databases, computation and validation of multiple sequence alignments, phylogenetic analysis and protein–structure prediction. Individual tools can be seamlessly chained into pipelines allowing the user to conveniently process complex workflows without the necessity to take care of any format conversions or tedious parsing of intermediate results. The toolkit is part of the Max-Planck Integrated Gene Analysis System (MIGenAS) of the Max Planck Society available at (click ‘Start Toolkit’). PMID:16844980

  7. Visualising biological data: a semantic approach to tool and database integration

    PubMed Central

    Pettifer, Steve; Thorne, David; McDermott, Philip; Marsh, James; Villéger, Alice; Kell, Douglas B; Attwood, Teresa K

    2009-01-01

    Motivation In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customised for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. Methods To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. Results The toolkit, named Utopia, is freely available from . PMID:19534744

  8. Visualising biological data: a semantic approach to tool and database integration.

    PubMed

    Pettifer, Steve; Thorne, David; McDermott, Philip; Marsh, James; Villéger, Alice; Kell, Douglas B; Attwood, Teresa K

    2009-06-16

    In the biological sciences, the need to analyse vast amounts of information has become commonplace. Such large-scale analyses often involve drawing together data from a variety of different databases, held remotely on the internet or locally on in-house servers. Supporting these tasks are ad hoc collections of data-manipulation tools, scripting languages and visualisation software, which are often combined in arcane ways to create cumbersome systems that have been customized for a particular purpose, and are consequently not readily adaptable to other uses. For many day-to-day bioinformatics tasks, the sizes of current databases, and the scale of the analyses necessary, now demand increasing levels of automation; nevertheless, the unique experience and intuition of human researchers is still required to interpret the end results in any meaningful biological way. Putting humans in the loop requires tools to support real-time interaction with these vast and complex data-sets. Numerous tools do exist for this purpose, but many do not have optimal interfaces, most are effectively isolated from other tools and databases owing to incompatible data formats, and many have limited real-time performance when applied to realistically large data-sets: much of the user's cognitive capacity is therefore focused on controlling the software and manipulating esoteric file formats rather than on performing the research. To confront these issues, harnessing expertise in human-computer interaction (HCI), high-performance rendering and distributed systems, and guided by bioinformaticians and end-user biologists, we are building reusable software components that, together, create a toolkit that is both architecturally sound from a computing point of view, and addresses both user and developer requirements. Key to the system's usability is its direct exploitation of semantics, which, crucially, gives individual components knowledge of their own functionality and allows them to interoperate seamlessly, removing many of the existing barriers and bottlenecks from standard bioinformatics tasks. The toolkit, named Utopia, is freely available from http://utopia.cs.man.ac.uk/.

  9. GlycomeDB – integration of open-access carbohydrate structure databases

    PubMed Central

    Ranzinger, René; Herget, Stephan; Wetter, Thomas; von der Lieth, Claus-Wilhelm

    2008-01-01

    Background Although carbohydrates are the third major class of biological macromolecules, after proteins and DNA, there is neither a comprehensive database for carbohydrate structures nor an established universal structure encoding scheme for computational purposes. Funding for further development of the Complex Carbohydrate Structure Database (CCSD or CarbBank) ceased in 1997, and since then several initiatives have developed independent databases with partially overlapping foci. For each database, different encoding schemes for residues and sequence topology were designed. Therefore, it is virtually impossible to obtain an overview of all deposited structures or to compare the contents of the various databases. Results We have implemented procedures which download the structures contained in the seven major databases, e.g. GLYCOSCIENCES.de, the Consortium for Functional Glycomics (CFG), the Kyoto Encyclopedia of Genes and Genomes (KEGG) and the Bacterial Carbohydrate Structure Database (BCSDB). We have created a new database called GlycomeDB, containing all structures, their taxonomic annotations and references (IDs) for the original databases. More than 100000 datasets were imported, resulting in more than 33000 unique sequences now encoded in GlycomeDB using the universal format GlycoCT. Inconsistencies were found in all public databases, which were discussed and corrected in multiple feedback rounds with the responsible curators. Conclusion GlycomeDB is a new, publicly available database for carbohydrate sequences with a unified, all-encompassing structure encoding format and NCBI taxonomic referencing. The database is updated weekly and can be downloaded free of charge. The JAVA application GlycoUpdateDB is also available for establishing and updating a local installation of GlycomeDB. With the advent of GlycomeDB, the distributed islands of knowledge in glycomics are now bridged to form a single resource. PMID:18803830

  10. The NCBI BioSystems database.

    PubMed

    Geer, Lewis Y; Marchler-Bauer, Aron; Geer, Renata C; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H

    2010-01-01

    The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI's Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets.

  11. BrassiBase: introduction to a novel knowledge database on Brassicaceae evolution.

    PubMed

    Kiefer, Markus; Schmickl, Roswitha; German, Dmitry A; Mandáková, Terezie; Lysak, Martin A; Al-Shehbaz, Ihsan A; Franzke, Andreas; Mummenhoff, Klaus; Stamatakis, Alexandros; Koch, Marcus A

    2014-01-01

    The Brassicaceae family (mustards or crucifers) includes Arabidopsis thaliana as one of the most important model species in plant biology and a number of important crop plants such as the various Brassica species (e.g. cabbage, canola and mustard). Moreover, the family comprises an increasing number of species that serve as study systems in many fields of plant science and evolutionary research. However, the systematics and taxonomy of the family are very complex and access to scientifically valuable and reliable information linked to species and genus names and its interpretation are often difficult. BrassiBase is a continuously developing and growing knowledge database (http://brassibase.cos.uni-heidelberg.de) that aims at providing direct access to many different types of information ranging from taxonomy and systematics to phylo- and cytogenetics. Providing critically revised key information, the database intends to optimize comparative evolutionary research in this family and supports the introduction of the Brassicaceae as the model family for evolutionary biology and plant sciences. Some features that should help to accomplish these goals within a comprehensive taxonomic framework have now been implemented in the new version 1.1.9. A 'Phylogenetic Placement Tool' should help to identify critical accessions and germplasm and provide a first visualization of phylogenetic relationships. The 'Cytogenetics Tool' provides in-depth information on genome sizes, chromosome numbers and polyploidy, and sets this information into a Brassicaceae-wide context.

  12. MUBII-TB-DB: a database of mutations associated with antibiotic resistance in Mycobacterium tuberculosis.

    PubMed

    Flandrois, Jean-Pierre; Lina, Gérard; Dumitrescu, Oana

    2014-04-14

    Tuberculosis is an infectious bacterial disease caused by Mycobacterium tuberculosis. It remains a major health threat, killing over one million people every year worldwide. An early antibiotic therapy is the basis of the treatment, and the emergence and spread of multidrug and extensively drug-resistant mutant strains raise significant challenges. As these bacteria grow very slowly, drug resistance mutations are currently detected using molecular biology techniques. Resistance mutations are identified by sequencing the resistance-linked genes followed by a comparison with the literature data. The only online database is the TB Drug Resistance Mutation database (TBDReaM database); however, it requires mutation detection before use, and its interrogation is complex due to its loose syntax and grammar. The MUBII-TB-DB database is a simple, highly structured text-based database that contains a set of Mycobacterium tuberculosis mutations (DNA and proteins) occurring at seven loci: rpoB, pncA, katG; mabA(fabG1)-inhA, gyrA, gyrB, and rrs. Resistance mutation data were extracted after the systematic review of MEDLINE referenced publications before March 2013. MUBII analyzes the query sequence obtained by PCR-sequencing using two parallel strategies: i) a BLAST search against a set of previously reconstructed mutated sequences and ii) the alignment of the query sequences (DNA and its protein translation) with the wild-type sequences. The post-treatment includes the extraction of the aligned sequences together with their descriptors (position and nature of mutations). The whole procedure is performed using the internet. The results are graphs (alignments) and text (description of the mutation, therapeutic significance). The system is quick and easy to use, even for technicians without bioinformatics training. MUBII-TB-DB is a structured database of the mutations occurring at seven loci of major therapeutic value in tuberculosis management. Moreover, the system provides interpretation of the mutations in biological and therapeutic terms and can evolve by the addition of newly described mutations. Its goal is to provide easy and comprehensive access through a client-server model over the Web to an up-to-date database of mutations that lead to the resistance of M. tuberculosis to antibiotics.

  13. PROFESS: a PROtein Function, Evolution, Structure and Sequence database

    PubMed Central

    Triplet, Thomas; Shortridge, Matthew D.; Griep, Mark A.; Stark, Jaime L.; Powers, Robert; Revesz, Peter

    2010-01-01

    The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are ∼1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein–protein interaction networks. Database URL: http://cse.unl.edu/∼profess/ PMID:20624718

  14. Databases for Microbiologists

    DOE PAGES

    Zhulin, Igor B.

    2015-05-26

    Databases play an increasingly important role in biology. They archive, store, maintain, and share information on genes, genomes, expression data, protein sequences and structures, metabolites and reactions, interactions, and pathways. All these data are critically important to microbiologists. Furthermore, microbiology has its own databases that deal with model microorganisms, microbial diversity, physiology, and pathogenesis. Thousands of biological databases are currently available, and it becomes increasingly difficult to keep up with their development. Finally, the purpose of this minireview is to provide a brief survey of current databases that are of interest to microbiologists.

  15. Databases for Microbiologists

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhulin, Igor B.

    Databases play an increasingly important role in biology. They archive, store, maintain, and share information on genes, genomes, expression data, protein sequences and structures, metabolites and reactions, interactions, and pathways. All these data are critically important to microbiologists. Furthermore, microbiology has its own databases that deal with model microorganisms, microbial diversity, physiology, and pathogenesis. Thousands of biological databases are currently available, and it becomes increasingly difficult to keep up with their development. Finally, the purpose of this minireview is to provide a brief survey of current databases that are of interest to microbiologists.

  16. Databases for Microbiologists

    PubMed Central

    2015-01-01

    Databases play an increasingly important role in biology. They archive, store, maintain, and share information on genes, genomes, expression data, protein sequences and structures, metabolites and reactions, interactions, and pathways. All these data are critically important to microbiologists. Furthermore, microbiology has its own databases that deal with model microorganisms, microbial diversity, physiology, and pathogenesis. Thousands of biological databases are currently available, and it becomes increasingly difficult to keep up with their development. The purpose of this minireview is to provide a brief survey of current databases that are of interest to microbiologists. PMID:26013493

  17. Making species checklists understandable to machines - a shift from relational databases to ontologies.

    PubMed

    Laurenne, Nina; Tuominen, Jouni; Saarenmaa, Hannu; Hyvönen, Eero

    2014-01-01

    The scientific names of plants and animals play a major role in Life Sciences as information is indexed, integrated, and searched using scientific names. The main problem with names is their ambiguous nature, because more than one name may point to the same taxon and multiple taxa may share the same name. In addition, scientific names change over time, which makes them open to various interpretations. Applying machine-understandable semantics to these names enables efficient processing of biological content in information systems. The first step is to use unique persistent identifiers instead of name strings when referring to taxa. The most commonly used identifiers are Life Science Identifiers (LSID), which are traditionally used in relational databases, and more recently HTTP URIs, which are applied on the Semantic Web by Linked Data applications. We introduce two models for expressing taxonomic information in the form of species checklists. First, we show how species checklists are presented in a relational database system using LSIDs. Then, in order to gain a more detailed representation of taxonomic information, we introduce meta-ontology TaxMeOn to model the same content as Semantic Web ontologies where taxa are identified using HTTP URIs. We also explore how changes in scientific names can be managed over time. The use of HTTP URIs is preferable for presenting the taxonomic information of species checklists. An HTTP URI identifies a taxon and operates as a web address from which additional information about the taxon can be located, unlike LSID. This enables the integration of biological data from different sources on the web using Linked Data principles and prevents the formation of information silos. The Linked Data approach allows a user to assemble information and evaluate the complexity of taxonomical data based on conflicting views of taxonomic classifications. Using HTTP URIs and Semantic Web technologies also facilitate the representation of the semantics of biological data, and in this way, the creation of more "intelligent" biological applications and services.

  18. The NCBI BioSystems database

    PubMed Central

    Geer, Lewis Y.; Marchler-Bauer, Aron; Geer, Renata C.; Han, Lianyi; He, Jane; He, Siqian; Liu, Chunlei; Shi, Wenyao; Bryant, Stephen H.

    2010-01-01

    The NCBI BioSystems database, found at http://www.ncbi.nlm.nih.gov/biosystems/, centralizes and cross-links existing biological systems databases, increasing their utility and target audience by integrating their pathways and systems into NCBI resources. This integration allows users of NCBI’s Entrez databases to quickly categorize proteins, genes and small molecules by metabolic pathway, disease state or other BioSystem type, without requiring time-consuming inference of biological relationships from the literature or multiple experimental datasets. PMID:19854944

  19. New tools and methods for direct programmatic access to the dbSNP relational database

    PubMed Central

    Saccone, Scott F.; Quan, Jiaxi; Mehta, Gaurang; Bolze, Raphael; Thomas, Prasanth; Deelman, Ewa; Tischfield, Jay A.; Rice, John P.

    2011-01-01

    Genome-wide association studies often incorporate information from public biological databases in order to provide a biological reference for interpreting the results. The dbSNP database is an extensive source of information on single nucleotide polymorphisms (SNPs) for many different organisms, including humans. We have developed free software that will download and install a local MySQL implementation of the dbSNP relational database for a specified organism. We have also designed a system for classifying dbSNP tables in terms of common tasks we wish to accomplish using the database. For each task we have designed a small set of custom tables that facilitate task-related queries and provide entity-relationship diagrams for each task composed from the relevant dbSNP tables. In order to expose these concepts and methods to a wider audience we have developed web tools for querying the database and browsing documentation on the tables and columns to clarify the relevant relational structure. All web tools and software are freely available to the public at http://cgsmd.isi.edu/dbsnpq. Resources such as these for programmatically querying biological databases are essential for viably integrating biological information into genetic association experiments on a genome-wide scale. PMID:21037260

  20. New perspectives in toxicological information management, and the role of ISSTOX databases in assessing chemical mutagenicity and carcinogenicity.

    PubMed

    Benigni, Romualdo; Battistelli, Chiara Laura; Bossa, Cecilia; Tcheremenskaia, Olga; Crettaz, Pierre

    2013-07-01

    Currently, the public has access to a variety of databases containing mutagenicity and carcinogenicity data. These resources are crucial for the toxicologists and regulators involved in the risk assessment of chemicals, which necessitates access to all the relevant literature, and the capability to search across toxicity databases using both biological and chemical criteria. Towards the larger goal of screening chemicals for a wide range of toxicity end points of potential interest, publicly available resources across a large spectrum of biological and chemical data space must be effectively harnessed with current and evolving information technologies (i.e. systematised, integrated and mined), if long-term screening and prediction objectives are to be achieved. A key to rapid progress in the field of chemical toxicity databases is that of combining information technology with the chemical structure as identifier of the molecules. This permits an enormous range of operations (e.g. retrieving chemicals or chemical classes, describing the content of databases, finding similar chemicals, crossing biological and chemical interrogations, etc.) that other more classical databases cannot allow. This article describes the progress in the technology of toxicity databases, including the concepts of Chemical Relational Database and Toxicological Standardized Controlled Vocabularies (Ontology). Then it describes the ISSTOX cluster of toxicological databases at the Istituto Superiore di Sanitá. It consists of freely available databases characterised by the use of modern information technologies and by curation of the quality of the biological data. Finally, this article provides examples of analyses and results made possible by ISSTOX.

  1. Worldwide Protein Data Bank biocuration supporting open access to high-quality 3D structural biology data

    PubMed Central

    Westbrook, John D; Feng, Zukang; Persikova, Irina; Sala, Raul; Sen, Sanchayita; Berrisford, John M; Swaminathan, G Jawahar; Oldfield, Thomas J; Gutmanas, Aleksandras; Igarashi, Reiko; Armstrong, David R; Baskaran, Kumaran; Chen, Li; Chen, Minyu; Clark, Alice R; Di Costanzo, Luigi; Dimitropoulos, Dimitris; Gao, Guanghua; Ghosh, Sutapa; Gore, Swanand; Guranovic, Vladimir; Hendrickx, Pieter M S; Hudson, Brian P; Ikegawa, Yasuyo; Kengaku, Yumiko; Lawson, Catherine L; Liang, Yuhe; Mak, Lora; Mukhopadhyay, Abhik; Narayanan, Buvaneswari; Nishiyama, Kayoko; Patwardhan, Ardan; Sahni, Gaurav; Sanz-García, Eduardo; Sato, Junko; Sekharan, Monica R; Shao, Chenghua; Smart, Oliver S; Tan, Lihua; van Ginkel, Glen; Yang, Huanwang; Zhuravleva, Marina A; Markley, John L; Nakamura, Haruki; Kurisu, Genji; Kleywegt, Gerard J; Velankar, Sameer; Berman, Helen M; Burley, Stephen K

    2018-01-01

    Abstract The Protein Data Bank (PDB) is the single global repository for experimentally determined 3D structures of biological macromolecules and their complexes with ligands. The worldwide PDB (wwPDB) is the international collaboration that manages the PDB archive according to the FAIR principles: Findability, Accessibility, Interoperability and Reusability. The wwPDB recently developed OneDep, a unified tool for deposition, validation and biocuration of structures of biological macromolecules. All data deposited to the PDB undergo critical review by wwPDB Biocurators. This article outlines the importance of biocuration for structural biology data deposited to the PDB and describes wwPDB biocuration processes and the role of expert Biocurators in sustaining a high-quality archive. Structural data submitted to the PDB are examined for self-consistency, standardized using controlled vocabularies, cross-referenced with other biological data resources and validated for scientific/technical accuracy. We illustrate how biocuration is integral to PDB data archiving, as it facilitates accurate, consistent and comprehensive representation of biological structure data, allowing efficient and effective usage by research scientists, educators, students and the curious public worldwide. Database URL: https://www.wwpdb.org/ PMID:29688351

  2. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

    PubMed Central

    Mu, Lin

    2018-01-01

    This work introduces a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homology, and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. In contrast to the conventional persistent homology, multi-component persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding analysis and virtual screening of small molecules. Extensive numerical experiments involving 4,414 protein-ligand complexes from the PDBBind database and 128,374 ligand-target and decoy-target pairs in the DUD database are performed to test respectively the scoring power and the discriminatory power of the proposed topological learning strategies. It is demonstrated that the present topological learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination. PMID:29309403

  3. Evolutionary characters, phenotypes and ontologies: curating data from the systematic biology literature.

    PubMed

    Dahdul, Wasila M; Balhoff, James P; Engeman, Jeffrey; Grande, Terry; Hilton, Eric J; Kothari, Cartik; Lapp, Hilmar; Lundberg, John G; Midford, Peter E; Vision, Todd J; Westerfield, Monte; Mabee, Paula M

    2010-05-20

    The wealth of phenotypic descriptions documented in the published articles, monographs, and dissertations of phylogenetic systematics is traditionally reported in a free-text format, and it is therefore largely inaccessible for linkage to biological databases for genetics, development, and phenotypes, and difficult to manage for large-scale integrative work. The Phenoscape project aims to represent these complex and detailed descriptions with rich and formal semantics that are amenable to computation and integration with phenotype data from other fields of biology. This entails reconceptualizing the traditional free-text characters into the computable Entity-Quality (EQ) formalism using ontologies. We used ontologies and the EQ formalism to curate a collection of 47 phylogenetic studies on ostariophysan fishes (including catfishes, characins, minnows, knifefishes) and their relatives with the goal of integrating these complex phenotype descriptions with information from an existing model organism database (zebrafish, http://zfin.org). We developed a curation workflow for the collection of character, taxonomic and specimen data from these publications. A total of 4,617 phenotypic characters (10,512 states) for 3,449 taxa, primarily species, were curated into EQ formalism (for a total of 12,861 EQ statements) using anatomical and taxonomic terms from teleost-specific ontologies (Teleost Anatomy Ontology and Teleost Taxonomy Ontology) in combination with terms from a quality ontology (Phenotype and Trait Ontology). Standards and guidelines for consistently and accurately representing phenotypes were developed in response to the challenges that were evident from two annotation experiments and from feedback from curators. The challenges we encountered and many of the curation standards and methods for improving consistency that we developed are generally applicable to any effort to represent phenotypes using ontologies. This is because an ontological representation of the detailed variations in phenotype, whether between mutant or wildtype, among individual humans, or across the diversity of species, requires a process by which a precise combination of terms from domain ontologies are selected and organized according to logical relations. The efficiencies that we have developed in this process will be useful for any attempt to annotate complex phenotypic descriptions using ontologies. We also discuss some ramifications of EQ representation for the domain of systematics.

  4. Structural basis of substrate specificity in the serine proteases.

    PubMed Central

    Perona, J. J.; Craik, C. S.

    1995-01-01

    Structure-based mutational analysis of serine protease specificity has produced a large database of information useful in addressing biological function and in establishing a basis for targeted design efforts. Critical issues examined include the function of water molecules in providing strength and specificity of binding, the extent to which binding subsites are interdependent, and the roles of polypeptide chain flexibility and distal structural elements in contributing to specificity profiles. The studies also provide a foundation for exploring why specificity modification can be either straightforward or complex, depending on the particular system. PMID:7795518

  5. Chaos and The Changing Nature of Science and Medicine. Proceedings

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Herbert, D.E.; Croft, P.; Silver, D.S.

    1996-09-01

    These proceedings represent the lectures given at the workshop on chaos and the changing nature of science and medicine. The workshop was sponsored by the University of South Alabama and the American Association of Physicists in Medicine. The topics discussed covered nonlinear dynamical systems, complexity theory, fractals, chaos in biology and medicine and in fluid dynamics. Applications of chaotic dynamics in climatology were also discussed. There were 8 lectures at the workshop and all 8 have been abstracted for the Energy Science and Technology database.(AIP)

  6. Physics through the 1990s: Scientific interfaces and technological applications

    NASA Technical Reports Server (NTRS)

    1986-01-01

    The volume examines the scientific interfaces and technological applications of physics. Twelve areas are dealt with: biological physics-biophysics, the brain, and theoretical biology; the physics-chemistry interface-instrumentation, surfaces, neutron and synchrotron radiation, polymers, organic electronic materials; materials science; geophysics-tectonics, the atmosphere and oceans, planets, drilling and seismic exploration, and remote sensing; computational physics-complex systems and applications in basic research; mathematics-field theory and chaos; microelectronics-integrated circuits, miniaturization, future trends; optical information technologies-fiber optics and photonics; instrumentation; physics applications to energy needs and the environment; national security-devices, weapons, and arms control; medical physics-radiology, ultrasonics, MNR, and photonics. An executive summary and many chapters contain recommendations regarding funding, education, industry participation, small-group university research and large facility programs, government agency programs, and computer database needs.

  7. Deep data science to prevent and treat growth faltering in Maya children.

    PubMed

    Varela-Silva, M I; Bogin, B; Sobral, J A G; Dickinson, F; Monserrat-Revillo, S

    2016-06-01

    The Maya people are descended from the indigenous inhabitants of southern Mexico, Guatemala and adjacent regions of Central America. In Guatemala, 50% of infants and children are stunted (very low height-for-age), and some rural Maya regions have >70% children stunted. A large, longitudinal, intergenerational database was created to (1) provide deep data to prevent and treat somatic growth faltering and impaired neurocognitive development, (2) detect key dependencies and predictive relations between highly complex, time-varying, and interacting biological and cultural variables and (3) identify targeted multifactorial intervention strategies for field testing and validation. Contributions to this database included data from the Universidad del Valle de Guatemala Longitudinal Study of Child and Adolescent Development, child growth and intergenerational studies among the Maya in Mexico and studies about Maya migrants in the United States.

  8. Recent Progress in the Development of Metabolome Databases for Plant Systems Biology

    PubMed Central

    Fukushima, Atsushi; Kusano, Miyako

    2013-01-01

    Metabolomics has grown greatly as a functional genomics tool, and has become an invaluable diagnostic tool for biochemical phenotyping of biological systems. Over the past decades, a number of databases involving information related to mass spectra, compound names and structures, statistical/mathematical models and metabolic pathways, and metabolite profile data have been developed. Such databases complement each other and support efficient growth in this area, although the data resources remain scattered across the World Wide Web. Here, we review available metabolome databases and summarize the present status of development of related tools, particularly focusing on the plant metabolome. Data sharing discussed here will pave way for the robust interpretation of metabolomic data and advances in plant systems biology. PMID:23577015

  9. Integrating systems biology models and biomedical ontologies

    PubMed Central

    2011-01-01

    Background Systems biology is an approach to biology that emphasizes the structure and dynamic behavior of biological systems and the interactions that occur within them. To succeed, systems biology crucially depends on the accessibility and integration of data across domains and levels of granularity. Biomedical ontologies were developed to facilitate such an integration of data and are often used to annotate biosimulation models in systems biology. Results We provide a framework to integrate representations of in silico systems biology with those of in vivo biology as described by biomedical ontologies and demonstrate this framework using the Systems Biology Markup Language. We developed the SBML Harvester software that automatically converts annotated SBML models into OWL and we apply our software to those biosimulation models that are contained in the BioModels Database. We utilize the resulting knowledge base for complex biological queries that can bridge levels of granularity, verify models based on the biological phenomenon they represent and provide a means to establish a basic qualitative layer on which to express the semantics of biosimulation models. Conclusions We establish an information flow between biomedical ontologies and biosimulation models and we demonstrate that the integration of annotated biosimulation models and biomedical ontologies enables the verification of models as well as expressive queries. Establishing a bi-directional information flow between systems biology and biomedical ontologies has the potential to enable large-scale analyses of biological systems that span levels of granularity from molecules to organisms. PMID:21835028

  10. The 2014 Nucleic Acids Research Database Issue and an updated NAR online Molecular Biology Database Collection.

    PubMed

    Fernández-Suárez, Xosé M; Rigden, Daniel J; Galperin, Michael Y

    2014-01-01

    The 2014 Nucleic Acids Research Database Issue includes descriptions of 58 new molecular biology databases and recent updates to 123 databases previously featured in NAR or other journals. For convenience, the issue is now divided into eight sections that reflect major subject categories. Among the highlights of this issue are six databases of the transcription factor binding sites in various organisms and updates on such popular databases as CAZy, Database of Genomic Variants (DGV), dbGaP, DrugBank, KEGG, miRBase, Pfam, Reactome, SEED, TCDB and UniProt. There is a strong block of structural databases, which includes, among others, the new RNA Bricks database, updates on PDBe, PDBsum, ArchDB, Gene3D, ModBase, Nucleic Acid Database and the recently revived iPfam database. An update on the NCBI's MMDB describes VAST+, an improved tool for protein structure comparison. Two articles highlight the development of the Structural Classification of Proteins (SCOP) database: one describes SCOPe, which automates assignment of new structures to the existing SCOP hierarchy; the other one describes the first version of SCOP2, with its more flexible approach to classifying protein structures. This issue also includes a collection of articles on bacterial taxonomy and metagenomics, which includes updates on the List of Prokaryotic Names with Standing in Nomenclature (LPSN), Ribosomal Database Project (RDP), the Silva/LTP project and several new metagenomics resources. The NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/c/, has been expanded to 1552 databases. The entire Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).

  11. PATIKAweb: a Web interface for analyzing biological pathways through advanced querying and visualization.

    PubMed

    Dogrusoz, U; Erson, E Z; Giral, E; Demir, E; Babur, O; Cetintas, A; Colak, R

    2006-02-01

    Patikaweb provides a Web interface for retrieving and analyzing biological pathways in the Patika database, which contains data integrated from various prominent public pathway databases. It features a user-friendly interface, dynamic visualization and automated layout, advanced graph-theoretic queries for extracting biologically important phenomena, local persistence capability and exporting facilities to various pathway exchange formats.

  12. Predicting Protein Relationships to Human Pathways through a Relational Learning Approach Based on Simple Sequence Features.

    PubMed

    García-Jiménez, Beatriz; Pons, Tirso; Sanchis, Araceli; Valencia, Alfonso

    2014-01-01

    Biological pathways are important elements of systems biology and in the past decade, an increasing number of pathway databases have been set up to document the growing understanding of complex cellular processes. Although more genome-sequence data are becoming available, a large fraction of it remains functionally uncharacterized. Thus, it is important to be able to predict the mapping of poorly annotated proteins to original pathway models. We have developed a Relational Learning-based Extension (RLE) system to investigate pathway membership through a function prediction approach that mainly relies on combinations of simple properties attributed to each protein. RLE searches for proteins with molecular similarities to specific pathway components. Using RLE, we associated 383 uncharacterized proteins to 28 pre-defined human Reactome pathways, demonstrating relative confidence after proper evaluation. Indeed, in specific cases manual inspection of the database annotations and the related literature supported the proposed classifications. Examples of possible additional components of the Electron transport system, Telomere maintenance and Integrin cell surface interactions pathways are discussed in detail. All the human predicted proteins in the 2009 and 2012 releases 30 and 40 of Reactome are available at http://rle.bioinfo.cnio.es.

  13. BRISK--research-oriented storage kit for biology-related data.

    PubMed

    Tan, Alan; Tripp, Ben; Daley, Denise

    2011-09-01

    In genetic science, large-scale international research collaborations represent a growing trend. These collaborations have demanding and challenging database, storage, retrieval and communication needs. These studies typically involve demographic and clinical data, in addition to the results from numerous genomic studies (omics studies) such as gene expression, eQTL, genome-wide association and methylation studies, which present numerous challenges, thus the need for data integration platforms that can handle these complex data structures. Inefficient methods of data transfer and access control still plague research collaboration. As science becomes more and more collaborative in nature, the need for a system that adequately manages data sharing becomes paramount. Biology-Related Information Storage Kit (BRISK) is a package of several web-based data management tools that provide a cohesive data integration and management platform. It was specifically designed to provide the architecture necessary to promote collaboration and expedite data sharing between scientists. The software, documentation, Java source code and demo are available at http://genapha.icapture.ubc.ca/brisk/index.jsp. BRISK was developed in Java, and tested on an Apache Tomcat 6 server with a MySQL database. denise.daley@hli.ubc.ca.

  14. The 2018 Nucleic Acids Research database issue and the online molecular biology database collection.

    PubMed

    Rigden, Daniel J; Fernández, Xosé M

    2018-01-04

    The 2018 Nucleic Acids Research Database Issue contains 181 papers spanning molecular biology. Among them, 82 are new and 84 are updates describing resources that appeared in the Issue previously. The remaining 15 cover databases most recently published elsewhere. Databases in the area of nucleic acids include 3DIV for visualisation of data on genome 3D structure and RNArchitecture, a hierarchical classification of RNA families. Protein databases include the established SMART, ELM and MEROPS while GPCRdb and the newcomer STCRDab cover families of biomedical interest. In the area of metabolism, HMDB and Reactome both report new features while PULDB appears in NAR for the first time. This issue also contains reports on genomics resources including Ensembl, the UCSC Genome Browser and ENCODE. Update papers from the IUPHAR/BPS Guide to Pharmacology and DrugBank are highlights of the drug and drug target section while a number of proteomics databases including proteomicsDB are also covered. The entire Database Issue is freely available online on the Nucleic Acids Research website (https://academic.oup.com/nar). The NAR online Molecular Biology Database Collection has been updated, reviewing 138 entries, adding 88 new resources and eliminating 47 discontinued URLs, bringing the current total to 1737 databases. It is available at http://www.oxfordjournals.org/nar/database/c/. © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.

  15. CellLineNavigator: a workbench for cancer cell line analysis

    PubMed Central

    Krupp, Markus; Itzel, Timo; Maass, Thorsten; Hildebrandt, Andreas; Galle, Peter R.; Teufel, Andreas

    2013-01-01

    The CellLineNavigator database, freely available at http://www.medicalgenomics.org/celllinenavigator, is a web-based workbench for large scale comparisons of a large collection of diverse cell lines. It aims to support experimental design in the fields of genomics, systems biology and translational biomedical research. Currently, this compendium holds genome wide expression profiles of 317 different cancer cell lines, categorized into 57 different pathological states and 28 individual tissues. To enlarge the scope of CellLineNavigator, the database was furthermore closely linked to commonly used bioinformatics databases and knowledge repositories. To ensure easy data access and search ability, a simple data and an intuitive querying interface were implemented. It allows the user to explore and filter gene expression, focusing on pathological or physiological conditions. For a more complex search, the advanced query interface may be used to query for (i) differentially expressed genes; (ii) pathological or physiological conditions; or (iii) gene names or functional attributes, such as Kyoto Encyclopaedia of Genes and Genomes pathway maps. These queries may also be combined. Finally, CellLineNavigator allows additional advanced analysis of differentially regulated genes by a direct link to the Database for Annotation, Visualization and Integrated Discovery (DAVID) Bioinformatics Resources. PMID:23118487

  16. Bibliographical database of radiation biological dosimetry and risk assessment: Part 1, through June 1988

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Straume, T.; Ricker, Y.; Thut, M.

    1988-08-29

    This database was constructed to support research in radiation biological dosimetry and risk assessment. Relevant publications were identified through detailed searches of national and international electronic databases and through our personal knowledge of the subject. Publications were numbered and key worded, and referenced in an electronic data-retrieval system that permits quick access through computerized searches on publication number, authors, key words, title, year, and journal name. Photocopies of all publications contained in the database are maintained in a file that is numerically arranged by citation number. This report of the database is provided as a useful reference and overview. Itmore » should be emphasized that the database will grow as new citations are added to it. With that in mind, we arranged this report in order of ascending citation number so that follow-up reports will simply extend this document. The database cite 1212 publications. Publications are from 119 different scientific journals, 27 of these journals are cited at least 5 times. It also contains reference to 42 books and published symposia, and 129 reports. Information relevant to radiation biological dosimetry and risk assessment is widely distributed among the scientific literature, although a few journals clearly dominate. The four journals publishing the largest number of relevant papers are Health Physics, Mutation Research, Radiation Research, and International Journal of Radiation Biology. Publications in Health Physics make up almost 10% of the current database.« less

  17. Computing health quality measures using Informatics for Integrating Biology and the Bedside.

    PubMed

    Klann, Jeffrey G; Murphy, Shawn N

    2013-04-19

    The Health Quality Measures Format (HQMF) is a Health Level 7 (HL7) standard for expressing computable Clinical Quality Measures (CQMs). Creating tools to process HQMF queries in clinical databases will become increasingly important as the United States moves forward with its Health Information Technology Strategic Plan to Stages 2 and 3 of the Meaningful Use incentive program (MU2 and MU3). Informatics for Integrating Biology and the Bedside (i2b2) is one of the analytical databases used as part of the Office of the National Coordinator (ONC)'s Query Health platform to move toward this goal. Our goal is to integrate i2b2 with the Query Health HQMF architecture, to prepare for other HQMF use-cases (such as MU2 and MU3), and to articulate the functional overlap between i2b2 and HQMF. Therefore, we analyze the structure of HQMF, and then we apply this understanding to HQMF computation on the i2b2 clinical analytical database platform. Specifically, we develop a translator between two query languages, HQMF and i2b2, so that the i2b2 platform can compute HQMF queries. We use the HQMF structure of queries for aggregate reporting, which define clinical data elements and the temporal and logical relationships between them. We use the i2b2 XML format, which allows flexible querying of a complex clinical data repository in an easy-to-understand domain-specific language. The translator can represent nearly any i2b2-XML query as HQMF and execute in i2b2 nearly any HQMF query expressible in i2b2-XML. This translator is part of the freely available reference implementation of the QueryHealth initiative. We analyze limitations of the conversion and find it covers many, but not all, of the complex temporal and logical operators required by quality measures. HQMF is an expressive language for defining quality measures, and it will be important to understand and implement for CQM computation, in both meaningful use and population health. However, its current form might allow complexity that is intractable for current database systems (both in terms of implementation and computation). Our translator, which supports the subset of HQMF currently expressible in i2b2-XML, may represent the beginnings of a practical compromise. It is being pilot-tested in two Query Health demonstration projects, and it can be further expanded to balance computational tractability with the advanced features needed by measure developers.

  18. Computing Health Quality Measures Using Informatics for Integrating Biology and the Bedside

    PubMed Central

    Murphy, Shawn N

    2013-01-01

    Background The Health Quality Measures Format (HQMF) is a Health Level 7 (HL7) standard for expressing computable Clinical Quality Measures (CQMs). Creating tools to process HQMF queries in clinical databases will become increasingly important as the United States moves forward with its Health Information Technology Strategic Plan to Stages 2 and 3 of the Meaningful Use incentive program (MU2 and MU3). Informatics for Integrating Biology and the Bedside (i2b2) is one of the analytical databases used as part of the Office of the National Coordinator (ONC)’s Query Health platform to move toward this goal. Objective Our goal is to integrate i2b2 with the Query Health HQMF architecture, to prepare for other HQMF use-cases (such as MU2 and MU3), and to articulate the functional overlap between i2b2 and HQMF. Therefore, we analyze the structure of HQMF, and then we apply this understanding to HQMF computation on the i2b2 clinical analytical database platform. Specifically, we develop a translator between two query languages, HQMF and i2b2, so that the i2b2 platform can compute HQMF queries. Methods We use the HQMF structure of queries for aggregate reporting, which define clinical data elements and the temporal and logical relationships between them. We use the i2b2 XML format, which allows flexible querying of a complex clinical data repository in an easy-to-understand domain-specific language. Results The translator can represent nearly any i2b2-XML query as HQMF and execute in i2b2 nearly any HQMF query expressible in i2b2-XML. This translator is part of the freely available reference implementation of the QueryHealth initiative. We analyze limitations of the conversion and find it covers many, but not all, of the complex temporal and logical operators required by quality measures. Conclusions HQMF is an expressive language for defining quality measures, and it will be important to understand and implement for CQM computation, in both meaningful use and population health. However, its current form might allow complexity that is intractable for current database systems (both in terms of implementation and computation). Our translator, which supports the subset of HQMF currently expressible in i2b2-XML, may represent the beginnings of a practical compromise. It is being pilot-tested in two Query Health demonstration projects, and it can be further expanded to balance computational tractability with the advanced features needed by measure developers. PMID:23603227

  19. Advanced techniques in placental biology -- workshop report.

    PubMed

    Nelson, D M; Sadovsky, Y; Robinson, J M; Croy, B A; Rice, G; Kniss, D A

    2006-04-01

    Major advances in placental biology have been realized as new technologies have been developed and existing methods have been refined in many areas of biological research. Classical anatomy and whole-organ physiology tools once used to analyze placental structure and function have been supplanted by more sophisticated techniques adapted from molecular biology, proteomics, and computational biology and bioinformatics. In addition, significant refinements in morphological study of the placenta and its constituent cell types have improved our ability to assess form and function in highly integrated manner. To offer an overview of modern technologies used by investigators to study the placenta, this workshop: Advanced techniques in placental biology, assembled experts who discussed fundamental principles and real time examples of four separate methodologies. Y. Sadovsky presented the principles of microRNA function as an endogenous mechanism of gene regulation. J. Robinson demonstrated the utility of correlative microscopy in which light-level and transmission electron microscopy are combined to provide cellular and subcellular views of placental cells. A. Croy provided a lecture on the use of microdissection techniques which are invaluable for isolating very small subsets of cell types for molecular analysis. Finally, G. Rice presented an overview methods on profiling of complex protein mixtures within tissue and/or fluid samples that, when refined, will offer databases that will underpin a systems approach to modern trophoblast biology.

  20. Synthetic Peptide Arrays for Pathway-Level Protein Monitoring by Liquid Chromatography-Tandem Mass Spectrometry*

    PubMed Central

    Hewel, Johannes A.; Liu, Jian; Onishi, Kento; Fong, Vincent; Chandran, Shamanta; Olsen, Jonathan B.; Pogoutse, Oxana; Schutkowski, Mike; Wenschuh, Holger; Winkler, Dirk F. H.; Eckler, Larry; Zandstra, Peter W.; Emili, Andrew

    2010-01-01

    Effective methods to detect and quantify functionally linked regulatory proteins in complex biological samples are essential for investigating mammalian signaling pathways. Traditional immunoassays depend on proprietary reagents that are difficult to generate and multiplex, whereas global proteomic profiling can be tedious and can miss low abundance proteins. Here, we report a target-driven liquid chromatography-tandem mass spectrometry (LC-MS/MS) strategy for selectively examining the levels of multiple low abundance components of signaling pathways which are refractory to standard shotgun screening procedures and hence appear limited in current MS/MS repositories. Our stepwise approach consists of: (i) synthesizing microscale peptide arrays, including heavy isotope-labeled internal standards, for use as high quality references to (ii) build empirically validated high density LC-MS/MS detection assays with a retention time scheduling system that can be used to (iii) identify and quantify endogenous low abundance protein targets in complex biological mixtures with high accuracy by correlation to a spectral database using new software tools. The method offers a flexible, rapid, and cost-effective means for routine proteomic exploration of biological systems including “label-free” quantification, while minimizing spurious interferences. As proof-of-concept, we have examined the abundance of transcription factors and protein kinases mediating pluripotency and self-renewal in embryonic stem cell populations. PMID:20467045

  1. Mining biological databases for candidate disease genes

    NASA Astrophysics Data System (ADS)

    Braun, Terry A.; Scheetz, Todd; Webster, Gregg L.; Casavant, Thomas L.

    2001-07-01

    The publicly-funded effort to sequence the complete nucleotide sequence of the human genome, the Human Genome Project (HGP), has currently produced more than 93% of the 3 billion nucleotides of the human genome into a preliminary `draft' format. In addition, several valuable sources of information have been developed as direct and indirect results of the HGP. These include the sequencing of model organisms (rat, mouse, fly, and others), gene discovery projects (ESTs and full-length), and new technologies such as expression analysis and resources (micro-arrays or gene chips). These resources are invaluable for the researchers identifying the functional genes of the genome that transcribe and translate into the transcriptome and proteome, both of which potentially contain orders of magnitude more complexity than the genome itself. Preliminary analyses of this data identified approximately 30,000 - 40,000 human `genes.' However, the bulk of the effort still remains -- to identify the functional and structural elements contained within the transcriptome and proteome, and to associate function in the transcriptome and proteome to genes. A fortuitous consequence of the HGP is the existence of hundreds of databases containing biological information that may contain relevant data pertaining to the identification of disease-causing genes. The task of mining these databases for information on candidate genes is a commercial application of enormous potential. We are developing a system to acquire and mine data from specific databases to aid our efforts to identify disease genes. A high speed cluster of Linux of workstations is used to analyze sequence and perform distributed sequence alignments as part of our data mining and processing. This system has been used to mine GeneMap99 sequences within specific genomic intervals to identify potential candidate disease genes associated with Bardet-Biedle Syndrome (BBS).

  2. Top-down approach to biological therapy of Crohn's disease.

    PubMed

    Hirschmann, Simon; Neurath, Markus F

    2017-03-01

    Crohn's disease (CD) is a chronic, immune-mediated condition with a potentially disabling and destructive course. Despite growing data on when to use a therapeutic 'top-down' strategy, clinical management of this complex disorder is still challenging. Currently, the discussion of 'top-down' strategy in CD mostly includes biological therapy alone or in combination. Areas covered: This article is based on a review of existing literature regarding the use of biological therapy in a 'top-down' approach for the treatment of Crohn's disease. The authors reviewed all the major databases including MEDLINE as well as DDW and ECCO abstracts, respectively. Expert opinion: A 'top-down' therapeutic approach in Crohn's disease is strongly supported by existing data in patients with several risk factors for a severe course of disease. Moreover, there is an increasing amount of published data recommending a more individualised therapeutic strategy to identify candidates for 'top-down' treatment, based on enhanced diagnostics using biomarkers. Emerging therapeutic approaches besides existing therapy concepts using biologicals may possibly redefine the 'top-down' therapeutic strategy for Crohn's disease in the future.

  3. Systematic Analysis of Arabidopsis Organelles and a Protein Localization Database for Facilitating Fluorescent Tagging of Full-Length Arabidopsis Proteins1[W

    PubMed Central

    Li, Shijun; Ehrhardt, David W.; Rhee, Seung Y.

    2006-01-01

    Cells are organized into a complex network of subcellular compartments that are specialized for various biological functions. Subcellular location is an important attribute of protein function. To facilitate systematic elucidation of protein subcellular location, we analyzed experimentally verified protein localization data of 1,300 Arabidopsis (Arabidopsis thaliana) proteins. The 1,300 experimentally verified proteins are distributed among 40 different compartments, with most of the proteins localized to four compartments: mitochondria (36%), nucleus (28%), plastid (17%), and cytosol (13.3%). About 19% of the proteins are found in multiple compartments, in which a high proportion (36.4%) is localized to both cytosol and nucleus. Characterization of the overrepresented Gene Ontology molecular functions and biological processes suggests that the Golgi apparatus and peroxisome may play more diverse functions but are involved in more specialized processes than other compartments. To support systematic empirical determination of protein subcellular localization using a technology called fluorescent tagging of full-length proteins, we developed a database and Web application to provide preselected green fluorescent protein insertion position and primer sequences for all Arabidopsis proteins to study their subcellular localization and to store experimentally verified protein localization images, videos, and their annotations of proteins generated using the fluorescent tagging of full-length proteins technology. The database can be searched, browsed, and downloaded using a Web browser at http://aztec.stanford.edu/gfp/. The software can also be downloaded from the same Web site for local installation. PMID:16617091

  4. DWARF – a data warehouse system for analyzing protein families

    PubMed Central

    Fischer, Markus; Thai, Quan K; Grieb, Melanie; Pleiss, Jürgen

    2006-01-01

    Background The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. Description The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. Conclusion DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering. PMID:17094801

  5. Metabolite signal identification in accurate mass metabolomics data with MZedDB, an interactive m/z annotation tool utilising predicted ionisation behaviour 'rules'

    PubMed Central

    Draper, John; Enot, David P; Parker, David; Beckmann, Manfred; Snowdon, Stuart; Lin, Wanchang; Zubair, Hassan

    2009-01-01

    Background Metabolomics experiments using Mass Spectrometry (MS) technology measure the mass to charge ratio (m/z) and intensity of ionised molecules in crude extracts of complex biological samples to generate high dimensional metabolite 'fingerprint' or metabolite 'profile' data. High resolution MS instruments perform routinely with a mass accuracy of < 5 ppm (parts per million) thus providing potentially a direct method for signal putative annotation using databases containing metabolite mass information. Most database interfaces support only simple queries with the default assumption that molecules either gain or lose a single proton when ionised. In reality the annotation process is confounded by the fact that many ionisation products will be not only molecular isotopes but also salt/solvent adducts and neutral loss fragments of original metabolites. This report describes an annotation strategy that will allow searching based on all potential ionisation products predicted to form during electrospray ionisation (ESI). Results Metabolite 'structures' harvested from publicly accessible databases were converted into a common format to generate a comprehensive archive in MZedDB. 'Rules' were derived from chemical information that allowed MZedDB to generate a list of adducts and neutral loss fragments putatively able to form for each structure and calculate, on the fly, the exact molecular weight of every potential ionisation product to provide targets for annotation searches based on accurate mass. We demonstrate that data matrices representing populations of ionisation products generated from different biological matrices contain a large proportion (sometimes > 50%) of molecular isotopes, salt adducts and neutral loss fragments. Correlation analysis of ESI-MS data features confirmed the predicted relationships of m/z signals. An integrated isotope enumerator in MZedDB allowed verification of exact isotopic pattern distributions to corroborate experimental data. Conclusion We conclude that although ultra-high accurate mass instruments provide major insight into the chemical diversity of biological extracts, the facile annotation of a large proportion of signals is not possible by simple, automated query of current databases using computed molecular formulae. Parameterising MZedDB to take into account predicted ionisation behaviour and the biological source of any sample improves greatly both the frequency and accuracy of potential annotation 'hits' in ESI-MS data. PMID:19622150

  6. A Review of the Composition of the Essential Oils and Biological Activities of Angelica Species.

    PubMed

    Sowndhararajan, Kandasamy; Deepa, Ponnuvel; Kim, Minju; Park, Se Jin; Kim, Songmun

    2017-09-20

    A number of Angelica species have been used in traditional systems of medicine to treat many ailments. Especially, essential oils (EOs) from the Angelica species have been used for the treatment of various health problems, including malaria, gynecological diseases, fever, anemia, and arthritis. EOs are complex mixtures of low molecular weight compounds, especially terpenoids and their oxygenated compounds. These components deliver specific fragrance and biological properties to essential oils. In this review, we summarized the chemical composition and biological activities of EOs from different species of Angelica . For this purpose, a literature search was carried out to obtain information about the EOs of Angelica species and their bioactivities from electronic databases such as PubMed, Science Direct, Wiley, Springer, ACS, Google, and other journal publications. There has been a lot of variation in the EO composition among different Angelica species. EOs from Angelica species were reported for different kinds of biological activities, such as antioxidant, anti-inflammatory, antimicrobial, immunotoxic, and insecticidal activities. The present review is an attempt to consolidate the available data for different Angelica species on the basis of major constituents in the EOs and their biological activities.

  7. ESEA: Discovering the Dysregulated Pathways based on Edge Set Enrichment Analysis

    PubMed Central

    Han, Junwei; Shi, Xinrui; Zhang, Yunpeng; Xu, Yanjun; Jiang, Ying; Zhang, Chunlong; Feng, Li; Yang, Haixiu; Shang, Desi; Sun, Zeguo; Su, Fei; Li, Chunquan; Li, Xia

    2015-01-01

    Pathway analyses are playing an increasingly important role in understanding biological mechanism, cellular function and disease states. Current pathway-identification methods generally focus on only the changes of gene expression levels; however, the biological relationships among genes are also the fundamental components of pathways, and the dysregulated relationships may also alter the pathway activities. We propose a powerful computational method, Edge Set Enrichment Analysis (ESEA), for the identification of dysregulated pathways. This provides a novel way of pathway analysis by investigating the changes of biological relationships of pathways in the context of gene expression data. Simulation studies illustrate the power and performance of ESEA under various simulated conditions. Using real datasets from p53 mutation, Type 2 diabetes and lung cancer, we validate effectiveness of ESEA in identifying dysregulated pathways. We further compare our results with five other pathway enrichment analysis methods. With these analyses, we show that ESEA is able to help uncover dysregulated biological pathways underlying complex traits and human diseases via specific use of the dysregulated biological relationships. We develop a freely available R-based tool of ESEA. Currently, ESEA can support pathway analysis of the seven public databases (KEGG; Reactome; Biocarta; NCI; SPIKE; HumanCyc; Panther). PMID:26267116

  8. BiNA: A Visual Analytics Tool for Biological Network Data

    PubMed Central

    Gerasch, Andreas; Faber, Daniel; Küntzer, Jan; Niermann, Peter; Kohlbacher, Oliver; Lenhof, Hans-Peter; Kaufmann, Michael

    2014-01-01

    Interactive visual analysis of biological high-throughput data in the context of the underlying networks is an essential task in modern biomedicine with applications ranging from metabolic engineering to personalized medicine. The complexity and heterogeneity of data sets require flexible software architectures for data analysis. Concise and easily readable graphical representation of data and interactive navigation of large data sets are essential in this context. We present BiNA - the Biological Network Analyzer - a flexible open-source software for analyzing and visualizing biological networks. Highly configurable visualization styles for regulatory and metabolic network data offer sophisticated drawings and intuitive navigation and exploration techniques using hierarchical graph concepts. The generic projection and analysis framework provides powerful functionalities for visual analyses of high-throughput omics data in the context of networks, in particular for the differential analysis and the analysis of time series data. A direct interface to an underlying data warehouse provides fast access to a wide range of semantically integrated biological network databases. A plugin system allows simple customization and integration of new analysis algorithms or visual representations. BiNA is available under the 3-clause BSD license at http://bina.unipax.info/. PMID:24551056

  9. Subject and authorship of records related to the Organization for Tropical Studies (OTS) in BINABITROP, a comprehensive database about Costa Rican biology.

    PubMed

    Monge-Nájera, Julián; Nielsen-Muñoz, Vanessa; Azofeifa-Mora, Ana Beatriz

    2013-06-01

    BINABITROP is a bibliographical database of more than 38000 records about the ecosystems and organisms of Costa Rica. In contrast with commercial databases, such as Web of Knowledge and Scopus, which exclude most of the scientific journals published in tropical countries, BINABITROP is a comprehensive record of knowledge on the tropical ecosystems and organisms of Costa Rica. We analyzed its contents in three sites (La Selva, Palo Verde and Las Cruces) and recorded scientific field, taxonomic group and authorship. We found that most records dealt with ecology and systematics, and that most authors published only one article in the study period (1963-2011). Most research was published in four journals: Biotropica, Revista de Biología Tropical/ International Journal of Tropical Biology and Conservation, Zootaxa and Brenesia. This may be the first study of a such a comprehensive database for any case of tropical biology literature.

  10. Elucidation of the binding mechanism of renin using a wide array of computational techniques and biological assays.

    PubMed

    Tzoupis, Haralambos; Leonis, Georgios; Avramopoulos, Aggelos; Reis, Heribert; Czyżnikowska, Żaneta; Zerva, Sofia; Vergadou, Niki; Peristeras, Loukas D; Papavasileiou, Konstantinos D; Alexis, Michael N; Mavromoustakos, Thomas; Papadopoulos, Manthos G

    2015-11-01

    We investigate the binding mechanism in renin complexes, involving three drugs (remikiren, zankiren and enalkiren) and one lead compound, which was selected after screening the ZINC database. For this purpose, we used ab initio methods (the effective fragment potential, the variational perturbation theory, the energy decomposition analysis, the atoms-in-molecules), docking, molecular dynamics, and the MM-PBSA method. A biological assay for the lead compound has been performed to validate the theoretical findings. Importantly, binding free energy calculations for the three drug complexes are within 3 kcal/mol of the experimental values, thus further justifying our computational protocol, which has been validated through previous studies on 11 drug-protein systems. The main elements of the discovered mechanism are: (i) minor changes are induced to renin upon drug binding, (ii) the three drugs form an extensive network of hydrogen bonds with renin, whilst the lead compound presented diminished interactions, (iii) ligand binding in all complexes is driven by favorable van der Waals interactions and the nonpolar contribution to solvation, while the lead compound is associated with diminished van der Waals interactions compared to the drug-bound forms of renin, and (iv) the environment (H2O/Na(+)) has a small effect on the renin-remikiren interaction. Copyright © 2015 Elsevier Inc. All rights reserved.

  11. The impact of network medicine in gastroenterology and hepatology.

    PubMed

    Baffy, György

    2013-10-01

    In the footsteps of groundbreaking achievements made by biomedical research, another scientific revolution is unfolding. Systems biology draws from the chaos and complexity theory and applies computational models to predict emerging behavior of the interactions between genes, gene products, and environmental factors. Adaptation of systems biology to translational and clinical sciences has been termed network medicine, and is likely to change the way we think about preventing, predicting, diagnosing, and treating complex human diseases. Network medicine finds gene-disease associations by analyzing the unparalleled digital information discovered and created by high-throughput technologies (dubbed as "omics" science) and links genetic variance to clinical disease phenotypes through intermediate organizational levels of life such as the epigenome, transcriptome, proteome, and metabolome. Supported by large reference databases, unprecedented data storage capacity, and innovative computational analysis, network medicine is poised to find links between conditions that were thought to be distinct, uncover shared disease mechanisms and key drivers of the pathogenesis, predict individual disease outcomes and trajectories, identify novel therapeutic applications, and help avoid off-target and undesirable drug effects. Recent advances indicate that these perspectives are increasingly within our reach for understanding and managing complex diseases of the digestive system. Copyright © 2013 AGA Institute. Published by Elsevier Inc. All rights reserved.

  12. High-throughput Crystallography for Structural Genomics

    PubMed Central

    Joachimiak, Andrzej

    2009-01-01

    Protein X-ray crystallography recently celebrated its 50th anniversary. The structures of myoglobin and hemoglobin determined by Kendrew and Perutz provided the first glimpses into the complex protein architecture and chemistry. Since then, the field of structural molecular biology has experienced extraordinary progress and now over 53,000 proteins structures have been deposited into the Protein Data Bank. In the past decade many advances in macromolecular crystallography have been driven by world-wide structural genomics efforts. This was made possible because of third-generation synchrotron sources, structure phasing approaches using anomalous signal and cryo-crystallography. Complementary progress in molecular biology, proteomics, hardware and software for crystallographic data collection, structure determination and refinement, computer science, databases, robotics and automation improved and accelerated many processes. These advancements provide the robust foundation for structural molecular biology and assure strong contribution to science in the future. In this report we focus mainly on reviewing structural genomics high-throughput X-ray crystallography technologies and their impact. PMID:19765976

  13. Ultra-Structure database design methodology for managing systems biology data and analyses

    PubMed Central

    Maier, Christopher W; Long, Jeffrey G; Hemminger, Bradley M; Giddings, Morgan C

    2009-01-01

    Background Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping). Results We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research. Conclusion We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era. PMID:19691849

  14. A natural language interface plug-in for cooperative query answering in biological databases.

    PubMed

    Jamil, Hasan M

    2012-06-11

    One of the many unique features of biological databases is that the mere existence of a ground data item is not always a precondition for a query response. It may be argued that from a biologist's standpoint, queries are not always best posed using a structured language. By this we mean that approximate and flexible responses to natural language like queries are well suited for this domain. This is partly due to biologists' tendency to seek simpler interfaces and partly due to the fact that questions in biology involve high level concepts that are open to interpretations computed using sophisticated tools. In such highly interpretive environments, rigidly structured databases do not always perform well. In this paper, our goal is to propose a semantic correspondence plug-in to aid natural language query processing over arbitrary biological database schema with an aim to providing cooperative responses to queries tailored to users' interpretations. Natural language interfaces for databases are generally effective when they are tuned to the underlying database schema and its semantics. Therefore, changes in database schema become impossible to support, or a substantial reorganization cost must be absorbed to reflect any change. We leverage developments in natural language parsing, rule languages and ontologies, and data integration technologies to assemble a prototype query processor that is able to transform a natural language query into a semantically equivalent structured query over the database. We allow knowledge rules and their frequent modifications as part of the underlying database schema. The approach we adopt in our plug-in overcomes some of the serious limitations of many contemporary natural language interfaces, including support for schema modifications and independence from underlying database schema. The plug-in introduced in this paper is generic and facilitates connecting user selected natural language interfaces to arbitrary databases using a semantic description of the intended application. We demonstrate the feasibility of our approach with a practical example.

  15. Evaluation of the Biolog MicroStation system for yeast identification

    NASA Technical Reports Server (NTRS)

    McGinnis, M. R.; Molina, T. C.; Pierson, D. L.; Mishra, S. K.

    1996-01-01

    One hundred and fifty-nine isolates representing 16 genera and 53 species of yeasts were processed with the Biolog MicroStation System for yeast identification. Thirteen genera and 38 species were included in the Biolog database. For these 129 isolates, correct identifications to the species level were 13.2, 39.5 and 48.8% after 24, 48 and 72 hours incubation at 30 degrees C, respectively. Three genera and 15 species which were not included in the Biolog database were also tested. Of the 30 isolates studied, 16.7, 53.3 and 56.7% of the isolates were given incorrect names from the system's database after 24,48 and 72 h incubation at 30 degrees C, respectively. The remaining isolates of this group were not identified.

  16. Identifying new persistent and bioaccumulative organics among chemicals in commerce.

    PubMed

    Howard, Philip H; Muir, Derek C G

    2010-04-01

    The goal of this study was to identify commercial chemicals that might be persistent and bioaccumulative (P&B) and that were not being considered in current Great Lakes, North American, and Arctic contaminant measurement programs. We combined the Canadian Domestic Substance List (DSL), a list of 3059 substances of "unknown or variable composition complex reaction products and biological materials" (UVCBs), and the U.S. Environmental Protection Agency (U.S. EPA) Toxic Substances Control Act (TSCA) Inventory Update Rule (IUR) database for years 1986, 1990, 1994, 1998, 2002, and 2006 yielding a database of 22263 commercial chemicals. From that list, 610 chemicals were identified by estimates from U.S EPA EPISuite software and using expert judgment. This study has yielded some interesting and probable P&B chemicals that should be considered for further study. Recent studies, following up our initial reports and presentations on this work, have confirmed the presence of many of these chemicals in the environment.

  17. Integrated Approaches to Drug Discovery for Oxidative Stress-Related Retinal Diseases.

    PubMed

    Nishimura, Yuhei; Hara, Hideaki

    2016-01-01

    Excessive oxidative stress induces dysregulation of functional networks in the retina, resulting in retinal diseases such as glaucoma, age-related macular degeneration, and diabetic retinopathy. Although various therapies have been developed to reduce oxidative stress in retinal diseases, most have failed to show efficacy in clinical trials. This may be due to oversimplification of target selection for such a complex network as oxidative stress. Recent advances in high-throughput technologies have facilitated the collection of multilevel omics data, which has driven growth in public databases and in the development of bioinformatics tools. Integration of the knowledge gained from omics databases can be used to generate disease-related biological networks and to identify potential therapeutic targets within the networks. Here, we provide an overview of integrative approaches in the drug discovery process and provide simple examples of how the approaches can be exploited to identify oxidative stress-related targets for retinal diseases.

  18. Integrated Approaches to Drug Discovery for Oxidative Stress-Related Retinal Diseases

    PubMed Central

    Hara, Hideaki

    2016-01-01

    Excessive oxidative stress induces dysregulation of functional networks in the retina, resulting in retinal diseases such as glaucoma, age-related macular degeneration, and diabetic retinopathy. Although various therapies have been developed to reduce oxidative stress in retinal diseases, most have failed to show efficacy in clinical trials. This may be due to oversimplification of target selection for such a complex network as oxidative stress. Recent advances in high-throughput technologies have facilitated the collection of multilevel omics data, which has driven growth in public databases and in the development of bioinformatics tools. Integration of the knowledge gained from omics databases can be used to generate disease-related biological networks and to identify potential therapeutic targets within the networks. Here, we provide an overview of integrative approaches in the drug discovery process and provide simple examples of how the approaches can be exploited to identify oxidative stress-related targets for retinal diseases. PMID:28053689

  19. Quality assurance for the query and distribution systems of the RCSB Protein Data Bank

    PubMed Central

    Bluhm, Wolfgang F.; Beran, Bojan; Bi, Chunxiao; Dimitropoulos, Dimitris; Prlić, Andreas; Quinn, Gregory B.; Rose, Peter W.; Shah, Chaitali; Young, Jasmine; Yukich, Benjamin; Berman, Helen M.; Bourne, Philip E.

    2011-01-01

    The RCSB Protein Data Bank (RCSB PDB, www.pdb.org) is a key online resource for structural biology and related scientific disciplines. The website is used on average by 165 000 unique visitors per month, and more than 2000 other websites link to it. The amount and complexity of PDB data as well as the expectations on its usage are growing rapidly. Therefore, ensuring the reliability and robustness of the RCSB PDB query and distribution systems are crucially important and increasingly challenging. This article describes quality assurance for the RCSB PDB website at several distinct levels, including: (i) hardware redundancy and failover, (ii) testing protocols for weekly database updates, (iii) testing and release procedures for major software updates and (iv) miscellaneous monitoring and troubleshooting tools and practices. As such it provides suggestions for how other websites might be operated. Database URL: www.pdb.org PMID:21382834

  20. Toward a New Outlook on Primate Learning and Behavior: Complex Learning and Emergent Processes in Comparative Perspective

    NASA Technical Reports Server (NTRS)

    Rumbaugh, Duane M.; Savage-Rumbaugh, E. Sue; Washburn, David A.

    1996-01-01

    Primate research of the 20th century has established the validity of Darwin's postulation of psychological as well as biological continuity between humans and other primates, notably the great apes. Its data make clear that Descartes' view of animals as unfeeling 'beast-machines' is invalid and should be discarded. Traditional behavioristic frameworks, that emphasize the concepts of stimulus, response, and reinforcement and an 'empty-organism' psychology, are in need of major revisions. Revised frameworks should incorporate the fact that, in contrast to the lifeless databases of the 'hard' sciences, the database of psychology entails properties novel to life and its attendant phenomena. The contributions of research this century, achieved by field and laboratory researchers from around the world, have been substantial, indeed revolutionary. It is time to celebrate the progress of our field, to anticipate its significance, and to emphasize conservation of primates in their natural habitats.

  1. Redundancy control in pathway databases (ReCiPa): an application for improving gene-set enrichment analysis in Omics studies and "Big data" biology.

    PubMed

    Vivar, Juan C; Pemu, Priscilla; McPherson, Ruth; Ghosh, Sujoy

    2013-08-01

    Abstract Unparalleled technological advances have fueled an explosive growth in the scope and scale of biological data and have propelled life sciences into the realm of "Big Data" that cannot be managed or analyzed by conventional approaches. Big Data in the life sciences are driven primarily via a diverse collection of 'omics'-based technologies, including genomics, proteomics, metabolomics, transcriptomics, metagenomics, and lipidomics. Gene-set enrichment analysis is a powerful approach for interrogating large 'omics' datasets, leading to the identification of biological mechanisms associated with observed outcomes. While several factors influence the results from such analysis, the impact from the contents of pathway databases is often under-appreciated. Pathway databases often contain variously named pathways that overlap with one another to varying degrees. Ignoring such redundancies during pathway analysis can lead to the designation of several pathways as being significant due to high content-similarity, rather than truly independent biological mechanisms. Statistically, such dependencies also result in correlated p values and overdispersion, leading to biased results. We investigated the level of redundancies in multiple pathway databases and observed large discrepancies in the nature and extent of pathway overlap. This prompted us to develop the application, ReCiPa (Redundancy Control in Pathway Databases), to control redundancies in pathway databases based on user-defined thresholds. Analysis of genomic and genetic datasets, using ReCiPa-generated overlap-controlled versions of KEGG and Reactome pathways, led to a reduction in redundancy among the top-scoring gene-sets and allowed for the inclusion of additional gene-sets representing possibly novel biological mechanisms. Using obesity as an example, bioinformatic analysis further demonstrated that gene-sets identified from overlap-controlled pathway databases show stronger evidence of prior association to obesity compared to pathways identified from the original databases.

  2. PGSB/MIPS Plant Genome Information Resources and Concepts for the Analysis of Complex Grass Genomes.

    PubMed

    Spannagl, Manuel; Bader, Kai; Pfeifer, Matthias; Nussbaumer, Thomas; Mayer, Klaus F X

    2016-01-01

    PGSB (Plant Genome and Systems Biology; formerly MIPS-Munich Institute for Protein Sequences) has been involved in developing, implementing and maintaining plant genome databases for more than a decade. Genome databases and analysis resources have focused on individual genomes and aim to provide flexible and maintainable datasets for model plant genomes as a backbone against which experimental data, e.g., from high-throughput functional genomics, can be organized and analyzed. In addition, genomes from both model and crop plants form a scaffold for comparative genomics, assisted by specialized tools such as the CrowsNest viewer to explore conserved gene order (synteny) between related species on macro- and micro-levels.The genomes of many economically important Triticeae plants such as wheat, barley, and rye present a great challenge for sequence assembly and bioinformatic analysis due to their enormous complexity and large genome size. Novel concepts and strategies have been developed to deal with these difficulties and have been applied to the genomes of wheat, barley, rye, and other cereals. This includes the GenomeZipper concept, reference-guided exome assembly, and "chromosome genomics" based on flow cytometry sorted chromosomes.

  3. biochem4j: Integrated and extensible biochemical knowledge through graph databases.

    PubMed

    Swainston, Neil; Batista-Navarro, Riza; Carbonell, Pablo; Dobson, Paul D; Dunstan, Mark; Jervis, Adrian J; Vinaixa, Maria; Williams, Alan R; Ananiadou, Sophia; Faulon, Jean-Loup; Mendes, Pedro; Kell, Douglas B; Scrutton, Nigel S; Breitling, Rainer

    2017-01-01

    Biologists and biochemists have at their disposal a number of excellent, publicly available data resources such as UniProt, KEGG, and NCBI Taxonomy, which catalogue biological entities. Despite the usefulness of these resources, they remain fundamentally unconnected. While links may appear between entries across these databases, users are typically only able to follow such links by manual browsing or through specialised workflows. Although many of the resources provide web-service interfaces for computational access, performing federated queries across databases remains a non-trivial but essential activity in interdisciplinary systems and synthetic biology programmes. What is needed are integrated repositories to catalogue both biological entities and-crucially-the relationships between them. Such a resource should be extensible, such that newly discovered relationships-for example, those between novel, synthetic enzymes and non-natural products-can be added over time. With the introduction of graph databases, the barrier to the rapid generation, extension and querying of such a resource has been lowered considerably. With a particular focus on metabolic engineering as an illustrative application domain, biochem4j, freely available at http://biochem4j.org, is introduced to provide an integrated, queryable database that warehouses chemical, reaction, enzyme and taxonomic data from a range of reliable resources. The biochem4j framework establishes a starting point for the flexible integration and exploitation of an ever-wider range of biological data sources, from public databases to laboratory-specific experimental datasets, for the benefit of systems biologists, biosystems engineers and the wider community of molecular biologists and biological chemists.

  4. biochem4j: Integrated and extensible biochemical knowledge through graph databases

    PubMed Central

    Batista-Navarro, Riza; Dunstan, Mark; Jervis, Adrian J.; Vinaixa, Maria; Ananiadou, Sophia; Faulon, Jean-Loup; Kell, Douglas B.

    2017-01-01

    Biologists and biochemists have at their disposal a number of excellent, publicly available data resources such as UniProt, KEGG, and NCBI Taxonomy, which catalogue biological entities. Despite the usefulness of these resources, they remain fundamentally unconnected. While links may appear between entries across these databases, users are typically only able to follow such links by manual browsing or through specialised workflows. Although many of the resources provide web-service interfaces for computational access, performing federated queries across databases remains a non-trivial but essential activity in interdisciplinary systems and synthetic biology programmes. What is needed are integrated repositories to catalogue both biological entities and–crucially–the relationships between them. Such a resource should be extensible, such that newly discovered relationships–for example, those between novel, synthetic enzymes and non-natural products–can be added over time. With the introduction of graph databases, the barrier to the rapid generation, extension and querying of such a resource has been lowered considerably. With a particular focus on metabolic engineering as an illustrative application domain, biochem4j, freely available at http://biochem4j.org, is introduced to provide an integrated, queryable database that warehouses chemical, reaction, enzyme and taxonomic data from a range of reliable resources. The biochem4j framework establishes a starting point for the flexible integration and exploitation of an ever-wider range of biological data sources, from public databases to laboratory-specific experimental datasets, for the benefit of systems biologists, biosystems engineers and the wider community of molecular biologists and biological chemists. PMID:28708831

  5. Metabolic pathways for the whole community.

    PubMed

    Hanson, Niels W; Konwar, Kishori M; Hawley, Alyse K; Altman, Tomer; Karp, Peter D; Hallam, Steven J

    2014-07-22

    A convergence of high-throughput sequencing and computational power is transforming biology into information science. Despite these technological advances, converting bits and bytes of sequence information into meaningful insights remains a challenging enterprise. Biological systems operate on multiple hierarchical levels from genomes to biomes. Holistic understanding of biological systems requires agile software tools that permit comparative analyses across multiple information levels (DNA, RNA, protein, and metabolites) to identify emergent properties, diagnose system states, or predict responses to environmental change. Here we adopt the MetaPathways annotation and analysis pipeline and Pathway Tools to construct environmental pathway/genome databases (ePGDBs) that describe microbial community metabolism using MetaCyc, a highly curated database of metabolic pathways and components covering all domains of life. We evaluate Pathway Tools' performance on three datasets with different complexity and coding potential, including simulated metagenomes, a symbiotic system, and the Hawaii Ocean Time-series. We define accuracy and sensitivity relationships between read length, coverage and pathway recovery and evaluate the impact of taxonomic pruning on ePGDB construction and interpretation. Resulting ePGDBs provide interactive metabolic maps, predict emergent metabolic pathways associated with biosynthesis and energy production and differentiate between genomic potential and phenotypic expression across defined environmental gradients. This multi-tiered analysis provides the user community with specific operating guidelines, performance metrics and prediction hazards for more reliable ePGDB construction and interpretation. Moreover, it demonstrates the power of Pathway Tools in predicting metabolic interactions in natural and engineered ecosystems.

  6. BioMart Central Portal: an open database network for the biological community

    PubMed Central

    Guberman, Jonathan M.; Ai, J.; Arnaiz, O.; Baran, Joachim; Blake, Andrew; Baldock, Richard; Chelala, Claude; Croft, David; Cros, Anthony; Cutts, Rosalind J.; Di Génova, A.; Forbes, Simon; Fujisawa, T.; Gadaleta, E.; Goodstein, D. M.; Gundem, Gunes; Haggarty, Bernard; Haider, Syed; Hall, Matthew; Harris, Todd; Haw, Robin; Hu, S.; Hubbard, Simon; Hsu, Jack; Iyer, Vivek; Jones, Philip; Katayama, Toshiaki; Kinsella, R.; Kong, Lei; Lawson, Daniel; Liang, Yong; Lopez-Bigas, Nuria; Luo, J.; Lush, Michael; Mason, Jeremy; Moreews, Francois; Ndegwa, Nelson; Oakley, Darren; Perez-Llamas, Christian; Primig, Michael; Rivkin, Elena; Rosanoff, S.; Shepherd, Rebecca; Simon, Reinhard; Skarnes, B.; Smedley, Damian; Sperling, Linda; Spooner, William; Stevenson, Peter; Stone, Kevin; Teague, J.; Wang, Jun; Wang, Jianxin; Whitty, Brett; Wong, D. T.; Wong-Erasmus, Marie; Yao, L.; Youens-Clark, Ken; Yung, Christina; Zhang, Junjun; Kasprzyk, Arek

    2011-01-01

    BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities. Database URL: http://central.biomart.org. PMID:21930507

  7. Development and Mining of a Volatile Organic Compound Database

    PubMed Central

    Abdullah, Azian Azamimi; Ono, Naoaki; Sugiura, Tadao; Morita, Aki Hirai; Katsuragi, Tetsuo; Muto, Ai; Nishioka, Takaaki; Kanaya, Shigehiko

    2015-01-01

    Volatile organic compounds (VOCs) are small molecules that exhibit high vapor pressure under ambient conditions and have low boiling points. Although VOCs contribute only a small proportion of the total metabolites produced by living organisms, they play an important role in chemical ecology specifically in the biological interactions between organisms and ecosystems. VOCs are also important in the health care field as they are presently used as a biomarker to detect various human diseases. Information on VOCs is scattered in the literature until now; however, there is still no available database describing VOCs and their biological activities. To attain this purpose, we have developed KNApSAcK Metabolite Ecology Database, which contains the information on the relationships between VOCs and their emitting organisms. The KNApSAcK Metabolite Ecology is also linked with the KNApSAcK Core and KNApSAcK Metabolite Activity Database to provide further information on the metabolites and their biological activities. The VOC database can be accessed online. PMID:26495281

  8. IDENTIFICATION OF BIOLOGICALLY RELEVANT GENES USING A DATABASE OF RAT LIVER AND KIDNEY BASELINE GENE EXPRESSION

    EPA Science Inventory

    Microarray data from independent labs and studies can be compared to potentially identify toxicologically and biologically relevant genes. The Baseline Animal Database working group of HESI was formed to assess baseline gene expression from microarray data derived from control or...

  9. Comprehensive, comprehensible, distributed and intelligent databases: current status.

    PubMed

    Frishman, D; Heumann, K; Lesk, A; Mewes, H W

    1998-01-01

    It is only a matter of time until a user will see not many but one integrated database of information for molecular biology. Is this true? Is it a good thing? Why will it happen? Where are we now? What developments are fostering and what developments are impeding progress towards this end? A list of WWW resources devoted to database issues in molecular biology is available at http://www.mips.biochem.mpg.de frishman@mips.biochem.mpg.de

  10. Construction of an ortholog database using the semantic web technology for integrative analysis of genomic data.

    PubMed

    Chiba, Hirokazu; Nishide, Hiroyo; Uchiyama, Ikuo

    2015-01-01

    Recently, various types of biological data, including genomic sequences, have been rapidly accumulating. To discover biological knowledge from such growing heterogeneous data, a flexible framework for data integration is necessary. Ortholog information is a central resource for interlinking corresponding genes among different organisms, and the Semantic Web provides a key technology for the flexible integration of heterogeneous data. We have constructed an ortholog database using the Semantic Web technology, aiming at the integration of numerous genomic data and various types of biological information. To formalize the structure of the ortholog information in the Semantic Web, we have constructed the Ortholog Ontology (OrthO). While the OrthO is a compact ontology for general use, it is designed to be extended to the description of database-specific concepts. On the basis of OrthO, we described the ortholog information from our Microbial Genome Database for Comparative Analysis (MBGD) in the form of Resource Description Framework (RDF) and made it available through the SPARQL endpoint, which accepts arbitrary queries specified by users. In this framework based on the OrthO, the biological data of different organisms can be integrated using the ortholog information as a hub. Besides, the ortholog information from different data sources can be compared with each other using the OrthO as a shared ontology. Here we show some examples demonstrating that the ortholog information described in RDF can be used to link various biological data such as taxonomy information and Gene Ontology. Thus, the ortholog database using the Semantic Web technology can contribute to biological knowledge discovery through integrative data analysis.

  11. Heterogeneous database integration in biomedicine.

    PubMed

    Sujansky, W

    2001-08-01

    The rapid expansion of biomedical knowledge, reduction in computing costs, and spread of internet access have created an ocean of electronic data. The decentralized nature of our scientific community and healthcare system, however, has resulted in a patchwork of diverse, or heterogeneous, database implementations, making access to and aggregation of data across databases very difficult. The database heterogeneity problem applies equally to clinical data describing individual patients and biological data characterizing our genome. Specifically, databases are highly heterogeneous with respect to the data models they employ, the data schemas they specify, the query languages they support, and the terminologies they recognize. Heterogeneous database systems attempt to unify disparate databases by providing uniform conceptual schemas that resolve representational heterogeneities, and by providing querying capabilities that aggregate and integrate distributed data. Research in this area has applied a variety of database and knowledge-based techniques, including semantic data modeling, ontology definition, query translation, query optimization, and terminology mapping. Existing systems have addressed heterogeneous database integration in the realms of molecular biology, hospital information systems, and application portability.

  12. Biofuel Database

    National Institute of Standards and Technology Data Gateway

    Biofuel Database (Web, free access)   This database brings together structural, biological, and thermodynamic data for enzymes that are either in current use or are being considered for use in the production of biofuels.

  13. Search and Discovery Strategies for Biotechnology: the Paradigm Shift

    PubMed Central

    Bull, Alan T.; Ward, Alan C.; Goodfellow, Michael

    2000-01-01

    Profound changes are occurring in the strategies that biotechnology-based industries are deploying in the search for exploitable biology and to discover new products and develop new or improved processes. The advances that have been made in the past decade in areas such as combinatorial chemistry, combinatorial biosynthesis, metabolic pathway engineering, gene shuffling, and directed evolution of proteins have caused some companies to consider withdrawing from natural product screening. In this review we examine the paradigm shift from traditional biology to bioinformatics that is revolutionizing exploitable biology. We conclude that the reinvigorated means of detecting novel organisms, novel chemical structures, and novel biocatalytic activities will ensure that natural products will continue to be a primary resource for biotechnology. The paradigm shift has been driven by a convergence of complementary technologies, exemplified by DNA sequencing and amplification, genome sequencing and annotation, proteome analysis, and phenotypic inventorying, resulting in the establishment of huge databases that can be mined in order to generate useful knowledge such as the identity and characterization of organisms and the identity of biotechnology targets. Concurrently there have been major advances in understanding the extent of microbial diversity, how uncultured organisms might be grown, and how expression of the metabolic potential of microorganisms can be maximized. The integration of information from complementary databases presents a significant challenge. Such integration should facilitate answers to complex questions involving sequence, biochemical, physiological, taxonomic, and ecological information of the sort posed in exploitable biology. The paradigm shift which we discuss is not absolute in the sense that it will replace established microbiology; rather, it reinforces our view that innovative microbiology is essential for releasing the potential of microbial diversity for biotechnology penetration throughout industry. Various of these issues are considered with reference to deep-sea microbiology and biotechnology. PMID:10974127

  14. Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems

    PubMed Central

    Boué, Stéphanie; Talikka, Marja; Westra, Jurjen Willem; Hayes, William; Di Fabio, Anselmo; Park, Jennifer; Schlage, Walter K.; Sewer, Alain; Fields, Brett; Ansari, Sam; Martin, Florian; Veljkovic, Emilija; Kenney, Renee; Peitsch, Manuel C.; Hoeng, Julia

    2015-01-01

    With the wealth of publications and data available, powerful and transparent computational approaches are required to represent measured data and scientific knowledge in a computable and searchable format. We developed a set of biological network models, scripted in the Biological Expression Language, that reflect causal signaling pathways across a wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and cardiovascular context. This comprehensive collection of networks is now freely available to the scientific community in a centralized web-based repository, the Causal Biological Network database, which is composed of over 120 manually curated and well annotated biological network models and can be accessed at http://causalbionet.com. The website accesses a MongoDB, which stores all versions of the networks as JSON objects and allows users to search for genes, proteins, biological processes, small molecules and keywords in the network descriptions to retrieve biological networks of interest. The content of the networks can be visualized and browsed. Nodes and edges can be filtered and all supporting evidence for the edges can be browsed and is linked to the original articles in PubMed. Moreover, networks may be downloaded for further visualization and evaluation. Database URL: http://causalbionet.com PMID:25887162

  15. CROPPER: a metagene creator resource for cross-platform and cross-species compendium studies.

    PubMed

    Paananen, Jussi; Storvik, Markus; Wong, Garry

    2006-09-22

    Current genomic research methods provide researchers with enormous amounts of data. Combining data from different high-throughput research technologies commonly available in biological databases can lead to novel findings and increase research efficiency. However, combining data from different heterogeneous sources is often a very arduous task. These sources can be different microarray technology platforms, genomic databases, or experiments performed on various species. Our aim was to develop a software program that could facilitate the combining of data from heterogeneous sources, and thus allow researchers to perform genomic cross-platform/cross-species studies and to use existing experimental data for compendium studies. We have developed a web-based software resource, called CROPPER that uses the latest genomic information concerning different data identifiers and orthologous genes from the Ensembl database. CROPPER can be used to combine genomic data from different heterogeneous sources, allowing researchers to perform cross-platform/cross-species compendium studies without the need for complex computational tools or the requirement of setting up one's own in-house database. We also present an example of a simple cross-platform/cross-species compendium study based on publicly available Parkinson's disease data derived from different sources. CROPPER is a user-friendly and freely available web-based software resource that can be successfully used for cross-species/cross-platform compendium studies.

  16. B-CAN: a resource sharing platform to improve the operation, visualization and integrated analysis of TCGA breast cancer data.

    PubMed

    Wen, Can-Hong; Ou, Shao-Min; Guo, Xiao-Bo; Liu, Chen-Feng; Shen, Yan-Bo; You, Na; Cai, Wei-Hong; Shen, Wen-Jun; Wang, Xue-Qin; Tan, Hai-Zhu

    2017-12-12

    Breast cancer is a high-risk heterogeneous disease with myriad subtypes and complicated biological features. The Cancer Genome Atlas (TCGA) breast cancer database provides researchers with the large-scale genome and clinical data via web portals and FTP services. Researchers are able to gain new insights into their related fields, and evaluate experimental discoveries with TCGA. However, it is difficult for researchers who have little experience with database and bioinformatics to access and operate on because of TCGA's complex data format and diverse files. For ease of use, we build the breast cancer (B-CAN) platform, which enables data customization, data visualization, and private data center. The B-CAN platform runs on Apache server and interacts with the backstage of MySQL database by PHP. Users can customize data based on their needs by combining tables from original TCGA database and selecting variables from each table. The private data center is applicable for private data and two types of customized data. A key feature of the B-CAN is that it provides single table display and multiple table display. Customized data with one barcode corresponding to many records and processed customized data are allowed in Multiple Tables Display. The B-CAN is an intuitive and high-efficient data-sharing platform.

  17. Hydroponics Database and Handbook for the Advanced Life Support Test Bed

    NASA Technical Reports Server (NTRS)

    Nash, Allen J.

    1999-01-01

    During the summer 1998, I did student assistance to Dr. Daniel J. Barta, chief plant growth expert at Johnson Space Center - NASA. We established the preliminary stages of a hydroponic crop growth database for the Advanced Life Support Systems Integration Test Bed, otherwise referred to as BIO-Plex (Biological Planetary Life Support Systems Test Complex). The database summarizes information from published technical papers by plant growth experts, and it includes bibliographical, environmental and harvest information based on plant growth under varying environmental conditions. I collected 84 lettuce entries, 14 soybean, 49 sweet potato, 16 wheat, 237 white potato, and 26 mix crop entries. The list will grow with the publication of new research. This database will be integrated with a search and systems analysis computer program that will cross-reference multiple parameters to determine optimum edible yield under varying parameters. Also, we have made preliminary effort to put together a crop handbook for BIO-Plex plant growth management. It will be a collection of information obtained from experts who provided recommendations on a particular crop's growing conditions. It includes bibliographic, environmental, nutrient solution, potential yield, harvest nutritional, and propagation procedure information. This handbook will stand as the baseline growth conditions for the first set of experiments in the BIO-Plex facility.

  18. CHOmine: an integrated data warehouse for CHO systems biology and modeling

    PubMed Central

    Hanscho, Michael; Ruckerbauer, David E.; Zanghellini, Jürgen; Borth, Nicole

    2017-01-01

    Abstract The last decade has seen a surge in published genome-scale information for Chinese hamster ovary (CHO) cells, which are the main production vehicles for therapeutic proteins. While a single access point is available at www.CHOgenome.org, the primary data is distributed over several databases at different institutions. Currently research is frequently hampered by a plethora of gene names and IDs that vary between published draft genomes and databases making systems biology analyses cumbersome and elaborate. Here we present CHOmine, an integrative data warehouse connecting data from various databases and links to other ones. Furthermore, we introduce CHOmodel, a web based resource that provides access to recently published CHO cell line specific metabolic reconstructions. Both resources allow to query CHO relevant data, find interconnections between different types of data and thus provides a simple, standardized entry point to the world of CHO systems biology. Database URL: http://www.chogenome.org PMID:28605771

  19. Towards the implementation of a spectral database for the detection of biological warfare agents

    NASA Astrophysics Data System (ADS)

    Carestia, M.; Pizzoferrato, R.; Gelfusa, M.; Cenciarelli, O.; D'Amico, F.; Malizia, A.; Scarpellini, D.; Murari, A.; Vega, J.; Gaudio, P.

    2014-10-01

    The deliberate use of biological warfare agents (BWA) and other pathogens can jeopardize the safety of population, fauna and flora, and represents a concrete concern from the military and civil perspective. At present, the only commercially available tools for fast warning of a biological attack can perform point detection and require active or passive sampling collection. The development of a stand-off detection system would be extremely valuable to minimize the risk and the possible consequences of the release of biological aerosols in the atmosphere. Biological samples can be analyzed by means of several optical techniques, covering a broad region of the electromagnetic spectrum. Strong evidence proved that the informative content of fluorescence spectra could provide good preliminary discrimination among those agents and it can also be obtained through stand-off measurements. Such a system necessitates a database and a mathematical method for the discrimination of the spectral signatures. In this work, we collected fluorescence emission spectra of the main BWA simulants, to implement a spectral signature database and apply the Universal Multi Event Locator (UMEL) statistical method. Our preliminary analysis, conducted in laboratory conditions with a standard UV lamp source, considers the main experimental setups influencing the fluorescence signature of some of the most commonly used BWA simulants. Our work represents a first step towards the implementation of a spectral database and a laser-based biological stand-off detection and identification technique.

  20. Comprehensive Transcriptome Analysis of Sex-Biased Expressed Genes Reveals Discrete Biological and Physiological Features of Male and Female Schistosoma japonicum.

    PubMed

    Cai, Pengfei; Liu, Shuai; Piao, Xianyu; Hou, Nan; Gobert, Geoffrey N; McManus, Donald P; Chen, Qijun

    2016-04-01

    Schistosomiasis is a chronic and debilitating disease caused by blood flukes (digenetic trematodes) of the genus Schistosoma. Schistosomes are sexually dimorphic and exhibit dramatic morphological changes during a complex lifecycle which requires subtle gene regulatory mechanisms to fulfil these complex biological processes. In the current study, a 41,982 features custom DNA microarray, which represents the most comprehensive probe coverage for any schistosome transcriptome study, was designed based on public domain and local databases to explore differential gene expression in S. japonicum. We found that approximately 1/10 of the total annotated genes in the S. japonicum genome are differentially expressed between adult males and females. In general, genes associated with the cytoskeleton, and motor and neuronal activities were readily expressed in male adult worms, whereas genes involved in amino acid metabolism, nucleotide biosynthesis, gluconeogenesis, glycosylation, cell cycle processes, DNA synthesis and genome fidelity and stability were enriched in females. Further, miRNAs target sites within these gene sets were predicted, which provides a scenario whereby the miRNAs potentially regulate these sex-biased expressed genes. The study significantly expands the expressional and regulatory characteristics of gender-biased expressed genes in schistosomes with high accuracy. The data provide a better appreciation of the biological and physiological features of male and female schistosome parasites, which may lead to novel vaccine targets and the development of new therapeutic interventions.

  1. The BEL information extraction workflow (BELIEF): evaluation in the BioCreative V BEL and IAT track

    PubMed Central

    Madan, Sumit; Hodapp, Sven; Senger, Philipp; Ansari, Sam; Szostak, Justyna; Hoeng, Julia; Peitsch, Manuel; Fluck, Juliane

    2016-01-01

    Network-based approaches have become extremely important in systems biology to achieve a better understanding of biological mechanisms. For network representation, the Biological Expression Language (BEL) is well designed to collate findings from the scientific literature into biological network models. To facilitate encoding and biocuration of such findings in BEL, a BEL Information Extraction Workflow (BELIEF) was developed. BELIEF provides a web-based curation interface, the BELIEF Dashboard, that incorporates text mining techniques to support the biocurator in the generation of BEL networks. The underlying UIMA-based text mining pipeline (BELIEF Pipeline) uses several named entity recognition processes and relationship extraction methods to detect concepts and BEL relationships in literature. The BELIEF Dashboard allows easy curation of the automatically generated BEL statements and their context annotations. Resulting BEL statements and their context annotations can be syntactically and semantically verified to ensure consistency in the BEL network. In summary, the workflow supports experts in different stages of systems biology network building. Based on the BioCreative V BEL track evaluation, we show that the BELIEF Pipeline automatically extracts relationships with an F-score of 36.4% and fully correct statements can be obtained with an F-score of 30.8%. Participation in the BioCreative V Interactive task (IAT) track with BELIEF revealed a systems usability scale (SUS) of 67. Considering the complexity of the task for new users—learning BEL, working with a completely new interface, and performing complex curation—a score so close to the overall SUS average highlights the usability of BELIEF. Database URL: BELIEF is available at http://www.scaiview.com/belief/ PMID:27694210

  2. The BEL information extraction workflow (BELIEF): evaluation in the BioCreative V BEL and IAT track.

    PubMed

    Madan, Sumit; Hodapp, Sven; Senger, Philipp; Ansari, Sam; Szostak, Justyna; Hoeng, Julia; Peitsch, Manuel; Fluck, Juliane

    2016-01-01

    Network-based approaches have become extremely important in systems biology to achieve a better understanding of biological mechanisms. For network representation, the Biological Expression Language (BEL) is well designed to collate findings from the scientific literature into biological network models. To facilitate encoding and biocuration of such findings in BEL, a BEL Information Extraction Workflow (BELIEF) was developed. BELIEF provides a web-based curation interface, the BELIEF Dashboard, that incorporates text mining techniques to support the biocurator in the generation of BEL networks. The underlying UIMA-based text mining pipeline (BELIEF Pipeline) uses several named entity recognition processes and relationship extraction methods to detect concepts and BEL relationships in literature. The BELIEF Dashboard allows easy curation of the automatically generated BEL statements and their context annotations. Resulting BEL statements and their context annotations can be syntactically and semantically verified to ensure consistency in the BEL network. In summary, the workflow supports experts in different stages of systems biology network building. Based on the BioCreative V BEL track evaluation, we show that the BELIEF Pipeline automatically extracts relationships with an F-score of 36.4% and fully correct statements can be obtained with an F-score of 30.8%. Participation in the BioCreative V Interactive task (IAT) track with BELIEF revealed a systems usability scale (SUS) of 67. Considering the complexity of the task for new users-learning BEL, working with a completely new interface, and performing complex curation-a score so close to the overall SUS average highlights the usability of BELIEF.Database URL: BELIEF is available at http://www.scaiview.com/belief/. © The Author(s) 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  3. The SBOL Stack: A Platform for Storing, Publishing, and Sharing Synthetic Biology Designs.

    PubMed

    Madsen, Curtis; McLaughlin, James Alastair; Mısırlı, Göksel; Pocock, Matthew; Flanagan, Keith; Hallinan, Jennifer; Wipat, Anil

    2016-06-17

    Recently, synthetic biologists have developed the Synthetic Biology Open Language (SBOL), a data exchange standard for descriptions of genetic parts, devices, modules, and systems. The goals of this standard are to allow scientists to exchange designs of biological parts and systems, to facilitate the storage of genetic designs in repositories, and to facilitate the description of genetic designs in publications. In order to achieve these goals, the development of an infrastructure to store, retrieve, and exchange SBOL data is necessary. To address this problem, we have developed the SBOL Stack, a Resource Description Framework (RDF) database specifically designed for the storage, integration, and publication of SBOL data. This database allows users to define a library of synthetic parts and designs as a service, to share SBOL data with collaborators, and to store designs of biological systems locally. The database also allows external data sources to be integrated by mapping them to the SBOL data model. The SBOL Stack includes two Web interfaces: the SBOL Stack API and SynBioHub. While the former is designed for developers, the latter allows users to upload new SBOL biological designs, download SBOL documents, search by keyword, and visualize SBOL data. Since the SBOL Stack is based on semantic Web technology, the inherent distributed querying functionality of RDF databases can be used to allow different SBOL stack databases to be queried simultaneously, and therefore, data can be shared between different institutes, centers, or other users.

  4. Relational databases: a transparent framework for encouraging biology students to think informatically.

    PubMed

    Rice, Michael; Gladstone, William; Weir, Michael

    2004-01-01

    We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills.

  5. Relational Databases: A Transparent Framework for Encouraging Biology Students To Think Informatically

    PubMed Central

    2004-01-01

    We discuss how relational databases constitute an ideal framework for representing and analyzing large-scale genomic data sets in biology. As a case study, we describe a Drosophila splice-site database that we recently developed at Wesleyan University for use in research and teaching. The database stores data about splice sites computed by a custom algorithm using Drosophila cDNA transcripts and genomic DNA and supports a set of procedures for analyzing splice-site sequence space. A generic Web interface permits the execution of the procedures with a variety of parameter settings and also supports custom structured query language queries. Moreover, new analytical procedures can be added by updating special metatables in the database without altering the Web interface. The database provides a powerful setting for students to develop informatic thinking skills. PMID:15592597

  6. Evaluation of Automated Yeast Identification System

    NASA Technical Reports Server (NTRS)

    McGinnis, M. R.

    1996-01-01

    One hundred and nine teleomorphic and anamorphic yeast isolates representing approximately 30 taxa were used to evaluate the accuracy of the Biolog yeast identification system. Isolates derived from nomenclatural types, environmental, and clinica isolates of known identity were tested in the Biolog system. Of the isolates tested, 81 were in the Biolog database. The system correctly identified 40, incorrectly identified 29, and was unable to identify 12. Of the 28 isolates not in the database, 18 were given names, whereas 10 were not. The Biolog yeast identification system is inadequate for the identification of yeasts originating from the environment during space program activities.

  7. ATHENA: A knowledge-based hybrid backpropagation-grammatical evolution neural network algorithm for discovering epistasis among quantitative trait Loci

    PubMed Central

    2010-01-01

    Background Growing interest and burgeoning technology for discovering genetic mechanisms that influence disease processes have ushered in a flood of genetic association studies over the last decade, yet little heritability in highly studied complex traits has been explained by genetic variation. Non-additive gene-gene interactions, which are not often explored, are thought to be one source of this "missing" heritability. Methods Stochastic methods employing evolutionary algorithms have demonstrated promise in being able to detect and model gene-gene and gene-environment interactions that influence human traits. Here we demonstrate modifications to a neural network algorithm in ATHENA (the Analysis Tool for Heritable and Environmental Network Associations) resulting in clear performance improvements for discovering gene-gene interactions that influence human traits. We employed an alternative tree-based crossover, backpropagation for locally fitting neural network weights, and incorporation of domain knowledge obtainable from publicly accessible biological databases for initializing the search for gene-gene interactions. We tested these modifications in silico using simulated datasets. Results We show that the alternative tree-based crossover modification resulted in a modest increase in the sensitivity of the ATHENA algorithm for discovering gene-gene interactions. The performance increase was highly statistically significant when backpropagation was used to locally fit NN weights. We also demonstrate that using domain knowledge to initialize the search for gene-gene interactions results in a large performance increase, especially when the search space is larger than the search coverage. Conclusions We show that a hybrid optimization procedure, alternative crossover strategies, and incorporation of domain knowledge from publicly available biological databases can result in marked increases in sensitivity and performance of the ATHENA algorithm for detecting and modelling gene-gene interactions that influence a complex human trait. PMID:20875103

  8. Unraveling snake venom complexity with 'omics' approaches: challenges and perspectives.

    PubMed

    Zelanis, André; Tashima, Alexandre Keiji

    2014-09-01

    The study of snake venom proteomes (venomics) has been experiencing a burst of reports, however the comprehensive knowledge of the dynamic range of proteins present within a single venom, the set of post-translational modifications (PTMs) as well as the lack of a comprehensive database related to venom proteins are among the main challenges in venomics research. The phenotypic plasticity in snake venom proteomes together with their inherent toxin proteoform diversity, points out to the use of integrative analysis in order to better understand their actual complexity. In this regard, such a systems venomics task should encompass the integration of data from transcriptomic and proteomic studies (specially the venom gland proteome), the identification of biological PTMs, and the estimation of artifactual proteomes and peptidomes generated by sample handling procedures. Copyright © 2014 Elsevier Ltd. All rights reserved.

  9. Semantically enabled and statistically supported biological hypothesis testing with tissue microarray databases

    PubMed Central

    2011-01-01

    Background Although many biological databases are applying semantic web technologies, meaningful biological hypothesis testing cannot be easily achieved. Database-driven high throughput genomic hypothesis testing requires both of the capabilities of obtaining semantically relevant experimental data and of performing relevant statistical testing for the retrieved data. Tissue Microarray (TMA) data are semantically rich and contains many biologically important hypotheses waiting for high throughput conclusions. Methods An application-specific ontology was developed for managing TMA and DNA microarray databases by semantic web technologies. Data were represented as Resource Description Framework (RDF) according to the framework of the ontology. Applications for hypothesis testing (Xperanto-RDF) for TMA data were designed and implemented by (1) formulating the syntactic and semantic structures of the hypotheses derived from TMA experiments, (2) formulating SPARQLs to reflect the semantic structures of the hypotheses, and (3) performing statistical test with the result sets returned by the SPARQLs. Results When a user designs a hypothesis in Xperanto-RDF and submits it, the hypothesis can be tested against TMA experimental data stored in Xperanto-RDF. When we evaluated four previously validated hypotheses as an illustration, all the hypotheses were supported by Xperanto-RDF. Conclusions We demonstrated the utility of high throughput biological hypothesis testing. We believe that preliminary investigation before performing highly controlled experiment can be benefited. PMID:21342584

  10. Tomato functional genomics database (TFGD): a comprehensive collection and analysis package for tomato functional genomics

    USDA-ARS?s Scientific Manuscript database

    Tomato Functional Genomics Database (TFGD; http://ted.bti.cornell.edu) provides a comprehensive systems biology resource to store, mine, analyze, visualize and integrate large-scale tomato functional genomics datasets. The database is expanded from the previously described Tomato Expression Database...

  11. Mining Rare Associations between Biological Ontologies

    PubMed Central

    Benites, Fernando; Simon, Svenja; Sapozhnikova, Elena

    2014-01-01

    The constantly increasing volume and complexity of available biological data requires new methods for their management and analysis. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining cross and intra-ontology pairwise generalized association rules. Its advantage is sensitivity to rare associations, for these are important for biologists. We propose a new class of interestingness measures designed for hierarchically organized rules. These measures allow one to select the most important rules and to take into account rare cases. They favor rules with an actual interestingness value that exceeds the expected value. The latter is calculated taking into account the parent rule. We demonstrate this approach by applying it to the analysis of data from Gene Ontology and GPCR databases. Our objective is to discover interesting relations between two different ontologies or parts of a single ontology. The association rules that are thus discovered can provide the user with new knowledge about underlying biological processes or help improve annotation consistency. The obtained results show that produced rules represent meaningful and quite reliable associations. PMID:24404165

  12. Mining rare associations between biological ontologies.

    PubMed

    Benites, Fernando; Simon, Svenja; Sapozhnikova, Elena

    2014-01-01

    The constantly increasing volume and complexity of available biological data requires new methods for their management and analysis. An important challenge is the integration of information from different sources in order to discover possible hidden relations between already known data. In this paper we introduce a data mining approach which relates biological ontologies by mining cross and intra-ontology pairwise generalized association rules. Its advantage is sensitivity to rare associations, for these are important for biologists. We propose a new class of interestingness measures designed for hierarchically organized rules. These measures allow one to select the most important rules and to take into account rare cases. They favor rules with an actual interestingness value that exceeds the expected value. The latter is calculated taking into account the parent rule. We demonstrate this approach by applying it to the analysis of data from Gene Ontology and GPCR databases. Our objective is to discover interesting relations between two different ontologies or parts of a single ontology. The association rules that are thus discovered can provide the user with new knowledge about underlying biological processes or help improve annotation consistency. The obtained results show that produced rules represent meaningful and quite reliable associations.

  13. Computer-aided identification, synthesis, and biological evaluation of novel inhibitors for botulinum neurotoxin serotype A

    DOE PAGES

    Teng, Y. G.; Berger, W. T.; Nesbitt, N. M.; ...

    2015-07-27

    Botulinum neurotoxins (BoNTs) are among the most potent biological toxin known to humans, and are classified as Category A bioterrorism agents by the Centers for Disease Control and prevention (CDC). There are seven known BoNT serotypes (A-G) which have been thus far identified in literature. BoNTs have been shown to block neurotransmitter release by cleaving proteins of the soluble NSF attachment protein receptor (SNARE) complex. Disruption of the SNARE complex precludes motor neuron failure which ultimately results in flaccid paralysis in humans and animals. Currently, there are no effective therapeutic treatments against the neurotoxin light chain (LC) after translocation intomore » the cytosols of motor neurons. In this work, high-throughput virtual screening was employed to screen a library of commercially available compounds from ZINC database against BoNT/A-LC. Among the hit compounds from the in-silico screening, two lead compounds were identified and found to have potent inhibitory activity against BoNT/A-LC in vitro, as well as in Neuro-2a cells. A few analogues of the lead compounds were synthesized and their potency examined. One of these analogues showed an enhanced activity than the lead compounds« less

  14. Unraveling the Complexities of Life Sciences Data.

    PubMed

    Higdon, Roger; Haynes, Winston; Stanberry, Larissa; Stewart, Elizabeth; Yandl, Gregory; Howard, Chris; Broomall, William; Kolker, Natali; Kolker, Eugene

    2013-03-01

    The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.

  15. Combining computational models, semantic annotations and simulation experiments in a graph database

    PubMed Central

    Henkel, Ron; Wolkenhauer, Olaf; Waltemath, Dagmar

    2015-01-01

    Model repositories such as the BioModels Database, the CellML Model Repository or JWS Online are frequently accessed to retrieve computational models of biological systems. However, their storage concepts support only restricted types of queries and not all data inside the repositories can be retrieved. In this article we present a storage concept that meets this challenge. It grounds on a graph database, reflects the models’ structure, incorporates semantic annotations and simulation descriptions and ultimately connects different types of model-related data. The connections between heterogeneous model-related data and bio-ontologies enable efficient search via biological facts and grant access to new model features. The introduced concept notably improves the access of computational models and associated simulations in a model repository. This has positive effects on tasks such as model search, retrieval, ranking, matching and filtering. Furthermore, our work for the first time enables CellML- and Systems Biology Markup Language-encoded models to be effectively maintained in one database. We show how these models can be linked via annotations and queried. Database URL: https://sems.uni-rostock.de/projects/masymos/ PMID:25754863

  16. Detection of time delays and directional interactions based on time series from complex dynamical systems

    NASA Astrophysics Data System (ADS)

    Ma, Huanfei; Leng, Siyang; Tao, Chenyang; Ying, Xiong; Kurths, Jürgen; Lai, Ying-Cheng; Lin, Wei

    2017-07-01

    Data-based and model-free accurate identification of intrinsic time delays and directional interactions is an extremely challenging problem in complex dynamical systems and their networks reconstruction. A model-free method with new scores is proposed to be generally capable of detecting single, multiple, and distributed time delays. The method is applicable not only to mutually interacting dynamical variables but also to self-interacting variables in a time-delayed feedback loop. Validation of the method is carried out using physical, biological, and ecological models and real data sets. Especially, applying the method to air pollution data and hospital admission records of cardiovascular diseases in Hong Kong reveals the major air pollutants as a cause of the diseases and, more importantly, it uncovers a hidden time delay (about 30-40 days) in the causal influence that previous studies failed to detect. The proposed method is expected to be universally applicable to ascertaining and quantifying subtle interactions (e.g., causation) in complex systems arising from a broad range of disciplines.

  17. Insights into the Specificity of Lysine Acetyltransferases

    DOE PAGES

    Tucker, Alex C.; Taylor, Keenan C.; Rank, Katherine C.; ...

    2014-11-07

    Reversible lysine acetylation by protein acetyltransferases is a conserved regulatory mechanism that controls diverse cellular pathways. Gcn5-related N-acetyltransferases (GNATs), named after their founding member, are found in all domains of life. GNATs are known for their role as histone acetyltransferases, but non-histone bacterial protein acetytransferases have been identified. Only structures of GNAT complexes with short histone peptide substrates are available in databases. Given the biological importance of this modification and the abundance of lysine in polypeptides, how specificity is attained for larger protein substrates is central to understanding acetyl-lysine-regulated networks. In this paper, we report the structure of a GNATmore » in complex with a globular protein substrate solved to 1.9 Å. GNAT binds the protein substrate with extensive surface interactions distinct from those reported for GNAT-peptide complexes. Finally, our data reveal determinants needed for the recognition of a protein substrate and provide insight into the specificity of GNATs.« less

  18. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes.

    PubMed

    Nielsen, H Bjørn; Almeida, Mathieu; Juncker, Agnieszka Sierakowska; Rasmussen, Simon; Li, Junhua; Sunagawa, Shinichi; Plichta, Damian R; Gautier, Laurent; Pedersen, Anders G; Le Chatelier, Emmanuelle; Pelletier, Eric; Bonde, Ida; Nielsen, Trine; Manichanh, Chaysavanh; Arumugam, Manimozhiyan; Batto, Jean-Michel; Quintanilha Dos Santos, Marcelo B; Blom, Nikolaj; Borruel, Natalia; Burgdorf, Kristoffer S; Boumezbeur, Fouad; Casellas, Francesc; Doré, Joël; Dworzynski, Piotr; Guarner, Francisco; Hansen, Torben; Hildebrand, Falk; Kaas, Rolf S; Kennedy, Sean; Kristiansen, Karsten; Kultima, Jens Roat; Léonard, Pierre; Levenez, Florence; Lund, Ole; Moumen, Bouziane; Le Paslier, Denis; Pons, Nicolas; Pedersen, Oluf; Prifti, Edi; Qin, Junjie; Raes, Jeroen; Sørensen, Søren; Tap, Julien; Tims, Sebastian; Ussery, David W; Yamada, Takuji; Renault, Pierre; Sicheritz-Ponten, Thomas; Bork, Peer; Wang, Jun; Brunak, Søren; Ehrlich, S Dusko

    2014-08-01

    Most current approaches for analyzing metagenomic data rely on comparisons to reference genomes, but the microbial diversity of many environments extends far beyond what is covered by reference databases. De novo segregation of complex metagenomic data into specific biological entities, such as particular bacterial strains or viruses, remains a largely unsolved problem. Here we present a method, based on binning co-abundant genes across a series of metagenomic samples, that enables comprehensive discovery of new microbial organisms, viruses and co-inherited genetic entities and aids assembly of microbial genomes without the need for reference sequences. We demonstrate the method on data from 396 human gut microbiome samples and identify 7,381 co-abundance gene groups (CAGs), including 741 metagenomic species (MGS). We use these to assemble 238 high-quality microbial genomes and identify affiliations between MGS and hundreds of viruses or genetic entities. Our method provides the means for comprehensive profiling of the diversity within complex metagenomic samples.

  19. Toward unification of taxonomy databases in a distributed computer environment

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Kitakami, Hajime; Tateno, Yoshio; Gojobori, Takashi

    1994-12-31

    All the taxonomy databases constructed with the DNA databases of the international DNA data banks are powerful electronic dictionaries which aid in biological research by computer. The taxonomy databases are, however not consistently unified with a relational format. If we can achieve consistent unification of the taxonomy databases, it will be useful in comparing many research results, and investigating future research directions from existent research results. In particular, it will be useful in comparing relationships between phylogenetic trees inferred from molecular data and those constructed from morphological data. The goal of the present study is to unify the existent taxonomymore » databases and eliminate inconsistencies (errors) that are present in them. Inconsistencies occur particularly in the restructuring of the existent taxonomy databases, since classification rules for constructing the taxonomy have rapidly changed with biological advancements. A repair system is needed to remove inconsistencies in each data bank and mismatches among data banks. This paper describes a new methodology for removing both inconsistencies and mismatches from the databases on a distributed computer environment. The methodology is implemented in a relational database management system, SYBASE.« less

  20. Causal biological network database: a comprehensive platform of causal biological network models focused on the pulmonary and vascular systems.

    PubMed

    Boué, Stéphanie; Talikka, Marja; Westra, Jurjen Willem; Hayes, William; Di Fabio, Anselmo; Park, Jennifer; Schlage, Walter K; Sewer, Alain; Fields, Brett; Ansari, Sam; Martin, Florian; Veljkovic, Emilija; Kenney, Renee; Peitsch, Manuel C; Hoeng, Julia

    2015-01-01

    With the wealth of publications and data available, powerful and transparent computational approaches are required to represent measured data and scientific knowledge in a computable and searchable format. We developed a set of biological network models, scripted in the Biological Expression Language, that reflect causal signaling pathways across a wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and cardiovascular context. This comprehensive collection of networks is now freely available to the scientific community in a centralized web-based repository, the Causal Biological Network database, which is composed of over 120 manually curated and well annotated biological network models and can be accessed at http://causalbionet.com. The website accesses a MongoDB, which stores all versions of the networks as JSON objects and allows users to search for genes, proteins, biological processes, small molecules and keywords in the network descriptions to retrieve biological networks of interest. The content of the networks can be visualized and browsed. Nodes and edges can be filtered and all supporting evidence for the edges can be browsed and is linked to the original articles in PubMed. Moreover, networks may be downloaded for further visualization and evaluation. Database URL: http://causalbionet.com © The Author(s) 2015. Published by Oxford University Press.

  1. Statistics and bioinformatics in nutritional sciences: analysis of complex data in the era of systems biology⋆

    PubMed Central

    Fu, Wenjiang J.; Stromberg, Arnold J.; Viele, Kert; Carroll, Raymond J.; Wu, Guoyao

    2009-01-01

    Over the past two decades, there have been revolutionary developments in life science technologies characterized by high throughput, high efficiency, and rapid computation. Nutritionists now have the advanced methodologies for the analysis of DNA, RNA, protein, low-molecular-weight metabolites, as well as access to bioinformatics databases. Statistics, which can be defined as the process of making scientific inferences from data that contain variability, has historically played an integral role in advancing nutritional sciences. Currently, in the era of systems biology, statistics has become an increasingly important tool to quantitatively analyze information about biological macromolecules. This article describes general terms used in statistical analysis of large, complex experimental data. These terms include experimental design, power analysis, sample size calculation, and experimental errors (type I and II errors) for nutritional studies at population, tissue, cellular, and molecular levels. In addition, we highlighted various sources of experimental variations in studies involving microarray gene expression, real-time polymerase chain reaction, proteomics, and other bioinformatics technologies. Moreover, we provided guidelines for nutritionists and other biomedical scientists to plan and conduct studies and to analyze the complex data. Appropriate statistical analyses are expected to make an important contribution to solving major nutrition-associated problems in humans and animals (including obesity, diabetes, cardiovascular disease, cancer, ageing, and intrauterine fetal retardation). PMID:20233650

  2. “NaKnowBase”: A Nanomaterials Relational Database

    EPA Science Inventory

    NaKnowBase is an internal relational database populated with data from peer-reviewed ORD nanomaterials research publications. The database focuses on papers describing the actions of nanomaterials in environmental or biological media including their interactions, transformations...

  3. Resolving the problem of multiple accessions of the same transcript deposited across various public databases.

    PubMed

    Weirick, Tyler; John, David; Uchida, Shizuka

    2017-03-01

    Maintaining the consistency of genomic annotations is an increasingly complex task because of the iterative and dynamic nature of assembly and annotation, growing numbers of biological databases and insufficient integration of annotations across databases. As information exchange among databases is poor, a 'novel' sequence from one reference annotation could be annotated in another. Furthermore, relationships to nearby or overlapping annotated transcripts are even more complicated when using different genome assemblies. To better understand these problems, we surveyed current and previous versions of genomic assemblies and annotations across a number of public databases containing long noncoding RNA. We identified numerous discrepancies of transcripts regarding their genomic locations, transcript lengths and identifiers. Further investigation showed that the positional differences between reference annotations of essentially the same transcript could lead to differences in its measured expression at the RNA level. To aid in resolving these problems, we present the algorithm 'Universal Genomic Accession Hash (UGAHash)' and created an open source web tool to encourage the usage of the UGAHash algorithm. The UGAHash web tool (http://ugahash.uni-frankfurt.de) can be accessed freely without registration. The web tool allows researchers to generate Universal Genomic Accessions for genomic features or to explore annotations deposited in the public databases of the past and present versions. We anticipate that the UGAHash web tool will be a valuable tool to check for the existence of transcripts before judging the newly discovered transcripts as novel. © The Author 2016. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com.

  4. dbMDEGA: a database for meta-analysis of differentially expressed genes in autism spectrum disorder.

    PubMed

    Zhang, Shuyun; Deng, Libin; Jia, Qiyue; Huang, Shaoting; Gu, Junwang; Zhou, Fankun; Gao, Meng; Sun, Xinyi; Feng, Chang; Fan, Guangqin

    2017-11-16

    Autism spectrum disorders (ASD) are hereditary, heterogeneous and biologically complex neurodevelopmental disorders. Individual studies on gene expression in ASD cannot provide clear consensus conclusions. Therefore, a systematic review to synthesize the current findings from brain tissues and a search tool to share the meta-analysis results are urgently needed. Here, we conducted a meta-analysis of brain gene expression profiles in the current reported human ASD expression datasets (with 84 frozen male cortex samples, 17 female cortex samples, 32 cerebellum samples and 4 formalin fixed samples) and knock-out mouse ASD model expression datasets (with 80 collective brain samples). Then, we applied R language software and developed an interactive shared and updated database (dbMDEGA) displaying the results of meta-analysis of data from ASD studies regarding differentially expressed genes (DEGs) in the brain. This database, dbMDEGA ( https://dbmdega.shinyapps.io/dbMDEGA/ ), is a publicly available web-portal for manual annotation and visualization of DEGs in the brain from data from ASD studies. This database uniquely presents meta-analysis values and homologous forest plots of DEGs in brain tissues. Gene entries are annotated with meta-values, statistical values and forest plots of DEGs in brain samples. This database aims to provide searchable meta-analysis results based on the current reported brain gene expression datasets of ASD to help detect candidate genes underlying this disorder. This new analytical tool may provide valuable assistance in the discovery of DEGs and the elucidation of the molecular pathogenicity of ASD. This database model may be replicated to study other disorders.

  5. PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy.

    PubMed

    Decelle, Johan; Romac, Sarah; Stern, Rowena F; Bendif, El Mahdi; Zingone, Adriana; Audic, Stéphane; Guiry, Michael D; Guillou, Laure; Tessier, Désiré; Le Gall, Florence; Gourvil, Priscillia; Dos Santos, Adriana L; Probert, Ian; Vaulot, Daniel; de Vargas, Colomban; Christen, Richard

    2015-11-01

    Photosynthetic eukaryotes have a critical role as the main producers in most ecosystems of the biosphere. The ongoing environmental metabarcoding revolution opens the perspective for holistic ecosystems biological studies of these organisms, in particular the unicellular microalgae that often lack distinctive morphological characters and have complex life cycles. To interpret environmental sequences, metabarcoding necessarily relies on taxonomically curated databases containing reference sequences of the targeted gene (or barcode) from identified organisms. To date, no such reference framework exists for photosynthetic eukaryotes. In this study, we built the PhytoREF database that contains 6490 plastidial 16S rDNA reference sequences that originate from a large diversity of eukaryotes representing all known major photosynthetic lineages. We compiled 3333 amplicon sequences available from public databases and 879 sequences extracted from plastidial genomes, and generated 411 novel sequences from cultured marine microalgal strains belonging to different eukaryotic lineages. A total of 1867 environmental Sanger 16S rDNA sequences were also included in the database. Stringent quality filtering and a phylogeny-based taxonomic classification were applied for each 16S rDNA sequence. The database mainly focuses on marine microalgae, but sequences from land plants (representing half of the PhytoREF sequences) and freshwater taxa were also included to broaden the applicability of PhytoREF to different aquatic and terrestrial habitats. PhytoREF, accessible via a web interface (http://phytoref.fr), is a new resource in molecular ecology to foster the discovery, assessment and monitoring of the diversity of photosynthetic eukaryotes using high-throughput sequencing. © 2015 John Wiley & Sons Ltd.

  6. Sting_RDB: a relational database of structural parameters for protein analysis with support for data warehousing and data mining.

    PubMed

    Oliveira, S R M; Almeida, G V; Souza, K R R; Rodrigues, D N; Kuser-Falcão, P R; Yamagishi, M E B; Santos, E H; Vieira, F D; Jardine, J G; Neshich, G

    2007-10-05

    An effective strategy for managing protein databases is to provide mechanisms to transform raw data into consistent, accurate and reliable information. Such mechanisms will greatly reduce operational inefficiencies and improve one's ability to better handle scientific objectives and interpret the research results. To achieve this challenging goal for the STING project, we introduce Sting_RDB, a relational database of structural parameters for protein analysis with support for data warehousing and data mining. In this article, we highlight the main features of Sting_RDB and show how a user can explore it for efficient and biologically relevant queries. Considering its importance for molecular biologists, effort has been made to advance Sting_RDB toward data quality assessment. To the best of our knowledge, Sting_RDB is one of the most comprehensive data repositories for protein analysis, now also capable of providing its users with a data quality indicator. This paper differs from our previous study in many aspects. First, we introduce Sting_RDB, a relational database with mechanisms for efficient and relevant queries using SQL. Sting_rdb evolved from the earlier, text (flat file)-based database, in which data consistency and integrity was not guaranteed. Second, we provide support for data warehousing and mining. Third, the data quality indicator was introduced. Finally and probably most importantly, complex queries that could not be posed on a text-based database, are now easily implemented. Further details are accessible at the Sting_RDB demo web page: http://www.cbi.cnptia.embrapa.br/StingRDB.

  7. Comprehensive curation and analysis of global interaction networks in Saccharomyces cerevisiae

    PubMed Central

    Reguly, Teresa; Breitkreutz, Ashton; Boucher, Lorrie; Breitkreutz, Bobby-Joe; Hon, Gary C; Myers, Chad L; Parsons, Ainslie; Friesen, Helena; Oughtred, Rose; Tong, Amy; Stark, Chris; Ho, Yuen; Botstein, David; Andrews, Brenda; Boone, Charles; Troyanskya, Olga G; Ideker, Trey; Dolinski, Kara; Batada, Nizar N; Tyers, Mike

    2006-01-01

    Background The study of complex biological networks and prediction of gene function has been enabled by high-throughput (HTP) methods for detection of genetic and protein interactions. Sparse coverage in HTP datasets may, however, distort network properties and confound predictions. Although a vast number of well substantiated interactions are recorded in the scientific literature, these data have not yet been distilled into networks that enable system-level inference. Results We describe here a comprehensive database of genetic and protein interactions, and associated experimental evidence, for the budding yeast Saccharomyces cerevisiae, as manually curated from over 31,793 abstracts and online publications. This literature-curated (LC) dataset contains 33,311 interactions, on the order of all extant HTP datasets combined. Surprisingly, HTP protein-interaction datasets currently achieve only around 14% coverage of the interactions in the literature. The LC network nevertheless shares attributes with HTP networks, including scale-free connectivity and correlations between interactions, abundance, localization, and expression. We find that essential genes or proteins are enriched for interactions with other essential genes or proteins, suggesting that the global network may be functionally unified. This interconnectivity is supported by a substantial overlap of protein and genetic interactions in the LC dataset. We show that the LC dataset considerably improves the predictive power of network-analysis approaches. The full LC dataset is available at the BioGRID () and SGD () databases. Conclusion Comprehensive datasets of biological interactions derived from the primary literature provide critical benchmarks for HTP methods, augment functional prediction, and reveal system-level attributes of biological networks. PMID:16762047

  8. Effects of threshold on the topology of gene co-expression networks.

    PubMed

    Couto, Cynthia Martins Villar; Comin, César Henrique; Costa, Luciano da Fontoura

    2017-09-26

    Several developments regarding the analysis of gene co-expression profiles using complex network theory have been reported recently. Such approaches usually start with the construction of an unweighted gene co-expression network, therefore requiring the selection of a suitable threshold defining which pairs of vertices will be connected. We aimed at addressing such an important problem by suggesting and comparing five different approaches for threshold selection. Each of the methods considers a respective biologically-motivated criterion for electing a potentially suitable threshold. A set of 21 microarray experiments from different biological groups was used to investigate the effect of applying the five proposed criteria to several biological situations. For each experiment, we used the Pearson correlation coefficient to measure the relationship between each gene pair, and the resulting weight matrices were thresholded considering several values, generating respective adjacency matrices (co-expression networks). Each of the five proposed criteria was then applied in order to select the respective threshold value. The effects of these thresholding approaches on the topology of the resulting networks were compared by using several measurements, and we verified that, depending on the database, the impact on the topological properties can be large. However, a group of databases was verified to be similarly affected by most of the considered criteria. Based on such results, it can be suggested that when the generated networks present similar measurements, the thresholding method can be chosen with greater freedom. If the generated networks are markedly different, the thresholding method that better suits the interests of each specific research study represents a reasonable choice.

  9. Quantum Private Queries

    NASA Astrophysics Data System (ADS)

    Giovannetti, Vittorio; Lloyd, Seth; Maccone, Lorenzo

    2008-06-01

    We propose a cheat sensitive quantum protocol to perform a private search on a classical database which is efficient in terms of communication complexity. It allows a user to retrieve an item from the database provider without revealing which item he or she retrieved: if the provider tries to obtain information on the query, the person querying the database can find it out. The protocol ensures also perfect data privacy of the database: the information that the user can retrieve in a single query is bounded and does not depend on the size of the database. With respect to the known (quantum and classical) strategies for private information retrieval, our protocol displays an exponential reduction in communication complexity and in running-time computational complexity.

  10. Development of SRS.php, a Simple Object Access Protocol-based library for data acquisition from integrated biological databases.

    PubMed

    Barbosa-Silva, A; Pafilis, E; Ortega, J M; Schneider, R

    2007-12-11

    Data integration has become an important task for biological database providers. The current model for data exchange among different sources simplifies the manner that distinct information is accessed by users. The evolution of data representation from HTML to XML enabled programs, instead of humans, to interact with biological databases. We present here SRS.php, a PHP library that can interact with the data integration Sequence Retrieval System (SRS). The library has been written using SOAP definitions, and permits the programmatic communication through webservices with the SRS. The interactions are possible by invoking the methods described in WSDL by exchanging XML messages. The current functions available in the library have been built to access specific data stored in any of the 90 different databases (such as UNIPROT, KEGG and GO) using the same query syntax format. The inclusion of the described functions in the source of scripts written in PHP enables them as webservice clients to the SRS server. The functions permit one to query the whole content of any SRS database, to list specific records in these databases, to get specific fields from the records, and to link any record among any pair of linked databases. The case study presented exemplifies the library usage to retrieve information regarding registries of a Plant Defense Mechanisms database. The Plant Defense Mechanisms database is currently being developed, and the proposal of SRS.php library usage is to enable the data acquisition for the further warehousing tasks related to its setup and maintenance.

  11. PathwayAccess: CellDesigner plugins for pathway databases.

    PubMed

    Van Hemert, John L; Dickerson, Julie A

    2010-09-15

    CellDesigner provides a user-friendly interface for graphical biochemical pathway description. Many pathway databases are not directly exportable to CellDesigner models. PathwayAccess is an extensible suite of CellDesigner plugins, which connect CellDesigner directly to pathway databases using respective Java application programming interfaces. The process is streamlined for creating new PathwayAccess plugins for specific pathway databases. Three PathwayAccess plugins, MetNetAccess, BioCycAccess and ReactomeAccess, directly connect CellDesigner to the pathway databases MetNetDB, BioCyc and Reactome. PathwayAccess plugins enable CellDesigner users to expose pathway data to analytical CellDesigner functions, curate their pathway databases and visually integrate pathway data from different databases using standard Systems Biology Markup Language and Systems Biology Graphical Notation. Implemented in Java, PathwayAccess plugins run with CellDesigner version 4.0.1 and were tested on Ubuntu Linux, Windows XP and 7, and MacOSX. Source code, binaries, documentation and video walkthroughs are freely available at http://vrac.iastate.edu/~jlv.

  12. Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches.

    PubMed

    Yugandhar, K; Gromiha, M Michael

    2014-09-01

    Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein-protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein-protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 protein-protein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein-protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein-protein interaction networks and human-pathogen interactions based on the strength of interactions. © 2014 Wiley Periodicals, Inc.

  13. PyPathway: Python Package for Biological Network Analysis and Visualization.

    PubMed

    Xu, Yang; Luo, Xiao-Chun

    2018-05-01

    Life science studies represent one of the biggest generators of large data sets, mainly because of rapid sequencing technological advances. Biological networks including interactive networks and human curated pathways are essential to understand these high-throughput data sets. Biological network analysis offers a method to explore systematically not only the molecular complexity of a particular disease but also the molecular relationships among apparently distinct phenotypes. Currently, several packages for Python community have been developed, such as BioPython and Goatools. However, tools to perform comprehensive network analysis and visualization are still needed. Here, we have developed PyPathway, an extensible free and open source Python package for functional enrichment analysis, network modeling, and network visualization. The network process module supports various interaction network and pathway databases such as Reactome, WikiPathway, STRING, and BioGRID. The network analysis module implements overrepresentation analysis, gene set enrichment analysis, network-based enrichment, and de novo network modeling. Finally, the visualization and data publishing modules enable users to share their analysis by using an easy web application. For package availability, see the first Reference.

  14. The CH/π hydrogen bond: Implication in chemistry

    NASA Astrophysics Data System (ADS)

    Nishio, M.

    2012-06-01

    The CH/π hydrogen bond is the weakest extreme of hydrogen bonds that occurs between a soft acid CH and a soft base π-system. Implication in chemistry of the CH/π hydrogen bond includes issues of conformation, crystal packing, and specificity in host/guest complexes. The result obtained by analyzing the Cambridge Structural Database is reviewed. The peculiar axial preference of isopropyl group in α-phellandrene and folded conformation of levopimaric acid have been explained in terms of the CH/π hydrogen bond, by high-level ab initio MO calculations. Implication of the CH/π hydrogen bond in structural biology is also discussed, briefly.

  15. NeuPAT: an intranet database supporting translational research in neuroblastic tumors.

    PubMed

    Villamón, Eva; Piqueras, Marta; Meseguer, Javier; Blanquer, Ignacio; Berbegall, Ana P; Tadeo, Irene; Hernández, Vicente; Navarro, Samuel; Noguera, Rosa

    2013-03-01

    Translational research in oncology is directed mainly towards establishing a better risk stratification and searching for appropriate therapeutic targets. This research generates a tremendous amount of complex clinical and biological data needing speedy and effective management. The authors describe the design, implementation and early experiences of a computer-aided system for the integration and management of data for neuroblastoma patients. NeuPAT facilitates clinical and translational research, minimizes the workload in consolidating the information, reduces errors and increases correlation of data through extensive coding. This design can also be applied to other tumor types. Copyright © 2012 Elsevier Ltd. All rights reserved.

  16. A systems biology model of the regulatory network in Populus leaves reveals interacting regulators and conserved regulation

    PubMed Central

    2011-01-01

    Background Green plant leaves have always fascinated biologists as hosts for photosynthesis and providers of basic energy to many food webs. Today, comprehensive databases of gene expression data enable us to apply increasingly more advanced computational methods for reverse-engineering the regulatory network of leaves, and to begin to understand the gene interactions underlying complex emergent properties related to stress-response and development. These new systems biology methods are now also being applied to organisms such as Populus, a woody perennial tree, in order to understand the specific characteristics of these species. Results We present a systems biology model of the regulatory network of Populus leaves. The network is reverse-engineered from promoter information and expression profiles of leaf-specific genes measured over a large set of conditions related to stress and developmental. The network model incorporates interactions between regulators, such as synergistic and competitive relationships, by evaluating increasingly more complex regulatory mechanisms, and is therefore able to identify new regulators of leaf development not found by traditional genomics methods based on pair-wise expression similarity. The approach is shown to explain available gene function information and to provide robust prediction of expression levels in new data. We also use the predictive capability of the model to identify condition-specific regulation as well as conserved regulation between Populus and Arabidopsis. Conclusions We outline a computationally inferred model of the regulatory network of Populus leaves, and show how treating genes as interacting, rather than individual, entities identifies new regulators compared to traditional genomics analysis. Although systems biology models should be used with care considering the complexity of regulatory programs and the limitations of current genomics data, methods describing interactions can provide hypotheses about the underlying cause of emergent properties and are needed if we are to identify target genes other than those constituting the "low hanging fruit" of genomic analysis. PMID:21232107

  17. Literature classification for semi-automated updating of biological knowledgebases

    PubMed Central

    2013-01-01

    Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403

  18. Nonlinear dimensionality reduction methods for synthetic biology biobricks' visualization.

    PubMed

    Yang, Jiaoyun; Wang, Haipeng; Ding, Huitong; An, Ning; Alterovitz, Gil

    2017-01-19

    Visualizing data by dimensionality reduction is an important strategy in Bioinformatics, which could help to discover hidden data properties and detect data quality issues, e.g. data noise, inappropriately labeled data, etc. As crowdsourcing-based synthetic biology databases face similar data quality issues, we propose to visualize biobricks to tackle them. However, existing dimensionality reduction methods could not be directly applied on biobricks datasets. Hereby, we use normalized edit distance to enhance dimensionality reduction methods, including Isomap and Laplacian Eigenmaps. By extracting biobricks from synthetic biology database Registry of Standard Biological Parts, six combinations of various types of biobricks are tested. The visualization graphs illustrate discriminated biobricks and inappropriately labeled biobricks. Clustering algorithm K-means is adopted to quantify the reduction results. The average clustering accuracy for Isomap and Laplacian Eigenmaps are 0.857 and 0.844, respectively. Besides, Laplacian Eigenmaps is 5 times faster than Isomap, and its visualization graph is more concentrated to discriminate biobricks. By combining normalized edit distance with Isomap and Laplacian Eigenmaps, synthetic biology biobircks are successfully visualized in two dimensional space. Various types of biobricks could be discriminated and inappropriately labeled biobricks could be determined, which could help to assess crowdsourcing-based synthetic biology databases' quality, and make biobricks selection.

  19. “NaKnowBase”: A Nanomaterials Relational Database

    EPA Science Inventory

    NaKnowBase is a relational database populated with data from peer-reviewed ORD nanomaterials research publications. The database focuses on papers describing the actions of nanomaterials in environmental or biological media including their interactions, transformations and poten...

  20. Database constraints applied to metabolic pathway reconstruction tools.

    PubMed

    Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi

    2014-01-01

    Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes.

  1. BISQUE: locus- and variant-specific conversion of genomic, transcriptomic and proteomic database identifiers.

    PubMed

    Meyer, Michael J; Geske, Philip; Yu, Haiyuan

    2016-05-15

    Biological sequence databases are integral to efforts to characterize and understand biological molecules and share biological data. However, when analyzing these data, scientists are often left holding disparate biological currency-molecular identifiers from different databases. For downstream applications that require converting the identifiers themselves, there are many resources available, but analyzing associated loci and variants can be cumbersome if data is not given in a form amenable to particular analyses. Here we present BISQUE, a web server and customizable command-line tool for converting molecular identifiers and their contained loci and variants between different database conventions. BISQUE uses a graph traversal algorithm to generalize the conversion process for residues in the human genome, genes, transcripts and proteins, allowing for conversion across classes of molecules and in all directions through an intuitive web interface and a URL-based web service. BISQUE is freely available via the web using any major web browser (http://bisque.yulab.org/). Source code is available in a public GitHub repository (https://github.com/hyulab/BISQUE). haiyuan.yu@cornell.edu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  2. TISSUES 2.0: an integrative web resource on mammalian tissue expression

    PubMed Central

    Palasca, Oana; Santos, Alberto; Stolte, Christian; Gorodkin, Jan; Jensen, Lars Juhl

    2018-01-01

    Abstract Physiological and molecular similarities between organisms make it possible to translate findings from simpler experimental systems—model organisms—into more complex ones, such as human. This translation facilitates the understanding of biological processes under normal or disease conditions. Researchers aiming to identify the similarities and differences between organisms at the molecular level need resources collecting multi-organism tissue expression data. We have developed a database of gene–tissue associations in human, mouse, rat and pig by integrating multiple sources of evidence: transcriptomics covering all four species and proteomics (human only), manually curated and mined from the scientific literature. Through a scoring scheme, these associations are made comparable across all sources of evidence and across organisms. Furthermore, the scoring produces a confidence score assigned to each of the associations. The TISSUES database (version 2.0) is publicly accessible through a user-friendly web interface and as part of the STRING app for Cytoscape. In addition, we analyzed the agreement between datasets, across and within organisms, and identified that the agreement is mainly affected by the quality of the datasets rather than by the technologies used or organisms compared. Database URL: http://tissues.jensenlab.org/ PMID:29617745

  3. Designing drugs on the internet? Free web tools and services supporting medicinal chemistry.

    PubMed

    Ertl, Peter; Jelfs, Stephen

    2007-01-01

    The drug discovery process is supported by a multitude of freely available tools on the Internet. This paper summarizes some of the databases and tools that are of particular interest to medicinal chemistry. These include numerous data collections that provide access to valuable chemical data resources, allowing complex queries of compound structures, associated physicochemical properties and biological activities to be performed and, in many cases, providing links to commercial chemical suppliers. Further applications are available for searching protein-ligand complexes and identifying important binding interactions that occur. This is particularly useful for understanding the molecular recognition of ligands in the lead optimization process. The Internet also provides access to databases detailing metabolic pathways and transformations which can provide insight into disease mechanism, identify new targets entities or the potential off-target effects of a drug candidate. Furthermore, sophisticated online cheminformatics tools are available for processing chemical structures, predicting properties, and generating 2D or 3D structure representations--often required prior to more advanced analyses. The Internet provides a wealth of valuable resources that, if fully exploited, can greatly benefit the drug discovery community. In this paper, we provide an overview of some of the more important of these and, in particular, the freely accessible resources that are currently available.

  4. Genetic diversity of Histoplasma and Sporothrix complexes based on sequences of their ITS1-5.8S-ITS2 regions from the BOLD System.

    PubMed

    Estrada-Bárcenas, Daniel Alfonso; Vite-Garín, Tania; Navarro-Barranco, Hortensia; de la Torre-Arciniega, Raúl; Pérez-Mejía, Amelia; Rodríguez-Arellanes, Gabriela; Ramirez, Jose Antonio; Humberto Sahaza, Jorge; Taylor, Maria Lucia; Toriello, Conchita

    2014-01-01

    High sensitivity and specificity of molecular biology techniques have proven usefulness for the detection, identification and typing of different pathogens. The ITS (Internal Transcribed Spacer) regions of the ribosomal DNA are highly conserved non-coding regions, and have been widely used in different studies including the determination of the genetic diversity of human fungal pathogens. This article wants to contribute to the understanding of the intra- and interspecific genetic diversity of isolates of the Histoplasma capsulatum and Sporothrix schenckii species complexes by an analysis of the available sequences of the ITS regions from different sequence databases. ITS1-5.8S-ITS2 sequences of each fungus, either deposited in GenBank, or from our research groups (registered in the Fungi Barcode of Life Database), were analyzed using the maximum likelihood (ML) method. ML analysis of the ITS sequences discriminated isolates from distant geographic origins and particular wild hosts, depending on the fungal species analyzed. This manuscript is part of the series of works presented at the "V International Workshop: Molecular genetic approaches to the study of human pathogenic fungi" (Oaxaca, Mexico, 2012). Copyright © 2013 Revista Iberoamericana de Micología. Published by Elsevier Espana. All rights reserved.

  5. LCR-eXXXplorer: a web platform to search, visualize and share data for low complexity regions in protein sequences.

    PubMed

    Kirmitzoglou, Ioannis; Promponas, Vasilis J

    2015-07-01

    Local compositionally biased and low complexity regions (LCRs) in amino acid sequences have initially attracted the interest of researchers due to their implication in generating artifacts in sequence database searches. There is accumulating evidence of the biological significance of LCRs both in physiological and in pathological situations. Nonetheless, LCR-related algorithms and tools have not gained wide appreciation across the research community, partly due to the fact that only a handful of user-friendly software is currently freely available. We developed LCR-eXXXplorer, an extensible online platform attempting to fill this gap. LCR-eXXXplorer offers tools for displaying LCRs from the UniProt/SwissProt knowledgebase, in combination with other relevant protein features, predicted or experimentally verified. Moreover, users may perform powerful queries against a custom designed sequence/LCR-centric database. We anticipate that LCR-eXXXplorer will be a useful starting point in research efforts for the elucidation of the structure, function and evolution of proteins with LCRs. LCR-eXXXplorer is freely available at the URL http://repeat.biol.ucy.ac.cy/lcr-exxxplorer. vprobon@ucy.ac.cy Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  6. Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications.

    PubMed

    Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H; Lee, Doheon

    2007-01-01

    With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene-patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene-patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at http://www.patome.org/; the information is updated bimonthly.

  7. Patome: a database server for biological sequence annotation and analysis in issued patents and published patent applications

    PubMed Central

    Lee, Byungwook; Kim, Taehyung; Kim, Seon-Kyu; Lee, Kwang H.; Lee, Doheon

    2007-01-01

    With the advent of automated and high-throughput techniques, the number of patent applications containing biological sequences has been increasing rapidly. However, they have attracted relatively little attention compared to other sequence resources. We have built a database server called Patome, which contains biological sequence data disclosed in patents and published applications, as well as their analysis information. The analysis is divided into two steps. The first is an annotation step in which the disclosed sequences were annotated with RefSeq database. The second is an association step where the sequences were linked to Entrez Gene, OMIM and GO databases, and their results were saved as a gene–patent table. From the analysis, we found that 55% of human genes were associated with patenting. The gene–patent table can be used to identify whether a particular gene or disease is related to patenting. Patome is available at ; the information is updated bimonthly. PMID:17085479

  8. Biological knowledge bases using Wikis: combining the flexibility of Wikis with the structure of databases.

    PubMed

    Brohée, Sylvain; Barriot, Roland; Moreau, Yves

    2010-09-01

    In recent years, the number of knowledge bases developed using Wiki technology has exploded. Unfortunately, next to their numerous advantages, classical Wikis present a critical limitation: the invaluable knowledge they gather is represented as free text, which hinders their computational exploitation. This is in sharp contrast with the current practice for biological databases where the data is made available in a structured way. Here, we present WikiOpener an extension for the classical MediaWiki engine that augments Wiki pages by allowing on-the-fly querying and formatting resources external to the Wiki. Those resources may provide data extracted from databases or DAS tracks, or even results returned by local or remote bioinformatics analysis tools. This also implies that structured data can be edited via dedicated forms. Hence, this generic resource combines the structure of biological databases with the flexibility of collaborative Wikis. The source code and its documentation are freely available on the MediaWiki website: http://www.mediawiki.org/wiki/Extension:WikiOpener.

  9. BioMart Central Portal: an open database network for the biological community.

    PubMed

    Guberman, Jonathan M; Ai, J; Arnaiz, O; Baran, Joachim; Blake, Andrew; Baldock, Richard; Chelala, Claude; Croft, David; Cros, Anthony; Cutts, Rosalind J; Di Génova, A; Forbes, Simon; Fujisawa, T; Gadaleta, E; Goodstein, D M; Gundem, Gunes; Haggarty, Bernard; Haider, Syed; Hall, Matthew; Harris, Todd; Haw, Robin; Hu, S; Hubbard, Simon; Hsu, Jack; Iyer, Vivek; Jones, Philip; Katayama, Toshiaki; Kinsella, R; Kong, Lei; Lawson, Daniel; Liang, Yong; Lopez-Bigas, Nuria; Luo, J; Lush, Michael; Mason, Jeremy; Moreews, Francois; Ndegwa, Nelson; Oakley, Darren; Perez-Llamas, Christian; Primig, Michael; Rivkin, Elena; Rosanoff, S; Shepherd, Rebecca; Simon, Reinhard; Skarnes, B; Smedley, Damian; Sperling, Linda; Spooner, William; Stevenson, Peter; Stone, Kevin; Teague, J; Wang, Jun; Wang, Jianxin; Whitty, Brett; Wong, D T; Wong-Erasmus, Marie; Yao, L; Youens-Clark, Ken; Yung, Christina; Zhang, Junjun; Kasprzyk, Arek

    2011-01-01

    BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the research community, and linking it with the other resources in the portal. Users can take advantage of the common interface to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities.

  10. Sagace: A web-based search engine for biomedical databases in Japan

    PubMed Central

    2012-01-01

    Background In the big data era, biomedical research continues to generate a large amount of data, and the generated information is often stored in a database and made publicly available. Although combining data from multiple databases should accelerate further studies, the current number of life sciences databases is too large to grasp features and contents of each database. Findings We have developed Sagace, a web-based search engine that enables users to retrieve information from a range of biological databases (such as gene expression profiles and proteomics data) and biological resource banks (such as mouse models of disease and cell lines). With Sagace, users can search more than 300 databases in Japan. Sagace offers features tailored to biomedical research, including manually tuned ranking, a faceted navigation to refine search results, and rich snippets constructed with retrieved metadata for each database entry. Conclusions Sagace will be valuable for experts who are involved in biomedical research and drug development in both academia and industry. Sagace is freely available at http://sagace.nibio.go.jp/en/. PMID:23110816

  11. From 20th century metabolic wall charts to 21st century systems biology: database of mammalian metabolic enzymes

    PubMed Central

    Corcoran, Callan C.; Grady, Cameron R.; Pisitkun, Trairak; Parulekar, Jaya

    2017-01-01

    The organization of the mammalian genome into gene subsets corresponding to specific functional classes has provided key tools for systems biology research. Here, we have created a web-accessible resource called the Mammalian Metabolic Enzyme Database (https://hpcwebapps.cit.nih.gov/ESBL/Database/MetabolicEnzymes/MetabolicEnzymeDatabase.html) keyed to the biochemical reactions represented on iconic metabolic pathway wall charts created in the previous century. Overall, we have mapped 1,647 genes to these pathways, representing ~7 percent of the protein-coding genome. To illustrate the use of the database, we apply it to the area of kidney physiology. In so doing, we have created an additional database (Database of Metabolic Enzymes in Kidney Tubule Segments: https://hpcwebapps.cit.nih.gov/ESBL/Database/MetabolicEnzymes/), mapping mRNA abundance measurements (mined from RNA-Seq studies) for all metabolic enzymes to each of 14 renal tubule segments. We carry out bioinformatics analysis of the enzyme expression pattern among renal tubule segments and mine various data sources to identify vasopressin-regulated metabolic enzymes in the renal collecting duct. PMID:27974320

  12. A global approach to analysis and interpretation of metabolic data for plant natural product discovery.

    PubMed

    Hur, Manhoi; Campbell, Alexis Ann; Almeida-de-Macedo, Marcia; Li, Ling; Ransom, Nick; Jose, Adarsh; Crispin, Matt; Nikolau, Basil J; Wurtele, Eve Syrkin

    2013-04-01

    Discovering molecular components and their functionality is key to the development of hypotheses concerning the organization and regulation of metabolic networks. The iterative experimental testing of such hypotheses is the trajectory that can ultimately enable accurate computational modelling and prediction of metabolic outcomes. This information can be particularly important for understanding the biology of natural products, whose metabolism itself is often only poorly defined. Here, we describe factors that must be in place to optimize the use of metabolomics in predictive biology. A key to achieving this vision is a collection of accurate time-resolved and spatially defined metabolite abundance data and associated metadata. One formidable challenge associated with metabolite profiling is the complexity and analytical limits associated with comprehensively determining the metabolome of an organism. Further, for metabolomics data to be efficiently used by the research community, it must be curated in publicly available metabolomics databases. Such databases require clear, consistent formats, easy access to data and metadata, data download, and accessible computational tools to integrate genome system-scale datasets. Although transcriptomics and proteomics integrate the linear predictive power of the genome, the metabolome represents the nonlinear, final biochemical products of the genome, which results from the intricate system(s) that regulate genome expression. For example, the relationship of metabolomics data to the metabolic network is confounded by redundant connections between metabolites and gene-products. However, connections among metabolites are predictable through the rules of chemistry. Therefore, enhancing the ability to integrate the metabolome with anchor-points in the transcriptome and proteome will enhance the predictive power of genomics data. We detail a public database repository for metabolomics, tools and approaches for statistical analysis of metabolomics data, and methods for integrating these datasets with transcriptomic data to create hypotheses concerning specialized metabolisms that generate the diversity in natural product chemistry. We discuss the importance of close collaborations among biologists, chemists, computer scientists and statisticians throughout the development of such integrated metabolism-centric databases and software.

  13. A global approach to analysis and interpretation of metabolic data for plant natural product discovery†

    PubMed Central

    Hur, Manhoi; Campbell, Alexis Ann; Almeida-de-Macedo, Marcia; Li, Ling; Ransom, Nick; Jose, Adarsh; Crispin, Matt; Nikolau, Basil J.

    2013-01-01

    Discovering molecular components and their functionality is key to the development of hypotheses concerning the organization and regulation of metabolic networks. The iterative experimental testing of such hypotheses is the trajectory that can ultimately enable accurate computational modelling and prediction of metabolic outcomes. This information can be particularly important for understanding the biology of natural products, whose metabolism itself is often only poorly defined. Here, we describe factors that must be in place to optimize the use of metabolomics in predictive biology. A key to achieving this vision is a collection of accurate time-resolved and spatially defined metabolite abundance data and associated metadata. One formidable challenge associated with metabolite profiling is the complexity and analytical limits associated with comprehensively determining the metabolome of an organism. Further, for metabolomics data to be efficiently used by the research community, it must be curated in publically available metabolomics databases. Such databases require clear, consistent formats, easy access to data and metadata, data download, and accessible computational tools to integrate genome system-scale datasets. Although transcriptomics and proteomics integrate the linear predictive power of the genome, the metabolome represents the nonlinear, final biochemical products of the genome, which results from the intricate system(s) that regulate genome expression. For example, the relationship of metabolomics data to the metabolic network is confounded by redundant connections between metabolites and gene-products. However, connections among metabolites are predictable through the rules of chemistry. Therefore, enhancing the ability to integrate the metabolome with anchor-points in the transcriptome and proteome will enhance the predictive power of genomics data. We detail a public database repository for metabolomics, tools and approaches for statistical analysis of metabolomics data, and methods for integrating these dataset with transcriptomic data to create hypotheses concerning specialized metabolism that generates the diversity in natural product chemistry. We discuss the importance of close collaborations among biologists, chemists, computer scientists and statisticians throughout the development of such integrated metabolism-centric databases and software. PMID:23447050

  14. ZINC: A Free Tool to Discover Chemistry for Biology

    PubMed Central

    2012-01-01

    ZINC is a free public resource for ligand discovery. The database contains over twenty million commercially available molecules in biologically relevant representations that may be downloaded in popular ready-to-dock formats and subsets. The Web site also enables searches by structure, biological activity, physical property, vendor, catalog number, name, and CAS number. Small custom subsets may be created, edited, shared, docked, downloaded, and conveyed to a vendor for purchase. The database is maintained and curated for a high purchasing success rate and is freely available at zinc.docking.org. PMID:22587354

  15. CyBy(2): a structure-based data management tool for chemical and biological data.

    PubMed

    Höck, Stefan; Riedl, Rainer

    2012-01-01

    We report the development of a powerful data management tool for chemical and biological data: CyBy(2). CyBy(2) is a structure-based information management tool used to store and visualize structural data alongside additional information such as project assignment, physical information, spectroscopic data, biological activity, functional data and synthetic procedures. The application consists of a database, an application server, used to query and update the database, and a client application with a rich graphical user interface (GUI) used to interact with the server.

  16. Protein-protein interaction networks: unraveling the wiring of molecular machines within the cell.

    PubMed

    De Las Rivas, Javier; Fontanillo, Celia

    2012-11-01

    Mapping and understanding of the protein interaction networks with their key modules and hubs can provide deeper insights into the molecular machinery underlying complex phenotypes. In this article, we present the basic characteristics and definitions of protein networks, starting with a distinction of the different types of associations between proteins. We focus the review on protein-protein interactions (PPIs), a subset of associations defined as physical contacts between proteins that occur by selective molecular docking in a particular biological context. We present such definition as opposed to other types of protein associations derived from regulatory, genetic, structural or functional relations. To determine PPIs, a variety of binary and co-complex methods exist; however, not all the technologies provide the same information and data quality. A way of increasing confidence in a given protein interaction is to integrate orthogonal experimental evidences. The use of several complementary methods testing each single interaction assesses the accuracy of PPI data and tries to minimize the occurrence of false interactions. Following this approach there have been important efforts to unify primary databases of experimentally proven PPIs into integrated databases. These meta-databases provide a measure of the confidence of interactions based on the number of experimental proofs that report them. As a conclusion, we can state that integrated information allows the building of more reliable interaction networks. Identification of communities, cliques, modules and hubs by analysing the topological parameters and graph properties of the protein networks allows the discovery of central/critical nodes, which are candidates to regulate cellular flux and dynamics.

  17. Preferential attachment and growth dynamics in complex systems

    NASA Astrophysics Data System (ADS)

    Yamasaki, Kazuko; Matia, Kaushik; Buldyrev, Sergey V.; Fu, Dongfeng; Pammolli, Fabio; Riccaboni, Massimo; Stanley, H. Eugene

    2006-09-01

    Complex systems can be characterized by classes of equivalency of their elements defined according to system specific rules. We propose a generalized preferential attachment model to describe the class size distribution. The model postulates preferential growth of the existing classes and the steady influx of new classes. According to the model, the distribution changes from a pure exponential form for zero influx of new classes to a power law with an exponential cut-off form when the influx of new classes is substantial. Predictions of the model are tested through the analysis of a unique industrial database, which covers both elementary units (products) and classes (markets, firms) in a given industry (pharmaceuticals), covering the entire size distribution. The model’s predictions are in good agreement with the data. The paper sheds light on the emergence of the exponent τ≈2 observed as a universal feature of many biological, social and economic problems.

  18. Scaling laws and model of words organization in spoken and written language

    NASA Astrophysics Data System (ADS)

    Bian, Chunhua; Lin, Ruokuang; Zhang, Xiaoyu; Ma, Qianli D. Y.; Ivanov, Plamen Ch.

    2016-01-01

    A broad range of complex physical and biological systems exhibits scaling laws. The human language is a complex system of words organization. Studies of written texts have revealed intriguing scaling laws that characterize the frequency of words occurrence, rank of words, and growth in the number of distinct words with text length. While studies have predominantly focused on the language system in its written form, such as books, little attention is given to the structure of spoken language. Here we investigate a database of spoken language transcripts and written texts, and we uncover that words organization in both spoken language and written texts exhibits scaling laws, although with different crossover regimes and scaling exponents. We propose a model that provides insight into words organization in spoken language and written texts, and successfully accounts for all scaling laws empirically observed in both language forms.

  19. Web services-based text-mining demonstrates broad impacts for interoperability and process simplification.

    PubMed

    Wiegers, Thomas C; Davis, Allan Peter; Mattingly, Carolyn J

    2014-01-01

    The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/ © The Author(s) 2014. Published by Oxford University Press.

  20. Web services-based text-mining demonstrates broad impacts for interoperability and process simplification

    PubMed Central

    Wiegers, Thomas C.; Davis, Allan Peter; Mattingly, Carolyn J.

    2014-01-01

    The Critical Assessment of Information Extraction systems in Biology (BioCreAtIvE) challenge evaluation tasks collectively represent a community-wide effort to evaluate a variety of text-mining and information extraction systems applied to the biological domain. The BioCreative IV Workshop included five independent subject areas, including Track 3, which focused on named-entity recognition (NER) for the Comparative Toxicogenomics Database (CTD; http://ctdbase.org). Previously, CTD had organized document ranking and NER-related tasks for the BioCreative Workshop 2012; a key finding of that effort was that interoperability and integration complexity were major impediments to the direct application of the systems to CTD's text-mining pipeline. This underscored a prevailing problem with software integration efforts. Major interoperability-related issues included lack of process modularity, operating system incompatibility, tool configuration complexity and lack of standardization of high-level inter-process communications. One approach to potentially mitigate interoperability and general integration issues is the use of Web services to abstract implementation details; rather than integrating NER tools directly, HTTP-based calls from CTD's asynchronous, batch-oriented text-mining pipeline could be made to remote NER Web services for recognition of specific biological terms using BioC (an emerging family of XML formats) for inter-process communications. To test this concept, participating groups developed Representational State Transfer /BioC-compliant Web services tailored to CTD's NER requirements. Participants were provided with a comprehensive set of training materials. CTD evaluated results obtained from the remote Web service-based URLs against a test data set of 510 manually curated scientific articles. Twelve groups participated in the challenge. Recall, precision, balanced F-scores and response times were calculated. Top balanced F-scores for gene, chemical and disease NER were 61, 74 and 51%, respectively. Response times ranged from fractions-of-a-second to over a minute per article. We present a description of the challenge and summary of results, demonstrating how curation groups can effectively use interoperable NER technologies to simplify text-mining pipeline implementation. Database URL: http://ctdbase.org/ PMID:24919658

  1. ChlamyCyc: an integrative systems biology database and web-portal for Chlamydomonas reinhardtii.

    PubMed

    May, Patrick; Christian, Jan-Ole; Kempa, Stefan; Walther, Dirk

    2009-05-04

    The unicellular green alga Chlamydomonas reinhardtii is an important eukaryotic model organism for the study of photosynthesis and plant growth. In the era of modern high-throughput technologies there is an imperative need to integrate large-scale data sets from high-throughput experimental techniques using computational methods and database resources to provide comprehensive information about the molecular and cellular organization of a single organism. In the framework of the German Systems Biology initiative GoFORSYS, a pathway database and web-portal for Chlamydomonas (ChlamyCyc) was established, which currently features about 250 metabolic pathways with associated genes, enzymes, and compound information. ChlamyCyc was assembled using an integrative approach combining the recently published genome sequence, bioinformatics methods, and experimental data from metabolomics and proteomics experiments. We analyzed and integrated a combination of primary and secondary database resources, such as existing genome annotations from JGI, EST collections, orthology information, and MapMan classification. ChlamyCyc provides a curated and integrated systems biology repository that will enable and assist in systematic studies of fundamental cellular processes in Chlamydomonas. The ChlamyCyc database and web-portal is freely available under http://chlamycyc.mpimp-golm.mpg.de.

  2. HTAPP: High-Throughput Autonomous Proteomic Pipeline

    PubMed Central

    Yu, Kebing; Salomon, Arthur R.

    2011-01-01

    Recent advances in the speed and sensitivity of mass spectrometers and in analytical methods, the exponential acceleration of computer processing speeds, and the availability of genomic databases from an array of species and protein information databases have led to a deluge of proteomic data. The development of a lab-based automated proteomic software platform for the automated collection, processing, storage, and visualization of expansive proteomic datasets is critically important. The high-throughput autonomous proteomic pipeline (HTAPP) described here is designed from the ground up to provide critically important flexibility for diverse proteomic workflows and to streamline the total analysis of a complex proteomic sample. This tool is comprised of software that controls the acquisition of mass spectral data along with automation of post-acquisition tasks such as peptide quantification, clustered MS/MS spectral database searching, statistical validation, and data exploration within a user-configurable lab-based relational database. The software design of HTAPP focuses on accommodating diverse workflows and providing missing software functionality to a wide range of proteomic researchers to accelerate the extraction of biological meaning from immense proteomic data sets. Although individual software modules in our integrated technology platform may have some similarities to existing tools, the true novelty of the approach described here is in the synergistic and flexible combination of these tools to provide an integrated and efficient analysis of proteomic samples. PMID:20336676

  3. Motif discovery with data mining in 3D protein structure databases: discovery, validation and prediction of the U-shape zinc binding ("Huf-Zinc") motif.

    PubMed

    Maurer-Stroh, Sebastian; Gao, He; Han, Hao; Baeten, Lies; Schymkowitz, Joost; Rousseau, Frederic; Zhang, Louxin; Eisenhaber, Frank

    2013-02-01

    Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif--structural motif--function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL (http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/).

  4. GETPrime: a gene- or transcript-specific primer database for quantitative real-time PCR.

    PubMed

    Gubelmann, Carine; Gattiker, Alexandre; Massouras, Andreas; Hens, Korneel; David, Fabrice; Decouttere, Frederik; Rougemont, Jacques; Deplancke, Bart

    2011-01-01

    The vast majority of genes in humans and other organisms undergo alternative splicing, yet the biological function of splice variants is still very poorly understood in large part because of the lack of simple tools that can map the expression profiles and patterns of these variants with high sensitivity. High-throughput quantitative real-time polymerase chain reaction (qPCR) is an ideal technique to accurately quantify nucleic acid sequences including splice variants. However, currently available primer design programs do not distinguish between splice variants and also differ substantially in overall quality, functionality or throughput mode. Here, we present GETPrime, a primer database supported by a novel platform that uniquely combines and automates several features critical for optimal qPCR primer design. These include the consideration of all gene splice variants to enable either gene-specific (covering the majority of splice variants) or transcript-specific (covering one splice variant) expression profiling, primer specificity validation, automated best primer pair selection according to strict criteria and graphical visualization of the latter primer pairs within their genomic context. GETPrime primers have been extensively validated experimentally, demonstrating high transcript specificity in complex samples. Thus, the free-access, user-friendly GETPrime database allows fast primer retrieval and visualization for genes or groups of genes of most common model organisms, and is available at http://updepla1srv1.epfl.ch/getprime/. Database URL: http://deplanckelab.epfl.ch.

  5. GETPrime: a gene- or transcript-specific primer database for quantitative real-time PCR

    PubMed Central

    Gubelmann, Carine; Gattiker, Alexandre; Massouras, Andreas; Hens, Korneel; David, Fabrice; Decouttere, Frederik; Rougemont, Jacques; Deplancke, Bart

    2011-01-01

    The vast majority of genes in humans and other organisms undergo alternative splicing, yet the biological function of splice variants is still very poorly understood in large part because of the lack of simple tools that can map the expression profiles and patterns of these variants with high sensitivity. High-throughput quantitative real-time polymerase chain reaction (qPCR) is an ideal technique to accurately quantify nucleic acid sequences including splice variants. However, currently available primer design programs do not distinguish between splice variants and also differ substantially in overall quality, functionality or throughput mode. Here, we present GETPrime, a primer database supported by a novel platform that uniquely combines and automates several features critical for optimal qPCR primer design. These include the consideration of all gene splice variants to enable either gene-specific (covering the majority of splice variants) or transcript-specific (covering one splice variant) expression profiling, primer specificity validation, automated best primer pair selection according to strict criteria and graphical visualization of the latter primer pairs within their genomic context. GETPrime primers have been extensively validated experimentally, demonstrating high transcript specificity in complex samples. Thus, the free-access, user-friendly GETPrime database allows fast primer retrieval and visualization for genes or groups of genes of most common model organisms, and is available at http://updepla1srv1.epfl.ch/getprime/. Database URL: http://deplanckelab.epfl.ch. PMID:21917859

  6. B-CAN: a resource sharing platform to improve the operation, visualization and integrated analysis of TCGA breast cancer data

    PubMed Central

    Wen, Can-Hong; Ou, Shao-Min; Guo, Xiao-Bo; Liu, Chen-Feng; Shen, Yan-Bo; You, Na; Cai, Wei-Hong; Shen, Wen-Jun; Wang, Xue-Qin; Tan, Hai-Zhu

    2017-01-01

    Breast cancer is a high-risk heterogeneous disease with myriad subtypes and complicated biological features. The Cancer Genome Atlas (TCGA) breast cancer database provides researchers with the large-scale genome and clinical data via web portals and FTP services. Researchers are able to gain new insights into their related fields, and evaluate experimental discoveries with TCGA. However, it is difficult for researchers who have little experience with database and bioinformatics to access and operate on because of TCGA’s complex data format and diverse files. For ease of use, we build the breast cancer (B-CAN) platform, which enables data customization, data visualization, and private data center. The B-CAN platform runs on Apache server and interacts with the backstage of MySQL database by PHP. Users can customize data based on their needs by combining tables from original TCGA database and selecting variables from each table. The private data center is applicable for private data and two types of customized data. A key feature of the B-CAN is that it provides single table display and multiple table display. Customized data with one barcode corresponding to many records and processed customized data are allowed in Multiple Tables Display. The B-CAN is an intuitive and high-efficient data-sharing platform. PMID:29312567

  7. Large-scale annotation of small-molecule libraries using public databases.

    PubMed

    Zhou, Yingyao; Zhou, Bin; Chen, Kaisheng; Yan, S Frank; King, Frederick J; Jiang, Shumei; Winzeler, Elizabeth A

    2007-01-01

    While many large publicly accessible databases provide excellent annotation for biological macromolecules, the same is not true for small chemical compounds. Commercial data sources also fail to encompass an annotation interface for large numbers of compounds and tend to be cost prohibitive to be widely available to biomedical researchers. Therefore, using annotation information for the selection of lead compounds from a modern day high-throughput screening (HTS) campaign presently occurs only under a very limited scale. The recent rapid expansion of the NIH PubChem database provides an opportunity to link existing biological databases with compound catalogs and provides relevant information that potentially could improve the information garnered from large-scale screening efforts. Using the 2.5 million compound collection at the Genomics Institute of the Novartis Research Foundation (GNF) as a model, we determined that approximately 4% of the library contained compounds with potential annotation in such databases as PubChem and the World Drug Index (WDI) as well as related databases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) and ChemIDplus. Furthermore, the exact structure match analysis showed 32% of GNF compounds can be linked to third party databases via PubChem. We also showed annotations such as MeSH (medical subject headings) terms can be applied to in-house HTS databases in identifying signature biological inhibition profiles of interest as well as expediting the assay validation process. The automated annotation of thousands of screening hits in batch is becoming feasible and has the potential to play an essential role in the hit-to-lead decision making process.

  8. PrenDB, a Substrate Prediction Database to Enable Biocatalytic Use of Prenyltransferases.

    PubMed

    Gunera, Jakub; Kindinger, Florian; Li, Shu-Ming; Kolb, Peter

    2017-03-10

    Prenyltransferases of the dimethylallyltryptophan synthase (DMATS) superfamily catalyze the attachment of prenyl or prenyl-like moieties to diverse acceptor compounds. These acceptor molecules are generally aromatic in nature and mostly indole or indole-like. Their catalytic transformation represents a major skeletal diversification step in the biosynthesis of secondary metabolites, including the indole alkaloids. DMATS enzymes thus contribute significantly to the biological and pharmacological diversity of small molecule metabolites. Understanding the substrate specificity of these enzymes could create opportunities for their biocatalytic use in preparing complex synthetic scaffolds. However, there has been no framework to achieve this in a rational way. Here, we report a chemoinformatic pipeline to enable prenyltransferase substrate prediction. We systematically catalogued 32 unique prenyltransferases and 167 unique substrates to create possible reaction matrices and compiled these data into a browsable database named PrenDB. We then used a newly developed algorithm based on molecular fragmentation to automatically extract reactive chemical epitopes. The analysis of the collected data sheds light on the thus far explored substrate space of DMATS enzymes. To assess the predictive performance of our virtual reaction extraction tool, 38 potential substrates were tested as prenyl acceptors in assays with three prenyltransferases, and we were able to detect turnover in >55% of the cases. The database, PrenDB (www.kolblab.org/prendb.php), enables the prediction of potential substrates for chemoenzymatic synthesis through substructure similarity and virtual chemical transformation techniques. It aims at making prenyltransferases and their highly regio- and stereoselective reactions accessible to the research community for integration in synthetic work flows. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.

  9. Exploiting the Proteome to Improve the Genome-Wide Genetic Analysis of Epistasis in Common Human Diseases

    PubMed Central

    Pattin, Kristine A.; Moore, Jason H.

    2009-01-01

    One of the central goals of human genetics is the identification of loci with alleles or genotypes that confer increased susceptibility. The availability of dense maps of single-nucleotide polymorphisms (SNPs) along with high-throughput genotyping technologies has set the stage for routine genome-wide association studies that are expected to significantly improve our ability to identify susceptibility loci. Before this promise can be realized, there are some significant challenges that need to be addressed. We address here the challenge of detecting epistasis or gene-gene interactions in genome-wide association studies. Discovering epistatic interactions in high dimensional datasets remains a challenge due to the computational complexity resulting from the analysis of all possible combinations of SNPs. One potential way to overcome the computational burden of a genome-wide epistasis analysis would be to devise a logical way to prioritize the many SNPs in a dataset so that the data may be analyzed more efficiently and yet still retain important biological information. One of the strongest demonstrations of the functional relationship between genes is protein-protein interaction. Thus, it is plausible that the expert knowledge extracted from protein interaction databases may allow for a more efficient analysis of genome-wide studies as well as facilitate the biological interpretation of the data. In this review we will discuss the challenges of detecting epistasis in genome-wide genetic studies and the means by which we propose to apply expert knowledge extracted from protein interaction databases to facilitate this process. We explore some of the fundamentals of protein interactions and the databases that are publicly available. PMID:18551320

  10. Featured Article: Genotation: Actionable knowledge for the scientific reader.

    PubMed

    Nagahawatte, Panduka; Willis, Ethan; Sakauye, Mark; Jose, Rony; Chen, Hao; Davis, Robert L

    2016-06-01

    We present an article viewer application that allows a scientific reader to easily discover and share knowledge by linking genomics-related concepts to knowledge of disparate biomedical databases. High-throughput data streams generated by technical advancements have contributed to scientific knowledge discovery at an unprecedented rate. Biomedical Informaticists have created a diverse set of databases to store and retrieve the discovered knowledge. The diversity and abundance of such resources present biomedical researchers a challenge with knowledge discovery. These challenges highlight a need for a better informatics solution. We use a text mining algorithm, Genomine, to identify gene symbols from the text of a journal article. The identified symbols are supplemented with information from the GenoDB knowledgebase. Self-updating GenoDB contains information from NCBI Gene, Clinvar, Medgen, dbSNP, KEGG, PharmGKB, Uniprot, and Hugo Gene databases. The journal viewer is a web application accessible via a web browser. The features described herein are accessible on www.genotation.org The Genomine algorithm identifies gene symbols with an accuracy shown by .65 F-Score. GenoDB currently contains information regarding 59,905 gene symbols, 5633 drug-gene relationships, 5981 gene-disease relationships, and 713 pathways. This application provides scientific readers with actionable knowledge related to concepts of a manuscript. The reader will be able to save and share supplements to be visualized in a graphical manner. This provides convenient access to details of complex biological phenomena, enabling biomedical researchers to generate novel hypothesis to further our knowledge in human health. This manuscript presents a novel application that integrates genomic, proteomic, and pharmacogenomic information to supplement content of a biomedical manuscript and enable readers to automatically discover actionable knowledge. © 2016 by the Society for Experimental Biology and Medicine.

  11. General and craniofacial development are complex adaptive processes influenced by diversity.

    PubMed

    Brook, A H; O'Donnell, M Brook; Hone, A; Hart, E; Hughes, T E; Smith, R N; Townsend, G C

    2014-06-01

    Complex systems are present in such diverse areas as social systems, economies, ecosystems and biology and, therefore, are highly relevant to dental research, education and practice. A Complex Adaptive System in biological development is a dynamic process in which, from interacting components at a lower level, higher level phenomena and structures emerge. Diversity makes substantial contributions to the performance of complex adaptive systems. It enhances the robustness of the process, allowing multiple responses to external stimuli as well as internal changes. From diversity comes variation in outcome and the possibility of major change; outliers in the distribution enhance the tipping points. The development of the dentition is a valuable, accessible model with extensive and reliable databases for investigating the role of complex adaptive systems in craniofacial and general development. The general characteristics of such systems are seen during tooth development: self-organization; bottom-up emergence; multitasking; self-adaptation; variation; tipping points; critical phases; and robustness. Dental findings are compatible with the Random Network Model, the Threshold Model and also with the Scale Free Network Model which has a Power Law distribution. In addition, dental development shows the characteristics of Modularity and Clustering to form Hierarchical Networks. The interactions between the genes (nodes) demonstrate Small World phenomena, Subgraph Motifs and Gene Regulatory Networks. Genetic mechanisms are involved in the creation and evolution of variation during development. The genetic factors interact with epigenetic and environmental factors at the molecular level and form complex networks within the cells. From these interactions emerge the higher level tissues, tooth germs and mineralized teeth. Approaching development in this way allows investigation of why there can be variations in phenotypes from identical genotypes; the phenotype is the outcome of perturbations in the cellular systems and networks, as well as of the genotype. Understanding and applying complexity theory will bring about substantial advances not only in dental research and education but also in the organization and delivery of oral health care. © 2014 Australian Dental Association.

  12. Strategic Integration of Multiple Bioinformatics Resources for System Level Analysis of Biological Networks.

    PubMed

    D'Souza, Mark; Sulakhe, Dinanath; Wang, Sheng; Xie, Bing; Hashemifar, Somaye; Taylor, Andrew; Dubchak, Inna; Conrad Gilliam, T; Maltsev, Natalia

    2017-01-01

    Recent technological advances in genomics allow the production of biological data at unprecedented tera- and petabyte scales. Efficient mining of these vast and complex datasets for the needs of biomedical research critically depends on a seamless integration of the clinical, genomic, and experimental information with prior knowledge about genotype-phenotype relationships. Such experimental data accumulated in publicly available databases should be accessible to a variety of algorithms and analytical pipelines that drive computational analysis and data mining.We present an integrated computational platform Lynx (Sulakhe et al., Nucleic Acids Res 44:D882-D887, 2016) ( http://lynx.cri.uchicago.edu ), a web-based database and knowledge extraction engine. It provides advanced search capabilities and a variety of algorithms for enrichment analysis and network-based gene prioritization. It gives public access to the Lynx integrated knowledge base (LynxKB) and its analytical tools via user-friendly web services and interfaces. The Lynx service-oriented architecture supports annotation and analysis of high-throughput experimental data. Lynx tools assist the user in extracting meaningful knowledge from LynxKB and experimental data, and in the generation of weighted hypotheses regarding the genes and molecular mechanisms contributing to human phenotypes or conditions of interest. The goal of this integrated platform is to support the end-to-end analytical needs of various translational projects.

  13. Integrating genome-wide association study summaries and element-gene interaction datasets identified multiple associations between elements and complex diseases.

    PubMed

    He, Awen; Wang, Wenyu; Prakash, N Tejo; Tinkov, Alexey A; Skalny, Anatoly V; Wen, Yan; Hao, Jingcan; Guo, Xiong; Zhang, Feng

    2018-03-01

    Chemical elements are closely related to human health. Extensive genomic profile data of complex diseases offer us a good opportunity to systemically investigate the relationships between elements and complex diseases/traits. In this study, we applied gene set enrichment analysis (GSEA) approach to detect the associations between elements and complex diseases/traits though integrating element-gene interaction datasets and genome-wide association study (GWAS) data of complex diseases/traits. To illustrate the performance of GSEA, the element-gene interaction datasets of 24 elements were extracted from the comparative toxicogenomics database (CTD). GWAS summary datasets of 24 complex diseases or traits were downloaded from the dbGaP or GEFOS websites. We observed significant associations between 7 elements and 13 complex diseases or traits (all false discovery rate (FDR) < 0.05), including reported relationships such as aluminum vs. Alzheimer's disease (FDR = 0.042), calcium vs. bone mineral density (FDR = 0.031), magnesium vs. systemic lupus erythematosus (FDR = 0.012) as well as novel associations, such as nickel vs. hypertriglyceridemia (FDR = 0.002) and bipolar disorder (FDR = 0.027). Our study results are consistent with previous biological studies, supporting the good performance of GSEA. Our analyzing results based on GSEA framework provide novel clues for discovering causal relationships between elements and complex diseases. © 2017 WILEY PERIODICALS, INC.

  14. Research approaches to mass casualty incidents response: development from routine perspectives to complexity science.

    PubMed

    Shen, Weifeng; Jiang, Libing; Zhang, Mao; Ma, Yuefeng; Jiang, Guanyu; He, Xiaojun

    2014-01-01

    To review the research methods of mass casualty incident (MCI) systematically and introduce the concept and characteristics of complexity science and artificial system, computational experiments and parallel execution (ACP) method. We searched PubMed, Web of Knowledge, China Wanfang and China Biology Medicine (CBM) databases for relevant studies. Searches were performed without year or language restrictions and used the combinations of the following key words: "mass casualty incident", "MCI", "research method", "complexity science", "ACP", "approach", "science", "model", "system" and "response". Articles were searched using the above keywords and only those involving the research methods of mass casualty incident (MCI) were enrolled. Research methods of MCI have increased markedly over the past few decades. For now, dominating research methods of MCI are theory-based approach, empirical approach, evidence-based science, mathematical modeling and computer simulation, simulation experiment, experimental methods, scenario approach and complexity science. This article provides an overview of the development of research methodology for MCI. The progresses of routine research approaches and complexity science are briefly presented in this paper. Furthermore, the authors conclude that the reductionism underlying the exact science is not suitable for MCI complex systems. And the only feasible alternative is complexity science. Finally, this summary is followed by a review that ACP method combining artificial systems, computational experiments and parallel execution provides a new idea to address researches for complex MCI.

  15. A unique large-scale undergraduate research experience in molecular systems biology for non-mathematics majors.

    PubMed

    Kappler, Ulrike; Rowland, Susan L; Pedwell, Rhianna K

    2017-05-01

    Systems biology is frequently taught with an emphasis on mathematical modeling approaches. This focus effectively excludes most biology, biochemistry, and molecular biology students, who are not mathematics majors. The mathematical focus can also present a misleading picture of systems biology, which is a multi-disciplinary pursuit requiring collaboration between biochemists, bioinformaticians, and mathematicians. This article describes an authentic large-scale undergraduate research experience (ALURE) in systems biology that incorporates proteomics, bacterial genomics, and bioinformatics in the one exercise. This project is designed to engage students who have a basic grounding in protein chemistry and metabolism and no mathematical modeling skills. The pedagogy around the research experience is designed to help students attack complex datasets and use their emergent metabolic knowledge to make meaning from large amounts of raw data. On completing the ALURE, participants reported a significant increase in their confidence around analyzing large datasets, while the majority of the cohort reported good or great gains in a variety of skills including "analysing data for patterns" and "conducting database or internet searches." An environmental scan shows that this ALURE is the only undergraduate-level system-biology research project offered on a large-scale in Australia; this speaks to the perceived difficulty of implementing such an opportunity for students. We argue however, that based on the student feedback, allowing undergraduate students to complete a systems-biology project is both feasible and desirable, even if the students are not maths and computing majors. © 2016 by The International Union of Biochemistry and Molecular Biology, 45(3):235-248, 2017. © 2016 The International Union of Biochemistry and Molecular Biology.

  16. Database Constraints Applied to Metabolic Pathway Reconstruction Tools

    PubMed Central

    Vilaplana, Jordi; Solsona, Francesc; Teixido, Ivan; Usié, Anabel; Karathia, Hiren; Alves, Rui; Mateo, Jordi

    2014-01-01

    Our group developed two biological applications, Biblio-MetReS and Homol-MetReS, accessing the same database of organisms with annotated genes. Biblio-MetReS is a data-mining application that facilitates the reconstruction of molecular networks based on automated text-mining analysis of published scientific literature. Homol-MetReS allows functional (re)annotation of proteomes, to properly identify both the individual proteins involved in the process(es) of interest and their function. It also enables the sets of proteins involved in the process(es) in different organisms to be compared directly. The efficiency of these biological applications is directly related to the design of the shared database. We classified and analyzed the different kinds of access to the database. Based on this study, we tried to adjust and tune the configurable parameters of the database server to reach the best performance of the communication data link to/from the database system. Different database technologies were analyzed. We started the study with a public relational SQL database, MySQL. Then, the same database was implemented by a MapReduce-based database named HBase. The results indicated that the standard configuration of MySQL gives an acceptable performance for low or medium size databases. Nevertheless, tuning database parameters can greatly improve the performance and lead to very competitive runtimes. PMID:25202745

  17. AIM: A comprehensive Arabidopsis Interactome Module database and related interologs in plants

    USDA-ARS?s Scientific Manuscript database

    Systems biology analysis of protein modules is important for understanding the functional relationships between proteins in the interactome. Here, we present a comprehensive database named AIM for Arabidopsis (Arabidopsis thaliana) interactome modules. The database contains almost 250,000 modules th...

  18. DNA profiles, computer searches, and the Fourth Amendment.

    PubMed

    Kimel, Catherine W

    2013-01-01

    Pursuant to federal statutes and to laws in all fifty states, the United States government has assembled a database containing the DNA profiles of over eleven million citizens. Without judicial authorization, the government searches each of these profiles one-hundred thousand times every day, seeking to link database subjects to crimes they are not suspected of committing. Yet, courts and scholars that have addressed DNA databasing have focused their attention almost exclusively on the constitutionality of the government's seizure of the biological samples from which the profiles are generated. This Note fills a gap in the scholarship by examining the Fourth Amendment problems that arise when the government searches its vast DNA database. This Note argues that each attempt to match two DNA profiles constitutes a Fourth Amendment search because each attempted match infringes upon database subjects' expectations of privacy in their biological relationships and physical movements. The Note further argues that database searches are unreasonable as they are currently conducted, and it suggests an adaptation of computer-search procedures to remedy the constitutional deficiency.

  19. Can all heritable biology really be reduced to a single dimension?

    PubMed

    Babbitt, Gregory A; Coppola, Erin E; Alawad, Mohammed A; Hudson, André O

    2016-03-10

    A long-held presupposition in the field of bioinformatics holds that genetic, and now even epigenetic 'information' can be abstracted from the physicochemical details of the macromolecular polymers in which it resides. It is perhaps rather ironic that this basic conjecture originated upon the first observations of DNA structure itself. This static model of DNA led very quickly to the conclusion that only the nucleobase sequence itself is rich enough in molecular complexity to replicate a complex biology. This idea has been pervasive throughout genomic science, higher education and popular culture ever since; to the point that most of us would accept it unquestioningly as fact. What is more alarming is that this conjecture is driving a significant portion of the technological development in modern genomics towards methods strongly rooted in DNA sequencing, thereby reducing a dynamic multi-dimensional biology into single-dimensional forms of data. Evidence countering this central tenet of bioinformatics has been quietly mounting over many decades, prompting some to propose that the genome must be studied from the perspective of its molecular reality, rather than as a body of information to be represented symbolically. Here, we explore the epistemological boundary between bioinformatics and molecular biology, and warn against an 'overtly' bioinformatic perspective. We review a selection of new bioinformatic methods that move beyond sequence-based approaches to include consideration of databased three dimensional structures. However, we also note that these hybrid methods still ignore the most important element of gene function when attempting to improve outcomes; the fourth dimension of molecular dynamics over time. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.

  20. The Escherichia coli Peripheral Inner Membrane Proteome*

    PubMed Central

    Papanastasiou, Malvina; Orfanoudaki, Georgia; Koukaki, Marina; Kountourakis, Nikos; Sardis, Marios Frantzeskos; Aivaliotis, Michalis; Karamanou, Spyridoula; Economou, Anastassios

    2013-01-01

    Biological membranes are essential for cell viability. Their functional characteristics strongly depend on their protein content, which consists of transmembrane (integral) and peripherally associated membrane proteins. Both integral and peripheral inner membrane proteins mediate a plethora of biological processes. Whereas transmembrane proteins have characteristic hydrophobic stretches and can be predicted using bioinformatics approaches, peripheral inner membrane proteins are hydrophilic, exist in equilibria with soluble pools, and carry no discernible membrane targeting signals. We experimentally determined the cytoplasmic peripheral inner membrane proteome of the model organism Escherichia coli using a multidisciplinary approach. Initially, we extensively re-annotated the theoretical proteome regarding subcellular localization using literature searches, manual curation, and multi-combinatorial bioinformatics searches of the available databases. Next we used sequential biochemical fractionations coupled to direct identification of individual proteins and protein complexes using high resolution mass spectrometry. We determined that the proposed cytoplasmic peripheral inner membrane proteome occupies a previously unsuspected ∼19% of the basic E. coli BL21(DE3) proteome, and the detected peripheral inner membrane proteome occupies ∼25% of the estimated expressed proteome of this cell grown in LB medium to mid-log phase. This value might increase when fleeting interactions, not studied here, are taken into account. Several proteins previously regarded as exclusively cytoplasmic bind membranes avidly. Many of these proteins are organized in functional or/and structural oligomeric complexes that bind to the membrane with multiple interactions. Identified proteins cover the full spectrum of biological activities, and more than half of them are essential. Our data suggest that the cytoplasmic proteome displays remarkably dynamic and extensive communication with biological membrane surfaces that we are only beginning to decipher. PMID:23230279

  1. Algorithm for detection the QRS complexes based on support vector machine

    NASA Astrophysics Data System (ADS)

    Van, G. V.; Podmasteryev, K. V.

    2017-11-01

    The efficiency of computer ECG analysis depends on the accurate detection of QRS-complexes. This paper presents an algorithm for QRS complex detection based of support vector machine (SVM). The proposed algorithm is evaluated on annotated standard databases such as MIT-BIH Arrhythmia database. The QRS detector obtained a sensitivity Se = 98.32% and specificity Sp = 95.46% for MIT-BIH Arrhythmia database. This algorithm can be used as the basis for the software to diagnose electrical activity of the heart.

  2. The LncRNA Connectivity Map: Using LncRNA Signatures to Connect Small Molecules, LncRNAs, and Diseases.

    PubMed

    Yang, Haixiu; Shang, Desi; Xu, Yanjun; Zhang, Chunlong; Feng, Li; Sun, Zeguo; Shi, Xinrui; Zhang, Yunpeng; Han, Junwei; Su, Fei; Li, Chunquan; Li, Xia

    2017-07-27

    Well characterized the connections among diseases, long non-coding RNAs (lncRNAs) and drugs are important for elucidating the key roles of lncRNAs in biological mechanisms in various biological states. In this study, we constructed a database called LNCmap (LncRNA Connectivity Map), available at http://www.bio-bigdata.com/LNCmap/ , to establish the correlations among diseases, physiological processes, and the action of small molecule therapeutics by attempting to describe all biological states in terms of lncRNA signatures. By reannotating the microarray data from the Connectivity Map database, the LNCmap obtained 237 lncRNA signatures of 5916 instances corresponding to 1262 small molecular drugs. We provided a user-friendly interface for the convenient browsing, retrieval and download of the database, including detailed information and the associations of drugs and corresponding affected lncRNAs. Additionally, we developed two enrichment analysis methods for users to identify candidate drugs for a particular disease by inputting the corresponding lncRNA expression profiles or an associated lncRNA list and then comparing them to the lncRNA signatures in our database. Overall, LNCmap could significantly improve our understanding of the biological roles of lncRNAs and provide a unique resource to reveal the connections among drugs, lncRNAs and diseases.

  3. Use of a validated algorithm to estimate the annual cost of effective biologic treatment for rheumatoid arthritis.

    PubMed

    Curtis, Jeffrey R; Schabert, Vernon F; Yeaw, Jason; Korn, Jonathan R; Quach, Caroleen; Harrison, David J; Yun, Huifeng; Joseph, George J; Collier, David

    2014-08-01

    To estimate biologic cost per effectively treated patient with rheumatoid arthritis (RA) using a claims-based algorithm for effectiveness. Patients with RA aged 18-63 years in the IMS PharMetrics Plus database were categorized as effectively treated if they met all six criteria: (1) a medication possession ratio ≥80% (subcutaneous) or at least as many infusions as specified in US labeling (intravenous); (2) no biologic dose increase; (3) no biologic switch; (4) no new non-biologic disease-modifying anti-rheumatic drug; (5) no new or increased oral glucocorticoid; and (6) ≤1 glucocorticoid injection. Biologic cost per effectively treated patient was defined as total cost of the index biologic (drug plus intravenous administration) divided by the number of patients categorized by the algorithm as effectively treated. Similar methods were used for the index biologic in the second year and for a second biologic after a switch. Rates that the index biologic was categorized as effective in the first year were 31.0% etanercept (2243/7247), 28.6% adalimumab (1426/4991), 28.6% abatacept (332/1160), 27.2% golimumab (71/261), and 20.2% infliximab (474/2352). Mean biologic cost per effectively treated patient, per the algorithm, was $50,141 etanercept, $53,386 golimumab, $56,942 adalimumab, $73,516 abatacept, and $114,089 infliximab. Biologic cost per effectively treated patient, using this algorithm, was lower for patients who continued the index biologic in the second year and higher after switching. When a claims-based algorithm was applied to a large commercial claims database, etanercept was categorized as the most effective and had the lowest estimated 1-year biologic cost per effectively treated patient. This proxy for effectiveness from claims databases was validated against a clinical effectiveness scale, but analyses of the second year or the year after a biologic switch were not included in the validation. Costs of other medications were not included in cost calculations.

  4. ODG: Omics database generator - a tool for generating, querying, and analyzing multi-omics comparative databases to facilitate biological understanding.

    PubMed

    Guhlin, Joseph; Silverstein, Kevin A T; Zhou, Peng; Tiffin, Peter; Young, Nevin D

    2017-08-10

    Rapid generation of omics data in recent years have resulted in vast amounts of disconnected datasets without systemic integration and knowledge building, while individual groups have made customized, annotated datasets available on the web with few ways to link them to in-lab datasets. With so many research groups generating their own data, the ability to relate it to the larger genomic and comparative genomic context is becoming increasingly crucial to make full use of the data. The Omics Database Generator (ODG) allows users to create customized databases that utilize published genomics data integrated with experimental data which can be queried using a flexible graph database. When provided with omics and experimental data, ODG will create a comparative, multi-dimensional graph database. ODG can import definitions and annotations from other sources such as InterProScan, the Gene Ontology, ENZYME, UniPathway, and others. This annotation data can be especially useful for studying new or understudied species for which transcripts have only been predicted, and rapidly give additional layers of annotation to predicted genes. In better studied species, ODG can perform syntenic annotation translations or rapidly identify characteristics of a set of genes or nucleotide locations, such as hits from an association study. ODG provides a web-based user-interface for configuring the data import and for querying the database. Queries can also be run from the command-line and the database can be queried directly through programming language hooks available for most languages. ODG supports most common genomic formats as well as generic, easy to use tab-separated value format for user-provided annotations. ODG is a user-friendly database generation and query tool that adapts to the supplied data to produce a comparative genomic database or multi-layered annotation database. ODG provides rapid comparative genomic annotation and is therefore particularly useful for non-model or understudied species. For species for which more data are available, ODG can be used to conduct complex multi-omics, pattern-matching queries.

  5. Examining database persistence of ISO/EN 13606 standardized electronic health record extracts: relational vs. NoSQL approaches.

    PubMed

    Sánchez-de-Madariaga, Ricardo; Muñoz, Adolfo; Lozano-Rubí, Raimundo; Serrano-Balazote, Pablo; Castro, Antonio L; Moreno, Oscar; Pascual, Mario

    2017-08-18

    The objective of this research is to compare the relational and non-relational (NoSQL) database systems approaches in order to store, recover, query and persist standardized medical information in the form of ISO/EN 13606 normalized Electronic Health Record XML extracts, both in isolation and concurrently. NoSQL database systems have recently attracted much attention, but few studies in the literature address their direct comparison with relational databases when applied to build the persistence layer of a standardized medical information system. One relational and two NoSQL databases (one document-based and one native XML database) of three different sizes have been created in order to evaluate and compare the response times (algorithmic complexity) of six different complexity growing queries, which have been performed on them. Similar appropriate results available in the literature have also been considered. Relational and non-relational NoSQL database systems show almost linear algorithmic complexity query execution. However, they show very different linear slopes, the former being much steeper than the two latter. Document-based NoSQL databases perform better in concurrency than in isolation, and also better than relational databases in concurrency. Non-relational NoSQL databases seem to be more appropriate than standard relational SQL databases when database size is extremely high (secondary use, research applications). Document-based NoSQL databases perform in general better than native XML NoSQL databases. EHR extracts visualization and edition are also document-based tasks more appropriate to NoSQL database systems. However, the appropriate database solution much depends on each particular situation and specific problem.

  6. Designing integrated computational biology pipelines visually.

    PubMed

    Jamil, Hasan M

    2013-01-01

    The long-term cost of developing and maintaining a computational pipeline that depends upon data integration and sophisticated workflow logic is too high to even contemplate "what if" or ad hoc type queries. In this paper, we introduce a novel application building interface for computational biology research, called VizBuilder, by leveraging a recent query language called BioFlow for life sciences databases. Using VizBuilder, it is now possible to develop ad hoc complex computational biology applications at throw away costs. The underlying query language supports data integration and workflow construction almost transparently and fully automatically, using a best effort approach. Users express their application by drawing it with VizBuilder icons and connecting them in a meaningful way. Completed applications are compiled and translated as BioFlow queries for execution by the data management system LifeDB, for which VizBuilder serves as a front end. We discuss VizBuilder features and functionalities in the context of a real life application after we briefly introduce BioFlow. The architecture and design principles of VizBuilder are also discussed. Finally, we outline future extensions of VizBuilder. To our knowledge, VizBuilder is a unique system that allows visually designing computational biology pipelines involving distributed and heterogeneous resources in an ad hoc manner.

  7. Database citation in full text biomedical articles.

    PubMed

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services.

  8. Database Citation in Full Text Biomedical Articles

    PubMed Central

    Kafkas, Şenay; Kim, Jee-Hyub; McEntyre, Johanna R.

    2013-01-01

    Molecular biology and literature databases represent essential infrastructure for life science research. Effective integration of these data resources requires that there are structured cross-references at the level of individual articles and biological records. Here, we describe the current patterns of how database entries are cited in research articles, based on analysis of the full text Open Access articles available from Europe PMC. Focusing on citation of entries in the European Nucleotide Archive (ENA), UniProt and Protein Data Bank, Europe (PDBe), we demonstrate that text mining doubles the number of structured annotations of database record citations supplied in journal articles by publishers. Many thousands of new literature-database relationships are found by text mining, since these relationships are also not present in the set of articles cited by database records. We recommend that structured annotation of database records in articles is extended to other databases, such as ArrayExpress and Pfam, entries from which are also cited widely in the literature. The very high precision and high-throughput of this text-mining pipeline makes this activity possible both accurately and at low cost, which will allow the development of new integrated data services. PMID:23734176

  9. From 20th century metabolic wall charts to 21st century systems biology: database of mammalian metabolic enzymes.

    PubMed

    Corcoran, Callan C; Grady, Cameron R; Pisitkun, Trairak; Parulekar, Jaya; Knepper, Mark A

    2017-03-01

    The organization of the mammalian genome into gene subsets corresponding to specific functional classes has provided key tools for systems biology research. Here, we have created a web-accessible resource called the Mammalian Metabolic Enzyme Database ( https://hpcwebapps.cit.nih.gov/ESBL/Database/MetabolicEnzymes/MetabolicEnzymeDatabase.html) keyed to the biochemical reactions represented on iconic metabolic pathway wall charts created in the previous century. Overall, we have mapped 1,647 genes to these pathways, representing ~7 percent of the protein-coding genome. To illustrate the use of the database, we apply it to the area of kidney physiology. In so doing, we have created an additional database ( Database of Metabolic Enzymes in Kidney Tubule Segments: https://hpcwebapps.cit.nih.gov/ESBL/Database/MetabolicEnzymes/), mapping mRNA abundance measurements (mined from RNA-Seq studies) for all metabolic enzymes to each of 14 renal tubule segments. We carry out bioinformatics analysis of the enzyme expression pattern among renal tubule segments and mine various data sources to identify vasopressin-regulated metabolic enzymes in the renal collecting duct. Copyright © 2017 the American Physiological Society.

  10. TranscriptomeBrowser 3.0: introducing a new compendium of molecular interactions and a new visualization tool for the study of gene regulatory networks.

    PubMed

    Lepoivre, Cyrille; Bergon, Aurélie; Lopez, Fabrice; Perumal, Narayanan B; Nguyen, Catherine; Imbert, Jean; Puthier, Denis

    2012-01-31

    Deciphering gene regulatory networks by in silico approaches is a crucial step in the study of the molecular perturbations that occur in diseases. The development of regulatory maps is a tedious process requiring the comprehensive integration of various evidences scattered over biological databases. Thus, the research community would greatly benefit from having a unified database storing known and predicted molecular interactions. Furthermore, given the intrinsic complexity of the data, the development of new tools offering integrated and meaningful visualizations of molecular interactions is necessary to help users drawing new hypotheses without being overwhelmed by the density of the subsequent graph. We extend the previously developed TranscriptomeBrowser database with a set of tables containing 1,594,978 human and mouse molecular interactions. The database includes: (i) predicted regulatory interactions (computed by scanning vertebrate alignments with a set of 1,213 position weight matrices), (ii) potential regulatory interactions inferred from systematic analysis of ChIP-seq experiments, (iii) regulatory interactions curated from the literature, (iv) predicted post-transcriptional regulation by micro-RNA, (v) protein kinase-substrate interactions and (vi) physical protein-protein interactions. In order to easily retrieve and efficiently analyze these interactions, we developed In-teractomeBrowser, a graph-based knowledge browser that comes as a plug-in for Transcriptome-Browser. The first objective of InteractomeBrowser is to provide a user-friendly tool to get new insight into any gene list by providing a context-specific display of putative regulatory and physical interactions. To achieve this, InteractomeBrowser relies on a "cell compartments-based layout" that makes use of a subset of the Gene Ontology to map gene products onto relevant cell compartments. This layout is particularly powerful for visual integration of heterogeneous biological information and is a productive avenue in generating new hypotheses. The second objective of InteractomeBrowser is to fill the gap between interaction databases and dynamic modeling. It is thus compatible with the network analysis software Cytoscape and with the Gene Interaction Network simulation software (GINsim). We provide examples underlying the benefits of this visualization tool for large gene set analysis related to thymocyte differentiation. The InteractomeBrowser plugin is a powerful tool to get quick access to a knowledge database that includes both predicted and validated molecular interactions. InteractomeBrowser is available through the TranscriptomeBrowser framework and can be found at: http://tagc.univ-mrs.fr/tbrowser/. Our database is updated on a regular basis.

  11. MEMOPS: data modelling and automatic code generation.

    PubMed

    Fogh, Rasmus H; Boucher, Wayne; Ionides, John M C; Vranken, Wim F; Stevens, Tim J; Laue, Ernest D

    2010-03-25

    In recent years the amount of biological data has exploded to the point where much useful information can only be extracted by complex computational analyses. Such analyses are greatly facilitated by metadata standards, both in terms of the ability to compare data originating from different sources, and in terms of exchanging data in standard forms, e.g. when running processes on a distributed computing infrastructure. However, standards thrive on stability whereas science tends to constantly move, with new methods being developed and old ones modified. Therefore maintaining both metadata standards, and all the code that is required to make them useful, is a non-trivial problem. Memops is a framework that uses an abstract definition of the metadata (described in UML) to generate internal data structures and subroutine libraries for data access (application programming interfaces--APIs--currently in Python, C and Java) and data storage (in XML files or databases). For the individual project these libraries obviate the need for writing code for input parsing, validity checking or output. Memops also ensures that the code is always internally consistent, massively reducing the need for code reorganisation. Across a scientific domain a Memops-supported data model makes it easier to support complex standards that can capture all the data produced in a scientific area, share them among all programs in a complex software pipeline, and carry them forward to deposition in an archive. The principles behind the Memops generation code will be presented, along with example applications in Nuclear Magnetic Resonance (NMR) spectroscopy and structural biology.

  12. IDAAPM: integrated database of ADMET and adverse effects of predictive modeling based on FDA approved drug data.

    PubMed

    Legehar, Ashenafi; Xhaard, Henri; Ghemtio, Leo

    2016-01-01

    The disposition of a pharmaceutical compound within an organism, i.e. its Absorption, Distribution, Metabolism, Excretion, Toxicity (ADMET) properties and adverse effects, critically affects late stage failure of drug candidates and has led to the withdrawal of approved drugs. Computational methods are effective approaches to reduce the number of safety issues by analyzing possible links between chemical structures and ADMET or adverse effects, but this is limited by the size, quality, and heterogeneity of the data available from individual sources. Thus, large, clean and integrated databases of approved drug data, associated with fast and efficient predictive tools are desirable early in the drug discovery process. We have built a relational database (IDAAPM) to integrate available approved drug data such as drug approval information, ADMET and adverse effects, chemical structures and molecular descriptors, targets, bioactivity and related references. The database has been coupled with a searchable web interface and modern data analytics platform (KNIME) to allow data access, data transformation, initial analysis and further predictive modeling. Data were extracted from FDA resources and supplemented from other publicly available databases. Currently, the database contains information regarding about 19,226 FDA approval applications for 31,815 products (small molecules and biologics) with their approval history, 2505 active ingredients, together with as many ADMET properties, 1629 molecular structures, 2.5 million adverse effects and 36,963 experimental drug-target bioactivity data. IDAAPM is a unique resource that, in a single relational database, provides detailed information on FDA approved drugs including their ADMET properties and adverse effects, the corresponding targets with bioactivity data, coupled with a data analytics platform. It can be used to perform basic to complex drug-target ADMET or adverse effects analysis and predictive modeling. IDAAPM is freely accessible at http://idaapm.helsinki.fi and can be exploited through a KNIME workflow connected to the database.Graphical abstractFDA approved drug data integration for predictive modeling.

  13. Information resources at the National Center for Biotechnology Information.

    PubMed Central

    Woodsmall, R M; Benson, D A

    1993-01-01

    The National Center for Biotechnology Information (NCBI), part of the National Library of Medicine, was established in 1988 to perform basic research in the field of computational molecular biology as well as build and distribute molecular biology databases. The basic research has led to new algorithms and analysis tools for interpreting genomic data and has been instrumental in the discovery of human disease genes for neurofibromatosis and Kallmann syndrome. The principal database responsibility is the National Institutes of Health (NIH) genetic sequence database, GenBank. NCBI, in collaboration with international partners, builds, distributes, and provides online and CD-ROM access to over 112,000 DNA sequences. Another major program is the integration of multiple sequences databases and related bibliographic information and the development of network-based retrieval systems for Internet access. PMID:8374583

  14. The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes

    PubMed Central

    Rigden, Daniel J

    2017-01-01

    Abstract This year's Database Issue of Nucleic Acids Research contains 152 papers that include descriptions of 54 new databases and update papers on 98 databases, of which 16 have not been previously featured in NAR. As always, these databases cover a broad range of molecular biology subjects, including genome structure, gene expression and its regulation, proteins, protein domains, and protein–protein interactions. Following the recent trend, an increasing number of new and established databases deal with the issues of human health, from cancer-causing mutations to drugs and drug targets. In accordance with this trend, three recently compiled databases that have been selected by NAR reviewers and editors as ‘breakthrough’ contributions, denovo-db, the Monarch Initiative, and Open Targets, cover human de novo gene variants, disease-related phenotypes in model organisms, and a bioinformatics platform for therapeutic target identification and validation, respectively. We expect these databases to attract the attention of numerous researchers working in various areas of genetics and genomics. Looking back at the past 12 years, we present here the ‘golden set’ of databases that have consistently served as authoritative, comprehensive, and convenient data resources widely used by the entire community and offer some lessons on what makes a successful database. The Database Issue is freely available online at the https://academic.oup.com/nar web site. An updated version of the NAR Molecular Biology Database Collection is available at http://www.oxfordjournals.org/nar/database/a/. PMID:28053160

  15. Capturing cooperative interactions with the PSI-MI format

    PubMed Central

    Van Roey, Kim; Orchard, Sandra; Kerrien, Samuel; Dumousseau, Marine; Ricard-Blum, Sylvie; Hermjakob, Henning; Gibson, Toby J.

    2013-01-01

    The complex biological processes that control cellular function are mediated by intricate networks of molecular interactions. Accumulating evidence indicates that these interactions are often interdependent, thus acting cooperatively. Cooperative interactions are prevalent in and indispensible for reliable and robust control of cell regulation, as they underlie the conditional decision-making capability of large regulatory complexes. Despite an increased focus on experimental elucidation of the molecular details of cooperative binding events, as evidenced by their growing occurrence in literature, they are currently lacking from the main bioinformatics resources. One of the contributing factors to this deficiency is the lack of a computer-readable standard representation and exchange format for cooperative interaction data. To tackle this shortcoming, we added functionality to the widely used PSI-MI interchange format for molecular interaction data by defining new controlled vocabulary terms that allow annotation of different aspects of cooperativity without making structural changes to the underlying XML schema. As a result, we are able to capture cooperative interaction data in a structured format that is backward compatible with PSI-MI–based data and applications. This will facilitate the storage, exchange and analysis of cooperative interaction data, which in turn will advance experimental research on this fundamental principle in biology. Database URL: http://psi-mi-cooperativeinteractions.embl.de/ PMID:24067240

  16. A comparative cellular and molecular biology of longevity database.

    PubMed

    Stuart, Jeffrey A; Liang, Ping; Luo, Xuemei; Page, Melissa M; Gallagher, Emily J; Christoff, Casey A; Robb, Ellen L

    2013-10-01

    Discovering key cellular and molecular traits that promote longevity is a major goal of aging and longevity research. One experimental strategy is to determine which traits have been selected during the evolution of longevity in naturally long-lived animal species. This comparative approach has been applied to lifespan research for nearly four decades, yielding hundreds of datasets describing aspects of cell and molecular biology hypothesized to relate to animal longevity. Here, we introduce a Comparative Cellular and Molecular Biology of Longevity Database, available at ( http://genomics.brocku.ca/ccmbl/ ), as a compendium of comparative cell and molecular data presented in the context of longevity. This open access database will facilitate the meta-analysis of amalgamated datasets using standardized maximum lifespan (MLSP) data (from AnAge). The first edition contains over 800 data records describing experimental measurements of cellular stress resistance, reactive oxygen species metabolism, membrane composition, protein homeostasis, and genome homeostasis as they relate to vertebrate species MLSP. The purpose of this review is to introduce the database and briefly demonstrate its use in the meta-analysis of combined datasets.

  17. Use of Genomic Databases for Inquiry-Based Learning about Influenza

    ERIC Educational Resources Information Center

    Ledley, Fred; Ndung'u, Eric

    2011-01-01

    The genome projects of the past decades have created extensive databases of biological information with applications in both research and education. We describe an inquiry-based exercise that uses one such database, the National Center for Biotechnology Information Influenza Virus Resource, to advance learning about influenza. This database…

  18. Transcriptome Analysis of PA Gain and Loss of Function Mutants.

    PubMed

    Marco, Francisco; Carrasco, Pedro

    2018-01-01

    Functional genomics has become a forefront methodology for plant science thanks to the widespread development of microarray technology. While technical difficulties associated with the process of obtaining raw expression data have been diminishing, allowing the appearance of tremendous amounts of transcriptome data in different databases, a common problem using "omic" technologies remains: the interpretation of these data and the inference of its biological meaning. In order to assist to this complex task, a wide variety of software tools have been developed. In this chapter we describe our current workflow of the application of some of these analyses. We have used it to compare the transcriptome of plants with differences in their polyamine levels.

  19. Usefulness and accuracy of MALDI-TOF mass spectrometry as a supplementary tool to identify mosquito vector species and to invest in development of international database.

    PubMed

    Raharimalala, F N; Andrianinarivomanana, T M; Rakotondrasoa, A; Collard, J M; Boyer, S

    2017-09-01

    Arthropod-borne diseases are important causes of morbidity and mortality. The identification of vector species relies mainly on morphological features and/or molecular biology tools. The first method requires specific technical skills and may result in misidentifications, and the second method is time-consuming and expensive. The aim of the present study is to assess the usefulness and accuracy of matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) as a supplementary tool with which to identify mosquito vector species and to invest in the creation of an international database. A total of 89 specimens belonging to 10 mosquito species were selected for the extraction of proteins from legs and for the establishment of a reference database. A blind test with 123 mosquitoes was performed to validate the MS method. Results showed that: (a) the spectra obtained in the study with a given species differed from the spectra of the same species collected in another country, which highlights the need for an international database; (b) MALDI-TOF MS is an accurate method for the rapid identification of mosquito species that are referenced in a database; (c) MALDI-TOF MS allows the separation of groups or complex species, and (d) laboratory specimens undergo a loss of proteins compared with those isolated in the field. In conclusion, MALDI-TOF MS is a useful supplementary tool for mosquito identification and can help inform vector control. © 2017 The Royal Entomological Society.

  20. Curation accuracy of model organism databases

    PubMed Central

    Keseler, Ingrid M.; Skrzypek, Marek; Weerasinghe, Deepika; Chen, Albert Y.; Fulcher, Carol; Li, Gene-Wei; Lemmer, Kimberly C.; Mladinich, Katherine M.; Chow, Edmond D.; Sherlock, Gavin; Karp, Peter D.

    2014-01-01

    Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org// PMID:24923819

  1. Rapid development of entity-based data models for bioinformatics with persistence object-oriented design and structured interfaces.

    PubMed

    Ezra Tsur, Elishai

    2017-01-01

    Databases are imperative for research in bioinformatics and computational biology. Current challenges in database design include data heterogeneity and context-dependent interconnections between data entities. These challenges drove the development of unified data interfaces and specialized databases. The curation of specialized databases is an ever-growing challenge due to the introduction of new data sources and the emergence of new relational connections between established datasets. Here, an open-source framework for the curation of specialized databases is proposed. The framework supports user-designed models of data encapsulation, objects persistency and structured interfaces to local and external data sources such as MalaCards, Biomodels and the National Centre for Biotechnology Information (NCBI) databases. The proposed framework was implemented using Java as the development environment, EclipseLink as the data persistency agent and Apache Derby as the database manager. Syntactic analysis was based on J3D, jsoup, Apache Commons and w3c.dom open libraries. Finally, a construction of a specialized database for aneurysms associated vascular diseases is demonstrated. This database contains 3-dimensional geometries of aneurysms, patient's clinical information, articles, biological models, related diseases and our recently published model of aneurysms' risk of rapture. Framework is available in: http://nbel-lab.com.

  2. Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases.

    PubMed

    Wollbrett, Julien; Larmande, Pierre; de Lamotte, Frédéric; Ruiz, Manuel

    2013-04-15

    In recent years, a large amount of "-omics" data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic.

  3. Clever generation of rich SPARQL queries from annotated relational schema: application to Semantic Web Service creation for biological databases

    PubMed Central

    2013-01-01

    Background In recent years, a large amount of “-omics” data have been produced. However, these data are stored in many different species-specific databases that are managed by different institutes and laboratories. Biologists often need to find and assemble data from disparate sources to perform certain analyses. Searching for these data and assembling them is a time-consuming task. The Semantic Web helps to facilitate interoperability across databases. A common approach involves the development of wrapper systems that map a relational database schema onto existing domain ontologies. However, few attempts have been made to automate the creation of such wrappers. Results We developed a framework, named BioSemantic, for the creation of Semantic Web Services that are applicable to relational biological databases. This framework makes use of both Semantic Web and Web Services technologies and can be divided into two main parts: (i) the generation and semi-automatic annotation of an RDF view; and (ii) the automatic generation of SPARQL queries and their integration into Semantic Web Services backbones. We have used our framework to integrate genomic data from different plant databases. Conclusions BioSemantic is a framework that was designed to speed integration of relational databases. We present how it can be used to speed the development of Semantic Web Services for existing relational biological databases. Currently, it creates and annotates RDF views that enable the automatic generation of SPARQL queries. Web Services are also created and deployed automatically, and the semantic annotations of our Web Services are added automatically using SAWSDL attributes. BioSemantic is downloadable at http://southgreen.cirad.fr/?q=content/Biosemantic. PMID:23586394

  4. Saccharomyces genome database informs human biology

    PubMed Central

    Skrzypek, Marek S; Nash, Robert S; Wong, Edith D; MacPherson, Kevin A; Karra, Kalpana; Binkley, Gail; Simison, Matt; Miyasato, Stuart R

    2018-01-01

    Abstract The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD. PMID:29140510

  5. Assignment of Streptococcus agalactiae isolates to clonal complexes using a small set of single nucleotide polymorphisms.

    PubMed

    Honsa, Erin; Fricke, Thomas; Stephens, Alex J; Ko, Danny; Kong, Fanrong; Gilbert, Gwendolyn L; Huygens, Flavia; Giffard, Philip M

    2008-08-19

    Streptococcus agalactiae (Group B Streptococcus (GBS)) is an important human pathogen, particularly of newborns. Emerging evidence for a relationship between genotype and virulence has accentuated the need for efficient and well-defined typing methods. The objective of this study was to develop a single nucleotide polymorphism (SNP) based method for assigning GBS isolates to multilocus sequence typing (MLST)-defined clonal complexes. It was found that a SNP set derived from the MLST database on the basis of maximization of Simpsons Index of Diversity provided poor resolution and did not define groups concordant with the population structure as defined by eBURST analysis of the MLST database. This was interpreted as being a consequence of low diversity and high frequency horizontal gene transfer. Accordingly, a different approach to SNP identification was developed. This entailed use of the "Not-N" bioinformatic algorithm that identifies SNPs diagnostic for groups of known sequence variants, together with an empirical process of SNP testing. This yielded a four member SNP set that divides GBS into 10 groups that are concordant with the population structure. A fifth SNP was identified that increased the sensitivity for the clinically significant clonal complex 17 to 100%. Kinetic PCR methods for the interrogation of these SNPs were developed, and used to genotype 116 well characterized isolates. A five SNP method for dividing GBS into biologically valid groups has been developed. These SNPs are ideal for high throughput surveillance activities, and combining with more rapidly evolving loci when additional resolution is required.

  6. Assignment of Streptococcus agalactiae isolates to clonal complexes using a small set of single nucleotide polymorphisms

    PubMed Central

    Honsa, Erin; Fricke, Thomas; Stephens, Alex J; Ko, Danny; Kong, Fanrong; Gilbert, Gwendolyn L; Huygens, Flavia; Giffard, Philip M

    2008-01-01

    Background Streptococcus agalactiae (Group B Streptococcus (GBS)) is an important human pathogen, particularly of newborns. Emerging evidence for a relationship between genotype and virulence has accentuated the need for efficient and well-defined typing methods. The objective of this study was to develop a single nucleotide polymorphism (SNP) based method for assigning GBS isolates to multilocus sequence typing (MLST)-defined clonal complexes. Results It was found that a SNP set derived from the MLST database on the basis of maximisation of Simpsons Index of Diversity provided poor resolution and did not define groups concordant with the population structure as defined by eBURST analysis of the MLST database. This was interpreted as being a consequence of low diversity and high frequency horizontal gene transfer. Accordingly, a different approach to SNP identification was developed. This entailed use of the "Not-N" bioinformatic algorithm that identifies SNPs diagnostic for groups of known sequence variants, together with an empirical process of SNP testing. This yielded a four member SNP set that divides GBS into 10 groups that are concordant with the population structure. A fifth SNP was identified that increased the sensitivity for the clinically significant clonal complex 17 to 100%. Kinetic PCR methods for the interrogation of these SNPs were developed, and used to genotype 116 well characterized isolates. Conclusion A five SNP method for dividing GBS into biologically valid groups has been developed. These SNPs are ideal for high throughput surveillance activities, and combining with more rapidly evolving loci when additional resolution is required. PMID:18710585

  7. A novel approach to select differential pathways associated with hypertrophic cardiomyopathy based on gene co‑expression analysis.

    PubMed

    Chen, Xiao-Min; Feng, Ming-Jun; Shen, Cai-Jie; He, Bin; Du, Xian-Feng; Yu, Yi-Bo; Liu, Jing; Chu, Hui-Min

    2017-07-01

    The present study was designed to develop a novel method for identifying significant pathways associated with human hypertrophic cardiomyopathy (HCM), based on gene co‑expression analysis. The microarray dataset associated with HCM (E‑GEOD‑36961) was obtained from the European Molecular Biology Laboratory‑European Bioinformatics Institute database. Informative pathways were selected based on the Reactome pathway database and screening treatments. An empirical Bayes method was utilized to construct co‑expression networks for informative pathways, and a weight value was assigned to each pathway. Differential pathways were extracted based on weight threshold, which was calculated using a random model. In order to assess whether the co‑expression method was feasible, it was compared with traditional pathway enrichment analysis of differentially expressed genes, which were identified using the significance analysis of microarrays package. A total of 1,074 informative pathways were screened out for subsequent investigations and their weight values were also obtained. According to the threshold of weight value of 0.01057, 447 differential pathways, including folding of actin by chaperonin containing T‑complex protein 1 (CCT)/T‑complex protein 1 ring complex (TRiC), purine ribonucleoside monophosphate biosynthesis and ubiquinol biosynthesis, were obtained. Compared with traditional pathway enrichment analysis, the number of pathways obtained from the co‑expression approach was increased. The results of the present study demonstrated that this method may be useful to predict marker pathways for HCM. The pathways of folding of actin by CCT/TRiC and purine ribonucleoside monophosphate biosynthesis may provide evidence of the underlying molecular mechanisms of HCM, and offer novel therapeutic directions for HCM.

  8. Using an International p53 Mutation Database as a Foundation for an Online Laboratory in an Upper Level Undergraduate Biology Class

    ERIC Educational Resources Information Center

    Melloy, Patricia G.

    2015-01-01

    A two-part laboratory exercise was developed to enhance classroom instruction on the significance of p53 mutations in cancer development. Students were asked to mine key information from an international database of p53 genetic changes related to cancer, the IARC TP53 database. Using this database, students designed several data mining activities…

  9. Finding similar nucleotide sequences using network BLAST searches.

    PubMed

    Ladunga, Istvan

    2009-06-01

    The Basic Local Alignment Search Tool (BLAST) is a keystone of bioinformatics due to its performance and user-friendliness. Beginner and intermediate users will learn how to design and submit blastn and Megablast searches on the Web pages at the National Center for Biotechnology Information. We map nucleic acid sequences to genomes, find identical or similar mRNA, expressed sequence tag, and noncoding RNA sequences, and run Megablast searches, which are much faster than blastn. Understanding results is assisted by taxonomy reports, genomic views, and multiple alignments. We interpret expected frequency thresholds, biological significance, and statistical significance. Weak hits provide no evidence, but hints for further analyses. We find genes that may code for homologous proteins by translated BLAST. We reduce false positives by filtering out low-complexity regions. Parsed BLAST results can be integrated into analysis pipelines. Links in the output connect to Entrez, PUBMED, structural, sequence, interaction, and expression databases. This facilitates integration with a wide spectrum of biological knowledge.

  10. How many human proteoforms are there?

    PubMed

    Aebersold, Ruedi; Agar, Jeffrey N; Amster, I Jonathan; Baker, Mark S; Bertozzi, Carolyn R; Boja, Emily S; Costello, Catherine E; Cravatt, Benjamin F; Fenselau, Catherine; Garcia, Benjamin A; Ge, Ying; Gunawardena, Jeremy; Hendrickson, Ronald C; Hergenrother, Paul J; Huber, Christian G; Ivanov, Alexander R; Jensen, Ole N; Jewett, Michael C; Kelleher, Neil L; Kiessling, Laura L; Krogan, Nevan J; Larsen, Martin R; Loo, Joseph A; Ogorzalek Loo, Rachel R; Lundberg, Emma; MacCoss, Michael J; Mallick, Parag; Mootha, Vamsi K; Mrksich, Milan; Muir, Tom W; Patrie, Steven M; Pesavento, James J; Pitteri, Sharon J; Rodriguez, Henry; Saghatelian, Alan; Sandoval, Wendy; Schlüter, Hartmut; Sechi, Salvatore; Slavoff, Sarah A; Smith, Lloyd M; Snyder, Michael P; Thomas, Paul M; Uhlén, Mathias; Van Eyk, Jennifer E; Vidal, Marc; Walt, David R; White, Forest M; Williams, Evan R; Wohlschlager, Therese; Wysocki, Vicki H; Yates, Nathan A; Young, Nicolas L; Zhang, Bing

    2018-02-14

    Despite decades of accumulated knowledge about proteins and their post-translational modifications (PTMs), numerous questions remain regarding their molecular composition and biological function. One of the most fundamental queries is the extent to which the combinations of DNA-, RNA- and PTM-level variations explode the complexity of the human proteome. Here, we outline what we know from current databases and measurement strategies including mass spectrometry-based proteomics. In doing so, we examine prevailing notions about the number of modifications displayed on human proteins and how they combine to generate the protein diversity underlying health and disease. We frame central issues regarding determination of protein-level variation and PTMs, including some paradoxes present in the field today. We use this framework to assess existing data and to ask the question, "How many distinct primary structures of proteins (proteoforms) are created from the 20,300 human genes?" We also explore prospects for improving measurements to better regularize protein-level biology and efficiently associate PTMs to function and phenotype.

  11. Omics databases on kidney disease: where they can be found and how to benefit from them.

    PubMed

    Papadopoulos, Theofilos; Krochmal, Magdalena; Cisek, Katryna; Fernandes, Marco; Husi, Holger; Stevens, Robert; Bascands, Jean-Loup; Schanstra, Joost P; Klein, Julie

    2016-06-01

    In the recent decades, the evolution of omics technologies has led to advances in all biological fields, creating a demand for effective storage, management and exchange of rapidly generated data and research discoveries. To address this need, the development of databases of experimental outputs has become a common part of scientific practice in order to serve as knowledge sources and data-sharing platforms, providing information about genes, transcripts, proteins or metabolites. In this review, we present omics databases available currently, with a special focus on their application in kidney research and possibly in clinical practice. Databases are divided into two categories: general databases with a broad information scope and kidney-specific databases distinctively concentrated on kidney pathologies. In research, databases can be used as a rich source of information about pathophysiological mechanisms and molecular targets. In the future, databases will support clinicians with their decisions, providing better and faster diagnoses and setting the direction towards more preventive, personalized medicine. We also provide a test case demonstrating the potential of biological databases in comparing multi-omics datasets and generating new hypotheses to answer a critical and common diagnostic problem in nephrology practice. In the future, employment of databases combined with data integration and data mining should provide powerful insights into unlocking the mysteries of kidney disease, leading to a potential impact on pharmacological intervention and therapeutic disease management.

  12. Prevention of data duplication for high throughput sequencing repositories

    PubMed Central

    Gabdank, Idan; Chan, Esther T; Davidson, Jean M; Hilton, Jason A; Davis, Carrie A; Baymuradov, Ulugbek K; Narayanan, Aditi; Onate, Kathrina C; Graham, Keenan; Miyasato, Stuart R; Dreszer, Timothy R; Strattan, J Seth; Jolanki, Otto; Tanaka, Forrest Y; Hitz, Benjamin C

    2018-01-01

    Abstract Prevention of unintended duplication is one of the ongoing challenges many databases have to address. Working with high-throughput sequencing data, the complexity of that challenge increases with the complexity of the definition of a duplicate. In a computational data model, a data object represents a real entity like a reagent or a biosample. This representation is similar to how a card represents a book in a paper library catalog. Duplicated data objects not only waste storage, they can mislead users into assuming the model represents more than the single entity. Even if it is clear that two objects represent a single entity, data duplication opens the door to potential inconsistencies between the objects since the content of the duplicated objects can be updated independently, allowing divergence of the metadata associated with the objects. Analogously to a situation in which a catalog in a paper library would contain by mistake two cards for a single copy of a book. If these cards are listing simultaneously two different individuals as current book borrowers, it would be difficult to determine which borrower (out of the two listed) actually has the book. Unfortunately, in a large database with multiple submitters, unintended duplication is to be expected. In this article, we present three principal guidelines the Encyclopedia of DNA Elements (ENCODE) Portal follows in order to prevent unintended duplication of both actual files and data objects: definition of identifiable data objects (I), object uniqueness validation (II) and de-duplication mechanism (III). In addition to explaining our modus operandi, we elaborate on the methods used for identification of sequencing data files. Comparison of the approach taken by the ENCODE Portal vs other widely used biological data repositories is provided. Database URL: https://www.encodeproject.org/ PMID:29688363

  13. Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets

    PubMed Central

    2012-01-01

    Background To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. Results We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. Conclusions SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery. PMID:23281852

  14. Space-related pharma-motifs for fast search of protein binding motifs and polypharmacological targets.

    PubMed

    Chiu, Yi-Yuan; Lin, Chun-Yu; Lin, Chih-Ta; Hsu, Kai-Cheng; Chang, Li-Zen; Yang, Jinn-Moon

    2012-01-01

    To discover a compound inhibiting multiple proteins (i.e. polypharmacological targets) is a new paradigm for the complex diseases (e.g. cancers and diabetes). In general, the polypharmacological proteins often share similar local binding environments and motifs. As the exponential growth of the number of protein structures, to find the similar structural binding motifs (pharma-motifs) is an emergency task for drug discovery (e.g. side effects and new uses for old drugs) and protein functions. We have developed a Space-Related Pharmamotifs (called SRPmotif) method to recognize the binding motifs by searching against protein structure database. SRPmotif is able to recognize conserved binding environments containing spatially discontinuous pharma-motifs which are often short conserved peptides with specific physico-chemical properties for protein functions. Among 356 pharma-motifs, 56.5% interacting residues are highly conserved. Experimental results indicate that 81.1% and 92.7% polypharmacological targets of each protein-ligand complex are annotated with same biological process (BP) and molecular function (MF) terms, respectively, based on Gene Ontology (GO). Our experimental results show that the identified pharma-motifs often consist of key residues in functional (active) sites and play the key roles for protein functions. The SRPmotif is available at http://gemdock.life.nctu.edu.tw/SRP/. SRPmotif is able to identify similar pharma-interfaces and pharma-motifs sharing similar binding environments for polypharmacological targets by rapidly searching against the protein structure database. Pharma-motifs describe the conservations of binding environments for drug discovery and protein functions. Additionally, these pharma-motifs provide the clues for discovering new sequence-based motifs to predict protein functions from protein sequence databases. We believe that SRPmotif is useful for elucidating protein functions and drug discovery.

  15. Sequence modelling and an extensible data model for genomic database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Li, Peter Wei-Der

    1992-01-01

    The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS's do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data modelmore » that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the Extensible Object Model'', to address the need of a modelling framework for incorporating the sequence data model with other types of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.« less

  16. Sequence modelling and an extensible data model for genomic database

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Li, Peter Wei-Der

    1992-01-01

    The Human Genome Project (HGP) plans to sequence the human genome by the beginning of the next century. It will generate DNA sequences of more than 10 billion bases and complex marker sequences (maps) of more than 100 million markers. All of these information will be stored in database management systems (DBMSs). However, existing data models do not have the abstraction mechanism for modelling sequences and existing DBMS`s do not have operations for complex sequences. This work addresses the problem of sequence modelling in the context of the HGP and the more general problem of an extensible object data modelmore » that can incorporate the sequence model as well as existing and future data constructs and operators. First, we proposed a general sequence model that is application and implementation independent. This model is used to capture the sequence information found in the HGP at the conceptual level. In addition, abstract and biological sequence operators are defined for manipulating the modelled sequences. Second, we combined many features of semantic and object oriented data models into an extensible framework, which we called the ``Extensible Object Model``, to address the need of a modelling framework for incorporating the sequence data model with other types of data constructs and operators. This framework is based on the conceptual separation between constructors and constraints. We then used this modelling framework to integrate the constructs for the conceptual sequence model. The Extensible Object Model is also defined with a graphical representation, which is useful as a tool for database designers. Finally, we defined a query language to support this model and implement the query processor to demonstrate the feasibility of the extensible framework and the usefulness of the conceptual sequence model.« less

  17. Top-k similar graph matching using TraM in biological networks.

    PubMed

    Amin, Mohammad Shafkat; Finley, Russell L; Jamil, Hasan M

    2012-01-01

    Many emerging database applications entail sophisticated graph-based query manipulation, predominantly evident in large-scale scientific applications. To access the information embedded in graphs, efficient graph matching tools and algorithms have become of prime importance. Although the prohibitively expensive time complexity associated with exact subgraph isomorphism techniques has limited its efficacy in the application domain, approximate yet efficient graph matching techniques have received much attention due to their pragmatic applicability. Since public domain databases are noisy and incomplete in nature, inexact graph matching techniques have proven to be more promising in terms of inferring knowledge from numerous structural data repositories. In this paper, we propose a novel technique called TraM for approximate graph matching that off-loads a significant amount of its processing on to the database making the approach viable for large graphs. Moreover, the vector space embedding of the graphs and efficient filtration of the search space enables computation of approximate graph similarity at a throw-away cost. We annotate nodes of the query graphs by means of their global topological properties and compare them with neighborhood biased segments of the datagraph for proper matches. We have conducted experiments on several real data sets, and have demonstrated the effectiveness and efficiency of the proposed method

  18. Genomic Approach to Understand the Association of DNA Repair with Longevity and Healthy Aging Using Genomic Databases of Oldest-Old Population

    PubMed Central

    Kim, Hyun Soo

    2018-01-01

    Aged population is increasing worldwide due to the aging process that is inevitable. Accordingly, longevity and healthy aging have been spotlighted to promote social contribution of aged population. Many studies in the past few decades have reported the process of aging and longevity, emphasizing the importance of maintaining genomic stability in exceptionally long-lived population. Underlying reason of longevity remains unclear due to its complexity involving multiple factors. With advances in sequencing technology and human genome-associated approaches, studies based on population-based genomic studies are increasing. In this review, we summarize recent longevity and healthy aging studies of human population focusing on DNA repair as a major factor in maintaining genome integrity. To keep pace with recent growth in genomic research, aging- and longevity-associated genomic databases are also briefly introduced. To suggest novel approaches to investigate longevity-associated genetic variants related to DNA repair using genomic databases, gene set analysis was conducted, focusing on DNA repair- and longevity-associated genes. Their biological networks were additionally analyzed to grasp major factors containing genetic variants of human longevity and healthy aging in DNA repair mechanisms. In summary, this review emphasizes DNA repair activity in human longevity and suggests approach to conduct DNA repair-associated genomic study on human healthy aging.

  19. Text mining for the biocuration workflow

    PubMed Central

    Hirschman, Lynette; Burns, Gully A. P. C; Krallinger, Martin; Arighi, Cecilia; Cohen, K. Bretonnel; Valencia, Alfonso; Wu, Cathy H.; Chatr-Aryamontri, Andrew; Dowell, Karen G.; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G.

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community. PMID:22513129

  20. Text mining for the biocuration workflow.

    PubMed

    Hirschman, Lynette; Burns, Gully A P C; Krallinger, Martin; Arighi, Cecilia; Cohen, K Bretonnel; Valencia, Alfonso; Wu, Cathy H; Chatr-Aryamontri, Andrew; Dowell, Karen G; Huala, Eva; Lourenço, Anália; Nash, Robert; Veuthey, Anne-Lise; Wiegers, Thomas; Winter, Andrew G

    2012-01-01

    Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases. As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automated aids for biocuration would seem to be an ideal application for natural language processing and text mining. However, to date, there have been few documented successes for improving biocuration throughput using text mining. Our initial investigations took place for the workshop on 'Text Mining for the BioCuration Workflow' at the third International Biocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases. This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing of documents with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions); however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey of biocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entities and support for the identification and prioritization of documents for curation. It also indicated that two-thirds of the biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysis of our interviews and survey provide a set of requirements for the integration of text mining into the biocuration workflow. These can guide the identification of common needs across curated databases and encourage joint experimentation involving biocurators, text mining developers and the larger biomedical research community.

  1. Conceptual Foundations of Systems Biology Explaining Complex Cardiac Diseases.

    PubMed

    Louridas, George E; Lourida, Katerina G

    2017-02-21

    Systems biology is an important concept that connects molecular biology and genomics with computing science, mathematics and engineering. An endeavor is made in this paper to associate basic conceptual ideas of systems biology with clinical medicine. Complex cardiac diseases are clinical phenotypes generated by integration of genetic, molecular and environmental factors. Basic concepts of systems biology like network construction, modular thinking, biological constraints (downward biological direction) and emergence (upward biological direction) could be applied to clinical medicine. Especially, in the field of cardiology, these concepts can be used to explain complex clinical cardiac phenotypes like chronic heart failure and coronary artery disease. Cardiac diseases are biological complex entities which like other biological phenomena can be explained by a systems biology approach. The above powerful biological tools of systems biology can explain robustness growth and stability during disease process from modulation to phenotype. The purpose of the present review paper is to implement systems biology strategy and incorporate some conceptual issues raised by this approach into the clinical field of complex cardiac diseases. Cardiac disease process and progression can be addressed by the holistic realistic approach of systems biology in order to define in better terms earlier diagnosis and more effective therapy.

  2. Identification of triacylglycerol using automated annotation of high resolution multistage mass spectral trees.

    PubMed

    Wang, Xiupin; Peng, Qingzhi; Li, Peiwu; Zhang, Qi; Ding, Xiaoxia; Zhang, Wen; Zhang, Liangxiao

    2016-10-12

    High complexity of identification for non-target triacylglycerols (TAGs) is a major challenge in lipidomics analysis. To identify non-target TAGs, a powerful tool named accurate MS(n) spectrometry generating so-called ion trees is used. In this paper, we presented a technique for efficient structural elucidation of TAGs on MS(n) spectral trees produced by LTQ Orbitrap MS(n), which was implemented as an open source software package, or TIT. The TIT software was used to support automatic annotation of non-target TAGs on MS(n) ion trees from a self-built fragment ion database. This database includes 19108 simulate TAG molecules from a random combination of fatty acids and corresponding 500582 self-built multistage fragment ions (MS ≤ 3). Our software can identify TAGs using a "stage-by-stage elimination" strategy. By utilizing the MS(1) accurate mass and referenced RKMD, the TIT software can discriminate unique elemental composition candidates. The regiospecific isomers of fatty acyl chains will be distinguished using MS(2) and MS(3) fragment spectra. We applied the algorithm to the selection of 45 TAG standards and demonstrated that the molecular ions could be 100% correctly assigned. Therefore, the TIT software could be applied to TAG identification in complex biological samples such as mouse plasma extracts. Copyright © 2016 Elsevier B.V. All rights reserved.

  3. Freshwater Biological Traits Database (Traits)

    EPA Pesticide Factsheets

    The traits database was compiled for a project on climate change effects on river and stream ecosystems. The traits data, gathered from multiple sources, focused on information published or otherwise well-documented by trustworthy sources.

  4. PubChem BioAssay: 2017 update

    PubMed Central

    Wang, Yanli; Bryant, Stephen H.; Cheng, Tiejun; Wang, Jiyao; Gindulyte, Asta; Shoemaker, Benjamin A.; Thiessen, Paul A.; He, Siqian; Zhang, Jian

    2017-01-01

    PubChem's BioAssay database (https://pubchem.ncbi.nlm.nih.gov) has served as a public repository for small-molecule and RNAi screening data since 2004 providing open access of its data content to the community. PubChem accepts data submission from worldwide researchers at academia, industry and government agencies. PubChem also collaborates with other chemical biology database stakeholders with data exchange. With over a decade's development effort, it becomes an important information resource supporting drug discovery and chemical biology research. To facilitate data discovery, PubChem is integrated with all other databases at NCBI. In this work, we provide an update for the PubChem BioAssay database describing several recent development including added sources of research data, redesigned BioAssay record page, new BioAssay classification browser and new features in the Upload system facilitating data sharing. PMID:27899599

  5. Genomic Comparisons Reveal Microevolutionary Differences in Mycobacterium abscessus Subspecies

    PubMed Central

    Tan, Joon L.; Ng, Kee P.; Ong, Chia S.; Ngeow, Yun F.

    2017-01-01

    Mycobacterium abscessus, a rapid-growing non-tuberculous mycobacterium, has been the cause of sporadic and outbreak infections world-wide. The subspecies in M. abscessus complex (M. abscessus, M. massiliense, and M. bolletii) are associated with different biologic and pathogenic characteristics and are known to be among the most frequently isolated opportunistic pathogens from clinical material. To date, the evolutionary forces that could have contributed to these biological and clinical differences are still unclear. We compared genome data from 243 M. abscessus strains downloaded from the NCBI ftp Refseq database to understand how the microevolutionary processes of homologous recombination and positive selection influenced the diversification of the M. abscessus complex at the subspecies level. The three subspecies are clearly separated in the Minimum Spanning Tree. Their MUMi-based genomic distances support the separation of M. massiliense and M. bolletii into two subspecies. Maximum Likelihood analysis through dN/dS (the ratio of number of non-synonymous substitutions per non-synonymous site, to the number of synonymous substitutions per synonymous site) identified distinct genes in each subspecies that could have been affected by positive selection during evolution. The results of genome-wide alignment based on concatenated locally-collinear blocks suggest that (a) recombination has affected the M. abscessus complex more than mutation and positive selection; (b) recombination occurred more frequently in M. massiliense than in the other two subspecies; and (c) the recombined segments in the three subspecies have come from different intra-species and inter-species origins. The results lead to the identification of possible gene sets that could have been responsible for the subspecies-specific features and suggest independent evolution among the three subspecies, with recombination playing a more significant role than positive selection in the diversification among members in this complex. PMID:29109707

  6. Genomic Comparisons Reveal Microevolutionary Differences in Mycobacterium abscessus Subspecies.

    PubMed

    Tan, Joon L; Ng, Kee P; Ong, Chia S; Ngeow, Yun F

    2017-01-01

    Mycobacterium abscessus , a rapid-growing non-tuberculous mycobacterium, has been the cause of sporadic and outbreak infections world-wide. The subspecies in M. abscessus complex ( M. abscessus, M. massiliense , and M. bolletii ) are associated with different biologic and pathogenic characteristics and are known to be among the most frequently isolated opportunistic pathogens from clinical material. To date, the evolutionary forces that could have contributed to these biological and clinical differences are still unclear. We compared genome data from 243 M. abscessus strains downloaded from the NCBI ftp Refseq database to understand how the microevolutionary processes of homologous recombination and positive selection influenced the diversification of the M. abscessus complex at the subspecies level. The three subspecies are clearly separated in the Minimum Spanning Tree. Their MUMi-based genomic distances support the separation of M. massiliense and M. bolletii into two subspecies. Maximum Likelihood analysis through dN/dS (the ratio of number of non-synonymous substitutions per non-synonymous site, to the number of synonymous substitutions per synonymous site) identified distinct genes in each subspecies that could have been affected by positive selection during evolution. The results of genome-wide alignment based on concatenated locally-collinear blocks suggest that (a) recombination has affected the M. abscessus complex more than mutation and positive selection; (b) recombination occurred more frequently in M. massiliense than in the other two subspecies; and (c) the recombined segments in the three subspecies have come from different intra-species and inter-species origins. The results lead to the identification of possible gene sets that could have been responsible for the subspecies-specific features and suggest independent evolution among the three subspecies, with recombination playing a more significant role than positive selection in the diversification among members in this complex.

  7. Searching and Extracting Data from the EMBL-EBI Complex Portal.

    PubMed

    Meldal, Birgit H M; Orchard, Sandra

    2018-01-01

    The Complex Portal ( www.ebi.ac.uk/complexportal ) is an encyclopedia of macromolecular complexes. Complexes are assigned unique, stable IDs, are species specific, and list all participating members with links to an appropriate reference database (UniProtKB, ChEBI, RNAcentral). Each complex is annotated extensively with its functions, properties, structure, stoichiometry, tissue expression profile, and subcellular location. Links to domain-specific databases allow the user to access additional information and enable data searching and filtering. Complexes can be saved and downloaded in PSI-MI XML, MI-JSON, and tab-delimited formats.

  8. Compiling Holocene RSL databases from near- to far-field regions: proxies, difficulties and possible solutions

    NASA Astrophysics Data System (ADS)

    Vacchi, M.; Horton, B.; Mann, T.; Engelhart, S. E.; Rovere, A.; Nikitina, D.; Bender, M.; Roy, K.; Peltier, W. R.

    2017-12-01

    Reconstructions of relative sea level (RSL) have implications for investigation of crustal movements, calibration of earth rheology models and the reconstruction of ice sheets. In recent years, efforts were made to create RSL databases following a standardized methodology. These regional databases provide a framework for developing our understanding of the primary mechanisms of RSL change since the Last Glacial Maximum and a long-term baseline against which to gauge changes in sea level during the 20th century and forecasts for the 21st. We report here the results of recently compiled databases in very different climatic and geographic contexts that are the northeastern Canadian coast, the Mediterranean Sea as well as the southeastern Asiatic region. Our re-evaluation of sea-level indicators from geological and archaeological investigations have yielded more than 3000 RSL data-points mainly from salt and freshwater wetlands or adjacent estuarine sediment, isolation basins, beach ridges, fixed biological indicators, beachrocks as well as coastal archaeological structures. We outline some of the inherent difficulties, and potential solutions to analyse sea-level data in such different depositional environments. In particular, we discuss problems related with the definition of standardized indicative meaning, and with the re-evaluation of old radiocarbon samples. We further address complex tectonics influences and the framework to compare such large variability of RSL data-points. Finally we discuss the implications of our results for the patterns of glacio-isostatic adjustment in these regions.

  9. A structured interface to the object-oriented genomics unified schema for XML-formatted data.

    PubMed

    Clark, Terry; Jurek, Josef; Kettler, Gregory; Preuss, Daphe

    2005-01-01

    Data management systems are fast becoming required components in many biology laboratories as the role of computer-based information grows. Although the need for data management systems is on the rise, their inherent complexities can deter the full and routine use of their computational capabilities. The significant undertaking to implement a capable production system can be reduced in part by adapting an established data management system. In such a way, we are leveraging the Genomics Unified Schema (GUS) developed at the Computational Biology and Informatics Laboratory at the University of Pennsylvania as a foundation for managing and analysing DNA sequence data in centromere research projects around Arabidopsis thaliana and related species. Because GUS provides a core schema that includes support for genome sequences, mRNA and its expression, and annotated chromosomes, it is ideal for synthesising a variety of parameters to analyse these repetitive and highly dynamic portions of the genome. Despite this, production-strength data management frameworks are complex, requiring dedicated efforts to adapt and maintain. The work reported in this article addresses one component of such an effort, namely the pivotal task of marshalling data from various sources into GUS. In order to harness GUS for our project, and motivated by efficiency needs, we developed a structured framework for transferring data into GUS from outside sources. This technology is embodied in a GUS object-layer processor, XMLGUS. XMLGUS facilitates incorporating data into GUS by (i) formulating an XML interface that includes relational database key constraint definitions, (ii) regularising traversal through that XML, (iii) realising automatic processing of the XML with database key constraints and (iv) allowing for special processing of input data within the framework for automated processing. The application of XMLGUS to production pipeline processing for a sequencing project and inputting the Arabidopsis genome into GUS is discussed. XMLGUS is available from the Flora website (http://flora.ittc.ku.edu/).

  10. Online Analytical Processing (OLAP): A Fast and Effective Data Mining Tool for Gene Expression Databases

    PubMed Central

    2005-01-01

    Gene expression databases contain a wealth of information, but current data mining tools are limited in their speed and effectiveness in extracting meaningful biological knowledge from them. Online analytical processing (OLAP) can be used as a supplement to cluster analysis for fast and effective data mining of gene expression databases. We used Analysis Services 2000, a product that ships with SQLServer2000, to construct an OLAP cube that was used to mine a time series experiment designed to identify genes associated with resistance of soybean to the soybean cyst nematode, a devastating pest of soybean. The data for these experiments is stored in the soybean genomics and microarray database (SGMD). A number of candidate resistance genes and pathways were found. Compared to traditional cluster analysis of gene expression data, OLAP was more effective and faster in finding biologically meaningful information. OLAP is available from a number of vendors and can work with any relational database management system through OLE DB. PMID:16046824

  11. Identifying novel glioma associated pathways based on systems biology level meta-analysis.

    PubMed

    Hu, Yangfan; Li, Jinquan; Yan, Wenying; Chen, Jiajia; Li, Yin; Hu, Guang; Shen, Bairong

    2013-01-01

    With recent advances in microarray technology, including genomics, proteomics, and metabolomics, it brings a great challenge for integrating this "-omics" data to analysis complex disease. Glioma is an extremely aggressive and lethal form of brain tumor, and thus the study of the molecule mechanism underlying glioma remains very important. To date, most studies focus on detecting the differentially expressed genes in glioma. However, the meta-analysis for pathway analysis based on multiple microarray datasets has not been systematically pursued. In this study, we therefore developed a systems biology based approach by integrating three types of omics data to identify common pathways in glioma. Firstly, the meta-analysis has been performed to study the overlapping of signatures at different levels based on the microarray gene expression data of glioma. Among these gene expression datasets, 12 pathways were found in GeneGO database that shared by four stages. Then, microRNA expression profiles and ChIP-seq data were integrated for the further pathway enrichment analysis. As a result, we suggest 5 of these pathways could be served as putative pathways in glioma. Among them, the pathway of TGF-beta-dependent induction of EMT via SMAD is of particular importance. Our results demonstrate that the meta-analysis based on systems biology level provide a more useful approach to study the molecule mechanism of complex disease. The integration of different types of omics data, including gene expression microarrays, microRNA and ChIP-seq data, suggest some common pathways correlated with glioma. These findings will offer useful potential candidates for targeted therapeutic intervention of glioma.

  12. Evolving Strategies for the Incorporation of Bioinformatics Within the Undergraduate Cell Biology Curriculum

    PubMed Central

    Honts, Jerry E.

    2003-01-01

    Recent advances in genomics and structural biology have resulted in an unprecedented increase in biological data available from Internet-accessible databases. In order to help students effectively use this vast repository of information, undergraduate biology students at Drake University were introduced to bioinformatics software and databases in three courses, beginning with an introductory course in cell biology. The exercises and projects that were used to help students develop literacy in bioinformatics are described. In a recently offered course in bioinformatics, students developed their own simple sequence analysis tool using the Perl programming language. These experiences are described from the point of view of the instructor as well as the students. A preliminary assessment has been made of the degree to which students had developed a working knowledge of bioinformatics concepts and methods. Finally, some conclusions have been drawn from these courses that may be helpful to instructors wishing to introduce bioinformatics within the undergraduate biology curriculum. PMID:14673489

  13. DNA algorithms of implementing biomolecular databases on a biological computer.

    PubMed

    Chang, Weng-Long; Vasilakos, Athanasios V

    2015-01-01

    In this paper, DNA algorithms are proposed to perform eight operations of relational algebra (calculus), which include Cartesian product, union, set difference, selection, projection, intersection, join, and division, on biomolecular relational databases.

  14. PrionScan: an online database of predicted prion domains in complete proteomes.

    PubMed

    Espinosa Angarica, Vladimir; Angulo, Alfonso; Giner, Arturo; Losilla, Guillermo; Ventura, Salvador; Sancho, Javier

    2014-02-05

    Prions are a particular type of amyloids related to a large variety of important processes in cells, but also responsible for serious diseases in mammals and humans. The number of experimentally characterized prions is still low and corresponds to a handful of examples in microorganisms and mammals. Prion aggregation is mediated by specific protein domains with a remarkable compositional bias towards glutamine/asparagine and against charged residues and prolines. These compositional features have been used to predict new prion proteins in the genomes of different organisms. Despite these efforts, there are only a few available data sources containing prion predictions at a genomic scale. Here we present PrionScan, a new database of predicted prion-like domains in complete proteomes. We have previously developed a predictive methodology to identify and score prionogenic stretches in protein sequences. In the present work, we exploit this approach to scan all the protein sequences in public databases and compile a repository containing relevant information of proteins bearing prion-like domains. The database is updated regularly alongside UniprotKB and in its present version contains approximately 28000 predictions in proteins from different functional categories in more than 3200 organisms from all the taxonomic subdivisions. PrionScan can be used in two different ways: database query and analysis of protein sequences submitted by the users. In the first mode, simple queries allow to retrieve a detailed description of the properties of a defined protein. Queries can also be combined to generate more complex and specific searching patterns. In the second mode, users can submit and analyze their own sequences. It is expected that this database would provide relevant insights on prion functions and regulation from a genome-wide perspective, allowing researches performing cross-species prion biology studies. Our database might also be useful for guiding experimentalists in the identification of new candidates for further experimental characterization.

  15. BIOFRAG – a new database for analyzing BIOdiversity responses to forest FRAGmentation

    PubMed Central

    Pfeifer, Marion; Lefebvre, Veronique; Gardner, Toby A; Arroyo-Rodriguez, Victor; Baeten, Lander; Banks-Leite, Cristina; Barlow, Jos; Betts, Matthew G; Brunet, Joerg; Cerezo, Alexis; Cisneros, Laura M; Collard, Stuart; D'Cruze, Neil; da Silva Motta, Catarina; Duguay, Stephanie; Eggermont, Hilde; Eigenbrod, Felix; Hadley, Adam S; Hanson, Thor R; Hawes, Joseph E; Heartsill Scalley, Tamara; Klingbeil, Brian T; Kolb, Annette; Kormann, Urs; Kumar, Sunil; Lachat, Thibault; Lakeman Fraser, Poppy; Lantschner, Victoria; Laurance, William F; Leal, Inara R; Lens, Luc; Marsh, Charles J; Medina-Rangel, Guido F; Melles, Stephanie; Mezger, Dirk; Oldekop, Johan A; Overal, William L; Owen, Charlotte; Peres, Carlos A; Phalan, Ben; Pidgeon, Anna M; Pilia, Oriana; Possingham, Hugh P; Possingham, Max L; Raheem, Dinarzarde C; Ribeiro, Danilo B; Ribeiro Neto, Jose D; Douglas Robinson, W; Robinson, Richard; Rytwinski, Trina; Scherber, Christoph; Slade, Eleanor M; Somarriba, Eduardo; Stouffer, Philip C; Struebig, Matthew J; Tylianakis, Jason M; Tscharntke, Teja; Tyre, Andrew J; Urbina Cardona, Jose N; Vasconcelos, Heraldo L; Wearn, Oliver; Wells, Konstans; Willig, Michael R; Wood, Eric; Young, Richard P; Bradley, Andrew V; Ewers, Robert M

    2014-01-01

    Habitat fragmentation studies have produced complex results that are challenging to synthesize. Inconsistencies among studies may result from variation in the choice of landscape metrics and response variables, which is often compounded by a lack of key statistical or methodological information. Collating primary datasets on biodiversity responses to fragmentation in a consistent and flexible database permits simple data retrieval for subsequent analyses. We present a relational database that links such field data to taxonomic nomenclature, spatial and temporal plot attributes, and environmental characteristics. Field assessments include measurements of the response(s) (e.g., presence, abundance, ground cover) of one or more species linked to plots in fragments within a partially forested landscape. The database currently holds 9830 unique species recorded in plots of 58 unique landscapes in six of eight realms: mammals 315, birds 1286, herptiles 460, insects 4521, spiders 204, other arthropods 85, gastropods 70, annelids 8, platyhelminthes 4, Onychophora 2, vascular plants 2112, nonvascular plants and lichens 320, and fungi 449. Three landscapes were sampled as long-term time series (>10 years). Seven hundred and eleven species are found in two or more landscapes. Consolidating the substantial amount of primary data available on biodiversity responses to fragmentation in the context of land-use change and natural disturbances is an essential part of understanding the effects of increasing anthropogenic pressures on land. The consistent format of this database facilitates testing of generalizations concerning biologic responses to fragmentation across diverse systems and taxa. It also allows the re-examination of existing datasets with alternative landscape metrics and robust statistical methods, for example, helping to address pseudo-replication problems. The database can thus help researchers in producing broad syntheses of the effects of land use. The database is dynamic and inclusive, and contributions from individual and large-scale data-collection efforts are welcome. PMID:24967073

  16. Development of a website and biobank database for the Nanosized Cancer Polymarker Biochip Project: a Multicenter Italian Experience.

    PubMed

    Leon, Antonette E; Fabricio, Aline S C; Benvegnù, Fabio; Michilin, Silvia; Secco, Annamaria; Spangaro, Omar; Meo, Sabrina; Gion, Massimo

    2011-01-01

    The Nanosized Cancer Polymarker Biochip Project (RBLA03S4SP) funded by an Italian MIUR-FIRB grant (Italian Ministry of University and Research - Investment Funds for Basic Research) has led to the creation of a free-access dynamic website, available at the web address https://serviziweb.ulss12.ve.it/firbabo, and of a centralized database with password-restricted access. The project network is composed of 9 research units (RUs) and has been active since 2005. The aim of the FIRB project was the design, production and validation of optoelectronic and chemoelectronic biosensors for the simultaneous detection of a novel class of cancer biomarkers associated with immunoglobulins of the M class (IgM) for early diagnosis of cancer. Biomarker immune complexes (BM-ICs) were assessed on samples of clinical cases and matched controls for breast, colorectal, liver, ovarian and prostate malignancies. This article describes in detail the architecture of the project website, the central database application, and the biobank developed for the FIRB Nanosized Cancer Polymarker Biochip Project. The article also illustrates many unique aspects that should be considered when developing a database within a multidisciplinary scenario. The main deliverables of the project were numerous, including the development of an online database which archived 1400 case report forms (700 cases and 700 matched controls) and more than 2700 experimental results relative to the BM-ICs assayed. The database also allowed for the traceability and retrieval of 21,000 aliquots archived in the centralized bank and stored as backup in the RUs, and for the development of a centralized biological bank in the coordinating unit with 6300 aliquots of serum. The constitution of the website and biobank database enabled optimal coordination of the RUs involved, highlighting the importance of sharing samples and scientific data in a multicenter setting for the achievement of the project goals.

  17. Atlas - a data warehouse for integrative bioinformatics.

    PubMed

    Shah, Sohrab P; Huang, Yong; Xu, Tao; Yuen, Macaire M S; Ling, John; Ouellette, B F Francis

    2005-02-21

    We present a biological data warehouse called Atlas that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. The goal of the system is to provide data, as well as a software infrastructure for bioinformatics research and development. The Atlas system is based on relational data models that we developed for each of the source data types. Data stored within these relational models are managed through Structured Query Language (SQL) calls that are implemented in a set of Application Programming Interfaces (APIs). The APIs include three languages: C++, Java, and Perl. The methods in these API libraries are used to construct a set of loader applications, which parse and load the source datasets into the Atlas database, and a set of toolbox applications which facilitate data retrieval. Atlas stores and integrates local instances of GenBank, RefSeq, UniProt, Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Database of Interacting Proteins (DIP), Molecular Interactions Database (MINT), IntAct, NCBI Taxonomy, Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), LocusLink, Entrez Gene and HomoloGene. The retrieval APIs and toolbox applications are critical components that offer end-users flexible, easy, integrated access to this data. We present use cases that use Atlas to integrate these sources for genome annotation, inference of molecular interactions across species, and gene-disease associations. The Atlas biological data warehouse serves as data infrastructure for bioinformatics research and development. It forms the backbone of the research activities in our laboratory and facilitates the integration of disparate, heterogeneous biological sources of data enabling new scientific inferences. Atlas achieves integration of diverse data sets at two levels. First, Atlas stores data of similar types using common data models, enforcing the relationships between data types. Second, integration is achieved through a combination of APIs, ontology, and tools. The Atlas software is freely available under the GNU General Public License at: http://bioinformatics.ubc.ca/atlas/

  18. Atlas – a data warehouse for integrative bioinformatics

    PubMed Central

    Shah, Sohrab P; Huang, Yong; Xu, Tao; Yuen, Macaire MS; Ling, John; Ouellette, BF Francis

    2005-01-01

    Background We present a biological data warehouse called Atlas that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. The goal of the system is to provide data, as well as a software infrastructure for bioinformatics research and development. Description The Atlas system is based on relational data models that we developed for each of the source data types. Data stored within these relational models are managed through Structured Query Language (SQL) calls that are implemented in a set of Application Programming Interfaces (APIs). The APIs include three languages: C++, Java, and Perl. The methods in these API libraries are used to construct a set of loader applications, which parse and load the source datasets into the Atlas database, and a set of toolbox applications which facilitate data retrieval. Atlas stores and integrates local instances of GenBank, RefSeq, UniProt, Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Database of Interacting Proteins (DIP), Molecular Interactions Database (MINT), IntAct, NCBI Taxonomy, Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), LocusLink, Entrez Gene and HomoloGene. The retrieval APIs and toolbox applications are critical components that offer end-users flexible, easy, integrated access to this data. We present use cases that use Atlas to integrate these sources for genome annotation, inference of molecular interactions across species, and gene-disease associations. Conclusion The Atlas biological data warehouse serves as data infrastructure for bioinformatics research and development. It forms the backbone of the research activities in our laboratory and facilitates the integration of disparate, heterogeneous biological sources of data enabling new scientific inferences. Atlas achieves integration of diverse data sets at two levels. First, Atlas stores data of similar types using common data models, enforcing the relationships between data types. Second, integration is achieved through a combination of APIs, ontology, and tools. The Atlas software is freely available under the GNU General Public License at: PMID:15723693

  19. ComplexQuant: high-throughput computational pipeline for the global quantitative analysis of endogenous soluble protein complexes using high resolution protein HPLC and precision label-free LC/MS/MS.

    PubMed

    Wan, Cuihong; Liu, Jian; Fong, Vincent; Lugowski, Andrew; Stoilova, Snejana; Bethune-Waddell, Dylan; Borgeson, Blake; Havugimana, Pierre C; Marcotte, Edward M; Emili, Andrew

    2013-04-09

    The experimental isolation and characterization of stable multi-protein complexes are essential to understanding the molecular systems biology of a cell. To this end, we have developed a high-throughput proteomic platform for the systematic identification of native protein complexes based on extensive fractionation of soluble protein extracts by multi-bed ion exchange high performance liquid chromatography (IEX-HPLC) combined with exhaustive label-free LC/MS/MS shotgun profiling. To support these studies, we have built a companion data analysis software pipeline, termed ComplexQuant. Proteins present in the hundreds of fractions typically collected per experiment are first identified by exhaustively interrogating MS/MS spectra using multiple database search engines within an integrative probabilistic framework, while accounting for possible post-translation modifications. Protein abundance is then measured across the fractions based on normalized total spectral counts and precursor ion intensities using a dedicated tool, PepQuant. This analysis allows co-complex membership to be inferred based on the similarity of extracted protein co-elution profiles. Each computational step has been optimized for processing large-scale biochemical fractionation datasets, and the reliability of the integrated pipeline has been benchmarked extensively. This article is part of a Special Issue entitled: From protein structures to clinical applications. Copyright © 2012 Elsevier B.V. All rights reserved.

  20. Challenges and Experiences of Building Multidisciplinary Datasets across Cultures

    NASA Astrophysics Data System (ADS)

    Jamiyansharav, K.; Laituri, M.; Fernandez-Gimenez, M.; Fassnacht, S. R.; Venable, N. B. H.; Allegretti, A. M.; Reid, R.; Baival, B.; Jamsranjav, C.; Ulambayar, T.; Linn, S.; Angerer, J.

    2017-12-01

    Efficient data sharing and management are key challenges to multidisciplinary scientific research. These challenges are further complicated by adding a multicultural component. We address the construction of a complex database for social-ecological analysis in Mongolia. Funded by the National Science Foundation (NSF) Dynamics of Coupled Natural and Human (CNH) Systems, the Mongolian Rangelands and Resilience (MOR2) project focuses on the vulnerability of Mongolian pastoral systems to climate change and adaptive capacity. The MOR2 study spans over three years of fieldwork in 36 paired districts (Soum) from 18 provinces (Aimag) of Mongolia that covers steppe, mountain forest steppe, desert steppe and eastern steppe ecological zones. Our project team is composed of hydrologists, social scientists, geographers, and ecologists. The MOR2 database includes multiple ecological, social, meteorological, geospatial and hydrological datasets, as well as archives of original data and survey in multiple formats. Managing this complex database requires significant organizational skills, attention to detail and ability to communicate within collective team members from diverse disciplines and across multiple institutions in the US and Mongolia. We describe the database's rich content, organization, structure and complexity. We discuss lessons learned, best practices and recommendations for complex database management, sharing, and archiving in creating a cross-cultural and multi-disciplinary database.

  1. Using a Semi-Realistic Database to Support a Database Course

    ERIC Educational Resources Information Center

    Yue, Kwok-Bun

    2013-01-01

    A common problem for university relational database courses is to construct effective databases for instructions and assignments. Highly simplified "toy" databases are easily available for teaching, learning, and practicing. However, they do not reflect the complexity and practical considerations that students encounter in real-world…

  2. Rhizoma Dioscoreae extract protects against alveolar bone loss by regulating the cell cycle: A predictive study based on the protein‑protein interaction network.

    PubMed

    Zhang, Zhi-Guo; Song, Chang-Heng; Zhang, Fang-Zhen; Chen, Yan-Jing; Xiang, Li-Hua; Xiao, Gary Guishan; Ju, Da-Hong

    2016-06-01

    Rhizoma Dioscoreae extract (RDE) exhibits a protective effect on alveolar bone loss in ovariectomized (OVX) rats. The aim of this study was to predict the pathways or targets that are regulated by RDE, by re‑assessing our previously reported data and conducting a protein‑protein interaction (PPI) network analysis. In total, 383 differentially expressed genes (≥3‑fold) between alveolar bone samples from the RDE and OVX group rats were identified, and a PPI network was constructed based on these genes. Furthermore, four molecular clusters (A‑D) in the PPI network with the smallest P‑values were detected by molecular complex detection (MCODE) algorithm. Using Database for Annotation, Visualization and Integrated Discovery (DAVID) and Ingenuity Pathway Analysis (IPA) tools, two molecular clusters (A and B) were enriched for biological process in Gene Ontology (GO). Only cluster A was associated with biological pathways in the IPA database. GO and pathway analysis results showed that cluster A, associated with cell cycle regulation, was the most important molecular cluster in the PPI network. In addition, cyclin‑dependent kinase 1 (CDK1) may be a key molecule achieving the cell‑cycle‑regulatory function of cluster A. From the PPI network analysis, it was predicted that delayed cell cycle progression in excessive alveolar bone remodeling via downregulation of CDK1 may be another mechanism underling the anti‑osteopenic effect of RDE on alveolar bone.

  3. A Systematic Bioinformatics Approach to Identify High Quality Mass Spectrometry Data and Functionally Annotate Proteins and Proteomes.

    PubMed

    Islam, Mohammad Tawhidul; Mohamedali, Abidali; Ahn, Seong Beom; Nawar, Ishmam; Baker, Mark S; Ranganathan, Shoba

    2017-01-01

    In the past decade, proteomics and mass spectrometry have taken tremendous strides forward, particularly in the life sciences, spurred on by rapid advances in technology resulting in generation and conglomeration of vast amounts of data. Though this has led to tremendous advancements in biology, the interpretation of the data poses serious challenges for many practitioners due to the immense size and complexity of the data. Furthermore, the lack of annotation means that a potential gold mine of relevant biological information may be hiding within this data. We present here a simple and intuitive workflow for the research community to investigate and mine this data, not only to extract relevant data but also to segregate usable, quality data to develop hypotheses for investigation and validation. We apply an MS evidence workflow for verifying peptides of proteins from one's own data as well as publicly available databases. We then integrate a suite of freely available bioinformatics analysis and annotation software tools to identify homologues and map putative functional signatures, gene ontology and biochemical pathways. We also provide an example of the functional annotation of missing proteins in human chromosome 7 data from the NeXtProt database, where no evidence is available at the proteomic, antibody, or structural levels. We give examples of protocols, tools and detailed flowcharts that can be extended or tailored to interpret and annotate the proteome of any novel organism.

  4. An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system.

    PubMed

    AlQuraishi, Mohammed; Tang, Shengdong; Xia, Xide

    2015-11-19

    Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. We have developed an integrated affinity-structure database in which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. This database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.

  5. The MaizeGDB Genome Browser Tutorial: One example of database outreach to biologists via video

    USDA-ARS?s Scientific Manuscript database

    Video tutorials are an effective way for researchers to quickly learn how to use online tools offered by biological databases. At the Maize Genetics and Genomics Database (MaizeGDB), we have developed a number of video tutorials that aim to demonstrate how to use various tools as well as to explici...

  6. A Chado case study: an ontology-based modular schema for representing genome-associated biological information.

    PubMed

    Mungall, Christopher J; Emmert, David B

    2007-07-01

    A few years ago, FlyBase undertook to design a new database schema to store Drosophila data. It would fully integrate genomic sequence and annotation data with bibliographic, genetic, phenotypic and molecular data from the literature representing a distillation of the first 100 years of research on this major animal model system. In developing this new integrated schema, FlyBase also made a commitment to ensure that its design was generic, extensible and available as open source, so that it could be employed as the core schema of any model organism data repository, thereby avoiding redundant software development and potentially increasing interoperability. Our question was whether we could create a relational database schema that would be successfully reused. Chado is a relational database schema now being used to manage biological knowledge for a wide variety of organisms, from human to pathogens, especially the classes of information that directly or indirectly can be associated with genome sequences or the primary RNA and protein products encoded by a genome. Biological databases that conform to this schema can interoperate with one another, and with application software from the Generic Model Organism Database (GMOD) toolkit. Chado is distinctive because its design is driven by ontologies. The use of ontologies (or controlled vocabularies) is ubiquitous across the schema, as they are used as a means of typing entities. The Chado schema is partitioned into integrated subschemas (modules), each encapsulating a different biological domain, and each described using representations in appropriate ontologies. To illustrate this methodology, we describe here the Chado modules used for describing genomic sequences. GMOD is a collaboration of several model organism database groups, including FlyBase, to develop a set of open-source software for managing model organism data. The Chado schema is freely distributed under the terms of the Artistic License (http://www.opensource.org/licenses/artistic-license.php) from GMOD (www.gmod.org).

  7. GoldenBraid 2.0: A Comprehensive DNA Assembly Framework for Plant Synthetic Biology1[C][W][OA

    PubMed Central

    Sarrion-Perdigones, Alejandro; Vazquez-Vilar, Marta; Palací, Jorge; Castelijns, Bas; Forment, Javier; Ziarsolo, Peio; Blanca, José; Granell, Antonio; Orzaez, Diego

    2013-01-01

    Plant synthetic biology aims to apply engineering principles to plant genetic design. One strategic requirement of plant synthetic biology is the adoption of common standardized technologies that facilitate the construction of increasingly complex multigene structures at the DNA level while enabling the exchange of genetic building blocks among plant bioengineers. Here, we describe GoldenBraid 2.0 (GB2.0), a comprehensive technological framework that aims to foster the exchange of standard DNA parts for plant synthetic biology. GB2.0 relies on the use of type IIS restriction enzymes for DNA assembly and proposes a modular cloning schema with positional notation that resembles the grammar of natural languages. Apart from providing an optimized cloning strategy that generates fully exchangeable genetic elements for multigene engineering, the GB2.0 toolkit offers an ever-growing open collection of DNA parts, including a group of functionally tested, premade genetic modules to build frequently used modules like constitutive and inducible expression cassettes, endogenous gene silencing and protein-protein interaction tools, etc. Use of the GB2.0 framework is facilitated by a number of Web resources that include a publicly available database, tutorials, and a software package that provides in silico simulations and laboratory protocols for GB2.0 part domestication and multigene engineering. In short, GB2.0 provides a framework to exchange both information and physical DNA elements among bioengineers to help implement plant synthetic biology projects. PMID:23669743

  8. GoldenBraid 2.0: a comprehensive DNA assembly framework for plant synthetic biology.

    PubMed

    Sarrion-Perdigones, Alejandro; Vazquez-Vilar, Marta; Palací, Jorge; Castelijns, Bas; Forment, Javier; Ziarsolo, Peio; Blanca, José; Granell, Antonio; Orzaez, Diego

    2013-07-01

    Plant synthetic biology aims to apply engineering principles to plant genetic design. One strategic requirement of plant synthetic biology is the adoption of common standardized technologies that facilitate the construction of increasingly complex multigene structures at the DNA level while enabling the exchange of genetic building blocks among plant bioengineers. Here, we describe GoldenBraid 2.0 (GB2.0), a comprehensive technological framework that aims to foster the exchange of standard DNA parts for plant synthetic biology. GB2.0 relies on the use of type IIS restriction enzymes for DNA assembly and proposes a modular cloning schema with positional notation that resembles the grammar of natural languages. Apart from providing an optimized cloning strategy that generates fully exchangeable genetic elements for multigene engineering, the GB2.0 toolkit offers an evergrowing open collection of DNA parts, including a group of functionally tested, premade genetic modules to build frequently used modules like constitutive and inducible expression cassettes, endogenous gene silencing and protein-protein interaction tools, etc. Use of the GB2.0 framework is facilitated by a number of Web resources that include a publicly available database, tutorials, and a software package that provides in silico simulations and laboratory protocols for GB2.0 part domestication and multigene engineering. In short, GB2.0 provides a framework to exchange both information and physical DNA elements among bioengineers to help implement plant synthetic biology projects.

  9. PROXiMATE: a database of mutant protein-protein complex thermodynamics and kinetics.

    PubMed

    Jemimah, Sherlyn; Yugandhar, K; Michael Gromiha, M

    2017-09-01

    We have developed PROXiMATE, a database of thermodynamic data for more than 6000 missense mutations in 174 heterodimeric protein-protein complexes, supplemented with interaction network data from STRING database, solvent accessibility, sequence, structural and functional information, experimental conditions and literature information. Additional features include complex structure visualization, search and display options, download options and a provision for users to upload their data. The database is freely available at http://www.iitm.ac.in/bioinfo/PROXiMATE/ . The website is implemented in Python, and supports recent versions of major browsers such as IE10, Firefox, Chrome and Opera. gromiha@iitm.ac.in. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com

  10. Tutorial videos of bioinformatics resources: online distribution trial in Japan named TogoTV.

    PubMed

    Kawano, Shin; Ono, Hiromasa; Takagi, Toshihisa; Bono, Hidemasa

    2012-03-01

    In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV (http://togotv.dbcls.jp/en/) and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability.

  11. PPDB - A tool for investigation of plants physiology based on gene ontology.

    PubMed

    Sharma, Ajay Shiv; Gupta, Hari Om; Prasad, Rajendra

    2014-09-02

    Representing the way forward, from functional genomics and its ontology to functional understanding and physiological model, in a computationally tractable fashion is one of the ongoing challenges faced by computational biology. To tackle the standpoint, we herein feature the applications of contemporary database management to the development of PPDB, a searching and browsing tool for the Plants Physiology Database that is based upon the mining of a large amount of gene ontology data currently available. The working principles and search options associated with the PPDB are publicly available and freely accessible on-line ( http://www.iitr.ernet.in/ajayshiv/ ) through a user friendly environment generated by means of Drupal-6.24. By knowing that genes are expressed in temporally and spatially characteristic patterns and that their functionally distinct products often reside in specific cellular compartments and may be part of one or more multi-component complexes, this sort of work is intended to be relevant for investigating the functional relationships of gene products at a system level and, thus, helps us approach to the full physiology.

  12. PPDB: A Tool for Investigation of Plants Physiology Based on Gene Ontology.

    PubMed

    Sharma, Ajay Shiv; Gupta, Hari Om; Prasad, Rajendra

    2015-09-01

    Representing the way forward, from functional genomics and its ontology to functional understanding and physiological model, in a computationally tractable fashion is one of the ongoing challenges faced by computational biology. To tackle the standpoint, we herein feature the applications of contemporary database management to the development of PPDB, a searching and browsing tool for the Plants Physiology Database that is based upon the mining of a large amount of gene ontology data currently available. The working principles and search options associated with the PPDB are publicly available and freely accessible online ( http://www.iitr.ac.in/ajayshiv/ ) through a user-friendly environment generated by means of Drupal-6.24. By knowing that genes are expressed in temporally and spatially characteristic patterns and that their functionally distinct products often reside in specific cellular compartments and may be part of one or more multicomponent complexes, this sort of work is intended to be relevant for investigating the functional relationships of gene products at a system level and, thus, helps us approach to the full physiology.

  13. REDIdb 3.0: A Comprehensive Collection of RNA Editing Events in Plant Organellar Genomes.

    PubMed

    Lo Giudice, Claudio; Pesole, Graziano; Picardi, Ernesto

    2018-01-01

    RNA editing is an important epigenetic mechanism by which genome-encoded transcripts are modified by substitutions, insertions and/or deletions. It was first discovered in kinetoplastid protozoa followed by its reporting in a wide range of organisms. In plants, RNA editing occurs mostly by cytidine (C) to uridine (U) conversion in translated regions of organelle mRNAs and tends to modify affected codons restoring evolutionary conserved aminoacid residues. RNA editing has also been described in non-protein coding regions such as group II introns and structural RNAs. Despite its impact on organellar transcriptome and proteome complexity, current primary databases still do not provide a specific field for RNA editing events. To overcome these limitations, we developed REDIdb a specialized database for RNA editing modifications in plant organelles. Hereafter we describe its third release containing more than 26,000 events in a completely novel web interface to accommodate RNA editing in its genomics, biological and evolutionary context through whole genome maps and multiple sequence alignments. REDIdb is freely available at http://srv00.recas.ba.infn.it/redidb/index.html.

  14. SAMSA2: a standalone metatranscriptome analysis pipeline.

    PubMed

    Westreich, Samuel T; Treiber, Michelle L; Mills, David A; Korf, Ian; Lemay, Danielle G

    2018-05-21

    Complex microbial communities are an area of growing interest in biology. Metatranscriptomics allows researchers to quantify microbial gene expression in an environmental sample via high-throughput sequencing. Metatranscriptomic experiments are computationally intensive because the experiments generate a large volume of sequence data and each sequence must be compared with reference sequences from thousands of organisms. SAMSA2 is an upgrade to the original Simple Annotation of Metatranscriptomes by Sequence Analysis (SAMSA) pipeline that has been redesigned for standalone use on a supercomputing cluster. SAMSA2 is faster due to the use of the DIAMOND aligner, and more flexible and reproducible because it uses local databases. SAMSA2 is available with detailed documentation, and example input and output files along with examples of master scripts for full pipeline execution. SAMSA2 is a rapid and efficient metatranscriptome pipeline for analyzing large RNA-seq datasets in a supercomputing cluster environment. SAMSA2 provides simplified output that can be examined directly or used for further analyses, and its reference databases may be upgraded, altered or customized to fit the needs of any experiment.

  15. Tutorial videos of bioinformatics resources: online distribution trial in Japan named TogoTV

    PubMed Central

    Kawano, Shin; Ono, Hiromasa; Takagi, Toshihisa

    2012-01-01

    In recent years, biological web resources such as databases and tools have become more complex because of the enormous amounts of data generated in the field of life sciences. Traditional methods of distributing tutorials include publishing textbooks and posting web documents, but these static contents cannot adequately describe recent dynamic web services. Due to improvements in computer technology, it is now possible to create dynamic content such as video with minimal effort and low cost on most modern computers. The ease of creating and distributing video tutorials instead of static content improves accessibility for researchers, annotators and curators. This article focuses on online video repositories for educational and tutorial videos provided by resource developers and users. It also describes a project in Japan named TogoTV (http://togotv.dbcls.jp/en/) and discusses the production and distribution of high-quality tutorial videos, which would be useful to viewer, with examples. This article intends to stimulate and encourage researchers who develop and use databases and tools to distribute how-to videos as a tool to enhance product usability. PMID:21803786

  16. BioPepDB: an integrated data platform for food-derived bioactive peptides.

    PubMed

    Li, Qilin; Zhang, Chao; Chen, Hongjun; Xue, Jitong; Guo, Xiaolei; Liang, Ming; Chen, Ming

    2018-03-12

    Food-derived bioactive peptides play critical roles in regulating most biological processes and have considerable biological, medical and industrial importance. However, a large number of active peptides data, including sequence, function, source, commercial product information, references and other information are poorly integrated. BioPepDB is a searchable database of food-derived bioactive peptides and their related articles, including more than four thousand bioactive peptide entries. Moreover, BioPepDB provides modules of prediction and hydrolysis-simulation for discovering novel peptides. It can serve as a reference database to investigate the function of different bioactive peptides. BioPepDB is available at http://bis.zju.edu.cn/biopepdbr/ . The web page utilises Apache, PHP5 and MySQL to provide the user interface for accessing the database and predict novel peptides. The database itself is operated on a specialised server.

  17. 78 FR 58311 - Complex Issues in Developing Drug and Biological Products for Rare Diseases; Public Workshop...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2013-09-23

    ...] Complex Issues in Developing Drug and Biological Products for Rare Diseases; Public Workshop; Request for... Issues in Developing Drug and Biological Products for Rare Diseases.'' The purpose of the public workshop is twofold: To discuss complex issues in clinical trials for developing drug and biological products...

  18. Technical aspects of gel-based proteomics designed for elucidating an aryl hydrocarbon receptor complex.

    PubMed

    Wada, Yoshinao; Nakano, Norihiko

    2004-01-01

    The identification of proteins by mass spectrometry has revolutionalized the basic method of identifying proteins constituting an intracellular unit or network for certain biological functions. The gel-based strategy following immunoprecipitation was applied to elucidating proteins associated with the aryl hydrocarbon receptor (AhR). Two hundred femtomoles of AhR was recovered from approximately 2 x 10(7) HepG2 cells by immunoprecipitation and was sufficient for identification by peptide mass fingerprinting. Possible candidates for the AhR-associated proteins were also identified. Improvements of the current strategy to increase the overall sensitivity tenfold are required to clarify the AhR complex in full detail. For example, a combination of trypsin and Achromobacter protease I for in-gel digestion allows the number of missed cleavage sites to be set at zero for database searching, thereby reducing random matches and facilitating identification. There is also room for improvement in each step of sample preparation prior to mass spectrometry.

  19. Database of the Geologic Map of North America - Adapted from the Map by J.C. Reed, Jr. and others (2005)

    USGS Publications Warehouse

    Garrity, Christopher P.; Soller, David R.

    2009-01-01

    The Geological Society of America's (GSA) Geologic Map of North America (Reed and others, 2005; 1:5,000,000) shows the geology of a significantly large area of the Earth, centered on North and Central America and including the submarine geology of parts of the Atlantic and Pacific Oceans. This map is now converted to a Geographic Information System (GIS) database that contains all geologic and base-map information shown on the two printed map sheets and the accompanying explanation sheet. We anticipate this map database will be revised at some unspecified time in the future, likely through the actions of a steering committee managed by the Geological Society of America (GSA) and staffed by scientists from agencies including, but not limited to, those responsible for the original map compilation (U.S. Geological Survey, Geological Survey of Canada, and Woods Hole Oceanographic Institute). Regarding the use of this product, as noted by the map's compilers: 'The Geologic Map of North America is an essential educational tool for teaching the geology of North America to university students and for the continuing education of professional geologists in North America and elsewhere. In addition, simplified maps derived from the Geologic Map of North America are useful for enlightening younger students and the general public about the geology of the continent.' With publication of this database, the preparation of any type of simplified map is made significantly easier. More important perhaps, the database provides a more accessible means to explore the map information and to compare and analyze it in conjunction with other types of information (for example, land use, soils, biology) to better understand the complex interrelations among factors that affect Earth resources, hazards, ecosystems, and climate.

  20. Cornelia de Lange syndrome and molecular implications of the cohesin complex: Abstracts from the 7th biennial scientific and educational symposium 2016

    PubMed Central

    Kline, Antonie D.; Krantz, Ian D.; Deardorff, Matthew A.; Shirahige, Katsuhiko; Dorsett, Dale; Gerton, Jennifer L.; Wu, Meng; Mehta, Devanshi; Mills, Jason A.; Carrico, Cheri S.; Noon, Sarah; Herrera, Pamela S.; Horsfield, Julia A.; Bettale, Chiara; Morgan, Jeremy; Huisman, Sylvia A.; Moss, Jo; McCleery, Joseph; Grados, Marco; Hansen, Blake D.; Srivastava, Siddharth; Taylor-Snell, Emily; Kerr, Lynne M.; Katz, Olivia; Calof, Anne L.; Musio, Antonio; Egense, Alena; Haaland, Richard E.

    2017-01-01

    Cornelia de Lange Syndrome (CdLS) is due to mutations in the genes for the structural and regulatory proteins that make up the cohesin complex, and is considered a cohesinopathy disorder or, more recently, a transcriptomopathy. New phenotypes have been recognized in this expanding field. There are multiple clinical issues facing individuals with all forms of CdLS, particularly in the neurodevelopmental system, but also gastrointestinal, cardiac, and musculoskeletal. Aspects of developmental and cell biology have found common endpoints in the biology of the cohesin complex, with improved understanding of the mechanisms, easier diagnostic tests, and the possibility of potential therapeutics, all major clinical implications for the individual with CdLS. The following abstracts are the presentations from the 7th Cornelia de Lange Syndrome Scientific and Educational Symposium, June 22–23, 2016, in Orlando, FL, in conjunction with the Cornelia de Lange Syndrome Foundation National Meeting. In addition to the scientific and clinical discussions, there were talks related to practical aspects of behavior including autism, transitions, communication, access to medical care, and databases. At the end of the symposium, a panel was held, which included several parents, affected individuals and genetic counselors, and discussed the greatest challenges in life and how this information can assist in guiding future research. The Research Committee of the CdLS Foundation organizes this meeting, reviews, and accepts abstracts, and subsequently disseminates the information to the families through members of the Clinical Advisory Board and publications. AMA CME credits were provided by Greater Baltimore Medical Center, Baltimore, MD. PMID:28190301

  1. SABRE: a method for assessing the stability of gene modules in complex tissues and subject populations.

    PubMed

    Shannon, Casey P; Chen, Virginia; Takhar, Mandeep; Hollander, Zsuzsanna; Balshaw, Robert; McManus, Bruce M; Tebbutt, Scott J; Sin, Don D; Ng, Raymond T

    2016-11-14

    Gene network inference (GNI) algorithms can be used to identify sets of coordinately expressed genes, termed network modules from whole transcriptome gene expression data. The identification of such modules has become a popular approach to systems biology, with important applications in translational research. Although diverse computational and statistical approaches have been devised to identify such modules, their performance behavior is still not fully understood, particularly in complex human tissues. Given human heterogeneity, one important question is how the outputs of these computational methods are sensitive to the input sample set, or stability. A related question is how this sensitivity depends on the size of the sample set. We describe here the SABRE (Similarity Across Bootstrap RE-sampling) procedure for assessing the stability of gene network modules using a re-sampling strategy, introduce a novel criterion for identifying stable modules, and demonstrate the utility of this approach in a clinically-relevant cohort, using two different gene network module discovery algorithms. The stability of modules increased as sample size increased and stable modules were more likely to be replicated in larger sets of samples. Random modules derived from permutated gene expression data were consistently unstable, as assessed by SABRE, and provide a useful baseline value for our proposed stability criterion. Gene module sets identified by different algorithms varied with respect to their stability, as assessed by SABRE. Finally, stable modules were more readily annotated in various curated gene set databases. The SABRE procedure and proposed stability criterion may provide guidance when designing systems biology studies in complex human disease and tissues.

  2. A Chemoinformatics Approach to the Discovery of Lead-Like Molecules from Marine and Microbial Sources En Route to Antitumor and Antibiotic Drugs

    PubMed Central

    Pereira, Florbela; Latino, Diogo A. R. S.; Gaudêncio, Susana P.

    2014-01-01

    The comprehensive information of small molecules and their biological activities in the PubChem database allows chemoinformatic researchers to access and make use of large-scale biological activity data to improve the precision of drug profiling. A Quantitative Structure–Activity Relationship approach, for classification, was used for the prediction of active/inactive compounds relatively to overall biological activity, antitumor and antibiotic activities using a data set of 1804 compounds from PubChem. Using the best classification models for antibiotic and antitumor activities a data set of marine and microbial natural products from the AntiMarin database were screened—57 and 16 new lead compounds for antibiotic and antitumor drug design were proposed, respectively. All compounds proposed by our approach are classified as non-antibiotic and non-antitumor compounds in the AntiMarin database. Recently several of the lead-like compounds proposed by us were reported as being active in the literature. PMID:24473174

  3. Saccharomyces genome database informs human biology.

    PubMed

    Skrzypek, Marek S; Nash, Robert S; Wong, Edith D; MacPherson, Kevin A; Hellerstedt, Sage T; Engel, Stacia R; Karra, Kalpana; Weng, Shuai; Sheppard, Travis K; Binkley, Gail; Simison, Matt; Miyasato, Stuart R; Cherry, J Michael

    2018-01-04

    The Saccharomyces Genome Database (SGD; http://www.yeastgenome.org) is an expertly curated database of literature-derived functional information for the model organism budding yeast, Saccharomyces cerevisiae. SGD constantly strives to synergize new types of experimental data and bioinformatics predictions with existing data, and to organize them into a comprehensive and up-to-date information resource. The primary mission of SGD is to facilitate research into the biology of yeast and to provide this wealth of information to advance, in many ways, research on other organisms, even those as evolutionarily distant as humans. To build such a bridge between biological kingdoms, SGD is curating data regarding yeast-human complementation, in which a human gene can successfully replace the function of a yeast gene, and/or vice versa. These data are manually curated from published literature, made available for download, and incorporated into a variety of analysis tools provided by SGD. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. CHOmine: an integrated data warehouse for CHO systems biology and modeling.

    PubMed

    Gerstl, Matthias P; Hanscho, Michael; Ruckerbauer, David E; Zanghellini, Jürgen; Borth, Nicole

    2017-01-01

    The last decade has seen a surge in published genome-scale information for Chinese hamster ovary (CHO) cells, which are the main production vehicles for therapeutic proteins. While a single access point is available at www.CHOgenome.org, the primary data is distributed over several databases at different institutions. Currently research is frequently hampered by a plethora of gene names and IDs that vary between published draft genomes and databases making systems biology analyses cumbersome and elaborate. Here we present CHOmine, an integrative data warehouse connecting data from various databases and links to other ones. Furthermore, we introduce CHOmodel, a web based resource that provides access to recently published CHO cell line specific metabolic reconstructions. Both resources allow to query CHO relevant data, find interconnections between different types of data and thus provides a simple, standardized entry point to the world of CHO systems biology. http://www.chogenome.org. © The Author(s) 2017. Published by Oxford University Press.

  5. The Longhorn Array Database (LAD): An Open-Source, MIAME compliant implementation of the Stanford Microarray Database (SMD)

    PubMed Central

    Killion, Patrick J; Sherlock, Gavin; Iyer, Vishwanath R

    2003-01-01

    Background The power of microarray analysis can be realized only if data is systematically archived and linked to biological annotations as well as analysis algorithms. Description The Longhorn Array Database (LAD) is a MIAME compliant microarray database that operates on PostgreSQL and Linux. It is a fully open source version of the Stanford Microarray Database (SMD), one of the largest microarray databases. LAD is available at Conclusions Our development of LAD provides a simple, free, open, reliable and proven solution for storage and analysis of two-color microarray data. PMID:12930545

  6. The art of curation at a biological database: principles and application

    USDA-ARS?s Scientific Manuscript database

    The variety and quantity of data being produced by biological research has grown dramatically in recent years, resulting in an expansion of our understanding of biological systems. However, this abundance of data has brought new challenges, especially in curation. The role of biocurators is in part ...

  7. Bioinformatics: A History of Evolution "In Silico"

    ERIC Educational Resources Information Center

    Ondrej, Vladan; Dvorak, Petr

    2012-01-01

    Bioinformatics, biological databases, and the worldwide use of computers have accelerated biological research in many fields, such as evolutionary biology. Here, we describe a primer of nucleotide sequence management and the construction of a phylogenetic tree with two examples; the two selected are from completely different groups of organisms:…

  8. Human Chromosome Y and Haplogroups; introducing YDHS Database.

    PubMed

    Tiirikka, Timo; Moilanen, Jukka S

    2015-12-01

    As the high throughput sequencing efforts generate more biological information, scientists from different disciplines are interpreting the polymorphisms that make us unique. In addition, there is an increasing trend in general public to research their own genealogy, find distant relatives and to know more about their biological background. Commercial vendors are providing analyses of mitochondrial and Y-chromosomal markers for such purposes. Clearly, an easy-to-use free interface to the existing data on the identified variants would be in the interest of general public and professionals less familiar with the field. Here we introduce a novel metadatabase YDHS that aims to provide such an interface for Y-chromosomal DNA (Y-DNA) haplogroups and sequence variants. The database uses ISOGG Y-DNA tree as the source of mutations and haplogroups and by using genomic positions of the mutations the database links them to genes and other biological entities. YDHS contains analysis tools for deeper Y-SNP analysis. YDHS addresses the shortage of Y-DNA related databases. We have tested our database using a set of different cases from literature ranging from infertility to autism. The database is at http://www.semanticgen.net/ydhs Y-chromosomal DNA (Y-DNA) haplogroups and sequence variants have not been in the scientific limelight, excluding certain specialized fields like forensics, mainly because there is not much freely available information or it is scattered in different sources. However, as we have demonstrated Y-SNPs do play a role in various cases on the haplogroup level and it is possible to create a free Y-DNA dedicated bioinformatics resource.

  9. National Center for Biotechnology Information Celebrates 25th Anniversary | NIH MedlinePlus the Magazine

    MedlinePlus

    ... is a national and international resource for molecular biology information. It creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and ...

  10. Modeling complex metabolic reactions, ecological systems, and financial and legal networks with MIANN models based on Markov-Wiener node descriptors.

    PubMed

    Duardo-Sánchez, Aliuska; Munteanu, Cristian R; Riera-Fernández, Pablo; López-Díaz, Antonio; Pazos, Alejandro; González-Díaz, Humberto

    2014-01-27

    The use of numerical parameters in Complex Network analysis is expanding to new fields of application. At a molecular level, we can use them to describe the molecular structure of chemical entities, protein interactions, or metabolic networks. However, the applications are not restricted to the world of molecules and can be extended to the study of macroscopic nonliving systems, organisms, or even legal or social networks. On the other hand, the development of the field of Artificial Intelligence has led to the formulation of computational algorithms whose design is based on the structure and functioning of networks of biological neurons. These algorithms, called Artificial Neural Networks (ANNs), can be useful for the study of complex networks, since the numerical parameters that encode information of the network (for example centralities/node descriptors) can be used as inputs for the ANNs. The Wiener index (W) is a graph invariant widely used in chemoinformatics to quantify the molecular structure of drugs and to study complex networks. In this work, we explore for the first time the possibility of using Markov chains to calculate analogues of node distance numbers/W to describe complex networks from the point of view of their nodes. These parameters are called Markov-Wiener node descriptors of order k(th) (W(k)). Please, note that these descriptors are not related to Markov-Wiener stochastic processes. Here, we calculated the W(k)(i) values for a very high number of nodes (>100,000) in more than 100 different complex networks using the software MI-NODES. These networks were grouped according to the field of application. Molecular networks include the Metabolic Reaction Networks (MRNs) of 40 different organisms. In addition, we analyzed other biological and legal and social networks. These include the Interaction Web Database Biological Networks (IWDBNs), with 75 food webs or ecological systems and the Spanish Financial Law Network (SFLN). The calculated W(k)(i) values were used as inputs for different ANNs in order to discriminate correct node connectivity patterns from incorrect random patterns. The MIANN models obtained present good values of Sensitivity/Specificity (%): MRNs (78/78), IWDBNs (90/88), and SFLN (86/84). These preliminary results are very promising from the point of view of a first exploratory study and suggest that the use of these models could be extended to the high-throughput re-evaluation of connectivity in known complex networks (collation).

  11. cPath: open source software for collecting, storing, and querying biological pathways.

    PubMed

    Cerami, Ethan G; Bader, Gary D; Gross, Benjamin E; Sander, Chris

    2006-11-13

    Biological pathways, including metabolic pathways, protein interaction networks, signal transduction pathways, and gene regulatory networks, are currently represented in over 220 diverse databases. These data are crucial for the study of specific biological processes, including human diseases. Standard exchange formats for pathway information, such as BioPAX, CellML, SBML and PSI-MI, enable convenient collection of this data for biological research, but mechanisms for common storage and communication are required. We have developed cPath, an open source database and web application for collecting, storing, and querying biological pathway data. cPath makes it easy to aggregate custom pathway data sets available in standard exchange formats from multiple databases, present pathway data to biologists via a customizable web interface, and export pathway data via a web service to third-party software, such as Cytoscape, for visualization and analysis. cPath is software only, and does not include new pathway information. Key features include: a built-in identifier mapping service for linking identical interactors and linking to external resources; built-in support for PSI-MI and BioPAX standard pathway exchange formats; a web service interface for searching and retrieving pathway data sets; and thorough documentation. The cPath software is freely available under the LGPL open source license for academic and commercial use. cPath is a robust, scalable, modular, professional-grade software platform for collecting, storing, and querying biological pathways. It can serve as the core data handling component in information systems for pathway visualization, analysis and modeling.

  12. Clustering of 3D-Structure Similarity Based Network of Secondary Metabolites Reveals Their Relationships with Biological Activities.

    PubMed

    Ohtana, Yuki; Abdullah, Azian Azamimi; Altaf-Ul-Amin, Md; Huang, Ming; Ono, Naoaki; Sato, Tetsuo; Sugiura, Tadao; Horai, Hisayuki; Nakamura, Yukiko; Morita Hirai, Aki; Lange, Klaus W; Kibinge, Nelson K; Katsuragi, Tetsuo; Shirai, Tsuyoshi; Kanaya, Shigehiko

    2014-12-01

    Developing database systems connecting diverse species based on omics is the most important theme in big data biology. To attain this purpose, we have developed KNApSAcK Family Databases, which are utilized in a number of researches in metabolomics. In the present study, we have developed a network-based approach to analyze relationships between 3D structure and biological activity of metabolites consisting of four steps as follows: construction of a network of metabolites based on structural similarity (Step 1), classification of metabolites into structure groups (Step 2), assessment of statistically significant relations between structure groups and biological activities (Step 3), and 2-dimensional clustering of the constructed data matrix based on statistically significant relations between structure groups and biological activities (Step 4). Applying this method to a data set consisting of 2072 secondary metabolites and 140 biological activities reported in KNApSAcK Metabolite Activity DB, we obtained 983 statistically significant structure group-biological activity pairs. As a whole, we systematically analyzed the relationship between 3D-chemical structures of metabolites and biological activities. © 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  13. SPECIES DATABASES AND THE BIOINFORMATICS REVOLUTION.

    EPA Science Inventory

    Biological databases are having a growth spurt. Much of this results from research in genetics and biodiversity, coupled with fast-paced developments in information technology. The revolution in bioinformatics, defined by Sugden and Pennisi (2000) as the "tools and techniques for...

  14. Current databases on biological variation: pros, cons and progress.

    PubMed

    Ricós, C; Alvarez, V; Cava, F; García-Lario, J V; Hernández, A; Jiménez, C V; Minchinela, J; Perich, C; Simón, M

    1999-11-01

    A database with reliable information to derive definitive analytical quality specifications for a large number of clinical laboratory tests was prepared in this work. This was achieved by comparing and correlating descriptive data and relevant observations with the biological variation information, an approach that had not been used in the previous efforts of this type. The material compiled in the database was obtained from published articles referenced in BIOS, CURRENT CONTENTS, EMBASE and MEDLINE using "biological variation & laboratory medicine" as key words, as well as books and doctoral theses provided by their authors. The database covers 316 quantities and reviews 191 articles, fewer than 10 of which had to be rejected. The within- and between-subject coefficients of variation and the subsequent desirable quality specifications for precision, bias and total error for all the quantities accepted are presented. Sex-related stratification of results was justified for only four quantities and, in these cases, quality specifications were derived from the group with lower within-subject variation. For certain quantities, biological variation in pathological states was higher than in the healthy state. In these cases, quality specifications were derived only from the healthy population (most stringent). Several quantities (particularly hormones) have been treated in very few articles and the results found are highly discrepant. Therefore, professionals in laboratory medicine should be strongly encouraged to study the quantities for which results are discrepant, the 90 quantities described in only one paper and the numerous quantities that have not been the subject of study.

  15. AIM: a comprehensive Arabidopsis interactome module database and related interologs in plants.

    PubMed

    Wang, Yi; Thilmony, Roger; Zhao, Yunjun; Chen, Guoping; Gu, Yong Q

    2014-01-01

    Systems biology analysis of protein modules is important for understanding the functional relationships between proteins in the interactome. Here, we present a comprehensive database named AIM for Arabidopsis (Arabidopsis thaliana) interactome modules. The database contains almost 250,000 modules that were generated using multiple analysis methods and integration of microarray expression data. All the modules in AIM are well annotated using multiple gene function knowledge databases. AIM provides a user-friendly interface for different types of searches and offers a powerful graphical viewer for displaying module networks linked to the enrichment annotation terms. Both interactive Venn diagram and power graph viewer are integrated into the database for easy comparison of modules. In addition, predicted interologs from other plant species (homologous proteins from different species that share a conserved interaction module) are available for each Arabidopsis module. AIM is a powerful systems biology platform for obtaining valuable insights into the function of proteins in Arabidopsis and other plants using the modules of the Arabidopsis interactome. Database URL:http://probes.pw.usda.gov/AIM Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.

  16. The integrated web service and genome database for agricultural plants with biotechnology information.

    PubMed

    Kim, Changkug; Park, Dongsuk; Seol, Youngjoo; Hahn, Jangho

    2011-01-01

    The National Agricultural Biotechnology Information Center (NABIC) constructed an agricultural biology-based infrastructure and developed a Web based relational database for agricultural plants with biotechnology information. The NABIC has concentrated on functional genomics of major agricultural plants, building an integrated biotechnology database for agro-biotech information that focuses on genomics of major agricultural resources. This genome database provides annotated genome information from 1,039,823 records mapped to rice, Arabidopsis, and Chinese cabbage.

  17. Deep Question Answering for protein annotation

    PubMed Central

    Gobeill, Julien; Gaudinat, Arnaud; Pasche, Emilie; Vishnyakova, Dina; Gaudet, Pascale; Bairoch, Amos; Ruch, Patrick

    2015-01-01

    Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/ PMID:26384372

  18. Genome-wide copy number variant analysis for congenital ventricular septal defects in Chinese Han population.

    PubMed

    An, Yu; Duan, Wenyuan; Huang, Guoying; Chen, Xiaoli; Li, Li; Nie, Chenxia; Hou, Jia; Gui, Yonghao; Wu, Yiming; Zhang, Feng; Shen, Yiping; Wu, Bailin; Wang, Hongyan

    2016-01-08

    Ventricular septal defects (VSDs) constitute the most prevalent congenital heart disease (CHD), occurs either in isolation (isolated VSD) or in combination with other cardiac defects (complex VSD). Copy number variation (CNV) has been highlighted as a possible contributing factor to the etiology of many congenital diseases. However, little is known concerning the involvement of CNVs in either isolated or complex VSDs. We analyzed 154 unrelated Chinese individuals with VSD by chromosomal microarray analysis. The subjects were recruited from four hospitals across China. Each case underwent clinical assessment to define the type of VSD, either isolated or complex VSD. CNVs detected were categorized into syndrom related CNVs, recurrent CNVs and rare CNVs. Genes encompassed by the CNVs were analyzed using enrichment and pathway analysis. Among 154 probands, we identified 29 rare CNVs in 26 VSD patients (16.9 %, 26/154) and 8 syndrome-related CNVs in 8 VSD patients (5.2 %, 8/154). 12 of the detected 29 rare CNVs (41.3 %) were recurrently reported in DECIPHER or ISCA database as associated with either VSD or general heart disease. Fifteen genes (5 %, 15/285) within CNVs were associated with a broad spectrum of complicated CHD. Among these15 genes, 7 genes were in "abnormal interventricular septum morphology" derived from the MGI (mouse genome informatics) database, and nine genes were associated with cardiovascular system development (GO:0072538).We also found that these VSD-related candidate genes are enriched in chromatin binding and transcription regulation, which are the biological processes underlying heart development. Our study demonstrates the potential clinical diagnostic utility of genomic imbalance profiling in VSD patients. Additionally, gene enrichment and pathway analysis helped us to implicate VSD related candidate genes.

  19. Deep Question Answering for protein annotation.

    PubMed

    Gobeill, Julien; Gaudinat, Arnaud; Pasche, Emilie; Vishnyakova, Dina; Gaudet, Pascale; Bairoch, Amos; Ruch, Patrick

    2015-01-01

    Biomedical professionals have access to a huge amount of literature, but when they use a search engine, they often have to deal with too many documents to efficiently find the appropriate information in a reasonable time. In this perspective, question-answering (QA) engines are designed to display answers, which were automatically extracted from the retrieved documents. Standard QA engines in literature process a user question, then retrieve relevant documents and finally extract some possible answers out of these documents using various named-entity recognition processes. In our study, we try to answer complex genomics questions, which can be adequately answered only using Gene Ontology (GO) concepts. Such complex answers cannot be found using state-of-the-art dictionary- and redundancy-based QA engines. We compare the effectiveness of two dictionary-based classifiers for extracting correct GO answers from a large set of 100 retrieved abstracts per question. In the same way, we also investigate the power of GOCat, a GO supervised classifier. GOCat exploits the GOA database to propose GO concepts that were annotated by curators for similar abstracts. This approach is called deep QA, as it adds an original classification step, and exploits curated biological data to infer answers, which are not explicitly mentioned in the retrieved documents. We show that for complex answers such as protein functional descriptions, the redundancy phenomenon has a limited effect. Similarly usual dictionary-based approaches are relatively ineffective. In contrast, we demonstrate how existing curated data, beyond information extraction, can be exploited by a supervised classifier, such as GOCat, to massively improve both the quantity and the quality of the answers with a +100% improvement for both recall and precision. Database URL: http://eagl.unige.ch/DeepQA4PA/. © The Author(s) 2015. Published by Oxford University Press.

  20. Deconstructing breast cancer heterogeneity: clinical implications for women with Basal-like tumors.

    PubMed

    Rattani, Nabila S; Swift-Scanlan, Theresa

    2014-11-01

    To compare and contrast the molecular and environmental factors contributing to basal-like breast cancer and highlight the clinical implications for women with this phenotype. CINAHL® and PubMed databases, journals, and citation indices were searched using the key word basal-like in combination with breast cancer, epigenetic, treatment, subtype, risk factor, and BRCA1 to synthesize the literature on the multiple underpinnings of basal-like breast cancer. Research findings related to the molecular foundation of basal-like breast cancer were integrated with knowledge of nongenetic contributing risk factors. Approved therapies and those under development were summarized with the goal of improving understanding for research and practice. Of the five subtypes of breast cancer, the basal-like subtype has the shortest survival and poorest prognosis. The development of gene expression assays with epigenetic studies has enabled reliable identification of the basal-like subtype and has shed light on novel therapeutic possibilities. Clinical trials for basal-like breast cancer are underway, and the potential for individualized treatments for women with this subtype show promise. The main difficulties with basal-like breast cancer are its aggressive course, treatment refractory nature, and complex biology, all of which pose real challenges for clinical management and patient education. Oncology nurses play a pivotal role in providing holistic care and patient support. Therefore, nurses must understand the complexity of the clinical presentation and the underlying biology of this cancer subtype.

  1. Senior Laboratory Animal Technician | Center for Cancer Research

    Cancer.gov

    PROGRAM DESCRIPTION The Laboratory Animal Sciences Program (LASP) provides exceptional quality animal care and technical support services for animal research performed at the National Cancer Institute at the Frederick National Laboratory for Cancer Research. LASP executes this mission by providing a broad spectrum of state-of-the-art technologies and services that are focused on the design, generation, characterization and application of genetically engineered and biological animal models of human disease, which are aimed at the development of targeted diagnostics and therapies. LASP contributes to advancing human health, developing new treatments, and improving existing treatments for cancer and other diseases while ensuring safe and humane treatment of animals. KEY ROLES/RESPONSIBILITIES The Senior Laboratory Animal Technician will be responsible for: Daily tasks associated with the care, breeding and treatment of research animals for experimental purposes Management of rodent breeding colonies consisting of multiple, genetically complex strains and associated record keeping and database management Colony management procedures including: tail clipping, animal identification, weaning Data entry consistent with complex colony management Collection of routine diagnostic samples Coordinating shipment of live animals and specimens Performing rodent experimental procedures including basic necropsy and blood collection Observation and recording of physical signs of animal health Knowledge of safe working practices using chemical carcinogen and biological hazards Work schedule may include weekend and holiday hours This position is in support of the Center for Cancer Research (CCR).

  2. Review of family relational stress and pediatric asthma: the value of biopsychosocial systemic models.

    PubMed

    Wood, Beatrice L; Miller, Bruce D; Lehman, Heather K

    2015-06-01

    Asthma is the most common chronic disease in children. Despite dramatic advances in pharmacological treatments, asthma remains a leading public health problem, especially in socially disadvantaged minority populations. Some experts believe that this health gap is due to the failure to address the impact of stress on the disease. Asthma is a complex disease that is influenced by multilevel factors, but the nature of these factors and their interrelations are not well understood. This paper aims to integrate social, psychological, and biological literatures on relations between family/parental stress and pediatric asthma, and to illustrate the utility of multilevel systemic models for guiding treatment and stimulating future research. We used electronic database searches and conducted an integrated analysis of selected epidemiological, longitudinal, and empirical studies. Evidence is substantial for the effects of family/parental stress on asthma mediated by both disease management and psychobiological stress pathways. However, integrative models containing specific pathways are scarce. We present two multilevel models, with supporting data, as potential prototypes for other such models. We conclude that these multilevel systems models may be of substantial heuristic value in organizing investigations of, and clinical approaches to, the complex social-biological aspects of family stress in pediatric asthma. However, additional systemic models are needed, and the models presented herein could serve as prototypes for model development. © 2015 Family Process Institute.

  3. Colloquium and Report on Systems Microbiology: Beyond Microbial Genomics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Merry R. Buckley

    The American Academy of Microbiology convened a colloquium June 4-6, 2004 to confer about the scientific promise of systems microbiology. Participants discussed the power of applying a systems approach to the study of biology and to microbiology in particular, specifics about current research efforts, technical bottlenecks, requirements for data acquisition and maintenance, educational needs, and communication issues surrounding the field. A number of recommendations were made for removing barriers to progress in systems microbiology and for improving opportunities in education and collaboration. Systems biology, as a concept, is not new, but the recent explosion of genomic sequences and related datamore » has revived interest in the field. Systems microbiology, a subset of systems biology, represents a different approach to investigating biological systems. It attempts to examine the emergent properties of microorganisms that arise from the interplay of genes, proteins, other macromolecules, small molecules, organelles, and the environment. It is these interactions, often nonlinear, that lead to the emergent properties of biological systems that are generally not tractable by traditional approaches. As a complement to the long-standing trend toward reductionism, systems microbiology seeks to treat the organism or community as a whole, integrating fundamental biological knowledge with genomics, metabolomics, and other data to create an integrated picture of how a microbial cell or community operates. Systems microbiology promises not only to shed light on the activities of microbes, but will also provide biology the tools and approaches necessary for achieving a better understanding of life and ecosystems. Microorganisms are ideal candidates for systems biology research because they are relatively easy to manipulate and because they play critical roles in health, environment, agriculture, and energy production. Potential applications of systems microbiology research range from improvements in the management of bacterial infections to the development of commercial-scale microbial hydrogen generation. A number of technical challenges must be met to realize the potential of systems microbiology. Development of a new, comprehensive systems microbiology database that would be available to the entire research community was identified as the single most critical need. Other challenges include difficulties in measuring single-cell parameters, limitations in identifying and measuring metabolites and other products, the inability to cultivate diverse microbes, limits on data accessibility, computational limitations associated with data integration, the lack of sufficient functional gene annotations, needs for quantitative proteomics, and the inapplicability of current high throughput methods to all areas of systems microbiology. Difficulties have also been encountered in acquiring the necessary data, assuring the quality of that data, and in making data available to the community in a useful format. Problems with data quality assurance and data availability could be partially offset by launching a dedicated systems microbiology database. To be of greatest value to the field, a database should include systems data from all levels of analysis, including sequences, microarray data, proteomics data, metabolite measurements, data on protein-protein or protein-nucleic interactions, carbohydrate and small RNA profiles, information on cell surface markers, and appropriate supporting data. Regular updates of these databases and adherence to agreed upon data format standards are critical to the success of these resources. It was recommended that educational requirements for undergraduate and graduate students in microbiology be amended to better prepare the next generation of researchers for the quantitative requirements of applying systems microbiology methods in their work. Systems microbiology research is too complex to be the sole property of any single academic discipline. The contributions of microbiologists, computer scientists, control theorists, biostatisticians, and others are all required to move the field forward. Since research in systems microbiology demands the contributions of a diverse array of professionals, collaboration across disciplines and national borders should be strongly encouraged by research bodies and funding agencies. Although the details of systems microbiology research are probably not of interest to the average individual, the potential applications and benefits of these types of investigations should be conveyed to the lay public.« less

  4. VTCdb: a gene co-expression database for the crop species Vitis vinifera (grapevine).

    PubMed

    Wong, Darren C J; Sweetman, Crystal; Drew, Damian P; Ford, Christopher M

    2013-12-16

    Gene expression datasets in model plants such as Arabidopsis have contributed to our understanding of gene function and how a single underlying biological process can be governed by a diverse network of genes. The accumulation of publicly available microarray data encompassing a wide range of biological and environmental conditions has enabled the development of additional capabilities including gene co-expression analysis (GCA). GCA is based on the understanding that genes encoding proteins involved in similar and/or related biological processes may exhibit comparable expression patterns over a range of experimental conditions, developmental stages and tissues. We present an open access database for the investigation of gene co-expression networks within the cultivated grapevine, Vitis vinifera. The new gene co-expression database, VTCdb (http://vtcdb.adelaide.edu.au/Home.aspx), offers an online platform for transcriptional regulatory inference in the cultivated grapevine. Using condition-independent and condition-dependent approaches, grapevine co-expression networks were constructed using the latest publicly available microarray datasets from diverse experimental series, utilising the Affymetrix Vitis vinifera GeneChip (16 K) and the NimbleGen Grape Whole-genome microarray chip (29 K), thus making it possible to profile approximately 29,000 genes (95% of the predicted grapevine transcriptome). Applications available with the online platform include the use of gene names, probesets, modules or biological processes to query the co-expression networks, with the option to choose between Affymetrix or Nimblegen datasets and between multiple co-expression measures. Alternatively, the user can browse existing network modules using interactive network visualisation and analysis via CytoscapeWeb. To demonstrate the utility of the database, we present examples from three fundamental biological processes (berry development, photosynthesis and flavonoid biosynthesis) whereby the recovered sub-networks reconfirm established plant gene functions and also identify novel associations. Together, we present valuable insights into grapevine transcriptional regulation by developing network models applicable to researchers in their prioritisation of gene candidates, for on-going study of biological processes related to grapevine development, metabolism and stress responses.

  5. The Multinational Arabidopsis Steering Subcommittee for Proteomics Assembles the Largest Proteome Database Resource for Plant Systems Biology

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Weckwerth, Wolfram; Baginsky, Sacha; Van Wijk, Klass

    2009-12-01

    In the past 10 years, we have witnessed remarkable advances in the field of plant molecular biology. The rapid development of proteomic technologies and the speed with which these techniques have been applied to the field have altered our perception of how we can analyze proteins in complex systems. At nearly the same time, the availability of the complete genome for the model plant Arabidopsis thaliana was released; this effort provides an unsurpassed resource for the identification of proteins when researchers use MS to analyze plant samples. Recognizing the growth in this area, the Multinational Arabidopsis Steering Committee (MASC) establishedmore » a subcommittee for A. thaliana proteomics in 2006 with the objective of consolidating databases, technique standards, and experimentally validated candidate genes and functions. Since the establishment of the Multinational Arabidopsis Steering Subcommittee for Proteomics (MASCP), many new approaches and resources have become available. Recently, the subcommittee established a webpage to consolidate this information (www.masc-proteomics.org). It includes links to plant proteomic databases, general information about proteomic techniques, meeting information, a summary of proteomic standards, and other relevant resources. Altogether, this website provides a useful resource for the Arabidopsis proteomics community. In the future, the website will host discussions and investigate the cross-linking of databases. The subcommittee members have extensive experience in arabidopsis proteomics and collectively have produced some of the most extensive proteomics data sets for this model plant (Table S1 in the Supporting Information has a list of resources). The largest collection of proteomics data from a single study in A. thaliana was assembled into an accessible database (AtProteome; http://fgcz-atproteome.unizh.ch/index.php) and was recently published by the Baginsky lab.1 The database provides links to major Arabidopsis online resources, and raw data have been deposited in PRIDE and PRIDE BioMart. Included in this database is an Arabidopsis proteome map that provides evidence for the expression of {approx}50% of all predicted gene models, including several alternative gene models that are not represented in The Arabidopsis Information Resource (TAIR) protein database. A set of organ-specific biomarkers is provided, as well as organ-specific proteotypic peptides for 4105 proteins that can be used to facilitate targeted quantitative proteomic surveys. In the future, the AtProteome database will be linked to additional existing resources developed by MASCP members, such as PPDB, ProMEX, and SUBA. The most comprehensive study on the Arabidopsis chloroplast proteome, which includes information on chloroplast sorting signals, posttranslational modifications (PTMs), and protein abundances (analyzed by high-accuracy MS [Orbitrap]), was recently published by the van Wijk lab.2 These and previous data are available via the plant proteome database (PPDB; http://ppdb.tc.cornell.edu) for A. thaliana and maize. PPDB provides genome-wide experimental and functional characterization of the A. thaliana and maize proteomes, including PTMs and subcellular localization information, with an emphasis on leaf and plastid proteins. Maize and Arabidopsis proteome entries are directly linked via internal BLAST alignments within PPDB. Direct links for each protein to TAIR, SUBA, ProMEX, and other resources are also provided.« less

  6. MAGIC database and interfaces: an integrated package for gene discovery and expression.

    PubMed

    Cordonnier-Pratt, Marie-Michèle; Liang, Chun; Wang, Haiming; Kolychev, Dmitri S; Sun, Feng; Freeman, Robert; Sullivan, Robert; Pratt, Lee H

    2004-01-01

    The rapidly increasing rate at which biological data is being produced requires a corresponding growth in relational databases and associated tools that can help laboratories contend with that data. With this need in mind, we describe here a Modular Approach to a Genomic, Integrated and Comprehensive (MAGIC) Database. This Oracle 9i database derives from an initial focus in our laboratory on gene discovery via production and analysis of expressed sequence tags (ESTs), and subsequently on gene expression as assessed by both EST clustering and microarrays. The MAGIC Gene Discovery portion of the database focuses on information derived from DNA sequences and on its biological relevance. In addition to MAGIC SEQ-LIMS, which is designed to support activities in the laboratory, it contains several additional subschemas. The latter include MAGIC Admin for database administration, MAGIC Sequence for sequence processing as well as sequence and clone attributes, MAGIC Cluster for the results of EST clustering, MAGIC Polymorphism in support of microsatellite and single-nucleotide-polymorphism discovery, and MAGIC Annotation for electronic annotation by BLAST and BLAT. The MAGIC Microarray portion is a MIAME-compliant database with two components at present. These are MAGIC Array-LIMS, which makes possible remote entry of all information into the database, and MAGIC Array Analysis, which provides data mining and visualization. Because all aspects of interaction with the MAGIC Database are via a web browser, it is ideally suited not only for individual research laboratories but also for core facilities that serve clients at any distance.

  7. BioModels Database: a repository of mathematical models of biological processes.

    PubMed

    Chelliah, Vijayalakshmi; Laibe, Camille; Le Novère, Nicolas

    2013-01-01

    BioModels Database is a public online resource that allows storing and sharing of published, peer-reviewed quantitative, dynamic models of biological processes. The model components and behaviour are thoroughly checked to correspond the original publication and manually curated to ensure reliability. Furthermore, the model elements are annotated with terms from controlled vocabularies as well as linked to relevant external data resources. This greatly helps in model interpretation and reuse. Models are stored in SBML format, accepted in SBML and CellML formats, and are available for download in various other common formats such as BioPAX, Octave, SciLab, VCML, XPP and PDF, in addition to SBML. The reaction network diagram of the models is also available in several formats. BioModels Database features a search engine, which provides simple and more advanced searches. Features such as online simulation and creation of smaller models (submodels) from the selected model elements of a larger one are provided. BioModels Database can be accessed both via a web interface and programmatically via web services. New models are available in BioModels Database at regular releases, about every 4 months.

  8. BIOPEP database and other programs for processing bioactive peptide sequences.

    PubMed

    Minkiewicz, Piotr; Dziuba, Jerzy; Iwaniak, Anna; Dziuba, Marta; Darewicz, Małgorzata

    2008-01-01

    This review presents the potential for application of computational tools in peptide science based on a sample BIOPEP database and program as well as other programs and databases available via the World Wide Web. The BIOPEP application contains a database of biologically active peptide sequences and a program enabling construction of profiles of the potential biological activity of protein fragments, calculation of quantitative descriptors as measures of the value of proteins as potential precursors of bioactive peptides, and prediction of bonds susceptible to hydrolysis by endopeptidases in a protein chain. Other bioactive and allergenic peptide sequence databases are also presented. Programs enabling the construction of binary and multiple alignments between peptide sequences, the construction of sequence motifs attributed to a given type of bioactivity, searching for potential precursors of bioactive peptides, and the prediction of sites susceptible to proteolytic cleavage in protein chains are available via the Internet as are other approaches concerning secondary structure prediction and calculation of physicochemical features based on amino acid sequence. Programs for prediction of allergenic and toxic properties have also been developed. This review explores the possibilities of cooperation between various programs.

  9. LMSD: LIPID MAPS structure database

    PubMed Central

    Sud, Manish; Fahy, Eoin; Cotter, Dawn; Brown, Alex; Dennis, Edward A.; Glass, Christopher K.; Merrill, Alfred H.; Murphy, Robert C.; Raetz, Christian R. H.; Russell, David W.; Subramaniam, Shankar

    2007-01-01

    The LIPID MAPS Structure Database (LMSD) is a relational database encompassing structures and annotations of biologically relevant lipids. Structures of lipids in the database come from four sources: (i) LIPID MAPS Consortium's core laboratories and partners; (ii) lipids identified by LIPID MAPS experiments; (iii) computationally generated structures for appropriate lipid classes; (iv) biologically relevant lipids manually curated from LIPID BANK, LIPIDAT and other public sources. All the lipid structures in LMSD are drawn in a consistent fashion. In addition to a classification-based retrieval of lipids, users can search LMSD using either text-based or structure-based search options. The text-based search implementation supports data retrieval by any combination of these data fields: LIPID MAPS ID, systematic or common name, mass, formula, category, main class, and subclass data fields. The structure-based search, in conjunction with optional data fields, provides the capability to perform a substructure search or exact match for the structure drawn by the user. Search results, in addition to structure and annotations, also include relevant links to external databases. The LMSD is publicly available at PMID:17098933

  10. SSBD: a database of quantitative data of spatiotemporal dynamics of biological phenomena

    PubMed Central

    Tohsato, Yukako; Ho, Kenneth H. L.; Kyoda, Koji; Onami, Shuichi

    2016-01-01

    Motivation: Rapid advances in live-cell imaging analysis and mathematical modeling have produced a large amount of quantitative data on spatiotemporal dynamics of biological objects ranging from molecules to organisms. There is now a crucial need to bring these large amounts of quantitative biological dynamics data together centrally in a coherent and systematic manner. This will facilitate the reuse of this data for further analysis. Results: We have developed the Systems Science of Biological Dynamics database (SSBD) to store and share quantitative biological dynamics data. SSBD currently provides 311 sets of quantitative data for single molecules, nuclei and whole organisms in a wide variety of model organisms from Escherichia coli to Mus musculus. The data are provided in Biological Dynamics Markup Language format and also through a REST API. In addition, SSBD provides 188 sets of time-lapse microscopy images from which the quantitative data were obtained and software tools for data visualization and analysis. Availability and Implementation: SSBD is accessible at http://ssbd.qbic.riken.jp. Contact: sonami@riken.jp PMID:27412095

  11. SSBD: a database of quantitative data of spatiotemporal dynamics of biological phenomena.

    PubMed

    Tohsato, Yukako; Ho, Kenneth H L; Kyoda, Koji; Onami, Shuichi

    2016-11-15

    Rapid advances in live-cell imaging analysis and mathematical modeling have produced a large amount of quantitative data on spatiotemporal dynamics of biological objects ranging from molecules to organisms. There is now a crucial need to bring these large amounts of quantitative biological dynamics data together centrally in a coherent and systematic manner. This will facilitate the reuse of this data for further analysis. We have developed the Systems Science of Biological Dynamics database (SSBD) to store and share quantitative biological dynamics data. SSBD currently provides 311 sets of quantitative data for single molecules, nuclei and whole organisms in a wide variety of model organisms from Escherichia coli to Mus musculus The data are provided in Biological Dynamics Markup Language format and also through a REST API. In addition, SSBD provides 188 sets of time-lapse microscopy images from which the quantitative data were obtained and software tools for data visualization and analysis. SSBD is accessible at http://ssbd.qbic.riken.jp CONTACT: sonami@riken.jp. © The Author 2016. Published by Oxford University Press.

  12. Unbiased Protein Association Study on the Public Human Proteome Reveals Biological Connections between Co-Occurring Protein Pairs

    PubMed Central

    2017-01-01

    Mass-spectrometry-based, high-throughput proteomics experiments produce large amounts of data. While typically acquired to answer specific biological questions, these data can also be reused in orthogonal ways to reveal new biological knowledge. We here present a novel method for such orthogonal data reuse of public proteomics data. Our method elucidates biological relationships between proteins based on the co-occurrence of these proteins across human experiments in the PRIDE database. The majority of the significantly co-occurring protein pairs that were detected by our method have been successfully mapped to existing biological knowledge. The validity of our novel method is substantiated by the extremely few pairs that can be mapped to existing knowledge based on random associations between the same set of proteins. Moreover, using literature searches and the STRING database, we were able to derive meaningful biological associations for unannotated protein pairs that were detected using our method, further illustrating that as-yet unknown associations present highly interesting targets for follow-up analysis. PMID:28480704

  13. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex

    PubMed Central

    Maccari, Giuseppe; Robinson, James; Ballingall, Keith; Guethlein, Lisbeth A.; Grimholt, Unni; Kaufman, Jim; Ho, Chak-Sum; de Groot, Natasja G.; Flicek, Paul; Bontrop, Ronald E.; Hammond, John A.; Marsh, Steven G. E.

    2017-01-01

    The IPD-MHC Database project (http://www.ebi.ac.uk/ipd/mhc/) collects and expertly curates sequences of the major histocompatibility complex from non-human species and provides the infrastructure and tools to enable accurate analysis. Since the first release of the database in 2003, IPD-MHC has grown and currently hosts a number of specific sections, with more than 7000 alleles from 70 species, including non-human primates, canines, felines, equids, ovids, suids, bovins, salmonids and murids. These sequences are expertly curated and made publicly available through an open access website. The IPD-MHC Database is a key resource in its field, and this has led to an average of 1500 unique visitors and more than 5000 viewed pages per month. As the database has grown in size and complexity, it has created a number of challenges in maintaining and organizing information, particularly the need to standardize nomenclature and taxonomic classification, while incorporating new allele submissions. Here, we describe the latest database release, the IPD-MHC 2.0 and discuss planned developments. This release incorporates sequence updates and new tools that enhance database queries and improve the submission procedure by utilizing common tools that are able to handle the varied requirements of each MHC-group. PMID:27899604

  14. Discovering Sequence Motifs with Arbitrary Insertions and Deletions

    PubMed Central

    Frith, Martin C.; Saunders, Neil F. W.; Kobe, Bostjan; Bailey, Timothy L.

    2008-01-01

    Biology is encoded in molecular sequences: deciphering this encoding remains a grand scientific challenge. Functional regions of DNA, RNA, and protein sequences often exhibit characteristic but subtle motifs; thus, computational discovery of motifs in sequences is a fundamental and much-studied problem. However, most current algorithms do not allow for insertions or deletions (indels) within motifs, and the few that do have other limitations. We present a method, GLAM2 (Gapped Local Alignment of Motifs), for discovering motifs allowing indels in a fully general manner, and a companion method GLAM2SCAN for searching sequence databases using such motifs. glam2 is a generalization of the gapless Gibbs sampling algorithm. It re-discovers variable-width protein motifs from the PROSITE database significantly more accurately than the alternative methods PRATT and SAM-T2K. Furthermore, it usefully refines protein motifs from the ELM database: in some cases, the refined motifs make orders of magnitude fewer overpredictions than the original ELM regular expressions. GLAM2 performs respectably on the BAliBASE multiple alignment benchmark, and may be superior to leading multiple alignment methods for “motif-like” alignments with N- and C-terminal extensions. Finally, we demonstrate the use of GLAM2 to discover protein kinase substrate motifs and a gapped DNA motif for the LIM-only transcriptional regulatory complex: using GLAM2SCAN, we identify promising targets for the latter. GLAM2 is especially promising for short protein motifs, and it should improve our ability to identify the protein cleavage sites, interaction sites, post-translational modification attachment sites, etc., that underlie much of biology. It may be equally useful for arbitrarily gapped motifs in DNA and RNA, although fewer examples of such motifs are known at present. GLAM2 is public domain software, available for download at http://bioinformatics.org.au/glam2. PMID:18437229

  15. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rayl, K.D.; Gaasterland, T.

    This paper presents an overview of the purpose, content, and design of a subset of the currently available biological databases, with an emphasis on protein databases. Databases included in this summary are 3D-ALI, Berlin RNA databank, Blocks, DSSP, EMBL Nucleotide Database, EMP, ENZYME, FSSP, GDB, GenBank, HSSP, LiMB, PDB, PIR, PKCDD, ProSite, and SWISS-PROT. The goal is to provide a starting point for researchers who wish to take advantage of the myriad available databases. Rather than providing a complete explanation of each database, we present its content and form by explaining the details of typical entries. Pointers to more completemore » ``user guides`` are included, along with general information on where to search for a new database.« less

  16. Consensus and conflict cards for metabolic pathway databases

    PubMed Central

    2013-01-01

    Background The metabolic network of H. sapiens and many other organisms is described in multiple pathway databases. The level of agreement between these descriptions, however, has proven to be low. We can use these different descriptions to our advantage by identifying conflicting information and combining their knowledge into a single, more accurate, and more complete description. This task is, however, far from trivial. Results We introduce the concept of Consensus and Conflict Cards (C2Cards) to provide concise overviews of what the databases do or do not agree on. Each card is centered at a single gene, EC number or reaction. These three complementary perspectives make it possible to distinguish disagreements on the underlying biology of a metabolic process from differences that can be explained by different decisions on how and in what detail to represent knowledge. As a proof-of-concept, we implemented C2CardsHuman, as a web application http://www.molgenis.org/c2cards, covering five human pathway databases. Conclusions C2Cards can contribute to ongoing reconciliation efforts by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute. Several case studies illustrate the potential of the C2Cards in identifying disagreements on the underlying biology of a metabolic process. The overviews may also point out controversial biological knowledge that should be subject of further research. Finally, the examples provided emphasize the importance of manual curation and the need for a broad community involvement. PMID:23803311

  17. Consensus and conflict cards for metabolic pathway databases.

    PubMed

    Stobbe, Miranda D; Swertz, Morris A; Thiele, Ines; Rengaw, Trebor; van Kampen, Antoine H C; Moerland, Perry D

    2013-06-26

    The metabolic network of H. sapiens and many other organisms is described in multiple pathway databases. The level of agreement between these descriptions, however, has proven to be low. We can use these different descriptions to our advantage by identifying conflicting information and combining their knowledge into a single, more accurate, and more complete description. This task is, however, far from trivial. We introduce the concept of Consensus and Conflict Cards (C₂Cards) to provide concise overviews of what the databases do or do not agree on. Each card is centered at a single gene, EC number or reaction. These three complementary perspectives make it possible to distinguish disagreements on the underlying biology of a metabolic process from differences that can be explained by different decisions on how and in what detail to represent knowledge. As a proof-of-concept, we implemented C₂Cards(Human), as a web application http://www.molgenis.org/c2cards, covering five human pathway databases. C₂Cards can contribute to ongoing reconciliation efforts by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute. Several case studies illustrate the potential of the C₂Cards in identifying disagreements on the underlying biology of a metabolic process. The overviews may also point out controversial biological knowledge that should be subject of further research. Finally, the examples provided emphasize the importance of manual curation and the need for a broad community involvement.

  18. A Brief Review of RNA–Protein Interaction Database Resources

    PubMed Central

    Yi, Ying; Zhao, Yue; Huang, Yan; Wang, Dong

    2017-01-01

    RNA–Protein interactions play critical roles in various biological processes. By collecting and analyzing the RNA–Protein interactions and binding sites from experiments and predictions, RNA–Protein interaction databases have become an essential resource for the exploration of the transcriptional and post-transcriptional regulatory network. Here, we briefly review several widely used RNA–Protein interaction database resources developed in recent years to provide a guide of these databases. The content and major functions in databases are presented. The brief description of database helps users to quickly choose the database containing information they interested. In short, these RNA–Protein interaction database resources are continually updated, but the current state shows the efforts to identify and analyze the large amount of RNA–Protein interactions. PMID:29657278

  19. Systematization of the protein sequence diversity in enzymes related to secondary metabolic pathways in plants, in the context of big data biology inspired by the KNApSAcK motorcycle database.

    PubMed

    Ikeda, Shun; Abe, Takashi; Nakamura, Yukiko; Kibinge, Nelson; Hirai Morita, Aki; Nakatani, Atsushi; Ono, Naoaki; Ikemura, Toshimichi; Nakamura, Kensuke; Altaf-Ul-Amin, Md; Kanaya, Shigehiko

    2013-05-01

    Biology is increasingly becoming a data-intensive science with the recent progress of the omics fields, e.g. genomics, transcriptomics, proteomics and metabolomics. The species-metabolite relationship database, KNApSAcK Core, has been widely utilized and cited in metabolomics research, and chronological analysis of that research work has helped to reveal recent trends in metabolomics research. To meet the needs of these trends, the KNApSAcK database has been extended by incorporating a secondary metabolic pathway database called Motorcycle DB. We examined the enzyme sequence diversity related to secondary metabolism by means of batch-learning self-organizing maps (BL-SOMs). Initially, we constructed a map by using a big data matrix consisting of the frequencies of all possible dipeptides in the protein sequence segments of plants and bacteria. The enzyme sequence diversity of the secondary metabolic pathways was examined by identifying clusters of segments associated with certain enzyme groups in the resulting map. The extent of diversity of 15 secondary metabolic enzyme groups is discussed. Data-intensive approaches such as BL-SOM applied to big data matrices are needed for systematizing protein sequences. Handling big data has become an inevitable part of biology.

  20. A novel database of bio-effects from non-ionizing radiation.

    PubMed

    Leach, Victor; Weller, Steven; Redmayne, Mary

    2018-06-06

    A significant amount of electromagnetic field/electromagnetic radiation (EMF/EMR) research is available that examines biological and disease associated endpoints. The quantity, variety and changing parameters in the available research can be challenging when undertaking a literature review, meta-analysis, preparing a study design, building reference lists or comparing findings between relevant scientific papers. The Oceania Radiofrequency Scientific Advisory Association (ORSAA) has created a comprehensive, non-biased, multi-categorized, searchable database of papers on non-ionizing EMF/EMR to help address these challenges. It is regularly added to, freely accessible online and designed to allow data to be easily retrieved, sorted and analyzed. This paper demonstrates the content and search flexibility of the ORSAA database. Demonstration searches are presented by Effect/No Effect; frequency-band/s; in vitro; in vivo; biological effects; study type; and funding source. As of the 15th September 2017, the clear majority of 2653 papers captured in the database examine outcomes in the 300 MHz-3 GHz range. There are 3 times more biological "Effect" than "No Effect" papers; nearly a third of papers provide no funding statement; industry-funded studies more often than not find "No Effect", while institutional funding commonly reveal "Effects". Country of origin where the study is conducted/funded also appears to have a dramatic influence on the likely result outcome.

  1. The integrated web service and genome database for agricultural plants with biotechnology information

    PubMed Central

    Kim, ChangKug; Park, DongSuk; Seol, YoungJoo; Hahn, JangHo

    2011-01-01

    The National Agricultural Biotechnology Information Center (NABIC) constructed an agricultural biology-based infrastructure and developed a Web based relational database for agricultural plants with biotechnology information. The NABIC has concentrated on functional genomics of major agricultural plants, building an integrated biotechnology database for agro-biotech information that focuses on genomics of major agricultural resources. This genome database provides annotated genome information from 1,039,823 records mapped to rice, Arabidopsis, and Chinese cabbage. PMID:21887015

  2. Data Sources for Trait Databases: Comparing the Phenomic Content of Monographs and Evolutionary Matrices.

    PubMed

    Dececchi, T Alex; Mabee, Paula M; Blackburn, David C

    2016-01-01

    Databases of organismal traits that aggregate information from one or multiple sources can be leveraged for large-scale analyses in biology. Yet the differences among these data streams and how well they capture trait diversity have never been explored. We present the first analysis of the differences between phenotypes captured in free text of descriptive publications ('monographs') and those used in phylogenetic analyses ('matrices'). We focus our analysis on osteological phenotypes of the limbs of four extinct vertebrate taxa critical to our understanding of the fin-to-limb transition. We find that there is low overlap between the anatomical entities used in these two sources of phenotype data, indicating that phenotypes represented in matrices are not simply a subset of those found in monographic descriptions. Perhaps as expected, compared to characters found in matrices, phenotypes in monographs tend to emphasize descriptive and positional morphology, be somewhat more complex, and relate to fewer additional taxa. While based on a small set of focal taxa, these qualitative and quantitative data suggest that either source of phenotypes alone will result in incomplete knowledge of variation for a given taxon. As a broader community develops to use and expand databases characterizing organismal trait diversity, it is important to recognize the limitations of the data sources and develop strategies to more fully characterize variation both within species and across the tree of life.

  3. Data Sources for Trait Databases: Comparing the Phenomic Content of Monographs and Evolutionary Matrices

    PubMed Central

    Dececchi, T. Alex; Mabee, Paula M.; Blackburn, David C.

    2016-01-01

    Databases of organismal traits that aggregate information from one or multiple sources can be leveraged for large-scale analyses in biology. Yet the differences among these data streams and how well they capture trait diversity have never been explored. We present the first analysis of the differences between phenotypes captured in free text of descriptive publications (‘monographs’) and those used in phylogenetic analyses (‘matrices’). We focus our analysis on osteological phenotypes of the limbs of four extinct vertebrate taxa critical to our understanding of the fin-to-limb transition. We find that there is low overlap between the anatomical entities used in these two sources of phenotype data, indicating that phenotypes represented in matrices are not simply a subset of those found in monographic descriptions. Perhaps as expected, compared to characters found in matrices, phenotypes in monographs tend to emphasize descriptive and positional morphology, be somewhat more complex, and relate to fewer additional taxa. While based on a small set of focal taxa, these qualitative and quantitative data suggest that either source of phenotypes alone will result in incomplete knowledge of variation for a given taxon. As a broader community develops to use and expand databases characterizing organismal trait diversity, it is important to recognize the limitations of the data sources and develop strategies to more fully characterize variation both within species and across the tree of life. PMID:27191170

  4. VCGDB: a dynamic genome database of the Chinese population

    PubMed Central

    2014-01-01

    Background The data released by the 1000 Genomes Project contain an increasing number of genome sequences from different nations and populations with a large number of genetic variations. As a result, the focus of human genome studies is changing from single and static to complex and dynamic. The currently available human reference genome (GRCh37) is based on sequencing data from 13 anonymous Caucasian volunteers, which might limit the scope of genomics, transcriptomics, epigenetics, and genome wide association studies. Description We used the massive amount of sequencing data published by the 1000 Genomes Project Consortium to construct the Virtual Chinese Genome Database (VCGDB), a dynamic genome database of the Chinese population based on the whole genome sequencing data of 194 individuals. VCGDB provides dynamic genomic information, which contains 35 million single nucleotide variations (SNVs), 0.5 million insertions/deletions (indels), and 29 million rare variations, together with genomic annotation information. VCGDB also provides a highly interactive user-friendly virtual Chinese genome browser (VCGBrowser) with functions like seamless zooming and real-time searching. In addition, we have established three population-specific consensus Chinese reference genomes that are compatible with mainstream alignment software. Conclusions VCGDB offers a feasible strategy for processing big data to keep pace with the biological data explosion by providing a robust resource for genomics studies; in particular, studies aimed at finding regions of the genome associated with diseases. PMID:24708222

  5. Exploiting PubChem for Virtual Screening

    PubMed Central

    Xie, Xiang-Qun

    2011-01-01

    Importance of the field PubChem is a public molecular information repository, a scientific showcase of the NIH Roadmap Initiative. The PubChem database holds over 27 million records of unique chemical structures of compounds (CID) derived from nearly 70 million substance depositions (SID), and contains more than 449,000 bioassay records with over thousands of in vitro biochemical and cell-based screening bioassays established, with targeting more than 7000 proteins and genes linking to over 1.8 million of substances. Areas covered in this review This review builds on recent PubChem-related computational chemistry research reported by other authors while providing readers with an overview of the PubChem database, focusing on its increasing role in cheminformatics, virtual screening and toxicity prediction modeling. What the reader will gain These publicly available datasets in PubChem provide great opportunities for scientists to perform cheminformatics and virtual screening research for computer-aided drug design. However, the high volume and complexity of the datasets, in particular the bioassay-associated false positives/negatives and highly imbalanced datasets in PubChem, also creates major challenges. Several approaches regarding the modeling of PubChem datasets and development of virtual screening models for bioactivity and toxicity predictions are also reviewed. Take home message Novel data-mining cheminformatics tools and virtual screening algorithms are being developed and used to retrieve, annotate and analyze the large-scale and highly complex PubChem biological screening data for drug design. PMID:21691435

  6. Protein structure determination by exhaustive search of Protein Data Bank derived databases.

    PubMed

    Stokes-Rees, Ian; Sliz, Piotr

    2010-12-14

    Parallel sequence and structure alignment tools have become ubiquitous and invaluable at all levels in the study of biological systems. We demonstrate the application and utility of this same parallel search paradigm to the process of protein structure determination, benefitting from the large and growing corpus of known structures. Such searches were previously computationally intractable. Through the method of Wide Search Molecular Replacement, developed here, they can be completed in a few hours with the aide of national-scale federated cyberinfrastructure. By dramatically expanding the range of models considered for structure determination, we show that small (less than 12% structural coverage) and low sequence identity (less than 20% identity) template structures can be identified through multidimensional template scoring metrics and used for structure determination. Many new macromolecular complexes can benefit significantly from such a technique due to the lack of known homologous protein folds or sequences. We demonstrate the effectiveness of the method by determining the structure of a full-length p97 homologue from Trichoplusia ni. Example cases with the MHC/T-cell receptor complex and the EmoB protein provide systematic estimates of minimum sequence identity, structure coverage, and structural similarity required for this method to succeed. We describe how this structure-search approach and other novel computationally intensive workflows are made tractable through integration with the US national computational cyberinfrastructure, allowing, for example, rapid processing of the entire Structural Classification of Proteins protein fragment database.

  7. GEM System: automatic prototyping of cell-wide metabolic pathway models from genomes.

    PubMed

    Arakawa, Kazuharu; Yamada, Yohei; Shinoda, Kosaku; Nakayama, Yoichi; Tomita, Masaru

    2006-03-23

    Successful realization of a "systems biology" approach to analyzing cells is a grand challenge for our understanding of life. However, current modeling approaches to cell simulation are labor-intensive, manual affairs, and therefore constitute a major bottleneck in the evolution of computational cell biology. We developed the Genome-based Modeling (GEM) System for the purpose of automatically prototyping simulation models of cell-wide metabolic pathways from genome sequences and other public biological information. Models generated by the GEM System include an entire Escherichia coli metabolism model comprising 968 reactions of 1195 metabolites, achieving 100% coverage when compared with the KEGG database, 92.38% with the EcoCyc database, and 95.06% with iJR904 genome-scale model. The GEM System prototypes qualitative models to reduce the labor-intensive tasks required for systems biology research. Models of over 90 bacterial genomes are available at our web site.

  8. The need for high-quality whole-genome sequence databases in microbial forensics.

    PubMed

    Sjödin, Andreas; Broman, Tina; Melefors, Öjar; Andersson, Gunnar; Rasmusson, Birgitta; Knutsson, Rickard; Forsman, Mats

    2013-09-01

    Microbial forensics is an important part of a strengthened capability to respond to biocrime and bioterrorism incidents to aid in the complex task of distinguishing between natural outbreaks and deliberate acts. The goal of a microbial forensic investigation is to identify and criminally prosecute those responsible for a biological attack, and it involves a detailed analysis of the weapon--that is, the pathogen. The recent development of next-generation sequencing (NGS) technologies has greatly increased the resolution that can be achieved in microbial forensic analyses. It is now possible to identify, quickly and in an unbiased manner, previously undetectable genome differences between closely related isolates. This development is particularly relevant for the most deadly bacterial diseases that are caused by bacterial lineages with extremely low levels of genetic diversity. Whole-genome analysis of pathogens is envisaged to be increasingly essential for this purpose. In a microbial forensic context, whole-genome sequence analysis is the ultimate method for strain comparisons as it is informative during identification, characterization, and attribution--all 3 major stages of the investigation--and at all levels of microbial strain identity resolution (ie, it resolves the full spectrum from family to isolate). Given these capabilities, one bottleneck in microbial forensics investigations is the availability of high-quality reference databases of bacterial whole-genome sequences. To be of high quality, databases need to be curated and accurate in terms of sequences, metadata, and genetic diversity coverage. The development of whole-genome sequence databases will be instrumental in successfully tracing pathogens in the future.

  9. MaizeGDB: The Maize Genetics and Genomics Database.

    USDA-ARS?s Scientific Manuscript database

    MaizeGDB is the community database for biological information about the crop plant Zea mays. Genomic, genetic, sequence, gene product, functional characterization, literature reference, and person/organization contact information are among the datatypes stored at MaizeGDB. At the project’s website...

  10. AN EPA SPONSORED LITERATURE REVIEW DATABASE TO SUPPORT STRESSOR IDENTIFICATION

    EPA Science Inventory

    The Causal Analysis/Diagnosis Decision Information System (CADDIS) is an EPA decision-support system currently under development for evaluating the biological impact of stressors on water bodies. In support of CADDIS, EPA is developing CADLIT, a searchable database of the scient...

  11. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database

    PubMed Central

    Drabkin, Harold J.; Blake, Judith A.

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as ‘GO’ or ‘homology’ or ‘phenotype’. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as ‘papers selected for GO that refer to genes with NO GO annotation’. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications. PMID:23110975

  12. Manual Gene Ontology annotation workflow at the Mouse Genome Informatics Database.

    PubMed

    Drabkin, Harold J; Blake, Judith A

    2012-01-01

    The Mouse Genome Database, the Gene Expression Database and the Mouse Tumor Biology database are integrated components of the Mouse Genome Informatics (MGI) resource (http://www.informatics.jax.org). The MGI system presents both a consensus view and an experimental view of the knowledge concerning the genetics and genomics of the laboratory mouse. From genotype to phenotype, this information resource integrates information about genes, sequences, maps, expression analyses, alleles, strains and mutant phenotypes. Comparative mammalian data are also presented particularly in regards to the use of the mouse as a model for the investigation of molecular and genetic components of human diseases. These data are collected from literature curation as well as downloads of large datasets (SwissProt, LocusLink, etc.). MGI is one of the founding members of the Gene Ontology (GO) and uses the GO for functional annotation of genes. Here, we discuss the workflow associated with manual GO annotation at MGI, from literature collection to display of the annotations. Peer-reviewed literature is collected mostly from a set of journals available electronically. Selected articles are entered into a master bibliography and indexed to one of eight areas of interest such as 'GO' or 'homology' or 'phenotype'. Each article is then either indexed to a gene already contained in the database or funneled through a separate nomenclature database to add genes. The master bibliography and associated indexing provide information for various curator-reports such as 'papers selected for GO that refer to genes with NO GO annotation'. Once indexed, curators who have expertise in appropriate disciplines enter pertinent information. MGI makes use of several controlled vocabularies that ensure uniform data encoding, enable robust analysis and support the construction of complex queries. These vocabularies range from pick-lists to structured vocabularies such as the GO. All data associations are supported with statements of evidence as well as access to source publications.

  13. Teaching information literacy skills to sophomore-level biology majors.

    PubMed

    Thompson, Leigh; Blankinship, Lisa Ann

    2015-05-01

    Many undergraduate students lack a sound understanding of information literacy. The skills that comprise information literacy are particularly important when combined with scientific writing for biology majors as they are the foundation skills necessary to complete upper-division biology course assignments, better train students for research projects, and prepare students for graduate and professional education. To help undergraduate biology students develop and practice information literacy and scientific writing skills, a series of three one-hour hands-on library sessions, discussions, and homework assignments were developed for Biological Literature, a one-credit, one-hour-per-week, required sophomore-level course. The embedded course librarian developed a learning exercise that reviewed how to conduct database and web searches, the difference between primary and secondary sources, source credibility, and how to access articles through the university's databases. Students used the skills gained in the library training sessions for later writing assignments including a formal lab report and annotated bibliography. By focusing on improving information literacy skills as well as providing practice in scientific writing, Biological Literature students are better able to meet the rigors of upper-division biology courses and communicate research findings in a more professional manner.

  14. Teaching Information Literacy Skills to Sophomore-Level Biology Majors

    PubMed Central

    Thompson, Leigh; Blankinship, Lisa Ann

    2015-01-01

    Many undergraduate students lack a sound understanding of information literacy. The skills that comprise information literacy are particularly important when combined with scientific writing for biology majors as they are the foundation skills necessary to complete upper-division biology course assignments, better train students for research projects, and prepare students for graduate and professional education. To help undergraduate biology students develop and practice information literacy and scientific writing skills, a series of three one-hour hands-on library sessions, discussions, and homework assignments were developed for Biological Literature, a one-credit, one-hour-per-week, required sophomore-level course. The embedded course librarian developed a learning exercise that reviewed how to conduct database and web searches, the difference between primary and secondary sources, source credibility, and how to access articles through the university’s databases. Students used the skills gained in the library training sessions for later writing assignments including a formal lab report and annotated bibliography. By focusing on improving information literacy skills as well as providing practice in scientific writing, Biological Literature students are better able to meet the rigors of upper-division biology courses and communicate research findings in a more professional manner. PMID:25949754

  15. The use of animal models in homeopathic research--a review of 2010-2014 PubMed indexed papers.

    PubMed

    Bonamin, Leoni Villano; Cardoso, Thayná Neves; de Carvalho, Aloísio Cunha; Amaral, Juliana Gimenez

    2015-10-01

    In the 1990s, a study was performed on the effects of highly diluted thyroxine on frog metamorphosis. This model represented one of the most discussed examples of the biological effects of high dilutions over the next two decades. In 2010, another critical conceptual review of the use of animal models in homeopathy and high-dilution research was published. The main contribution of these studies was the elucidation of the biological features and phenomenology of the effects of high dilutions on living systems, representing an important step forward in our understanding of the mechanisms of action of homeopathic medicines. We performed a further review of this line of investigation using the same methods. Fifty-three articles that were indexed in the PubMed database and used 12 different animal species were systematically evaluated. Only a fraction of the studies (29/53) reported herein were performed with "ultra high" dilutions. The other studies were performed with dilutions in ranges below 10(-23) (14/53 articles) or commercial complexes (10/53 articles). Only two articles reported negative results; both used in vivo protocols to test commercial complexes, one in fish and one in bees. The quality of the employed techniques improved in 2010-2014 compared with the studies that were reviewed previously in 2010, with the inclusion of more ethically refined protocols, including in vitro primary cell cultures and ex vivo studies (10/53 articles), often with three or more replicates and analyses of epigenetic mechanisms that were previously unknown in 2010. In our updated review of the past 5 years, we found further demonstrations of the biological effects of homeopathy using more refined animal models and in vitro techniques. Copyright © 2015 The Faculty of Homeopathy. Published by Elsevier Ltd. All rights reserved.

  16. Low Frequency Variants, Collapsed Based on Biological Knowledge, Uncover Complexity of Population Stratification in 1000 Genomes Project Data

    PubMed Central

    Moore, Carrie B.; Wallace, John R.; Wolfe, Daniel J.; Frase, Alex T.; Pendergrass, Sarah A.; Weiss, Kenneth M.; Ritchie, Marylyn D.

    2013-01-01

    Analyses investigating low frequency variants have the potential for explaining additional genetic heritability of many complex human traits. However, the natural frequencies of rare variation between human populations strongly confound genetic analyses. We have applied a novel collapsing method to identify biological features with low frequency variant burden differences in thirteen populations sequenced by the 1000 Genomes Project. Our flexible collapsing tool utilizes expert biological knowledge from multiple publicly available database sources to direct feature selection. Variants were collapsed according to genetically driven features, such as evolutionary conserved regions, regulatory regions genes, and pathways. We have conducted an extensive comparison of low frequency variant burden differences (MAF<0.03) between populations from 1000 Genomes Project Phase I data. We found that on average 26.87% of gene bins, 35.47% of intergenic bins, 42.85% of pathway bins, 14.86% of ORegAnno regulatory bins, and 5.97% of evolutionary conserved regions show statistically significant differences in low frequency variant burden across populations from the 1000 Genomes Project. The proportion of bins with significant differences in low frequency burden depends on the ancestral similarity of the two populations compared and types of features tested. Even closely related populations had notable differences in low frequency burden, but fewer differences than populations from different continents. Furthermore, conserved or functionally relevant regions had fewer significant differences in low frequency burden than regions under less evolutionary constraint. This degree of low frequency variant differentiation across diverse populations and feature elements highlights the critical importance of considering population stratification in the new era of DNA sequencing and low frequency variant genomic analyses. PMID:24385916

  17. Complex phenotype of dyskeratosis congenita and mood dysregulation with novel homozygous RTEL1 and TPH1 variants.

    PubMed

    Ungar, Rachel A; Giri, Neelam; Pao, Maryland; Khincha, Payal P; Zhou, Weiyin; Alter, Blanche P; Savage, Sharon A

    2018-06-01

    Dyskeratosis congenita (DC) is an inherited bone marrow failure syndrome caused by germline mutations in telomere biology genes. Patients have extremely short telomeres for their age and a complex phenotype including oral leukoplakia, abnormal skin pigmentation, and dysplastic nails in addition to bone marrow failure, pulmonary fibrosis, stenosis of the esophagus, lacrimal ducts and urethra, developmental anomalies, and high risk of cancer. We evaluated a patient with features of DC, mood dysregulation, diabetes, and lack of pubertal development. Family history was not available but genome-wide genotyping was consistent with consanguinity. Whole exome sequencing identified 82 variants of interest in 80 genes based on the following criteria: homozygous, <0.1% minor allele frequency in public and in-house databases, nonsynonymous, and predicted deleterious by multiple in silico prediction programs. Six genes were identified likely contributory to the clinical presentation. The cause of DC is likely due to homozygous splice site variants in regulator of telomere elongation helicase 1, a known DC and telomere biology gene. A homozygous, missense variant in tryptophan hydroxylase 1 may be clinically important as this gene encodes the rate limiting step in serotonin biosynthesis, a biologic pathway connected with mood disorders. Four additional genes (SCN4A, LRP4, GDAP1L1, and SPTBN5) had rare, missense homozygous variants that we speculate may contribute to portions of the clinical phenotype. This case illustrates the value of conducting detailed clinical and genomic evaluations on rare patients in order to identify new areas of research into the functional consequences of rare variants and their contribution to human disease. © 2018 Wiley Periodicals, Inc.

  18. MOPED enables discoveries through consistently processed proteomics data

    PubMed Central

    Higdon, Roger; Stewart, Elizabeth; Stanberry, Larissa; Haynes, Winston; Choiniere, John; Montague, Elizabeth; Anderson, Nathaniel; Yandl, Gregory; Janko, Imre; Broomall, William; Fishilevich, Simon; Lancet, Doron; Kolker, Natali; Kolker, Eugene

    2014-01-01

    The Model Organism Protein Expression Database (MOPED, http://moped.proteinspire.org), is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration, as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project’s efforts to generate chromosome and diseases specific proteomes by providing links from proteins to chromosome and disease information, as well as many complementary resources. MOPED supports a new omics metadata checklist in order to harmonize data integration, analysis and use. MOPED’s development is driven by the user community, which spans 90 countries guiding future development that will transform MOPED into a multi-omics resource. MOPED encourages users to submit data in a simple format. They can use the metadata a checklist generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries. PMID:24350770

  19. Human Permanent Ectoparasites; Recent Advances on Biology and Clinical Significance of Demodex Mites: Narrative Review Article.

    PubMed

    Litwin, Dorota; Chen, WenChieh; Dzika, Ewa; Korycińska, Joanna

    2017-01-01

    Demodex is a genus of mites living predominantly in mammalian pilosebaceous units. They are commonly detected in the skin of face, with increasing numbers in inflammatory lesions. Causation between Demodex mites and inflammatory diseases, such as rosacea, blepharitis, perioral and seborrhoeic dermatitis or chalazion, is controversially discussed. Clinical observations indicate a primary form of human Demodex infection. The aim of this review was to highlight the biological aspects of Demodex infestation and point out directions for the future research. We conducted a broad review based on the electronic database sources such as MEDLINE, PubMed and Scopus with regard to the characteristics of the Demodex species, methods of examination and worldwide epidemiology, molecular studies and its role in the complex human ecosystem. Demodex mites are organisms with a worldwide importance as they act in indicating several dermatoses, under certain conditions. However, correlations between Demodex and other parasites or microorganisms occupying one host, as well as interactions between these arachnids and its symbiotic bacteria should be considered. There are few methods of human mites' examination depending on purpose of the study. Nevertheless, paying attention must be needed as polymorphism of Demodex species has been reported. Overall, the present review will focus on different aspects of Demodex mites' biology and significance of these arachnids in human's health.

  20. Building an Ontology-driven Database for Clinical Immune Research

    PubMed Central

    Ma, Jingming

    2006-01-01

    The clinical researches of immune response usually generate a huge amount of biomedical testing data over a certain period of time. The user-friendly data management systems based on the relational database will help immunologists/clinicians to fully manage the data. On the other hand, the same biological assays such as ELISPOT and flow cytometric assays are involved in immunological experiments no matter of different study purposes. The reuse of biological knowledge is one of driving forces behind this ontology-driven data management. Therefore, an ontology-driven database will help to handle different clinical immune researches and help immunologists/clinicians easily understand the immunological data from each other. We will discuss some outlines for building an ontology-driven data management for clinical immune researches (ODMim). PMID:17238637

  1. Simple Logic for Big Problems: An Inside Look at Relational Databases.

    ERIC Educational Resources Information Center

    Seba, Douglas B.; Smith, Pat

    1982-01-01

    Discusses database design concept termed "normalization" (process replacing associations between data with associations in two-dimensional tabular form) which results in formation of relational databases (they are to computers what dictionaries are to spoken languages). Applications of the database in serials control and complex systems…

  2. cPath: open source software for collecting, storing, and querying biological pathways

    PubMed Central

    Cerami, Ethan G; Bader, Gary D; Gross, Benjamin E; Sander, Chris

    2006-01-01

    Background Biological pathways, including metabolic pathways, protein interaction networks, signal transduction pathways, and gene regulatory networks, are currently represented in over 220 diverse databases. These data are crucial for the study of specific biological processes, including human diseases. Standard exchange formats for pathway information, such as BioPAX, CellML, SBML and PSI-MI, enable convenient collection of this data for biological research, but mechanisms for common storage and communication are required. Results We have developed cPath, an open source database and web application for collecting, storing, and querying biological pathway data. cPath makes it easy to aggregate custom pathway data sets available in standard exchange formats from multiple databases, present pathway data to biologists via a customizable web interface, and export pathway data via a web service to third-party software, such as Cytoscape, for visualization and analysis. cPath is software only, and does not include new pathway information. Key features include: a built-in identifier mapping service for linking identical interactors and linking to external resources; built-in support for PSI-MI and BioPAX standard pathway exchange formats; a web service interface for searching and retrieving pathway data sets; and thorough documentation. The cPath software is freely available under the LGPL open source license for academic and commercial use. Conclusion cPath is a robust, scalable, modular, professional-grade software platform for collecting, storing, and querying biological pathways. It can serve as the core data handling component in information systems for pathway visualization, analysis and modeling. PMID:17101041

  3. Sources of heterogeneity in developmental outcomes of children with past and current experiences of institutionalization in Russia: A four-group comparison.

    PubMed

    Rakhlin, Natalia; Hein, Sascha; Doyle, Niamh; Hart, Lesley; Koposov, Roman; Macomber, Donna; Ruchkin, Vladislav; Strelina, Anastasia; Tan, Mei; Grigorenko, Elena L

    2017-01-01

    The present study sought to compare 4 groups of age- and gender-matched children-(a) those reared in institutions for children without parental care in Russia; (b) those raised by their biological parents in Russia; (c) those adopted to the United States from Russian institutions; and (d) those born in the United States and raised by their biological parents-on indicators of cognition, language, and early learning. In addition, we aimed to compare the effects of the length of time spent in an institution, the age of initial placement in an institution, the age at adoption, and pre-institutional risk factors (i.e., prenatal substance exposure and prematurity and low birth weight) on the above-mentioned outcomes in the 2 groups of children with institutionalization experiences. Our results confirm previous reports demonstrating negative consequences of institutionalization and substantial ameliorating effects of adoption. They also underscore the complexity of the effects of institutionalization and adoption, showing that they are intertwined with the effects of pre-institutional risk factors. (PsycINFO Database Record (c) 2017 APA, all rights reserved).

  4. Diversification of transcription factor-DNA interactions and the evolution of gene regulatory networks.

    PubMed

    Rogers, Julia M; Bulyk, Martha L

    2018-04-25

    Sequence-specific transcription factors (TFs) bind short DNA sequences in the genome to regulate the expression of target genes. In the last decade, numerous technical advances have enabled the determination of the DNA-binding specificities of many of these factors. Large-scale screens of many TFs enabled the creation of databases of TF DNA-binding specificities, typically represented as position weight matrices (PWMs). Although great progress has been made in determining and predicting binding specificities systematically, there are still many surprises to be found when studying a particular TF's interactions with DNA in detail. Paralogous TFs' binding specificities can differ in subtle ways, in a manner that is not immediately apparent from looking at their PWMs. These differences affect gene regulatory outputs and enable TFs to rewire transcriptional networks over evolutionary time. This review discusses recent observations made in the study of TF-DNA interactions that highlight the importance of continued in-depth analysis of TF-DNA interactions and their inherent complexity. This article is categorized under: Biological Mechanisms > Regulatory Biology. © 2018 Wiley Periodicals, Inc.

  5. An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    AlQuraishi, Mohammed; Tang, Shengdong; Xia, Xide

    Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. We have developed an integrated affinity-structure database inmore » which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. Lastly, this database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.« less

  6. An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system

    DOE PAGES

    AlQuraishi, Mohammed; Tang, Shengdong; Xia, Xide

    2015-11-19

    Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. We have developed an integrated affinity-structure database inmore » which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. Lastly, this database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.« less

  7. CREDO: a structural interactomics database for drug discovery

    PubMed Central

    Schreyer, Adrian M.; Blundell, Tom L.

    2013-01-01

    CREDO is a unique relational database storing all pairwise atomic interactions of inter- as well as intra-molecular contacts between small molecules and macromolecules found in experimentally determined structures from the Protein Data Bank. These interactions are integrated with further chemical and biological data. The database implements useful data structures and algorithms such as cheminformatics routines to create a comprehensive analysis platform for drug discovery. The database can be accessed through a web-based interface, downloads of data sets and web services at http://www-cryst.bioc.cam.ac.uk/credo. Database URL: http://www-cryst.bioc.cam.ac.uk/credo PMID:23868908

  8. Biological Databases for Behavioral Neurobiology

    PubMed Central

    Baker, Erich J.

    2014-01-01

    Databases are, at their core, abstractions of data and their intentionally derived relationships. They serve as a central organizing metaphor and repository, supporting or augmenting nearly all bioinformatics. Behavioral domains provide a unique stage for contemporary databases, as research in this area spans diverse data types, locations, and data relationships. This chapter provides foundational information on the diversity and prevalence of databases, how data structures support the various needs of behavioral neuroscience analysis and interpretation. The focus is on the classes of databases, data curation, and advanced applications in bioinformatics using examples largely drawn from research efforts in behavioral neuroscience. PMID:23195119

  9. Israel Marine Bio-geographic Database (ISRAMAR-BIO)

    NASA Astrophysics Data System (ADS)

    Greengrass, Eyal; Krivenko, Yevgeniya; Ozer, Tal; Ben Yosef, Dafna; Tom, Moshe; Gertman, Isaac

    2015-04-01

    The knowledge of the space/time variations of species is the basis for any ecological investigations. While historical observations containing integral concentrations of biological parameters (chlorophyll, abundance, biomass…) are organized partly in ISRAMAR Cast Database, the taxon-specific data collected in Israel has not been sufficiently organized. This has been hindered by the lack of standards, variability of methods and complexity of biological data formalization. The ISRAMAR-BIO DB was developed to store various types of historical and future available information related to marine species observations and related metadata. Currently the DB allows to store biological data acquired by the following sampling devices such as: van veer grab, box corer, sampling bottles, nets (plankton, trawls and fish), quadrates, and cameras. The DB's logical unit is information regarding a specimen (taxa name, barcode, image), related attributes (abundance, size, age, contaminants…), habitat description, sampling device and method, time and space of sampling, responsible organization and scientist, source of information (cruise, project and publication). The following standardization of specimen and attributes naming were implemented: Taxonomy according to World Register of Marine Species (WoRMS: http://www.marinespecies.org). Habitat description according to Coastal and Marine Ecological Classification Standards (CMECS: http://www.cmecscatalog.org) Parameter name; Unit; Device name; Developmental stage; Institution name; Country name; Marine region according to SeaDataNet Vocabularies (http://www.seadatanet.org/Standards-Software/Common-Vocabularies). This system supports two types of data submission procedures, which support the above stated data structure. The first is a downloadable excel file with drop-down fields based on the ISRAMAR-BIO vocabularies. The file is filled and uploaded online by the data contributor. Alternatively, the same dataset can be assembled by filling online forms and then submitted to the DB. Online access to the ISRAMAR-BIO is available through taxon search page, where one can get both biological and geographical data regarding a certain taxon. Further development of the online data access is ongoing. It will include interactive geographical map interface where data may be queried, analyzed and downloaded.

  10. Towards physical principles of biological evolution

    NASA Astrophysics Data System (ADS)

    Katsnelson, Mikhail I.; Wolf, Yuri I.; Koonin, Eugene V.

    2018-03-01

    Biological systems reach organizational complexity that far exceeds the complexity of any known inanimate objects. Biological entities undoubtedly obey the laws of quantum physics and statistical mechanics. However, is modern physics sufficient to adequately describe, model and explain the evolution of biological complexity? Detailed parallels have been drawn between statistical thermodynamics and the population-genetic theory of biological evolution. Based on these parallels, we outline new perspectives on biological innovation and major transitions in evolution, and introduce a biological equivalent of thermodynamic potential that reflects the innovation propensity of an evolving population. Deep analogies have been suggested to also exist between the properties of biological entities and processes, and those of frustrated states in physics, such as glasses. Such systems are characterized by frustration whereby local state with minimal free energy conflict with the global minimum, resulting in ‘emergent phenomena’. We extend such analogies by examining frustration-type phenomena, such as conflicts between different levels of selection, in biological evolution. These frustration effects appear to drive the evolution of biological complexity. We further address evolution in multidimensional fitness landscapes from the point of view of percolation theory and suggest that percolation at level above the critical threshold dictates the tree-like evolution of complex organisms. Taken together, these multiple connections between fundamental processes in physics and biology imply that construction of a meaningful physical theory of biological evolution might not be a futile effort. However, it is unrealistic to expect that such a theory can be created in one scoop; if it ever comes to being, this can only happen through integration of multiple physical models of evolutionary processes. Furthermore, the existing framework of theoretical physics is unlikely to suffice for adequate modeling of the biological level of complexity, and new developments within physics itself are likely to be required.

  11. Gramene database: navigating plant comparative genomics resources

    USDA-ARS?s Scientific Manuscript database

    Gramene (http://www.gramene.org) is an online, open source, curated resource for plant comparative genomics and pathway analysis designed to support researchers working in plant genomics, breeding, evolutionary biology, system biology, and metabolic engineering. It exploits phylogenetic relationship...

  12. Reach for Reference. Science Online

    ERIC Educational Resources Information Center

    Safford, Barbara Ripp

    2004-01-01

    This brief article describes the database, Science Online, from Facts on File. Science is defined broadly in this database to include archeology, computer technology, medicine, inventions, and mathematics, as well as biology, chemistry, earth sciences, and astronomy. Content also is divided into format categories for browsing purposes:…

  13. Genome-wide association as a means to understanding the mammary gland

    USDA-ARS?s Scientific Manuscript database

    Next-generation sequencing and related technologies have facilitated the creation of enormous public databases that catalogue genomic variation. These databases have facilitated a variety of approaches to discover new genes that regulate normal biology as well as disease. Genome wide association (...

  14. Relax with CouchDB - Into the non-relational DBMS era of Bioinformatics

    PubMed Central

    Manyam, Ganiraju; Payton, Michelle A.; Roth, Jack A.; Abruzzo, Lynne V.; Coombes, Kevin R.

    2012-01-01

    With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services. PMID:22609849

  15. Mining databases for protein aggregation: a review.

    PubMed

    Tsiolaki, Paraskevi L; Nastou, Katerina C; Hamodrakas, Stavros J; Iconomidou, Vassiliki A

    2017-09-01

    Protein aggregation is an active area of research in recent decades, since it is the most common and troubling indication of protein instability. Understanding the mechanisms governing protein aggregation and amyloidogenesis is a key component to the aetiology and pathogenesis of many devastating disorders, including Alzheimer's disease or type 2 diabetes. Protein aggregation data are currently found "scattered" in an increasing number of repositories, since advances in computational biology greatly influence this field of research. This review exploits the various resources of aggregation data and attempts to distinguish and analyze the biological knowledge they contain, by introducing protein-based, fragment-based and disease-based repositories, related to aggregation. In order to gain a broad overview of the available repositories, a novel comprehensive network maps and visualizes the current association between aggregation databases and other important databases and/or tools and discusses the beneficial role of community annotation. The need for unification of aggregation databases in a common platform is also addressed.

  16. The EBI SRS server-new features.

    PubMed

    Zdobnov, Evgeny M; Lopez, Rodrigo; Apweiler, Rolf; Etzold, Thure

    2002-08-01

    Here we report on recent developments at the EBI SRS server (http://srs.ebi.ac.uk). SRS has become an integration system for both data retrieval and sequence analysis applications. The EBI SRS server is a primary gateway to major databases in the field of molecular biology produced and supported at EBI as well as European public access point to the MEDLINE database provided by US National Library of Medicine (NLM). It is a reference server for latest developments in data and application integration. The new additions include: concept of virtual databases, integration of XML databases like the Integrated Resource of Protein Domains and Functional Sites (InterPro), Gene Ontology (GO), MEDLINE, Metabolic pathways, etc., user friendly data representation in 'Nice views', SRSQuickSearch bookmarklets. SRS6 is a licensed product of LION Bioscience AG freely available for academics. The EBI SRS server (http://srs.ebi.ac.uk) is a free central resource for molecular biology data as well as a reference server for the latest developments in data integration.

  17. The Molecular Signatures Database (MSigDB) hallmark gene set collection.

    PubMed

    Liberzon, Arthur; Birger, Chet; Thorvaldsdóttir, Helga; Ghandi, Mahmoud; Mesirov, Jill P; Tamayo, Pablo

    2015-12-23

    The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis. Since its creation, MSigDB has grown beyond its roots in metabolic disease and cancer to include >10,000 gene sets. These better represent a wider range of biological processes and diseases, but the utility of the database is reduced by increased redundancy across, and heterogeneity within, gene sets. To address this challenge, here we use a combination of automated approaches and expert curation to develop a collection of "hallmark" gene sets as part of MSigDB. Each hallmark in this collection consists of a "refined" gene set, derived from multiple "founder" sets, that conveys a specific biological state or process and displays coherent expression. The hallmarks effectively summarize most of the relevant information of the original founder sets and, by reducing both variation and redundancy, provide more refined and concise inputs for gene set enrichment analysis.

  18. P43-S Computational Biology Applications Suite for High-Performance Computing (BioHPC.net)

    PubMed Central

    Pillardy, J.

    2007-01-01

    One of the challenges of high-performance computing (HPC) is user accessibility. At the Cornell University Computational Biology Service Unit, which is also a Microsoft HPC institute, we have developed a computational biology application suite that allows researchers from biological laboratories to submit their jobs to the parallel cluster through an easy-to-use Web interface. Through this system, we are providing users with popular bioinformatics tools including BLAST, HMMER, InterproScan, and MrBayes. The system is flexible and can be easily customized to include other software. It is also scalable; the installation on our servers currently processes approximately 8500 job submissions per year, many of them requiring massively parallel computations. It also has a built-in user management system, which can limit software and/or database access to specified users. TAIR, the major database of the plant model organism Arabidopsis, and SGN, the international tomato genome database, are both using our system for storage and data analysis. The system consists of a Web server running the interface (ASP.NET C#), Microsoft SQL server (ADO.NET), compute cluster running Microsoft Windows, ftp server, and file server. Users can interact with their jobs and data via a Web browser, ftp, or e-mail. The interface is accessible at http://cbsuapps.tc.cornell.edu/.

  19. NPCARE: database of natural products and fractional extracts for cancer regulation.

    PubMed

    Choi, Hwanho; Cho, Sun Young; Pak, Ho Jeong; Kim, Youngsoo; Choi, Jung-Yun; Lee, Yoon Jae; Gong, Byung Hee; Kang, Yeon Seok; Han, Taehoon; Choi, Geunbae; Cho, Yeeun; Lee, Soomin; Ryoo, Dekwoo; Park, Hwangseo

    2017-01-01

    Natural products have increasingly attracted much attention as a valuable resource for the development of anticancer medicines due to the structural novelty and good bioavailability. This necessitates a comprehensive database for the natural products and the fractional extracts whose anticancer activities have been verified. NPCARE (http://silver.sejong.ac.kr/npcare) is a publicly accessible online database of natural products and fractional extracts for cancer regulation. At NPCARE, one can explore 6578 natural compounds and 2566 fractional extracts isolated from 1952 distinct biological species including plants, marine organisms, fungi, and bacteria whose anticancer activities were validated with 1107 cell lines for 34 cancer types. Each entry in NPCARE is annotated with the cancer type, genus and species names of the biological resource, the cell line used for demonstrating the anticancer activity, PubChem ID, and a wealth of information about the target gene or protein. Besides the augmentation of plant entries up to 743 genus and 197 families, NPCARE is further enriched with the natural products and the fractional extracts of diverse non-traditional biological resources. NPCARE is anticipated to serve as a dominant gateway for the discovery of new anticancer medicines due to the inclusion of a large number of the fractional extracts as well as the natural compounds isolated from a variety of biological resources.

  20. Sharing and executing linked data queries in a collaborative environment.

    PubMed

    García Godoy, María Jesús; López-Camacho, Esteban; Navas-Delgado, Ismael; Aldana-Montes, José F

    2013-07-01

    Life Sciences have emerged as a key domain in the Linked Data community because of the diversity of data semantics and formats available through a great variety of databases and web technologies. Thus, it has been used as the perfect domain for applications in the web of data. Unfortunately, bioinformaticians are not exploiting the full potential of this already available technology, and experts in Life Sciences have real problems to discover, understand and devise how to take advantage of these interlinked (integrated) data. In this article, we present Bioqueries, a wiki-based portal that is aimed at community building around biological Linked Data. This tool has been designed to aid bioinformaticians in developing SPARQL queries to access biological databases exposed as Linked Data, and also to help biologists gain a deeper insight into the potential use of this technology. This public space offers several services and a collaborative infrastructure to stimulate the consumption of biological Linked Data and, therefore, contribute to implementing the benefits of the web of data in this domain. Bioqueries currently contains 215 query entries grouped by database and theme, 230 registered users and 44 end points that contain biological Resource Description Framework information. The Bioqueries portal is freely accessible at http://bioqueries.uma.es. Supplementary data are available at Bioinformatics online.

  1. Importance of Data Management in a Long-term Biological Monitoring Program

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Christensen, Sigurd W; Brandt, Craig C; McCracken, Kitty

    2011-01-01

    The long-term Biological Monitoring and Abatement Program (BMAP) has always needed to collect and retain high-quality data on which to base its assessments of ecological status of streams and their recovery after remediation. Its formal quality assurance, data processing, and data management components all contribute to this need. The Quality Assurance Program comprehensively addresses requirements from various institutions, funders, and regulators, and includes a data management component. Centralized data management began a few years into the program. An existing relational database was adapted and extended to handle biological data. Data modeling enabled the program's database to process, store, and retrievemore » its data. The data base's main data tables and several key reference tables are described. One of the most important related activities supporting long-term analyses was the establishing of standards for sampling site names, taxonomic identification, flagging, and other components. There are limitations. Some types of program data were not easily accommodated in the central systems, and many possible data-sharing and integration options are not easily accessible to investigators. The implemented relational database supports the transmittal of data to the Oak Ridge Environmental Information System (OREIS) as the permanent repository. From our experience we offer data management advice to other biologically oriented long-term environmental sampling and analysis programs.« less

  2. Importance of Data Management in a Long-Term Biological Monitoring Program

    NASA Astrophysics Data System (ADS)

    Christensen, Sigurd W.; Brandt, Craig C.; McCracken, Mary K.

    2011-06-01

    The long-term Biological Monitoring and Abatement Program (BMAP) has always needed to collect and retain high-quality data on which to base its assessments of ecological status of streams and their recovery after remediation. Its formal quality assurance, data processing, and data management components all contribute to meeting this need. The Quality Assurance Program comprehensively addresses requirements from various institutions, funders, and regulators, and includes a data management component. Centralized data management began a few years into the program when an existing relational database was adapted and extended to handle biological data. The database's main data tables and several key reference tables are described. One of the most important related activities supporting long-term analyses was the establishing of standards for sampling site names, taxonomic identification, flagging, and other components. The implemented relational database supports the transmittal of data to the Oak Ridge Environmental Information System (OREIS) as the permanent repository. We also discuss some limitations to our implementation. Some types of program data were not easily accommodated in the central systems, and many possible data-sharing and integration options are not easily accessible to investigators. From our experience we offer data management advice to other biologically oriented long-term environmental sampling and analysis programs.

  3. Directed evolution and synthetic biology applications to microbial systems.

    PubMed

    Bassalo, Marcelo C; Liu, Rongming; Gill, Ryan T

    2016-06-01

    Biotechnology applications require engineering complex multi-genic traits. The lack of knowledge on the genetic basis of complex phenotypes restricts our ability to rationally engineer them. However, complex phenotypes can be engineered at the systems level, utilizing directed evolution strategies that drive whole biological systems toward desired phenotypes without requiring prior knowledge of the genetic basis of the targeted trait. Recent developments in the synthetic biology field accelerates the directed evolution cycle, facilitating engineering of increasingly complex traits in biological systems. In this review, we summarize some of the most recent advances in directed evolution and synthetic biology that allows engineering of complex traits in microbial systems. Then, we discuss applications that can be achieved through engineering at the systems level. Copyright © 2016 Elsevier Ltd. All rights reserved.

  4. PASCAL Data Base File Description and Indexing Rules in Chemistry, Biology and Medicine.

    ERIC Educational Resources Information Center

    Gaillardin, R.; And Others

    This report on the multidisciplinary PASCAL database describes the files and the indexing rules for chemistry, biology, and medicine. PASCAL deals with all aspects of chemistry within two subfiles whose combined yearly growth is about 100,000 references. The Biopascal file, organized in the two subfiles of Plant Science and Biology and Medicine,…

  5. Integration of Biological Applications into the Core Undergraduate Curriculum: A Practical Strategy

    ERIC Educational Resources Information Center

    Komives, Claire; Prince, Michael; Fernandez, Erik; Balcarcel, Robert

    2011-01-01

    A web database of solved problems has been created to enable faculty to incorporate biological applications into core courses. Over 20% of US ChE departments utilized problems from the website, and 19 faculty attended a workshop to facilitate teaching the modules. Assessment of student learning showed some gains related to biological outcomes, as…

  6. A high performance, ad-hoc, fuzzy query processing system for relational databases

    NASA Technical Reports Server (NTRS)

    Mansfield, William H., Jr.; Fleischman, Robert M.

    1992-01-01

    Database queries involving imprecise or fuzzy predicates are currently an evolving area of academic and industrial research. Such queries place severe stress on the indexing and I/O subsystems of conventional database environments since they involve the search of large numbers of records. The Datacycle architecture and research prototype is a database environment that uses filtering technology to perform an efficient, exhaustive search of an entire database. It has recently been modified to include fuzzy predicates in its query processing. The approach obviates the need for complex index structures, provides unlimited query throughput, permits the use of ad-hoc fuzzy membership functions, and provides a deterministic response time largely independent of query complexity and load. This paper describes the Datacycle prototype implementation of fuzzy queries and some recent performance results.

  7. Challenges of molecular nutrition research 6: the nutritional phenotype database to store, share and evaluate nutritional systems biology studies

    PubMed Central

    Bouwman, Jildau; Dragsted, Lars O.; Drevon, Christian A.; Elliott, Ruan; de Groot, Philip; Kaput, Jim; Mathers, John C.; Müller, Michael; Pepping, Fre; Saito, Jahn; Scalbert, Augustin; Radonjic, Marijana; Rocca-Serra, Philippe; Travis, Anthony; Wopereis, Suzan; Evelo, Chris T.

    2010-01-01

    The challenge of modern nutrition and health research is to identify food-based strategies promoting life-long optimal health and well-being. This research is complex because it exploits a multitude of bioactive compounds acting on an extensive network of interacting processes. Whereas nutrition research can profit enormously from the revolution in ‘omics’ technologies, it has discipline-specific requirements for analytical and bioinformatic procedures. In addition to measurements of the parameters of interest (measures of health), extensive description of the subjects of study and foods or diets consumed is central for describing the nutritional phenotype. We propose and pursue an infrastructural activity of constructing the “Nutritional Phenotype database” (dbNP). When fully developed, dbNP will be a research and collaboration tool and a publicly available data and knowledge repository. Creation and implementation of the dbNP will maximize benefits to the research community by enabling integration and interrogation of data from multiple studies, from different research groups, different countries and different—omics levels. The dbNP is designed to facilitate storage of biologically relevant, pre-processed—omics data, as well as study descriptive and study participant phenotype data. It is also important to enable the combination of this information at different levels (e.g. to facilitate linkage of data describing participant phenotype, genotype and food intake with information on study design and—omics measurements, and to combine all of this with existing knowledge). The biological information stored in the database (i.e. genetics, transcriptomics, proteomics, biomarkers, metabolomics, functional assays, food intake and food composition) is tailored to nutrition research and embedded in an environment of standard procedures and protocols, annotations, modular data-basing, networking and integrated bioinformatics. The dbNP is an evolving enterprise, which is only sustainable if it is accepted and adopted by the wider nutrition and health research community as an open source, pre-competitive and publicly available resource where many partners both can contribute and profit from its developments. We introduce the Nutrigenomics Organisation (NuGO, http://www.nugo.org) as a membership association responsible for establishing and curating the dbNP. Within NuGO, all efforts related to dbNP (i.e. usage, coordination, integration, facilitation and maintenance) will be directed towards a sustainable and federated infrastructure. PMID:21052526

  8. Integrated Bio-Entity Network: A System for Biological Knowledge Discovery

    PubMed Central

    Bell, Lindsey; Chowdhary, Rajesh; Liu, Jun S.; Niu, Xufeng; Zhang, Jinfeng

    2011-01-01

    A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs. PMID:21738677

  9. Detecting uber-operons in prokaryotic genomes.

    PubMed

    Che, Dongsheng; Li, Guojun; Mao, Fenglou; Wu, Hongwei; Xu, Ying

    2006-01-01

    We present a study on computational identification of uber-operons in a prokaryotic genome, each of which represents a group of operons that are evolutionarily or functionally associated through operons in other (reference) genomes. Uber-operons represent a rich set of footprints of operon evolution, whose full utilization could lead to new and more powerful tools for elucidation of biological pathways and networks than what operons have provided, and a better understanding of prokaryotic genome structures and evolution. Our prediction algorithm predicts uber-operons through identifying groups of functionally or transcriptionally related operons, whose gene sets are conserved across the target and multiple reference genomes. Using this algorithm, we have predicted uber-operons for each of a group of 91 genomes, using the other 90 genomes as references. In particular, we predicted 158 uber-operons in Escherichia coli K12 covering 1830 genes, and found that many of the uber-operons correspond to parts of known regulons or biological pathways or are involved in highly related biological processes based on their Gene Ontology (GO) assignments. For some of the predicted uber-operons that are not parts of known regulons or pathways, our analyses indicate that their genes are highly likely to work together in the same biological processes, suggesting the possibility of new regulons and pathways. We believe that our uber-operon prediction provides a highly useful capability and a rich information source for elucidation of complex biological processes, such as pathways in microbes. All the prediction results are available at our Uber-Operon Database: http://csbl.bmb.uga.edu/uber, the first of its kind.

  10. Detecting uber-operons in prokaryotic genomes

    PubMed Central

    Che, Dongsheng; Li, Guojun; Mao, Fenglou; Wu, Hongwei; Xu, Ying

    2006-01-01

    We present a study on computational identification of uber-operons in a prokaryotic genome, each of which represents a group of operons that are evolutionarily or functionally associated through operons in other (reference) genomes. Uber-operons represent a rich set of footprints of operon evolution, whose full utilization could lead to new and more powerful tools for elucidation of biological pathways and networks than what operons have provided, and a better understanding of prokaryotic genome structures and evolution. Our prediction algorithm predicts uber-operons through identifying groups of functionally or transcriptionally related operons, whose gene sets are conserved across the target and multiple reference genomes. Using this algorithm, we have predicted uber-operons for each of a group of 91 genomes, using the other 90 genomes as references. In particular, we predicted 158 uber-operons in Escherichia coli K12 covering 1830 genes, and found that many of the uber-operons correspond to parts of known regulons or biological pathways or are involved in highly related biological processes based on their Gene Ontology (GO) assignments. For some of the predicted uber-operons that are not parts of known regulons or pathways, our analyses indicate that their genes are highly likely to work together in the same biological processes, suggesting the possibility of new regulons and pathways. We believe that our uber-operon prediction provides a highly useful capability and a rich information source for elucidation of complex biological processes, such as pathways in microbes. All the prediction results are available at our Uber-Operon Database: , the first of its kind. PMID:16682449

  11. Particle Disease: A Current Review of the Biological Mechanisms in Periprosthetic Osteolysis After Hip Arthroplasty

    PubMed Central

    Sukur, Erhan; Akman, Yunus Emre; Ozturkmen, Yusuf; Kucukdurmaz, Fatih

    2016-01-01

    Background: Inflammatory responses to wear debris cause osteolysis that leads to aseptic prosthesis loosening and hip arthroplasty failure. Although osteolysis is usually associated with aseptic loosening, it is rarely seen around stable implants. Aseptic implant loosening is a simple radiologic phenomenon, but a complex immunological process. Particulate debris produced by implants most commonly causes osteolysis, and this is called particle-associated periprosthetic osteolysis (PPO). Objective: The objective of this review is to outline the features of particle-associated periprosthetic osteolysis to allow the physician to recognise this condition and commence early treatment, thereby optimizing patient outcome. Methods: A thorough literature search was performed using available databases, including Pubmed, to cover important research published covering particle-associated PPO. Results: Although osteolysis causes bone resorption, clinical, animal, and in vitro studies of particle bioreactivity suggest that particle-associated PPO represents the culmination of several biological reactions of many cell types, rather than being caused solely by the osteoclasts. The biological activity is highly dependent on the characteristics and quantity of the wear particles. Conclusion: Despite advances in total hip arthroplasty (THA), particle-associated PPO and aseptic loosening continue to be major factors that affect prosthetic joint longevity. Biomarkers could be exploited as easy and objective diagnostic and prognostic targets that would enable testing for osteolysis after THA. Further research is needed to identify new biomarkers in PPO. A comprehensive understanding of the underlying biological mechanisms is crucial for developing new therapeutic interventions to reverse or suppress biological responses to wear particles. PMID:27499822

  12. Cancer Risk Assessment for Space Radiation

    NASA Technical Reports Server (NTRS)

    Richmond, Robert C.; Curreri, Peter A. (Technical Monitor)

    2002-01-01

    Predicting the occurrence of human cancer following exposure to any agent causing genetic damage is a difficult task. This is because the uncertainty of uniform exposure to the damaging agent, and the uncertainty of uniform processing of that damage within a complex set of biological variables, degrade the confidence of predicting the delayed expression of cancer as a relatively rare event within any given clinically normal individual. The radiation health research priorities for enabling long-duration human exploration of space were established in the 1996 NRC Report entitled "Radiation Hazards to Crews of Interplanetary Missions: Biological Issues and Research Strategies". This report emphasized that a 15-fold uncertainty in predicting radiation-induced cancer incidence must be reduced before NASA can commit humans to extended interplanetary missions. That report concluded that the great majority of this uncertainty is biologically based, while a minority is physically based due to uncertainties in radiation dosimetry and radiation transport codes. Since that report, the biologically based uncertainty has remained large, and the relatively small uncertainty associated with radiation dosimetry has increased due to the considerations raised by concepts of microdosimetry. In a practical sense, however, the additional uncertainties introduced by microdosimetry are encouraging since they are in a direction of lowered effective dose absorbed through infrequent interactions of any given cell with the high energy particle component of space radiation. The biological uncertainty in predicting cancer risk for space radiation derives from two primary facts. 1) One animal tumor study has been reported that includes a relevant spectrum of particle radiation energies, and that is the Harderian gland model in mice. Fact #1: Extension of cancer risk from animal models, and especially from a single study in an animal model, to humans is inherently uncertain. 2) One human database is predominantly used for assessing cancer risk caused by space radiation, and that is the Japanese atomic bomb survivors. Fact #2: The atomic-bomb-survivor database, itself a remarkable achievement, contains uncertainties. These include the actual exposure to each individual, the radiation quality of that exposure, and the fact that the exposure was to acute doses of predominantly low-LET radiation, not to chronic exposures of high-LET radiation expected on long-duration interplanetary manned missions.

  13. The European Bioinformatics Institute's data resources: towards systems biology.

    PubMed

    Brooksbank, Catherine; Cameron, Graham; Thornton, Janet

    2005-01-01

    Genomic and post-genomic biological research has provided fine-grain insights into the molecular processes of life, but also threatens to drown biomedical researchers in data. Moreover, as new high-throughput technologies are developed, the types of data that are gathered en masse are diversifying. The need to collect, store and curate all this information in ways that allow its efficient retrieval and exploitation is greater than ever. The European Bioinformatics Institute's (EBI's) databases and tools have evolved to meet the changing needs of molecular biologists: since we last wrote about our services in the 2003 issue of Nucleic Acids Research, we have launched new databases covering protein-protein interactions (IntAct), pathways (Reactome) and small molecules (ChEBI). Our existing core databases have continued to evolve to meet the changing needs of biomedical researchers, and we have developed new data-access tools that help biologists to move intuitively through the different data types, thereby helping them to put the parts together to understand biology at the systems level. The EBI's data resources are all available on our website at http://www.ebi.ac.uk.

  14. The European Bioinformatics Institute's data resources: towards systems biology

    PubMed Central

    Brooksbank, Catherine; Cameron, Graham; Thornton, Janet

    2005-01-01

    Genomic and post-genomic biological research has provided fine-grain insights into the molecular processes of life, but also threatens to drown biomedical researchers in data. Moreover, as new high-throughput technologies are developed, the types of data that are gathered en masse are diversifying. The need to collect, store and curate all this information in ways that allow its efficient retrieval and exploitation is greater than ever. The European Bioinformatics Institute's (EBI's) databases and tools have evolved to meet the changing needs of molecular biologists: since we last wrote about our services in the 2003 issue of Nucleic Acids Research, we have launched new databases covering protein–protein interactions (IntAct), pathways (Reactome) and small molecules (ChEBI). Our existing core databases have continued to evolve to meet the changing needs of biomedical researchers, and we have developed new data-access tools that help biologists to move intuitively through the different data types, thereby helping them to put the parts together to understand biology at the systems level. The EBI's data resources are all available on our website at http://www.ebi.ac.uk. PMID:15608238

  15. Incorporating Information of microRNAs into Pathway Analysis in a Genome-Wide Association Study of Bipolar Disorder

    PubMed Central

    Shih, Wei-Liang; Kao, Chung-Feng; Chuang, Li-Chung; Kuo, Po-Hsiu

    2012-01-01

    MicroRNAs (miRNAs) are known to be important post-transcriptional regulators that are involved in the etiology of complex psychiatric traits. The present study aimed to incorporate miRNAs information into pathway analysis using a genome-wide association dataset to identify relevant biological pathways for bipolar disorder (BPD). We selected psychiatric- and neurological-associated miRNAs (N = 157) from PhenomiR database. The miRNA target genes (miTG) predictions were obtained from microRNA.org. Canonical pathways (N = 4,051) were downloaded from the Molecule Signature Database. We employed a novel weighting scheme for miTGs in pathway analysis using methods of gene set enrichment analysis and sum-statistic. Under four statistical scenarios, 38 significantly enriched pathways (P-value < 0.01 after multiple testing correction) were identified for the risk of developing BPD, including pathways of ion channels associated (e.g., gated channel activity, ion transmembrane transporter activity, and ion channel activity) and nervous related biological processes (e.g., nervous system development, cytoskeleton, and neuroactive ligand receptor interaction). Among them, 19 were identified only when the weighting scheme was applied. Many miRNA-targeted genes were functionally related to ion channels, collagen, and axonal growth and guidance that have been suggested to be associated with BPD previously. Some of these genes are linked to the regulation of miRNA machinery in the literature. Our findings provide support for the potential involvement of miRNAs in the psychopathology of BPD. Further investigations to elucidate the functions and mechanisms of identified candidate pathways are needed. PMID:23264780

  16. Immuno-Navigator, a batch-corrected coexpression database, reveals cell type-specific gene networks in the immune system

    PubMed Central

    Vandenbon, Alexis; Dinh, Viet H.; Mikami, Norihisa; Kitagawa, Yohko; Teraguchi, Shunsuke; Ohkura, Naganari; Sakaguchi, Shimon

    2016-01-01

    High-throughput gene expression data are one of the primary resources for exploring complex intracellular dynamics in modern biology. The integration of large amounts of public data may allow us to examine general dynamical relationships between regulators and target genes. However, obstacles for such analyses are study-specific biases or batch effects in the original data. Here we present Immuno-Navigator, a batch-corrected gene expression and coexpression database for 24 cell types of the mouse immune system. We systematically removed batch effects from the underlying gene expression data and showed that this removal considerably improved the consistency between inferred correlations and prior knowledge. The data revealed widespread cell type-specific correlation of expression. Integrated analysis tools allow users to use this correlation of expression for the generation of hypotheses about biological networks and candidate regulators in specific cell types. We show several applications of Immuno-Navigator as examples. In one application we successfully predicted known regulators of importance in naturally occurring Treg cells from their expression correlation with a set of Treg-specific genes. For one high-scoring gene, integrin β8 (Itgb8), we confirmed an association between Itgb8 expression in forkhead box P3 (Foxp3)-positive T cells and Treg-specific epigenetic remodeling. Our results also suggest that the regulation of Treg-specific genes within Treg cells is relatively independent of Foxp3 expression, supporting recent results pointing to a Foxp3-independent component in the development of Treg cells. PMID:27078110

  17. Amalgamation of East Eurasia Since Late Paleozoic: Constraints from the Apparent Polar Wander Paths of the Major China Blocks

    NASA Astrophysics Data System (ADS)

    Wu, L.; Kravchinsky, V. A.; Potter, D. K.

    2014-12-01

    It has been a longstanding challenge in the last few decades to quantitatively reconstruct the paleogeographic evolution of East Eurasia because of its great tectonic complexities. As the core region, the major China cratons including North China Block, South China Block and Tarim Block hold the key clues for the understanding of the amalgamation history, tectonic activities and biological affinity among the component blocks and terranes in East Eurasia. Compared with the major Gondwana and Laurentia plates, however, the apparent polar wander paths of China are not well constrained due to the outdated paleomagnetic database and relatively loose pole selection process. With the recruitment of the new high-fidelity poles published in the last decade, the rejection of the low quality data and the strict implementation of Voo's grading scheme, we build an updated paleomagnetic database for the three blocks from which three types of apparent polar wander paths (APWP) are computed. Version 1 running mean paths are constructed during the pole selection and compared with those from the previous publications. Version 2 running mean and spline paths with different sliding time windows are computed from the thoroughly examined poles to find the optimal paths with the steady trend, reasonable speed for the polar drift and plate rotation. The spline paths are recommended for the plate reconstructions, however, considering the poor data coverage during certain periods. Our new China APWPs, together with the latest European reference path, the geological, geochronological and biological evidence from the studied Asian plates allow us to reevaluate the paleogeographic and tectonic history of East Eurasia.

  18. Population genetics of chronic kidney disease: the evolving story of APOL1.

    PubMed

    Wasser, Walter G; Tzur, Shay; Wolday, Dawit; Adu, Dwomoa; Baumstein, Donald; Rosset, Saharon; Skorecki, Karl

    2012-01-01

    Advances in human genome sequencing and generation of public databases of genomic diversity enable nephrologists to re-examine the genetics of common, complex kidney diseases. Non-diabetic kidney diseases prevalent in African ancestry populations and the allelic variation described in chromosome 22q12.3 is one such illustrative example. Newly available genomic database information enabled research groups to discover common functional DNA sequence risk variants in the APOL1 gene. These variants (termed G1 and G2) evolved to confer protection from a species of trypanosomal infection and thus achieved high prominence in many geographic regions of Africa and have been carried over to African diaspora communities worldwide. Since these discoveries two years ago, new insights have been gained: localization of APOL1 in normal and disease kidney tissues; influence of the APOL1 variants on the histopathology of HIV kidney disease; possible association with kidney transplant durability; onset of kidney failure at a younger age; association with blood lipid concentrations; more precise geographic localization of individuals with these variants to western and southern African ancestry; and the absence of the variants and kidney disease predisposition in Ethiopians. The definition of APOL1 nephropathy also confirms the long-held assumption by many clinicians that kidney disease attributed to hypertension in African populations represents an underlying glomerulopathy. Still awaited is the delineation of the biologic mechanisms of cellular injury related to these variants, to provide biologic proof of the APOL1 association and to provide potential targets for preventive and therapeutic intervention.

  19. Biomine: predicting links between biological entities using network models of heterogeneous databases.

    PubMed

    Eronen, Lauri; Toivonen, Hannu

    2012-06-06

    Biological databases contain large amounts of data concerning the functions and associations of genes and proteins. Integration of data from several such databases into a single repository can aid the discovery of previously unknown connections spanning multiple types of relationships and databases. Biomine is a system that integrates cross-references from several biological databases into a graph model with multiple types of edges, such as protein interactions, gene-disease associations and gene ontology annotations. Edges are weighted based on their type, reliability, and informativeness. We present Biomine and evaluate its performance in link prediction, where the goal is to predict pairs of nodes that will be connected in the future, based on current data. In particular, we formulate protein interaction prediction and disease gene prioritization tasks as instances of link prediction. The predictions are based on a proximity measure computed on the integrated graph. We consider and experiment with several such measures, and perform a parameter optimization procedure where different edge types are weighted to optimize link prediction accuracy. We also propose a novel method for disease-gene prioritization, defined as finding a subset of candidate genes that cluster together in the graph. We experimentally evaluate Biomine by predicting future annotations in the source databases and prioritizing lists of putative disease genes. The experimental results show that Biomine has strong potential for predicting links when a set of selected candidate links is available. The predictions obtained using the entire Biomine dataset are shown to clearly outperform ones obtained using any single source of data alone, when different types of links are suitably weighted. In the gene prioritization task, an established reference set of disease-associated genes is useful, but the results show that under favorable conditions, Biomine can also perform well when no such information is available.The Biomine system is a proof of concept. Its current version contains 1.1 million entities and 8.1 million relations between them, with focus on human genetics. Some of its functionalities are available in a public query interface at http://biomine.cs.helsinki.fi, allowing searching for and visualizing connections between given biological entities.

  20. SALAD database: a motif-based database of protein annotations for plant comparative genomics

    PubMed Central

    Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi

    2010-01-01

    Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209 529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named ‘SALAD on ARRAYs’ to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis. PMID:19854933

Top