In silico analysis of fragile histidine triad involved in regression of carcinoma.
Rasheed, Muhammad Asif; Tariq, Fatima; Afzal, Sara; Mannanv, Shazia
2017-04-01
Hepatocellular carcinoma (HCCa) is a primary malignancy of the liver. Many different proteins are involved in HCCa including insulin growth factor (IGF) II , signal transducers and activators of transcription (STAT) 3, STAT4, mothers against decapentaplegic homolog 4 (SMAD 4), fragile histidine triad (FHIT) and selective internal radiation therapy (SIRT) etc. The present study is based on the bioinformatics analysis of FHIT protein in order to understand the proteomics aspect and improvement of the diagnosis of the disease based on the protein. Different information related to protein were gathered from different databases, including National Centre for Biotechnology Information (NCBI) Gene, Protein and Online Mendelian Inheritance in Man (OMIM) databases, Uniprot database, String database and Kyoto Encyclopedia of Genes and Genomes (KEGG) database. Moreover, the structure of the protein and evaluation of the quality of the structure were included from Easy modeler programme. Hence, this analysis not only helped to gather information related to the protein at one place, but also analysed the structure and quality of the protein to conclude that the protein has a role in carcinoma.
Navigating through the Jungle of Allergens: Features and Applications of Allergen Databases.
Radauer, Christian
2017-01-01
The increasing number of available data on allergenic proteins demanded the establishment of structured, freely accessible allergen databases. In this review article, features and applications of 6 of the most widely used allergen databases are discussed. The WHO/IUIS Allergen Nomenclature Database is the official resource of allergen designations. Allergome is the most comprehensive collection of data on allergens and allergen sources. AllergenOnline is aimed at providing a peer-reviewed database of allergen sequences for prediction of allergenicity of proteins, such as those planned to be inserted into genetically modified crops. The Structural Database of Allergenic Proteins (SDAP) provides a database of allergen sequences, structures, and epitopes linked to bioinformatics tools for sequence analysis and comparison. The Immune Epitope Database (IEDB) is the largest repository of T-cell, B-cell, and major histocompatibility complex protein epitopes including epitopes of allergens. AllFam classifies allergens into families of evolutionarily related proteins using definitions from the Pfam protein family database. These databases contain mostly overlapping data, but also show differences in terms of their targeted users, the criteria for including allergens, data shown for each allergen, and the availability of bioinformatics tools. © 2017 S. Karger AG, Basel.
Pruitt, Kim D.; Tatusova, Tatiana; Maglott, Donna R.
2005-01-01
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database (http://www.ncbi.nlm.nih.gov/RefSeq/) provides a non-redundant collection of sequences representing genomic data, transcripts and proteins. Although the goal is to provide a comprehensive dataset representing the complete sequence information for any given species, the database pragmatically includes sequence data that are currently publicly available in the archival databases. The database incorporates data from over 2400 organisms and includes over one million proteins representing significant taxonomic diversity spanning prokaryotes, eukaryotes and viruses. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources including the NCBI Map Viewer and Gene. Sequences are annotated to include coding regions, conserved domains, variation, references, names, database cross-references, and other features using a combined approach of collaboration and other input from the scientific community, automated annotation, propagation from GenBank and curation by NCBI staff. PMID:15608248
MIPS: a database for protein sequences and complete genomes.
Mewes, H W; Hani, J; Pfeiffer, F; Frishman, D
1998-01-01
The MIPS group [Munich Information Center for Protein Sequences of the German National Center for Environment and Health (GSF)] at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, is involved in a number of data collection activities, including a comprehensive database of the yeast genome, a database reflecting the progress in sequencing the Arabidopsis thaliana genome, the systematic analysis of other small genomes and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). Through its WWW server (http://www.mips.biochem.mpg.de ) MIPS provides access to a variety of generic databases, including a database of protein families as well as automatically generated data by the systematic application of sequence analysis algorithms. The yeast genome sequence and its related information was also compiled on CD-ROM to provide dynamic interactive access to the 16 chromosomes of the first eukaryotic genome unraveled. PMID:9399795
GRBase, a new gene regulation data base available by anonymous ftp.
Collier, B; Danielsen, M
1994-01-01
The Gene Regulation Database (GRBase) is a compendium of information on the structure and function of proteins involved in the control of gene expression in eukaryotes. These proteins include transcription factors, proteins involved in signal transduction, and receptors. The database can be obtained by FTP in Filemaker Pro, text, and postscript formats. The database will be expanded in the coming year to include reviews on families of proteins involved in gene regulation and to allow online searching. PMID:7937071
The 2015 Nucleic Acids Research Database Issue and molecular biology database collection.
Galperin, Michael Y; Rigden, Daniel J; Fernández-Suárez, Xosé M
2015-01-01
The 2015 Nucleic Acids Research Database Issue contains 172 papers that include descriptions of 56 new molecular biology databases, and updates on 115 databases whose descriptions have been previously published in NAR or other journals. Following the classification that has been introduced last year in order to simplify navigation of the entire issue, these articles are divided into eight subject categories. This year's highlights include RNAcentral, an international community portal to various databases on noncoding RNA; ValidatorDB, a validation database for protein structures and their ligands; SASBDB, a primary repository for small-angle scattering data of various macromolecular complexes; MoonProt, a database of 'moonlighting' proteins, and two new databases of protein-protein and other macromolecular complexes, ComPPI and the Complex Portal. This issue also includes an unusually high number of cancer-related databases and other databases dedicated to genomic basics of disease and potential drugs and drug targets. The size of NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/a/, remained approximately the same, following the addition of 74 new resources and removal of 77 obsolete web sites. The entire Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/). Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Protein Information Resource: a community resource for expert annotation of protein data
Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy
2001-01-01
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-International databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041
Databases and Associated Tools for Glycomics and Glycoproteomics.
Lisacek, Frederique; Mariethoz, Julien; Alocci, Davide; Rudd, Pauline M; Abrahams, Jodie L; Campbell, Matthew P; Packer, Nicolle H; Ståhle, Jonas; Widmalm, Göran; Mullen, Elaine; Adamczyk, Barbara; Rojas-Macias, Miguel A; Jin, Chunsheng; Karlsson, Niclas G
2017-01-01
The access to biodatabases for glycomics and glycoproteomics has proven to be essential for current glycobiological research. This chapter presents available databases that are devoted to different aspects of glycobioinformatics. This includes oligosaccharide sequence databases, experimental databases, 3D structure databases (of both glycans and glycorelated proteins) and association of glycans with tissue, disease, and proteins. Specific search protocols are also provided using tools associated with experimental databases for converting primary glycoanalytical data to glycan structural information. In particular, researchers using glycoanalysis methods by U/HPLC (GlycoBase), MS (GlycoWorkbench, UniCarb-DB, GlycoDigest), and NMR (CASPER) will benefit from this chapter. In addition we also include information on how to utilize glycan structural information to query databases that associate glycans with proteins (UniCarbKB) and with interactions with pathogens (SugarBind).
Sys-BodyFluid: a systematical database for human body fluid proteome research
Li, Su-Jun; Peng, Mao; Li, Hong; Liu, Bo-Shu; Wang, Chuan; Wu, Jia-Rui; Li, Yi-Xue; Zeng, Rong
2009-01-01
Recently, body fluids have widely become an important target for proteomic research and proteomic study has produced more and more body fluid related protein data. A database is needed to collect and analyze these proteome data. Thus, we developed this web-based body fluid proteome database Sys-BodyFluid. It contains eleven kinds of body fluid proteomes, including plasma/serum, urine, cerebrospinal fluid, saliva, bronchoalveolar lavage fluid, synovial fluid, nipple aspirate fluid, tear fluid, seminal fluid, human milk and amniotic fluid. Over 10 000 proteins are presented in the Sys-BodyFluid. Sys-BodyFluid provides the detailed protein annotations, including protein description, Gene Ontology, domain information, protein sequence and involved pathways. These proteome data can be retrieved by using protein name, protein accession number and sequence similarity. In addition, users can query between these different body fluids to get the different proteins identification information. Sys-BodyFluid database can facilitate the body fluid proteomics and disease proteomics research as a reference database. It is available at http://www.biosino.org/bodyfluid/. PMID:18978022
Sys-BodyFluid: a systematical database for human body fluid proteome research.
Li, Su-Jun; Peng, Mao; Li, Hong; Liu, Bo-Shu; Wang, Chuan; Wu, Jia-Rui; Li, Yi-Xue; Zeng, Rong
2009-01-01
Recently, body fluids have widely become an important target for proteomic research and proteomic study has produced more and more body fluid related protein data. A database is needed to collect and analyze these proteome data. Thus, we developed this web-based body fluid proteome database Sys-BodyFluid. It contains eleven kinds of body fluid proteomes, including plasma/serum, urine, cerebrospinal fluid, saliva, bronchoalveolar lavage fluid, synovial fluid, nipple aspirate fluid, tear fluid, seminal fluid, human milk and amniotic fluid. Over 10,000 proteins are presented in the Sys-BodyFluid. Sys-BodyFluid provides the detailed protein annotations, including protein description, Gene Ontology, domain information, protein sequence and involved pathways. These proteome data can be retrieved by using protein name, protein accession number and sequence similarity. In addition, users can query between these different body fluids to get the different proteins identification information. Sys-BodyFluid database can facilitate the body fluid proteomics and disease proteomics research as a reference database. It is available at http://www.biosino.org/bodyfluid/.
Crosara, Karla Tonelli Bicalho; Moffa, Eduardo Buozi; Xiao, Yizhi; Siqueira, Walter Luiz
2018-01-16
Protein-protein interaction is a common physiological mechanism for protection and actions of proteins in an organism. The identification and characterization of protein-protein interactions in different organisms is necessary to better understand their physiology and to determine their efficacy. In a previous in vitro study using mass spectrometry, we identified 43 proteins that interact with histatin 1. Six previously documented interactors were confirmed and 37 novel partners were identified. In this tutorial, we aimed to demonstrate the usefulness of the STRING database for studying protein-protein interactions. We used an in-silico approach along with the STRING database (http://string-db.org/) and successfully performed a fast simulation of a novel constructed histatin 1 protein-protein network, including both the previously known and the predicted interactors, along with our newly identified interactors. Our study highlights the advantages and importance of applying bioinformatics tools to merge in-silico tactics with experimental in vitro findings for rapid advancement of our knowledge about protein-protein interactions. Our findings also indicate that bioinformatics tools such as the STRING protein network database can help predict potential interactions between proteins and thus serve as a guide for future steps in our exploration of the Human Interactome. Our study highlights the usefulness of the STRING protein database for studying protein-protein interactions. The STRING database can collect and integrate data about known and predicted protein-protein associations from many organisms, including both direct (physical) and indirect (functional) interactions, in an easy-to-use interface. Copyright © 2017 Elsevier B.V. All rights reserved.
The Protein Information Resource: an integrated public resource of functional annotation of proteins
Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.
2002-01-01
The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247
NPIDB: Nucleic acid-Protein Interaction DataBase.
Kirsanov, Dmitry D; Zanegina, Olga N; Aksianov, Evgeniy A; Spirin, Sergei A; Karyagina, Anna S; Alexeevski, Andrei V
2013-01-01
The Nucleic acid-Protein Interaction DataBase (http://npidb.belozersky.msu.ru/) contains information derived from structures of DNA-protein and RNA-protein complexes extracted from the Protein Data Bank (3846 complexes in October 2012). It provides a web interface and a set of tools for extracting biologically meaningful characteristics of nucleoprotein complexes. The content of the database is updated weekly. The current version of the Nucleic acid-Protein Interaction DataBase is an upgrade of the version published in 2007. The improvements include a new web interface, new tools for calculation of intermolecular interactions, a classification of SCOP families that contains DNA-binding protein domains and data on conserved water molecules on the DNA-protein interface.
MIPS: a database for genomes and protein sequences.
Mewes, H W; Heumann, K; Kaps, A; Mayer, K; Pfeiffer, F; Stocker, S; Frishman, D
1999-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database. PMID:9847138
The Protein-DNA Interface database
2010-01-01
The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 Å or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface. We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes. PMID:20482798
The Protein-DNA Interface database.
Norambuena, Tomás; Melo, Francisco
2010-05-18
The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 A or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface.We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes.
ATtRACT-a database of RNA-binding proteins and associated motifs.
Giudice, Girolamo; Sánchez-Cabo, Fátima; Torroja, Carlos; Lara-Pezzi, Enrique
2016-01-01
RNA-binding proteins (RBPs) play a crucial role in key cellular processes, including RNA transport, splicing, polyadenylation and stability. Understanding the interaction between RBPs and RNA is key to improve our knowledge of RNA processing, localization and regulation in a global manner. Despite advances in recent years, a unified non-redundant resource that includes information on experimentally validated motifs, RBPs and integrated tools to exploit this information is lacking. Here, we developed a database named ATtRACT (available athttp://attract.cnic.es) that compiles information on 370 RBPs and 1583 RBP consensus binding motifs, 192 of which are not present in any other database. To populate ATtRACT we (i) extracted and hand-curated experimentally validated data from CISBP-RNA, SpliceAid-F, RBPDB databases, (ii) integrated and updated the unavailable ASD database and (iii) extracted information from Protein-RNA complexes present in Protein Data Bank database through computational analyses. ATtRACT provides also efficient algorithms to search a specific motif and scan one or more RNA sequences at a time. It also allows discoveringde novomotifs enriched in a set of related sequences and compare them with the motifs included in the database.Database URL:http:// attract. cnic. es. © The Author(s) 2016. Published by Oxford University Press.
The Histone Database: an integrated resource for histones and histone fold-containing proteins
Mariño-Ramírez, Leonardo; Levine, Kevin M.; Morales, Mario; Zhang, Suiyuan; Moreland, R. Travis; Baxevanis, Andreas D.; Landsman, David
2011-01-01
Eukaryotic chromatin is composed of DNA and protein components—core histones—that act to compactly pack the DNA into nucleosomes, the fundamental building blocks of chromatin. These nucleosomes are connected to adjacent nucleosomes by linker histones. Nucleosomes are highly dynamic and, through various core histone post-translational modifications and incorporation of diverse histone variants, can serve as epigenetic marks to control processes such as gene expression and recombination. The Histone Sequence Database is a curated collection of sequences and structures of histones and non-histone proteins containing histone folds, assembled from major public databases. Here, we report a substantial increase in the number of sequences and taxonomic coverage for histone and histone fold-containing proteins available in the database. Additionally, the database now contains an expanded dataset that includes archaeal histone sequences. The database also provides comprehensive multiple sequence alignments for each of the four core histones (H2A, H2B, H3 and H4), the linker histones (H1/H5) and the archaeal histones. The database also includes current information on solved histone fold-containing structures. The Histone Sequence Database is an inclusive resource for the analysis of chromatin structure and function focused on histones and histone fold-containing proteins. Database URL: The Histone Sequence Database is freely available and can be accessed at http://research.nhgri.nih.gov/histones/. PMID:22025671
The COG database: new developments in phylogenetic classification of proteins from complete genomes
Tatusov, Roman L.; Natale, Darren A.; Garkavtsev, Igor V.; Tatusova, Tatiana A.; Shankavaram, Uma T.; Rao, Bachoti S.; Kiryutin, Boris; Galperin, Michael Y.; Fedorova, Natalie D.; Koonin, Eugene V.
2001-01-01
The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis. PMID:11125040
Fernández-Suárez, Xosé M; Rigden, Daniel J; Galperin, Michael Y
2014-01-01
The 2014 Nucleic Acids Research Database Issue includes descriptions of 58 new molecular biology databases and recent updates to 123 databases previously featured in NAR or other journals. For convenience, the issue is now divided into eight sections that reflect major subject categories. Among the highlights of this issue are six databases of the transcription factor binding sites in various organisms and updates on such popular databases as CAZy, Database of Genomic Variants (DGV), dbGaP, DrugBank, KEGG, miRBase, Pfam, Reactome, SEED, TCDB and UniProt. There is a strong block of structural databases, which includes, among others, the new RNA Bricks database, updates on PDBe, PDBsum, ArchDB, Gene3D, ModBase, Nucleic Acid Database and the recently revived iPfam database. An update on the NCBI's MMDB describes VAST+, an improved tool for protein structure comparison. Two articles highlight the development of the Structural Classification of Proteins (SCOP) database: one describes SCOPe, which automates assignment of new structures to the existing SCOP hierarchy; the other one describes the first version of SCOP2, with its more flexible approach to classifying protein structures. This issue also includes a collection of articles on bacterial taxonomy and metagenomics, which includes updates on the List of Prokaryotic Names with Standing in Nomenclature (LPSN), Ribosomal Database Project (RDP), the Silva/LTP project and several new metagenomics resources. The NAR online Molecular Biology Database Collection, http://www.oxfordjournals.org/nar/database/c/, has been expanded to 1552 databases. The entire Database Issue is freely available online on the Nucleic Acids Research website (http://nar.oxfordjournals.org/).
FARE-CAFE: a database of functional and regulatory elements of cancer-associated fusion events.
Korla, Praveen Kumar; Cheng, Jack; Huang, Chien-Hung; Tsai, Jeffrey J P; Liu, Yu-Hsuan; Kurubanjerdjit, Nilubon; Hsieh, Wen-Tsong; Chen, Huey-Yi; Ng, Ka-Lok
2015-01-01
Chromosomal translocation (CT) is of enormous clinical interest because this disorder is associated with various major solid tumors and leukemia. A tumor-specific fusion gene event may occur when a translocation joins two separate genes. Currently, various CT databases provide information about fusion genes and their genomic elements. However, no database of the roles of fusion genes, in terms of essential functional and regulatory elements in oncogenesis, is available. FARE-CAFE is a unique combination of CTs, fusion proteins, protein domains, domain-domain interactions, protein-protein interactions, transcription factors and microRNAs, with subsequent experimental information, which cannot be found in any other CT database. Genomic DNA information including, for example, manually collected exact locations of the first and second break points, sequences and karyotypes of fusion genes are included. FARE-CAFE will substantially facilitate the cancer biologist's mission of elucidating the pathogenesis of various types of cancer. This database will ultimately help to develop 'novel' therapeutic approaches. Database URL: http://ppi.bioinfo.asia.edu.tw/FARE-CAFE. © The Author(s) 2015. Published by Oxford University Press.
Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier
2003-01-01
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.
Chang, Yi-Chien; Hu, Zhenjun; Rachlin, John; Anton, Brian P; Kasif, Simon; Roberts, Richard J; Steffen, Martin
2016-01-04
The COMBREX database (COMBREX-DB; combrex.bu.edu) is an online repository of information related to (i) experimentally determined protein function, (ii) predicted protein function, (iii) relationships among proteins of unknown function and various types of experimental data, including molecular function, protein structure, and associated phenotypes. The database was created as part of the novel COMBREX (COMputational BRidges to EXperiments) effort aimed at accelerating the rate of gene function validation. It currently holds information on ∼ 3.3 million known and predicted proteins from over 1000 completely sequenced bacterial and archaeal genomes. The database also contains a prototype recommendation system for helping users identify those proteins whose experimental determination of function would be most informative for predicting function for other proteins within protein families. The emphasis on documenting experimental evidence for function predictions, and the prioritization of uncharacterized proteins for experimental testing distinguish COMBREX from other publicly available microbial genomics resources. This article describes updates to COMBREX-DB since an initial description in the 2011 NAR Database Issue. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Droit, Arnaud; Hunter, Joanna M; Rouleau, Michèle; Ethier, Chantal; Picard-Cloutier, Aude; Bourgais, David; Poirier, Guy G
2007-01-01
Background In the "post-genome" era, mass spectrometry (MS) has become an important method for the analysis of proteins and the rapid advancement of this technique, in combination with other proteomics methods, results in an increasing amount of proteome data. This data must be archived and analysed using specialized bioinformatics tools. Description We herein describe "PARPs database," a data analysis and management pipeline for liquid chromatography tandem mass spectrometry (LC-MS/MS) proteomics. PARPs database is a web-based tool whose features include experiment annotation, protein database searching, protein sequence management, as well as data-mining of the peptides and proteins identified. Conclusion Using this pipeline, we have successfully identified several interactions of biological significance between PARP-1 and other proteins, namely RFC-1, 2, 3, 4 and 5. PMID:18093328
AlQuraishi, Mohammed; Tang, Shengdong; Xia, Xide
2015-11-19
Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. We have developed an integrated affinity-structure database in which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. This database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.
Detection of alternative splice variants at the proteome level in Aspergillus flavus.
Chang, Kung-Yen; Georgianna, D Ryan; Heber, Steffen; Payne, Gary A; Muddiman, David C
2010-03-05
Identification of proteins from proteolytic peptides or intact proteins plays an essential role in proteomics. Researchers use search engines to match the acquired peptide sequences to the target proteins. However, search engines depend on protein databases to provide candidates for consideration. Alternative splicing (AS), the mechanism where the exon of pre-mRNAs can be spliced and rearranged to generate distinct mRNA and therefore protein variants, enable higher eukaryotic organisms, with only a limited number of genes, to have the requisite complexity and diversity at the proteome level. Multiple alternative isoforms from one gene often share common segments of sequences. However, many protein databases only include a limited number of isoforms to keep minimal redundancy. As a result, the database search might not identify a target protein even with high quality tandem MS data and accurate intact precursor ion mass. We computationally predicted an exhaustive list of putative isoforms of Aspergillus flavus proteins from 20 371 expressed sequence tags to investigate whether an alternative splicing protein database can assign a greater proportion of mass spectrometry data. The newly constructed AS database provided 9807 new alternatively spliced variants in addition to 12 832 previously annotated proteins. The searches of the existing tandem MS spectra data set using the AS database identified 29 new proteins encoded by 26 genes. Nine fungal genes appeared to have multiple protein isoforms. In addition to the discovery of splice variants, AS database also showed potential to improve genome annotation. In summary, the introduction of an alternative splicing database helps identify more proteins and unveils more information about a proteome.
Kuang, Xingyan; Dhroso, Andi; Han, Jing Ginger; Shyu, Chi-Ren; Korkin, Dmitry
2016-01-01
Macromolecular interactions are formed between proteins, DNA and RNA molecules. Being a principle building block in macromolecular assemblies and pathways, the interactions underlie most of cellular functions. Malfunctioning of macromolecular interactions is also linked to a number of diseases. Structural knowledge of the macromolecular interaction allows one to understand the interaction’s mechanism, determine its functional implications and characterize the effects of genetic variations, such as single nucleotide polymorphisms, on the interaction. Unfortunately, until now the interactions mediated by different types of macromolecules, e.g. protein–protein interactions or protein–DNA interactions, are collected into individual and unrelated structural databases. This presents a significant obstacle in the analysis of macromolecular interactions. For instance, the homogeneous structural interaction databases prevent scientists from studying structural interactions of different types but occurring in the same macromolecular complex. Here, we introduce DOMMINO 2.0, a structural Database Of Macro-Molecular INteractiOns. Compared to DOMMINO 1.0, a comprehensive database on protein-protein interactions, DOMMINO 2.0 includes the interactions between all three basic types of macromolecules extracted from PDB files. DOMMINO 2.0 is automatically updated on a weekly basis. It currently includes ∼1 040 000 interactions between two polypeptide subunits (e.g. domains, peptides, termini and interdomain linkers), ∼43 000 RNA-mediated interactions, and ∼12 000 DNA-mediated interactions. All protein structures in the database are annotated using SCOP and SUPERFAMILY family annotation. As a result, protein-mediated interactions involving protein domains, interdomain linkers, C- and N- termini, and peptides are identified. Our database provides an intuitive web interface, allowing one to investigate interactions at three different resolution levels: whole subunit network, binary interaction and interaction interface. Database URL: http://dommino.org PMID:26827237
The Protein Disease Database of human body fluids: II. Computer methods and data issues.
Lemkin, P F; Orr, G A; Goldstein, M P; Creed, G J; Myrick, J E; Merril, C R
1995-01-01
The Protein Disease Database (PDD) is a relational database of proteins and diseases. With this database it is possible to screen for quantitative protein abnormalities associated with disease states. These quantitative relationships use data drawn from the peer-reviewed biomedical literature. Assays may also include those observed in high-resolution electrophoretic gels that offer the potential to quantitate many proteins in a single test as well as data gathered by enzymatic or immunologic assays. We are using the Internet World Wide Web (WWW) and the Web browser paradigm as an access method for wide distribution and querying of the Protein Disease Database. The WWW hypertext transfer protocol and its Common Gateway Interface make it possible to build powerful graphical user interfaces that can support easy-to-use data retrieval using query specification forms or images. The details of these interactions are totally transparent to the users of these forms. Using a client-server SQL relational database, user query access, initial data entry and database maintenance are all performed over the Internet with a Web browser. We discuss the underlying design issues, mapping mechanisms and assumptions that we used in constructing the system, data entry, access to the database server, security, and synthesis of derived two-dimensional gel image maps and hypertext documents resulting from SQL database searches.
Savidor, Alon; Barzilay, Rotem; Elinger, Dalia; Yarden, Yosef; Lindzen, Moshit; Gabashvili, Alexandra; Adiv Tal, Ophir; Levin, Yishai
2017-06-01
Traditional "bottom-up" proteomic approaches use proteolytic digestion, LC-MS/MS, and database searching to elucidate peptide identities and their parent proteins. Protein sequences absent from the database cannot be identified, and even if present in the database, complete sequence coverage is rarely achieved even for the most abundant proteins in the sample. Thus, sequencing of unknown proteins such as antibodies or constituents of metaproteomes remains a challenging problem. To date, there is no available method for full-length protein sequencing, independent of a reference database, in high throughput. Here, we present Database-independent Protein Sequencing, a method for unambiguous, rapid, database-independent, full-length protein sequencing. The method is a novel combination of non-enzymatic, semi-random cleavage of the protein, LC-MS/MS analysis, peptide de novo sequencing, extraction of peptide tags, and their assembly into a consensus sequence using an algorithm named "Peptide Tag Assembler." As proof-of-concept, the method was applied to samples of three known proteins representing three size classes and to a previously un-sequenced, clinically relevant monoclonal antibody. Excluding leucine/isoleucine and glutamic acid/deamidated glutamine ambiguities, end-to-end full-length de novo sequencing was achieved with 99-100% accuracy for all benchmarking proteins and the antibody light chain. Accuracy of the sequenced antibody heavy chain, including the entire variable region, was also 100%, but there was a 23-residue gap in the constant region sequence. © 2017 by The American Society for Biochemistry and Molecular Biology, Inc.
Rice proteome database: a step toward functional analysis of the rice genome.
Komatsu, Setsuko
2005-09-01
The technique of proteome analysis using two-dimensional polyacrylamide gel electrophoresis (2D-PAGE) has the power to monitor global changes that occur in the protein complement of tissues and subcellular compartments. In this study, the proteins of rice were cataloged, a rice proteome database was constructed, and a functional characterization of some of the identified proteins was undertaken. Proteins extracted from various tissues and subcellular compartments in rice were separated by 2D-PAGE and an image analyzer was used to construct a display of the proteins. The Rice Proteome Database contains 23 reference maps based on 2D-PAGE of proteins from various rice tissues and subcellular compartments. These reference maps comprise 13129 identified proteins, and the amino acid sequences of 5092 proteins are entered in the database. Major proteins involved in growth or stress responses were identified using the proteome approach. Some of these proteins, including a beta-tubulin, calreticulin, and ribulose-1,5-bisphosphate carboxylase/oxygenase activase in rice, have unexpected functions. The information obtained from the Rice Proteome Database will aid in cloning the genes for and predicting the function of unknown proteins.
A Bioinformatics Workflow for Variant Peptide Detection in Shotgun Proteomics*
Li, Jing; Su, Zengliu; Ma, Ze-Qiang; Slebos, Robbert J. C.; Halvey, Patrick; Tabb, David L.; Liebler, Daniel C.; Pao, William; Zhang, Bing
2011-01-01
Shotgun proteomics data analysis usually relies on database search. However, commonly used protein sequence databases do not contain information on protein variants and thus prevent variant peptides and proteins from been identified. Including known coding variations into protein sequence databases could help alleviate this problem. Based on our recently published human Cancer Proteome Variation Database, we have created a protein sequence database that comprehensively annotates thousands of cancer-related coding variants collected in the Cancer Proteome Variation Database as well as noncancer-specific ones from the Single Nucleotide Polymorphism Database (dbSNP). Using this database, we then developed a data analysis workflow for variant peptide identification in shotgun proteomics. The high risk of false positive variant identifications was addressed by a modified false discovery rate estimation method. Analysis of colorectal cancer cell lines SW480, RKO, and HCT-116 revealed a total of 81 peptides that contain either noncancer-specific or cancer-related variations. Twenty-three out of 26 variants randomly selected from the 81 were confirmed by genomic sequencing. We further applied the workflow on data sets from three individual colorectal tumor specimens. A total of 204 distinct variant peptides were detected, and five carried known cancer-related mutations. Each individual showed a specific pattern of cancer-related mutations, suggesting potential use of this type of information for personalized medicine. Compatibility of the workflow has been tested with four popular database search engines including Sequest, Mascot, X!Tandem, and MyriMatch. In summary, we have developed a workflow that effectively uses existing genomic data to enable variant peptide detection in proteomics. PMID:21389108
MODBASE, a database of annotated comparative protein structure models
Pieper, Ursula; Eswar, Narayanan; Stuart, Ashley C.; Ilyin, Valentin A.; Sali, Andrej
2002-01-01
MODBASE (http://guitar.rockefeller.edu/modbase) is a relational database of annotated comparative protein structure models for all available protein sequences matched to at least one known protein structure. The models are calculated by MODPIPE, an automated modeling pipeline that relies on PSI-BLAST, IMPALA and MODELLER. MODBASE uses the MySQL relational database management system for flexible and efficient querying, and the MODVIEW Netscape plugin for viewing and manipulating multiple sequences and structures. It is updated regularly to reflect the growth of the protein sequence and structure databases, as well as improvements in the software for calculating the models. For ease of access, MODBASE is organized into different datasets. The largest dataset contains models for domains in 304 517 out of 539 171 unique protein sequences in the complete TrEMBL database (23 March 2001); only models based on significant alignments (PSI-BLAST E-value < 10–4) and models assessed to have the correct fold are included. Other datasets include models for target selection and structure-based annotation by the New York Structural Genomics Research Consortium, models for prediction of genes in the Drosophila melanogaster genome, models for structure determination of several ribosomal particles and models calculated by the MODWEB comparative modeling web server. PMID:11752309
Kim, Woo-Yeon; Kang, Sungsoo; Kim, Byoung-Chul; Oh, Jeehyun; Cho, Seongwoong; Bhak, Jong; Choi, Jong-Soon
2008-01-01
Cyanobacteria are model organisms for studying photosynthesis, carbon and nitrogen assimilation, evolution of plant plastids, and adaptability to environmental stresses. Despite many studies on cyanobacteria, there is no web-based database of their regulatory and signaling protein-protein interaction networks to date. We report a database and website SynechoNET that provides predicted protein-protein interactions. SynechoNET shows cyanobacterial domain-domain interactions as well as their protein-level interactions using the model cyanobacterium, Synechocystis sp. PCC 6803. It predicts the protein-protein interactions using public interaction databases that contain mutually complementary and redundant data. Furthermore, SynechoNET provides information on transmembrane topology, signal peptide, and domain structure in order to support the analysis of regulatory membrane proteins. Such biological information can be queried and visualized in user-friendly web interfaces that include the interactive network viewer and search pages by keyword and functional category. SynechoNET is an integrated protein-protein interaction database designed to analyze regulatory membrane proteins in cyanobacteria. It provides a platform for biologists to extend the genomic data of cyanobacteria by predicting interaction partners, membrane association, and membrane topology of Synechocystis proteins. SynechoNET is freely available at http://synechocystis.org/ or directly at http://bioportal.kobic.kr/SynechoNET/.
Gorohovski, Alessandro; Tagore, Somnath; Palande, Vikrant; Malka, Assaf; Raviv-Shay, Dorith; Frenkel-Morgenstern, Milana
2017-01-04
Discovery of chimeric RNAs, which are produced by chromosomal translocations as well as the joining of exons from different genes by trans-splicing, has added a new level of complexity to our study and understanding of the transcriptome. The enhanced ChiTaRS-3.1 database (http://chitars.md.biu.ac.il) is designed to make widely accessible a wealth of mined data on chimeric RNAs, with easy-to-use analytical tools built-in. The database comprises 34 922: chimeric transcripts along with 11 714: cancer breakpoints. In this latest version, we have included multiple cross-references to GeneCards, iHop, PubMed, NCBI, Ensembl, OMIM, RefSeq and the Mitelman collection for every entry in the 'Full Collection'. In addition, for every chimera, we have added a predicted Chimeric Protein-Protein Interaction (ChiPPI) network, which allows for easy visualization of protein partners of both parental and fusion proteins for all human chimeras. The database contains a comprehensive annotation for 34 922: chimeric transcripts from eight organisms, and includes the manual annotation of 200 sense-antiSense (SaS) chimeras. The current improvements in the content and functionality to the ChiTaRS database make it a central resource for the study of chimeric transcripts and fusion proteins. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
MoonProt: a database for proteins that are known to moonlight
Mani, Mathew; Chen, Chang; Amblee, Vaishak; Liu, Haipeng; Mathur, Tanu; Zwicke, Grant; Zabad, Shadi; Patel, Bansi; Thakkar, Jagravi; Jeffery, Constance J.
2015-01-01
Moonlighting proteins comprise a class of multifunctional proteins in which a single polypeptide chain performs multiple biochemical functions that are not due to gene fusions, multiple RNA splice variants or pleiotropic effects. The known moonlighting proteins perform a variety of diverse functions in many different cell types and species, and information about their structures and functions is scattered in many publications. We have constructed the manually curated, searchable, internet-based MoonProt Database (http://www.moonlightingproteins.org) with information about the over 200 proteins that have been experimentally verified to be moonlighting proteins. The availability of this organized information provides a more complete picture of what is currently known about moonlighting proteins. The database will also aid researchers in other fields, including determining the functions of genes identified in genome sequencing projects, interpreting data from proteomics projects and annotating protein sequence and structural databases. In addition, information about the structures and functions of moonlighting proteins can be helpful in understanding how novel protein functional sites evolved on an ancient protein scaffold, which can also help in the design of proteins with novel functions. PMID:25324305
ARCPHdb: A comprehensive protein database for SF1 and SF2 helicase from archaea.
Moukhtar, Mirna; Chaar, Wafi; Abdel-Razzak, Ziad; Khalil, Mohamad; Taha, Samir; Chamieh, Hala
2017-01-01
Superfamily 1 and Superfamily 2 helicases, two of the largest helicase protein families, play vital roles in many biological processes including replication, transcription and translation. Study of helicase proteins in the model microorganisms of archaea have largely contributed to the understanding of their function, architecture and assembly. Based on a large phylogenomics approach, we have identified and classified all SF1 and SF2 protein families in ninety five sequenced archaea genomes. Here we developed an online webserver linked to a specialized protein database named ARCPHdb to provide access for SF1 and SF2 helicase families from archaea. ARCPHdb was implemented using MySQL relational database. Web interfaces were developed using Netbeans. Data were stored according to UniProt accession numbers, NCBI Ref Seq ID, PDB IDs and Entrez Databases. A user-friendly interactive web interface has been developed to browse, search and download archaeal helicase protein sequences, their available 3D structure models, and related documentation available in the literature provided by ARCPHdb. The database provides direct links to matching external databases. The ARCPHdb is the first online database to compile all protein information on SF1 and SF2 helicase from archaea in one platform. This database provides essential resource information for all researchers interested in the field. Copyright © 2016 Elsevier Ltd. All rights reserved.
PCoM-DB Update: A Protein Co-Migration Database for Photosynthetic Organisms.
Takabayashi, Atsushi; Takabayashi, Saeka; Takahashi, Kaori; Watanabe, Mai; Uchida, Hiroko; Murakami, Akio; Fujita, Tomomichi; Ikeuchi, Masahiko; Tanaka, Ayumi
2017-01-01
The identification of protein complexes is important for the understanding of protein structure and function and the regulation of cellular processes. We used blue-native PAGE and tandem mass spectrometry to identify protein complexes systematically, and built a web database, the protein co-migration database (PCoM-DB, http://pcomdb.lowtem.hokudai.ac.jp/proteins/top), to provide prediction tools for protein complexes. PCoM-DB provides migration profiles for any given protein of interest, and allows users to compare them with migration profiles of other proteins, showing the oligomeric states of proteins and thus identifying potential interaction partners. The initial version of PCoM-DB (launched in January 2013) included protein complex data for Synechocystis whole cells and Arabidopsis thaliana thylakoid membranes. Here we report PCoM-DB version 2.0, which includes new data sets and analytical tools. Additional data are included from whole cells of the pelagic marine picocyanobacterium Prochlorococcus marinus, the thermophilic cyanobacterium Thermosynechococcus elongatus, the unicellular green alga Chlamydomonas reinhardtii and the bryophyte Physcomitrella patens. The Arabidopsis protein data now include data for intact mitochondria, intact chloroplasts, chloroplast stroma and chloroplast envelopes. The new tools comprise a multiple-protein search form and a heat map viewer for protein migration profiles. Users can compare migration profiles of a protein of interest among different organelles or compare migration profiles among different proteins within the same sample. For Arabidopsis proteins, users can compare migration profiles of a protein of interest with putative homologous proteins from non-Arabidopsis organisms. The updated PCoM-DB will help researchers find novel protein complexes and estimate their evolutionary changes in the green lineage. © The Author 2017. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Mobilio, Dominick; Walker, Gary; Brooijmans, Natasja; Nilakantan, Ramaswamy; Denny, R Aldrin; Dejoannis, Jason; Feyfant, Eric; Kowticwar, Rupesh K; Mankala, Jyoti; Palli, Satish; Punyamantula, Sairam; Tatipally, Maneesh; John, Reji K; Humblet, Christine
2010-08-01
The Protein Data Bank is the most comprehensive source of experimental macromolecular structures. It can, however, be difficult at times to locate relevant structures with the Protein Data Bank search interface. This is particularly true when searching for complexes containing specific interactions between protein and ligand atoms. Moreover, searching within a family of proteins can be tedious. For example, one cannot search for some conserved residue as residue numbers vary across structures. We describe herein three databases, Protein Relational Database, Kinase Knowledge Base, and Matrix Metalloproteinase Knowledge Base, containing protein structures from the Protein Data Bank. In Protein Relational Database, atom-atom distances between protein and ligand have been precalculated allowing for millisecond retrieval based on atom identity and distance constraints. Ring centroids, centroid-centroid and centroid-atom distances and angles have also been included permitting queries for pi-stacking interactions and other structural motifs involving rings. Other geometric features can be searched through the inclusion of residue pair and triplet distances. In Kinase Knowledge Base and Matrix Metalloproteinase Knowledge Base, the catalytic domains have been aligned into common residue numbering schemes. Thus, by searching across Protein Relational Database and Kinase Knowledge Base, one can easily retrieve structures wherein, for example, a ligand of interest is making contact with the gatekeeper residue.
The Universal Protein Resource (UniProt): an expanding universe of protein information.
Wu, Cathy H; Apweiler, Rolf; Bairoch, Amos; Natale, Darren A; Barker, Winona C; Boeckmann, Brigitte; Ferro, Serenella; Gasteiger, Elisabeth; Huang, Hongzhan; Lopez, Rodrigo; Magrane, Michele; Martin, Maria J; Mazumder, Raja; O'Donovan, Claire; Redaschi, Nicole; Suzek, Baris
2006-01-01
The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at http://www.uniprot.org or downloaded at ftp://ftp.uniprot.org/pub/databases/.
An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system
DOE Office of Scientific and Technical Information (OSTI.GOV)
AlQuraishi, Mohammed; Tang, Shengdong; Xia, Xide
Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. We have developed an integrated affinity-structure database inmore » which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. Lastly, this database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.« less
An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system
AlQuraishi, Mohammed; Tang, Shengdong; Xia, Xide
2015-11-19
Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. We have developed an integrated affinity-structure database inmore » which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. Lastly, this database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.« less
Komatsu, Setsuko; Wang, Xin; Yin, Xiaojian; Nanjo, Yohei; Ohyanagi, Hajime; Sakata, Katsumi
2017-06-23
The Soybean Proteome Database (SPD) stores data on soybean proteins obtained with gel-based and gel-free proteomic techniques. The database was constructed to provide information on proteins for functional analyses. The majority of the data is focused on soybean (Glycine max 'Enrei'). The growth and yield of soybean are strongly affected by environmental stresses such as flooding. The database was originally constructed using data on soybean proteins separated by two-dimensional polyacrylamide gel electrophoresis, which is a gel-based proteomic technique. Since 2015, the database has been expanded to incorporate data obtained by label-free mass spectrometry-based quantitative proteomics, which is a gel-free proteomic technique. Here, the portions of the database consisting of gel-free proteomic data are described. The gel-free proteomic database contains 39,212 proteins identified in 63 sample sets, such as temporal and organ-specific samples of soybean plants grown under flooding stress or non-stressed conditions. In addition, data on organellar proteins identified in mitochondria, nuclei, and endoplasmic reticulum are stored. Furthermore, the database integrates multiple omics data such as genomics, transcriptomics, metabolomics, and proteomics. The SPD database is accessible at http://proteome.dc.affrc.go.jp/Soybean/. The Soybean Proteome Database stores data obtained from both gel-based and gel-free proteomic techniques. The gel-free proteomic database comprises 39,212 proteins identified in 63 sample sets, such as different organs of soybean plants grown under flooding stress or non-stressed conditions in a time-dependent manner. In addition, organellar proteins identified in mitochondria, nuclei, and endoplasmic reticulum are stored in the gel-free proteomics database. A total of 44,704 proteins, including 5490 proteins identified using a gel-based proteomic technique, are stored in the SPD. It accounts for approximately 80% of all predicted proteins from genome sequences, though there are over lapped proteins. Based on the demonstrated application of data stored in the database for functional analyses, it is suggested that these data will be useful for analyses of biological mechanisms in soybean. Furthermore, coupled with recent advances in information and communication technology, the usefulness of this database would increase in the analyses of biological mechanisms. Copyright © 2017 Elsevier B.V. All rights reserved.
FARE-CAFE: a database of functional and regulatory elements of cancer-associated fusion events
Korla, Praveen Kumar; Cheng, Jack; Huang, Chien-Hung; Tsai, Jeffrey J. P.; Liu, Yu-Hsuan; Kurubanjerdjit, Nilubon; Hsieh, Wen-Tsong; Chen, Huey-Yi; Ng, Ka-Lok
2015-01-01
Chromosomal translocation (CT) is of enormous clinical interest because this disorder is associated with various major solid tumors and leukemia. A tumor-specific fusion gene event may occur when a translocation joins two separate genes. Currently, various CT databases provide information about fusion genes and their genomic elements. However, no database of the roles of fusion genes, in terms of essential functional and regulatory elements in oncogenesis, is available. FARE-CAFE is a unique combination of CTs, fusion proteins, protein domains, domain–domain interactions, protein–protein interactions, transcription factors and microRNAs, with subsequent experimental information, which cannot be found in any other CT database. Genomic DNA information including, for example, manually collected exact locations of the first and second break points, sequences and karyotypes of fusion genes are included. FARE-CAFE will substantially facilitate the cancer biologist’s mission of elucidating the pathogenesis of various types of cancer. This database will ultimately help to develop ‘novel’ therapeutic approaches. Database URL: http://ppi.bioinfo.asia.edu.tw/FARE-CAFE PMID:26384373
The 24th annual Nucleic Acids Research database issue: a look back and upcoming changes
Rigden, Daniel J
2017-01-01
Abstract This year's Database Issue of Nucleic Acids Research contains 152 papers that include descriptions of 54 new databases and update papers on 98 databases, of which 16 have not been previously featured in NAR. As always, these databases cover a broad range of molecular biology subjects, including genome structure, gene expression and its regulation, proteins, protein domains, and protein–protein interactions. Following the recent trend, an increasing number of new and established databases deal with the issues of human health, from cancer-causing mutations to drugs and drug targets. In accordance with this trend, three recently compiled databases that have been selected by NAR reviewers and editors as ‘breakthrough’ contributions, denovo-db, the Monarch Initiative, and Open Targets, cover human de novo gene variants, disease-related phenotypes in model organisms, and a bioinformatics platform for therapeutic target identification and validation, respectively. We expect these databases to attract the attention of numerous researchers working in various areas of genetics and genomics. Looking back at the past 12 years, we present here the ‘golden set’ of databases that have consistently served as authoritative, comprehensive, and convenient data resources widely used by the entire community and offer some lessons on what makes a successful database. The Database Issue is freely available online at the https://academic.oup.com/nar web site. An updated version of the NAR Molecular Biology Database Collection is available at http://www.oxfordjournals.org/nar/database/a/. PMID:28053160
DB-PABP: a database of polyanion-binding proteins
Fang, Jianwen; Dong, Yinghua; Salamat-Miller, Nazila; Russell Middaugh, C.
2008-01-01
The interactions between polyanions (PAs) and polyanion-binding proteins (PABPs) have been found to play significant roles in many essential biological processes including intracellular organization, transport and protein folding. Furthermore, many neurodegenerative disease-related proteins are PABPs. Thus, a better understanding of PA/PABP interactions may not only enhance our understandings of biological systems but also provide new clues to these deadly diseases. The literature in this field is widely scattered, suggesting the need for a comprehensive and searchable database of PABPs. The DB-PABP is a comprehensive, manually curated and searchable database of experimentally characterized PABPs. It is freely available and can be accessed online at http://pabp.bcf.ku.edu/DB_PABP/. The DB-PABP was implemented as a MySQL relational database. An interactive web interface was created using Java Server Pages (JSP). The search page of the database is organized into a main search form and a section for utilities. The main search form enables custom searches via four menus: protein names, polyanion names, the source species of the proteins and the methods used to discover the interactions. Available utilities include a commonality matrix, a function of listing PABPs by the number of interacting polyanions and a string search for author surnames. The DB-PABP is maintained at the University of Kansas. We encourage users to provide feedback and submit new data and references. PMID:17916573
DB-PABP: a database of polyanion-binding proteins.
Fang, Jianwen; Dong, Yinghua; Salamat-Miller, Nazila; Middaugh, C Russell
2008-01-01
The interactions between polyanions (PAs) and polyanion-binding proteins (PABPs) have been found to play significant roles in many essential biological processes including intracellular organization, transport and protein folding. Furthermore, many neurodegenerative disease-related proteins are PABPs. Thus, a better understanding of PA/PABP interactions may not only enhance our understandings of biological systems but also provide new clues to these deadly diseases. The literature in this field is widely scattered, suggesting the need for a comprehensive and searchable database of PABPs. The DB-PABP is a comprehensive, manually curated and searchable database of experimentally characterized PABPs. It is freely available and can be accessed online at http://pabp.bcf.ku.edu/DB_PABP/. The DB-PABP was implemented as a MySQL relational database. An interactive web interface was created using Java Server Pages (JSP). The search page of the database is organized into a main search form and a section for utilities. The main search form enables custom searches via four menus: protein names, polyanion names, the source species of the proteins and the methods used to discover the interactions. Available utilities include a commonality matrix, a function of listing PABPs by the number of interacting polyanions and a string search for author surnames. The DB-PABP is maintained at the University of Kansas. We encourage users to provide feedback and submit new data and references.
GALT protein database: querying structural and functional features of GALT enzyme.
d'Acierno, Antonio; Facchiano, Angelo; Marabotti, Anna
2014-09-01
Knowledge of the impact of variations on protein structure can enhance the comprehension of the mechanisms of genetic diseases related to that protein. Here, we present a new version of GALT Protein Database, a Web-accessible data repository for the storage and interrogation of structural effects of variations of the enzyme galactose-1-phosphate uridylyltransferase (GALT), the impairment of which leads to classic Galactosemia, a rare genetic disease. This new version of this database now contains the models of 201 missense variants of GALT enzyme, including heterozygous variants, and it allows users not only to retrieve information about the missense variations affecting this protein, but also to investigate their impact on substrate binding, intersubunit interactions, stability, and other structural features. In addition, it allows the interactive visualization of the models of variants collected into the database. We have developed additional tools to improve the use of the database by nonspecialized users. This Web-accessible database (http://bioinformatica.isa.cnr.it/GALT/GALT2.0) represents a model of tools potentially suitable for application to other proteins that are involved in human pathologies and that are subjected to genetic variations. © 2014 WILEY PERIODICALS, INC.
PROXiMATE: a database of mutant protein-protein complex thermodynamics and kinetics.
Jemimah, Sherlyn; Yugandhar, K; Michael Gromiha, M
2017-09-01
We have developed PROXiMATE, a database of thermodynamic data for more than 6000 missense mutations in 174 heterodimeric protein-protein complexes, supplemented with interaction network data from STRING database, solvent accessibility, sequence, structural and functional information, experimental conditions and literature information. Additional features include complex structure visualization, search and display options, download options and a provision for users to upload their data. The database is freely available at http://www.iitm.ac.in/bioinfo/PROXiMATE/ . The website is implemented in Python, and supports recent versions of major browsers such as IE10, Firefox, Chrome and Opera. gromiha@iitm.ac.in. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com
Jefferson, Emily R.; Walsh, Thomas P.; Roberts, Timothy J.; Barton, Geoffrey J.
2007-01-01
SNAPPI-DB, a high performance database of Structures, iNterfaces and Alignments of Protein–Protein Interactions, and its associated Java Application Programming Interface (API) is described. SNAPPI-DB contains structural data, down to the level of atom co-ordinates, for each structure in the Protein Data Bank (PDB) together with associated data including SCOP, CATH, Pfam, SWISSPROT, InterPro, GO terms, Protein Quaternary Structures (PQS) and secondary structure information. Domain–domain interactions are stored for multiple domain definitions and are classified by their Superfamily/Family pair and interaction interface. Each set of classified domain–domain interactions has an associated multiple structure alignment for each partner. The API facilitates data access via PDB entries, domains and domain–domain interactions. Rapid development, fast database access and the ability to perform advanced queries without the requirement for complex SQL statements are provided via an object oriented database and the Java Data Objects (JDO) API. SNAPPI-DB contains many features which are not available in other databases of structural protein–protein interactions. It has been applied in three studies on the properties of protein–protein interactions and is currently being employed to train a protein–protein interaction predictor and a functional residue predictor. The database, API and manual are available for download at: . PMID:17202171
DBSecSys 2.0: a database of Burkholderia mallei and Burkholderia pseudomallei secretion systems.
Memišević, Vesna; Kumar, Kamal; Zavaljevski, Nela; DeShazer, David; Wallqvist, Anders; Reifman, Jaques
2016-09-20
Burkholderia mallei and B. pseudomallei are the causative agents of glanders and melioidosis, respectively, diseases with high morbidity and mortality rates. B. mallei and B. pseudomallei are closely related genetically; B. mallei evolved from an ancestral strain of B. pseudomallei by genome reduction and adaptation to an obligate intracellular lifestyle. Although these two bacteria cause different diseases, they share multiple virulence factors, including bacterial secretion systems, which represent key components of bacterial pathogenicity. Despite recent progress, the secretion system proteins for B. mallei and B. pseudomallei, their pathogenic mechanisms of action, and host factors are not well characterized. We previously developed a manually curated database, DBSecSys, of bacterial secretion system proteins for B. mallei. Here, we report an expansion of the database with corresponding information about B. pseudomallei. DBSecSys 2.0 contains comprehensive literature-based and computationally derived information about B. mallei ATCC 23344 and literature-based and computationally derived information about B. pseudomallei K96243. The database contains updated information for 163 B. mallei proteins from the previous database and 61 additional B. mallei proteins, and new information for 281 B. pseudomallei proteins associated with 5 secretion systems, their 1,633 human- and murine-interacting targets, and 2,400 host-B. mallei interactions and 2,286 host-B. pseudomallei interactions. The database also includes information about 13 pathogenic mechanisms of action for B. mallei and B. pseudomallei secretion system proteins inferred from the available literature or computationally. Additionally, DBSecSys 2.0 provides details about 82 virulence attenuation experiments for 52 B. mallei secretion system proteins and 98 virulence attenuation experiments for 61 B. pseudomallei secretion system proteins. We updated the Web interface and data access layer to speed-up users' search of detailed information for orthologous proteins related to secretion systems of the two pathogens. The updates of DBSecSys 2.0 provide unique capabilities to access comprehensive information about secretion systems of B. mallei and B. pseudomallei. They enable studies and comparisons of corresponding proteins of these two closely related pathogens and their host-interacting partners. The database is available at http://dbsecsys.bhsai.org .
Consolidation of proteomics data in the Cancer Proteomics database.
Arntzen, Magnus Ø; Boddie, Paul; Frick, Rahel; Koehler, Christian J; Thiede, Bernd
2015-11-01
Cancer is a class of diseases characterized by abnormal cell growth and one of the major reasons for human deaths. Proteins are involved in the molecular mechanisms leading to cancer, furthermore they are affected by anti-cancer drugs, and protein biomarkers can be used to diagnose certain cancer types. Therefore, it is important to explore the proteomics background of cancer. In this report, we developed the Cancer Proteomics database to re-interrogate published proteome studies investigating cancer. The database is divided in three sections related to cancer processes, cancer types, and anti-cancer drugs. Currently, the Cancer Proteomics database contains 9778 entries of 4118 proteins extracted from 143 scientific articles covering all three sections: cell death (cancer process), prostate cancer (cancer type) and platinum-based anti-cancer drugs including carboplatin, cisplatin, and oxaliplatin (anti-cancer drugs). The detailed information extracted from the literature includes basic information about the articles (e.g., PubMed ID, authors, journal name, publication year), information about the samples (type, study/reference, prognosis factor), and the proteomics workflow (Subcellular fractionation, protein, and peptide separation, mass spectrometry, quantification). Useful annotations such as hyperlinks to UniProt and PubMed were included. In addition, many filtering options were established as well as export functions. The database is freely available at http://cancerproteomics.uio.no. © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
MiCroKit 3.0: an integrated database of midbody, centrosome and kinetochore.
Ren, Jian; Liu, Zexian; Gao, Xinjiao; Jin, Changjiang; Ye, Mingliang; Zou, Hanfa; Wen, Longping; Zhang, Zhaolei; Xue, Yu; Yao, Xuebiao
2010-01-01
During cell division/mitosis, a specific subset of proteins is spatially and temporally assembled into protein super complexes in three distinct regions, i.e. centrosome/spindle pole, kinetochore/centromere and midbody/cleavage furrow/phragmoplast/bud neck, and modulates cell division process faithfully. Although many experimental efforts have been carried out to investigate the characteristics of these proteins, no integrated database was available. Here, we present the MiCroKit database (http://microkit.biocuckoo.org) of proteins that localize in midbody, centrosome and/or kinetochore. We collected into the MiCroKit database experimentally verified microkit proteins from the scientific literature that have unambiguous supportive evidence for subcellular localization under fluorescent microscope. The current version of MiCroKit 3.0 provides detailed information for 1489 microkit proteins from seven model organisms, including Saccharomyces cerevisiae, Schizasaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, Xenopus laevis, Mus musculus and Homo sapiens. Moreover, the orthologous information was provided for these microkit proteins, and could be a useful resource for further experimental identification. The online service of MiCroKit database was implemented in PHP + MySQL + JavaScript, while the local packages were developed in JAVA 1.5 (J2SE 5.0).
MiCroKit 3.0: an integrated database of midbody, centrosome and kinetochore
Liu, Zexian; Gao, Xinjiao; Jin, Changjiang; Ye, Mingliang; Zou, Hanfa; Wen, Longping; Zhang, Zhaolei; Xue, Yu; Yao, Xuebiao
2010-01-01
During cell division/mitosis, a specific subset of proteins is spatially and temporally assembled into protein super complexes in three distinct regions, i.e. centrosome/spindle pole, kinetochore/centromere and midbody/cleavage furrow/phragmoplast/bud neck, and modulates cell division process faithfully. Although many experimental efforts have been carried out to investigate the characteristics of these proteins, no integrated database was available. Here, we present the MiCroKit database (http://microkit.biocuckoo.org) of proteins that localize in midbody, centrosome and/or kinetochore. We collected into the MiCroKit database experimentally verified microkit proteins from the scientific literature that have unambiguous supportive evidence for subcellular localization under fluorescent microscope. The current version of MiCroKit 3.0 provides detailed information for 1489 microkit proteins from seven model organisms, including Saccharomyces cerevisiae, Schizasaccharomyces pombe, Caenorhabditis elegans, Drosophila melanogaster, Xenopus laevis, Mus musculus and Homo sapiens. Moreover, the orthologous information was provided for these microkit proteins, and could be a useful resource for further experimental identification. The online service of MiCroKit database was implemented in PHP + MySQL + JavaScript, while the local packages were developed in JAVA 1.5 (J2SE 5.0). PMID:19783819
Sousa, Filipa L; Parente, Daniel J; Hessman, Jacob A; Chazelle, Allen; Teichmann, Sarah A; Swint-Kruse, Liskin
2016-09-01
The AlloRep database (www.AlloRep.org) (Sousa et al., 2016) [1] compiles extensive sequence, mutagenesis, and structural information for the LacI/GalR family of transcription regulators. Sequence alignments are presented for >3000 proteins in 45 paralog subfamilies and as a subsampled alignment of the whole family. Phenotypic and biochemical data on almost 6000 mutants have been compiled from an exhaustive search of the literature; citations for these data are included herein. These data include information about oligomerization state, stability, DNA binding and allosteric regulation. Protein structural data for 65 proteins are presented as easily-accessible, residue-contact networks. Finally, this article includes example queries to enable the use of the AlloRep database. See the related article, "AlloRep: a repository of sequence, structural and mutagenesis data for the LacI/GalR transcription regulators" (Sousa et al., 2016) [1].
Database resources of the National Center for Biotechnology Information
2015-01-01
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (Bookshelf, PubMed Central (PMC) and PubReader); medical genetics (ClinVar, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen); genes and genomics (BioProject, BioSample, dbSNP, dbVar, Epigenomics, Gene, Gene Expression Omnibus (GEO), Genome, HomoloGene, the Map Viewer, Nucleotide, PopSet, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser, Trace Archive and UniGene); and proteins and chemicals (Biosystems, COBALT, the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB), Protein Clusters, Protein and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for many of these databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov. PMID:25398906
Database resources of the National Center for Biotechnology Information
2016-01-01
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (PubMed Central (PMC), Bookshelf and PubReader), health (ClinVar, dbGaP, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen), genomes (BioProject, Assembly, Genome, BioSample, dbSNP, dbVar, Epigenomics, the Map Viewer, Nucleotide, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser and the Trace Archive), genes (Gene, Gene Expression Omnibus (GEO), HomoloGene, PopSet and UniGene), proteins (Protein, the Conserved Domain Database (CDD), COBALT, Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB) and Protein Clusters) and chemicals (Biosystems and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for most of these databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:26615191
The COG database: a tool for genome-scale analysis of protein functions and evolution
Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.
2000-01-01
Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175
Mining databases for protein aggregation: a review.
Tsiolaki, Paraskevi L; Nastou, Katerina C; Hamodrakas, Stavros J; Iconomidou, Vassiliki A
2017-09-01
Protein aggregation is an active area of research in recent decades, since it is the most common and troubling indication of protein instability. Understanding the mechanisms governing protein aggregation and amyloidogenesis is a key component to the aetiology and pathogenesis of many devastating disorders, including Alzheimer's disease or type 2 diabetes. Protein aggregation data are currently found "scattered" in an increasing number of repositories, since advances in computational biology greatly influence this field of research. This review exploits the various resources of aggregation data and attempts to distinguish and analyze the biological knowledge they contain, by introducing protein-based, fragment-based and disease-based repositories, related to aggregation. In order to gain a broad overview of the available repositories, a novel comprehensive network maps and visualizes the current association between aggregation databases and other important databases and/or tools and discusses the beneficial role of community annotation. The need for unification of aggregation databases in a common platform is also addressed.
PlantCAZyme: a database for plant carbohydrate-active enzymes
Ekstrom, Alexander; Taujale, Rahil; McGinn, Nathan; Yin, Yanbin
2014-01-01
PlantCAZyme is a database built upon dbCAN (database for automated carbohydrate active enzyme annotation), aiming to provide pre-computed sequence and annotation data of carbohydrate active enzymes (CAZymes) to plant carbohydrate and bioenergy research communities. The current version contains data of 43 790 CAZymes of 159 protein families from 35 plants (including angiosperms, gymnosperms, lycophyte and bryophyte mosses) and chlorophyte algae with fully sequenced genomes. Useful features of the database include: (i) a BLAST server and a HMMER server that allow users to search against our pre-computed sequence data for annotation purpose, (ii) a download page to allow batch downloading data of a specific CAZyme family or species and (iii) protein browse pages to provide an easy access to the most comprehensive sequence and annotation data. Database URL: http://cys.bios.niu.edu/plantcazyme/ PMID:25125445
Senachak, Jittisak; Cheevadhanarak, Supapon; Hongsthong, Apiradee
2015-07-29
Spirulina (Arthrospira) platensis is the only cyanobacterium that in addition to being studied at the molecular level and subjected to gene manipulation, can also be mass cultivated in outdoor ponds for commercial use as a food supplement. Thus, encountering environmental changes, including temperature stresses, is common during the mass production of Spirulina. The use of cyanobacteria as an experimental platform, especially for photosynthetic gene manipulation in plants and bacteria, is becoming increasingly important. Understanding the mechanisms and protein-protein interaction networks that underlie low- and high-temperature responses is relevant to Spirulina mass production. To accomplish this goal, high-throughput techniques such as OMICs analyses are used. Thus, large datasets must be collected, managed and subjected to information extraction. Therefore, databases including (i) proteomic analysis and protein-protein interaction (PPI) data and (ii) domain/motif visualization tools are required for potential use in temperature response models for plant chloroplasts and photosynthetic bacteria. A web-based repository was developed including an embedded database, SpirPro, and tools for network visualization. Proteome data were analyzed integrated with protein-protein interactions and/or metabolic pathways from KEGG. The repository provides various information, ranging from raw data (2D-gel images) to associated results, such as data from interaction and/or pathway analyses. This integration allows in silico analyses of protein-protein interactions affected at the metabolic level and, particularly, analyses of interactions between and within the affected metabolic pathways under temperature stresses for comparative proteomic analysis. The developed tool, which is coded in HTML with CSS/JavaScript and depicted in Scalable Vector Graphics (SVG), is designed for interactive analysis and exploration of the constructed network. SpirPro is publicly available on the web at http://spirpro.sbi.kmutt.ac.th . SpirPro is an analysis platform containing an integrated proteome and PPI database that provides the most comprehensive data on this cyanobacterium at the systematic level. As an integrated database, SpirPro can be applied in various analyses, such as temperature stress response networking analysis in cyanobacterial models and interacting domain-domain analysis between proteins of interest.
Major, Sylvia M; Nishizuka, Satoshi; Morita, Daisaku; Rowland, Rick; Sunshine, Margot; Shankavaram, Uma; Washburn, Frank; Asin, Daniel; Kouros-Mehr, Hosein; Kane, David; Weinstein, John N
2006-04-06
Monoclonal antibodies are used extensively throughout the biomedical sciences for detection of antigens, either in vitro or in vivo. We, for example, have used them for quantitation of proteins on "reverse-phase" protein lysate arrays. For those studies, we quality-controlled > 600 available monoclonal antibodies and also needed to develop precise information on the genes that encode their antigens. Translation among the various protein and gene identifier types proved non-trivial because of one-to-many and many-to-one relationships. To organize the antibody, protein, and gene information, we initially developed a relational database in Filemaker for our own use. When it became apparent that the information would be useful to many other researchers faced with the need to choose or characterize antibodies, we developed it further as AbMiner, a fully relational web-based database under MySQL, programmed in Java. AbMiner is a user-friendly, web-based relational database of information on > 600 commercially available antibodies that we validated by Western blot for protein microarray studies. It includes many types of information on the antibody, the immunogen, the vendor, the antigen, and the antigen's gene. Multiple gene and protein identifier types provide links to corresponding entries in a variety of other public databases, including resources for phosphorylation-specific antibodies. AbMiner also includes our quality-control data against a pool of 60 diverse cancer cell types (the NCI-60) and also protein expression levels for the NCI-60 cells measured using our high-density "reverse-phase" protein lysate microarrays for a selection of the listed antibodies. Some other available database resources give information on antibody specificity for one or a couple of cell types. In contrast, the data in AbMiner indicate specificity with respect to the antigens in a pool of 60 diverse cell types from nine different tissues of origin. AbMiner is a relational database that provides extensive information from our own laboratory and other sources on more than 600 available antibodies and the genes that encode the antibodies' antigens. The data will be made freely available at http://discover.nci.nih.gov/abminer.
HypoxiaDB: a database of hypoxia-regulated proteins
Khurana, Pankaj; Sugadev, Ragumani; Jain, Jaspreet; Singh, Shashi Bala
2013-01-01
There has been intense interest in the cellular response to hypoxia, and a large number of differentially expressed proteins have been identified through various high-throughput experiments. These valuable data are scattered, and there have been no systematic attempts to document the various proteins regulated by hypoxia. Compilation, curation and annotation of these data are important in deciphering their role in hypoxia and hypoxia-related disorders. Therefore, we have compiled HypoxiaDB, a database of hypoxia-regulated proteins. It is a comprehensive, manually-curated, non-redundant catalog of proteins whose expressions are shown experimentally to be altered at different levels and durations of hypoxia. The database currently contains 72 000 manually curated entries taken on 3500 proteins extracted from 73 peer-reviewed publications selected from PubMed. HypoxiaDB is distinctive from other generalized databases: (i) it compiles tissue-specific protein expression changes under different levels and duration of hypoxia. Also, it provides manually curated literature references to support the inclusion of the protein in the database and establish its association with hypoxia. (ii) For each protein, HypoxiaDB integrates data on gene ontology, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway, protein–protein interactions, protein family (Pfam), OMIM (Online Mendelian Inheritance in Man), PDB (Protein Data Bank) structures and homology to other sequenced genomes. (iii) It also provides pre-compiled information on hypoxia-proteins, which otherwise requires tedious computational analysis. This includes information like chromosomal location, identifiers like Entrez, HGNC, Unigene, Uniprot, Ensembl, Vega, GI numbers and Genbank accession numbers associated with the protein. These are further cross-linked to respective public databases augmenting HypoxiaDB to the external repositories. (iv) In addition, HypoxiaDB provides an online sequence-similarity search tool for users to compare their protein sequences with HypoxiaDB protein database. We hope that HypoxiaDB will enrich our knowledge about hypoxia-related biology and eventually will lead to the development of novel hypothesis and advancements in diagnostic and therapeutic activities. HypoxiaDB is freely accessible for academic and non-profit users via http://www.hypoxiadb.com. Database URL: http://www.hypoxiadb.com PMID:24178989
Goodman, Richard E; Ebisawa, Motohiro; Ferreira, Fatima; Sampson, Hugh A; van Ree, Ronald; Vieths, Stefan; Baumert, Joseph L; Bohle, Barbara; Lalithambika, Sreedevi; Wise, John; Taylor, Steve L
2016-05-01
Increasingly regulators are demanding evaluation of potential allergenicity of foods prior to marketing. Primary risks are the transfer of allergens or potentially cross-reactive proteins into new foods. AllergenOnline was developed in 2005 as a peer-reviewed bioinformatics platform to evaluate risks of new dietary proteins in genetically modified organisms (GMO) and novel foods. The process used to identify suspected allergens and evaluate the evidence of allergenicity was refined between 2010 and 2015. Candidate proteins are identified from the NCBI database using keyword searches, the WHO/IUIS nomenclature database and peer reviewed publications. Criteria to classify proteins as allergens are described. Characteristics of the protein, the source and human subjects, test methods and results are evaluated by our expert panel and archived. Food, inhalant, salivary, venom, and contact allergens are included. Users access allergen sequences through links to the NCBI database and relevant references are listed online. Version 16 includes 1956 sequences from 778 taxonomic-protein groups that are accepted with evidence of allergic serum IgE-binding and/or biological activity. AllergenOnline provides a useful peer-reviewed tool for identifying the primary potential risks of allergy for GMOs and novel foods based on criteria described by the Codex Alimentarius Commission (2003). © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
DBSecSys: a database of Burkholderia mallei secretion systems.
Memišević, Vesna; Kumar, Kamal; Cheng, Li; Zavaljevski, Nela; DeShazer, David; Wallqvist, Anders; Reifman, Jaques
2014-07-16
Bacterial pathogenicity represents a major public health concern worldwide. Secretion systems are a key component of bacterial pathogenicity, as they provide the means for bacterial proteins to penetrate host-cell membranes and insert themselves directly into the host cells' cytosol. Burkholderia mallei is a Gram-negative bacterium that uses multiple secretion systems during its host infection life cycle. To date, the identities of secretion system proteins for B. mallei are not well known, and their pathogenic mechanisms of action and host factors are largely uncharacterized. We present the Database of Burkholderia malleiSecretion Systems (DBSecSys), a compilation of manually curated and computationally predicted bacterial secretion system proteins and their host factors. Currently, DBSecSys contains comprehensive experimentally and computationally derived information about B. mallei strain ATCC 23344. The database includes 143 B. mallei proteins associated with five secretion systems, their 1,635 human and murine interacting targets, and the corresponding 2,400 host-B. mallei interactions. The database also includes information about 10 pathogenic mechanisms of action for B. mallei secretion system proteins inferred from the available literature. Additionally, DBSecSys provides details about 42 virulence attenuation experiments for 27 B. mallei secretion system proteins. Users interact with DBSecSys through a Web interface that allows for data browsing, querying, visualizing, and downloading. DBSecSys provides a comprehensive, systematically organized resource of experimental and computational data associated with B. mallei secretion systems. It provides the unique ability to study secretion systems not only through characterization of their corresponding pathogen proteins, but also through characterization of their host-interacting partners.The database is available at https://applications.bhsai.org/dbsecsys.
Columba: an integrated database of proteins, structures, and annotations.
Trissl, Silke; Rother, Kristian; Müller, Heiko; Steinke, Thomas; Koch, Ina; Preissner, Robert; Frömmel, Cornelius; Leser, Ulf
2005-03-31
Structural and functional research often requires the computation of sets of protein structures based on certain properties of the proteins, such as sequence features, fold classification, or functional annotation. Compiling such sets using current web resources is tedious because the necessary data are spread over many different databases. To facilitate this task, we have created COLUMBA, an integrated database of annotations of protein structures. COLUMBA currently integrates twelve different databases, including PDB, KEGG, Swiss-Prot, CATH, SCOP, the Gene Ontology, and ENZYME. The database can be searched using either keyword search or data source-specific web forms. Users can thus quickly select and download PDB entries that, for instance, participate in a particular pathway, are classified as containing a certain CATH architecture, are annotated as having a certain molecular function in the Gene Ontology, and whose structures have a resolution under a defined threshold. The results of queries are provided in both machine-readable extensible markup language and human-readable format. The structures themselves can be viewed interactively on the web. The COLUMBA database facilitates the creation of protein structure data sets for many structure-based studies. It allows to combine queries on a number of structure-related databases not covered by other projects at present. Thus, information on both many and few protein structures can be used efficiently. The web interface for COLUMBA is available at http://www.columba-db.de.
RPG: the Ribosomal Protein Gene database.
Nakao, Akihiro; Yoshihama, Maki; Kenmochi, Naoya
2004-01-01
RPG (http://ribosome.miyazaki-med.ac.jp/) is a new database that provides detailed information about ribosomal protein (RP) genes. It contains data from humans and other organisms, including Drosophila melanogaster, Caenorhabditis elegans, Saccharo myces cerevisiae, Methanococcus jannaschii and Escherichia coli. Users can search the database by gene name and organism. Each record includes sequences (genomic, cDNA and amino acid sequences), intron/exon structures, genomic locations and information about orthologs. In addition, users can view and compare the gene structures of the above organisms and make multiple amino acid sequence alignments. RPG also provides information on small nucleolar RNAs (snoRNAs) that are encoded in the introns of RP genes.
RPG: the Ribosomal Protein Gene database
Nakao, Akihiro; Yoshihama, Maki; Kenmochi, Naoya
2004-01-01
RPG (http://ribosome.miyazaki-med.ac.jp/) is a new database that provides detailed information about ribosomal protein (RP) genes. It contains data from humans and other organisms, including Drosophila melanogaster, Caenorhabditis elegans, Saccharo myces cerevisiae, Methanococcus jannaschii and Escherichia coli. Users can search the database by gene name and organism. Each record includes sequences (genomic, cDNA and amino acid sequences), intron/exon structures, genomic locations and information about orthologs. In addition, users can view and compare the gene structures of the above organisms and make multiple amino acid sequence alignments. RPG also provides information on small nucleolar RNAs (snoRNAs) that are encoded in the introns of RP genes. PMID:14681386
DNAtraffic--a new database for systems biology of DNA dynamics during the cell life.
Kuchta, Krzysztof; Barszcz, Daniela; Grzesiuk, Elzbieta; Pomorski, Pawel; Krwawicz, Joanna
2012-01-01
DNAtraffic (http://dnatraffic.ibb.waw.pl/) is dedicated to be a unique comprehensive and richly annotated database of genome dynamics during the cell life. It contains extensive data on the nomenclature, ontology, structure and function of proteins related to the DNA integrity mechanisms such as chromatin remodeling, histone modifications, DNA repair and damage response from eight organisms: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli and Arabidopsis thaliana. DNAtraffic contains comprehensive information on the diseases related to the assembled human proteins. DNAtraffic is richly annotated in the systemic information on the nomenclature, chemistry and structure of DNA damage and their sources, including environmental agents or commonly used drugs targeting nucleic acids and/or proteins involved in the maintenance of genome stability. One of the DNAtraffic database aim is to create the first platform of the combinatorial complexity of DNA network analysis. Database includes illustrations of pathways, damage, proteins and drugs. Since DNAtraffic is designed to cover a broad spectrum of scientific disciplines, it has to be extensively linked to numerous external data sources. Our database represents the result of the manual annotation work aimed at making the DNAtraffic much more useful for a wide range of systems biology applications.
DNAtraffic—a new database for systems biology of DNA dynamics during the cell life
Kuchta, Krzysztof; Barszcz, Daniela; Grzesiuk, Elzbieta; Pomorski, Pawel; Krwawicz, Joanna
2012-01-01
DNAtraffic (http://dnatraffic.ibb.waw.pl/) is dedicated to be a unique comprehensive and richly annotated database of genome dynamics during the cell life. It contains extensive data on the nomenclature, ontology, structure and function of proteins related to the DNA integrity mechanisms such as chromatin remodeling, histone modifications, DNA repair and damage response from eight organisms: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Escherichia coli and Arabidopsis thaliana. DNAtraffic contains comprehensive information on the diseases related to the assembled human proteins. DNAtraffic is richly annotated in the systemic information on the nomenclature, chemistry and structure of DNA damage and their sources, including environmental agents or commonly used drugs targeting nucleic acids and/or proteins involved in the maintenance of genome stability. One of the DNAtraffic database aim is to create the first platform of the combinatorial complexity of DNA network analysis. Database includes illustrations of pathways, damage, proteins and drugs. Since DNAtraffic is designed to cover a broad spectrum of scientific disciplines, it has to be extensively linked to numerous external data sources. Our database represents the result of the manual annotation work aimed at making the DNAtraffic much more useful for a wide range of systems biology applications. PMID:22110027
Extension of the COG and arCOG databases by amino acid and nucleotide sequences
Meereis, Florian; Kaufmann, Michael
2008-01-01
Background The current versions of the COG and arCOG databases, both excellent frameworks for studies in comparative and functional genomics, do not contain the nucleotide sequences corresponding to their protein or protein domain entries. Results Using sequence information obtained from GenBank flat files covering the completely sequenced genomes of the COG and arCOG databases, we constructed NUCOCOG (nucleotide sequences containing COG databases) as an extended version including all nucleotide sequences and in addition the amino acid sequences originally utilized to construct the current COG and arCOG databases. We make available three comprehensive single XML files containing the complete databases including all sequence information. In addition, we provide a web interface as a utility suitable to browse the NUCOCOG database for sequence retrieval. The database is accessible at . Conclusion NUCOCOG offers the possibility to analyze any sequence related property in the context of the COG and arCOG framework simply by using script languages such as PERL applied to a large but single XML document. PMID:19014535
Lee, Ji-Hyun; You, Sungyong; Hyeon, Do Young; Kang, Byeongsoo; Kim, Hyerim; Park, Kyoung Mii; Han, Byungwoo; Hwang, Daehee; Kim, Sunghoon
2015-01-01
Mammalian cells have cytoplasmic and mitochondrial aminoacyl-tRNA synthetases (ARSs) that catalyze aminoacylation of tRNAs during protein synthesis. Despite their housekeeping functions in protein synthesis, recently, ARSs and ARS-interacting multifunctional proteins (AIMPs) have been shown to play important roles in disease pathogenesis through their interactions with disease-related molecules. However, there are lacks of data resources and analytical tools that can be used to examine disease associations of ARS/AIMPs. Here, we developed an Integrated Database for ARSs (IDA), a resource database including cancer genomic/proteomic and interaction data of ARS/AIMPs. IDA includes mRNA expression, somatic mutation, copy number variation and phosphorylation data of ARS/AIMPs and their interacting proteins in various cancers. IDA further includes an array of analytical tools for exploration of disease association of ARS/AIMPs, identification of disease-associated ARS/AIMP interactors and reconstruction of ARS-dependent disease-perturbed network models. Therefore, IDA provides both comprehensive data resources and analytical tools for understanding potential roles of ARS/AIMPs in cancers. Database URL: http://ida.biocon.re.kr/, http://ars.biocon.re.kr/ PMID:25824651
DOE Office of Scientific and Technical Information (OSTI.GOV)
Weckwerth, Wolfram; Baginsky, Sacha; Van Wijk, Klass
2009-12-01
In the past 10 years, we have witnessed remarkable advances in the field of plant molecular biology. The rapid development of proteomic technologies and the speed with which these techniques have been applied to the field have altered our perception of how we can analyze proteins in complex systems. At nearly the same time, the availability of the complete genome for the model plant Arabidopsis thaliana was released; this effort provides an unsurpassed resource for the identification of proteins when researchers use MS to analyze plant samples. Recognizing the growth in this area, the Multinational Arabidopsis Steering Committee (MASC) establishedmore » a subcommittee for A. thaliana proteomics in 2006 with the objective of consolidating databases, technique standards, and experimentally validated candidate genes and functions. Since the establishment of the Multinational Arabidopsis Steering Subcommittee for Proteomics (MASCP), many new approaches and resources have become available. Recently, the subcommittee established a webpage to consolidate this information (www.masc-proteomics.org). It includes links to plant proteomic databases, general information about proteomic techniques, meeting information, a summary of proteomic standards, and other relevant resources. Altogether, this website provides a useful resource for the Arabidopsis proteomics community. In the future, the website will host discussions and investigate the cross-linking of databases. The subcommittee members have extensive experience in arabidopsis proteomics and collectively have produced some of the most extensive proteomics data sets for this model plant (Table S1 in the Supporting Information has a list of resources). The largest collection of proteomics data from a single study in A. thaliana was assembled into an accessible database (AtProteome; http://fgcz-atproteome.unizh.ch/index.php) and was recently published by the Baginsky lab.1 The database provides links to major Arabidopsis online resources, and raw data have been deposited in PRIDE and PRIDE BioMart. Included in this database is an Arabidopsis proteome map that provides evidence for the expression of {approx}50% of all predicted gene models, including several alternative gene models that are not represented in The Arabidopsis Information Resource (TAIR) protein database. A set of organ-specific biomarkers is provided, as well as organ-specific proteotypic peptides for 4105 proteins that can be used to facilitate targeted quantitative proteomic surveys. In the future, the AtProteome database will be linked to additional existing resources developed by MASCP members, such as PPDB, ProMEX, and SUBA. The most comprehensive study on the Arabidopsis chloroplast proteome, which includes information on chloroplast sorting signals, posttranslational modifications (PTMs), and protein abundances (analyzed by high-accuracy MS [Orbitrap]), was recently published by the van Wijk lab.2 These and previous data are available via the plant proteome database (PPDB; http://ppdb.tc.cornell.edu) for A. thaliana and maize. PPDB provides genome-wide experimental and functional characterization of the A. thaliana and maize proteomes, including PTMs and subcellular localization information, with an emphasis on leaf and plastid proteins. Maize and Arabidopsis proteome entries are directly linked via internal BLAST alignments within PPDB. Direct links for each protein to TAIR, SUBA, ProMEX, and other resources are also provided.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Courteau, J.
1991-10-11
Since the Genome Project began several years ago, a plethora of databases have been developed or are in the works. They range from the massive Genome Data Base at Johns Hopkins University, the central repository of all gene mapping information, to small databases focusing on single chromosomes or organisms. Some are publicly available, others are essentially private electronic lab notebooks. Still others limit access to a consortium of researchers working on, say, a single human chromosome. An increasing number incorporate sophisticated search and analytical software, while others operate as little more than data lists. In consultation with numerous experts inmore » the field, a list has been compiled of some key genome-related databases. The list was not limited to map and sequence databases but also included the tools investigators use to interpret and elucidate genetic data, such as protein sequence and protein structure databases. Because a major goal of the Genome Project is to map and sequence the genomes of several experimental animals, including E. coli, yeast, fruit fly, nematode, and mouse, the available databases for those organisms are listed as well. The author also includes several databases that are still under development - including some ambitious efforts that go beyond data compilation to create what are being called electronic research communities, enabling many users, rather than just one or a few curators, to add or edit the data and tag it as raw or confirmed.« less
LymPHOS 2.0: an update of a phosphosite database of primary human T cells
Nguyen, Tien Dung; Vidal-Cortes, Oriol; Gallardo, Oscar; Abian, Joaquin; Carrascal, Montserrat
2015-01-01
LymPHOS is a web-oriented database containing peptide and protein sequences and spectrometric information on the phosphoproteome of primary human T-Lymphocytes. Current release 2.0 contains 15 566 phosphorylation sites from 8273 unique phosphopeptides and 4937 proteins, which correspond to a 45-fold increase over the original database description. It now includes quantitative data on phosphorylation changes after time-dependent treatment with activators of the TCR-mediated signal transduction pathway. Sequence data quality has also been improved with the use of multiple search engines for database searching. LymPHOS can be publicly accessed at http://www.lymphos.org. Database URL: http://www.lymphos.org. PMID:26708986
The Biological Macromolecule Crystallization Database and NASA Protein Crystal Growth Archive
Gilliland, Gary L.; Tung, Michael; Ladner, Jane
1996-01-01
The NIST/NASA/CARB Biological Macromolecule Crystallization Database (BMCD), NIST Standard Reference Database 21, contains crystal data and crystallization conditions for biological macromolecules. The database entries include data abstracted from published crystallographic reports. Each entry consists of information describing the biological macromolecule crystallized and crystal data and the crystallization conditions for each crystal form. The BMCD serves as the NASA Protein Crystal Growth Archive in that it contains protocols and results of crystallization experiments undertaken in microgravity (space). These database entries report the results, whether successful or not, from NASA-sponsored protein crystal growth experiments in microgravity and from microgravity crystallization studies sponsored by other international organizations. The BMCD was designed as a tool to assist x-ray crystallographers in the development of protocols to crystallize biological macromolecules, those that have previously been crystallized, and those that have not been crystallized. PMID:11542472
Büssow, Konrad; Hoffmann, Steve; Sievert, Volker
2002-12-19
Functional genomics involves the parallel experimentation with large sets of proteins. This requires management of large sets of open reading frames as a prerequisite of the cloning and recombinant expression of these proteins. A Java program was developed for retrieval of protein and nucleic acid sequences and annotations from NCBI GenBank, using the XML sequence format. Annotations retrieved by ORFer include sequence name, organism and also the completeness of the sequence. The program has a graphical user interface, although it can be used in a non-interactive mode. For protein sequences, the program also extracts the open reading frame sequence, if available, and checks its correct translation. ORFer accepts user input in the form of single or lists of GenBank GI identifiers or accession numbers. It can be used to extract complete sets of open reading frames and protein sequences from any kind of GenBank sequence entry, including complete genomes or chromosomes. Sequences are either stored with their features in a relational database or can be exported as text files in Fasta or tabulator delimited format. The ORFer program is freely available at http://www.proteinstrukturfabrik.de/orfer. The ORFer program allows for fast retrieval of DNA sequences, protein sequences and their open reading frames and sequence annotations from GenBank. Furthermore, storage of sequences and features in a relational database is supported. Such a database can supplement a laboratory information system (LIMS) with appropriate sequence information.
Baran, Michael C; Moseley, Hunter N B; Sahota, Gurmukh; Montelione, Gaetano T
2002-10-01
Modern protein NMR spectroscopy laboratories have a rapidly growing need for an easily queried local archival system of raw experimental NMR datasets. SPINS (Standardized ProteIn Nmr Storage) is an object-oriented relational database that provides facilities for high-volume NMR data archival, organization of analyses, and dissemination of results to the public domain by automatic preparation of the header files required for submission of data to the BioMagResBank (BMRB). The current version of SPINS coordinates the process from data collection to BMRB deposition of raw NMR data by standardizing and integrating the storage and retrieval of these data in a local laboratory file system. Additional facilities include a data mining query tool, graphical database administration tools, and a NMRStar v2. 1.1 file generator. SPINS also includes a user-friendly internet-based graphical user interface, which is optionally integrated with Varian VNMR NMR data collection software. This paper provides an overview of the data model underlying the SPINS database system, a description of its implementation in Oracle, and an outline of future plans for the SPINS project.
SInCRe—structural interactome computational resource for Mycobacterium tuberculosis
Metri, Rahul; Hariharaputran, Sridhar; Ramakrishnan, Gayatri; Anand, Praveen; Raghavender, Upadhyayula S.; Ochoa-Montaño, Bernardo; Higueruelo, Alicia P.; Sowdhamini, Ramanathan; Chandra, Nagasuma R.; Blundell, Tom L.; Srinivasan, Narayanaswamy
2015-01-01
We have developed an integrated database for Mycobacterium tuberculosis H37Rv (Mtb) that collates information on protein sequences, domain assignments, functional annotation and 3D structural information along with protein–protein and protein–small molecule interactions. SInCRe (Structural Interactome Computational Resource) is developed out of CamBan (Cambridge and Bangalore) collaboration. The motivation for development of this database is to provide an integrated platform to allow easily access and interpretation of data and results obtained by all the groups in CamBan in the field of Mtb informatics. In-house algorithms and databases developed independently by various academic groups in CamBan are used to generate Mtb-specific datasets and are integrated in this database to provide a structural dimension to studies on tuberculosis. The SInCRe database readily provides information on identification of functional domains, genome-scale modelling of structures of Mtb proteins and characterization of the small-molecule binding sites within Mtb. The resource also provides structure-based function annotation, information on small-molecule binders including FDA (Food and Drug Administration)-approved drugs, protein–protein interactions (PPIs) and natural compounds that bind to pathogen proteins potentially and result in weakening or elimination of host–pathogen protein–protein interactions. Together they provide prerequisites for identification of off-target binding. Database URL: http://proline.biochem.iisc.ernet.in/sincre PMID:26130660
Plant Genome Resources at the National Center for Biotechnology Information
Wheeler, David L.; Smith-White, Brian; Chetvernin, Vyacheslav; Resenchuk, Sergei; Dombrowski, Susan M.; Pechous, Steven W.; Tatusova, Tatiana; Ostell, James
2005-01-01
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI. PMID:16010002
Pareja, Eduardo; Pareja-Tobes, Pablo; Manrique, Marina; Pareja-Tobes, Eduardo; Bonal, Javier; Tobes, Raquel
2006-01-01
Background Transcriptional regulation processes are the principal mechanisms of adaptation in prokaryotes. In these processes, the regulatory proteins and the regulatory DNA signals located in extragenic regions are the key elements involved. As all extragenic spaces are putative regulatory regions, ExtraTrain covers all extragenic regions of available genomes and regulatory proteins from bacteria and archaea included in the UniProt database. Description ExtraTrain provides integrated and easily manageable information for 679816 extragenic regions and for the genes delimiting each of them. In addition ExtraTrain supplies a tool to explore extragenic regions, named Palinsight, oriented to detect and search palindromic patterns. This interactive visual tool is totally integrated in the database, allowing the search for regulatory signals in user defined sets of extragenic regions. The 26046 regulatory proteins included in ExtraTrain belong to the families AraC/XylS, ArsR, AsnC, Cold shock domain, CRP-FNR, DeoR, GntR, IclR, LacI, LuxR, LysR, MarR, MerR, NtrC/Fis, OmpR and TetR. The database follows the InterPro criteria to define these families. The information about regulators includes manually curated sets of references specifically associated to regulator entries. In order to achieve a sustainable and maintainable knowledge database ExtraTrain is a platform open to the contribution of knowledge by the scientific community providing a system for the incorporation of textual knowledge. Conclusion ExtraTrain is a new database for exploring Extragenic regions and Transcriptional information in bacteria and archaea. ExtraTrain database is available at . PMID:16539733
Park, Gun Wook; Hwang, Heeyoun; Kim, Kwang Hoe; Lee, Ju Yeon; Lee, Hyun Kyoung; Park, Ji Yeong; Ji, Eun Sun; Park, Sung-Kyu Robin; Yates, John R; Kwon, Kyung-Hoon; Park, Young Mok; Lee, Hyoung-Joo; Paik, Young-Ki; Kim, Jin Young; Yoo, Jong Shin
2016-11-04
In the Chromosome-Centric Human Proteome Project (C-HPP), false-positive identification by peptide spectrum matches (PSMs) after database searches is a major issue for proteogenomic studies using liquid-chromatography and mass-spectrometry-based large proteomic profiling. Here we developed a simple strategy for protein identification, with a controlled false discovery rate (FDR) at the protein level, using an integrated proteomic pipeline (IPP) that consists of four engrailed steps as follows. First, using three different search engines, SEQUEST, MASCOT, and MS-GF+, individual proteomic searches were performed against the neXtProt database. Second, the search results from the PSMs were combined using statistical evaluation tools including DTASelect and Percolator. Third, the peptide search scores were converted into E-scores normalized using an in-house program. Last, ProteinInferencer was used to filter the proteins containing two or more peptides with a controlled FDR of 1.0% at the protein level. Finally, we compared the performance of the IPP to a conventional proteomic pipeline (CPP) for protein identification using a controlled FDR of <1% at the protein level. Using the IPP, a total of 5756 proteins (vs 4453 using the CPP) including 477 alternative splicing variants (vs 182 using the CPP) were identified from human hippocampal tissue. In addition, a total of 10 missing proteins (vs 7 using the CPP) were identified with two or more unique peptides, and their tryptic peptides were validated using MS/MS spectral pattern from a repository database or their corresponding synthetic peptides. This study shows that the IPP effectively improved the identification of proteins, including alternative splicing variants and missing proteins, in human hippocampal tissues for the C-HPP. All RAW files used in this study were deposited in ProteomeXchange (PXD000395).
MIPS: analysis and annotation of proteins from whole genomes in 2005.
Mewes, H W; Frishman, D; Mayer, K F X; Münsterkötter, M; Noubibou, O; Pagel, P; Rattei, T; Oesterheld, M; Ruepp, A; Stümpflen, V
2006-01-01
The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein-protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.gsf.de).
Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes.
Hiscock, D; Upton, C
2000-05-01
The Viral Genome DataBase (VGDB) contains detailed information of the genes and predicted protein sequences from 15 completely sequenced genomes of large (&100 kb) viruses (2847 genes). The data that is stored includes DNA sequence, protein sequence, GenBank and user-entered notes, molecular weight (MW), isoelectric point (pI), amino acid content, A + T%, nucleotide frequency, dinucleotide frequency and codon use. The VGDB is a mySQL database with a user-friendly JAVA GUI. Results of queries can be easily sorted by any of the individual parameters. The software and additional figures and information are available at http://athena.bioc.uvic.ca/genomes/index.html .
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rayl, K.D.; Gaasterland, T.
This paper presents an overview of the purpose, content, and design of a subset of the currently available biological databases, with an emphasis on protein databases. Databases included in this summary are 3D-ALI, Berlin RNA databank, Blocks, DSSP, EMBL Nucleotide Database, EMP, ENZYME, FSSP, GDB, GenBank, HSSP, LiMB, PDB, PIR, PKCDD, ProSite, and SWISS-PROT. The goal is to provide a starting point for researchers who wish to take advantage of the myriad available databases. Rather than providing a complete explanation of each database, we present its content and form by explaining the details of typical entries. Pointers to more completemore » ``user guides`` are included, along with general information on where to search for a new database.« less
The MAR databases: development and implementation of databases specific for marine metagenomics
Klemetsen, Terje; Raknes, Inge A; Fu, Juan; Agafonov, Alexander; Balasundaram, Sudhagar V; Tartari, Giacomo; Robertsen, Espen
2018-01-01
Abstract We introduce the marine databases; MarRef, MarDB and MarCat (https://mmp.sfb.uit.no/databases/), which are publicly available resources that promote marine research and innovation. These data resources, which have been implemented in the Marine Metagenomics Portal (MMP) (https://mmp.sfb.uit.no/), are collections of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy. While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDB includes all incomplete sequenced prokaryotic genomes regardless level of completeness. The last database, MarCat, represents a gene (protein) catalog of uncultivable (and cultivable) marine genes and proteins derived from marine metagenomics samples. The first versions of MarRef and MarDB contain 612 and 3726 records, respectively. Each record is built up of 106 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to the organism and taxonomic information. Currently, MarCat contains 1227 records with 55 metadata fields. Ontologies and controlled vocabularies are used in the contextual databases to enhance consistency. The user-friendly web interface lets the visitors browse, filter and search in the contextual databases and perform BLAST searches against the corresponding sequence databases. All contextual and sequence databases are freely accessible and downloadable from https://s1.sfb.uit.no/public/mar/. PMID:29106641
2016-01-01
ProXL is a Web application and accompanying database designed for sharing, visualizing, and analyzing bottom-up protein cross-linking mass spectrometry data with an emphasis on structural analysis and quality control. ProXL is designed to be independent of any particular software pipeline. The import process is simplified by the use of the ProXL XML data format, which shields developers of data importers from the relative complexity of the relational database schema. The database and Web interfaces function equally well for any software pipeline and allow data from disparate pipelines to be merged and contrasted. ProXL includes robust public and private data sharing capabilities, including a project-based interface designed to ensure security and facilitate collaboration among multiple researchers. ProXL provides multiple interactive and highly dynamic data visualizations that facilitate structural-based analysis of the observed cross-links as well as quality control. ProXL is open-source, well-documented, and freely available at https://github.com/yeastrc/proxl-web-app. PMID:27302480
Riffle, Michael; Jaschob, Daniel; Zelter, Alex; Davis, Trisha N
2016-08-05
ProXL is a Web application and accompanying database designed for sharing, visualizing, and analyzing bottom-up protein cross-linking mass spectrometry data with an emphasis on structural analysis and quality control. ProXL is designed to be independent of any particular software pipeline. The import process is simplified by the use of the ProXL XML data format, which shields developers of data importers from the relative complexity of the relational database schema. The database and Web interfaces function equally well for any software pipeline and allow data from disparate pipelines to be merged and contrasted. ProXL includes robust public and private data sharing capabilities, including a project-based interface designed to ensure security and facilitate collaboration among multiple researchers. ProXL provides multiple interactive and highly dynamic data visualizations that facilitate structural-based analysis of the observed cross-links as well as quality control. ProXL is open-source, well-documented, and freely available at https://github.com/yeastrc/proxl-web-app .
Database resources of the National Center for Biotechnology Information.
2016-01-04
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank(®) nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (PubMed Central (PMC), Bookshelf and PubReader), health (ClinVar, dbGaP, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen), genomes (BioProject, Assembly, Genome, BioSample, dbSNP, dbVar, Epigenomics, the Map Viewer, Nucleotide, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser and the Trace Archive), genes (Gene, Gene Expression Omnibus (GEO), HomoloGene, PopSet and UniGene), proteins (Protein, the Conserved Domain Database (CDD), COBALT, Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB) and Protein Clusters) and chemicals (Biosystems and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for most of these databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Database resources of the National Center for Biotechnology Information.
2015-01-01
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank(®) nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. Additional NCBI resources focus on literature (Bookshelf, PubMed Central (PMC) and PubReader); medical genetics (ClinVar, dbMHC, the Genetic Testing Registry, HIV-1/Human Protein Interaction Database and MedGen); genes and genomics (BioProject, BioSample, dbSNP, dbVar, Epigenomics, Gene, Gene Expression Omnibus (GEO), Genome, HomoloGene, the Map Viewer, Nucleotide, PopSet, Probe, RefSeq, Sequence Read Archive, the Taxonomy Browser, Trace Archive and UniGene); and proteins and chemicals (Biosystems, COBALT, the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), the Molecular Modeling Database (MMDB), Protein Clusters, Protein and the PubChem suite of small molecule databases). The Entrez system provides search and retrieval operations for many of these databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at http://www.ncbi.nlm.nih.gov. Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by (a) US Government employee(s) and is in the public domain in the US.
RAIN: RNA–protein Association and Interaction Networks
Junge, Alexander; Refsgaard, Jan C.; Garde, Christian; Pan, Xiaoyong; Santos, Alberto; Alkan, Ferhat; Anthon, Christian; von Mering, Christian; Workman, Christopher T.; Jensen, Lars Juhl; Gorodkin, Jan
2017-01-01
Protein association networks can be inferred from a range of resources including experimental data, literature mining and computational predictions. These types of evidence are emerging for non-coding RNAs (ncRNAs) as well. However, integration of ncRNAs into protein association networks is challenging due to data heterogeneity. Here, we present a database of ncRNA–RNA and ncRNA–protein interactions and its integration with the STRING database of protein–protein interactions. These ncRNA associations cover four organisms and have been established from curated examples, experimental data, interaction predictions and automatic literature mining. RAIN uses an integrative scoring scheme to assign a confidence score to each interaction. We demonstrate that RAIN outperforms the underlying microRNA-target predictions in inferring ncRNA interactions. RAIN can be operated through an easily accessible web interface and all interaction data can be downloaded. Database URL: http://rth.dk/resources/rain PMID:28077569
Follicle Online: an integrated database of follicle assembly, development and ovulation.
Hua, Juan; Xu, Bo; Yang, Yifan; Ban, Rongjun; Iqbal, Furhan; Cooke, Howard J; Zhang, Yuanwei; Shi, Qinghua
2015-01-01
Folliculogenesis is an important part of ovarian function as it provides the oocytes for female reproductive life. Characterizing genes/proteins involved in folliculogenesis is fundamental for understanding the mechanisms associated with this biological function and to cure the diseases associated with folliculogenesis. A large number of genes/proteins associated with folliculogenesis have been identified from different species. However, no dedicated public resource is currently available for folliculogenesis-related genes/proteins that are validated by experiments. Here, we are reporting a database 'Follicle Online' that provides the experimentally validated gene/protein map of the folliculogenesis in a number of species. Follicle Online is a web-based database system for storing and retrieving folliculogenesis-related experimental data. It provides detailed information for 580 genes/proteins (from 23 model organisms, including Homo sapiens, Mus musculus, Rattus norvegicus, Mesocricetus auratus, Bos Taurus, Drosophila and Xenopus laevis) that have been reported to be involved in folliculogenesis, POF (premature ovarian failure) and PCOS (polycystic ovary syndrome). The literature was manually curated from more than 43,000 published articles (till 1 March 2014). The Follicle Online database is implemented in PHP + MySQL + JavaScript and this user-friendly web application provides access to the stored data. In summary, we have developed a centralized database that provides users with comprehensive information about genes/proteins involved in folliculogenesis. This database can be accessed freely and all the stored data can be viewed without any registration. Database URL: http://mcg.ustc.edu.cn/sdap1/follicle/index.php © The Author(s) 2015. Published by Oxford University Press.
Follicle Online: an integrated database of follicle assembly, development and ovulation
Hua, Juan; Xu, Bo; Yang, Yifan; Ban, Rongjun; Iqbal, Furhan; Zhang, Yuanwei; Shi, Qinghua
2015-01-01
Folliculogenesis is an important part of ovarian function as it provides the oocytes for female reproductive life. Characterizing genes/proteins involved in folliculogenesis is fundamental for understanding the mechanisms associated with this biological function and to cure the diseases associated with folliculogenesis. A large number of genes/proteins associated with folliculogenesis have been identified from different species. However, no dedicated public resource is currently available for folliculogenesis-related genes/proteins that are validated by experiments. Here, we are reporting a database ‘Follicle Online’ that provides the experimentally validated gene/protein map of the folliculogenesis in a number of species. Follicle Online is a web-based database system for storing and retrieving folliculogenesis-related experimental data. It provides detailed information for 580 genes/proteins (from 23 model organisms, including Homo sapiens, Mus musculus, Rattus norvegicus, Mesocricetus auratus, Bos Taurus, Drosophila and Xenopus laevis) that have been reported to be involved in folliculogenesis, POF (premature ovarian failure) and PCOS (polycystic ovary syndrome). The literature was manually curated from more than 43 000 published articles (till 1 March 2014). The Follicle Online database is implemented in PHP + MySQL + JavaScript and this user-friendly web application provides access to the stored data. In summary, we have developed a centralized database that provides users with comprehensive information about genes/proteins involved in folliculogenesis. This database can be accessed freely and all the stored data can be viewed without any registration. Database URL: http://mcg.ustc.edu.cn/sdap1/follicle/index.php PMID:25931457
Text mining for metabolic pathways, signaling cascades, and protein networks.
Hoffmann, Robert; Krallinger, Martin; Andres, Eduardo; Tamames, Javier; Blaschke, Christian; Valencia, Alfonso
2005-05-10
The complexity of the information stored in databases and publications on metabolic and signaling pathways, the high throughput of experimental data, and the growing number of publications make it imperative to provide systems to help the researcher navigate through these interrelated information resources. Text-mining methods have started to play a key role in the creation and maintenance of links between the information stored in biological databases and its original sources in the literature. These links will be extremely useful for database updating and curation, especially if a number of technical problems can be solved satisfactorily, including the identification of protein and gene names (entities in general) and the characterization of their types of interactions. The first generation of openly accessible text-mining systems, such as iHOP (Information Hyperlinked over Proteins), provides additional functions to facilitate the reconstruction of protein interaction networks, combine database and text information, and support the scientist in the formulation of novel hypotheses. The next challenge is the generation of comprehensive information regarding the general function of signaling pathways and protein interaction networks.
DITOP: drug-induced toxicity related protein database.
Zhang, Jing-Xian; Huang, Wei-Juan; Zeng, Jing-Hua; Huang, Wen-Hui; Wang, Yi; Zhao, Rui; Han, Bu-Cong; Liu, Qing-Feng; Chen, Yu-Zong; Ji, Zhi-Liang
2007-07-01
Drug-induced toxicity related proteins (DITRPs) are proteins that mediate adverse drug reactions (ADRs) or toxicities through their binding to drugs or reactive metabolites. Collection of these proteins facilitates better understanding of the molecular mechanisms of drug-induced toxicity and the rational drug discovery. Drug-induced toxicity related protein database (DITOP) is such a database that is intending to provide comprehensive information of DITRPs. Currently, DITOP contains 1501 records, covering 618 distinct literature-reported DITRPs, 529 drugs/ligands and 418 distinct toxicity terms. These proteins were confirmed experimentally to interact with drugs or their reactive metabolites, thus directly or indirectly cause adverse effects or toxicities. Five major types of drug-induced toxicities or ADRs are included in DITOP, which are the idiosyncratic adverse drug reactions, the dose-dependent toxicities, the drug-drug interactions, the immune-mediated adverse drug effects (IMADEs) and the toxicities caused by genetic susceptibility. Molecular mechanisms underlying the toxicity and cross-links to related resources are also provided while available. Moreover, a series of user-friendly interfaces were designed for flexible retrieval of DITRPs-related information. The DITOP can be accessed freely at http://bioinf.xmu.edu.cn/databases/ADR/index.html. Supplementary data are available at Bioinformatics online.
ChemProt-2.0: visual navigation in a disease chemical biology database
Kim Kjærulff, Sonny; Wich, Louis; Kringelum, Jens; Jacobsen, Ulrik P.; Kouskoumvekaki, Irene; Audouze, Karine; Lund, Ole; Brunak, Søren; Oprea, Tudor I.; Taboureau, Olivier
2013-01-01
ChemProt-2.0 (http://www.cbs.dtu.dk/services/ChemProt-2.0) is a public available compilation of multiple chemical–protein annotation resources integrated with diseases and clinical outcomes information. The database has been updated to >1.15 million compounds with 5.32 millions bioactivity measurements for 15 290 proteins. Each protein is linked to quality-scored human protein–protein interactions data based on more than half a million interactions, for studying diseases and biological outcomes (diseases, pathways and GO terms) through protein complexes. In ChemProt-2.0, therapeutic effects as well as adverse drug reactions have been integrated allowing for suggesting proteins associated to clinical outcomes. New chemical structure fingerprints were computed based on the similarity ensemble approach. Protein sequence similarity search was also integrated to evaluate the promiscuity of proteins, which can help in the prediction of off-target effects. Finally, the database was integrated into a visual interface that enables navigation of the pharmacological space for small molecules. Filtering options were included in order to facilitate and to guide dynamic search of specific queries. PMID:23185041
Transterm: a database to aid the analysis of regulatory sequences in mRNAs
Jacobs, Grant H.; Chen, Augustine; Stevens, Stewart G.; Stockwell, Peter A.; Black, Michael A.; Tate, Warren P.; Brown, Chris M.
2009-01-01
Messenger RNAs, in addition to coding for proteins, may contain regulatory elements that affect how the protein is translated. These include protein and microRNA-binding sites. Transterm (http://mRNA.otago.ac.nz/Transterm.html) is a database of regions and elements that affect translation with two major unique components. The first is integrated results of analysis of general features that affect translation (initiation, elongation, termination) for species or strains in Genbank, processed through a standard pipeline. The second is curated descriptions of experimentally determined regulatory elements that function as translational control elements in mRNAs. Transterm focuses on protein binding sites, particularly those in 3′-untranslated regions (3′-UTR). For this release the interface has been extensively updated based on user feedback. The data is now accessible by strain rather than species, for example there are 10 Escherichia coli strains (genomes) analysed separately. In addition to providing a repository of data, the database also provides tools for users to query their own mRNA sequences. Users can search sequences for Transterm or user defined regulatory elements, including protein or miRNA targets. Transterm also provides a central core of links to related resources for complementary analyses. PMID:18984623
JAIL: a structure-based interface library for macromolecules.
Günther, Stefan; von Eichborn, Joachim; May, Patrick; Preissner, Robert
2009-01-01
The increasing number of solved macromolecules provides a solid number of 3D interfaces, if all types of molecular contacts are being considered. JAIL annotates three different kinds of macromolecular interfaces, those between interacting protein domains, interfaces of different protein chains and interfaces between proteins and nucleic acids. This results in a total number of about 184,000 database entries. All the interfaces can easily be identified by a detailed search form or by a hierarchical tree that describes the protein domain architectures classified by the SCOP database. Visual inspection of the interfaces is possible via an interactive protein viewer. Furthermore, large scale analyses are supported by an implemented sequential and by a structural clustering. Similar interfaces as well as non-redundant interfaces can be easily picked out. Additionally, the sequential conservation of binding sites was also included in the database and is retrievable via Jmol. A comprehensive download section allows the composition of representative data sets with user defined parameters. The huge data set in combination with various search options allow a comprehensive view on all interfaces between macromolecules included in the Protein Data Bank (PDB). The download of the data sets supports numerous further investigations in macromolecular recognition. JAIL is publicly available at http://bioinformatics.charite.de/jail.
Curated protein information in the Saccharomyces genome database.
Hellerstedt, Sage T; Nash, Robert S; Weng, Shuai; Paskov, Kelley M; Wong, Edith D; Karra, Kalpana; Engel, Stacia R; Cherry, J Michael
2017-01-01
Due to recent advancements in the production of experimental proteomic data, the Saccharomyces genome database (SGD; www.yeastgenome.org ) has been expanding our protein curation activities to make new data types available to our users. Because of broad interest in post-translational modifications (PTM) and their importance to protein function and regulation, we have recently started incorporating expertly curated PTM information on individual protein pages. Here we also present the inclusion of new abundance and protein half-life data obtained from high-throughput proteome studies. These new data types have been included with the aim to facilitate cellular biology research. : www.yeastgenome.org. © The Author(s) 2017. Published by Oxford University Press.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yusim, Karina; Korber, Bette Tina Marie; Barouch, Dan
HIV Molecular Immunology is a companion volume to HIV Sequence Compendium. This publication, the 2014 edition, is the PDF version of the web-based HIV Immunology Database (http://www.hiv.lanl.gov/content/immunology/). The web interface for this relational database has many search options, as well as interactive tools to help immunologists design reagents and interpret their results. In the HIV Immunology Database, HIV-specific B-cell and T-cell responses are summarized and annotated. Immunological responses are divided into three parts, CTL, T helper, and antibody. Within these parts, defined epitopes are organized by protein and binding sites within each protein, moving from left to right through themore » coding regions spanning the HIV genome. We include human responses to natural HIV infections, as well as vaccine studies in a range of animal models and human trials. Responses that are not specifically defined, such as responses to whole proteins or monoclonal antibody responses to discontinuous epitopes, are summarized at the end of each protein section. Studies describing general HIV responses to the virus, but not to any specific protein, are included at the end of each part. The annotation includes information such as crossreactivity, escape mutations, antibody sequence, TCR usage, functional domains that overlap with an epitope, immune response associations with rates of progression and therapy, and how specific epitopes were experimentally defined. Basic information such as HLA specificities for T-cell epitopes, isotypes of monoclonal antibodies, and epitope sequences are included whenever possible. All studies that we can find that incorporate the use of a specific monoclonal antibody are included in the entry for that antibody. A single T-cell epitope can have multiple entries, generally one entry per study. Finally, maps of all defined linear epitopes relative to the HXB2 reference proteins are provided.« less
The MAR databases: development and implementation of databases specific for marine metagenomics.
Klemetsen, Terje; Raknes, Inge A; Fu, Juan; Agafonov, Alexander; Balasundaram, Sudhagar V; Tartari, Giacomo; Robertsen, Espen; Willassen, Nils P
2018-01-04
We introduce the marine databases; MarRef, MarDB and MarCat (https://mmp.sfb.uit.no/databases/), which are publicly available resources that promote marine research and innovation. These data resources, which have been implemented in the Marine Metagenomics Portal (MMP) (https://mmp.sfb.uit.no/), are collections of richly annotated and manually curated contextual (metadata) and sequence databases representing three tiers of accuracy. While MarRef is a database for completely sequenced marine prokaryotic genomes, which represent a marine prokaryote reference genome database, MarDB includes all incomplete sequenced prokaryotic genomes regardless level of completeness. The last database, MarCat, represents a gene (protein) catalog of uncultivable (and cultivable) marine genes and proteins derived from marine metagenomics samples. The first versions of MarRef and MarDB contain 612 and 3726 records, respectively. Each record is built up of 106 metadata fields including attributes for sampling, sequencing, assembly and annotation in addition to the organism and taxonomic information. Currently, MarCat contains 1227 records with 55 metadata fields. Ontologies and controlled vocabularies are used in the contextual databases to enhance consistency. The user-friendly web interface lets the visitors browse, filter and search in the contextual databases and perform BLAST searches against the corresponding sequence databases. All contextual and sequence databases are freely accessible and downloadable from https://s1.sfb.uit.no/public/mar/. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Rattei, Thomas; Tischler, Patrick; Götz, Stefan; Jehl, Marc-André; Hoser, Jonathan; Arnold, Roland; Conesa, Ana; Mewes, Hans-Werner
2010-01-01
The prediction of protein function as well as the reconstruction of evolutionary genesis employing sequence comparison at large is still the most powerful tool in sequence analysis. Due to the exponential growth of the number of known protein sequences and the subsequent quadratic growth of the similarity matrix, the computation of the Similarity Matrix of Proteins (SIMAP) becomes a computational intensive task. The SIMAP database provides a comprehensive and up-to-date pre-calculation of the protein sequence similarity matrix, sequence-based features and sequence clusters. As of September 2009, SIMAP covers 48 million proteins and more than 23 million non-redundant sequences. Novel features of SIMAP include the expansion of the sequence space by including databases such as ENSEMBL as well as the integration of metagenomes based on their consistent processing and annotation. Furthermore, protein function predictions by Blast2GO are pre-calculated for all sequences in SIMAP and the data access and query functions have been improved. SIMAP assists biologists to query the up-to-date sequence space systematically and facilitates large-scale downstream projects in computational biology. Access to SIMAP is freely provided through the web portal for individuals (http://mips.gsf.de/simap/) and for programmatic access through DAS (http://webclu.bio.wzw.tum.de/das/) and Web-Service (http://mips.gsf.de/webservices/services/SimapService2.0?wsdl).
Toseland, Christopher P; Clayton, Debra J; McSparron, Helen; Hemsley, Shelley L; Blythe, Martin J; Paine, Kelly; Doytchinova, Irini A; Guan, Pingping; Hattotuwagama, Channa K; Flower, Darren R
2005-01-01
AntiJen is a database system focused on the integration of kinetic, thermodynamic, functional, and cellular data within the context of immunology and vaccinology. Compared to its progenitor JenPep, the interface has been completely rewritten and redesigned and now offers a wider variety of search methods, including a nucleotide and a peptide BLAST search. In terms of data archived, AntiJen has a richer and more complete breadth, depth, and scope, and this has seen the database increase to over 31,000 entries. AntiJen provides the most complete and up-to-date dataset of its kind. While AntiJen v2.0 retains a focus on both T cell and B cell epitopes, its greatest novelty is the archiving of continuous quantitative data on a variety of immunological molecular interactions. This includes thermodynamic and kinetic measures of peptide binding to TAP and the Major Histocompatibility Complex (MHC), peptide-MHC complexes binding to T cell receptors, antibodies binding to protein antigens and general immunological protein-protein interactions. The database also contains quantitative specificity data from position-specific peptide libraries and biophysical data, in the form of diffusion co-efficients and cell surface copy numbers, on MHCs and other immunological molecules. The uses of AntiJen include the design of vaccines and diagnostics, such as tetramers, and other laboratory reagents, as well as helping parameterize the bioinformatic or mathematical in silico modeling of the immune system. The database is accessible from the URL: . PMID:16305757
The BioGRID interaction database: 2017 update
Chatr-aryamontri, Andrew; Oughtred, Rose; Boucher, Lorrie; Rust, Jennifer; Chang, Christie; Kolas, Nadine K.; O'Donnell, Lara; Oster, Sara; Theesfeld, Chandra; Sellam, Adnane; Stark, Chris; Breitkreutz, Bobby-Joe; Dolinski, Kara; Tyers, Mike
2017-01-01
The Biological General Repository for Interaction Datasets (BioGRID: https://thebiogrid.org) is an open access database dedicated to the annotation and archival of protein, genetic and chemical interactions for all major model organism species and humans. As of September 2016 (build 3.4.140), the BioGRID contains 1 072 173 genetic and protein interactions, and 38 559 post-translational modifications, as manually annotated from 48 114 publications. This dataset represents interaction records for 66 model organisms and represents a 30% increase compared to the previous 2015 BioGRID update. BioGRID curates the biomedical literature for major model organism species, including humans, with a recent emphasis on central biological processes and specific human diseases. To facilitate network-based approaches to drug discovery, BioGRID now incorporates 27 501 chemical–protein interactions for human drug targets, as drawn from the DrugBank database. A new dynamic interaction network viewer allows the easy navigation and filtering of all genetic and protein interaction data, as well as for bioactive compounds and their established targets. BioGRID data are directly downloadable without restriction in a variety of standardized formats and are freely distributed through partner model organism databases and meta-databases. PMID:27980099
d'Acierno, Antonio; Facchiano, Angelo; Marabotti, Anna
2009-06-01
We describe the GALT-Prot database and its related web-based application that have been developed to collect information about the structural and functional effects of mutations on the human enzyme galactose-1-phosphate uridyltransferase (GALT) involved in the genetic disease named galactosemia type I. Besides a list of missense mutations at gene and protein sequence levels, GALT-Prot reports the analysis results of mutant GALT structures. In addition to the structural information about the wild-type enzyme, the database also includes structures of over 100 single point mutants simulated by means of a computational procedure, and the analysis to each mutant was made with several bioinformatics programs in order to investigate the effect of the mutations. The web-based interface allows querying of the database, and several links are also provided in order to guarantee a high integration with other resources already present on the web. Moreover, the architecture of the database and the web application is flexible and can be easily adapted to store data related to other proteins with point mutations. GALT-Prot is freely available at http://bioinformatica.isa.cnr.it/GALT/.
Human Mitochondrial Protein Database
National Institute of Standards and Technology Data Gateway
SRD 131 Human Mitochondrial Protein Database (Web, free access) The Human Mitochondrial Protein Database (HMPDb) provides comprehensive data on mitochondrial and human nuclear encoded proteins involved in mitochondrial biogenesis and function. This database consolidates information from SwissProt, LocusLink, Protein Data Bank (PDB), GenBank, Genome Database (GDB), Online Mendelian Inheritance in Man (OMIM), Human Mitochondrial Genome Database (mtDB), MITOMAP, Neuromuscular Disease Center and Human 2-D PAGE Databases. This database is intended as a tool not only to aid in studying the mitochondrion but in studying the associated diseases.
Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar
2008-01-01
The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community. PMID:17965093
Saunders, Brian; Lyon, Stephen; Day, Matthew; Riley, Brenda; Chenette, Emily; Subramaniam, Shankar; Vadivelu, Ilango
2008-01-01
The UCSD-Nature Signaling Gateway Molecule Pages (http://www.signaling-gateway.org/molecule) provides essential information on more than 3800 mammalian proteins involved in cellular signaling. The Molecule Pages contain expert-authored and peer-reviewed information based on the published literature, complemented by regularly updated information derived from public data source references and sequence analysis. The expert-authored data includes both a full-text review about the molecule, with citations, and highly structured data for bioinformatics interrogation, including information on protein interactions and states, transitions between states and protein function. The expert-authored pages are anonymously peer reviewed by the Nature Publishing Group. The Molecule Pages data is present in an object-relational database format and is freely accessible to the authors, the reviewers and the public from a web browser that serves as a presentation layer. The Molecule Pages are supported by several applications that along with the database and the interfaces form a multi-tier architecture. The Molecule Pages and the Signaling Gateway are routinely accessed by a very large research community.
PaperBLAST: Text Mining Papers for Information about Homologs.
Price, Morgan N; Arkin, Adam P
2017-01-01
Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST's database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins' functions.
A Brief Review of RNA–Protein Interaction Database Resources
Yi, Ying; Zhao, Yue; Huang, Yan; Wang, Dong
2017-01-01
RNA–Protein interactions play critical roles in various biological processes. By collecting and analyzing the RNA–Protein interactions and binding sites from experiments and predictions, RNA–Protein interaction databases have become an essential resource for the exploration of the transcriptional and post-transcriptional regulatory network. Here, we briefly review several widely used RNA–Protein interaction database resources developed in recent years to provide a guide of these databases. The content and major functions in databases are presented. The brief description of database helps users to quickly choose the database containing information they interested. In short, these RNA–Protein interaction database resources are continually updated, but the current state shows the efforts to identify and analyze the large amount of RNA–Protein interactions. PMID:29657278
Benchmarking protein classification algorithms via supervised cross-validation.
Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor
2008-04-24
Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yusim, Karina; Korber, Bette Tina; Brander, Christian
The scope and purpose of the HIV molecular immunology database: HIV Molecular Immunology is a companion volume to HIV Sequence Compendium. This publication, the 2015 edition, is the PDF version of the web-based HIV Immunology Database (http://www.hiv.lanl.gov/ content/immunology/). The web interface for this relational database has many search options, as well as interactive tools to help immunologists design reagents and interpret their results. In the HIV Immunology Database, HIV-specific B-cell and T-cell responses are summarized and annotated. Immunological responses are divided into three parts, CTL, T helper, and antibody. Within these parts, defined epitopes are organized by protein and bindingmore » sites within each protein, moving from left to right through the coding regions spanning the HIV genome. We include human responses to natural HIV infections, as well as vaccine studies in a range of animal models and human trials. Responses that are not specifically defined, such as responses to whole proteins or monoclonal antibody responses to discontinuous epitopes, are summarized at the end of each protein section. Studies describing general HIV responses to the virus, but not to any specific protein, are included at the end of each part. The annotation includes information such as cross-reactivity, escape mutations, antibody sequence, TCR usage, functional domains that overlap with an epitope, immune response associations with rates of progression and therapy, and how specific epitopes were experimentally defined. Basic information such as HLA specificities for T-cell epitopes, isotypes of monoclonal antibodies, and epitope sequences are included whenever possible. All studies that we can find that incorporate the use of a specific monoclonal antibody are included in the entry for that antibody. A single T-cell epitope can have multiple entries, generally one entry per study. Finally, maps of all defined linear epitopes relative to the HXB2 reference proteins are provided. Alignments of CTL, helper T-cell, and antibody epitopes are available through the search interface on our web site at http:// www.hiv.lanl.gov/content/immunology.« less
Hume, Maxwell A; Barrera, Luis A; Gisselbrecht, Stephen S; Bulyk, Martha L
2015-01-01
The Universal PBM Resource for Oligonucleotide Binding Evaluation (UniPROBE) serves as a convenient source of information on published data generated using universal protein-binding microarray (PBM) technology, which provides in vitro data about the relative DNA-binding preferences of transcription factors for all possible sequence variants of a length k ('k-mers'). The database displays important information about the proteins and displays their DNA-binding specificity data in terms of k-mers, position weight matrices and graphical sequence logos. This update to the database documents the growth of UniPROBE since the last update 4 years ago, and introduces a variety of new features and tools, including a new streamlined pipeline that facilitates data deposition by universal PBM data generators in the research community, a tool that generates putative nonbinding (i.e. negative control) DNA sequences for one or more proteins and novel motifs obtained by analyzing the PBM data using the BEEML-PBM algorithm for motif inference. The UniPROBE database is available at http://uniprobe.org. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Interactive and Versatile Navigation of Structural Databases.
Korb, Oliver; Kuhn, Bernd; Hert, Jérôme; Taylor, Neil; Cole, Jason; Groom, Colin; Stahl, Martin
2016-05-12
We present CSD-CrossMiner, a novel tool for pharmacophore-based searches in crystal structure databases. Intuitive pharmacophore queries describing, among others, protein-ligand interaction patterns, ligand scaffolds, or protein environments can be built and modified interactively. Matching crystal structures are overlaid onto the query and visualized as soon as they are available, enabling the researcher to quickly modify a hypothesis on the fly. We exemplify the utility of the approach by showing applications relevant to real-world drug discovery projects, including the identification of novel fragments for a specific protein environment or scaffold hopping. The ability to concurrently search protein-ligand binding sites extracted from the Protein Data Bank (PDB) and small organic molecules from the Cambridge Structural Database (CSD) using the same pharmacophore query further emphasizes the flexibility of CSD-CrossMiner. We believe that CSD-CrossMiner closes an important gap in mining structural data and will allow users to extract more value from the growing number of available crystal structures.
PaperBLAST: Text Mining Papers for Information about Homologs
Price, Morgan N.; Arkin, Adam P.
2017-08-15
Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quicklymore » finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.« less
PaperBLAST: Text Mining Papers for Information about Homologs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Price, Morgan N.; Arkin, Adam P.
Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quicklymore » finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.« less
PaperBLAST: Text Mining Papers for Information about Homologs
Arkin, Adam P.
2017-01-01
ABSTRACT Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/. IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions. PMID:28845458
Morphinome Database - The database of proteins altered by morphine administration - An update.
Bodzon-Kulakowska, Anna; Padrtova, Tereza; Drabik, Anna; Ner-Kluza, Joanna; Antolak, Anna; Kulakowski, Konrad; Suder, Piotr
2018-04-13
Morphine is considered a gold standard in pain treatment. Nevertheless, its use could be associated with severe side effects, including drug addiction. Thus, it is very important to understand the molecular mechanism of morphine action in order to develop new methods of pain therapy, or at least to attenuate the side effects of opioids usage. Proteomics allows for the indication of proteins involved in certain biological processes, but the number of items identified in a single study is usually overwhelming. Thus, researchers face the difficult problem of choosing the proteins which are really important for the investigated processes and worth further studies. Therefore, based on the 29 published articles, we created a database of proteins regulated by morphine administration - The Morphinome Database (addiction-proteomics.org). This web tool allows for indicating proteins that were identified during different proteomics studies. Moreover, the collection and organization of such a vast amount of data allows us to find the same proteins that were identified in various studies and to create their ranking, based on the frequency of their identification. STRING and KEGG databases indicated metabolic pathways which those molecules are involved in. This means that those molecular pathways seem to be strongly affected by morphine administration and could be important targets for further investigations. The data about proteins identified by different proteomics studies of molecular changes caused by morphine administration (29 published articles) were gathered in the Morphinome Database. Unification of those data allowed for the identification of proteins that were indicated several times by distinct proteomics studies, which means that they seem to be very well verified and important for the entire process. Those proteins might be now considered promising aims for more detailed studies of their role in the molecular mechanism of morphine action. Copyright © 2018. Published by Elsevier B.V.
PrionHome: a database of prions and other sequences relevant to prion phenomena.
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion.
PrionHome: A Database of Prions and Other Sequences Relevant to Prion Phenomena
Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M. A.; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M.
2012-01-01
Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion. PMID:22363733
Histoplasma capsulatum proteome response to decreased iron availability
Winters, Michael S; Spellman, Daniel S; Chan, Qilin; Gomez, Francisco J; Hernandez, Margarita; Catron, Brittany; Smulian, Alan G; Neubert, Thomas A; Deepe, George S
2008-01-01
Background A fundamental pathogenic feature of the fungus Histoplasma capsulatum is its ability to evade innate and adaptive immune defenses. Once ingested by macrophages the organism is faced with several hostile environmental conditions including iron limitation. H. capsulatum can establish a persistent state within the macrophage. A gap in knowledge exists because the identities and number of proteins regulated by the organism under host conditions has yet to be defined. Lack of such knowledge is an important problem because until these proteins are identified it is unlikely that they can be targeted as new and innovative treatment for histoplasmosis. Results To investigate the proteomic response by H. capsulatum to decreasing iron availability we have created H. capsulatum protein/genomic databases compatible with current mass spectrometric (MS) search engines. Databases were assembled from the H. capsulatum G217B strain genome using gene prediction programs and expressed sequence tag (EST) libraries. Searching these databases with MS data generated from two dimensional (2D) in-gel digestions of proteins resulted in over 50% more proteins identified compared to searching the publicly available fungal databases alone. Using 2D gel electrophoresis combined with statistical analysis we discovered 42 H. capsulatum proteins whose abundance was significantly modulated when iron concentrations were lowered. Altered proteins were identified by mass spectrometry and database searching to be involved in glycolysis, the tricarboxylic acid cycle, lysine metabolism, protein synthesis, and one protein sequence whose function was unknown. Conclusion We have created a bioinformatics platform for H. capsulatum and demonstrated the utility of a proteomic approach by identifying a shift in metabolism the organism utilizes to cope with the hostile conditions provided by the host. We have shown that enzyme transcripts regulated by other fungal pathogens in response to lowering iron availability are also regulated in H. capsulatum at the protein level. We also identified H. capsulatum proteins sensitive to iron level reductions which have yet to be connected to iron availability in other pathogens. These data also indicate the complexity of the response by H. capsulatum to nutritional deprivation. Finally, we demonstrate the importance of a strain specific gene/protein database for H. capsulatum proteomic analysis. PMID:19108728
Motivated Proteins: A web application for studying small three-dimensional protein motifs
Leader, David P; Milner-White, E James
2009-01-01
Background Small loop-shaped motifs are common constituents of the three-dimensional structure of proteins. Typically they comprise between three and seven amino acid residues, and are defined by a combination of dihedral angles and hydrogen bonding partners. The most abundant of these are αβ-motifs, asx-motifs, asx-turns, β-bulges, β-bulge loops, β-turns, nests, niches, Schellmann loops, ST-motifs, ST-staples and ST-turns. We have constructed a database of such motifs from a range of high-quality protein structures and built a web application as a visual interface to this. Description The web application, Motivated Proteins, provides access to these 12 motifs (with 48 sub-categories) in a database of over 400 representative proteins. Queries can be made for specific categories or sub-categories of motif, motifs in the vicinity of ligands, motifs which include part of an enzyme active site, overlapping motifs, or motifs which include a particular amino acid sequence. Individual proteins can be specified, or, where appropriate, motifs for all proteins listed. The results of queries are presented in textual form as an (X)HTML table, and may be saved as parsable plain text or XML. Motifs can be viewed and manipulated either individually or in the context of the protein in the Jmol applet structural viewer. Cartoons of the motifs imposed on a linear representation of protein secondary structure are also provided. Summary information for the motifs is available, as are histograms of amino acid distribution, and graphs of dihedral angles at individual positions in the motifs. Conclusion Motivated Proteins is a publicly and freely accessible web application that enables protein scientists to study small three-dimensional motifs without requiring knowledge of either Structured Query Language or the underlying database schema. PMID:19210785
Wong, Diane K.; Lee, Bai-Yu; Horwitz, Marcus A.; Gibson, Bradford W.
1999-01-01
Iron plays a critical role in the pathophysiology of Mycobacterium tuberculosis. To gain a better understanding of iron regulation by this organism, we have used two-dimensional (2-D) gel electrophoresis, mass spectrometry, and database searching to study protein expression in M. tuberculosis under conditions of high and low iron concentration. Proteins in cellular extracts from M. tuberculosis Erdman strain grown under low-iron (1 μM) and high-iron (70 μM) conditions were separated by 2-D polyacrylamide gel electrophoresis, which allowed high-resolution separation of several hundred proteins, as visualized by Coomassie staining. The expression of at least 15 proteins was induced, and the expression of at least 12 proteins was decreased under low-iron conditions. In-gel trypsin digestion was performed on these differentially expressed proteins, and the digestion mixtures were analyzed by matrix-assisted laser desorption ionization time-of-flight mass spectrometry to determine the molecular masses of the resulting tryptic peptides. Partial sequence data on some of the peptides were obtained by using after source decay and/or collision-induced dissociation. The fragmentation data were used to search computerized peptide mass and protein sequence databases for known proteins. Ten iron-regulated proteins were identified, including Fur and aconitase proteins, both of which are known to be regulated by iron in other bacterial systems. Our study shows that, where large protein sequence databases are available from genomic studies, the combined use of 2-D gel electrophoresis, mass spectrometry, and database searching to analyze proteins expressed under defined environmental conditions is a powerful tool for identifying expressed proteins and their physiologic relevance. PMID:9864233
The Importance of Biological Databases in Biological Discovery.
Baxevanis, Andreas D; Bateman, Alex
2015-06-19
Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access a wide variety of biologically relevant data, including the genomic sequences of an increasingly broad range of organisms. This unit provides a brief overview of major sequence databases and portals, such as GenBank, the UCSC Genome Browser, and Ensembl. Model organism databases, including WormBase, The Arabidopsis Information Resource (TAIR), and those made available through the Mouse Genome Informatics (MGI) resource, are also covered. Non-sequence-centric databases, such as Online Mendelian Inheritance in Man (OMIM), the Protein Data Bank (PDB), MetaCyc, and the Kyoto Encyclopedia of Genes and Genomes (KEGG), are also discussed. Copyright © 2015 John Wiley & Sons, Inc.
ESTuber db: an online database for Tuber borchii EST sequences.
Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo
2007-03-08
The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.
CORUM: the comprehensive resource of mammalian protein complexes
Ruepp, Andreas; Brauner, Barbara; Dunger-Kaltenbach, Irmtraud; Frishman, Goar; Montrone, Corinna; Stransky, Michael; Waegele, Brigitte; Schmidt, Thorsten; Doudieu, Octave Noubibou; Stümpflen, Volker; Mewes, H. Werner
2008-01-01
Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The CORUM (http://mips.gsf.de/genre/proj/corum/index.html) database is a collection of experimentally verified mammalian protein complexes. Information is manually derived by critical reading of the scientific literature from expert annotators. Information about protein complexes includes protein complex names, subunits, literature references as well as the function of the complexes. For functional annotation, we use the FunCat catalogue that enables to organize the protein complex space into biologically meaningful subsets. The database contains more than 1750 protein complexes that are built from 2400 different genes, thus representing 12% of the protein-coding genes in human. A web-based system is available to query, view and download the data. CORUM provides a comprehensive dataset of protein complexes for discoveries in systems biology, analyses of protein networks and protein complex-associated diseases. Comparable to the MIPS reference dataset of protein complexes from yeast, CORUM intends to serve as a reference for mammalian protein complexes. PMID:17965090
RaftProt: mammalian lipid raft proteome database.
Shah, Anup; Chen, David; Boda, Akash R; Foster, Leonard J; Davis, Melissa J; Hill, Michelle M
2015-01-01
RaftProt (http://lipid-raft-database.di.uq.edu.au/) is a database of mammalian lipid raft-associated proteins as reported in high-throughput mass spectrometry studies. Lipid rafts are specialized membrane microdomains enriched in cholesterol and sphingolipids thought to act as dynamic signalling and sorting platforms. Given their fundamental roles in cellular regulation, there is a plethora of information on the size, composition and regulation of these membrane microdomains, including a large number of proteomics studies. To facilitate the mining and analysis of published lipid raft proteomics studies, we have developed a searchable database RaftProt. In addition to browsing the studies, performing basic queries by protein and gene names, searching experiments by cell, tissue and organisms; we have implemented several advanced features to facilitate data mining. To address the issue of potential bias due to biochemical preparation procedures used, we have captured the lipid raft preparation methods and implemented advanced search option for methodology and sample treatment conditions, such as cholesterol depletion. Furthermore, we have identified a list of high confidence proteins, and enabled searching only from this list of likely bona fide lipid raft proteins. Given the apparent biological importance of lipid raft and their associated proteins, this database would constitute a key resource for the scientific community. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Dölz, R; Mossé, M O; Slonimski, P P; Bairoch, A; Linder, P
1996-01-01
We continued our effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. As in previous editions the genetic names are consistently associated to each sequence with a known and confirmed ORF. If necessary, synonyms are given in the case of allelic duplicated sequences. Although the first publication of a sequence gives-according to our rules-the genetic name of a gene, in some instances more commonly used names are given to avoid nomenclature problems and the use of ancient designations which are no longer used. In these cases the old designation is given as synonym. Thus sequences can be found either by the name or by synonyms given in LISTA. Each entry contains the genetic name, the mnemonic from the EMBL data bank, the codon bias, reference of the publication of the sequence, Chromosomal location as far as known, SWISSPROT and EMBL accession numbers. New entries will also contain the name from the systematic sequencing efforts. Since the release of LISTA4.1 we update the database continuously. To obtain more information on the included sequences, each entry has been screened against non-redundant nucleotide and protein data bank collections resulting in LISTA-HON and LISTA-HOP. This release includes reports from full Smith and Watermann peptide-level searches against a non-redundant protein sequence database. The LISTA data base can be linked to the associated data sets or to nucleotide and protein banks by the Sequence Retrieval System (SRS). The database is available by FTP and on World Wide Web. PMID:8594599
Wang, Penghao; Wilson, Susan R
2013-01-01
Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. We have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
GPCR & company: databases and servers for GPCRs and interacting partners.
Kowalsman, Noga; Niv, Masha Y
2014-01-01
G-protein-coupled receptors (GPCRs) are a large superfamily of membrane receptors that are involved in a wide range of signaling pathways. To fulfill their tasks, GPCRs interact with a variety of partners, including small molecules, lipids and proteins. They are accompanied by different proteins during all phases of their life cycle. Therefore, GPCR interactions with their partners are of great interest in basic cell-signaling research and in drug discovery.Due to the rapid development of computers and internet communication, knowledge and data can be easily shared within the worldwide research community via freely available databases and servers. These provide an abundance of biological, chemical and pharmacological information.This chapter describes the available web resources for investigating GPCR interactions. We review about 40 freely available databases and servers, and provide a few sentences about the essence and the data they supply. For simplification, the databases and servers were grouped under the following topics: general GPCR-ligand interactions; particular families of GPCRs and their ligands; GPCR oligomerization; GPCR interactions with intracellular partners; and structural information on GPCRs. In conclusion, a multitude of useful tools are currently available. Summary tables are provided to ease navigation between the numerous and partially overlapping resources. Suggestions for future enhancements of the online tools include the addition of links from general to specialized databases and enabling usage of user-supplied template for GPCR structural modeling.
FPD: A comprehensive phosphorylation database in fungi.
Bai, Youhuang; Chen, Bin; Li, Mingzhu; Zhou, Yincong; Ren, Silin; Xu, Qin; Chen, Ming; Wang, Shihua
2017-10-01
Protein phosphorylation, one of the most classic post-translational modification, plays a critical role in diverse cellular processes including cell cycle, growth, and signal transduction pathways. However, the available information about phosphorylation in fungi is limited. Here, we provided a Fungi Phosphorylation Database (FPD) that comprises high-confidence in vivo phosphosites identified by MS-based proteomics in various fungal species. This comprehensive phosphorylation database contains 62 272 non-redundant phosphorylation sites in 11 222 proteins across eight organisms, including Aspergillus flavus, Aspergillus nidulans, Fusarium graminearum, Magnaporthe oryzae, Neurospora crassa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Cryptococcus neoformans. A fungi-specific phosphothreonine motif and several conserved phosphorylation motifs were discovered by comparatively analysing the pattern of phosphorylation sites in plants, animals, and fungi. Copyright © 2017 British Mycological Society. Published by Elsevier Ltd. All rights reserved.
FARME DB: a functional antibiotic resistance element database
Wallace, James C.; Port, Jesse A.; Smith, Marissa N.; Faustman, Elaine M.
2017-01-01
Antibiotic resistance (AR) is a major global public health threat but few resources exist that catalog AR genes outside of a clinical context. Current AR sequence databases are assembled almost exclusively from genomic sequences derived from clinical bacterial isolates and thus do not include many microbial sequences derived from environmental samples that confer resistance in functional metagenomic studies. These environmental metagenomic sequences often show little or no similarity to AR sequences from clinical isolates using standard classification criteria. In addition, existing AR databases provide no information about flanking sequences containing regulatory or mobile genetic elements. To help address this issue, we created an annotated database of DNA and protein sequences derived exclusively from environmental metagenomic sequences showing AR in laboratory experiments. Our Functional Antibiotic Resistant Metagenomic Element (FARME) database is a compilation of publically available DNA sequences and predicted protein sequences conferring AR as well as regulatory elements, mobile genetic elements and predicted proteins flanking antibiotic resistant genes. FARME is the first database to focus on functional metagenomic AR gene elements and provides a resource to better understand AR in the 99% of bacteria which cannot be cultured and the relationship between environmental AR sequences and antibiotic resistant genes derived from cultured isolates. Database URL: http://staff.washington.edu/jwallace/farme PMID:28077567
MIPS: analysis and annotation of proteins from whole genomes in 2005
Mewes, H. W.; Frishman, D.; Mayer, K. F. X.; Münsterkötter, M.; Noubibou, O.; Pagel, P.; Rattei, T.; Oesterheld, M.; Ruepp, A.; Stümpflen, V.
2006-01-01
The Munich Information Center for Protein Sequences (MIPS at the GSF), Neuherberg, Germany, provides resources related to genome information. Manually curated databases for several reference organisms are maintained. Several of these databases are described elsewhere in this and other recent NAR database issues. In a complementary effort, a comprehensive set of >400 genomes automatically annotated with the PEDANT system are maintained. The main goal of our current work on creating and maintaining genome databases is to extend gene centered information to information on interactions within a generic comprehensive framework. We have concentrated our efforts along three lines (i) the development of suitable comprehensive data structures and database technology, communication and query tools to include a wide range of different types of information enabling the representation of complex information such as functional modules or networks Genome Research Environment System, (ii) the development of databases covering computable information such as the basic evolutionary relations among all genes, namely SIMAP, the sequence similarity matrix and the CABiNet network analysis framework and (iii) the compilation and manual annotation of information related to interactions such as protein–protein interactions or other types of relations (e.g. MPCDB, MPPI, CYGD). All databases described and the detailed descriptions of our projects can be accessed through the MIPS WWW server (). PMID:16381839
Konc, Janez; Cesnik, Tomo; Konc, Joanna Trykowska; Penca, Matej; Janežič, Dušanka
2012-02-27
ProBiS-Database is a searchable repository of precalculated local structural alignments in proteins detected by the ProBiS algorithm in the Protein Data Bank. Identification of functionally important binding regions of the protein is facilitated by structural similarity scores mapped to the query protein structure. PDB structures that have been aligned with a query protein may be rapidly retrieved from the ProBiS-Database, which is thus able to generate hypotheses concerning the roles of uncharacterized proteins. Presented with uncharacterized protein structure, ProBiS-Database can discern relationships between such a query protein and other better known proteins in the PDB. Fast access and a user-friendly graphical interface promote easy exploration of this database of over 420 million local structural alignments. The ProBiS-Database is updated weekly and is freely available online at http://probis.cmm.ki.si/database.
dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins.
Huang, Kai-Yao; Su, Min-Gang; Kao, Hui-Ju; Hsieh, Yun-Chung; Jhong, Jhih-Hua; Cheng, Kuang-Hao; Huang, Hsien-Da; Lee, Tzong-Yi
2016-01-04
Owing to the importance of the post-translational modifications (PTMs) of proteins in regulating biological processes, the dbPTM (http://dbPTM.mbc.nctu.edu.tw/) was developed as a comprehensive database of experimentally verified PTMs from several databases with annotations of potential PTMs for all UniProtKB protein entries. For this 10th anniversary of dbPTM, the updated resource provides not only a comprehensive dataset of experimentally verified PTMs, supported by the literature, but also an integrative interface for accessing all available databases and tools that are associated with PTM analysis. As well as collecting experimental PTM data from 14 public databases, this update manually curates over 12 000 modified peptides, including the emerging S-nitrosylation, S-glutathionylation and succinylation, from approximately 500 research articles, which were retrieved by text mining. As the number of available PTM prediction methods increases, this work compiles a non-homologous benchmark dataset to evaluate the predictive power of online PTM prediction tools. An increasing interest in the structural investigation of PTM substrate sites motivated the mapping of all experimental PTM peptides to protein entries of Protein Data Bank (PDB) based on database identifier and sequence identity, which enables users to examine spatially neighboring amino acids, solvent-accessible surface area and side-chain orientations for PTM substrate sites on tertiary structures. Since drug binding in PDB is annotated, this update identified over 1100 PTM sites that are associated with drug binding. The update also integrates metabolic pathways and protein-protein interactions to support the PTM network analysis for a group of proteins. Finally, the web interface is redesigned and enhanced to facilitate access to this resource. © The Author(s) 2015. Published by Oxford University Press on behalf of Nucleic Acids Research.
Cell death proteomics database: consolidating proteomics data on cell death.
Arntzen, Magnus Ø; Bull, Vibeke H; Thiede, Bernd
2013-05-03
Programmed cell death is a ubiquitous process of utmost importance for the development and maintenance of multicellular organisms. More than 10 different types of programmed cell death forms have been discovered. Several proteomics analyses have been performed to gain insight in proteins involved in the different forms of programmed cell death. To consolidate these studies, we have developed the cell death proteomics (CDP) database, which comprehends data from apoptosis, autophagy, cytotoxic granule-mediated cell death, excitotoxicity, mitotic catastrophe, paraptosis, pyroptosis, and Wallerian degeneration. The CDP database is available as a web-based database to compare protein identifications and quantitative information across different experimental setups. The proteomics data of 73 publications were integrated and unified with protein annotations from UniProt-KB and gene ontology (GO). Currently, more than 6,500 records of more than 3,700 proteins are included in the CDP. Comparing apoptosis and autophagy using overrepresentation analysis of GO terms, the majority of enriched processes were found in both, but also some clear differences were perceived. Furthermore, the analysis revealed differences and similarities of the proteome between autophagosomal and overall autophagy. The CDP database represents a useful tool to consolidate data from proteome analyses of programmed cell death and is available at http://celldeathproteomics.uio.no.
MIPS: curated databases and comprehensive secondary data resources in 2010.
Mewes, H Werner; Ruepp, Andreas; Theis, Fabian; Rattei, Thomas; Walter, Mathias; Frishman, Dmitrij; Suhre, Karsten; Spannagl, Manuel; Mayer, Klaus F X; Stümpflen, Volker; Antonov, Alexey
2011-01-01
The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38,000,000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de).
MIPS: curated databases and comprehensive secondary data resources in 2010
Mewes, H. Werner; Ruepp, Andreas; Theis, Fabian; Rattei, Thomas; Walter, Mathias; Frishman, Dmitrij; Suhre, Karsten; Spannagl, Manuel; Mayer, Klaus F.X.; Stümpflen, Volker; Antonov, Alexey
2011-01-01
The Munich Information Center for Protein Sequences (MIPS at the Helmholtz Center for Environmental Health, Neuherberg, Germany) has many years of experience in providing annotated collections of biological data. Selected data sets of high relevance, such as model genomes, are subjected to careful manual curation, while the bulk of high-throughput data is annotated by automatic means. High-quality reference resources developed in the past and still actively maintained include Saccharomyces cerevisiae, Neurospora crassa and Arabidopsis thaliana genome databases as well as several protein interaction data sets (MPACT, MPPI and CORUM). More recent projects are PhenomiR, the database on microRNA-related phenotypes, and MIPS PlantsDB for integrative and comparative plant genome research. The interlinked resources SIMAP and PEDANT provide homology relationships as well as up-to-date and consistent annotation for 38 000 000 protein sequences. PPLIPS and CCancer are versatile tools for proteomics and functional genomics interfacing to a database of compilations from gene lists extracted from literature. A novel literature-mining tool, EXCERBT, gives access to structured information on classified relations between genes, proteins, phenotypes and diseases extracted from Medline abstracts by semantic analysis. All databases described here, as well as the detailed descriptions of our projects can be accessed through the MIPS WWW server (http://mips.helmholtz-muenchen.de). PMID:21109531
DDRprot: a database of DNA damage response-related proteins.
Andrés-León, Eduardo; Cases, Ildefonso; Arcas, Aida; Rojas, Ana M
2016-01-01
The DNA Damage Response (DDR) signalling network is an essential system that protects the genome's integrity. The DDRprot database presented here is a resource that integrates manually curated information on the human DDR network and its sub-pathways. For each particular DDR protein, we present detailed information about its function. If involved in post-translational modifications (PTMs) with each other, we depict the position of the modified residue/s in the three-dimensional structures, when resolved structures are available for the proteins. All this information is linked to the original publication from where it was obtained. Phylogenetic information is also shown, including time of emergence and conservation across 47 selected species, family trees and sequence alignments of homologues. The DDRprot database can be queried by different criteria: pathways, species, evolutionary age or involvement in (PTM). Sequence searches using hidden Markov models can be also used.Database URL: http://ddr.cbbio.es. © The Author(s) 2016. Published by Oxford University Press.
AllerML: markup language for allergens.
Ivanciuc, Ovidiu; Gendel, Steven M; Power, Trevor D; Schein, Catherine H; Braun, Werner
2011-06-01
Many concerns have been raised about the potential allergenicity of novel, recombinant proteins into food crops. Guidelines, proposed by WHO/FAO and EFSA, include the use of bioinformatics screening to assess the risk of potential allergenicity or cross-reactivities of all proteins introduced, for example, to improve nutritional value or promote crop resistance. However, there are no universally accepted standards that can be used to encode data on the biology of allergens to facilitate using data from multiple databases in this screening. Therefore, we developed AllerML a markup language for allergens to assist in the automated exchange of information between databases and in the integration of the bioinformatics tools that are used to investigate allergenicity and cross-reactivity. As proof of concept, AllerML was implemented using the Structural Database of Allergenic Proteins (SDAP; http://fermi.utmb.edu/SDAP/) database. General implementation of AllerML will promote automatic flow of validated data that will aid in allergy research and regulatory analysis. Copyright © 2011 Elsevier Inc. All rights reserved.
AllerML: Markup Language for Allergens
Ivanciuc, Ovidiu; Gendel, Steven M.; Power, Trevor D.; Schein, Catherine H.; Braun, Werner
2011-01-01
Many concerns have been raised about the potential allergenicity of novel, recombinant proteins into food crops. Guidelines, proposed by WHO/FAO and EFSA, include the use of bioinformatics screening to assess the risk of potential allergenicity or cross-reactivities of all proteins introduced, for example, to improve nutritional value or promote crop resistance. However, there are no universally accepted standards that can be used to encode data on the biology of allergens to facilitate using data from multiple databases in this screening. Therefore, we developed AllerML a markup language for allergens to assist in the automated exchange of information between databases and in the integration of the bioinformatics tools that are used to investigate allergenicity and cross-reactivity. As proof of concept, AllerML was implemented using the Structural Database of Allergenic Proteins (SDAP; http://fermi.utmb.edu/SDAP/) database. General implementation of AllerML will promote automatic flow of validated data that will aid in allergy research and regulatory analysis. PMID:21420460
Choosing an Optimal Database for Protein Identification from Tandem Mass Spectrometry Data.
Kumar, Dhirendra; Yadav, Amit Kumar; Dash, Debasis
2017-01-01
Database searching is the preferred method for protein identification from digital spectra of mass to charge ratios (m/z) detected for protein samples through mass spectrometers. The search database is one of the major influencing factors in discovering proteins present in the sample and thus in deriving biological conclusions. In most cases the choice of search database is arbitrary. Here we describe common search databases used in proteomic studies and their impact on final list of identified proteins. We also elaborate upon factors like composition and size of the search database that can influence the protein identification process. In conclusion, we suggest that choice of the database depends on the type of inferences to be derived from proteomics data. However, making additional efforts to build a compact and concise database for a targeted question should generally be rewarding in achieving confident protein identifications.
Sabbatini, A; Bédouet, L; Marie, A; Bartolini, A; Landemarre, L; Weber, M X; Gusti Ngurah Kade Mahardika, I; Berland, S; Zito, F; Vénec-Peyré, M-T
2014-07-01
Most foraminifera that produce a shell are efficient biomineralizers. We analyzed the calcitic shell of the large tropical benthic foraminifer Schlumbergerella floresiana. We found a suite of macromolecules containing many charged and polar amino acids and glycine that are also abundant in biomineralization proteins of other phyla. As neither genomic nor transcriptomic data are available for foraminiferal biomineralization yet, de novo-generated sequences, obtained from organic matrices submitted to ms blast database search, led to the characterization of 156 peptides. Very few homologous proteins were matched in the proteomic database, implying that the peptides are derived from unknown proteins present in the foraminiferal organic matrices. The amino acid distribution of these peptides was queried against the uniprot database and the mollusk uniprot database for comparison. The mollusks compose a well-studied phylum that yield a large variety of biomineralization proteins. These results showed that proteins extracted from S. floresiana shells contained sequences enriched with glycine, alanine, and proline, making a set of residues that provided a signature unique to foraminifera. Three of the de novo peptides exhibited sequence similarities to peptides found in proteins such as pre-collagen-P and a group of P-type ATPases including a calcium-transporting ATPase. Surprisingly, the peptide that was most similar to the collagen-like protein was a glycine-rich peptide reported from the test and spine proteome of sea urchin. The molecules, identified by matrix-assisted laser desorption ionization-time of flight mass spectrometry analyses, included acid-soluble N-glycoproteins with its sugar moieties represented by high-mannose-type glycans and carbohydrates. Describing the nature of the proteins, and associated molecules in the skeletal structure of living foraminifera, can elucidate the biomineralization mechanisms of these major carbonate producers in marine ecosystems. As fossil foraminifera provide important paleoenvironmental and paleoclimatic information, a better understanding of biomineralization in these organisms will have far-reaching impacts. © 2014 John Wiley & Sons Ltd.
An updated version of NPIDB includes new classifications of DNA–protein complexes and their families
Zanegina, Olga; Kirsanov, Dmitriy; Baulin, Eugene; Karyagina, Anna; Alexeevski, Andrei; Spirin, Sergey
2016-01-01
The recent upgrade of nucleic acid–protein interaction database (NPIDB, http://npidb.belozersky.msu.ru/) includes a newly elaborated classification of complexes of protein domains with double-stranded DNA and a classification of families of related complexes. Our classifications are based on contacting structural elements of both DNA: the major groove, the minor groove and the backbone; and protein: helices, beta-strands and unstructured segments. We took into account both hydrogen bonds and hydrophobic interaction. The analyzed material contains 1942 structures of protein domains from 748 PDB entries. We have identified 97 interaction modes of individual protein domain–DNA complexes and 17 DNA–protein interaction classes of protein domain families. We analyzed the sources of diversity of DNA–protein interaction modes in different complexes of one protein domain family. The observed interaction mode is sometimes influenced by artifacts of crystallization or diversity in secondary structure assignment. The interaction classes of domain families are more stable and thus possess more biological sense than a classification of single complexes. Integration of the classification into NPIDB allows the user to browse the database according to the interacting structural elements of DNA and protein molecules. For each family, we present average DNA shape parameters in contact zones with domains of the family. PMID:26656949
An emerging cyberinfrastructure for biodefense pathogen and pathogen-host data.
Zhang, C; Crasta, O; Cammer, S; Will, R; Kenyon, R; Sullivan, D; Yu, Q; Sun, W; Jha, R; Liu, D; Xue, T; Zhang, Y; Moore, M; McGarvey, P; Huang, H; Chen, Y; Zhang, J; Mazumder, R; Wu, C; Sobral, B
2008-01-01
The NIAID-funded Biodefense Proteomics Resource Center (RC) provides storage, dissemination, visualization and analysis capabilities for the experimental data deposited by seven Proteomics Research Centers (PRCs). The data and its publication is to support researchers working to discover candidates for the next generation of vaccines, therapeutics and diagnostics against NIAID's Category A, B and C priority pathogens. The data includes transcriptional profiles, protein profiles, protein structural data and host-pathogen protein interactions, in the context of the pathogen life cycle in vivo and in vitro. The database has stored and supported host or pathogen data derived from Bacillus, Brucella, Cryptosporidium, Salmonella, SARS, Toxoplasma, Vibrio and Yersinia, human tissue libraries, and mouse macrophages. These publicly available data cover diverse data types such as mass spectrometry, yeast two-hybrid (Y2H), gene expression profiles, X-ray and NMR determined protein structures and protein expression clones. The growing database covers over 23 000 unique genes/proteins from different experiments and organisms. All of the genes/proteins are annotated and integrated across experiments using UniProt Knowledgebase (UniProtKB) accession numbers. The web-interface for the database enables searching, querying and downloading at the level of experiment, group and individual gene(s)/protein(s) via UniProtKB accession numbers or protein function keywords. The system is accessible at http://www.proteomicsresource.org/.
MannDB: A microbial annotation database for protein characterization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, C; Lam, M; Smith, J
2006-05-19
MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-sourcemore » tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high-priority agents on the websites of several governmental organizations concerned with bio-terrorism. MannDB provides the user with a BLAST interface for comparison of native and non-native sequences and a query tool for conveniently selecting proteins of interest. In addition, the user has access to a web-based browser that compiles comprehensive and extensive reports.« less
MIPS: analysis and annotation of proteins from whole genomes.
Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A
2004-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Identifying functionally informative evolutionary sequence profiles.
Gil, Nelson; Fiser, Andras
2018-04-15
Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. andras.fiser@einstein.yu.edu. Supplementary data are available at Bioinformatics online.
Hall, Aaron Smalter; Shan, Yunfeng; Lushington, Gerald; Visvanathan, Mahesh
2016-01-01
Databases and exchange formats describing biological entities such as chemicals and proteins, along with their relationships, are a critical component of research in life sciences disciplines, including chemical biology wherein small information about small molecule properties converges with cellular and molecular biology. Databases for storing biological entities are growing not only in size, but also in type, with many similarities between them and often subtle differences. The data formats available to describe and exchange these entities are numerous as well. In general, each format is optimized for a particular purpose or database, and hence some understanding of these formats is required when choosing one for research purposes. This paper reviews a selection of different databases and data formats with the goal of summarizing their purposes, features, and limitations. Databases are reviewed under the categories of 1) protein interactions, 2) metabolic pathways, 3) chemical interactions, and 4) drug discovery. Representation formats will be discussed according to those describing chemical structures, and those describing genomic/proteomic entities. PMID:22934944
Smalter Hall, Aaron; Shan, Yunfeng; Lushington, Gerald; Visvanathan, Mahesh
2013-03-01
Databases and exchange formats describing biological entities such as chemicals and proteins, along with their relationships, are a critical component of research in life sciences disciplines, including chemical biology wherein small information about small molecule properties converges with cellular and molecular biology. Databases for storing biological entities are growing not only in size, but also in type, with many similarities between them and often subtle differences. The data formats available to describe and exchange these entities are numerous as well. In general, each format is optimized for a particular purpose or database, and hence some understanding of these formats is required when choosing one for research purposes. This paper reviews a selection of different databases and data formats with the goal of summarizing their purposes, features, and limitations. Databases are reviewed under the categories of 1) protein interactions, 2) metabolic pathways, 3) chemical interactions, and 4) drug discovery. Representation formats will be discussed according to those describing chemical structures, and those describing genomic/proteomic entities.
Improvements in the Protein Identifier Cross-Reference service.
Wein, Samuel P; Côté, Richard G; Dumousseau, Marine; Reisinger, Florian; Hermjakob, Henning; Vizcaíno, Juan A
2012-07-01
The Protein Identifier Cross-Reference (PICR) service is a tool that allows users to map protein identifiers, protein sequences and gene identifiers across over 100 different source databases. PICR takes input through an interactive website as well as Representational State Transfer (REST) and Simple Object Access Protocol (SOAP) services. It returns the results as HTML pages, XLS and CSV files. It has been in production since 2007 and has been recently enhanced to add new functionality and increase the number of databases it covers. Protein subsequences can be Basic Local Alignment Search Tool (BLAST) against the UniProt Knowledgebase (UniProtKB) to provide an entry point to the standard PICR mapping algorithm. In addition, gene identifiers from UniProtKB and Ensembl can now be submitted as input or mapped to as output from PICR. We have also implemented a 'best-guess' mapping algorithm for UniProt. In this article, we describe the usefulness of PICR, how these changes have been implemented, and the corresponding additions to the web services. Finally, we explain that the number of source databases covered by PICR has increased from the initial 73 to the current 102. New resources include several new species-specific Ensembl databases as well as the Ensembl Genome ones. PICR can be accessed at http://www.ebi.ac.uk/Tools/picr/.
Maurer-Stroh, Sebastian; Gao, He; Han, Hao; Baeten, Lies; Schymkowitz, Joost; Rousseau, Frederic; Zhang, Louxin; Eisenhaber, Frank
2013-02-01
Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif--structural motif--function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL (http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/).
Meslamani, Jamel; Rognan, Didier; Kellenberger, Esther
2011-05-01
The sc-PDB database is an annotated archive of druggable binding sites extracted from the Protein Data Bank. It contains all-atoms coordinates for 8166 protein-ligand complexes, chosen for their geometrical and physico-chemical properties. The sc-PDB provides a functional annotation for proteins, a chemical description for ligands and the detailed intermolecular interactions for complexes. The sc-PDB now includes a hierarchical classification of all the binding sites within a functional class. The sc-PDB entries were first clustered according to the protein name indifferent of the species. For each cluster, we identified dissimilar sites (e.g. catalytic and allosteric sites of an enzyme). SCOPE AND APPLICATIONS: The classification of sc-PDB targets by binding site diversity was intended to facilitate chemogenomics approaches to drug design. In ligand-based approaches, it avoids comparing ligands that do not share the same binding site. In structure-based approaches, it permits to quantitatively evaluate the diversity of the binding site definition (variations in size, sequence and/or structure). The sc-PDB database is freely available at: http://bioinfo-pharma.u-strasbg.fr/scPDB.
Remily-Wood, Elizabeth R.; Liu, Richard Z.; Xiang, Yun; Chen, Yi; Thomas, C. Eric; Rajyaguru, Neal; Kaufman, Laura M.; Ochoa, Joana E.; Hazlehurst, Lori; Pinilla-Ibarz, Javier; Lancet, Jeffrey; Zhang, Guolin; Haura, Eric; Shibata, David; Yeatman, Timothy; Smalley, Keiran S.M.; Dalton, William S.; Huang, Emina; Scott, Ed; Bloom, Gregory C.; Eschrich, Steven A.; Koomen, John M.
2012-01-01
Purpose The Quantitative Assay Database (QuAD), http://proteome.moffitt.org/QUAD/, facilitates widespread implementation of quantitative mass spectrometry in cancer biology and clinical research through sharing of methods and reagents for monitoring protein expression and modification. Experimental Design Liquid chromatography coupled to multiple reaction monitoring mass spectrometry (LC-MRM) assays are developed using SDS-PAGE fractionated lysates from cancer cell lines. Pathway maps created using GeneGO Metacore provide the biological relationships between proteins and illustrate concepts for multiplexed analysis; each protein can be selected to examine assay development at the protein and peptide level. Results The coupling of SDS-PAGE and LC-MRM screening has been used to detect 876 peptides from 218 cancer-related proteins in model systems including colon, lung, melanoma, leukemias, and myeloma, which has led to the development of 95 quantitative assays including stable-isotope labeled peptide standards. Methods are published online and peptide standards are made available to the research community. Protein expression measurements for heat shock proteins, including a comparison with ELISA and monitoring response to the HSP90 inhibitor, 17-DMAG, are used to illustrate the components of the QuAD and its potential utility. Conclusions and Clinical Relevance This resource enables quantitative assessment of protein components of signaling pathways and biological processes and holds promise for systematic investigation of treatment responses in cancer. PMID:21656910
Protein sequence annotation in the genome era: the annotation concept of SWISS-PROT+TREMBL.
Apweiler, R; Gateau, A; Contrino, S; Martin, M J; Junker, V; O'Donovan, C; Lang, F; Mitaritonna, N; Kappus, S; Bairoch, A
1997-01-01
SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation, a minimal level of redundancy and high level of integration with other databases. Ongoing genome sequencing projects have dramatically increased the number of protein sequences to be incorporated into SWISS-PROT. Since we do not want to dilute the quality standards of SWISS-PROT by incorporating sequences without proper sequence analysis and annotation, we cannot speed up the incorporation of new incoming data indefinitely. However, as we also want to make the sequences available as fast as possible, we introduced TREMBL (TRanslation of EMBL nucleotide sequence database), a supplement to SWISS-PROT. TREMBL consists of computer-annotated entries in SWISS-PROT format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for CDS already included in SWISS-PROT. While TREMBL is already of immense value, its computer-generated annotation does not match the quality of SWISS-PROTs. The main difference is in the protein functional information attached to sequences. With this in mind, we are dedicating substantial effort to develop and apply computer methods to enhance the functional information attached to TREMBL entries.
The Halophile protein database.
Sharma, Naveen; Farooqi, Mohammad Samir; Chaturvedi, Krishna Kumar; Lal, Shashi Bhushan; Grover, Monendra; Rai, Anil; Pandey, Pankaj
2014-01-01
Halophilic archaea/bacteria adapt to different salt concentration, namely extreme, moderate and low. These type of adaptations may occur as a result of modification of protein structure and other changes in different cell organelles. Thus proteins may play an important role in the adaptation of halophilic archaea/bacteria to saline conditions. The Halophile protein database (HProtDB) is a systematic attempt to document the biochemical and biophysical properties of proteins from halophilic archaea/bacteria which may be involved in adaptation of these organisms to saline conditions. In this database, various physicochemical properties such as molecular weight, theoretical pI, amino acid composition, atomic composition, estimated half-life, instability index, aliphatic index and grand average of hydropathicity (Gravy) have been listed. These physicochemical properties play an important role in identifying the protein structure, bonding pattern and function of the specific proteins. This database is comprehensive, manually curated, non-redundant catalogue of proteins. The database currently contains 59 897 proteins properties extracted from 21 different strains of halophilic archaea/bacteria. The database can be accessed through link. Database URL: http://webapp.cabgrid.res.in/protein/ © The Author(s) 2014. Published by Oxford University Press.
Genic insights from integrated human proteomics in GeneCards.
Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron
2016-01-01
GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.
Analysis of high accuracy, quantitative proteomics data in the MaxQB database.
Schaab, Christoph; Geiger, Tamar; Stoehr, Gabriele; Cox, Juergen; Mann, Matthias
2012-03-01
MS-based proteomics generates rapidly increasing amounts of precise and quantitative information. Analysis of individual proteomic experiments has made great strides, but the crucial ability to compare and store information across different proteome measurements still presents many challenges. For example, it has been difficult to avoid contamination of databases with low quality peptide identifications, to control for the inflation in false positive identifications when combining data sets, and to integrate quantitative data. Although, for example, the contamination with low quality identifications has been addressed by joint analysis of deposited raw data in some public repositories, we reasoned that there should be a role for a database specifically designed for high resolution and quantitative data. Here we describe a novel database termed MaxQB that stores and displays collections of large proteomics projects and allows joint analysis and comparison. We demonstrate the analysis tools of MaxQB using proteome data of 11 different human cell lines and 28 mouse tissues. The database-wide false discovery rate is controlled by adjusting the project specific cutoff scores for the combined data sets. The 11 cell line proteomes together identify proteins expressed from more than half of all human genes. For each protein of interest, expression levels estimated by label-free quantification can be visualized across the cell lines. Similarly, the expression rank order and estimated amount of each protein within each proteome are plotted. We used MaxQB to calculate the signal reproducibility of the detected peptides for the same proteins across different proteomes. Spearman rank correlation between peptide intensity and detection probability of identified proteins was greater than 0.8 for 64% of the proteome, whereas a minority of proteins have negative correlation. This information can be used to pinpoint false protein identifications, independently of peptide database scores. The information contained in MaxQB, including high resolution fragment spectra, is accessible to the community via a user-friendly web interface at http://www.biochem.mpg.de/maxqb.
Gilany, Kambiz; Minai-Tehrani, Arash; Savadi-Shiraz, Elham; Rezadoost, Hassan; Lakpour, Niknam
2015-01-01
The human seminal fluid is a complex body fluid. It is not known how many proteins are expressed in the seminal plasma; however in analog with the blood it is possible up to 10,000 proteins are expressed in the seminal plasma. The human seminal fluid is a rich source of potential biomarkers for male infertility and reproduction disorder. In this review, the ongoing list of proteins identified from the human seminal fluid was collected. To date, 4188 redundant proteins of the seminal fluid are identified using different proteomics technology, including 2-DE, SDS-PAGE-LC-MS/MS, MudPIT. However, this was reduced to a database of 2168 non-redundant protein using UniProtKB/Swiss-Prot reviewed database. The core concept of proteome were analyzed including pI, MW, Amino Acids, Chromosome and PTM distribution in the human seminal plasma proteome. Additionally, the biological process, molecular function and KEGG pathway were investigated using DAVID software. Finally, the biomarker identified in different male reproductive system disorder was investigated using proteomics platforms so far. In this study, an attempt was made to update the human seminal plasma proteome database. Our finding showed that human seminal plasma studies used to date seem to have converged on a set of proteins that are repeatedly identified in many studies and that represent only a small fraction of the entire human seminal plasma proteome.
Swetha, Rayapadi G; Kala Sekar, Dinesh Kumar; Ramaiah, Sudha; Anbarasu, Anand; Sekar, Kanagaraj
2014-12-01
Haemophilus influenzae (H. Influenzae) is the causative agent of pneumonia, bacteraemia and meningitis. The organism is responsible for large number of deaths in both developed and developing countries. Even-though the first bacterial genome to be sequenced was that of H. Influenzae, there is no exclusive database dedicated for H. Influenzae. This prompted us to develop the Haemophilus influenzae Genome Database (HIGDB). All data of HIGDB are stored and managed in MySQL database. The HIGDB is hosted on Solaris server and developed using PERL modules. Ajax and JavaScript are used for the interface development. The HIGDB contains detailed information on 42,741 proteins, 18,077 genes including 10 whole genome sequences and also 284 three dimensional structures of proteins of H. influenzae. In addition, the database provides "Motif search" and "GBrowse". The HIGDB is freely accessible through the URL: http://bioserver1.physics.iisc.ernet.in/HIGDB/. The HIGDB will be a single point access for bacteriological, clinical, genomic and proteomic information of H. influenzae. The database can also be used to identify DNA motifs within H. influenzae genomes and to compare gene or protein sequences of a particular strain with other strains of H. influenzae. Copyright © 2014 Elsevier Ltd. All rights reserved.
MIPS: analysis and annotation of proteins from whole genomes
Mewes, H. W.; Amid, C.; Arnold, R.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M.; Pagel, P.; Strack, N.; Stümpflen, V.; Warfsmann, J.; Ruepp, A.
2004-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein–protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:14681354
MAPU: Max-Planck Unified database of organellar, cellular, tissue and body fluid proteomes
Zhang, Yanling; Zhang, Yong; Adachi, Jun; Olsen, Jesper V.; Shi, Rong; de Souza, Gustavo; Pasini, Erica; Foster, Leonard J.; Macek, Boris; Zougman, Alexandre; Kumar, Chanchal; Wiśniewski, Jacek R.; Jun, Wang; Mann, Matthias
2007-01-01
Mass spectrometry (MS)-based proteomics has become a powerful technology to map the protein composition of organelles, cell types and tissues. In our department, a large-scale effort to map these proteomes is complemented by the Max-Planck Unified (MAPU) proteome database. MAPU contains several body fluid proteomes; including plasma, urine, and cerebrospinal fluid. Cell lines have been mapped to a depth of several thousand proteins and the red blood cell proteome has also been analyzed in depth. The liver proteome is represented with 3200 proteins. By employing high resolution MS and stringent validation criteria, false positive identification rates in MAPU are lower than 1:1000. Thus MAPU datasets can serve as reference proteomes in biomarker discovery. MAPU contains the peptides identifying each protein, measured masses, scores and intensities and is freely available at using a clickable interface of cell or body parts. Proteome data can be queried across proteomes by protein name, accession number, sequence similarity, peptide sequence and annotation information. More than 4500 mouse and 2500 human proteins have already been identified in at least one proteome. Basic annotation information and links to other public databases are provided in MAPU and we plan to add further analysis tools. PMID:17090601
SinEx DB: a database for single exon coding sequences in mammalian genomes.
Jorquera, Roddy; Ortiz, Rodrigo; Ossandon, F; Cárdenas, Juan Pablo; Sepúlveda, Rene; González, Carolina; Holmes, David S
2016-01-01
Eukaryotic genes are typically interrupted by intragenic, noncoding sequences termed introns. However, some genes lack introns in their coding sequence (CDS) and are generally known as 'single exon genes' (SEGs). In this work, a SEG is defined as a nuclear, protein-coding gene that lacks introns in its CDS. Whereas, many public databases of Eukaryotic multi-exon genes are available, there are only two specialized databases for SEGs. The present work addresses the need for a more extensive and diverse database by creating SinEx DB, a publicly available, searchable database of predicted SEGs from 10 completely sequenced mammalian genomes including human. SinEx DB houses the DNA and protein sequence information of these SEGs and includes their functional predictions (KOG) and the relative distribution of these functions within species. The information is stored in a relational database built with My SQL Server 5.1.33 and the complete dataset of SEG sequences and their functional predictions are available for downloading. SinEx DB can be interrogated by: (i) a browsable phylogenetic schema, (ii) carrying out BLAST searches to the in-house SinEx DB of SEGs and (iii) via an advanced search mode in which the database can be searched by key words and any combination of searches by species and predicted functions. SinEx DB provides a rich source of information for advancing our understanding of the evolution and function of SEGs.Database URL: www.sinex.cl. © The Author(s) 2016. Published by Oxford University Press.
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics.
Deutsch, Eric W; Sun, Zhi; Campbell, David S; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S; Moritz, Robert L
2016-11-04
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances-a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ∼20,000 primary isoforms plus contaminants to a very large database that includes almost all nonredundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/ .
Tiered Human Integrated Sequence Search Databases for Shotgun Proteomics
Deutsch, Eric W.; Sun, Zhi; Campbell, David S.; Binz, Pierre-Alain; Farrah, Terry; Shteynberg, David; Mendoza, Luis; Omenn, Gilbert S.; Moritz, Robert L.
2016-01-01
The results of analysis of shotgun proteomics mass spectrometry data can be greatly affected by the selection of the reference protein sequence database against which the spectra are matched. For many species there are multiple sources from which somewhat different sequence sets can be obtained. This can lead to confusion about which database is best in which circumstances – a problem especially acute in human sample analysis. All sequence databases are genome-based, with sequences for the predicted gene and their protein translation products compiled. Our goal is to create a set of primary sequence databases that comprise the union of sequences from many of the different available sources and make the result easily available to the community. We have compiled a set of four sequence databases of varying sizes, from a small database consisting of only the ~20,000 primary isoforms plus contaminants to a very large database that includes almost all non-redundant protein sequences from several sources. This set of tiered, increasingly complete human protein sequence databases suitable for mass spectrometry proteomics sequence database searching is called the Tiered Human Integrated Search Proteome set. In order to evaluate the utility of these databases, we have analyzed two different data sets, one from the HeLa cell line and the other from normal human liver tissue, with each of the four tiers of database complexity. The result is that approximately 0.8%, 1.1%, and 1.5% additional peptides can be identified for Tiers 2, 3, and 4, respectively, as compared with the Tier 1 database, at substantially increasing computational cost. This increase in computational cost may be worth bearing if the identification of sequence variants or the discovery of sequences that are not present in the reviewed knowledge base entries is an important goal of the study. We find that it is useful to search a data set against a simpler database, and then check the uniqueness of the discovered peptides against a more complex database. We have set up an automated system that downloads all the source databases on the first of each month and automatically generates a new set of search databases and makes them available for download at http://www.peptideatlas.org/thisp/. PMID:27577934
Dellaire, G.; Farrall, R.; Bickmore, W.A.
2003-01-01
The Nuclear Protein Database (NPD) is a curated database that contains information on more than 1300 vertebrate proteins that are thought, or are known, to localise to the cell nucleus. Each entry is annotated with information on predicted protein size and isoelectric point, as well as any repeats, motifs or domains within the protein sequence. In addition, information on the sub-nuclear localisation of each protein is provided and the biological and molecular functions are described using Gene Ontology (GO) terms. The database is searchable by keyword, protein name, sub-nuclear compartment and protein domain/motif. Links to other databases are provided (e.g. Entrez, SWISS-PROT, OMIM, PubMed, PubMed Central). Thus, NPD provides a gateway through which the nuclear proteome may be explored. The database can be accessed at http://npd.hgu.mrc.ac.uk and is updated monthly. PMID:12520015
O-GLYCBASE Version 3.0: a revised database of O-glycosylated proteins.
Hansen, J E; Lund, O; Nilsson, J; Rapacki, K; Brunak, S
1998-01-01
O-GLYCBASE is a revised database of information on glycoproteins and their O-linked glycosylation sites. Entries are compiled and revised from the literature, and from the sequence databases. Entries include information about species, sequence, glycosylation sites and glycan type and is fully cross-referenced. Compared to version 2.0 the number of entries has increased by 20%. Sequence logos displaying the acceptor specificity patterns for the GalNAc, mannose and GlcNAc transferases are shown. The O-GLYCBASE database is available through the WWW at http://www.cbs.dtu. dk/databases/OGLYCBASE/ PMID:9399880
Schokraie, Elham; Hotz-Wagenblatt, Agnes; Warnken, Uwe; Mali, Brahim; Frohme, Marcus; Förster, Frank; Dandekar, Thomas; Hengherr, Steffen; Schill, Ralph O; Schnölzer, Martina
2010-03-03
Tardigrades are small, multicellular invertebrates which are able to survive times of unfavourable environmental conditions using their well-known capability to undergo cryptobiosis at any stage of their life cycle. Milnesium tardigradum has become a powerful model system for the analysis of cryptobiosis. While some genetic information is already available for Milnesium tardigradum the proteome is still to be discovered. Here we present to the best of our knowledge the first comprehensive study of Milnesium tardigradum on the protein level. To establish a proteome reference map we developed optimized protocols for protein extraction from tardigrades in the active state and for separation of proteins by high resolution two-dimensional gel electrophoresis. Since only limited sequence information of M. tardigradum on the genome and gene expression level is available to date in public databases we initiated in parallel a tardigrade EST sequencing project to allow for protein identification by electrospray ionization tandem mass spectrometry. 271 out of 606 analyzed protein spots could be identified by searching against the publicly available NCBInr database as well as our newly established tardigrade protein database corresponding to 144 unique proteins. Another 150 spots could be identified in the tardigrade clustered EST database corresponding to 36 unique contigs and ESTs. Proteins with annotated function were further categorized in more detail by their molecular function, biological process and cellular component. For the proteins of unknown function more information could be obtained by performing a protein domain annotation analysis. Our results include proteins like protein member of different heat shock protein families and LEA group 3, which might play important roles in surviving extreme conditions. The proteome reference map of Milnesium tardigradum provides the basis for further studies in order to identify and characterize the biochemical mechanisms of tolerance to extreme desiccation. The optimized proteomics workflow will enable application of sensitive quantification techniques to detect differences in protein expression, which are characteristic of the active and anhydrobiotic states of tardigrades.
Schokraie, Elham; Hotz-Wagenblatt, Agnes; Warnken, Uwe; Mali, Brahim; Frohme, Marcus; Förster, Frank; Dandekar, Thomas; Hengherr, Steffen; Schill, Ralph O.; Schnölzer, Martina
2010-01-01
Background Tardigrades are small, multicellular invertebrates which are able to survive times of unfavourable environmental conditions using their well-known capability to undergo cryptobiosis at any stage of their life cycle. Milnesium tardigradum has become a powerful model system for the analysis of cryptobiosis. While some genetic information is already available for Milnesium tardigradum the proteome is still to be discovered. Principal Findings Here we present to the best of our knowledge the first comprehensive study of Milnesium tardigradum on the protein level. To establish a proteome reference map we developed optimized protocols for protein extraction from tardigrades in the active state and for separation of proteins by high resolution two-dimensional gel electrophoresis. Since only limited sequence information of M. tardigradum on the genome and gene expression level is available to date in public databases we initiated in parallel a tardigrade EST sequencing project to allow for protein identification by electrospray ionization tandem mass spectrometry. 271 out of 606 analyzed protein spots could be identified by searching against the publicly available NCBInr database as well as our newly established tardigrade protein database corresponding to 144 unique proteins. Another 150 spots could be identified in the tardigrade clustered EST database corresponding to 36 unique contigs and ESTs. Proteins with annotated function were further categorized in more detail by their molecular function, biological process and cellular component. For the proteins of unknown function more information could be obtained by performing a protein domain annotation analysis. Our results include proteins like protein member of different heat shock protein families and LEA group 3, which might play important roles in surviving extreme conditions. Conclusions The proteome reference map of Milnesium tardigradum provides the basis for further studies in order to identify and characterize the biochemical mechanisms of tolerance to extreme desiccation. The optimized proteomics workflow will enable application of sensitive quantification techniques to detect differences in protein expression, which are characteristic of the active and anhydrobiotic states of tardigrades. PMID:20224743
Ndhlovu, Andrew; Durand, Pierre M.; Hazelhurst, Scott
2015-01-01
The evolutionary rate at codon sites across protein-coding nucleotide sequences represents a valuable tier of information for aligning sequences, inferring homology and constructing phylogenetic profiles. However, a comprehensive resource for cataloguing the evolutionary rate at codon sites and their corresponding nucleotide and protein domain sequence alignments has not been developed. To address this gap in knowledge, EvoDB (an Evolutionary rates DataBase) was compiled. Nucleotide sequences and their corresponding protein domain data including the associated seed alignments from the PFAM-A (protein family) database were used to estimate evolutionary rate (ω = dN/dS) profiles at codon sites for each entry. EvoDB contains 98.83% of the gapped nucleotide sequence alignments and 97.1% of the evolutionary rate profiles for the corresponding information in PFAM-A. As the identification of codon sites under positive selection and their position in a sequence profile is usually the most sought after information for molecular evolutionary biologists, evolutionary rate profiles were determined under the M2a model using the CODEML algorithm in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite of software. Validation of nucleotide sequences against amino acid data was implemented to ensure high data quality. EvoDB is a catalogue of the evolutionary rate profiles and provides the corresponding phylogenetic trees, PFAM-A alignments and annotated accession identifier data. In addition, the database can be explored and queried using known evolutionary rate profiles to identify domains under similar evolutionary constraints and pressures. EvoDB is a resource for evolutionary, phylogenetic studies and presents a tier of information untapped by current databases. Database URL: http://www.bioinf.wits.ac.za/software/fire/evodb PMID:26140928
Thermodynamic database for proteins: features and applications.
Gromiha, M Michael; Sarai, Akinori
2010-01-01
We have developed a thermodynamic database for proteins and mutants, ProTherm, which is a collection of a large number of thermodynamic data on protein stability along with the sequence and structure information, experimental methods and conditions, and literature information. This is a valuable resource for understanding/predicting the stability of proteins, and it can be accessible at http://www.gibk26.bse.kyutech.ac.jp/jouhou/Protherm/protherm.html . ProTherm has several features including various search, display, and sorting options and visualization tools. We have analyzed the data in ProTherm to examine the relationship among thermodynamics, structure, and function of proteins. We describe the progress on the development of methods for understanding/predicting protein stability, such as (i) relationship between the stability of protein mutants and amino acid properties, (ii) average assignment method, (iii) empirical energy functions, (iv) torsion, distance, and contact potentials, and (v) machine learning techniques. The list of online resources for predicting protein stability has also been provided.
A HUPO test sample study reveals common problems in mass spectrometry-based proteomics
Bell, Alexander W.; Deutsch, Eric W.; Au, Catherine E.; Kearney, Robert E.; Beavis, Ron; Sechi, Salvatore; Nilsson, Tommy; Bergeron, John J.M.
2009-01-01
We carried out a test sample study to try to identify errors leading to irreproducibility, including incompleteness of peptide sampling, in LC-MS-based proteomics. We distributed a test sample consisting of an equimolar mix of 20 highly purified recombinant human proteins, to 27 laboratories for identification. Each protein contained one or more unique tryptic peptides of 1250 Da to also test for ion selection and sampling in the mass spectrometer. Of the 27 labs, initially only 7 labs reported all 20 proteins correctly, and only 1 lab reported all the tryptic peptides of 1250 Da. Nevertheless, a subsequent centralized analysis of the raw data revealed that all 20 proteins and most of the 1250 Da peptides had in fact been detected by all 27 labs. The centralized analysis allowed us to determine sources of problems encountered in the study, which include missed identifications (false negatives), environmental contamination, database matching, and curation of protein identifications. Improved search engines and databases are likely to increase the fidelity of mass spectrometry-based proteomics. PMID:19448641
An emerging cyberinfrastructure for biodefense pathogen and pathogen–host data
Zhang, C.; Crasta, O.; Cammer, S.; Will, R.; Kenyon, R.; Sullivan, D.; Yu, Q.; Sun, W.; Jha, R.; Liu, D.; Xue, T.; Zhang, Y.; Moore, M.; McGarvey, P.; Huang, H.; Chen, Y.; Zhang, J.; Mazumder, R.; Wu, C.; Sobral, B.
2008-01-01
The NIAID-funded Biodefense Proteomics Resource Center (RC) provides storage, dissemination, visualization and analysis capabilities for the experimental data deposited by seven Proteomics Research Centers (PRCs). The data and its publication is to support researchers working to discover candidates for the next generation of vaccines, therapeutics and diagnostics against NIAID's Category A, B and C priority pathogens. The data includes transcriptional profiles, protein profiles, protein structural data and host–pathogen protein interactions, in the context of the pathogen life cycle in vivo and in vitro. The database has stored and supported host or pathogen data derived from Bacillus, Brucella, Cryptosporidium, Salmonella, SARS, Toxoplasma, Vibrio and Yersinia, human tissue libraries, and mouse macrophages. These publicly available data cover diverse data types such as mass spectrometry, yeast two-hybrid (Y2H), gene expression profiles, X-ray and NMR determined protein structures and protein expression clones. The growing database covers over 23 000 unique genes/proteins from different experiments and organisms. All of the genes/proteins are annotated and integrated across experiments using UniProt Knowledgebase (UniProtKB) accession numbers. The web-interface for the database enables searching, querying and downloading at the level of experiment, group and individual gene(s)/protein(s) via UniProtKB accession numbers or protein function keywords. The system is accessible at http://www.proteomicsresource.org/. PMID:17984082
Medvedeva, Irina V; Demenkov, Pavel S; Ivanisenko, Vladimir A
2017-04-01
Functional sites define the diversity of protein functions and are the central object of research of the structural and functional organization of proteins. The mechanisms underlying protein functional sites emergence and their variability during evolution are distinguished by duplication, shuffling, insertion and deletion of the exons in genes. The study of the correlation between a site structure and exon structure serves as the basis for the in-depth understanding of sites organization. In this regard, the development of programming resources that allow the realization of the mutual projection of exon structure of genes and primary and tertiary structures of encoded proteins is still the actual problem. Previously, we developed the SitEx system that provides information about protein and gene sequences with mapped exon borders and protein functional sites amino acid positions. The database included information on proteins with known 3D structure. However, data with respect to orthologs was not available. Therefore, we added the projection of sites positions to the exon structures of orthologs in SitEx 2.0. We implemented a search through database using site conservation variability and site discontinuity through exon structure. Inclusion of the information on orthologs allowed to expand the possibilities of SitEx usage for solving problems regarding the analysis of the structural and functional organization of proteins. Database URL: http://www-bionet.sscc.ru/sitex/ .
The 2018 Nucleic Acids Research database issue and the online molecular biology database collection.
Rigden, Daniel J; Fernández, Xosé M
2018-01-04
The 2018 Nucleic Acids Research Database Issue contains 181 papers spanning molecular biology. Among them, 82 are new and 84 are updates describing resources that appeared in the Issue previously. The remaining 15 cover databases most recently published elsewhere. Databases in the area of nucleic acids include 3DIV for visualisation of data on genome 3D structure and RNArchitecture, a hierarchical classification of RNA families. Protein databases include the established SMART, ELM and MEROPS while GPCRdb and the newcomer STCRDab cover families of biomedical interest. In the area of metabolism, HMDB and Reactome both report new features while PULDB appears in NAR for the first time. This issue also contains reports on genomics resources including Ensembl, the UCSC Genome Browser and ENCODE. Update papers from the IUPHAR/BPS Guide to Pharmacology and DrugBank are highlights of the drug and drug target section while a number of proteomics databases including proteomicsDB are also covered. The entire Database Issue is freely available online on the Nucleic Acids Research website (https://academic.oup.com/nar). The NAR online Molecular Biology Database Collection has been updated, reviewing 138 entries, adding 88 new resources and eliminating 47 discontinued URLs, bringing the current total to 1737 databases. It is available at http://www.oxfordjournals.org/nar/database/c/. © The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
MIPS: a database for genomes and protein sequences
Mewes, H. W.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Mayer, K.; Mokrejs, M.; Morgenstern, B.; Münsterkötter, M.; Rudd, S.; Weil, B.
2002-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz–Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91–93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155–158; Barker et al. (2001) Nucleic Acids Res., 29, 29–32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de). PMID:11752246
MIPS: a database for genomes and protein sequences.
Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B
2002-01-01
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).
Phenome-genome association studies of pancreatic cancer: new targets for therapy and diagnosis.
Narayanan, Ramaswamy
2015-01-01
Pancreatic cancer, has a very high mortality rate and requires novel molecular targets for diagnosis and therapy. Genetic association studies over databases offer an attractive starting point for gene discovery. The National Center for Biotechnology Information (NCBI) Phenome Genome Integrator (PheGenI) tool was enriched for pancreatic cancer-associated traits. The genes associated with the trait were characterized using diverse bioinformatics tools for Genome-Wide Association (GWA), transcriptome and proteome profile and protein classes for motif and domain. Two hundred twenty-six genes were identified that had a genetic association with pancreatic cancer in the human genome. This included 25 uncharacterized open reading frames (ORFs). Bioinformatics analysis of these ORFs identified putative druggable proteins and biomarkers including enzymes, transporters and G-protein-coupled receptor signaling proteins. Secreted proteins including a neuroendocrine factor and a chemokine were identified. Five out of these ORFs encompassed non coding RNAs. The ORF protein expression was detected in numerous body fluids, such as ascites, bile, pancreatic juice, milk, plasma, serum and saliva. Transcriptome and proteome analyses showed a correlation of mRNA and protein expression for nine ORFs. Analysis of the Catalogue of Somatic Mutations in Cancer (COSMIC) database revealed a strong correlation across copy number variations and mRNA over-expression for four ORFs. Mining of the International Cancer Gene Consortium (ICGC) database identified somatic mutations in a significant number of pancreatic patients' tumors for most of these ORFs. The pancreatic cancer-associated ORFs were also found to be genetically associated with other neoplasms, including leukemia, malignant melanoma, neuroblastoma and prostate carcinomas, as well as other unrelated diseases and disorders, such as Alzheimer's disease, Crohn's disease, coronary diseases, attention deficit disorder and addiction. Based on Genome-Wide Association Studies (GWAS), copy number variations, somatic mutational status and correlation of gene expression in pancreatic tumors at the mRNA and protein level, expression specificity in normal tissues and detection in body fluids, six ORFs emerged as putative leads for pancreatic cancer. These six targets provide a basis for accelerated drug discovery and diagnostic marker development for pancreatic cancer. Copyright© 2015, International Institute of Anticancer Research (Dr. John G. Delinasios), All rights reserved.
Sivapalarajah, Shayeeshan; Krishnakumar, Mathangi; Bickerstaffe, Harry; Chan, YikYing; Clarkson, Joseph; Hampden-Martin, Alistair; Mirza, Ahmad; Tanti, Matthew; Marson, Anthony; Pirmohamed, Munir; Mirza, Nasir
2018-02-01
Current antiepileptic drugs (AEDs) have several shortcomings. For example, they fail to control seizures in 30% of patients. Hence, there is a need to identify new AEDs. Drug repurposing is the discovery of new indications for approved drugs. This drug "recycling" offers the potential of significant savings in the time and cost of drug development. Many drugs licensed for other indications exhibit antiepileptic efficacy in animal models. Our aim was to create a database of "prescribable" drugs, approved for other conditions, with published evidence of efficacy in animal models of epilepsy, and to collate data that would assist in choosing the most promising candidates for drug repurposing. The database was created by the following: (1) computational literature-mining using novel software that identifies Medline abstracts containing the name of a prescribable drug, a rodent model of epilepsy, and a phrase indicating seizure reduction; then (2) crowdsourced manual curation of the identified abstracts. The final database includes 173 drugs and 500 abstracts. It is made freely available at www.liverpool.ac.uk/D3RE/PDE3. The database is reliable: 94% of the included drugs have corroborative evidence of efficacy in animal models (for example, evidence from multiple independent studies). The database includes many drugs that are appealing candidates for repurposing, as they are widely accepted by prescribers and patients-the database includes half of the 20 most commonly prescribed drugs in England-and they target many proteins involved in epilepsy but not targeted by current AEDs. It is important to note that the drugs are of potential relevance to human epilepsy-the database is highly enriched with drugs that target proteins of known causal human epilepsy genes (Fisher's exact test P-value < 3 × 10 -5 ). We present data to help prioritize the most promising candidates for repurposing from the database. The PDE3 database is an important new resource for drug repurposing research in epilepsy. Wiley Periodicals, Inc. © 2018 International League Against Epilepsy.
Web server to identify similarity of amino acid motifs to compounds (SAAMCO).
Casey, Fergal P; Davey, Norman E; Baran, Ivan; Varekova, Radka Svobodova; Shields, Denis C
2008-07-01
Protein-protein interactions are fundamental in mediating biological processes including metabolism, cell growth, and signaling. To be able to selectively inhibit or induce protein activity or complex formation is a key feature in controlling disease. For those situations in which protein-protein interactions derive substantial affinity from short linear peptide sequences, or motifs, we can develop search algorithms for peptidomimetic compounds that resemble the short peptide's structure but are not compromised by poor pharmacological properties. SAAMCO is a Web service ( http://bioware.ucd.ie/ approximately saamco) that facilitates the screening of motifs with known structures against bioactive compound databases. It is built on an algorithm that defines compound similarity based on the presence of appropriate amino acid side chain fragments and a favorable Root Mean Squared Deviation (RMSD) between compound and motif structure. The methodology is efficient as the available compound databases are preprocessed and fast regular expression searches filter potential matches before time-intensive 3D superposition is performed. The required input information is minimal, and the compound databases have been selected to maximize the availability of information on biological activity. "Hits" are accompanied with a visualization window and links to source database entries. Motif matching can be defined on partial or full similarity which will increase or reduce respectively the number of potential mimetic compounds. The Web server provides the functionality for rapid screening of known or putative interaction motifs against prepared compound libraries using a novel search algorithm. The tabulated results can be analyzed by linking to appropriate databases and by visualization.
Database resources of the National Center for Biotechnology Information.
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; Dicuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian
2012-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Database resources of the National Center for Biotechnology Information
Acland, Abigail; Agarwala, Richa; Barrett, Tanya; Beck, Jeff; Benson, Dennis A.; Bollin, Colleen; Bolton, Evan; Bryant, Stephen H.; Canese, Kathi; Church, Deanna M.; Clark, Karen; DiCuccio, Michael; Dondoshansky, Ilya; Federhen, Scott; Feolo, Michael; Geer, Lewis Y.; Gorelenkov, Viatcheslav; Hoeppner, Marilu; Johnson, Mark; Kelly, Christopher; Khotomlianski, Viatcheslav; Kimchi, Avi; Kimelman, Michael; Kitts, Paul; Krasnov, Sergey; Kuznetsov, Anatoliy; Landsman, David; Lipman, David J.; Lu, Zhiyong; Madden, Thomas L.; Madej, Tom; Maglott, Donna R.; Marchler-Bauer, Aron; Karsch-Mizrachi, Ilene; Murphy, Terence; Ostell, James; O'Sullivan, Christopher; Panchenko, Anna; Phan, Lon; Pruitt, Don Preussm Kim D.; Rubinstein, Wendy; Sayers, Eric W.; Schneider, Valerie; Schuler, Gregory D.; Sequeira, Edwin; Sherry, Stephen T.; Shumway, Martin; Sirotkin, Karl; Siyan, Karanjit; Slotta, Douglas; Soboleva, Alexandra; Soussov, Vladimir; Starchenko, Grigory; Tatusova, Tatiana A.; Trawick, Bart W.; Vakatov, Denis; Wang, Yanli; Ward, Minghong; John Wilbur, W.; Yaschenko, Eugene; Zbicz, Kerry
2014-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, PubReader, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Primer-BLAST, COBALT, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, ClinVar, MedGen, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All these resources can be accessed through the NCBI home page. PMID:24259429
Proteome analysis of bell pepper (Capsicum annuum L.) chromoplasts.
Siddique, Muhammad Asim; Grossmann, Jonas; Gruissem, Wilhelm; Baginsky, Sacha
2006-12-01
We report a comprehensive proteome analysis of chromoplasts from bell pepper (Capsicum annuum L.). The combination of a novel strategy for database-independent detection of proteins from tandem mass spectrometry (MS/MS) data with standard database searches allowed us to identify 151 proteins with a high level of confidence. These include several well-known plastid proteins but also novel proteins that were not previously reported from other plastid proteome studies. The majority of the identified proteins are active in plastid carbohydrate and amino acid metabolism. Among the most abundant individual proteins are capsanthin/capsorubin synthase and fibrillin, which are involved in the synthesis and storage of carotenoids that accumulate to high levels in chromoplasts. The relative abundances of the identified chromoplast proteins differ remarkably compared with their abundances in other plastid types, suggesting a chromoplast-specific metabolic network. Our results provide an overview of the major metabolic pathways active in chromoplasts and extend existing knowledge about prevalent metabolic activities of different plastid types.
Côté, Richard G; Jones, Philip; Martens, Lennart; Kerrien, Samuel; Reisinger, Florian; Lin, Quan; Leinonen, Rasko; Apweiler, Rolf; Hermjakob, Henning
2007-10-18
Each major protein database uses its own conventions when assigning protein identifiers. Resolving the various, potentially unstable, identifiers that refer to identical proteins is a major challenge. This is a common problem when attempting to unify datasets that have been annotated with proteins from multiple data sources or querying data providers with one flavour of protein identifiers when the source database uses another. Partial solutions for protein identifier mapping exist but they are limited to specific species or techniques and to a very small number of databases. As a result, we have not found a solution that is generic enough and broad enough in mapping scope to suit our needs. We have created the Protein Identifier Cross-Reference (PICR) service, a web application that provides interactive and programmatic (SOAP and REST) access to a mapping algorithm that uses the UniProt Archive (UniParc) as a data warehouse to offer protein cross-references based on 100% sequence identity to proteins from over 70 distinct source databases loaded into UniParc. Mappings can be limited by source database, taxonomic ID and activity status in the source database. Users can copy/paste or upload files containing protein identifiers or sequences in FASTA format to obtain mappings using the interactive interface. Search results can be viewed in simple or detailed HTML tables or downloaded as comma-separated values (CSV) or Microsoft Excel (XLS) files suitable for use in a local database or a spreadsheet. Alternatively, a SOAP interface is available to integrate PICR functionality in other applications, as is a lightweight REST interface. We offer a publicly available service that can interactively map protein identifiers and protein sequences to the majority of commonly used protein databases. Programmatic access is available through a standards-compliant SOAP interface or a lightweight REST interface. The PICR interface, documentation and code examples are available at http://www.ebi.ac.uk/Tools/picr.
Côté, Richard G; Jones, Philip; Martens, Lennart; Kerrien, Samuel; Reisinger, Florian; Lin, Quan; Leinonen, Rasko; Apweiler, Rolf; Hermjakob, Henning
2007-01-01
Background Each major protein database uses its own conventions when assigning protein identifiers. Resolving the various, potentially unstable, identifiers that refer to identical proteins is a major challenge. This is a common problem when attempting to unify datasets that have been annotated with proteins from multiple data sources or querying data providers with one flavour of protein identifiers when the source database uses another. Partial solutions for protein identifier mapping exist but they are limited to specific species or techniques and to a very small number of databases. As a result, we have not found a solution that is generic enough and broad enough in mapping scope to suit our needs. Results We have created the Protein Identifier Cross-Reference (PICR) service, a web application that provides interactive and programmatic (SOAP and REST) access to a mapping algorithm that uses the UniProt Archive (UniParc) as a data warehouse to offer protein cross-references based on 100% sequence identity to proteins from over 70 distinct source databases loaded into UniParc. Mappings can be limited by source database, taxonomic ID and activity status in the source database. Users can copy/paste or upload files containing protein identifiers or sequences in FASTA format to obtain mappings using the interactive interface. Search results can be viewed in simple or detailed HTML tables or downloaded as comma-separated values (CSV) or Microsoft Excel (XLS) files suitable for use in a local database or a spreadsheet. Alternatively, a SOAP interface is available to integrate PICR functionality in other applications, as is a lightweight REST interface. Conclusion We offer a publicly available service that can interactively map protein identifiers and protein sequences to the majority of commonly used protein databases. Programmatic access is available through a standards-compliant SOAP interface or a lightweight REST interface. The PICR interface, documentation and code examples are available at . PMID:17945017
DOE Office of Scientific and Technical Information (OSTI.GOV)
Williams, K.; Chubb, C.; Huberman, E.
High resolution two dimensional get electrophoresis (2DE) and database analysis was used to establish protein expression patterns for cultured normal human mammary epithelial cells and thirteen breast cancer cell lines. The Human Breast Epithelial Cell database contains the 2DE protein patterns, including relative protein abundances, for each cell line, plus a composite pattern that contains all the common and specifically expressed proteins from all the cell lines. Significant differences in protein expression, both qualitative and quantitative, were observed not only between normal cells and tumor cells, but also among the tumor cell lines. Eight percent of the consistently detected proteinsmore » were found in significantly (P < 0.001) variable levels among the cell lines. Using a combination of immunostaining, comigration with purified protein, subcellular fractionation, and amino-terminal protein sequencing, we identified a subset of the differentially expressed proteins. These identified proteins include the cytoskeletal proteins actin, tubulin, vimentin, and cytokeratins. The cell lines can be classified into four distinct groups based on their intermediate filament protein profile. We also identified heat shock proteins; hsp27, hsp60, and hsp70 varied in abundance and in some cases in the relative phosphorylation levels among the cell lines. Finally, we identified IMP dehydrogenase in each of the cell lines, and found the levels of this enzyme in the tumor cell lines elevated 2- to 20-fold relative to the levels in normal cells.« less
Biomedical Requirements for High Productivity Computing Systems
2005-04-01
server at http://www.ncbi.nlm.nih.gov/BLAST/. There are many variants of BLAST, including: 1. BLASTN - Compares a DNA query to a DNA database. Searches ...database (3 reading frames from each strand of the DNA) searching . 13 4. TBLASTN - Compares a protein query to a DNA database, in the 6 possible...the molecular during this phase. After eliminating molecules that could not match the query , an atom-by-atom search for the molecules in conducted
Updated regulation curation model at the Saccharomyces Genome Database
Engel, Stacia R; Skrzypek, Marek S; Hellerstedt, Sage T; Wong, Edith D; Nash, Robert S; Weng, Shuai; Binkley, Gail; Sheppard, Travis K; Karra, Kalpana; Cherry, J Michael
2018-01-01
Abstract The Saccharomyces Genome Database (SGD) provides comprehensive, integrated biological information for the budding yeast Saccharomyces cerevisiae, along with search and analysis tools to explore these data, enabling the discovery of functional relationships between sequence and gene products in fungi and higher organisms. We have recently expanded our data model for regulation curation to address regulation at the protein level in addition to transcription, and are presenting the expanded data on the ‘Regulation’ pages at SGD. These pages include a summary describing the context under which the regulator acts, manually curated and high-throughput annotations showing the regulatory relationships for that gene and a graphical visualization of its regulatory network and connected networks. For genes whose products regulate other genes or proteins, the Regulation page includes Gene Ontology enrichment analysis of the biological processes in which those targets participate. For DNA-binding transcription factors, we also provide other information relevant to their regulatory function, such as DNA binding site motifs and protein domains. As with other data types at SGD, all regulatory relationships and accompanying data are available through YeastMine, SGD’s data warehouse based on InterMine. Database URL: http://www.yeastgenome.org PMID:29688362
Robasky, Kimberly; Bulyk, Martha L
2011-01-01
The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) database is a centralized repository of information on the DNA-binding preferences of proteins as determined by universal protein-binding microarray (PBM) technology. Each entry for a protein (or protein complex) in UniPROBE provides the quantitative preferences for all possible nucleotide sequence variants ('words') of length k ('k-mers'), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In this update, we describe >130% expansion of the database content, incorporation of a protein BLAST (blastp) tool for finding protein sequence matches in UniPROBE, the introduction of UniPROBE accession numbers and additional database enhancements. The UniPROBE database is available at http://uniprobe.org.
THPdb: Database of FDA-approved peptide and protein therapeutics.
Usmani, Salman Sadullah; Bedi, Gursimran; Samuel, Jesse S; Singh, Sandeep; Kalra, Sourav; Kumar, Pawan; Ahuja, Anjuman Arora; Sharma, Meenu; Gautam, Ankur; Raghava, Gajendra P S
2017-01-01
THPdb (http://crdd.osdd.net/raghava/thpdb/) is a manually curated repository of Food and Drug Administration (FDA) approved therapeutic peptides and proteins. The information in THPdb has been compiled from 985 research publications, 70 patents and other resources like DrugBank. The current version of the database holds a total of 852 entries, providing comprehensive information on 239 US-FDA approved therapeutic peptides and proteins and their 380 drug variants. The information on each peptide and protein includes their sequences, chemical properties, composition, disease area, mode of activity, physical appearance, category or pharmacological class, pharmacodynamics, route of administration, toxicity, target of activity, etc. In addition, we have annotated the structure of most of the protein and peptides. A number of user-friendly tools have been integrated to facilitate easy browsing and data analysis. To assist scientific community, a web interface and mobile App have also been developed.
Arntzen, Magnus Ø; Thiede, Bernd
2012-02-01
Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no.
Arntzen, Magnus Ø.; Thiede, Bernd
2012-01-01
Apoptosis is the most commonly described form of programmed cell death, and dysfunction is implicated in a large number of human diseases. Many quantitative proteome analyses of apoptosis have been performed to gain insight in proteins involved in the process. This resulted in large and complex data sets that are difficult to evaluate. Therefore, we developed the ApoptoProteomics database for storage, browsing, and analysis of the outcome of large scale proteome analyses of apoptosis derived from human, mouse, and rat. The proteomics data of 52 publications were integrated and unified with protein annotations from UniProt-KB, the caspase substrate database homepage (CASBAH), and gene ontology. Currently, more than 2300 records of more than 1500 unique proteins were included, covering a large proportion of the core signaling pathways of apoptosis. Analysis of the data set revealed a high level of agreement between the reported changes in directionality reported in proteomics studies and expected apoptosis-related function and may disclose proteins without a current recognized involvement in apoptosis based on gene ontology. Comparison between induction of apoptosis by the intrinsic and the extrinsic apoptotic signaling pathway revealed slight differences. Furthermore, proteomics has significantly contributed to the field of apoptosis in identifying hundreds of caspase substrates. The database is available at http://apoptoproteomics.uio.no. PMID:22067098
Hernandez-Valladares, Maria; Vaudel, Marc; Selheim, Frode; Berven, Frode; Bruserud, Øystein
2017-08-01
Mass spectrometry (MS)-based proteomics has become an indispensable tool for the characterization of the proteome and its post-translational modifications (PTM). In addition to standard protein sequence databases, proteogenomics strategies search the spectral data against the theoretical spectra obtained from customized protein sequence databases. Up to date, there are no published proteogenomics studies on acute myeloid leukemia (AML) samples. Areas covered: Proteogenomics involves the understanding of genomic and proteomic data. The intersection of both datatypes requires advanced bioinformatics skills. A standard proteogenomics workflow that could be used for the study of AML samples is described. The generation of customized protein sequence databases as well as bioinformatics tools and pipelines commonly used in proteogenomics are discussed in detail. Expert commentary: Drawing on evidence from recent cancer proteogenomics studies and taking into account the public availability of AML genomic data, the interpretation of present and future MS-based AML proteomic data using AML-specific protein sequence databases could discover new biological mechanisms and targets in AML. However, proteogenomics workflows including bioinformatics guidelines can be challenging for the wide AML research community. It is expected that further automation and simplification of the bioinformatics procedures might attract AML investigators to adopt the proteogenomics strategy.
Finding Protein and Nucleotide Similarities with FASTA
Pearson, William R.
2016-01-01
The FASTA programs provide a comprehensive set of rapid similarity searching tools ( fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local and global similarity searches ( ssearch36, ggsearch36) and for searching with short peptides and oligonucleotides ( fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity (Unit 3.5). The FASTA programs can produce “BLAST-like” alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases (Unit 9.4). The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. PMID:27010337
YTPdb: a wiki database of yeast membrane transporters.
Brohée, Sylvain; Barriot, Roland; Moreau, Yves; André, Bruno
2010-10-01
Membrane transporters constitute one of the largest functional categories of proteins in all organisms. In the yeast Saccharomyces cerevisiae, this represents about 300 proteins ( approximately 5% of the proteome). We here present the Yeast Transport Protein database (YTPdb), a user-friendly collaborative resource dedicated to the precise classification and annotation of yeast transporters. YTPdb exploits an evolution of the MediaWiki web engine used for popular collaborative databases like Wikipedia, allowing every registered user to edit the data in a user-friendly manner. Proteins in YTPdb are classified on the basis of functional criteria such as subcellular location or their substrate compounds. These classifications are hierarchical, allowing queries to be performed at various levels, from highly specific (e.g. ammonium as a substrate or the vacuole as a location) to broader (e.g. cation as a substrate or inner membranes as location). Other resources accessible for each transporter via YTPdb include post-translational modifications, K(m) values, a permanently updated bibliography, and a hierarchical classification into families. The YTPdb concept can be extrapolated to other organisms and could even be applied for other functional categories of proteins. YTPdb is accessible at http://homes.esat.kuleuven.be/ytpdb/. Copyright © 2010 Elsevier B.V. All rights reserved.
Finding Protein and Nucleotide Similarities with FASTA.
Pearson, William R
2016-03-24
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local, and global similarity searches (ssearch36, ggsearch36), and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity. The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases. The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons. Copyright © 2016 John Wiley & Sons, Inc.
Kristensen, David M.; Wolf, Yuri I.; Koonin, Eugene V.
2017-01-01
The Alignable Tight Genomic Clusters (ATGCs) database is a collection of closely related bacterial and archaeal genomes that provides several tools to aid research into evolutionary processes in the microbial world. Each ATGC is a taxonomy-independent cluster of 2 or more completely sequenced genomes that meet the objective criteria of a high degree of local gene order (synteny) and a small number of synonymous substitutions in the protein-coding genes. As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g. species), whereas the entire collection of ATGCs is useful for macroevolutionary studies. The ATGC database includes many forms of pre-computed data, in particular ATGC-COGs (Clusters of Orthologous Genes), multiple sequence alignments, a set of ‘index’ orthologs representing the most well-conserved members of each ATGC-COG, the phylogenetic tree of the organisms within each ATGC, etc. Although the ATGC database contains several million proteins from thousands of genomes organized into hundreds of clusters (roughly a 4-fold increase since the last version of the ATGC database), it is now built with completely automated methods and will be regularly updated following new releases of the NCBI RefSeq database. The ATGC database is hosted jointly at the University of Iowa at dmk-brain.ecn.uiowa.edu/ATGC/ and the NCBI at ftp.ncbi.nlm.nih.gov/pub/kristensen/ATGC/atgc_home.html. PMID:28053163
Vathipadiekal, Vinod; Wang, Victoria; Wei, Wei; Waldron, Levi; Drapkin, Ronny; Gillette, Michael; Skates, Steven; Birrer, Michael
2015-11-01
To generate a comprehensive "Secretome" of proteins potentially found in the blood and derive a virtual Affymetrix array. To validate the utility of this database for the discovery of novel serum-based biomarkers using ovarian cancer transcriptomic data. The secretome was constructed by aggregating the data from databases of known secreted proteins, transmembrane or membrane proteins, signal peptides, G-protein coupled receptors, or proteins existing in the extracellular region, and the virtual array was generated by mapping them to Affymetrix probeset identifiers. Whole-genome microarray data from ovarian cancer, normal ovarian surface epithelium, and fallopian tube epithelium were used to identify transcripts upregulated in ovarian cancer. We established the secretome from eight public databases and a virtual array consisting of 16,521 Affymetrix U133 Plus 2.0 probesets. Using ovarian cancer transcriptomic data, we identified candidate blood-based biomarkers for ovarian cancer and performed bioinformatic validation by demonstrating rediscovery of known biomarkers including CA125 and HE4. Two novel top biomarkers (FGF18 and GPR172A) were validated in serum samples from an independent patient cohort. We present the secretome, comprising the most comprehensive resource available for protein products that are potentially found in the blood. The associated virtual array can be used to translate gene-expression data into cancer biomarker discovery. A list of blood-based biomarkers for ovarian cancer detection is reported and includes CA125 and HE4. FGF18 and GPR172A were identified and validated by ELISA as being differentially expressed in the serum of ovarian cancer patients compared with controls. ©2015 American Association for Cancer Research.
MIPS: a database for protein sequences, homology data and yeast genome information.
Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F
1997-01-01
The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498
Genes2Networks: connecting lists of gene symbols using mammalian protein interactions databases.
Berger, Seth I; Posner, Jeremy M; Ma'ayan, Avi
2007-10-04
In recent years, mammalian protein-protein interaction network databases have been developed. The interactions in these databases are either extracted manually from low-throughput experimental biomedical research literature, extracted automatically from literature using techniques such as natural language processing (NLP), generated experimentally using high-throughput methods such as yeast-2-hybrid screens, or interactions are predicted using an assortment of computational approaches. Genes or proteins identified as significantly changing in proteomic experiments, or identified as susceptibility disease genes in genomic studies, can be placed in the context of protein interaction networks in order to assign these genes and proteins to pathways and protein complexes. Genes2Networks is a software system that integrates the content of ten mammalian interaction network datasets. Filtering techniques to prune low-confidence interactions were implemented. Genes2Networks is delivered as a web-based service using AJAX. The system can be used to extract relevant subnetworks created from "seed" lists of human Entrez gene symbols. The output includes a dynamic linkable three color web-based network map, with a statistical analysis report that identifies significant intermediate nodes used to connect the seed list. Genes2Networks is powerful web-based software that can help experimental biologists to interpret lists of genes and proteins such as those commonly produced through genomic and proteomic experiments, as well as lists of genes and proteins associated with disease processes. This system can be used to find relationships between genes and proteins from seed lists, and predict additional genes or proteins that may play key roles in common pathways or protein complexes.
MAPU: Max-Planck Unified database of organellar, cellular, tissue and body fluid proteomes.
Zhang, Yanling; Zhang, Yong; Adachi, Jun; Olsen, Jesper V; Shi, Rong; de Souza, Gustavo; Pasini, Erica; Foster, Leonard J; Macek, Boris; Zougman, Alexandre; Kumar, Chanchal; Wisniewski, Jacek R; Jun, Wang; Mann, Matthias
2007-01-01
Mass spectrometry (MS)-based proteomics has become a powerful technology to map the protein composition of organelles, cell types and tissues. In our department, a large-scale effort to map these proteomes is complemented by the Max-Planck Unified (MAPU) proteome database. MAPU contains several body fluid proteomes; including plasma, urine, and cerebrospinal fluid. Cell lines have been mapped to a depth of several thousand proteins and the red blood cell proteome has also been analyzed in depth. The liver proteome is represented with 3200 proteins. By employing high resolution MS and stringent validation criteria, false positive identification rates in MAPU are lower than 1:1000. Thus MAPU datasets can serve as reference proteomes in biomarker discovery. MAPU contains the peptides identifying each protein, measured masses, scores and intensities and is freely available at http://www.mapuproteome.com using a clickable interface of cell or body parts. Proteome data can be queried across proteomes by protein name, accession number, sequence similarity, peptide sequence and annotation information. More than 4500 mouse and 2500 human proteins have already been identified in at least one proteome. Basic annotation information and links to other public databases are provided in MAPU and we plan to add further analysis tools.
The PMDB Protein Model Database
Castrignanò, Tiziana; De Meo, Paolo D'Onorio; Cozzetto, Domenico; Talamo, Ivano Giuseppe; Tramontano, Anna
2006-01-01
The Protein Model Database (PMDB) is a public resource aimed at storing manually built 3D models of proteins. The database is designed to provide access to models published in the scientific literature, together with validating experimental data. It is a relational database and it currently contains >74 000 models for ∼240 proteins. The system is accessible at and allows predictors to submit models along with related supporting evidence and users to download them through a simple and intuitive interface. Users can navigate in the database and retrieve models referring to the same target protein or to different regions of the same protein. Each model is assigned a unique identifier that allows interested users to directly access the data. PMID:16381873
PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank.
Tusnády, Gábor E; Dosztányi, Zsuzsanna; Simon, István
2005-01-01
PDB_TM is a database for transmembrane proteins with known structures. It aims to collect all transmembrane proteins that are deposited in the protein structure database (PDB) and to determine their membrane-spanning regions. These assignments are based on the TMDET algorithm, which uses only structural information to locate the most likely position of the lipid bilayer and to distinguish between transmembrane and globular proteins. This algorithm was applied to all PDB entries and the results were collected in the PDB_TM database. By using TMDET algorithm, the PDB_TM database can be automatically updated every week, keeping it synchronized with the latest PDB updates. The PDB_TM database is available at http://www.enzim.hu/PDB_TM.
Rallapalli, P M; Kemball-Cook, G; Tuddenham, E G; Gomez, K; Perkins, S J
2013-07-01
Factor IX (FIX) is important in the coagulation cascade, being activated to FIXa on cleavage. Defects in the human F9 gene frequently lead to hemophilia B. To assess 1113 unique F9 mutations corresponding to 3721 patient entries in a new and up-to-date interactive web database alongside the FIXa protein structure. The mutations database was built using MySQL and structural analyses were based on a homology model for the human FIXa structure based on closely-related crystal structures. Mutations have been found in 336 (73%) out of 461 residues in FIX. There were 812 unique point mutations, 182 deletions, 54 polymorphisms, 39 insertions and 26 others that together comprise a total of 1113 unique variants. The 64 unique mild severity mutations in the mature protein with known circulating protein phenotypes include 15 (23%) quantitative type I mutations and 41 (64%) predominantly qualitative type II mutations. Inhibitors were described in 59 reports (1.6%) corresponding to 25 unique mutations. The interactive database provides insights into mechanisms of hemophilia B. Type II mutations are deduced to disrupt predominantly those structural regions involved with functional interactions. The interactive features of the database will assist in making judgments about patient management. © 2013 International Society on Thrombosis and Haemostasis.
Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource
Koike, Asako; Kobayashi, Yoshiyuki; Takagi, Toshihisa
2003-01-01
Protein kinases play a crucial role in the regulation of cellular functions. Various kinds of information about these molecules are important for understanding signaling pathways and organism characteristics. We have developed the Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes. It contains the classification of protein kinases and their functional conservation, ortholog tables among species, protein–protein, protein–gene, and protein–compound interaction data, domain information, and structural information. It also provides an automatic pathway graphic image interface. The protein, gene, and compound interactions are automatically extracted from abstracts for all genes and proteins by natural-language processing (NLP).The method of automatic extraction uses phrase patterns and the GENA protein, gene, and compound name dictionary, which was developed by our group. With this database, pathways are easily compared among species using data with more than 47,000 protein interactions and protein kinase ortholog tables. The database is available for querying and browsing at http://kinasedb.ontology.ims.u-tokyo.ac.jp/. PMID:12799355
Syed, Mustafa H; Karpinets, Tatiana V; Leuze, Michael R; Kora, Guruprasad H; Romine, Margaret R; Uberbacher, Edward C
2009-01-01
Shewanella oneidensis MR-1 is an important model organism for environmental research as it has an exceptional metabolic and respiratory versatility regulated by a complex regulatory network. We have developed a database to collect experimental and computational data relating to regulation of gene and protein expression, and, a visualization environment that enables integration of these data types. The regulatory information in the database includes predictions of DNA regulator binding sites, sigma factor binding sites, transcription units, operons, promoters, and RNA regulators including non-coding RNAs, riboswitches, and different types of terminators. Availability http://shewanella-knowledgebase.org:8080/Shewanella/gbrowserLanding.jsp PMID:20198195
ASGARD: an open-access database of annotated transcriptomes for emerging model arthropod species.
Zeng, Victor; Extavour, Cassandra G
2012-01-01
The increased throughput and decreased cost of next-generation sequencing (NGS) have shifted the bottleneck genomic research from sequencing to annotation, analysis and accessibility. This is particularly challenging for research communities working on organisms that lack the basic infrastructure of a sequenced genome, or an efficient way to utilize whatever sequence data may be available. Here we present a new database, the Assembled Searchable Giant Arthropod Read Database (ASGARD). This database is a repository and search engine for transcriptomic data from arthropods that are of high interest to multiple research communities but currently lack sequenced genomes. We demonstrate the functionality and utility of ASGARD using de novo assembled transcriptomes from the milkweed bug Oncopeltus fasciatus, the cricket Gryllus bimaculatus and the amphipod crustacean Parhyale hawaiensis. We have annotated these transcriptomes to assign putative orthology, coding region determination, protein domain identification and Gene Ontology (GO) term annotation to all possible assembly products. ASGARD allows users to search all assemblies by orthology annotation, GO term annotation or Basic Local Alignment Search Tool. User-friendly features of ASGARD include search term auto-completion suggestions based on database content, the ability to download assembly product sequences in FASTA format, direct links to NCBI data for predicted orthologs and graphical representation of the location of protein domains and matches to similar sequences from the NCBI non-redundant database. ASGARD will be a useful repository for transcriptome data from future NGS studies on these and other emerging model arthropods, regardless of sequencing platform, assembly or annotation status. This database thus provides easy, one-stop access to multi-species annotated transcriptome information. We anticipate that this database will be useful for members of multiple research communities, including developmental biology, physiology, evolutionary biology, ecology, comparative genomics and phylogenomics. Database URL: asgard.rc.fas.harvard.edu.
Hsing, Michael; Cherkasov, Artem
2008-06-25
Insertions and deletions (indels) represent a common type of sequence variations, which are less studied and pose many important biological questions. Recent research has shown that the presence of sizable indels in protein sequences may be indicative of protein essentiality and their role in protein interaction networks. Examples of utilization of indels for structure-based drug design have also been recently demonstrated. Nonetheless many structural and functional characteristics of indels remain less researched or unknown. We have created a web-based resource, Indel PDB, representing a structural database of insertions/deletions identified from the sequence alignments of highly similar proteins found in the Protein Data Bank (PDB). Indel PDB utilized large amounts of available structural information to characterize 1-, 2- and 3-dimensional features of indel sites. Indel PDB contains 117,266 non-redundant indel sites extracted from 11,294 indel-containing proteins. Unlike loop databases, Indel PDB features more indel sequences with secondary structures including alpha-helices and beta-sheets in addition to loops. The insertion fragments have been characterized by their sequences, lengths, locations, secondary structure composition, solvent accessibility, protein domain association and three dimensional structures. By utilizing the data available in Indel PDB, we have studied and presented here several sequence and structural features of indels. We anticipate that Indel PDB will not only enable future functional studies of indels, but will also assist protein modeling efforts and identification of indel-directed drug binding sites.
The value of protein structure classification information—Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.
2015-01-01
ABSTRACT The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:26313554
Metagenomic Taxonomy-Guided Database-Searching Strategy for Improving Metaproteomic Analysis.
Xiao, Jinqiu; Tanca, Alessandro; Jia, Ben; Yang, Runqing; Wang, Bo; Zhang, Yu; Li, Jing
2018-04-06
Metaproteomics provides a direct measure of the functional information by investigating all proteins expressed by a microbiota. However, due to the complexity and heterogeneity of microbial communities, it is very hard to construct a sequence database suitable for a metaproteomic study. Using a public database, researchers might not be able to identify proteins from poorly characterized microbial species, while a sequencing-based metagenomic database may not provide adequate coverage for all potentially expressed protein sequences. To address this challenge, we propose a metagenomic taxonomy-guided database-search strategy (MT), in which a merged database is employed, consisting of both taxonomy-guided reference protein sequences from public databases and proteins from metagenome assembly. By applying our MT strategy to a mock microbial mixture, about two times as many peptides were detected as with the metagenomic database only. According to the evaluation of the reliability of taxonomic attribution, the rate of misassignments was comparable to that obtained using an a priori matched database. We also evaluated the MT strategy with a human gut microbial sample, and we found 1.7 times as many peptides as using a standard metagenomic database. In conclusion, our MT strategy allows the construction of databases able to provide high sensitivity and precision in peptide identification in metaproteomic studies, enabling the detection of proteins from poorly characterized species within the microbiota.
Analysis of gene expression profile microarray data in complex regional pain syndrome.
Tan, Wulin; Song, Yiyan; Mo, Chengqiang; Jiang, Shuangjian; Wang, Zhongxing
2017-09-01
The aim of the present study was to predict key genes and proteins associated with complex regional pain syndrome (CRPS) using bioinformatics analysis. The gene expression profiling microarray data, GSE47603, which included peripheral blood samples from 4 patients with CRPS and 5 healthy controls, was obtained from the Gene Expression Omnibus (GEO) database. The differentially expressed genes (DEGs) in CRPS patients compared with healthy controls were identified using the GEO2R online tool. Functional enrichment analysis was then performed using The Database for Annotation Visualization and Integrated Discovery online tool. Protein‑protein interaction (PPI) network analysis was subsequently performed using Search Tool for the Retrieval of Interaction Genes database and analyzed with Cytoscape software. A total of 257 DEGs were identified, including 243 upregulated genes and 14 downregulated ones. Genes in the human leukocyte antigen (HLA) family were most significantly differentially expressed. Enrichment analysis demonstrated that signaling pathways, including immune response, cell motion, adhesion and angiogenesis were associated with CRPS. PPI network analysis revealed that key genes, including early region 1A binding protein p300 (EP300), CREB‑binding protein (CREBBP), signal transducer and activator of transcription (STAT)3, STAT5A and integrin α M were associated with CRPS. The results suggest that the immune response may therefore serve an important role in CRPS development. In addition, genes in the HLA family, such as HLA‑DQB1 and HLA‑DRB1, may present potential biomarkers for the diagnosis of CRPS. Furthermore, EP300, its paralog CREBBP, and the STAT family genes, STAT3 and STAT5 may be important in the development of CRPS.
Ravikumar, Komandur Elayavilli; Wagholikar, Kavishwar B; Li, Dingcheng; Kocher, Jean-Pierre; Liu, Hongfang
2015-06-06
Advances in the next generation sequencing technology has accelerated the pace of individualized medicine (IM), which aims to incorporate genetic/genomic information into medicine. One immediate need in interpreting sequencing data is the assembly of information about genetic variants and their corresponding associations with other entities (e.g., diseases or medications). Even with dedicated effort to capture such information in biological databases, much of this information remains 'locked' in the unstructured text of biomedical publications. There is a substantial lag between the publication and the subsequent abstraction of such information into databases. Multiple text mining systems have been developed, but most of them focus on the sentence level association extraction with performance evaluation based on gold standard text annotations specifically prepared for text mining systems. We developed and evaluated a text mining system, MutD, which extracts protein mutation-disease associations from MEDLINE abstracts by incorporating discourse level analysis, using a benchmark data set extracted from curated database records. MutD achieves an F-measure of 64.3% for reconstructing protein mutation disease associations in curated database records. Discourse level analysis component of MutD contributed to a gain of more than 10% in F-measure when compared against the sentence level association extraction. Our error analysis indicates that 23 of the 64 precision errors are true associations that were not captured by database curators and 68 of the 113 recall errors are caused by the absence of associated disease entities in the abstract. After adjusting for the defects in the curated database, the revised F-measure of MutD in association detection reaches 81.5%. Our quantitative analysis reveals that MutD can effectively extract protein mutation disease associations when benchmarking based on curated database records. The analysis also demonstrates that incorporating discourse level analysis significantly improved the performance of extracting the protein-mutation-disease association. Future work includes the extension of MutD for full text articles.
The value of protein structure classification information-Surveying the scientific literature
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
2015-08-27
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
The value of protein structure classification information-Surveying the scientific literature
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc
The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less
Gramene database in 2010: updates and extensions.
Youens-Clark, Ken; Buckler, Ed; Casstevens, Terry; Chen, Charles; Declerck, Genevieve; Derwent, Paul; Dharmawardhana, Palitha; Jaiswal, Pankaj; Kersey, Paul; Karthikeyan, A S; Lu, Jerry; McCouch, Susan R; Ren, Liya; Spooner, William; Stein, Joshua C; Thomason, Jim; Wei, Sharon; Ware, Doreen
2011-01-01
Now in its 10th year, the Gramene database (http://www.gramene.org) has grown from its primary focus on rice, the first fully-sequenced grass genome, to become a resource for major model and crop plants including Arabidopsis, Brachypodium, maize, sorghum, poplar and grape in addition to several species of rice. Gramene began with the addition of an Ensembl genome browser and has expanded in the last decade to become a robust resource for plant genomics hosting a wide array of data sets including quantitative trait loci (QTL), metabolic pathways, genetic diversity, genes, proteins, germplasm, literature, ontologies and a fully-structured markers and sequences database integrated with genome browsers and maps from various published studies (genetic, physical, bin, etc.). In addition, Gramene now hosts a variety of web services including a Distributed Annotation Server (DAS), BLAST and a public MySQL database. Twice a year, Gramene releases a major build of the database and makes interim releases to correct errors or to make important updates to software and/or data.
DOE R&D Accomplishments Database
Chandonia, John-Marc; Hon, Gary; Walker, Nigel S.; Lo Conte, Loredana; Koehl, Patrice; Levitt, Michael; Brenner, Steven E.
2003-09-15
The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54,745 domains, more than three times as many as the initial release four years ago. ASTRAL has undergone major transformations in the past two years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as available integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods.
Overcoming barriers to membrane protein structure determination.
Bill, Roslyn M; Henderson, Peter J F; Iwata, So; Kunji, Edmund R S; Michel, Hartmut; Neutze, Richard; Newstead, Simon; Poolman, Bert; Tate, Christopher G; Vogel, Horst
2011-04-01
After decades of slow progress, the pace of research on membrane protein structures is beginning to quicken thanks to various improvements in technology, including protein engineering and microfocus X-ray diffraction. Here we review these developments and, where possible, highlight generic new approaches to solving membrane protein structures based on recent technological advances. Rational approaches to overcoming the bottlenecks in the field are urgently required as membrane proteins, which typically comprise ~30% of the proteomes of organisms, are dramatically under-represented in the structural database of the Protein Data Bank.
sc-PDB: an annotated database of druggable binding sites from the Protein Data Bank.
Kellenberger, Esther; Muller, Pascal; Schalon, Claire; Bret, Guillaume; Foata, Nicolas; Rognan, Didier
2006-01-01
The sc-PDB is a collection of 6 415 three-dimensional structures of binding sites found in the Protein Data Bank (PDB). Binding sites were extracted from all high-resolution crystal structures in which a complex between a protein cavity and a small-molecular-weight ligand could be identified. Importantly, ligands are considered from a pharmacological and not a structural point of view. Therefore, solvents, detergents, and most metal ions are not stored in the sc-PDB. Ligands are classified into four main categories: nucleotides (< 4-mer), peptides (< 9-mer), cofactors, and organic compounds. The corresponding binding site is formed by all protein residues (including amino acids, cofactors, and important metal ions) with at least one atom within 6.5 angstroms of any ligand atom. The database was carefully annotated by browsing several protein databases (PDB, UniProt, and GO) and storing, for every sc-PDB entry, the following features: protein name, function, source, domain and mutations, ligand name, and structure. The repository of ligands has also been archived by diversity analysis of molecular scaffolds, and several chemoinformatics descriptors were computed to better understand the chemical space covered by stored ligands. The sc-PDB may be used for several purposes: (i) screening a collection of binding sites for predicting the most likely target(s) of any ligand, (ii) analyzing the molecular similarity between different cavities, and (iii) deriving rules that describe the relationship between ligand pharmacophoric points and active-site properties. The database is periodically updated and accessible on the web at http://bioinfo-pharma.u-strasbg.fr/scPDB/.
Ndhlovu, Andrew; Durand, Pierre M; Hazelhurst, Scott
2015-01-01
The evolutionary rate at codon sites across protein-coding nucleotide sequences represents a valuable tier of information for aligning sequences, inferring homology and constructing phylogenetic profiles. However, a comprehensive resource for cataloguing the evolutionary rate at codon sites and their corresponding nucleotide and protein domain sequence alignments has not been developed. To address this gap in knowledge, EvoDB (an Evolutionary rates DataBase) was compiled. Nucleotide sequences and their corresponding protein domain data including the associated seed alignments from the PFAM-A (protein family) database were used to estimate evolutionary rate (ω = dN/dS) profiles at codon sites for each entry. EvoDB contains 98.83% of the gapped nucleotide sequence alignments and 97.1% of the evolutionary rate profiles for the corresponding information in PFAM-A. As the identification of codon sites under positive selection and their position in a sequence profile is usually the most sought after information for molecular evolutionary biologists, evolutionary rate profiles were determined under the M2a model using the CODEML algorithm in the PAML (Phylogenetic Analysis by Maximum Likelihood) suite of software. Validation of nucleotide sequences against amino acid data was implemented to ensure high data quality. EvoDB is a catalogue of the evolutionary rate profiles and provides the corresponding phylogenetic trees, PFAM-A alignments and annotated accession identifier data. In addition, the database can be explored and queried using known evolutionary rate profiles to identify domains under similar evolutionary constraints and pressures. EvoDB is a resource for evolutionary, phylogenetic studies and presents a tier of information untapped by current databases. © The Author(s) 2015. Published by Oxford University Press.
NemaPath: online exploration of KEGG-based metabolic pathways for nematodes
Wylie, Todd; Martin, John; Abubucker, Sahar; Yin, Yong; Messina, David; Wang, Zhengyuan; McCarter, James P; Mitreva, Makedonka
2008-01-01
Background Nematode.net is a web-accessible resource for investigating gene sequences from parasitic and free-living nematode genomes. Beyond the well-characterized model nematode C. elegans, over 500,000 expressed sequence tags (ESTs) and nearly 600,000 genome survey sequences (GSSs) have been generated from 36 nematode species as part of the Parasitic Nematode Genomics Program undertaken by the Genome Center at Washington University School of Medicine. However, these sequencing data are not present in most publicly available protein databases, which only include sequences in Swiss-Prot. Swiss-Prot, in turn, relies on GenBank/Embl/DDJP for predicted proteins from complete genomes or full-length proteins. Description Here we present the NemaPath pathway server, a web-based pathway-level visualization tool for navigating putative metabolic pathways for over 30 nematode species, including 27 parasites. The NemaPath approach consists of two parts: 1) a backend tool to align and evaluate nematode genomic sequences (curated EST contigs) against the annotated Kyoto Encyclopedia of Genes and Genomes (KEGG) protein database; 2) a web viewing application that displays annotated KEGG pathway maps based on desired confidence levels of primary sequence similarity as defined by a user. NemaPath also provides cross-referenced access to nematode genome information provided by other tools available on Nematode.net, including: detailed NemaGene EST cluster information; putative translations; GBrowse EST cluster views; links from nematode data to external databases for corresponding synonymous C. elegans counterparts, subject matches in KEGG's gene database, and also KEGG Ontology (KO) identification. Conclusion The NemaPath server hosts metabolic pathway mappings for 30 nematode species and is available on the World Wide Web at . The nematode source sequences used for the metabolic pathway mappings are available via FTP , as provided by the Genome Center at Washington University School of Medicine. PMID:18983679
Domain fusion analysis by applying relational algebra to protein sequence and domain databases
Truong, Kevin; Ikura, Mitsuhiko
2003-01-01
Background Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. Results This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at . Conclusion As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time. PMID:12734020
Zhang, Yaoyang; Xu, Tao; Shan, Bing; Hart, Jonathan; Aslanian, Aaron; Han, Xuemei; Zong, Nobel; Li, Haomin; Choi, Howard; Wang, Dong; Acharya, Lipi; Du, Lisa; Vogt, Peter K; Ping, Peipei; Yates, John R
2015-11-03
Shotgun proteomics generates valuable information from large-scale and target protein characterizations, including protein expression, protein quantification, protein post-translational modifications (PTMs), protein localization, and protein-protein interactions. Typically, peptides derived from proteolytic digestion, rather than intact proteins, are analyzed by mass spectrometers because peptides are more readily separated, ionized and fragmented. The amino acid sequences of peptides can be interpreted by matching the observed tandem mass spectra to theoretical spectra derived from a protein sequence database. Identified peptides serve as surrogates for their proteins and are often used to establish what proteins were present in the original mixture and to quantify protein abundance. Two major issues exist for assigning peptides to their originating protein. The first issue is maintaining a desired false discovery rate (FDR) when comparing or combining multiple large datasets generated by shotgun analysis and the second issue is properly assigning peptides to proteins when homologous proteins are present in the database. Herein we demonstrate a new computational tool, ProteinInferencer, which can be used for protein inference with both small- or large-scale data sets to produce a well-controlled protein FDR. In addition, ProteinInferencer introduces confidence scoring for individual proteins, which makes protein identifications evaluable. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015. Published by Elsevier B.V.
TOPDOM: database of conservatively located domains and motifs in proteins.
Varga, Julia; Dobson, László; Tusnády, Gábor E
2016-09-01
The TOPDOM database-originally created as a collection of domains and motifs located consistently on the same side of the membranes in α-helical transmembrane proteins-has been updated and extended by taking into consideration consistently localized domains and motifs in globular proteins, too. By taking advantage of the recently developed CCTOP algorithm to determine the type of a protein and predict topology in case of transmembrane proteins, and by applying a thorough search for domains and motifs as well as utilizing the most up-to-date version of all source databases, we managed to reach a 6-fold increase in the size of the whole database and a 2-fold increase in the number of transmembrane proteins. TOPDOM database is available at http://topdom.enzim.hu The webpage utilizes the common Apache, PHP5 and MySQL software to provide the user interface for accessing and searching the database. The database itself is generated on a high performance computer. tusnady.gabor@ttk.mta.hu Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press.
Specialized microbial databases for inductive exploration of microbial genome sequences
Fang, Gang; Ho, Christine; Qiu, Yaowu; Cubas, Virginie; Yu, Zhou; Cabau, Cédric; Cheung, Frankie; Moszer, Ivan; Danchin, Antoine
2005-01-01
Background The enormous amount of genome sequence data asks for user-oriented databases to manage sequences and annotations. Queries must include search tools permitting function identification through exploration of related objects. Methods The GenoList package for collecting and mining microbial genome databases has been rewritten using MySQL as the database management system. Functions that were not available in MySQL, such as nested subquery, have been implemented. Results Inductive reasoning in the study of genomes starts from "islands of knowledge", centered around genes with some known background. With this concept of "neighborhood" in mind, a modified version of the GenoList structure has been used for organizing sequence data from prokaryotic genomes of particular interest in China. GenoChore , a set of 17 specialized end-user-oriented microbial databases (including one instance of Microsporidia, Encephalitozoon cuniculi, a member of Eukarya) has been made publicly available. These databases allow the user to browse genome sequence and annotation data using standard queries. In addition they provide a weekly update of searches against the world-wide protein sequences data libraries, allowing one to monitor annotation updates on genes of interest. Finally, they allow users to search for patterns in DNA or protein sequences, taking into account a clustering of genes into formal operons, as well as providing extra facilities to query sequences using predefined sequence patterns. Conclusion This growing set of specialized microbial databases organize data created by the first Chinese bacterial genome programs (ThermaList, Thermoanaerobacter tencongensis, LeptoList, with two different genomes of Leptospira interrogans and SepiList, Staphylococcus epidermidis) associated to related organisms for comparison. PMID:15698474
Hermjakob, Henning; Montecchi-Palazzi, Luisa; Bader, Gary; Wojcik, Jérôme; Salwinski, Lukasz; Ceol, Arnaud; Moore, Susan; Orchard, Sandra; Sarkans, Ugis; von Mering, Christian; Roechert, Bernd; Poux, Sylvain; Jung, Eva; Mersch, Henning; Kersey, Paul; Lappe, Michael; Li, Yixue; Zeng, Rong; Rana, Debashis; Nikolski, Macha; Husi, Holger; Brun, Christine; Shanker, K; Grant, Seth G N; Sander, Chris; Bork, Peer; Zhu, Weimin; Pandey, Akhilesh; Brazma, Alvis; Jacq, Bernard; Vidal, Marc; Sherman, David; Legrain, Pierre; Cesareni, Gianni; Xenarios, Ioannis; Eisenberg, David; Steipe, Boris; Hogue, Chris; Apweiler, Rolf
2004-02-01
A major goal of proteomics is the complete description of the protein interaction network underlying cell physiology. A large number of small scale and, more recently, large-scale experiments have contributed to expanding our understanding of the nature of the interaction network. However, the necessary data integration across experiments is currently hampered by the fragmentation of publicly available protein interaction data, which exists in different formats in databases, on authors' websites or sometimes only in print publications. Here, we propose a community standard data model for the representation and exchange of protein interaction data. This data model has been jointly developed by members of the Proteomics Standards Initiative (PSI), a work group of the Human Proteome Organization (HUPO), and is supported by major protein interaction data providers, in particular the Biomolecular Interaction Network Database (BIND), Cellzome (Heidelberg, Germany), the Database of Interacting Proteins (DIP), Dana Farber Cancer Institute (Boston, MA, USA), the Human Protein Reference Database (HPRD), Hybrigenics (Paris, France), the European Bioinformatics Institute's (EMBL-EBI, Hinxton, UK) IntAct, the Molecular Interactions (MINT, Rome, Italy) database, the Protein-Protein Interaction Database (PPID, Edinburgh, UK) and the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING, EMBL, Heidelberg, Germany).
Fourment, Mathieu; Gibbs, Mark J
2008-02-05
Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically.
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2010-01-01
GenBank is a comprehensive database that contains publicly available nucleotide sequences for more than 300,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bi-monthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI homepage: www.ncbi.nlm.nih.gov.
Thakar, Sambhaji B; Ghorpade, Pradnya N; Kale, Manisha V; Sonawane, Kailas D
2015-01-01
Fern plants are known for their ethnomedicinal applications. Huge amount of fern medicinal plants information is scattered in the form of text. Hence, database development would be an appropriate endeavor to cope with the situation. So by looking at the importance of medicinally useful fern plants, we developed a web based database which contains information about several group of ferns, their medicinal uses, chemical constituents as well as protein/enzyme sequences isolated from different fern plants. Fern ethnomedicinal plant database is an all-embracing, content management web-based database system, used to retrieve collection of factual knowledge related to the ethnomedicinal fern species. Most of the protein/enzyme sequences have been extracted from NCBI Protein sequence database. The fern species, family name, identification, taxonomy ID from NCBI, geographical occurrence, trial for, plant parts used, ethnomedicinal importance, morphological characteristics, collected from various scientific literatures and journals available in the text form. NCBI's BLAST, InterPro, phylogeny, Clustal W web source has also been provided for the future comparative studies. So users can get information related to fern plants and their medicinal applications at one place. This Fern ethnomedicinal plant database includes information of 100 fern medicinal species. This web based database would be an advantageous to derive information specifically for computational drug discovery, botanists or botanical interested persons, pharmacologists, researchers, biochemists, plant biotechnologists, ayurvedic practitioners, doctors/pharmacists, traditional medicinal users, farmers, agricultural students and teachers from universities as well as colleges and finally fern plant lovers. This effort would be useful to provide essential knowledge for the users about the adventitious applications for drug discovery, applications, conservation of fern species around the world and finally to create social awareness.
Database resources of the National Center for Biotechnology Information
Sayers, Eric W.; Barrett, Tanya; Benson, Dennis A.; Bolton, Evan; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M.; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Krasnov, Sergey; Landsman, David; Lipman, David J.; Lu, Zhiyong; Madden, Thomas L.; Madej, Tom; Maglott, Donna R.; Marchler-Bauer, Aron; Miller, Vadim; Karsch-Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D.; Schuler, Gregory D.; Sequeira, Edwin; Sherry, Stephen T.; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A.; Wagner, Lukas; Wang, Yanli; Wilbur, W. John; Yaschenko, Eugene; Ye, Jian
2012-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Website. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Probe, Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:22140104
Database resources of the National Center for Biotechnology Information
2013-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, the Genetic Testing Registry, Genome and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, BioProject, BioSample, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Probe, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool, Biosystems, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page. PMID:23193264
GMDD: a database of GMO detection methods.
Dong, Wei; Yang, Litao; Shen, Kailin; Kim, Banghyun; Kleter, Gijs A; Marvin, Hans J P; Guo, Rong; Liang, Wanqi; Zhang, Dabing
2008-06-04
Since more than one hundred events of genetically modified organisms (GMOs) have been developed and approved for commercialization in global area, the GMO analysis methods are essential for the enforcement of GMO labelling regulations. Protein and nucleic acid-based detection techniques have been developed and utilized for GMOs identification and quantification. However, the information for harmonization and standardization of GMO analysis methods at global level is needed. GMO Detection method Database (GMDD) has collected almost all the previous developed and reported GMOs detection methods, which have been grouped by different strategies (screen-, gene-, construct-, and event-specific), and also provide a user-friendly search service of the detection methods by GMO event name, exogenous gene, or protein information, etc. In this database, users can obtain the sequences of exogenous integration, which will facilitate PCR primers and probes design. Also the information on endogenous genes, certified reference materials, reference molecules, and the validation status of developed methods is included in this database. Furthermore, registered users can also submit new detection methods and sequences to this database, and the newly submitted information will be released soon after being checked. GMDD contains comprehensive information of GMO detection methods. The database will make the GMOs analysis much easier.
Montague, Elizabeth; Stanberry, Larissa; Higdon, Roger; Janko, Imre; Lee, Elaine; Anderson, Nathaniel; Choiniere, John; Stewart, Elizabeth; Yandl, Gregory; Broomall, William; Kolker, Natali
2014-01-01
Abstract Multi-omics data-driven scientific discovery crucially rests on high-throughput technologies and data sharing. Currently, data are scattered across single omics repositories, stored in varying raw and processed formats, and are often accompanied by limited or no metadata. The Multi-Omics Profiling Expression Database (MOPED, http://moped.proteinspire.org) version 2.5 is a freely accessible multi-omics expression database. Continual improvement and expansion of MOPED is driven by feedback from the Life Sciences Community. In order to meet the emergent need for an integrated multi-omics data resource, MOPED 2.5 now includes gene relative expression data in addition to protein absolute and relative expression data from over 250 large-scale experiments. To facilitate accurate integration of experiments and increase reproducibility, MOPED provides extensive metadata through the Data-Enabled Life Sciences Alliance (DELSA Global, http://delsaglobal.org) metadata checklist. MOPED 2.5 has greatly increased the number of proteomics absolute and relative expression records to over 500,000, in addition to adding more than four million transcriptomics relative expression records. MOPED has an intuitive user interface with tabs for querying different types of omics expression data and new tools for data visualization. Summary information including expression data, pathway mappings, and direct connection between proteins and genes can be viewed on Protein and Gene Details pages. These connections in MOPED provide a context for multi-omics expression data exploration. Researchers are encouraged to submit omics data which will be consistently processed into expression summaries. MOPED as a multi-omics data resource is a pivotal public database, interdisciplinary knowledge resource, and platform for multi-omics understanding. PMID:24910945
A Circular Dichroism Reference Database for Membrane Proteins
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wallace,B.; Wien, F.; Stone, T.
2006-01-01
Membrane proteins are a major product of most genomes and the target of a large number of current pharmaceuticals, yet little information exists on their structures because of the difficulty of crystallising them; hence for the most part they have been excluded from structural genomics programme targets. Furthermore, even methods such as circular dichroism (CD) spectroscopy which seek to define secondary structure have not been fully exploited because of technical limitations to their interpretation for membrane embedded proteins. Empirical analyses of circular dichroism (CD) spectra are valuable for providing information on secondary structures of proteins. However, the accuracy of themore » results depends on the appropriateness of the reference databases used in the analyses. Membrane proteins have different spectral characteristics than do soluble proteins as a result of the low dielectric constants of membrane bilayers relative to those of aqueous solutions (Chen & Wallace (1997) Biophys. Chem. 65:65-74). To date, no CD reference database exists exclusively for the analysis of membrane proteins, and hence empirical analyses based on current reference databases derived from soluble proteins are not adequate for accurate analyses of membrane protein secondary structures (Wallace et al (2003) Prot. Sci. 12:875-884). We have therefore created a new reference database of CD spectra of integral membrane proteins whose crystal structures have been determined. To date it contains more than 20 proteins, and spans the range of secondary structures from mostly helical to mostly sheet proteins. This reference database should enable more accurate secondary structure determinations of membrane embedded proteins and will become one of the reference database options in the CD calculation server DICHROWEB (Whitmore & Wallace (2004) NAR 32:W668-673).« less
Padliya, Neerav D; Garrett, Wesley M; Campbell, Kimberly B; Tabb, David L; Cooper, Bret
2007-11-01
LC-MS/MS has demonstrated potential for detecting plant pathogens. Unlike PCR or ELISA, LC-MS/MS does not require pathogen-specific reagents for the detection of pathogen-specific proteins and peptides. However, the MS/MS approach we and others have explored does require a protein sequence reference database and database-search software to interpret tandem mass spectra. To evaluate the limitations of database composition on pathogen identification, we analyzed proteins from cultured Ustilago maydis, Phytophthora sojae, Fusarium graminearum, and Rhizoctonia solani by LC-MS/MS. When the search database did not contain sequences for a target pathogen, or contained sequences to related pathogens, target pathogen spectra were reliably matched to protein sequences from nontarget organisms, giving an illusion that proteins from nontarget organisms were identified. Our analysis demonstrates that when database-search software is used as part of the identification process, a paradox exists whereby additional sequences needed to detect a wide variety of possible organisms may lead to more cross-species protein matches and misidentification of pathogens.
Krassowski, Michal; Paczkowska, Marta; Cullion, Kim; Huang, Tina; Dzneladze, Irakli; Ouellette, B F Francis; Yamada, Joseph T; Fradet-Turcotte, Amelie
2018-01-01
Abstract Interpretation of genetic variation is needed for deciphering genotype-phenotype associations, mechanisms of inherited disease, and cancer driver mutations. Millions of single nucleotide variants (SNVs) in human genomes are known and thousands are associated with disease. An estimated 21% of disease-associated amino acid substitutions corresponding to missense SNVs are located in protein sites of post-translational modifications (PTMs), chemical modifications of amino acids that extend protein function. ActiveDriverDB is a comprehensive human proteo-genomics database that annotates disease mutations and population variants through the lens of PTMs. We integrated >385,000 published PTM sites with ∼3.6 million substitutions from The Cancer Genome Atlas (TCGA), the ClinVar database of disease genes, and human genome sequencing projects. The database includes site-specific interaction networks of proteins, upstream enzymes such as kinases, and drugs targeting these enzymes. We also predicted network-rewiring impact of mutations by analyzing gains and losses of kinase-bound sequence motifs. ActiveDriverDB provides detailed visualization, filtering, browsing and searching options for studying PTM-associated mutations. Users can upload mutation datasets interactively and use our application programming interface in pipelines. Integrative analysis of mutations and PTMs may help decipher molecular mechanisms of phenotypes and disease, as exemplified by case studies of TP53, BRCA2 and VHL. The open-source database is available at https://www.ActiveDriverDB.org. PMID:29126202
Ojima-Kato, Teruyo; Yamamoto, Naomi; Nagai, Satomi; Shima, Keisuke; Akiyama, Yumi; Ota, Junji; Tamura, Hiroto
2017-12-01
Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS)-based microbial identification is a popular analytical method. Strain Solution proteotyping software available for MALDI-TOF MS has great potential for the precise and detailed discrimination of microorganisms at serotype- or strain-level, beyond the conventional mass fingerprinting approaches. Here, we constructed a theoretically calculated mass database of Salmonella enterica subspecies enterica consisting of 12 biomarker proteins: ribosomal proteins S8, L15, L17, L21, L25, and S7, Mn-cofactor-containing superoxide dismutase (SodA), peptidyl-prolyl cis-trans isomerase C (PPIase C), and protein Gns, and uncharacterized proteins YibT, YaiA, and YciF, that can allow serotyping of Salmonella. Strain Solution ver. 2 software with the novel database constructed in this study demonstrated that 109 strains (94%), including the major outbreak-associated serotypes, Enteritidis, Typhimurium, and Infantis, could be correctly identified from others by colony-directed MALDI-TOF MS using 116 strains belonging to 23 kinds of typed and untyped serotypes of S. enterica from culture collections, patients, and foods. We conclude that Strain Solution ver. 2 software integrated with the accurate mass database will be useful for the bacterial proteotyping by MALDI-TOF MS-based microbial classification in the clinical and food safety fields.
The Biomolecular Crystallization Database Version 4: expanded content and new features.
Tung, Michael; Gallagher, D Travis
2009-01-01
The Biological Macromolecular Crystallization Database (BMCD) has been a publicly available resource since 1988, providing a curated archive of information on crystal growth for proteins and other biological macromolecules. The BMCD content has recently been expanded to include 14 372 crystal entries. The resource continues to be freely available at http://xpdb.nist.gov:8060/BMCD4. In addition, the software has been adapted to support the Java-based Lucene query language, enabling detailed searching over specific parameters, and explicit search of parameter ranges is offered for five numeric variables. Extensive tools have been developed for import and handling of data from the RCSB Protein Data Bank. The updated BMCD is called version 4.02 or BMCD4. BMCD4 entries have been expanded to include macromolecule sequence, enabling more elaborate analysis of relations among protein properties, crystal-growth conditions and the geometric and diffraction properties of the crystals. The BMCD version 4.02 contains greatly expanded content and enhanced search capabilities to facilitate scientific analysis and design of crystal-growth strategies.
Li, Guo-Zhong; Vissers, Johannes P C; Silva, Jeffrey C; Golick, Dan; Gorenstein, Marc V; Geromanos, Scott J
2009-03-01
A novel database search algorithm is presented for the qualitative identification of proteins over a wide dynamic range, both in simple and complex biological samples. The algorithm has been designed for the analysis of data originating from data independent acquisitions, whereby multiple precursor ions are fragmented simultaneously. Measurements used by the algorithm include retention time, ion intensities, charge state, and accurate masses on both precursor and product ions from LC-MS data. The search algorithm uses an iterative process whereby each iteration incrementally increases the selectivity, specificity, and sensitivity of the overall strategy. Increased specificity is obtained by utilizing a subset database search approach, whereby for each subsequent stage of the search, only those peptides from securely identified proteins are queried. Tentative peptide and protein identifications are ranked and scored by their relative correlation to a number of models of known and empirically derived physicochemical attributes of proteins and peptides. In addition, the algorithm utilizes decoy database techniques for automatically determining the false positive identification rates. The search algorithm has been tested by comparing the search results from a four-protein mixture, the same four-protein mixture spiked into a complex biological background, and a variety of other "system" type protein digest mixtures. The method was validated independently by data dependent methods, while concurrently relying on replication and selectivity. Comparisons were also performed with other commercially and publicly available peptide fragmentation search algorithms. The presented results demonstrate the ability to correctly identify peptides and proteins from data independent acquisition strategies with high sensitivity and specificity. They also illustrate a more comprehensive analysis of the samples studied; providing approximately 20% more protein identifications, compared to a more conventional data directed approach using the same identification criteria, with a concurrent increase in both sequence coverage and the number of modified peptides.
PROFESS: a PROtein Function, Evolution, Structure and Sequence database
Triplet, Thomas; Shortridge, Matthew D.; Griep, Mark A.; Stark, Jaime L.; Powers, Robert; Revesz, Peter
2010-01-01
The proliferation of biological databases and the easy access enabled by the Internet is having a beneficial impact on biological sciences and transforming the way research is conducted. There are ∼1100 molecular biology databases dispersed throughout the Internet. To assist in the functional, structural and evolutionary analysis of the abundant number of novel proteins continually identified from whole-genome sequencing, we introduce the PROFESS (PROtein Function, Evolution, Structure and Sequence) database. Our database is designed to be versatile and expandable and will not confine analysis to a pre-existing set of data relationships. A fundamental component of this approach is the development of an intuitive query system that incorporates a variety of similarity functions capable of generating data relationships not conceived during the creation of the database. The utility of PROFESS is demonstrated by the analysis of the structural drift of homologous proteins and the identification of potential pancreatic cancer therapeutic targets based on the observation of protein–protein interaction networks. Database URL: http://cse.unl.edu/∼profess/ PMID:20624718
Chen, Long; Jiang, Yifeng; Du, Zhen
2018-04-01
Although previous studies have demonstrated that dental pulp stem cells (DPSCs) from mature and immature teeth exhibit potential for multi-directional differentiation, the molecular and biological difference between the DPSCs from mature and immature permanent teeth has not been fully investigated. In the present study, 500 differentially expressed genes from dental pulp cells (DPCs) in mature and immature permanent teeth were obtained from the Gene Expression Omnibus online database. Based on bioinformatics analysis using the Database for Annotation, Visualization and Integrated Discovery, these genes were divided into a number of subgroups associated with immunity, inflammation and cell signaling. The results of the present study suggest that immune features, response to infection and cell signaling may be different in DPCs from mature and immature permanent teeth; furthermore, DPCs from immature permanent teeth may be more suitable for use in tissue engineering or stem cell therapy. The Online Mendelian Inheritance in Man database stated that Sonic Hedgehog (SHH), a differentially expressed gene in DPCs from mature and immature permanent teeth, serves a crucial role in the development of craniofacial tissues, including teeth, which further confirmed that SHH may cause DPCs from mature and immature permanent teeth to exhibit different biological characteristics. The Search Tool for the Retrieval of Interacting Genes/Proteins database revealed that SHH has functional protein associations with a number of other proteins, including Glioma-associated oncogene (GLI)1, GLI2, growth arrest-specific protein 1, bone morphogenetic protein (BMP)2 and BMP4, in mice and humans. It was also demonstrated that SHH may interact with other genes to regulate the biological characteristics of DPCs. The results of the present study may provide a useful reference basis for selecting suitable DPSCs and molecules for the treatment of these cells to optimize features for tissue engineering or stem cell therapy. Quantitative polymerase chain reaction should be performed to confirm the differential expression of these genes prior to the beginning of a functional study.
PACSY, a relational database management system for protein structure and chemical shift analysis.
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo; Lee, Weontae; Markley, John L
2012-10-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu.
RICD: a rice indica cDNA database resource for rice functional genomics.
Lu, Tingting; Huang, Xuehui; Zhu, Chuanrang; Huang, Tao; Zhao, Qiang; Xie, Kabing; Xiong, Lizhong; Zhang, Qifa; Han, Bin
2008-11-26
The Oryza sativa L. indica subspecies is the most widely cultivated rice. During the last few years, we have collected over 20,000 putative full-length cDNAs and over 40,000 ESTs isolated from various cDNA libraries of two indica varieties Guangluai 4 and Minghui 63. A database of the rice indica cDNAs was therefore built to provide a comprehensive web data source for searching and retrieving the indica cDNA clones. Rice Indica cDNA Database (RICD) is an online MySQL-PHP driven database with a user-friendly web interface. It allows investigators to query the cDNA clones by keyword, genome position, nucleotide or protein sequence, and putative function. It also provides a series of information, including sequences, protein domain annotations, similarity search results, SNPs and InDels information, and hyperlinks to gene annotation in both The Rice Annotation Project Database (RAP-DB) and The TIGR Rice Genome Annotation Resource, expression atlas in RiceGE and variation report in Gramene of each cDNA. The online rice indica cDNA database provides cDNA resource with comprehensive information to researchers for functional analysis of indica subspecies and for comparative genomics. The RICD database is available through our website http://www.ncgr.ac.cn/ricd.
Yates, John R
2015-11-01
Advances in computer technology and software have driven developments in mass spectrometry over the last 50 years. Computers and software have been impactful in three areas: the automation of difficult calculations to aid interpretation, the collection of data and control of instruments, and data interpretation. As the power of computers has grown, so too has the utility and impact on mass spectrometers and their capabilities. This has been particularly evident in the use of tandem mass spectrometry data to search protein and nucleotide sequence databases to identify peptide and protein sequences. This capability has driven the development of many new approaches to study biological systems, including the use of "bottom-up shotgun proteomics" to directly analyze protein mixtures. Graphical Abstract ᅟ.
Friso, Giulia; Giacomelli, Lisa; Ytterberg, A Jimmy; Peltier, Jean-Benoit; Rudella, Andrea; Sun, Qi; Wijk, Klaas J van
2004-02-01
An extensive analysis of the Arabidopsis thaliana peripheral and integral thylakoid membrane proteome was performed by sequential extractions with salt, detergent, and organic solvents, followed by multidimensional protein separation steps (reverse-phase HPLC and one- and two-dimensional electrophoresis gels), different enzymatic and nonenzymatic protein cleavage techniques, mass spectrometry, and bioinformatics. Altogether, 154 proteins were identified, of which 76 (49%) were alpha-helical integral membrane proteins. Twenty-seven new proteins without known function but with predicted chloroplast transit peptides were identified, of which 17 (63%) are integral membrane proteins. These new proteins, likely important in thylakoid biogenesis, include two rubredoxins, a potential metallochaperone, and a new DnaJ-like protein. The data were integrated with our analysis of the lumenal-enriched proteome. We identified 83 out of 100 known proteins of the thylakoid localized photosynthetic apparatus, including several new paralogues and some 20 proteins involved in protein insertion, assembly, folding, or proteolysis. An additional 16 proteins are involved in translation, demonstrating that the thylakoid membrane surface is an important site for protein synthesis. The high coverage of the photosynthetic apparatus and the identification of known hydrophobic proteins with low expression levels, such as cpSecE, Ohp1, and Ohp2, indicate an excellent dynamic resolution of the analysis. The sequential extraction process proved very helpful to validate transmembrane prediction. Our data also were cross-correlated to chloroplast subproteome analyses by other laboratories. All data are deposited in a new curated plastid proteome database (PPDB) with multiple search functions (http://cbsusrv01.tc.cornell.edu/users/ppdb/). This PPDB will serve as an expandable resource for the plant community.
May, Brian H; Deng, Shiqiang; Zhang, Anthony L; Lu, Chuanjian; Xue, Charlie C L
2015-09-01
Reviews and meta-analyses of clinical trials identified plants used as traditional medicines (TMs) that show promise for psoriasis. These include Rehmannia glutinosa, Camptotheca acuminata, Indigo naturalis and Salvia miltiorrhiza. Compounds contained in these TMs have shown activities of relevance to psoriasis in experimental models. To further investigate the likely mechanisms of action of the multiple compounds in these TMs, we undertook a computer-based in silico investigation of the proteins known to be regulated by these compounds and their associated biological pathways. The proteins reportedly regulated by compounds in these four TMs were identified using the HIT (Herbal Ingredients' Targets) database. The resultant data were entered into the PANTHER (Protein ANnotation THrough Evolutionary Relationship) database to identify the pathways in which the proteins could be involved. The study identified 237 compounds in the TMs and these retrieved 287 proteins from HIT. These proteins identified 59 pathways in PANTHER with most proteins being located in the Apoptosis, Angiogenesis, Inflammation mediated by chemokine and cytokine, Gonadotropin releasing hormone receptor, and/or Interleukin signaling pathways. All four TMs contained compounds that had regulating effects on Apoptosis regulator BAX, Apoptosis regulator Bcl-2, Caspase-3, Tumor necrosis factor (TNF) or Prostaglandin G/H synthase 2 (COX2). The main proteins and pathways are primarily related to inflammation, proliferation and angiogenesis which are all processes involved in psoriasis. Experimental studies have reported that certain compounds from these TMs can regulate the expression of proteins involved in each of these pathways.
Ghorab, Hamida; Lammi, Carmen; Arnoldi, Anna; Kabouche, Zahia; Aiello, Gilda
2018-01-15
An investigation on the proteome of the sweet kernel of apricot, based on equalisation with combinatorial peptide ligand libraries (CPLLs), SDS-PAGE, nLC-ESI-MS/MS, and database search, permitted identifying 175 proteins. Gene ontology analysis indicated that their main molecular functions are in nucleotide binding (20.9%), hydrolase activities (10.6%), kinase activities (7%), and catalytic activity (5.6%). A protein-protein association network analysis using STRING software permitted to build an interactomic map of all detected proteins, characterised by 34 interactions. In order to forecast the potential health benefits deriving from the consumption of these proteins, the two most abundant, i.e. Prunin 1 and 2, were enzymatically digested in silico predicting 10 and 14 peptides, respectively. Searching their sequences in the database BIOPEP, it was possible to suggest a variety of bioactivities, including dipeptidyl peptidase-IV (DPP-IV) and angiotensin converting enzyme I (ACE) inhibition, glucose uptake stimulation and antioxidant properties. Copyright © 2017 Elsevier Ltd. All rights reserved.
Establishment of an international database for genetic variants in esophageal cancer.
Vihinen, Mauno
2016-10-01
The establishment of a database has been suggested in order to collect, organize, and distribute genetic information about esophageal cancer. The World Organization for Specialized Studies on Diseases of the Esophagus and the Human Variome Project will be in charge of a central database of information about esophageal cancer-related variations from publications, databases, and laboratories; in addition to genetic details, clinical parameters will also be included. The aim will be to get all the central players in research, clinical, and commercial laboratories to contribute. The database will follow established recommendations and guidelines. The database will require a team of dedicated curators with different backgrounds. Numerous layers of systematics will be applied to facilitate computational analyses. The data items will be extensively integrated with other information sources. The database will be distributed as open access to ensure exchange of the data with other databases. Variations will be reported in relation to reference sequences on three levels--DNA, RNA, and protein-whenever applicable. In the first phase, the database will concentrate on genetic variations including both somatic and germline variations for susceptibility genes. Additional types of information can be integrated at a later stage. © 2016 New York Academy of Sciences.
Domain fusion analysis by applying relational algebra to protein sequence and domain databases.
Truong, Kevin; Ikura, Mitsuhiko
2003-05-06
Domain fusion analysis is a useful method to predict functionally linked proteins that may be involved in direct protein-protein interactions or in the same metabolic or signaling pathway. As separate domain databases like BLOCKS, PROSITE, Pfam, SMART, PRINTS-S, ProDom, TIGRFAMs, and amalgamated domain databases like InterPro continue to grow in size and quality, a computational method to perform domain fusion analysis that leverages on these efforts will become increasingly powerful. This paper proposes a computational method employing relational algebra to find domain fusions in protein sequence databases. The feasibility of this method was illustrated on the SWISS-PROT+TrEMBL sequence database using domain predictions from the Pfam HMM (hidden Markov model) database. We identified 235 and 189 putative functionally linked protein partners in H. sapiens and S. cerevisiae, respectively. From scientific literature, we were able to confirm many of these functional linkages, while the remainder offer testable experimental hypothesis. Results can be viewed at http://calcium.uhnres.utoronto.ca/pi. As the analysis can be computed quickly on any relational database that supports standard SQL (structured query language), it can be dynamically updated along with the sequence and domain databases, thereby improving the quality of predictions over time.
PeroxisomeDB: a database for the peroxisomal proteome, functional genomics and disease
Schlüter, Agatha; Fourcade, Stéphane; Domènech-Estévez, Enric; Gabaldón, Toni; Huerta-Cepas, Jaime; Berthommier, Guillaume; Ripp, Raymond; Wanders, Ronald J. A.; Poch, Olivier; Pujol, Aurora
2007-01-01
Peroxisomes are essential organelles of eukaryotic origin, ubiquitously distributed in cells and organisms, playing key roles in lipid and antioxidant metabolism. Loss or malfunction of peroxisomes causes more than 20 fatal inherited conditions. We have created a peroxisomal database () that includes the complete peroxisomal proteome of Homo sapiens and Saccharomyces cerevisiae, by gathering, updating and integrating the available genetic and functional information on peroxisomal genes. PeroxisomeDB is structured in interrelated sections ‘Genes’, ‘Functions’, ‘Metabolic pathways’ and ‘Diseases’, that include hyperlinks to selected features of NCBI, ENSEMBL and UCSC databases. We have designed graphical depictions of the main peroxisomal metabolic routes and have included updated flow charts for diagnosis. Precomputed BLAST, PSI-BLAST, multiple sequence alignment (MUSCLE) and phylogenetic trees are provided to assist in direct multispecies comparison to study evolutionary conserved functions and pathways. Highlights of the PeroxisomeDB include new tools developed for facilitating (i) identification of novel peroxisomal proteins, by means of identifying proteins carrying peroxisome targeting signal (PTS) motifs, (ii) detection of peroxisomes in silico, particularly useful for screening the deluge of newly sequenced genomes. PeroxisomeDB should contribute to the systematic characterization of the peroxisomal proteome and facilitate system biology approaches on the organelle. PMID:17135190
Using the TIGR gene index databases for biological discovery.
Lee, Yuandan; Quackenbush, John
2003-11-01
The TIGR Gene Index web pages provide access to analyses of ESTs and gene sequences for nearly 60 species, as well as a number of resources derived from these. Each species-specific database is presented using a common format with a homepage. A variety of methods exist that allow users to search each species-specific database. Methods implemented currently include nucleotide or protein sequence queries using WU-BLAST, text-based searches using various sequence identifiers, searches by gene, tissue and library name, and searches using functional classes through Gene Ontology assignments. This protocol provides guidance for using the Gene Index Databases to extract information.
The Human Ageing Genomic Resources: online databases and tools for biogerontologists
de Magalhães, João Pedro; Budovsky, Arie; Lehmann, Gilad; Costa, Joana; Li, Yang; Fraifeld, Vadim; Church, George M.
2009-01-01
Summary Ageing is a complex, challenging phenomenon that will require multiple, interdisciplinary approaches to unravel its puzzles. To assist basic research on ageing, we developed the Human Ageing Genomic Resources (HAGR). This work provides an overview of the databases and tools in HAGR and describes how the gerontology research community can employ them. Several recent changes and improvements to HAGR are also presented. The two centrepieces in HAGR are GenAge and AnAge. GenAge is a gene database featuring genes associated with ageing and longevity in model organisms, a curated database of genes potentially associated with human ageing, and a list of genes tested for their association with human longevity. A myriad of biological data and information is included for hundreds of genes, making GenAge a reference for research that reflects our current understanding of the genetic basis of ageing. GenAge can also serve as a platform for the systems biology of ageing, and tools for the visualization of protein-protein interactions are also included. AnAge is a database of ageing in animals, featuring over 4,000 species, primarily assembled as a resource for comparative and evolutionary studies of ageing. Longevity records, developmental and reproductive traits, taxonomic information, basic metabolic characteristics, and key observations related to ageing are included in AnAge. Software is also available to aid researchers in the form of Perl modules to automate numerous tasks and as an SPSS script to analyse demographic mortality data. The Human Ageing Genomic Resources are available online at http://genomics.senescence.info. PMID:18986374
Protein Bioinformatics Databases and Resources
Chen, Chuming; Huang, Hongzhan; Wu, Cathy H.
2017-01-01
Many publicly available data repositories and resources have been developed to support protein related information management, data-driven hypothesis generation and biological knowledge discovery. To help researchers quickly find the appropriate protein related informatics resources, we present a comprehensive review (with categorization and description) of major protein bioinformatics databases in this chapter. We also discuss the challenges and opportunities for developing next-generation protein bioinformatics databases and resources to support data integration and data analytics in the Big Data era. PMID:28150231
BioWarehouse: a bioinformatics database warehouse toolkit
Lee, Thomas J; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David WJ; Tenenbaum, Jessica D; Karp, Peter D
2006-01-01
Background This article addresses the problem of interoperation of heterogeneous bioinformatics databases. Results We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. Conclusion BioWarehouse embodies significant progress on the database integration problem for bioinformatics. PMID:16556315
BioWarehouse: a bioinformatics database warehouse toolkit.
Lee, Thomas J; Pouliot, Yannick; Wagner, Valerie; Gupta, Priyanka; Stringer-Calvert, David W J; Tenenbaum, Jessica D; Karp, Peter D
2006-03-23
This article addresses the problem of interoperation of heterogeneous bioinformatics databases. We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. BioWarehouse embodies significant progress on the database integration problem for bioinformatics.
FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.
El-Manzalawy, Yasser; Abbas, Mostafa; Malluhi, Qutaibah; Honavar, Vasant
2016-01-01
A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.
Resource for structure related information on transmembrane proteins
NASA Astrophysics Data System (ADS)
Tusnády, Gábor E.; Simon, István
Transmembrane proteins are involved in a wide variety of vital biological processes including transport of water-soluble molecules, flow of information and energy production. Despite significant efforts to determine the structures of these proteins, only a few thousand solved structures are known so far. Here, we review the various resources for structure-related information on these types of proteins ranging from the 3D structure to the topology and from the up-to-date databases to the various Internet sites and servers dealing with structure prediction and structure analysis. Abbreviations: 3D, three dimensional; PDB, Protein Data Bank; TMP, transmembrane protein.
PACSY, a relational database management system for protein structure and chemical shift analysis
Lee, Woonghee; Yu, Wookyung; Kim, Suhkmann; Chang, Iksoo
2012-01-01
PACSY (Protein structure And Chemical Shift NMR spectroscopY) is a relational database management system that integrates information from the Protein Data Bank, the Biological Magnetic Resonance Data Bank, and the Structural Classification of Proteins database. PACSY provides three-dimensional coordinates and chemical shifts of atoms along with derived information such as torsion angles, solvent accessible surface areas, and hydrophobicity scales. PACSY consists of six relational table types linked to one another for coherence by key identification numbers. Database queries are enabled by advanced search functions supported by an RDBMS server such as MySQL or PostgreSQL. PACSY enables users to search for combinations of information from different database sources in support of their research. Two software packages, PACSY Maker for database creation and PACSY Analyzer for database analysis, are available from http://pacsy.nmrfam.wisc.edu. PMID:22903636
THGS: a web-based database of Transmembrane Helices in Genome Sequences
Fernando, S. A.; Selvarani, P.; Das, Soma; Kumar, Ch. Kiran; Mondal, Sukanta; Ramakumar, S.; Sekar, K.
2004-01-01
Transmembrane Helices in Genome Sequences (THGS) is an interactive web-based database, developed to search the transmembrane helices in the user-interested gene sequences available in the Genome Database (GDB). The proposed database has provision to search sequence motifs in transmembrane and globular proteins. In addition, the motif can be searched in the other sequence databases (Swiss-Prot and PIR) or in the macromolecular structure database, Protein Data Bank (PDB). Further, the 3D structure of the corresponding queried motif, if it is available in the solved protein structures deposited in the Protein Data Bank, can also be visualized using the widely used graphics package RASMOL. All the sequence databases used in the present work are updated frequently and hence the results produced are up to date. The database THGS is freely available via the world wide web and can be accessed at http://pranag.physics.iisc.ernet.in/thgs/ or http://144.16.71.10/thgs/. PMID:14681375
Atomic analysis of protein-protein interfaces with known inhibitors: the 2P2I database.
Bourgeas, Raphaël; Basse, Marie-Jeanne; Morelli, Xavier; Roche, Philippe
2010-03-09
In the last decade, the inhibition of protein-protein interactions (PPIs) has emerged from both academic and private research as a new way to modulate the activity of proteins. Inhibitors of these original interactions are certainly the next generation of highly innovative drugs that will reach the market in the next decade. However, in silico design of such compounds still remains challenging. Here we describe this particular PPI chemical space through the presentation of 2P2I(DB), a hand-curated database dedicated to the structure of PPIs with known inhibitors. We have analyzed protein/protein and protein/inhibitor interfaces in terms of geometrical parameters, atom and residue properties, buried accessible surface area and other biophysical parameters. The interfaces found in 2P2I(DB) were then compared to those of representative datasets of heterodimeric complexes. We propose a new classification of PPIs with known inhibitors into two classes depending on the number of segments present at the interface and corresponding to either a single secondary structure element or to a more globular interacting domain. 2P2I(DB) complexes share global shape properties with standard transient heterodimer complexes, but their accessible surface areas are significantly smaller. No major conformational changes are seen between the different states of the proteins. The interfaces are more hydrophobic than general PPI's interfaces, with less charged residues and more non-polar atoms. Finally, fifty percent of the complexes in the 2P2I(DB) dataset possess more hydrogen bonds than typical protein-protein complexes. Potential areas of study for the future are proposed, which include a new classification system consisting of specific families and the identification of PPI targets with high druggability potential based on key descriptors of the interaction. 2P2I database stores structural information about PPIs with known inhibitors and provides a useful tool for biologists to assess the potential druggability of their interfaces. The database can be accessed at http://2p2idb.cnrs-mrs.fr.
Ripoche, Hugues; Laine, Elodie; Ceres, Nicoletta; Carbone, Alessandra
2017-01-04
The database JET2 Viewer, openly accessible at http://www.jet2viewer.upmc.fr/, reports putative protein binding sites for all three-dimensional (3D) structures available in the Protein Data Bank (PDB). This knowledge base was generated by applying the computational method JET 2 at large-scale on more than 20 000 chains. JET 2 strategy yields very precise predictions of interacting surfaces and unravels their evolutionary process and complexity. JET2 Viewer provides an online intelligent display, including interactive 3D visualization of the binding sites mapped onto PDB structures and suitable files recording JET 2 analyses. Predictions were evaluated on more than 15 000 experimentally characterized protein interfaces. This is, to our knowledge, the largest evaluation of a protein binding site prediction method. The overall performance of JET 2 on all interfaces are: Sen = 52.52, PPV = 51.24, Spe = 80.05, Acc = 75.89. The data can be used to foster new strategies for protein-protein interactions modulation and interaction surface redesign. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Dziuba, Bartłomiej; Dziuba, Marta
2014-08-20
New peptides with potential antimicrobial activity, encrypted in milk protein sequences, were searched for with the use of bioinformatic tools. The major milk proteins were hydrolyzed in silico by 28 enzymes. The obtained peptides were characterized by the following parameters: molecular weight, isoelectric point, composition and number of amino acid residues, net charge at pH 7.0, aliphatic index, instability index, Boman index, and GRAVY index, and compared with those calculated for known 416 antimicrobial peptides including 59 antimicrobial peptides (AMPs) from milk proteins listed in the BIOPEP database. A simple analysis of physico-chemical properties and the values of biological activity indicators were insufficient to select potentially antimicrobial peptides released in silico from milk proteins by proteolytic enzymes. The final selection was made based on the results of multidimensional statistical analysis such as support vector machines (SVM), random forest (RF), artificial neural networks (ANN) and discriminant analysis (DA) available in the Collection of Anti-Microbial Peptides (CAMP database). Eleven new peptides with potential antimicrobial activity were selected from all peptides released during in silico proteolysis of milk proteins.
Dziuba, Bartłomiej; Dziuba, Marta
2014-01-01
New peptides with potential antimicrobial activity, encrypted in milk protein sequences, were searched for with the use of bioinformatic tools. The major milk proteins were hydrolyzed in silico by 28 enzymes. The obtained peptides were characterized by the following parameters: molecular weight, isoelectric point, composition and number of amino acid residues, net charge at pH 7.0, aliphatic index, instability index, Boman index, and GRAVY index, and compared with those calculated for known 416 antimicrobial peptides including 59 antimicrobial peptides (AMPs) from milk proteins listed in the BIOPEP database. A simple analysis of physico-chemical properties and the values of biological activity indicators were insufficient to select potentially antimicrobial peptides released in silico from milk proteins by proteolytic enzymes. The final selection was made based on the results of multidimensional statistical analysis such as support vector machines (SVM), random forest (RF), artificial neural networks (ANN) and discriminant analysis (DA) available in the Collection of Anti-Microbial Peptides (CAMP database). Eleven new peptides with potential antimicrobial activity were selected from all peptides released during in silico proteolysis of milk proteins. PMID:25141106
Comprehensive analysis of orthologous protein domains using the HOPS database.
Storm, Christian E V; Sonnhammer, Erik L L
2003-10-01
One of the most reliable methods for protein function annotation is to transfer experimentally known functions from orthologous proteins in other organisms. Most methods for identifying orthologs operate on a subset of organisms with a completely sequenced genome, and treat proteins as single-domain units. However, it is well known that proteins are often made up of several independent domains, and there is a wealth of protein sequences from genomes that are not completely sequenced. A comprehensive set of protein domain families is found in the Pfam database. We wanted to apply orthology detection to Pfam families, but first some issues needed to be addressed. First, orthology detection becomes impractical and unreliable when too many species are included. Second, shorter domains contain less information. It is therefore important to assess the quality of the orthology assignment and avoid very short domains altogether. We present a database of orthologous protein domains in Pfam called HOPS: Hierarchical grouping of Orthologous and Paralogous Sequences. Orthology is inferred in a hierarchic system of phylogenetic subgroups using ortholog bootstrapping. To avoid the frequent errors stemming from horizontally transferred genes in bacteria, the analysis is presently limited to eukaryotic genes. The results are accessible in the graphical browser NIFAS, a Java tool originally developed for analyzing phylogenetic relations within Pfam families. The method was tested on a set of curated orthologs with experimentally verified function. In comparison to tree reconciliation with a complete species tree, our approach finds significantly more orthologs in the test set. Examples for investigating gene fusions and domain recombination using HOPS are given.
MPIC: a mitochondrial protein import components database for plant and non-plant species.
Murcha, Monika W; Narsai, Reena; Devenish, James; Kubiszewski-Jakubiak, Szymon; Whelan, James
2015-01-01
In the 2 billion years since the endosymbiotic event that gave rise to mitochondria, variations in mitochondrial protein import have evolved across different species. With the genomes of an increasing number of plant species sequenced, it is possible to gain novel insights into mitochondrial protein import pathways. We have generated the Mitochondrial Protein Import Components (MPIC) Database (DB; http://www.plantenergy.uwa.edu.au/applications/mpic) providing searchable information on the protein import apparatus of plant and non-plant mitochondria. An in silico analysis was carried out, comparing the mitochondrial protein import apparatus from 24 species representing various lineages from Saccharomyces cerevisiae (yeast) and algae to Homo sapiens (human) and higher plants, including Arabidopsis thaliana (Arabidopsis), Oryza sativa (rice) and other more recently sequenced plant species. Each of these species was extensively searched and manually assembled for analysis in the MPIC DB. The database presents an interactive diagram in a user-friendly manner, allowing users to select their import component of interest. The MPIC DB presents an extensive resource facilitating detailed investigation of the mitochondrial protein import machinery and allowing patterns of conservation and divergence to be recognized that would otherwise have been missed. To demonstrate the usefulness of the MPIC DB, we present a comparative analysis of the mitochondrial protein import machinery in plants and non-plant species, revealing plant-specific features that have evolved. © The Author 2014. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.
Kristensen, David M; Wolf, Yuri I; Koonin, Eugene V
2017-01-04
The Alignable Tight Genomic Clusters (ATGCs) database is a collection of closely related bacterial and archaeal genomes that provides several tools to aid research into evolutionary processes in the microbial world. Each ATGC is a taxonomy-independent cluster of 2 or more completely sequenced genomes that meet the objective criteria of a high degree of local gene order (synteny) and a small number of synonymous substitutions in the protein-coding genes. As such, each ATGC is suited for analysis of microevolutionary variations within a cohesive group of organisms (e.g. species), whereas the entire collection of ATGCs is useful for macroevolutionary studies. The ATGC database includes many forms of pre-computed data, in particular ATGC-COGs (Clusters of Orthologous Genes), multiple sequence alignments, a set of 'index' orthologs representing the most well-conserved members of each ATGC-COG, the phylogenetic tree of the organisms within each ATGC, etc. Although the ATGC database contains several million proteins from thousands of genomes organized into hundreds of clusters (roughly a 4-fold increase since the last version of the ATGC database), it is now built with completely automated methods and will be regularly updated following new releases of the NCBI RefSeq database. The ATGC database is hosted jointly at the University of Iowa at dmk-brain.ecn.uiowa.edu/ATGC/ and the NCBI at ftp.ncbi.nlm.nih.gov/pub/kristensen/ATGC/atgc_home.html. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.
Andreeva, Antonina
2016-06-15
The Structural Classification of Proteins (SCOP) database has facilitated the development of many tools and algorithms and it has been successfully used in protein structure prediction and large-scale genome annotations. During the development of SCOP, numerous exceptions were found to topological rules, along with complex evolutionary scenarios and peculiarities in proteins including the ability to fold into alternative structures. This article reviews cases of structural variations observed for individual proteins and among groups of homologues, knowledge of which is essential for protein structure modelling. © 2016 The Author(s). published by Portland Press Limited on behalf of the Biochemical Society.
2013-01-01
Background Contemporary coral reef research has firmly established that a genomic approach is urgently needed to better understand the effects of anthropogenic environmental stress and global climate change on coral holobiont interactions. Here we present KEGG orthology-based annotation of the complete genome sequence of the scleractinian coral Acropora digitifera and provide the first comprehensive view of the genome of a reef-building coral by applying advanced bioinformatics. Description Sequences from the KEGG database of protein function were used to construct hidden Markov models. These models were used to search the predicted proteome of A. digitifera to establish complete genomic annotation. The annotated dataset is published in ZoophyteBase, an open access format with different options for searching the data. A particularly useful feature is the ability to use a Google-like search engine that links query words to protein attributes. We present features of the annotation that underpin the molecular structure of key processes of coral physiology that include (1) regulatory proteins of symbiosis, (2) planula and early developmental proteins, (3) neural messengers, receptors and sensory proteins, (4) calcification and Ca2+-signalling proteins, (5) plant-derived proteins, (6) proteins of nitrogen metabolism, (7) DNA repair proteins, (8) stress response proteins, (9) antioxidant and redox-protective proteins, (10) proteins of cellular apoptosis, (11) microbial symbioses and pathogenicity proteins, (12) proteins of viral pathogenicity, (13) toxins and venom, (14) proteins of the chemical defensome and (15) coral epigenetics. Conclusions We advocate that providing annotation in an open-access searchable database available to the public domain will give an unprecedented foundation to interrogate the fundamental molecular structure and interactions of coral symbiosis and allow critical questions to be addressed at the genomic level based on combined aspects of evolutionary, developmental, metabolic, and environmental perspectives. PMID:23889801
The electric dipole moment of DNA-binding HU protein calculated by the use of an NMR database.
Takashima, S; Yamaoka, K
1999-08-30
Electric birefringence measurements indicated the presence of a large permanent dipole moment in HU protein-DNA complex. In order to substantiate this observation, numerical computation of the dipole moment of HU protein homodimer was carried out by using NMR protein databases. The dipole moments of globular proteins have hitherto been calculated with X-ray databases and NMR data have never been used before. The advantages of NMR databases are: (a) NMR data are obtained, unlike X-ray databases, using protein solutions. Accordingly, this method eliminates the bothersome question as to the possible alteration of the protein structure due to the transition from the crystalline state to the solution state. This question is particularly important for proteins such as HU protein which has some degree of internal flexibility; (b) the three-dimensional coordinates of hydrogen atoms in protein molecules can be determined with a sufficient resolution and this enables the N-H as well as C = O bond moments to be calculated. Since the NMR database of HU protein from Bacillus stearothermophilus consists of 25 models, the surface charge as well as the core dipole moments were computed for each of these structures. The results of these calculations show that the net permanent dipole moments of HU protein homodimer is approximately 500-530 D (1 D = 3.33 x 10(-30) Cm) at pH 7.5 and 600-630 D at the isoelectric point (pH 10.5). These permanent dipole moments are unusually large for a small protein of the size of 19.5 kDa. Nevertheless, the result of numerical calculations is compatible with the electro-optical observation, confirming a very large dipole moment in this protein.
Dölz, R; Mossé, M O; Slonimski, P P; Bairoch, A; Linder, P
1994-01-01
We continued our effort to make a comprehensive database (LISTA) for the yeast Saccharomyces cerevisiae. In this database each sequence has been attributed a single genetic name. In the case of duplicated sequences a simple method has been applied to distinguish between sequences of one and the same gene from non-allelic sequences of duplicated genes. If necessary, synonyms are given in the case of allelic duplicated sequences. Thus sequences can be found either by the name or by synonyms given in LISTA. Each entry contains the genetic name, the mnemonic from the EMBL data bank, the codon bias, reference of the publication of the sequence, Chromosomal location as far as known, Swissprot and EMBL accession numbers. To obtain more information on the included sequences, each entry has been screened against non-redundant nucleotide and protein data bank collections resulting in LISTA-HON and LISTA-HOP. The LISTA data base can be linked to the associated data sets or to nucleotide and protein banks by the Sequence Retrieval System (SRS). PMID:7937046
A comprehensive and scalable database search system for metaproteomics.
Chatterjee, Sandip; Stupp, Gregory S; Park, Sung Kyu Robin; Ducom, Jean-Christophe; Yates, John R; Su, Andrew I; Wolan, Dennis W
2016-08-16
Mass spectrometry-based shotgun proteomics experiments rely on accurate matching of experimental spectra against a database of protein sequences. Existing computational analysis methods are limited in the size of their sequence databases, which severely restricts the proteomic sequencing depth and functional analysis of highly complex samples. The growing amount of public high-throughput sequencing data will only exacerbate this problem. We designed a broadly applicable metaproteomic analysis method (ComPIL) that addresses protein database size limitations. Our approach to overcome this significant limitation in metaproteomics was to design a scalable set of sequence databases assembled for optimal library querying speeds. ComPIL was integrated with a modified version of the search engine ProLuCID (termed "Blazmass") to permit rapid matching of experimental spectra. Proof-of-principle analysis of human HEK293 lysate with a ComPIL database derived from high-quality genomic libraries was able to detect nearly all of the same peptides as a search with a human database (~500x fewer peptides in the database), with a small reduction in sensitivity. We were also able to detect proteins from the adenovirus used to immortalize these cells. We applied our method to a set of healthy human gut microbiome proteomic samples and showed a substantial increase in the number of identified peptides and proteins compared to previous metaproteomic analyses, while retaining a high degree of protein identification accuracy and allowing for a more in-depth characterization of the functional landscape of the samples. The combination of ComPIL with Blazmass allows proteomic searches to be performed with database sizes much larger than previously possible. These large database searches can be applied to complex meta-samples with unknown composition or proteomic samples where unexpected proteins may be identified. The protein database, proteomic search engine, and the proteomic data files for the 5 microbiome samples characterized and discussed herein are open source and available for use and additional analysis.
GWFASTA: server for FASTA search in eukaryotic and microbial genomes.
Issac, Biju; Raghava, G P S
2002-09-01
Similarity searches are a powerful method for solving important biological problems such as database scanning, evolutionary studies, gene prediction, and protein structure prediction. FASTA is a widely used sequence comparison tool for rapid database scanning. Here we describe the GWFASTA server that was developed to assist the FASTA user in similarity searches against partially and/or completely sequenced genomes. GWFASTA consists of more than 60 microbial genomes, eight eukaryote genomes, and proteomes of annotatedgenomes. Infact, it provides the maximum number of databases for similarity searching from a single platform. GWFASTA allows the submission of more than one sequence as a single query for a FASTA search. It also provides integrated post-processing of FASTA output, including compositional analysis of proteins, multiple sequences alignment, and phylogenetic analysis. Furthermore, it summarizes the search results organism-wise for prokaryotes and chromosome-wise for eukaryotes. Thus, the integration of different tools for sequence analyses makes GWFASTA a powerful toolfor biologists.
Database resources of the National Center for Biotechnology Information.
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A; Bolton, Evan; Bryant, Stephen H; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M; DiCuccio, Michael; Federhen, Scott; Feolo, Michael; Fingerman, Ian M; Geer, Lewis Y; Helmberg, Wolfgang; Kapustin, Yuri; Landsman, David; Lipman, David J; Lu, Zhiyong; Madden, Thomas L; Madej, Tom; Maglott, Donna R; Marchler-Bauer, Aron; Miller, Vadim; Mizrachi, Ilene; Ostell, James; Panchenko, Anna; Phan, Lon; Pruitt, Kim D; Schuler, Gregory D; Sequeira, Edwin; Sherry, Stephen T; Shumway, Martin; Sirotkin, Karl; Slotta, Douglas; Souvorov, Alexandre; Starchenko, Grigory; Tatusova, Tatiana A; Wagner, Lukas; Wang, Yanli; Wilbur, W John; Yaschenko, Eugene; Ye, Jian
2011-01-01
In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
GMDD: a database of GMO detection methods
Dong, Wei; Yang, Litao; Shen, Kailin; Kim, Banghyun; Kleter, Gijs A; Marvin, Hans JP; Guo, Rong; Liang, Wanqi; Zhang, Dabing
2008-01-01
Background Since more than one hundred events of genetically modified organisms (GMOs) have been developed and approved for commercialization in global area, the GMO analysis methods are essential for the enforcement of GMO labelling regulations. Protein and nucleic acid-based detection techniques have been developed and utilized for GMOs identification and quantification. However, the information for harmonization and standardization of GMO analysis methods at global level is needed. Results GMO Detection method Database (GMDD) has collected almost all the previous developed and reported GMOs detection methods, which have been grouped by different strategies (screen-, gene-, construct-, and event-specific), and also provide a user-friendly search service of the detection methods by GMO event name, exogenous gene, or protein information, etc. In this database, users can obtain the sequences of exogenous integration, which will facilitate PCR primers and probes design. Also the information on endogenous genes, certified reference materials, reference molecules, and the validation status of developed methods is included in this database. Furthermore, registered users can also submit new detection methods and sequences to this database, and the newly submitted information will be released soon after being checked. Conclusion GMDD contains comprehensive information of GMO detection methods. The database will make the GMOs analysis much easier. PMID:18522755
Fourment, Mathieu; Gibbs, Mark J
2008-01-01
Background Viruses of the Bunyaviridae have segmented negative-stranded RNA genomes and several of them cause significant disease. Many partial sequences have been obtained from the segments so that GenBank searches give complex results. Sequence databases usually use HTML pages to mediate remote sorting, but this approach can be limiting and may discourage a user from exploring a database. Results The VirusBanker database contains Bunyaviridae sequences and alignments and is presented as two spreadsheets generated by a Java program that interacts with a MySQL database on a server. Sequences are displayed in rows and may be sorted using information that is displayed in columns and includes data relating to the segment, gene, protein, species, strain, sequence length, terminal sequence and date and country of isolation. Bunyaviridae sequences and alignments may be downloaded from the second spreadsheet with titles defined by the user from the columns, or viewed when passed directly to the sequence editor, Jalview. Conclusion VirusBanker allows large datasets of aligned nucleotide and protein sequences from the Bunyaviridae to be compiled and winnowed rapidly using criteria that are formulated heuristically. PMID:18251994
Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor.
Kohany, Oleksiy; Gentles, Andrew J; Hankus, Lukasz; Jurka, Jerzy
2006-10-25
Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases. We describe the software tools RepbaseSubmitter and Censor, which are designed to facilitate updating and screening the content of Repbase. RepbaseSubmitter is a java-based interface for formatting and annotating Repbase entries. It eliminates many common formatting errors, and automates actions such as calculation of sequence lengths and composition, thus facilitating curation of Repbase sequences. In addition, it has several features for predicting protein coding regions in sequences; searching and including Pubmed references in Repbase entries; and searching the NCBI taxonomy database for correct inclusion of species information and taxonomic position. Censor is a tool to rapidly identify repetitive elements by comparison to known repeats. It uses WU-BLAST for speed and sensitivity, and can conduct DNA-DNA, DNA-protein, or translated DNA-translated DNA searches of genomic sequence. Defragmented output includes a map of repeats present in the query sequence, with the options to report masked query sequence(s), repeat sequences found in the query, and alignments. Censor and RepbaseSubmitter are available as both web-based services and downloadable versions. They can be found at http://www.girinst.org/repbase/submission.html (RepbaseSubmitter) and http://www.girinst.org/censor/index.php (Censor).
hPDI: a database of experimental human protein-DNA interactions.
Xie, Zhi; Hu, Shaohui; Blackshaw, Seth; Zhu, Heng; Qian, Jiang
2010-01-15
The human protein DNA Interactome (hPDI) database holds experimental protein-DNA interaction data for humans identified by protein microarray assays. The unique characteristics of hPDI are that it contains consensus DNA-binding sequences not only for nearly 500 human transcription factors but also for >500 unconventional DNA-binding proteins, which are completely uncharacterized previously. Users can browse, search and download a subset or the entire data via a web interface. This database is freely accessible for any academic purposes. http://bioinfo.wilmer.jhu.edu/PDI/.
MitoNuc: a database of nuclear genes coding for mitochondrial proteins. Update 2002.
Attimonelli, Marcella; Catalano, Domenico; Gissi, Carmela; Grillo, Giorgio; Licciulli, Flavio; Liuni, Sabino; Santamaria, Monica; Pesole, Graziano; Saccone, Cecilia
2002-01-01
Mitochondria, besides their central role in energy metabolism, have recently been found to be involved in a number of basic processes of cell life and to contribute to the pathogenesis of many degenerative diseases. All functions of mitochondria depend on the interaction of nuclear and organelle genomes. Mitochondrial genomes have been extensively sequenced and analysed and data have been collected in several specialised databases. In order to collect information on nuclear coded mitochondrial proteins we developed MitoNuc, a database containing detailed information on sequenced nuclear genes coding for mitochondrial proteins in Metazoa. The MitoNuc database can be retrieved through SRS and is available via the web site http://bighost.area.ba.cnr.it/mitochondriome where other mitochondrial databases developed by our group, the complete list of the sequenced mitochondrial genomes, links to other mitochondrial sites and related information, are available. The MitoAln database, related to MitoNuc in the previous release, reporting the multiple alignments of the relevant homologous protein coding regions, is no longer supported in the present release. In order to keep the links among entries in MitoNuc from homologous proteins, a new field in the database has been defined: the cluster identifier, an alpha numeric code used to identify each cluster of homologous proteins. A comment field derived from the corresponding SWISS-PROT entry has been introduced; this reports clinical data related to dysfunction of the protein. The logic scheme of MitoNuc database has been implemented in the ORACLE DBMS. This will allow the end-users to retrieve data through a friendly interface that will be soon implemented.
Drug search for leishmaniasis: a virtual screening approach by grid computing
NASA Astrophysics Data System (ADS)
Ochoa, Rodrigo; Watowich, Stanley J.; Flórez, Andrés; Mesa, Carol V.; Robledo, Sara M.; Muskus, Carlos
2016-07-01
The trypanosomatid protozoa Leishmania is endemic in 100 countries, with infections causing 2 million new cases of leishmaniasis annually. Disease symptoms can include severe skin and mucosal ulcers, fever, anemia, splenomegaly, and death. Unfortunately, therapeutics approved to treat leishmaniasis are associated with potentially severe side effects, including death. Furthermore, drug-resistant Leishmania parasites have developed in most endemic countries. To address an urgent need for new, safe and inexpensive anti-leishmanial drugs, we utilized the IBM World Community Grid to complete computer-based drug discovery screens (Drug Search for Leishmaniasis) using unique leishmanial proteins and a database of 600,000 drug-like small molecules. Protein structures from different Leishmania species were selected for molecular dynamics (MD) simulations, and a series of conformational "snapshots" were chosen from each MD trajectory to simulate the protein's flexibility. A Relaxed Complex Scheme methodology was used to screen 2000 MD conformations against the small molecule database, producing >1 billion protein-ligand structures. For each protein target, a binding spectrum was calculated to identify compounds predicted to bind with highest average affinity to all protein conformations. Significantly, four different Leishmania protein targets were predicted to strongly bind small molecules, with the strongest binding interactions predicted to occur for dihydroorotate dehydrogenase (LmDHODH; PDB:3MJY). A number of predicted tight-binding LmDHODH inhibitors were tested in vitro and potent selective inhibitors of Leishmania panamensis were identified. These promising small molecules are suitable for further development using iterative structure-based optimization and in vitro/in vivo validation assays.
Drug search for leishmaniasis: a virtual screening approach by grid computing.
Ochoa, Rodrigo; Watowich, Stanley J; Flórez, Andrés; Mesa, Carol V; Robledo, Sara M; Muskus, Carlos
2016-07-01
The trypanosomatid protozoa Leishmania is endemic in ~100 countries, with infections causing ~2 million new cases of leishmaniasis annually. Disease symptoms can include severe skin and mucosal ulcers, fever, anemia, splenomegaly, and death. Unfortunately, therapeutics approved to treat leishmaniasis are associated with potentially severe side effects, including death. Furthermore, drug-resistant Leishmania parasites have developed in most endemic countries. To address an urgent need for new, safe and inexpensive anti-leishmanial drugs, we utilized the IBM World Community Grid to complete computer-based drug discovery screens (Drug Search for Leishmaniasis) using unique leishmanial proteins and a database of 600,000 drug-like small molecules. Protein structures from different Leishmania species were selected for molecular dynamics (MD) simulations, and a series of conformational "snapshots" were chosen from each MD trajectory to simulate the protein's flexibility. A Relaxed Complex Scheme methodology was used to screen ~2000 MD conformations against the small molecule database, producing >1 billion protein-ligand structures. For each protein target, a binding spectrum was calculated to identify compounds predicted to bind with highest average affinity to all protein conformations. Significantly, four different Leishmania protein targets were predicted to strongly bind small molecules, with the strongest binding interactions predicted to occur for dihydroorotate dehydrogenase (LmDHODH; PDB:3MJY). A number of predicted tight-binding LmDHODH inhibitors were tested in vitro and potent selective inhibitors of Leishmania panamensis were identified. These promising small molecules are suitable for further development using iterative structure-based optimization and in vitro/in vivo validation assays.
Using SQL Databases for Sequence Similarity Searching and Analysis.
Pearson, William R; Mackey, Aaron J
2017-09-13
Relational databases can integrate diverse types of information and manage large sets of similarity search results, greatly simplifying genome-scale analyses. By focusing on taxonomic subsets of sequences, relational databases can reduce the size and redundancy of sequence libraries and improve the statistical significance of homologs. In addition, by loading similarity search results into a relational database, it becomes possible to explore and summarize the relationships between all of the proteins in an organism and those in other biological kingdoms. This unit describes how to use relational databases to improve the efficiency of sequence similarity searching and demonstrates various large-scale genomic analyses of homology-related data. It also describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. The unit also introduces search_demo, a database that stores sequence similarity search results. The search_demo database is then used to explore the evolutionary relationships between E. coli proteins and proteins in other organisms in a large-scale comparative genomic analysis. © 2017 by John Wiley & Sons, Inc. Copyright © 2017 John Wiley & Sons, Inc.
SALAD database: a motif-based database of protein annotations for plant comparative genomics
Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi
2010-01-01
Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209 529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named ‘SALAD on ARRAYs’ to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis. PMID:19854933
SALAD database: a motif-based database of protein annotations for plant comparative genomics.
Mihara, Motohiro; Itoh, Takeshi; Izawa, Takeshi
2010-01-01
Proteins often have several motifs with distinct evolutionary histories. Proteins with similar motifs have similar biochemical properties and thus related biological functions. We constructed a unique comparative genomics database termed the SALAD database (http://salad.dna.affrc.go.jp/salad/) from plant-genome-based proteome data sets. We extracted evolutionarily conserved motifs by MEME software from 209,529 protein-sequence annotation groups selected by BLASTP from the proteome data sets of 10 species: rice, sorghum, Arabidopsis thaliana, grape, a lycophyte, a moss, 3 algae, and yeast. Similarity clustering of each protein group was performed by pairwise scoring of the motif patterns of the sequences. The SALAD database provides a user-friendly graphical viewer that displays a motif pattern diagram linked to the resulting bootstrapped dendrogram for each protein group. Amino-acid-sequence-based and nucleotide-sequence-based phylogenetic trees for motif combination alignment, a logo comparison diagram for each clade in the tree, and a Pfam-domain pattern diagram are also available. We also developed a viewer named 'SALAD on ARRAYs' to view arbitrary microarray data sets of paralogous genes linked to the same dendrogram in a window. The SALAD database is a powerful tool for comparing protein sequences and can provide valuable hints for biological analysis.
Hayashi, Takanori; Matsuzaki, Yuri; Yanagisawa, Keisuke; Ohue, Masahito; Akiyama, Yutaka
2018-05-08
Protein-protein interactions (PPIs) play several roles in living cells, and computational PPI prediction is a major focus of many researchers. The three-dimensional (3D) structure and binding surface are important for the design of PPI inhibitors. Therefore, rigid body protein-protein docking calculations for two protein structures are expected to allow elucidation of PPIs different from known complexes in terms of 3D structures because known PPI information is not explicitly required. We have developed rapid PPI prediction software based on protein-protein docking, called MEGADOCK. In order to fully utilize the benefits of computational PPI predictions, it is necessary to construct a comprehensive database to gather prediction results and their predicted 3D complex structures and to make them easily accessible. Although several databases exist that provide predicted PPIs, the previous databases do not contain a sufficient number of entries for the purpose of discovering novel PPIs. In this study, we constructed an integrated database of MEGADOCK PPI predictions, named MEGADOCK-Web. MEGADOCK-Web provides more than 10 times the number of PPI predictions than previous databases and enables users to conduct PPI predictions that cannot be found in conventional PPI prediction databases. In MEGADOCK-Web, there are 7528 protein chains and 28,331,628 predicted PPIs from all possible combinations of those proteins. Each protein structure is annotated with PDB ID, chain ID, UniProt AC, related KEGG pathway IDs, and known PPI pairs. Additionally, MEGADOCK-Web provides four powerful functions: 1) searching precalculated PPI predictions, 2) providing annotations for each predicted protein pair with an experimentally known PPI, 3) visualizing candidates that may interact with the query protein on biochemical pathways, and 4) visualizing predicted complex structures through a 3D molecular viewer. MEGADOCK-Web provides a huge amount of comprehensive PPI predictions based on docking calculations with biochemical pathways and enables users to easily and quickly assess PPI feasibilities by archiving PPI predictions. MEGADOCK-Web also promotes the discovery of new PPIs and protein functions and is freely available for use at http://www.bi.cs.titech.ac.jp/megadock-web/ .
The annotation-enriched non-redundant patent sequence databases.
Li, Weizhong; Kondratowicz, Bartosz; McWilliam, Hamish; Nauche, Stephane; Lopez, Rodrigo
2013-01-01
The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases. Database URL: http://www.ebi.ac.uk/patentdata/nr/
The Annotation-enriched non-redundant patent sequence databases
Li, Weizhong; Kondratowicz, Bartosz; McWilliam, Hamish; Nauche, Stephane; Lopez, Rodrigo
2013-01-01
The EMBL-European Bioinformatics Institute (EMBL-EBI) offers public access to patent sequence data, providing a valuable service to the intellectual property and scientific communities. The non-redundant (NR) patent sequence databases comprise two-level nucleotide and protein sequence clusters (NRNL1, NRNL2, NRPL1 and NRPL2) based on sequence identity (level-1) and patent family (level-2). Annotation from the source entries in these databases is merged and enhanced with additional information from the patent literature and biological context. Corrections in patent publication numbers, kind-codes and patent equivalents significantly improve the data quality. Data are available through various user interfaces including web browser, downloads via FTP, SRS, Dbfetch and EBI-Search. Sequence similarity/homology searches against the databases are available using BLAST, FASTA and PSI-Search. In this article, we describe the data collection and annotation and also outline major changes and improvements introduced since 2009. Apart from data growth, these changes include additional annotation for singleton clusters, the identifier versioning for tracking entry change and the entry mappings between the two-level databases. Database URL: http://www.ebi.ac.uk/patentdata/nr/ PMID:23396323
Database on pharmacophore analysis of active principles, from medicinal plants
Pitchai, Daisy; Manikkam, Rajalakshmi; Rajendran, Sasikala R; Pitchai, Gnanamani
2010-01-01
Plants continue to be a major source of medicines, as they have been throughout human history. In the present days, drug discovery from plants involves a multidisciplinary approach combining ethnobotanical, phytochemical and biological techniques to provide us new chemical compounds (lead molecules) for the development of drugs against various pharmacological targets, including cancer, diabetes and its secondary complications. In view of this need in current drug discovery from medicinal plants, here we describe another web database containing the information of pharmacophore analysis of active principles possessing antidiabetic, antimicrobial, anticancerous and antioxidant properties from medicinal plants. The database provides the botanical, taxonomic classification, biochemical as well as pharmacological properties of medicinal plants. Data on antidiabetic, antimicrobial, anti oxidative, anti tumor and anti inflammatory compounds, and their physicochemical properties, SMILES Notation, Lipinski's properties are included in our database. One of the proposed features in the database is the predicted ADMET values and the interaction of bioactive compounds to the target protein. The database alphabetically lists the compound name and also provides tabs separating for anti microbial, antitumor, antidiabetic, and antioxidative compounds. Availability http://www.hccbif.info / PMID:21346859
ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes.
Otto, Thomas Dan; Catanho, Marcos; Tristão, Cristian; Bezerra, Márcia; Fernandes, Renan Mathias; Elias, Guilherme Steinberger; Scaglia, Alexandre Capeletto; Bovermann, Bill; Berstis, Viktors; Lifschitz, Sergio; de Miranda, Antonio Basílio; Degrave, Wim
2010-03-01
Many analyses in modern biological research are based on comparisons between biological sequences, resulting in functional, evolutionary and structural inferences. When large numbers of sequences are compared, heuristics are often used resulting in a certain lack of accuracy. In order to improve and validate results of such comparisons, we have performed radical all-against-all comparisons of 4 million protein sequences belonging to the RefSeq database, using an implementation of the Smith-Waterman algorithm. This extremely intensive computational approach was made possible with the help of World Community Grid, through the Genome Comparison Project. The resulting database, ProteinWorldDB, which contains coordinates of pairwise protein alignments and their respective scores, is now made available. Users can download, compare and analyze the results, filtered by genomes, protein functions or clusters. ProteinWorldDB is integrated with annotations derived from Swiss-Prot, Pfam, KEGG, NCBI Taxonomy database and gene ontology. The database is a unique and valuable asset, representing a major effort to create a reliable and consistent dataset of cross-comparisons of the whole protein content encoded in hundreds of completely sequenced genomes using a rigorous dynamic programming approach. The database can be accessed through http://proteinworlddb.org
LocSigDB: a database of protein localization signals
Negi, Simarjeet; Pandey, Sanjit; Srinivasan, Satish M.; Mohammed, Akram; Guda, Chittibabu
2015-01-01
LocSigDB (http://genome.unmc.edu/LocSigDB/) is a manually curated database of experimental protein localization signals for eight distinct subcellular locations; primarily in a eukaryotic cell with brief coverage of bacterial proteins. Proteins must be localized at their appropriate subcellular compartment to perform their desired function. Mislocalization of proteins to unintended locations is a causative factor for many human diseases; therefore, collection of known sorting signals will help support many important areas of biomedical research. By performing an extensive literature study, we compiled a collection of 533 experimentally determined localization signals, along with the proteins that harbor such signals. Each signal in the LocSigDB is annotated with its localization, source, PubMed references and is linked to the proteins in UniProt database along with the organism information that contain the same amino acid pattern as the given signal. From LocSigDB webserver, users can download the whole database or browse/search for data using an intuitive query interface. To date, LocSigDB is the most comprehensive compendium of protein localization signals for eight distinct subcellular locations. Database URL: http://genome.unmc.edu/LocSigDB/ PMID:25725059
Xia, Kai; Dong, Dong; Han, Jing-Dong J
2006-01-01
Background Although protein-protein interaction (PPI) networks have been explored by various experimental methods, the maps so built are still limited in coverage and accuracy. To further expand the PPI network and to extract more accurate information from existing maps, studies have been carried out to integrate various types of functional relationship data. A frequently updated database of computationally analyzed potential PPIs to provide biological researchers with rapid and easy access to analyze original data as a biological network is still lacking. Results By applying a probabilistic model, we integrated 27 heterogeneous genomic, proteomic and functional annotation datasets to predict PPI networks in human. In addition to previously studied data types, we show that phenotypic distances and genetic interactions can also be integrated to predict PPIs. We further built an easy-to-use, updatable integrated PPI database, the Integrated Network Database (IntNetDB) online, to provide automatic prediction and visualization of PPI network among genes of interest. The networks can be visualized in SVG (Scalable Vector Graphics) format for zooming in or out. IntNetDB also provides a tool to extract topologically highly connected network neighborhoods from a specific network for further exploration and research. Using the MCODE (Molecular Complex Detections) algorithm, 190 such neighborhoods were detected among all the predicted interactions. The predicted PPIs can also be mapped to worm, fly and mouse interologs. Conclusion IntNetDB includes 180,010 predicted protein-protein interactions among 9,901 human proteins and represents a useful resource for the research community. Our study has increased prediction coverage by five-fold. IntNetDB also provides easy-to-use network visualization and analysis tools that allow biological researchers unfamiliar with computational biology to access and analyze data over the internet. The web interface of IntNetDB is freely accessible at . Visualization requires Mozilla version 1.8 (or higher) or Internet Explorer with installation of SVGviewer. PMID:17112386
Rice proteome analysis: a step toward functional analysis of the rice genome.
Komatsu, Setsuko; Tanaka, Naoki
2005-03-01
The technique of proteome analysis using 2-DE has the power to monitor global changes that occur in the protein complement of tissues and subcellular compartments. In this review, we describe construction of the rice proteome database, the cataloging of rice proteins, and the functional characterization of some of the proteins identified. Initially, proteins extracted from various tissues and organelles were separated by 2-DE and an image analyzer was used to construct a display or reference map of the proteins. The rice proteome database currently contains 23 reference maps based on 2-DE of proteins from different rice tissues and subcellular compartments. These reference maps comprise 13 129 rice proteins, and the amino acid sequences of 5092 of these proteins are entered in the database. Major proteins involved in growth or stress responses have been identified by using a proteomics approach and some of these proteins have unique functions. Furthermore, initial work has also begun on analyzing the phosphoproteome and protein-protein interactions in rice. The information obtained from the rice proteome database will aid in the molecular cloning of rice genes and in predicting the function of unknown proteins.
Proteome of Caulobacter crescentus cell cycle publicly accessible on SWICZ server.
Vohradsky, Jiri; Janda, Ivan; Grünenfelder, Björn; Berndt, Peter; Röder, Daniel; Langen, Hanno; Weiser, Jaroslav; Jenal, Urs
2003-10-01
Here we present the Swiss-Czech Proteomics Server (SWICZ), which hosts the proteomic database summarizing information about the cell cycle of the aquatic bacterium Caulobacter crescentus. The database provides a searchable tool for easy access of global protein synthesis and protein stability data as examined during the C. crescentus cell cycle. Protein synthesis data collected from five different cell cycle stages were determined for each protein spot as a relative value of the total amount of [(35)S]methionine incorporation. Protein stability of pulse-labeled extracts were measured during a chase period equivalent to one cell cycle unit. Quantitative information for individual proteins together with descriptive data such as protein identities, apparent molecular masses and isoelectric points, were combined with information on protein function, genomic context, and the cell cycle stage, and were then assembled in a relational database with a world wide web interface (http://proteom.biomed.cas.cz), which allows the database records to be searched and displays the recovered information. A total of 1250 protein spots were reproducibly detected on two-dimensional gel electropherograms, 295 of which were identified by mass spectroscopy. The database is accessible either through clickable two-dimensional gel electrophoretic maps or by means of a set of dedicated search engines. Basic characterization of the experimental procedures, data processing, and a comprehensive description of the web site are presented. In its current state, the SWICZ proteome database provides a platform for the incorporation of new data emerging from extended functional studies on the C. crescentus proteome.
Chen, Mingchen; Lin, Xingcheng; Zheng, Weihua; Onuchic, José N; Wolynes, Peter G
2016-08-25
The associative memory, water mediated, structure and energy model (AWSEM) is a coarse-grained force field with transferable tertiary interactions that incorporates local in sequence energetic biases using bioinformatically derived structural information about peptide fragments with locally similar sequences that we call memories. The memory information from the protein data bank (PDB) database guides proper protein folding. The structural information about available sequences in the database varies in quality and can sometimes lead to frustrated free energy landscapes locally. One way out of this difficulty is to construct the input fragment memory information from all-atom simulations of portions of the complete polypeptide chain. In this paper, we investigate this approach first put forward by Kwac and Wolynes in a more complete way by studying the structure prediction capabilities of this approach for six α-helical proteins. This scheme which we call the atomistic associative memory, water mediated, structure and energy model (AAWSEM) amounts to an ab initio protein structure prediction method that starts from the ground up without using bioinformatic input. The free energy profiles from AAWSEM show that atomistic fragment memories are sufficient to guide the correct folding when tertiary forces are included. AAWSEM combines the efficiency of coarse-grained simulations on the full protein level with the local structural accuracy achievable from all-atom simulations of only parts of a large protein. The results suggest that a hybrid use of atomistic fragment memory and database memory in structural predictions may well be optimal for many practical applications.
Geisler, Christoph
2018-02-07
Adventitious viral contamination in cell substrates used for biologicals production is a major safety concern. A powerful new approach that can be used to identify adventitious viruses is a combination of bioinformatics tools with massively parallel sequencing technology. Typically, this involves mapping or BLASTN searching individual reads against viral nucleotide databases. Although extremely sensitive for known viruses, this approach can easily miss viruses that are too dissimilar to viruses in the database. Moreover, it is computationally intensive and requires reference cell genome databases. To avoid these drawbacks, we set out to develop an alternative approach. We reasoned that searching genome and transcriptome assemblies for adventitious viral contaminants using TBLASTN with a compact viral protein database covering extant viral diversity as the query could be fast and sensitive without a requirement for high performance computing hardware. We tested our approach on Spodoptera frugiperda Sf-RVN, a recently isolated insect cell line, to determine if it was contaminated with one or more adventitious viruses. We used Illumina reads to assemble the Sf-RVN genome and transcriptome and searched them for adventitious viral contaminants using TBLASTN with our viral protein database. We found no evidence of viral contamination, which was substantiated by the fact that our searches otherwise identified diverse sequences encoding virus-like proteins. These sequences included Maverick, R1 LINE, and errantivirus transposons, all of which are common in insect genomes. We also identified previously described as well as novel endogenous viral elements similar to ORFs encoded by diverse insect viruses. Our results demonstrate TBLASTN searching massively parallel sequencing (MPS) assemblies with a compact, manually curated viral protein database is more sensitive for adventitious virus detection than BLASTN, as we identified various sequences that encoded virus-like proteins, but had no similarity to viral sequences at the nucleotide level. Moreover, searches were fast without requiring high performance computing hardware. Our study also documents the enhanced biosafety profile of Sf-RVN as compared to other Sf cell lines, and supports the notion that Sf-RVN is highly suitable for the production of safe biologicals.
The BioExtract Server: a web-based bioinformatic workflow platform
Lushbough, Carol M.; Jennewein, Douglas M.; Brendel, Volker P.
2011-01-01
The BioExtract Server (bioextract.org) is an open, web-based system designed to aid researchers in the analysis of genomic data by providing a platform for the creation of bioinformatic workflows. Scientific workflows are created within the system by recording tasks performed by the user. These tasks may include querying multiple, distributed data sources, saving query results as searchable data extracts, and executing local and web-accessible analytic tools. The series of recorded tasks can then be saved as a reproducible, sharable workflow available for subsequent execution with the original or modified inputs and parameter settings. Integrated data resources include interfaces to the National Center for Biotechnology Information (NCBI) nucleotide and protein databases, the European Molecular Biology Laboratory (EMBL-Bank) non-redundant nucleotide database, the Universal Protein Resource (UniProt), and the UniProt Reference Clusters (UniRef) database. The system offers access to numerous preinstalled, curated analytic tools and also provides researchers with the option of selecting computational tools from a large list of web services including the European Molecular Biology Open Software Suite (EMBOSS), BioMoby, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). The system further allows users to integrate local command line tools residing on their own computers through a client-side Java applet. PMID:21546552
KnotProt: a database of proteins with knots and slipknots
Jamroz, Michal; Niemyska, Wanda; Rawdon, Eric J.; Stasiak, Andrzej; Millett, Kenneth C.; Sułkowski, Piotr; Sulkowska, Joanna I.
2015-01-01
The protein topology database KnotProt, http://knotprot.cent.uw.edu.pl/, collects information about protein structures with open polypeptide chains forming knots or slipknots. The knotting complexity of the cataloged proteins is presented in the form of a matrix diagram that shows users the knot type of the entire polypeptide chain and of each of its subchains. The pattern visible in the matrix gives the knotting fingerprint of a given protein and permits users to determine, for example, the minimal length of the knotted regions (knot's core size) or the depth of a knot, i.e. how many amino acids can be removed from either end of the cataloged protein structure before converting it from a knot to a different type of knot. In addition, the database presents extensive information about the biological functions, families and fold types of proteins with non-trivial knotting. As an additional feature, the KnotProt database enables users to submit protein or polymer chains and generate their knotting fingerprints. PMID:25361973
Cytoscape: a software environment for integrated models of biomolecular interaction networks.
Shannon, Paul; Markiel, Andrew; Ozier, Owen; Baliga, Nitin S; Wang, Jonathan T; Ramage, Daniel; Amin, Nada; Schwikowski, Benno; Ideker, Trey
2003-11-01
Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gaponov, Yu.A.; Igarashi, N.; Hiraki, M.
2004-05-12
An integrated controlling system and a unified database for high throughput protein crystallography experiments have been developed. Main features of protein crystallography experiments (purification, crystallization, crystal harvesting, data collection, data processing) were integrated into the software under development. All information necessary to perform protein crystallography experiments is stored (except raw X-ray data that are stored in a central data server) in a MySQL relational database. The database contains four mutually linked hierarchical trees describing protein crystals, data collection of protein crystal and experimental data processing. A database editor was designed and developed. The editor supports basic database functions to view,more » create, modify and delete user records in the database. Two search engines were realized: direct search of necessary information in the database and object oriented search. The system is based on TCP/IP secure UNIX sockets with four predefined sending and receiving behaviors, which support communications between all connected servers and clients with remote control functions (creating and modifying data for experimental conditions, data acquisition, viewing experimental data, and performing data processing). Two secure login schemes were designed and developed: a direct method (using the developed Linux clients with secure connection) and an indirect method (using the secure SSL connection using secure X11 support from any operating system with X-terminal and SSH support). A part of the system has been implemented on a new MAD beam line, NW12, at the Photon Factory Advanced Ring for general user experiments.« less
Wimmer, Helge; Gundacker, Nina C; Griss, Johannes; Haudek, Verena J; Stättner, Stefan; Mohr, Thomas; Zwickl, Hannes; Paulitschke, Verena; Baron, David M; Trittner, Wolfgang; Kubicek, Markus; Bayer, Editha; Slany, Astrid; Gerner, Christopher
2009-06-01
Interpretation of proteome data with a focus on biomarker discovery largely relies on comparative proteome analyses. Here, we introduce a database-assisted interpretation strategy based on proteome profiles of primary cells. Both 2-D-PAGE and shotgun proteomics are applied. We obtain high data concordance with these two different techniques. When applying mass analysis of tryptic spot digests from 2-D gels of cytoplasmic fractions, we typically identify several hundred proteins. Using the same protein fractions, we usually identify more than thousand proteins by shotgun proteomics. The data consistency obtained when comparing these independent data sets exceeds 99% of the proteins identified in the 2-D gels. Many characteristic differences in protein expression of different cells can thus be independently confirmed. Our self-designed SQL database (CPL/MUW - database of the Clinical Proteomics Laboratories at the Medical University of Vienna accessible via www.meduniwien.ac.at/proteomics/database) facilitates (i) quality management of protein identification data, which are based on MS, (ii) the detection of cell type-specific proteins and (iii) of molecular signatures of specific functional cell states. Here, we demonstrate, how the interpretation of proteome profiles obtained from human liver tissue and hepatocellular carcinoma tissue is assisted by the Clinical Proteomics Laboratories at the Medical University of Vienna-database. Therefore, we suggest that the use of reference experiments supported by a tailored database may substantially facilitate data interpretation of proteome profiling experiments.
PpTFDB: A pigeonpea transcription factor database for exploring functional genomics in legumes
Singh, Akshay; Sharma, Ajay Kumar; Singh, Nagendra Kumar
2017-01-01
Pigeonpea (Cajanus cajan L.), a diploid legume crop, is a member of the tribe Phaseoleae. This tribe is descended from the millettioid (tropical) clade of the subfamily Papilionoideae, which includes many important legume crop species such as soybean (Glycine max), mung bean (Vigna radiata), cowpea (Vigna ungiculata), and common bean (Phaseolus vulgaris). It plays major role in food and nutritional security, being rich source of proteins, minerals and vitamins. We have developed a comprehensive Pigeonpea Transcription Factors Database (PpTFDB) that encompasses information about 1829 putative transcription factors (TFs) and their 55 TF families. PpTFDB provides a comprehensive information about each of the identified TFs that includes chromosomal location, protein physicochemical properties, sequence data, protein functional annotation, simple sequence repeats (SSRs) with primers derived from their motifs, orthology with related legume crops, and gene ontology (GO) assignment to respective TFs. (PpTFDB: http://14.139.229.199/PpTFDB/Home.aspx) is a freely available and user friendly web resource that facilitates users to retrieve the information of individual members of a TF family through a set of query interfaces including TF ID or protein functional annotation. In addition, users can also get the information by browsing interfaces, which include browsing by TF Categories and by, GO Categories. This PpTFDB will serve as a promising central resource for researchers as well as breeders who are working towards crop improvement of legume crops. PMID:28651001
D'Antonio, Matteo; Masseroli, Marco
2009-01-01
Background Alternative splicing has been demonstrated to affect most of human genes; different isoforms from the same gene encode for proteins which differ for a limited number of residues, thus yielding similar structures. This suggests possible correlations between alternative splicing and protein structure. In order to support the investigation of such relationships, we have developed the Alternative Splicing and Protein Structure Scrutinizer (PASS), a Web application to automatically extract, integrate and analyze human alternative splicing and protein structure data sparsely available in the Alternative Splicing Database, Ensembl databank and Protein Data Bank. Primary data from these databases have been integrated and analyzed using the Protein Identifier Cross-Reference, BLAST, CLUSTALW and FeatureMap3D software tools. Results A database has been developed to store the considered primary data and the results from their analysis; a system of Perl scripts has been implemented to automatically create and update the database and analyze the integrated data; a Web interface has been implemented to make the analyses easily accessible; a database has been created to manage user accesses to the PASS Web application and store user's data and searches. Conclusion PASS automatically integrates data from the Alternative Splicing Database with protein structure data from the Protein Data Bank. Additionally, it comprehensively analyzes the integrated data with publicly available well-known bioinformatics tools in order to generate structural information of isoform pairs. Further analysis of such valuable information might reveal interesting relationships between alternative splicing and protein structure differences, which may be significantly associated with different functions. PMID:19828075
PharmDB-K: Integrated Bio-Pharmacological Network Database for Traditional Korean Medicine
Lee, Ji-Hyun; Park, Kyoung Mii; Han, Dong-Jin; Bang, Nam Young; Kim, Do-Hee; Na, Hyeongjin; Lim, Semi; Kim, Tae Bum; Kim, Dae Gyu; Kim, Hyun-Jung; Chung, Yeonseok; Sung, Sang Hyun; Surh, Young-Joon; Kim, Sunghoon; Han, Byung Woo
2015-01-01
Despite the growing attention given to Traditional Medicine (TM) worldwide, there is no well-known, publicly available, integrated bio-pharmacological Traditional Korean Medicine (TKM) database for researchers in drug discovery. In this study, we have constructed PharmDB-K, which offers comprehensive information relating to TKM-associated drugs (compound), disease indication, and protein relationships. To explore the underlying molecular interaction of TKM, we integrated fourteen different databases, six Pharmacopoeias, and literature, and established a massive bio-pharmacological network for TKM and experimentally validated some cases predicted from the PharmDB-K analyses. Currently, PharmDB-K contains information about 262 TKMs, 7,815 drugs, 3,721 diseases, 32,373 proteins, and 1,887 side effects. One of the unique sets of information in PharmDB-K includes 400 indicator compounds used for standardization of herbal medicine. Furthermore, we are operating PharmDB-K via phExplorer (a network visualization software) and BioMart (a data federation framework) for convenient search and analysis of the TKM network. Database URL: http://pharmdb-k.org, http://biomart.i-pharm.org. PMID:26555441
Benson, Dennis A; Karsch-Mizrachi, Ilene; Lipman, David J; Ostell, James; Sayers, Eric W
2011-01-01
GenBank® is a comprehensive database that contains publicly available nucleotide sequences for more than 380,000 organisms named at the genus level or lower, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects, including whole genome shotgun (WGS) and environmental sampling projects. Most submissions are made using the web-based BankIt or standalone Sequin programs, and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Nucleotide Archive (ENA) and the DNA Data Bank of Japan (DDBJ) ensures worldwide coverage. GenBank is accessible through the NCBI Entrez retrieval system that integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: www.ncbi.nlm.nih.gov.
ZifBASE: a database of zinc finger proteins and associated resources.
Jayakanthan, Mannu; Muthukumaran, Jayaraman; Chandrasekar, Sanniyasi; Chawla, Konika; Punetha, Ankita; Sundar, Durai
2009-09-09
Information on the occurrence of zinc finger protein motifs in genomes is crucial to the developing field of molecular genome engineering. The knowledge of their target DNA-binding sequences is vital to develop chimeric proteins for targeted genome engineering and site-specific gene correction. There is a need to develop a computational resource of zinc finger proteins (ZFP) to identify the potential binding sites and its location, which reduce the time of in vivo task, and overcome the difficulties in selecting the specific type of zinc finger protein and the target site in the DNA sequence. ZifBASE provides an extensive collection of various natural and engineered ZFP. It uses standard names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, Protein Data Bank, ModBase, Protein Model Portal and the literature. It also incorporates specialized features of ZFP including finger sequences and positions, number of fingers, physiochemical properties, classes, framework, PubMed citations with links to experimental structures (PDB, if available) and modeled structures of natural zinc finger proteins. ZifBASE provides information on zinc finger proteins (both natural and engineered ones), the number of finger units in each of the zinc finger proteins (with multiple fingers), the synergy between the adjacent fingers and their positions. Additionally, it gives the individual finger sequence and their target DNA site to which it binds for better and clear understanding on the interactions of adjacent fingers. The current version of ZifBASE contains 139 entries of which 89 are engineered ZFPs, containing 3-7F totaling to 296 fingers. There are 50 natural zinc finger protein entries ranging from 2-13F, totaling to 307 fingers. It has sequences and structures from literature, Protein Data Bank, ModBase and Protein Model Portal. The interface is cross linked to other public databases like UniprotKB, PDB, ModBase and Protein Model Portal and PubMed for making it more informative. A database is established to maintain the information of the sequence features, including the class, framework, number of fingers, residues, position, recognition site and physio-chemical properties (molecular weight, isoelectric point) of both natural and engineered zinc finger proteins and dissociation constant of few. ZifBASE can provide more effective and efficient way of accessing the zinc finger protein sequences and their target binding sites with the links to their three-dimensional structures. All the data and functions are available at the advanced web-based search interface http://web.iitd.ac.in/~sundar/zifbase.
APPRIS 2017: principal isoforms for multiple gene sets
Rodriguez-Rivas, Juan; Di Domenico, Tomás; Vázquez, Jesús; Valencia, Alfonso
2018-01-01
Abstract The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the ‘principal’ isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants. PMID:29069475
Welker, F
2018-02-20
The study of ancient protein sequences is increasingly focused on the analysis of older samples, including those of ancient hominins. The analysis of such ancient proteomes thereby potentially suffers from "cross-species proteomic effects": the loss of peptide and protein identifications at increased evolutionary distances due to a larger number of protein sequence differences between the database sequence and the analyzed organism. Error-tolerant proteomic search algorithms should theoretically overcome this problem at both the peptide and protein level; however, this has not been demonstrated. If error-tolerant searches do not overcome the cross-species proteomic issue then there might be inherent biases in the identified proteomes. Here, a bioinformatics experiment is performed to test this using a set of modern human bone proteomes and three independent searches against sequence databases at increasing evolutionary distances: the human (0 Ma), chimpanzee (6-8 Ma) and orangutan (16-17 Ma) reference proteomes, respectively. Incorrectly suggested amino acid substitutions are absent when employing adequate filtering criteria for mutable Peptide Spectrum Matches (PSMs), but roughly half of the mutable PSMs were not recovered. As a result, peptide and protein identification rates are higher in error-tolerant mode compared to non-error-tolerant searches but did not recover protein identifications completely. Data indicates that peptide length and the number of mutations between the target and database sequences are the main factors influencing mutable PSM identification. The error-tolerant results suggest that the cross-species proteomics problem is not overcome at increasing evolutionary distances, even at the protein level. Peptide and protein loss has the potential to significantly impact divergence dating and proteome comparisons when using ancient samples as there is a bias towards the identification of conserved sequences and proteins. Effects are minimized between moderately divergent proteomes, as indicated by almost complete recovery of informative positions in the search against the chimpanzee proteome (≈90%, 6-8 Ma). This provides a bioinformatic background to future phylogenetic and proteomic analysis of ancient hominin proteomes, including the future description of novel hominin amino acid sequences, but also has negative implications for the study of fast-evolving proteins in hominins, non-hominin animals, and ancient bacterial proteins in evolutionary contexts.
sc-PDB: a 3D-database of ligandable binding sites—10 years on
Desaphy, Jérémy; Bret, Guillaume; Rognan, Didier; Kellenberger, Esther
2015-01-01
The sc-PDB database (available at http://bioinfo-pharma.u-strasbg.fr/scPDB/) is a comprehensive and up-to-date selection of ligandable binding sites of the Protein Data Bank. Sites are defined from complexes between a protein and a pharmacological ligand. The database provides the all-atom description of the protein, its ligand, their binding site and their binding mode. Currently, the sc-PDB archive registers 9283 binding sites from 3678 unique proteins and 5608 unique ligands. The sc-PDB database was publicly launched in 2004 with the aim of providing structure files suitable for computational approaches to drug design, such as docking. During the last 10 years we have improved and standardized the processes for (i) identifying binding sites, (ii) correcting structures, (iii) annotating protein function and ligand properties and (iv) characterizing their binding mode. This paper presents the latest enhancements in the database, specifically pertaining to the representation of molecular interaction and to the similarity between ligand/protein binding patterns. The new website puts emphasis in pictorial analysis of data. PMID:25300483
NASA Astrophysics Data System (ADS)
Sabbatini, Anna; Bédouet, Laurent; Marie, Arul; Bartolini, Annachiara; Landemarre, Ludovic; Weber, Michele; Ngurah Kade Mahardika, Gusti; Berland, Sophie; Zito, Francesca; Vénec-Peyré, Marie-Thérèse
2016-04-01
Most foraminifera that produce a shell are efficient biomineralizers. They contribute to the global carbon cycle, and thus influence ocean-climate regulation. Calcification in foraminifera is likely biologically controlled and is potentially similar to shell formation in metazoan taxa (e.g. mollusks, corals, sea urchins). However, foraminiferal biomineralization processes and the molecules involved are still poorly understood. We analyzed the calcitic shell of the large tropical benthic foraminifer Schlumbergerella floresiana. We found a suite of macromolecules containing many charged and polar amino acids and glycine that are also abundant in biomineralization proteins of other phyla. As neither genomic nor transcriptomic data are available for foraminiferal biomineralization yet, de novo-generated sequences, obtained from organic matrices submitted to MS BLAST database search, led to the characterization of 156 peptides. Very few homologous proteins were matched in the proteomic database, implying that the peptides are derived from unknown proteins present in the foraminiferal organic matrices. The amino acid distribution of these peptides was queried against the UNIPROT database and the mollusk UNIPROT database for comparison. The mollusks compose a well-studied phylum that yield a large variety of biomineralization proteins. These results showed that proteins extracted from S. floresiana shells contained sequences enriched with glycine, alanine, and proline, making a set of residues that provided a signature unique to foraminifera. Three of the de novo peptides exhibited sequence similarities to peptides found in proteins such as pre-collagen-P and a group of P-type ATPases including a calcium-transporting ATPase. Surprisingly, the peptide that was most similar to the collagen-like protein was a glycine-rich peptide reported from the test and spine proteome of sea urchin. The molecules, identified by matrix-assisted laser desorption ionization-time of flight mass spectrometry analyses, included acid-soluble N-glycoproteins with its sugar moieties represented by high-mannose-type glycans and carbohydrates. Describing the nature of the proteins, and associated molecules in the skeletal structure of living foraminifera, can elucidate the biomineralization mechanisms of these major carbonate producers in marine ecosystems. Foraminifera constitute an important tool used for paleo-environmental reconstructions because of their nearly continuous fossil record and abundance. Many studies focus on their biomineralization process using a geochemical perspective to record environmental and climate changes from shell isotopic and trace element compositions. Our results are a first step toward understanding the functioning mechanism behind biomineralization and the molecules involved. Coupling geochemical and biological perspectives will enhance interpretation of the proxies used for climatic reconstructions and improve future modeling efforts.
2010-01-01
Background Papaver somniferum (opium poppy) is the source for several pharmaceutical benzylisoquinoline alkaloids including morphine, the codeine and sanguinarine. In response to treatment with a fungal elicitor, the biosynthesis and accumulation of sanguinarine is induced along with other plant defense responses in opium poppy cell cultures. The transcriptional induction of alkaloid metabolism in cultured cells provides an opportunity to identify components of this process via the integration of deep transcriptome and proteome databases generated using next-generation technologies. Results A cDNA library was prepared for opium poppy cell cultures treated with a fungal elicitor for 10 h. Using 454 GS-FLX Titanium pyrosequencing, 427,369 expressed sequence tags (ESTs) with an average length of 462 bp were generated. Assembly of these sequences yielded 93,723 unigenes, of which 23,753 were assigned Gene Ontology annotations. Transcripts encoding all known sanguinarine biosynthetic enzymes were identified in the EST database, 5 of which were represented among the 50 most abundant transcripts. Liquid chromatography-tandem mass spectrometry (LC-MS/MS) of total protein extracts from cell cultures treated with a fungal elicitor for 50 h facilitated the identification of 1,004 proteins. Proteins were fractionated by one-dimensional SDS-PAGE and digested with trypsin prior to LC-MS/MS analysis. Query of an opium poppy-specific EST database substantially enhanced peptide identification. Eight out of 10 known sanguinarine biosynthetic enzymes and many relevant primary metabolic enzymes were represented in the peptide database. Conclusions The integration of deep transcriptome and proteome analyses provides an effective platform to catalogue the components of secondary metabolism, and to identify genes encoding uncharacterized enzymes. The establishment of corresponding transcript and protein databases generated by next-generation technologies in a system with a well-defined metabolite profile facilitates an improved linkage between genes, enzymes, and pathway components. The proteome database represents the most relevant alkaloid-producing enzymes, compared with the much deeper and more complete transcriptome library. The transcript database contained full-length mRNAs encoding most alkaloid biosynthetic enzymes, which is a key requirement for the functional characterization of novel gene candidates. PMID:21083930
SASD: the Synthetic Alternative Splicing Database for identifying novel isoform from proteomics
2013-01-01
Background Alternative splicing is an important and widespread mechanism for generating protein diversity and regulating protein expression. High-throughput identification and analysis of alternative splicing in the protein level has more advantages than in the mRNA level. The combination of alternative splicing database and tandem mass spectrometry provides a powerful technique for identification, analysis and characterization of potential novel alternative splicing protein isoforms from proteomics. Therefore, based on the peptidomic database of human protein isoforms for proteomics experiments, our objective is to design a new alternative splicing database to 1) provide more coverage of genes, transcripts and alternative splicing, 2) exclusively focus on the alternative splicing, and 3) perform context-specific alternative splicing analysis. Results We used a three-step pipeline to create a synthetic alternative splicing database (SASD) to identify novel alternative splicing isoforms and interpret them at the context of pathway, disease, drug and organ specificity or custom gene set with maximum coverage and exclusive focus on alternative splicing. First, we extracted information on gene structures of all genes in the Ensembl Genes 71 database and incorporated the Integrated Pathway Analysis Database. Then, we compiled artificial splicing transcripts. Lastly, we translated the artificial transcripts into alternative splicing peptides. The SASD is a comprehensive database containing 56,630 genes (Ensembl gene IDs), 95,260 transcripts (Ensembl transcript IDs), and 11,919,779 Alternative Splicing peptides, and also covering about 1,956 pathways, 6,704 diseases, 5,615 drugs, and 52 organs. The database has a web-based user interface that allows users to search, display and download a single gene/transcript/protein, custom gene set, pathway, disease, drug, organ related alternative splicing. Moreover, the quality of the database was validated with comparison to other known databases and two case studies: 1) in liver cancer and 2) in breast cancer. Conclusions The SASD provides the scientific community with an efficient means to identify, analyze, and characterize novel Exon Skipping and Intron Retention protein isoforms from mass spectrometry and interpret them at the context of pathway, disease, drug and organ specificity or custom gene set with maximum coverage and exclusive focus on alternative splicing. PMID:24267658
Why are they missing? : Bioinformatics characterization of missing human proteins.
Elguoshy, Amr; Magdeldin, Sameh; Xu, Bo; Hirao, Yoshitoshi; Zhang, Ying; Kinoshita, Naohiko; Takisawa, Yusuke; Nameta, Masaaki; Yamamoto, Keiko; El-Refy, Ali; El-Fiky, Fawzy; Yamamoto, Tadashi
2016-10-21
NeXtProt is a web-based protein knowledge platform that supports research on human proteins. NeXtProt (release 2015-04-28) lists 20,060 proteins, among them, 3373 canonical proteins (16.8%) lack credible experimental evidence at protein level (PE2:PE5). Therefore, they are considered as "missing proteins". A comprehensive bioinformatic workflow has been proposed to analyze these "missing" proteins. The aims of current study were to analyze physicochemical properties, existence and distribution of the tryptic cleavage sites, and to pinpoint the signature peptides of the missing proteins. Our findings showed that 23.7% of missing proteins were hydrophobic proteins possessing transmembrane domains (TMD). Also, forty missing entries generate tryptic peptides were either out of mass detection range (>30aa) or mapped to different proteins (<9aa). Additionally, 21% of missing entries didn't generate any unique tryptic peptides. In silico endopeptidase combination strategy increased the possibility of missing proteins identification. Coherently, using both mature protein database and signal peptidome database could be a promising option to identify some missing proteins by targeting their unique N-terminal tryptic peptide from mature protein database and or C-terminus tryptic peptide from signal peptidome database. In conclusion, Identification of missing protein requires additional consideration during sample preparation, extraction, digestion and data analysis to increase its incidence of identification. Copyright © 2016. Published by Elsevier B.V.
Martone, Maryann E.; Tran, Joshua; Wong, Willy W.; Sargis, Joy; Fong, Lisa; Larson, Stephen; Lamont, Stephan P.; Gupta, Amarnath; Ellisman, Mark H.
2008-01-01
Databases have become integral parts of data management, dissemination and mining in biology. At the Second Annual Conference on Electron Tomography, held in Amsterdam in 2001, we proposed that electron tomography data should be shared in a manner analogous to structural data at the protein and sequence scales. At that time, we outlined our progress in creating a database to bring together cell level imaging data across scales, The Cell Centered Database (CCDB). The CCDB was formally launched in 2002 as an on-line repository of high-resolution 3D light and electron microscopic reconstructions of cells and subcellular structures. It contains 2D, 3D and 4D structural and protein distribution information from confocal, multiphoton and electron microscopy, including correlated light and electron microscopy. Many of the data sets are derived from electron tomography of cells and tissues. In the five years since its debut, we have moved the CCDB from a prototype to a stable resource and expanded the scope of the project to include data management and knowledge engineering. Here we provide an update on the CCDB and how it is used by the scientific community. We also describe our work in developing additional knowledge tools, e.g., ontologies, for annotation and query of electron microscopic data. PMID:18054501
SORTEZ: a relational translator for NCBI's ASN.1 database.
Hart, K W; Searls, D B; Overton, G C
1994-07-01
The National Center for Biotechnology Information (NCBI) has created a database collection that includes several protein and nucleic acid sequence databases, a biosequence-specific subset of MEDLINE, as well as value-added information such as links between similar sequences. Information in the NCBI database is modeled in Abstract Syntax Notation 1 (ASN.1) an Open Systems Interconnection protocol designed for the purpose of exchanging structured data between software applications rather than as a data model for database systems. While the NCBI database is distributed with an easy-to-use information retrieval system, ENTREZ, the ASN.1 data model currently lacks an ad hoc query language for general-purpose data access. For that reason, we have developed a software package, SORTEZ, that transforms the ASN.1 database (or other databases with nested data structures) to a relational data model and subsequently to a relational database management system (Sybase) where information can be accessed through the relational query language, SQL. Because the need to transform data from one data model and schema to another arises naturally in several important contexts, including efficient execution of specific applications, access to multiple databases and adaptation to database evolution this work also serves as a practical study of the issues involved in the various stages of database transformation. We show that transformation from the ASN.1 data model to a relational data model can be largely automated, but that schema transformation and data conversion require considerable domain expertise and would greatly benefit from additional support tools.
Solving the Problem: Genome Annotation Standards before the Data Deluge.
Klimke, William; O'Donovan, Claire; White, Owen; Brister, J Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D; Tatusova, Tatiana
2011-10-15
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries.
Solving the Problem: Genome Annotation Standards before the Data Deluge
Klimke, William; O'Donovan, Claire; White, Owen; Brister, J. Rodney; Clark, Karen; Fedorov, Boris; Mizrachi, Ilene; Pruitt, Kim D.; Tatusova, Tatiana
2011-01-01
The promise of genome sequencing was that the vast undiscovered country would be mapped out by comparison of the multitude of sequences available and would aid researchers in deciphering the role of each gene in every organism. Researchers recognize that there is a need for high quality data. However, different annotation procedures, numerous databases, and a diminishing percentage of experimentally determined gene functions have resulted in a spectrum of annotation quality. NCBI in collaboration with sequencing centers, archival databases, and researchers, has developed the first international annotation standards, a fundamental step in ensuring that high quality complete prokaryotic genomes are available as gold standard references. Highlights include the development of annotation assessment tools, community acceptance of protein naming standards, comparison of annotation resources to provide consistent annotation, and improved tracking of the evidence used to generate a particular annotation. The development of a set of minimal standards, including the requirement for annotated complete prokaryotic genomes to contain a full set of ribosomal RNAs, transfer RNAs, and proteins encoding core conserved functions, is an historic milestone. The use of these standards in existing genomes and future submissions will increase the quality of databases, enabling researchers to make accurate biological discoveries. PMID:22180819
ProteinWorldDB: querying radical pairwise alignments among protein sets from complete genomes
Otto, Thomas Dan; Catanho, Marcos; Tristão, Cristian; Bezerra, Márcia; Fernandes, Renan Mathias; Elias, Guilherme Steinberger; Scaglia, Alexandre Capeletto; Bovermann, Bill; Berstis, Viktors; Lifschitz, Sergio; de Miranda, Antonio Basílio; Degrave, Wim
2010-01-01
Motivation: Many analyses in modern biological research are based on comparisons between biological sequences, resulting in functional, evolutionary and structural inferences. When large numbers of sequences are compared, heuristics are often used resulting in a certain lack of accuracy. In order to improve and validate results of such comparisons, we have performed radical all-against-all comparisons of 4 million protein sequences belonging to the RefSeq database, using an implementation of the Smith–Waterman algorithm. This extremely intensive computational approach was made possible with the help of World Community Grid™, through the Genome Comparison Project. The resulting database, ProteinWorldDB, which contains coordinates of pairwise protein alignments and their respective scores, is now made available. Users can download, compare and analyze the results, filtered by genomes, protein functions or clusters. ProteinWorldDB is integrated with annotations derived from Swiss-Prot, Pfam, KEGG, NCBI Taxonomy database and gene ontology. The database is a unique and valuable asset, representing a major effort to create a reliable and consistent dataset of cross-comparisons of the whole protein content encoded in hundreds of completely sequenced genomes using a rigorous dynamic programming approach. Availability: The database can be accessed through http://proteinworlddb.org Contact: otto@fiocruz.br PMID:20089515
Proteome-wide Subcellular Topologies of E. coli Polypeptides Database (STEPdb)*
Orfanoudaki, Georgia; Economou, Anastassios
2014-01-01
Cell compartmentalization serves both the isolation and the specialization of cell functions. After synthesis in the cytoplasm, over a third of all proteins are targeted to other subcellular compartments. Knowing how proteins are distributed within the cell and how they interact is a prerequisite for understanding it as a whole. Surface and secreted proteins are important pathogenicity determinants. Here we present the STEP database (STEPdb) that contains a comprehensive characterization of subcellular localization and topology of the complete proteome of Escherichia coli. Two widely used E. coli proteomes (K-12 and BL21) are presented organized into thirteen subcellular classes. STEPdb exploits the wealth of genetic, proteomic, biochemical, and functional information on protein localization, secretion, and targeting in E. coli, one of the best understood model organisms. Subcellular annotations were derived from a combination of bioinformatics prediction, proteomic, biochemical, functional, topological data and extensive literature re-examination that were refined through manual curation. Strong experimental support for the location of 1553 out of 4303 proteins was based on 426 articles and some experimental indications for another 526. Annotations were provided for another 320 proteins based on firm bioinformatic predictions. STEPdb is the first database that contains an extensive set of peripheral IM proteins (PIM proteins) and includes their graphical visualization into complexes, cellular functions, and interactions. It also summarizes all currently known protein export machineries of E. coli K-12 and pairs them, where available, with the secretory proteins that use them. It catalogs the Sec- and TAT-utilizing secretomes and summarizes their topological features such as signal peptides and transmembrane regions, transmembrane topologies and orientations. It also catalogs physicochemical and structural features that influence topology such as abundance, solubility, disorder, heat resistance, and structural domain families. Finally, STEPdb incorporates prediction tools for topology (TMHMM, SignalP, and Phobius) and disorder (IUPred) and implements the BLAST2STEP that performs protein homology searches against the STEPdb. PMID:25210196
Algorithms for database-dependent search of MS/MS data.
Matthiesen, Rune
2013-01-01
The frequent used bottom-up strategy for identification of proteins and their associated modifications generate nowadays typically thousands of MS/MS spectra that normally are matched automatically against a protein sequence database. Search engines that take as input MS/MS spectra and a protein sequence database are referred as database-dependent search engines. Many programs both commercial and freely available exist for database-dependent search of MS/MS spectra and most of the programs have excellent user documentation. The aim here is therefore to outline the algorithm strategy behind different search engines rather than providing software user manuals. The process of database-dependent search can be divided into search strategy, peptide scoring, protein scoring, and finally protein inference. Most efforts in the literature have been put in to comparing results from different software rather than discussing the underlining algorithms. Such practical comparisons can be cluttered by suboptimal implementation and the observed differences are frequently caused by software parameters settings which have not been set proper to allow even comparison. In other words an algorithmic idea can still be worth considering even if the software implementation has been demonstrated to be suboptimal. The aim in this chapter is therefore to split the algorithms for database-dependent searching of MS/MS data into the above steps so that the different algorithmic ideas become more transparent and comparable. Most search engines provide good implementations of the first three data analysis steps mentioned above, whereas the final step of protein inference are much less developed for most search engines and is in many cases performed by an external software. The final part of this chapter illustrates how protein inference is built into the VEMS search engine and discusses a stand-alone program SIR for protein inference that can import a Mascot search result.
Takashima, S
2001-04-05
The large dipole moment of globular proteins has been well known because of the detailed studies using dielectric relaxation and electro-optical methods. The search for the origin of these dipolemoments, however, must be based on the detailed knowledge on protein structure with atomic resolutions. At present, we have two sources of information on the structure of protein molecules: (1) x-ray databases obtained in crystalline state; (2) NMR databases obtained in solution state. While x-ray databases consist of only one model, NMR databases, because of the fluctuation of the protein folding in solution, consist of a number of models, thus enabling the computation of dipole moment repeated for all these models. The aim of this work, using these databases, is the detailed investigation on the interdependence between the structure and dipole moment of protein molecules. The dipole moment of protein molecules has roughly two components: one dipole moment is due to surface charges and the other, core dipole moment, is due to polar groups such as N--H and C==O bonds. The computation of surface charge dipole moment consists of two steps: (A) calculation of the pK shifts of charged groups for electrostatic interactions and (B) calculation of the dipole moment using the pK corrected for electrostatic shifts. The dipole moments of several proteins were computed using both NMR and x-ray databases. The dipole moments of these two sets of calculations are, with a few exceptions, in good agreement with one another and also with measured dipole moments.
LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources.
Karchin, Rachel; Diekhans, Mark; Kelly, Libusha; Thomas, Daryl J; Pieper, Ursula; Eswar, Narayanan; Haussler, David; Sali, Andrej
2005-06-15
The NCBI dbSNP database lists over 9 million single nucleotide polymorphisms (SNPs) in the human genome, but currently contains limited annotation information. SNPs that result in amino acid residue changes (nsSNPs) are of critical importance in variation between individuals, including disease and drug sensitivity. We have developed LS-SNP, a genomic scale software pipeline to annotate nsSNPs. LS-SNP comprehensively maps nsSNPs onto protein sequences, functional pathways and comparative protein structure models, and predicts positions where nsSNPs destabilize proteins, interfere with the formation of domain-domain interfaces, have an effect on protein-ligand binding or severely impact human health. It currently annotates 28,043 validated SNPs that produce amino acid residue substitutions in human proteins from the SwissProt/TrEMBL database. Annotations can be viewed via a web interface either in the context of a genomic region or by selecting sets of SNPs, genes, proteins or pathways. These results are useful for identifying candidate functional SNPs within a gene, haplotype or pathway and in probing molecular mechanisms responsible for functional impacts of nsSNPs. http://www.salilab.org/LS-SNP CONTACT: rachelk@salilab.org http://salilab.org/LS-SNP/supp-info.pdf.
Hegedűs, Tamás; Chaubey, Pururawa Mayank; Várady, György; Szabó, Edit; Sarankó, Hajnalka; Hofstetter, Lia; Roschitzki, Bernd; Sarkadi, Balázs
2015-01-01
Based on recent results, the determination of the easily accessible red blood cell (RBC) membrane proteins may provide new diagnostic possibilities for assessing mutations, polymorphisms or regulatory alterations in diseases. However, the analysis of the current mass spectrometry-based proteomics datasets and other major databases indicates inconsistencies—the results show large scattering and only a limited overlap for the identified RBC membrane proteins. Here, we applied membrane-specific proteomics studies in human RBC, compared these results with the data in the literature, and generated a comprehensive and expandable database using all available data sources. The integrated web database now refers to proteomic, genetic and medical databases as well, and contains an unexpected large number of validated membrane proteins previously thought to be specific for other tissues and/or related to major human diseases. Since the determination of protein expression in RBC provides a method to indicate pathological alterations, our database should facilitate the development of RBC membrane biomarker platforms and provide a unique resource to aid related further research and diagnostics. Database URL: http://rbcc.hegelab.org PMID:26078478
Role for protein–protein interaction databases in human genetics
Pattin, Kristine A; Moore, Jason H
2010-01-01
Proteomics and the study of protein–protein interactions are becoming increasingly important in our effort to understand human diseases on a system-wide level. Thanks to the development and curation of protein-interaction databases, up-to-date information on these interaction networks is accessible and publicly available to the scientific community. As our knowledge of protein–protein interactions increases, it is important to give thought to the different ways that these resources can impact biomedical research. In this article, we highlight the importance of protein–protein interactions in human genetics and genetic epidemiology. Since protein–protein interactions demonstrate one of the strongest functional relationships between genes, combining genomic data with available proteomic data may provide us with a more in-depth understanding of common human diseases. In this review, we will discuss some of the fundamentals of protein interactions, the databases that are publicly available and how information from these databases can be used to facilitate genome-wide genetic studies. PMID:19929610
The Movable Type Method Applied to Protein-Ligand Binding.
Zheng, Zheng; Ucisik, Melek N; Merz, Kenneth M
2013-12-10
Accurately computing the free energy for biological processes like protein folding or protein-ligand association remains a challenging problem. Both describing the complex intermolecular forces involved and sampling the requisite configuration space make understanding these processes innately difficult. Herein, we address the sampling problem using a novel methodology we term "movable type". Conceptually it can be understood by analogy with the evolution of printing and, hence, the name movable type. For example, a common approach to the study of protein-ligand complexation involves taking a database of intact drug-like molecules and exhaustively docking them into a binding pocket. This is reminiscent of early woodblock printing where each page had to be laboriously created prior to printing a book. However, printing evolved to an approach where a database of symbols (letters, numerals, etc.) was created and then assembled using a movable type system, which allowed for the creation of all possible combinations of symbols on a given page, thereby, revolutionizing the dissemination of knowledge. Our movable type (MT) method involves the identification of all atom pairs seen in protein-ligand complexes and then creating two databases: one with their associated pairwise distant dependent energies and another associated with the probability of how these pairs can combine in terms of bonds, angles, dihedrals and non-bonded interactions. Combining these two databases coupled with the principles of statistical mechanics allows us to accurately estimate binding free energies as well as the pose of a ligand in a receptor. This method, by its mathematical construction, samples all of configuration space of a selected region (the protein active site here) in one shot without resorting to brute force sampling schemes involving Monte Carlo, genetic algorithms or molecular dynamics simulations making the methodology extremely efficient. Importantly, this method explores the free energy surface eliminating the need to estimate the enthalpy and entropy components individually. Finally, low free energy structures can be obtained via a free energy minimization procedure yielding all low free energy poses on a given free energy surface. Besides revolutionizing the protein-ligand docking and scoring problem this approach can be utilized in a wide range of applications in computational biology which involve the computation of free energies for systems with extensive phase spaces including protein folding, protein-protein docking and protein design.
TryTransDB: A web-based resource for transport proteins in Trypanosomatidae.
Sonar, Krushna; Kabra, Ritika; Singh, Shailza
2018-03-12
TryTransDB is a web-based resource that stores transport protein data which can be retrieved using a standalone BLAST tool. We have attempted to create an integrated database that can be a one-stop shop for the researchers working with transport proteins of Trypanosomatidae family. TryTransDB (Trypanosomatidae Transport Protein Database) is a web based comprehensive resource that can fire a BLAST search against most of the transport protein sequences (protein and nucleotide) from Trypanosomatidae family organisms. This web resource further allows to compute a phylogenetic tree by performing multiple sequence alignment (MSA) using CLUSTALW suite embedded in it. Also, cross-linking to other databases helps in gathering more information for a certain transport protein in a single website.
The BioGRID Interaction Database: 2011 update
Stark, Chris; Breitkreutz, Bobby-Joe; Chatr-aryamontri, Andrew; Boucher, Lorrie; Oughtred, Rose; Livstone, Michael S.; Nixon, Julie; Van Auken, Kimberly; Wang, Xiaodong; Shi, Xiaoqi; Reguly, Teresa; Rust, Jennifer M.; Winter, Andrew; Dolinski, Kara; Tyers, Mike
2011-01-01
The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans (http://www.thebiogrid.org). BioGRID currently holds 347 966 interactions (170 162 genetic, 177 804 protein) curated from both high-throughput data sets and individual focused studies, as derived from over 23 000 publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (Saccharomyces cerevisiae), fission yeast (Schizosaccharomyces pombe) and thale cress (Arabidopsis thaliana), and efforts to expand curation across multiple metazoan species are underway. The BioGRID houses 48 831 human protein interactions that have been curated from 10 247 publications. Current curation drives are focused on particular areas of biology to enable insights into conserved networks and pathways that are relevant to human health. The BioGRID 3.0 web interface contains new search and display features that enable rapid queries across multiple data types and sources. An automated Interaction Management System (IMS) is used to prioritize, coordinate and track curation across international sites and projects. BioGRID provides interaction data to several model organism databases, resources such as Entrez-Gene and other interaction meta-databases. The entire BioGRID 3.0 data collection may be downloaded in multiple file formats, including PSI MI XML. Source code for BioGRID 3.0 is freely available without any restrictions. PMID:21071413
Metagenomics and the protein universe
Godzik, Adam
2011-01-01
Metagenomics sequencing projects have dramatically increased our knowledge of the protein universe and provided over one-half of currently known protein sequences; they have also introduced a much broader phylogenetic diversity into the protein databases. The full analysis of metagenomic datasets is only beginning, but it has already led to the discovery of thousands of new protein families, likely representing novel functions specific to given environments. At the same time, a deeper analysis of such novel families, including experimental structure determination of some representatives, suggests that most of them represent distant homologs of already characterized protein families, and thus most of the protein diversity present in the new environments are due to functional divergence of the known protein families rather than the emergence of new ones. PMID:21497084
AIM: A comprehensive Arabidopsis Interactome Module database and related interologs in plants
USDA-ARS?s Scientific Manuscript database
Systems biology analysis of protein modules is important for understanding the functional relationships between proteins in the interactome. Here, we present a comprehensive database named AIM for Arabidopsis (Arabidopsis thaliana) interactome modules. The database contains almost 250,000 modules th...
VaProS: a database-integration approach for protein/genome information retrieval.
Gojobori, Takashi; Ikeo, Kazuho; Katayama, Yukie; Kawabata, Takeshi; Kinjo, Akira R; Kinoshita, Kengo; Kwon, Yeondae; Migita, Ohsuke; Mizutani, Hisashi; Muraoka, Masafumi; Nagata, Koji; Omori, Satoshi; Sugawara, Hideaki; Yamada, Daichi; Yura, Kei
2016-12-01
Life science research now heavily relies on all sorts of databases for genome sequences, transcription, protein three-dimensional (3D) structures, protein-protein interactions, phenotypes and so forth. The knowledge accumulated by all the omics research is so vast that a computer-aided search of data is now a prerequisite for starting a new study. In addition, a combinatory search throughout these databases has a chance to extract new ideas and new hypotheses that can be examined by wet-lab experiments. By virtually integrating the related databases on the Internet, we have built a new web application that facilitates life science researchers for retrieving experts' knowledge stored in the databases and for building a new hypothesis of the research target. This web application, named VaProS, puts stress on the interconnection between the functional information of genome sequences and protein 3D structures, such as structural effect of the gene mutation. In this manuscript, we present the notion of VaProS, the databases and tools that can be accessed without any knowledge of database locations and data formats, and the power of search exemplified in quest of the molecular mechanisms of lysosomal storage disease. VaProS can be freely accessed at http://p4d-info.nig.ac.jp/vapros/ .
Dong, Runze; Pan, Shuo; Peng, Zhenling; Zhang, Yang; Yang, Jianyi
2018-05-21
With the rapid increase of the number of protein structures in the Protein Data Bank, it becomes urgent to develop algorithms for efficient protein structure comparisons. In this article, we present the mTM-align server, which consists of two closely related modules: one for structure database search and the other for multiple structure alignment. The database search is speeded up based on a heuristic algorithm and a hierarchical organization of the structures in the database. The multiple structure alignment is performed using the recently developed algorithm mTM-align. Benchmark tests demonstrate that our algorithms outperform other peering methods for both modules, in terms of speed and accuracy. One of the unique features for the server is the interplay between database search and multiple structure alignment. The server provides service not only for performing fast database search, but also for making accurate multiple structure alignment with the structures found by the search. For the database search, it takes about 2-5 min for a structure of a medium size (∼300 residues). For the multiple structure alignment, it takes a few seconds for ∼10 structures of medium sizes. The server is freely available at: http://yanglab.nankai.edu.cn/mTM-align/.
Identification of species- and tissue-specific proteins using proteomic strategy
NASA Astrophysics Data System (ADS)
Chernukha, I. M.; Vostrikova, N. L.; Kovalev, L. I.; Shishkin, S. S.; Kovaleva, M. A.; Manukhin, Y. S.
2017-09-01
Proteomic technologies have proven to be very effective for detecting biochemical changes in meat products, such as changes in tissue- and species-specific proteins. In the tissues of cattle, pig, horse and camel M. longissimus dorsi both tissue- and species specific proteins were detected using two dimensional electrophoresis. Species-specific isoforms of several muscle proteins were also identified. The identified and described proteins of cattle, pig, horse and camel skeletal muscles (including mass spectra of the tryptic peptides) were added to the national free access database “Muscle organ proteomics”. This research has enabled the development of new highly sensitive technologies for meat product quality control against food fraud.
Integration of multiple biological features yields high confidence human protein interactome.
Karagoz, Kubra; Sevimoglu, Tuba; Arga, Kazim Yalcin
2016-08-21
The biological function of a protein is usually determined by its physical interaction with other proteins. Protein-protein interactions (PPIs) are identified through various experimental methods and are stored in curated databases. The noisiness of the existing PPI data is evident, and it is essential that a more reliable data is generated. Furthermore, the selection of a set of PPIs at different confidence levels might be necessary for many studies. Although different methodologies were introduced to evaluate the confidence scores for binary interactions, a highly reliable, almost complete PPI network of Homo sapiens is not proposed yet. The quality and coverage of human protein interactome need to be improved to be used in various disciplines, especially in biomedicine. In the present work, we propose an unsupervised statistical approach to assign confidence scores to PPIs of H. sapiens. To achieve this goal PPI data from six different databases were collected and a total of 295,288 non-redundant interactions between 15,950 proteins were acquired. The present scoring system included the context information that was assigned to PPIs derived from eight biological attributes. A high confidence network, which included 147,923 binary interactions between 13,213 proteins, had scores greater than the cutoff value of 0.80, for which sensitivity, specificity, and coverage were 94.5%, 80.9%, and 82.8%, respectively. We compared the present scoring method with others for evaluation. Reducing the noise inherent in experimental PPIs via our scoring scheme increased the accuracy significantly. As it was demonstrated through the assessment of process and cancer subnetworks, this study allows researchers to construct and analyze context-specific networks via valid PPI sets and one can easily achieve subnetworks around proteins of interest at a specified confidence level. Copyright © 2016 Elsevier Ltd. All rights reserved.
2014-01-01
Protein biomarkers offer major benefits for diagnosis and monitoring of disease processes. Recent advances in protein mass spectrometry make it feasible to use this very sensitive technology to detect and quantify proteins in blood. To explore the potential of blood biomarkers, we conducted a thorough review to evaluate the reliability of data in the literature and to determine the spectrum of proteins reported to exist in blood with a goal of creating a Federated Database of Blood Proteins (FDBP). A unique feature of our approach is the use of a SQL database for all of the peptide data; the power of the SQL database combined with standard informatic algorithms such as BLAST and the statistical analysis system (SAS) allowed the rapid annotation and analysis of the database without the need to create special programs to manage the data. Our mathematical analysis and review shows that in addition to the usual secreted proteins found in blood, there are many reports of intracellular proteins and good agreement on transcription factors, DNA remodelling factors in addition to cellular receptors and their signal transduction enzymes. Overall, we have catalogued about 12,130 proteins identified by at least one unique peptide, and of these 3858 have 3 or more peptide correlations. The FDBP with annotations should facilitate testing blood for specific disease biomarkers. PMID:24476026
CyanoClust: comparative genome resources of cyanobacteria and plastids.
Sasaki, Naobumi V; Sato, Naoki
2010-01-01
Cyanobacteria, which perform oxygen-evolving photosynthesis as do chloroplasts of plants and algae, are one of the best-studied prokaryotic phyla and one from which many representative genomes have been sequenced. Lack of a suitable comparative genomic database has been a problem in cyanobacterial genomics because many proteins involved in physiological functions such as photosynthesis and nitrogen fixation are not catalogued in commonly used databases, such as Clusters of Orthologous Proteins (COG). CyanoClust is a database of homolog groups in cyanobacteria and plastids that are produced by the program Gclust. We have developed a web-server system for the protein homology database featuring cyanobacteria and plastids. Database URL: http://cyanoclust.c.u-tokyo.ac.jp/.
Complex network theory for the identification and assessment of candidate protein targets.
McGarry, Ken; McDonald, Sharon
2018-06-01
In this work we use complex network theory to provide a statistical model of the connectivity patterns of human proteins and their interaction partners. Our intention is to identify important proteins that may be predisposed to be potential candidates as drug targets for therapeutic interventions. Target proteins usually have more interaction partners than non-target proteins, but there are no hard-and-fast rules for defining the actual number of interactions. We devise a statistical measure for identifying hub proteins, we score our target proteins with gene ontology annotations. The important druggable protein targets are likely to have similar biological functions that can be assessed for their potential therapeutic value. Our system provides a statistical analysis of the local and distant neighborhood protein interactions of the potential targets using complex network measures. This approach builds a more accurate model of drug-to-target activity and therefore the likely impact on treating diseases. We integrate high quality protein interaction data from the HINT database and disease associated proteins from the DrugTarget database. Other sources include biological knowledge from Gene Ontology and drug information from DrugBank. The problem is a very challenging one since the data is highly imbalanced between target proteins and the more numerous nontargets. We use undersampling on the training data and build Random Forest classifier models which are used to identify previously unclassified target proteins. We validate and corroborate these findings from the available literature. Copyright © 2018 Elsevier Ltd. All rights reserved.
Data on the association of the nuclear envelope protein Sun1 with nucleoli.
Moujaber, Ossama; Omran, Nawal; Kodiha, Mohamed; Pié, Brigitte; Cooper, Ellis; Presley, John F; Stochaj, Ursula
2017-08-01
SUN proteins participate in diverse cellular activities, many of which are connected to the nuclear envelope. Recently, the family member SUN1 has been linked to novel biological activities. These include the regulation of nucleoli, intranuclear compartments that assemble ribosomal subunits. We show that SUN1 associates with nucleoli in several mammalian epithelial cell lines. This nucleolar localization is not shared by all cell types, as SUN1 concentrates at the nuclear envelope in ganglionic neurons and non-neuronal satellite cells. Database analyses and Western blotting emphasize the complexity of SUN1 protein profiles in different mammalian cells. We constructed a STRING network which identifies SUN1-related proteins as part of a larger network that includes several nucleolar proteins. Taken together, the current data highlight the diversity of SUN1 proteins and emphasize the possible links between SUN1 and nucleoli.
Chamrad, Daniel C; Körting, Gerhard; Schäfer, Heike; Stephan, Christian; Thiele, Herbert; Apweiler, Rolf; Meyer, Helmut E; Marcus, Katrin; Blüggel, Martin
2006-09-01
A novel software tool named PTM-Explorer has been applied to LC-MS/MS datasets acquired within the Human Proteome Organisation (HUPO) Brain Proteome Project (BPP). PTM-Explorer enables automatic identification of peptide MS/MS spectra that were not explained in typical sequence database searches. The main focus was detection of PTMs, but PTM-Explorer detects also unspecific peptide cleavage, mass measurement errors, experimental modifications, amino acid substitutions, transpeptidation products and unknown mass shifts. To avoid a combinatorial problem the search is restricted to a set of selected protein sequences, which stem from previous protein identifications using a common sequence database search. Prior to application to the HUPO BPP data, PTM-Explorer was evaluated on excellently manually characterized and evaluated LC-MS/MS data sets from Alpha-A-Crystallin gel spots obtained from mouse eye lens. Besides various PTMs including phosphorylation, a wealth of experimental modifications and unspecific cleavage products were successfully detected, completing the primary structure information of the measured proteins. Our results indicate that a large amount of MS/MS spectra that currently remain unidentified in standard database searches contain valuable information that can only be elucidated using suitable software tools.
VisANT 3.0: new modules for pathway visualization, editing, prediction and construction.
Hu, Zhenjun; Ng, David M; Yamada, Takuji; Chen, Chunnuan; Kawashima, Shuichi; Mellor, Joe; Linghu, Bolan; Kanehisa, Minoru; Stuart, Joshua M; DeLisi, Charles
2007-07-01
With the integration of the KEGG and Predictome databases as well as two search engines for coexpressed genes/proteins using data sets obtained from the Stanford Microarray Database (SMD) and Gene Expression Omnibus (GEO) database, VisANT 3.0 supports exploratory pathway analysis, which includes multi-scale visualization of multiple pathways, editing and annotating pathways using a KEGG compatible visual notation and visualization of expression data in the context of pathways. Expression levels are represented either by color intensity or by nodes with an embedded expression profile. Multiple experiments can be navigated or animated. Known KEGG pathways can be enriched by querying either coexpressed components of known pathway members or proteins with known physical interactions. Predicted pathways for genes/proteins with unknown functions can be inferred from coexpression or physical interaction data. Pathways produced in VisANT can be saved as computer-readable XML format (VisML), graphic images or high-resolution Scalable Vector Graphics (SVG). Pathways in the format of VisML can be securely shared within an interested group or published online using a simple Web link. VisANT is freely available at http://visant.bu.edu.
Reverse screening methods to search for the protein targets of chemopreventive compounds
NASA Astrophysics Data System (ADS)
Huang, Hongbin; Zhang, Guigui; Zhou, Yuquan; Lin, Chenru; Chen, Suling; Lin, Yutong; Mai, Shangkang; Huang, Zunnan
2018-05-01
This article is a systematic review of reverse screening methods used to search for the protein targets of chemopreventive compounds or drugs. Typical chemopreventive compounds include components of traditional Chinese medicine, natural compounds and Food and Drug Administration (FDA)-approved drugs. Such compounds are somewhat selective but are predisposed to bind multiple protein targets distributed throughout diverse signaling pathways in human cells. In contrast to conventional virtual screening, which identifies the ligands of a targeted protein from a compound database, reverse screening is used to identify the potential targets or unintended targets of a given compound from a large number of receptors by examining their known ligands or crystal structures. This method, also known as in silico or computational target fishing, is highly valuable for discovering the target receptors of query molecules from terrestrial or marine natural products, exploring the molecular mechanisms of chemopreventive compounds, finding alternative indications of existing drugs by drug repositioning, and detecting adverse drug reactions and drug toxicity. Reverse screening can be divided into three major groups: shape screening, pharmacophore screening and reverse docking. Several large software packages, such as Schrödinger and Discovery Studio; typical software/network services such as ChemMapper, PharmMapper, idTarget and INVDOCK; and practical databases of known target ligands and receptor crystal structures, such as ChEMBL, BindingDB and the Protein Data Bank (PDB), are available for use in these computational methods. Different programs, online services and databases have different applications and constraints. Here, we conducted a systematic analysis and multilevel classification of the computational programs, online services and compound libraries available for shape screening, pharmacophore screening and reverse docking to enable non-specialist users to quickly learn and grasp the types of calculations used in protein target fishing. In addition, we review the main features of these methods, programs and databases and provide a variety of examples illustrating the application of one or a combination of reverse screening methods for accurate target prediction.
Dictionary-driven protein annotation.
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-09-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.
Reverse Screening Methods to Search for the Protein Targets of Chemopreventive Compounds.
Huang, Hongbin; Zhang, Guigui; Zhou, Yuquan; Lin, Chenru; Chen, Suling; Lin, Yutong; Mai, Shangkang; Huang, Zunnan
2018-01-01
This article is a systematic review of reverse screening methods used to search for the protein targets of chemopreventive compounds or drugs. Typical chemopreventive compounds include components of traditional Chinese medicine, natural compounds and Food and Drug Administration (FDA)-approved drugs. Such compounds are somewhat selective but are predisposed to bind multiple protein targets distributed throughout diverse signaling pathways in human cells. In contrast to conventional virtual screening, which identifies the ligands of a targeted protein from a compound database, reverse screening is used to identify the potential targets or unintended targets of a given compound from a large number of receptors by examining their known ligands or crystal structures. This method, also known as in silico or computational target fishing, is highly valuable for discovering the target receptors of query molecules from terrestrial or marine natural products, exploring the molecular mechanisms of chemopreventive compounds, finding alternative indications of existing drugs by drug repositioning, and detecting adverse drug reactions and drug toxicity. Reverse screening can be divided into three major groups: shape screening, pharmacophore screening and reverse docking. Several large software packages, such as Schrödinger and Discovery Studio; typical software/network services such as ChemMapper, PharmMapper, idTarget, and INVDOCK; and practical databases of known target ligands and receptor crystal structures, such as ChEMBL, BindingDB, and the Protein Data Bank (PDB), are available for use in these computational methods. Different programs, online services and databases have different applications and constraints. Here, we conducted a systematic analysis and multilevel classification of the computational programs, online services and compound libraries available for shape screening, pharmacophore screening and reverse docking to enable non-specialist users to quickly learn and grasp the types of calculations used in protein target fishing. In addition, we review the main features of these methods, programs and databases and provide a variety of examples illustrating the application of one or a combination of reverse screening methods for accurate target prediction.
Reverse Screening Methods to Search for the Protein Targets of Chemopreventive Compounds
Huang, Hongbin; Zhang, Guigui; Zhou, Yuquan; Lin, Chenru; Chen, Suling; Lin, Yutong; Mai, Shangkang; Huang, Zunnan
2018-01-01
This article is a systematic review of reverse screening methods used to search for the protein targets of chemopreventive compounds or drugs. Typical chemopreventive compounds include components of traditional Chinese medicine, natural compounds and Food and Drug Administration (FDA)-approved drugs. Such compounds are somewhat selective but are predisposed to bind multiple protein targets distributed throughout diverse signaling pathways in human cells. In contrast to conventional virtual screening, which identifies the ligands of a targeted protein from a compound database, reverse screening is used to identify the potential targets or unintended targets of a given compound from a large number of receptors by examining their known ligands or crystal structures. This method, also known as in silico or computational target fishing, is highly valuable for discovering the target receptors of query molecules from terrestrial or marine natural products, exploring the molecular mechanisms of chemopreventive compounds, finding alternative indications of existing drugs by drug repositioning, and detecting adverse drug reactions and drug toxicity. Reverse screening can be divided into three major groups: shape screening, pharmacophore screening and reverse docking. Several large software packages, such as Schrödinger and Discovery Studio; typical software/network services such as ChemMapper, PharmMapper, idTarget, and INVDOCK; and practical databases of known target ligands and receptor crystal structures, such as ChEMBL, BindingDB, and the Protein Data Bank (PDB), are available for use in these computational methods. Different programs, online services and databases have different applications and constraints. Here, we conducted a systematic analysis and multilevel classification of the computational programs, online services and compound libraries available for shape screening, pharmacophore screening and reverse docking to enable non-specialist users to quickly learn and grasp the types of calculations used in protein target fishing. In addition, we review the main features of these methods, programs and databases and provide a variety of examples illustrating the application of one or a combination of reverse screening methods for accurate target prediction. PMID:29868550
MultitaskProtDB: a database of multitasking proteins.
Hernández, Sergio; Ferragut, Gabriela; Amela, Isaac; Perez-Pons, JosepAntoni; Piñol, Jaume; Mozo-Villarias, Angel; Cedano, Juan; Querol, Enrique
2014-01-01
We have compiled MultitaskProtDB, available online at http://wallace.uab.es/multitask, to provide a repository where the many multitasking proteins found in the literature can be stored. Multitasking or moonlighting is the capability of some proteins to execute two or more biological functions. Usually, multitasking proteins are experimentally revealed by serendipity. This ability of proteins to perform multitasking functions helps us to understand one of the ways used by cells to perform many complex functions with a limited number of genes. Even so, the study of this phenomenon is complex because, among other things, there is no database of moonlighting proteins. The existence of such a tool facilitates the collection and dissemination of these important data. This work reports the database, MultitaskProtDB, which is designed as a friendly user web page containing >288 multitasking proteins with their NCBI and UniProt accession numbers, canonical and additional biological functions, monomeric/oligomeric states, PDB codes when available and bibliographic references. This database also serves to gain insight into some characteristics of multitasking proteins such as frequencies of the different pairs of functions, phylogenetic conservation and so forth.
KnotProt: a database of proteins with knots and slipknots.
Jamroz, Michal; Niemyska, Wanda; Rawdon, Eric J; Stasiak, Andrzej; Millett, Kenneth C; Sułkowski, Piotr; Sulkowska, Joanna I
2015-01-01
The protein topology database KnotProt, http://knotprot.cent.uw.edu.pl/, collects information about protein structures with open polypeptide chains forming knots or slipknots. The knotting complexity of the cataloged proteins is presented in the form of a matrix diagram that shows users the knot type of the entire polypeptide chain and of each of its subchains. The pattern visible in the matrix gives the knotting fingerprint of a given protein and permits users to determine, for example, the minimal length of the knotted regions (knot's core size) or the depth of a knot, i.e. how many amino acids can be removed from either end of the cataloged protein structure before converting it from a knot to a different type of knot. In addition, the database presents extensive information about the biological functions, families and fold types of proteins with non-trivial knotting. As an additional feature, the KnotProt database enables users to submit protein or polymer chains and generate their knotting fingerprints. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
BLAST and FASTA similarity searching for multiple sequence alignment.
Pearson, William R
2014-01-01
BLAST, FASTA, and other similarity searching programs seek to identify homologous proteins and DNA sequences based on excess sequence similarity. If two sequences share much more similarity than expected by chance, the simplest explanation for the excess similarity is common ancestry-homology. The most effective similarity searches compare protein sequences, rather than DNA sequences, for sequences that encode proteins, and use expectation values, rather than percent identity, to infer homology. The BLAST and FASTA packages of sequence comparison programs provide programs for comparing protein and DNA sequences to protein databases (the most sensitive searches). Protein and translated-DNA comparisons to protein databases routinely allow evolutionary look back times from 1 to 2 billion years; DNA:DNA searches are 5-10-fold less sensitive. BLAST and FASTA can be run on popular web sites, but can also be downloaded and installed on local computers. With local installation, target databases can be customized for the sequence data being characterized. With today's very large protein databases, search sensitivity can also be improved by searching smaller comprehensive databases, for example, a complete protein set from an evolutionarily neighboring model organism. By default, BLAST and FASTA use scoring strategies target for distant evolutionary relationships; for comparisons involving short domains or queries, or searches that seek relatively close homologs (e.g. mouse-human), shallower scoring matrices will be more effective. Both BLAST and FASTA provide very accurate statistical estimates, which can be used to reliably identify protein sequences that diverged more than 2 billion years ago.
A PDB-wide, evolution-based assessment of protein-protein interfaces.
Baskaran, Kumaran; Duarte, Jose M; Biyani, Nikhil; Bliven, Spencer; Capitani, Guido
2014-10-18
Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\\#downloads. Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.
PDB-wide collection of binding data: current status of the PDBbind database.
Liu, Zhihai; Li, Yan; Han, Li; Li, Jie; Liu, Jie; Zhao, Zhixiong; Nie, Wei; Liu, Yuchen; Wang, Renxiao
2015-02-01
Molecular recognition between biological macromolecules and organic small molecules plays an important role in various life processes. Both structural information and binding data of biomolecular complexes are indispensable for depicting the underlying mechanism in such an event. The PDBbind database was created to collect experimentally measured binding data for the biomolecular complexes throughout the Protein Data Bank (PDB). It thus provides the linkage between structural information and energetic properties of biomolecular complexes, which is especially desirable for computational studies or statistical analyses. Since its first public release in 2004, the PDBbind database has been updated on an annual basis. The latest release (version 2013) provides experimental binding affinity data for 10,776 biomolecular complexes in PDB, including 8302 protein-ligand complexes and 2474 other types of complexes. In this article, we will describe the current methods used for compiling PDBbind and the updated status of this database. We will also review some typical applications of PDBbind published in the scientific literature. All contents of this database are freely accessible at the PDBbind-CN Web server at http://www.pdbbind-cn.org/. wangrx@mail.sioc.ac.cn. Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
ASDB: a resource for probing protein functions with small molecules.
Liu, Zhihong; Ding, Peng; Yan, Xin; Zheng, Minghao; Zhou, Huihao; Xu, Yuehua; Du, Yunfei; Gu, Qiong; Xu, Jun
2016-06-01
: Identifying chemical probes or seeking scaffolds for a specific biological target is important for protein function studies. Therefore, we create the Annotated Scaffold Database (ASDB), a computer-readable and systematic target-annotated scaffold database, to serve such needs. The scaffolds in ASDB were derived from public databases including ChEMBL, DrugBank and TCMSP, with a scaffold-based classification approach. Each scaffold was assigned with an InChIKey as its unique identifier, energy-minimized 3D conformations, and other calculated properties. A scaffold is also associated with drugs, natural products, drug targets and medical indications. The database can be retrieved through text or structure query tools. ASDB collects 333 601 scaffolds, which are associated with 4368 targets. The scaffolds consist of 3032 scaffolds derived from drugs and 5163 scaffolds derived from natural products. For given scaffolds, scaffold-target networks can be generated from the database to demonstrate the relations of scaffolds and targets. ASDB is freely available at http://www.rcdd.org.cn/asdb/with the major web browsers. junxu@biochemomes.com or xujun9@mail.sysu.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Reference System of DNA and Protein Sequences on CD-ROM
NASA Astrophysics Data System (ADS)
Nasu, Hisanori; Ito, Toshiaki
DNASIS-DBREF31 is a database for DNA and Protein sequences in the form of optical Compact Disk (CD) ROM, developed and commercialized by Hitachi Software Engineering Co., Ltd. Both nucleic acid base sequences and protein amino acid sequences can be retrieved from a single CD-ROM. Existing database is offered in the form of on-line service, floppy disks, or magnetic tape, all of which have some problems or other, such as usability or storage capacity. DNASIS-DBREF31 newly adopt a CD-ROM as a database device to realize a mass storage and personal use of the database.
Bhardwaj, Jyoti; Gangwar, Indu; Panzade, Ganesh; Shankar, Ravi; Yadav, Sudesh Kumar
2016-06-03
Inspired by the availability of de novo transcriptome of horse gram (Macrotyloma uniflorum) and recent developments in systems biology studies, the first ever global protein-protein interactome (PPI) map was constructed for this highly drought-tolerant legume. Large-scale studies of PPIs and the constructed database would provide rationale behind the interplay at cascading translational levels for drought stress-adaptive mechanisms in horse gram. Using a bidirectional approach (interolog and domain-based), a high-confidence interactome map and database for horse gram was constructed. Available transcriptomic information for shoot and root tissues of a sensitive (M-191; genotype 1) and a drought-tolerant (M-249; genotype 2) genotype of horse gram was utilized to draw comparative PPI subnetworks under drought stress. High-confidence 6804 interactions were predicted among 1812 proteins covering about one-fourth of the horse gram proteome. The highest number of interactions (33.86%) in horse gram interactome matched with Arabidopsis PPI data. The top five hub nodes mostly included ubiquitin and heat-shock-related proteins. Higher numbers of PPIs were found to be responsive in shoot tissue (416) and root tissue (2228) of genotype 2 compared with shoot tissue (136) and root tissue (579) of genotype 1. Characterization of PPIs using gene ontology analysis revealed that kinase and transferase activities involved in signal transduction, cellular processes, nucleocytoplasmic transport, protein ubiquitination, and localization of molecules were most responsive to drought stress. Hence, these could be framed in stress adaptive mechanisms of horse gram. Being the first legume global PPI map, it would provide new insights into gene and protein regulatory networks for drought stress tolerance mechanisms in horse gram. Information compiled in the form of database (MauPIR) will provide the much needed high-confidence systems biology information for horse gram genes, proteins, and involved processes. This information would ease the effort and increase the efficacy for similar studies on other legumes. Public access is available at http://14.139.59.221/MauPIR/ .
Alonso-López, Diego; Gutiérrez, Miguel A.; Lopes, Katia P.; Prieto, Carlos; Santamaría, Rodrigo; De Las Rivas, Javier
2016-01-01
APID (Agile Protein Interactomes DataServer) is an interactive web server that provides unified generation and delivery of protein interactomes mapped to their respective proteomes. This resource is a new, fully redesigned server that includes a comprehensive collection of protein interactomes for more than 400 organisms (25 of which include more than 500 interactions) produced by the integration of only experimentally validated protein–protein physical interactions. For each protein–protein interaction (PPI) the server includes currently reported information about its experimental validation to allow selection and filtering at different quality levels. As a whole, it provides easy access to the interactomes from specific species and includes a global uniform compendium of 90,379 distinct proteins and 678,441 singular interactions. APID integrates and unifies PPIs from major primary databases of molecular interactions, from other specific repositories and also from experimentally resolved 3D structures of protein complexes where more than two proteins were identified. For this purpose, a collection of 8,388 structures were analyzed to identify specific PPIs. APID also includes a new graph tool (based on Cytoscape.js) for visualization and interactive analyses of PPI networks. The server does not require registration and it is freely available for use at http://apid.dep.usal.es. PMID:27131791
Boutet, Emmanuel; Lieberherr, Damien; Tognolli, Michael; Schneider, Michel; Bansal, Parit; Bridge, Alan J; Poux, Sylvain; Bougueleret, Lydie; Xenarios, Ioannis
2016-01-01
The Universal Protein Resource (UniProt, http://www.uniprot.org ) consortium is an initiative of the SIB Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) to provide the scientific community with a central resource for protein sequences and functional information. The UniProt consortium maintains the UniProt KnowledgeBase (UniProtKB), updated every 4 weeks, and several supplementary databases including the UniProt Reference Clusters (UniRef) and the UniProt Archive (UniParc).The Swiss-Prot section of the UniProt KnowledgeBase (UniProtKB/Swiss-Prot) contains publicly available expertly manually annotated protein sequences obtained from a broad spectrum of organisms. Plant protein entries are produced in the frame of the Plant Proteome Annotation Program (PPAP), with an emphasis on characterized proteins of Arabidopsis thaliana and Oryza sativa. High level annotations provided by UniProtKB/Swiss-Prot are widely used to predict annotation of newly available proteins through automatic pipelines.The purpose of this chapter is to present a guided tour of a UniProtKB/Swiss-Prot entry. We will also present some of the tools and databases that are linked to each entry.
PCPPI: a comprehensive database for the prediction of Penicillium-crop protein-protein interactions.
Yue, Junyang; Zhang, Danfeng; Ban, Rongjun; Ma, Xiaojing; Chen, Danyang; Li, Guangwei; Liu, Jia; Wisniewski, Michael; Droby, Samir; Liu, Yongsheng
2017-01-01
Penicillium expansum , the causal agent of blue mold, is one of the most prevalent post-harvest pathogens, infecting a wide range of crops after harvest. In response, crops have evolved various defense systems to protect themselves against this and other pathogens. Penicillium -crop interaction is a multifaceted process and mediated by pathogen- and host-derived proteins. Identification and characterization of the inter-species protein-protein interactions (PPIs) are fundamental to elucidating the molecular mechanisms underlying infection processes between P. expansum and plant crops. Here, we have developed PCPPI, the Penicillium -Crop Protein-Protein Interactions database, which is constructed based on the experimentally determined orthologous interactions in pathogen-plant systems and available domain-domain interactions (DDIs) in each PPI. Thus far, it stores information on 9911 proteins, 439 904 interactions and seven host species, including apple, kiwifruit, maize, pear, rice, strawberry and tomato. Further analysis through the gene ontology (GO) annotation indicated that proteins with more interacting partners tend to execute the essential function. Significantly, semantic statistics of the GO terms also provided strong support for the accuracy of our predicted interactions in PCPPI. We believe that all the PCPPI datasets are helpful to facilitate the study of pathogen-crop interactions and freely available to the research community. : http://bdg.hfut.edu.cn/pcppi/index.html. © The Author(s) 2017. Published by Oxford University Press.
SATPdb: a database of structurally annotated therapeutic peptides
Singh, Sandeep; Chaudhary, Kumardeep; Dhanda, Sandeep Kumar; Bhalla, Sherry; Usmani, Salman Sadullah; Gautam, Ankur; Tuknait, Abhishek; Agrawal, Piyush; Mathur, Deepika; Raghava, Gajendra P.S.
2016-01-01
SATPdb (http://crdd.osdd.net/raghava/satpdb/) is a database of structurally annotated therapeutic peptides, curated from 22 public domain peptide databases/datasets including 9 of our own. The current version holds 19192 unique experimentally validated therapeutic peptide sequences having length between 2 and 50 amino acids. It covers peptides having natural, non-natural and modified residues. These peptides were systematically grouped into 10 categories based on their major function or therapeutic property like 1099 anticancer, 10585 antimicrobial, 1642 drug delivery and 1698 antihypertensive peptides. We assigned or annotated structure of these therapeutic peptides using structural databases (Protein Data Bank) and state-of-the-art structure prediction methods like I-TASSER, HHsearch and PEPstrMOD. In addition, SATPdb facilitates users in performing various tasks that include: (i) structure and sequence similarity search, (ii) peptide browsing based on their function and properties, (iii) identification of moonlighting peptides and (iv) searching of peptides having desired structure and therapeutic activities. We hope this database will be useful for researchers working in the field of peptide-based therapeutics. PMID:26527728
Veiga, Ana B. G.; Ribeiro, José M. C.; Guimarães, Jorge A.; Francischetti, Ivo M.B.
2010-01-01
Accidents with the caterpillar Lonomia obliqua are often associated with a coagulation disorder and hemorrhagic syndrome in humans. In the present study, we have constructed cDNA libraries from two venomous structures of the caterpillar, namely the tegument and the bristle. High-throughput sequencing and bioinformatics analyses were performed in parallel. Over one thousand cDNAs were obtained and clustered to produce a database of 538 contigs and singletons (clusters) for the tegument library and 368 for the bristle library. We have thus identified dozens of full-length cDNAs coding for proteins with sequence homology to snake venom prothrombin activator, trypsin-like enzymes, blood coagulation factors and prophenoloxidase cascade activators. We also report cDNA coding for cysteine proteases, Group III phospholipase A2, C-type lectins, lipocalins, in addition to protease inhibitors including serpins, Kazal-type inhibitors, cystatins and trypsin inhibitor-like molecules. Antibacterial proteins and housekeeping genes are also described. A significant number of sequences were devoid of database matches, suggesting that their biologic function remains to be defined. We also report the N-terminus of the most abundant proteins present in the bristle, tegument, hemolymph, and "cryosecretion". Thus, we have created a catalog that contains the predicted molecular weight, isoelectric point, accession number, and putative function for each selected molecule from the venomous structures of L. obliqua. The role of these molecules in the coagulation disorder and hemorrhagic syndrome caused by envenomation with this caterpillar is discussed. All sequence information and the Supplemental Data, including Figures and Tables with hyperlinks to FASTA-formatted files for each contig and the best match to the Databases, are available at http://www.ncbi.nih.gov/projects/omes. PMID:16023793
Karbalaei, Reza; Allahyari, Marzieh; Rezaei-Tavirani, Mostafa; Asadzadeh-Aghdaei, Hamid; Zali, Mohammad Reza
2018-01-01
Analysis reconstruction networks from two diseases, NAFLD and Alzheimer`s diseases and their relationship based on systems biology methods. NAFLD and Alzheimer`s diseases are two complex diseases, with progressive prevalence and high cost for countries. There are some reports on relation and same spreading pathways of these two diseases. In addition, they have some similar risk factors, exclusively lifestyle such as feeding, exercises and so on. Therefore, systems biology approach can help to discover their relationship. DisGeNET and STRING databases were sources of disease genes and constructing networks. Three plugins of Cytoscape software, including ClusterONE, ClueGO and CluePedia, were used to analyze and cluster networks and enrichment of pathways. An R package used to define best centrality method. Finally, based on degree and Betweenness, hubs and bottleneck nodes were defined. Common genes between NAFLD and Alzheimer`s disease were 190 genes that used construct a network with STRING database. The resulting network contained 182 nodes and 2591 edges and comprises from four clusters. Enrichment of these clusters separately lead to carbohydrate metabolism, long chain fatty acid and regulation of JAK-STAT and IL-17 signaling pathways, respectively. Also seven genes selected as hub-bottleneck include: IL6, AKT1, TP53, TNF, JUN, VEGFA and PPARG. Enrichment of these proteins and their first neighbors in network by OMIM database lead to diabetes and obesity as ancestors of NAFLD and AD. Systems biology methods, specifically PPI networks, can be useful for analyzing complicated related diseases. Finding Hub and bottleneck proteins should be the goal of drug designing and introducing disease markers.
Li, Haiquan; Dai, Xinbin; Zhao, Xuechun
2008-05-01
Membrane transport proteins play a crucial role in the import and export of ions, small molecules or macromolecules across biological membranes. Currently, there are a limited number of published computational tools which enable the systematic discovery and categorization of transporters prior to costly experimental validation. To approach this problem, we utilized a nearest neighbor method which seamlessly integrates homologous search and topological analysis into a machine-learning framework. Our approach satisfactorily distinguished 484 transporter families in the Transporter Classification Database, a curated and representative database for transporters. A five-fold cross-validation on the database achieved a positive classification rate of 72.3% on average. Furthermore, this method successfully detected transporters in seven model and four non-model organisms, ranging from archaean to mammalian species. A preliminary literature-based validation has cross-validated 65.8% of our predictions on the 11 organisms, including 55.9% of our predictions overlapping with 83.6% of the predicted transporters in TransportDB.
Projections for fast protein structure retrieval
Bhattacharya, Sourangshu; Bhattacharyya, Chiranjib; Chandra, Nagasuma R
2006-01-01
Background In recent times, there has been an exponential rise in the number of protein structures in databases e.g. PDB. So, design of fast algorithms capable of querying such databases is becoming an increasingly important research issue. This paper reports an algorithm, motivated from spectral graph matching techniques, for retrieving protein structures similar to a query structure from a large protein structure database. Each protein structure is specified by the 3D coordinates of residues of the protein. The algorithm is based on a novel characterization of the residues, called projections, leading to a similarity measure between the residues of the two proteins. This measure is exploited to efficiently compute the optimal equivalences. Results Experimental results show that, the current algorithm outperforms the state of the art on benchmark datasets in terms of speed without losing accuracy. Search results on SCOP 95% nonredundant database, for fold similarity with 5 proteins from different SCOP classes show that the current method performs competitively with the standard algorithm CE. The algorithm is also capable of detecting non-topological similarities between two proteins which is not possible with most of the state of the art tools like Dali. PMID:17254310
Investigation of candidate genes for osteoarthritis based on gene expression profiles.
Dong, Shuanghai; Xia, Tian; Wang, Lei; Zhao, Qinghua; Tian, Jiwei
2016-12-01
To explore the mechanism of osteoarthritis (OA) and provide valid biological information for further investigation. Gene expression profile of GSE46750 was downloaded from Gene Expression Omnibus database. The Linear Models for Microarray Data (limma) package (Bioconductor project, http://www.bioconductor.org/packages/release/bioc/html/limma.html) was used to identify differentially expressed genes (DEGs) in inflamed OA samples. Gene Ontology function enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enrichment analysis of DEGs were performed based on Database for Annotation, Visualization and Integrated Discovery data, and protein-protein interaction (PPI) network was constructed based on the Search Tool for the Retrieval of Interacting Genes/Proteins database. Regulatory network was screened based on Encyclopedia of DNA Elements. Molecular Complex Detection was used for sub-network screening. Two sub-networks with highest node degree were integrated with transcriptional regulatory network and KEGG functional enrichment analysis was processed for 2 modules. In total, 401 up- and 196 down-regulated DEGs were obtained. Up-regulated DEGs were involved in inflammatory response, while down-regulated DEGs were involved in cell cycle. PPI network with 2392 protein interactions was constructed. Moreover, 10 genes including Interleukin 6 (IL6) and Aurora B kinase (AURKB) were found to be outstanding in PPI network. There are 214 up- and 8 down-regulated transcription factor (TF)-target pairs in the TF regulatory network. Module 1 had TFs including SPI1, PRDM1, and FOS, while module 2 contained FOSL1. The nodes in module 1 were enriched in chemokine signaling pathway, while the nodes in module 2 were mainly enriched in cell cycle. The screened DEGs including IL6, AGT, and AURKB might be potential biomarkers for gene therapy for OA by being regulated by TFs such as FOS and SPI1, and participating in the cell cycle and cytokine-cytokine receptor interaction pathway. Copyright © 2016 Turkish Association of Orthopaedics and Traumatology. Production and hosting by Elsevier B.V. All rights reserved.
Integrating In Silico Resources to Map a Signaling Network
Liu, Hanqing; Beck, Tim N.; Golemis, Erica A.; Serebriiskii, Ilya G.
2013-01-01
The abundance of publicly available life science databases offer a wealth of information that can support interpretation of experimentally derived data and greatly enhance hypothesis generation. Protein interaction and functional networks are not simply new renditions of existing data: they provide the opportunity to gain insights into the specific physical and functional role a protein plays as part of the biological system. In this chapter, we describe different in silico tools that can quickly and conveniently retrieve data from existing data repositories and discuss how the available tools are best utilized for different purposes. While emphasizing protein-protein interaction databases (e.g., BioGrid and IntAct), we also introduce metasearch platforms such as STRING and GeneMANIA, pathway databases (e.g., BioCarta and Pathway Commons), text mining approaches (e.g., PubMed and Chilibot), and resources for drug-protein interactions, genetic information for model organisms and gene expression information based on microarray data mining. Furthermore, we provide a simple step-by-step protocol to building customized protein-protein interaction networks in Cytoscape, a powerful network assembly and visualization program, integrating data retrieved from these various databases. As we illustrate, generation of composite interaction networks enables investigators to extract significantly more information about a given biological system than utilization of a single database or sole reliance on primary literature. PMID:24233784
sc-PDB: a 3D-database of ligandable binding sites--10 years on.
Desaphy, Jérémy; Bret, Guillaume; Rognan, Didier; Kellenberger, Esther
2015-01-01
The sc-PDB database (available at http://bioinfo-pharma.u-strasbg.fr/scPDB/) is a comprehensive and up-to-date selection of ligandable binding sites of the Protein Data Bank. Sites are defined from complexes between a protein and a pharmacological ligand. The database provides the all-atom description of the protein, its ligand, their binding site and their binding mode. Currently, the sc-PDB archive registers 9283 binding sites from 3678 unique proteins and 5608 unique ligands. The sc-PDB database was publicly launched in 2004 with the aim of providing structure files suitable for computational approaches to drug design, such as docking. During the last 10 years we have improved and standardized the processes for (i) identifying binding sites, (ii) correcting structures, (iii) annotating protein function and ligand properties and (iv) characterizing their binding mode. This paper presents the latest enhancements in the database, specifically pertaining to the representation of molecular interaction and to the similarity between ligand/protein binding patterns. The new website puts emphasis in pictorial analysis of data. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.
Distribution and cluster analysis of predicted intrinsically disordered protein Pfam domains
Williams, Robert W; Xue, Bin; Uversky, Vladimir N; Dunker, A Keith
2013-01-01
The Pfam database groups regions of proteins by how well hidden Markov models (HMMs) can be trained to recognize similarities among them. Conservation pressure is probably in play here. The Pfam seed training set includes sequence and structure information, being drawn largely from the PDB. A long standing hypothesis among intrinsically disordered protein (IDP) investigators has held that conservation pressures are also at play in the evolution of different kinds of intrinsic disorder, but we find that predicted intrinsic disorder (PID) is not always conserved across Pfam domains. Here we analyze distributions and clusters of PID regions in 193024 members of the version 23.0 Pfam seed database. To include the maximum information available for proteins that remain unfolded in solution, we employ the 10 linearly independent Kidera factors1–3 for the amino acids, combined with PONDR4 predictions of disorder tendency, to transform the sequences of these Pfam members into an 11 column matrix where the number of rows is the length of each Pfam region. Cluster analyses of the set of all regions, including those that are folded, show 6 groupings of domains. Cluster analyses of domains with mean VSL2b scores greater than 0.5 (half predicted disorder or more) show at least 3 separated groups. It is hypothesized that grouping sets into shorter sequences with more uniform length will reveal more information about intrinsic disorder and lead to more finely structured and perhaps more accurate predictions. HMMs could be trained to include this information. PMID:28516017
Arnold, Roland; Goldenberg, Florian; Mewes, Hans-Werner; Rattei, Thomas
2014-01-01
The Similarity Matrix of Proteins (SIMAP, http://mips.gsf.de/simap/) database has been designed to massively accelerate computationally expensive protein sequence analysis tasks in bioinformatics. It provides pre-calculated sequence similarities interconnecting the entire known protein sequence universe, complemented by pre-calculated protein features and domains, similarity clusters and functional annotations. SIMAP covers all major public protein databases as well as many consistently re-annotated metagenomes from different repositories. As of September 2013, SIMAP contains >163 million proteins corresponding to ∼70 million non-redundant sequences. SIMAP uses the sensitive FASTA search heuristics, the Smith–Waterman alignment algorithm, the InterPro database of protein domain models and the BLAST2GO functional annotation algorithm. SIMAP assists biologists by facilitating the interactive exploration of the protein sequence universe. Web-Service and DAS interfaces allow connecting SIMAP with any other bioinformatic tool and resource. All-against-all protein sequence similarity matrices of project-specific protein collections are generated on request. Recent improvements allow SIMAP to cover the rapidly growing sequenced protein sequence universe. New Web-Service interfaces enhance the connectivity of SIMAP. Novel tools for interactive extraction of protein similarity networks have been added. Open access to SIMAP is provided through the web portal; the portal also contains instructions and links for software access and flat file downloads. PMID:24165881
Gromiha, M Michael; Anoosha, P; Huang, Liang-Tsung
2016-01-01
Protein stability is the free energy difference between unfolded and folded states of a protein, which lies in the range of 5-25 kcal/mol. Experimentally, protein stability is measured with circular dichroism, differential scanning calorimetry, and fluorescence spectroscopy using thermal and denaturant denaturation methods. These experimental data have been accumulated in the form of a database, ProTherm, thermodynamic database for proteins and mutants. It also contains sequence and structure information of a protein, experimental methods and conditions, and literature information. Different features such as search, display, and sorting options and visualization tools have been incorporated in the database. ProTherm is a valuable resource for understanding/predicting the stability of proteins and it can be accessed at http://www.abren.net/protherm/ . ProTherm has been effectively used to examine the relationship among thermodynamics, structure, and function of proteins. We describe the recent progress on the development of methods for understanding/predicting protein stability, such as (1) general trends on mutational effects on stability, (2) relationship between the stability of protein mutants and amino acid properties, (3) applications of protein three-dimensional structures for predicting their stability upon point mutations, (4) prediction of protein stability upon single mutations from amino acid sequence, and (5) prediction methods for addressing double mutants. A list of online resources for predicting has also been provided.
Towards an understanding of wheat chloroplasts: a methodical investigation of thylakoid proteome.
Kamal, Abu Hena Mostafa; Cho, Kun; Komatsu, Setsuko; Uozumi, Nobuyuki; Choi, Jong-Soon; Woo, Sun Hee
2012-05-01
We utilized Percoll density gradient centrifugation to isolate and fractionate chloroplasts of Korean winter wheat cultivar cv. Kumgang (Triticum aestivum L.). The resulting protein fractions were separated by one dimensional polyacrylamide gel electrophoresis (1D-PAGE) coupled with LTQ-FTICR mass spectrometry. This enabled us to detect and identify 767 unique proteins. Our findings represent the most comprehensive exploration of a proteome to date. Based on annotation information from the UniProtKB/Swiss-Prot database and our analyses via WoLF PSORT and PSORT, these proteins are localized in the chloroplast (607 proteins), chloroplast stroma (145), thylakoid membrane (342), lumens (163), and integral membranes (166). In all, 67% were confirmed as chloroplast thylakoid proteins. Although nearly complete protein coverage (89% proteins) has been accomplished for the key chloroplast pathways in wheat, such as for photosynthesis, many other proteins are involved in regulating carbon metabolism. The identified proteins were assigned to 103 functional categories according to a classification system developed by the iProClass database and provided through Protein Information Resources. Those functions include electron transport, energy, cellular organization and biogenesis, transport, stress responses, and other metabolic processes. Whereas most of these proteins are associated with known complexes and metabolic pathways, about 13% of the proteins have unknown functions. The chloroplast proteome contains many proteins that are localized to the thylakoids but as yet have no known function. We propose that some of these familiar proteins participate in the photosynthetic pathway. Thus, our new and comprehensive protein profile may provide clues for better understanding that photosynthetic process in wheat.
Schuemie, Martijn J; Mons, Barend; Weeber, Marc; Kors, Jan A
2007-06-01
Gene and protein name identification in text requires a dictionary approach to relate synonyms to the same gene or protein, and to link names to external databases. However, existing dictionaries are incomplete. We investigate two complementary methods for automatic generation of a comprehensive dictionary: combination of information from existing gene and protein databases and rule-based generation of spelling variations. Both methods have been reported in literature before, but have hitherto not been combined and evaluated systematically. We combined gene and protein names from several existing databases of four different organisms. The combined dictionaries showed a substantial increase in recall on three different test sets, as compared to any single database. Application of 23 spelling variation rules to the combined dictionaries further increased recall. However, many rules appeared to have no effect and some appear to have a detrimental effect on precision.
Hooper, Cornelia M; Castleden, Ian R; Aryamanesh, Nader; Jacoby, Richard P; Millar, A Harvey
2016-01-01
Barley, wheat, rice and maize provide the bulk of human nutrition and have extensive industrial use as agricultural products. The genomes of these crops each contains >40,000 genes encoding proteins; however, the major genome databases for these species lack annotation information of protein subcellular location for >80% of these gene products. We address this gap, by constructing the compendium of crop protein subcellular locations called crop Proteins with Annotated Locations (cropPAL). Subcellular location is most commonly determined by fluorescent protein tagging of live cells or mass spectrometry detection in subcellular purifications, but can also be predicted from amino acid sequence or protein expression patterns. The cropPAL database collates 556 published studies, from >300 research institutes in >30 countries that have been previously published, as well as compiling eight pre-computed subcellular predictions for all Hordeum vulgare, Triticum aestivum, Oryza sativa and Zea mays protein sequences. The data collection including metadata for proteins and published studies can be accessed through a search portal http://crop-PAL.org. The subcellular localization information housed in cropPAL helps to depict plant cells as compartmentalized protein networks that can be investigated for improving crop yield and quality, and developing new biotechnological solutions to agricultural challenges. © The Author 2015. Published by Oxford University Press on behalf of Japanese Society of Plant Physiologists. All rights reserved. For permissions, please email: journals.permissions@oup.com.
iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence
Turner, Brian; Razick, Sabry; Turinsky, Andrei L.; Vlasblom, James; Crowdy, Edgard K.; Cho, Emerson; Morrison, Kyle; Wodak, Shoshana J.
2010-01-01
We present iRefWeb, a web interface to protein interaction data consolidated from 10 public databases: BIND, BioGRID, CORUM, DIP, IntAct, HPRD, MINT, MPact, MPPI and OPHID. iRefWeb enables users to examine aggregated interactions for a protein of interest, and presents various statistical summaries of the data across databases, such as the number of organism-specific interactions, proteins and cited publications. Through links to source databases and supporting evidence, researchers may gauge the reliability of an interaction using simple criteria, such as the detection methods, the scale of the study (high- or low-throughput) or the number of cited publications. Furthermore, iRefWeb compares the information extracted from the same publication by different databases, and offers means to follow-up possible inconsistencies. We provide an overview of the consolidated protein–protein interaction landscape and show how it can be automatically cropped to aid the generation of meaningful organism-specific interactomes. iRefWeb can be accessed at: http://wodaklab.org/iRefWeb. Database URL: http://wodaklab.org/iRefWeb/ PMID:20940177
López, Yosvany; Nakai, Kenta; Patil, Ashwini
2015-01-01
HitPredict is a consolidated resource of experimentally identified, physical protein-protein interactions with confidence scores to indicate their reliability. The study of genes and their inter-relationships using methods such as network and pathway analysis requires high quality protein-protein interaction information. Extracting reliable interactions from most of the existing databases is challenging because they either contain only a subset of the available interactions, or a mixture of physical, genetic and predicted interactions. Automated integration of interactions is further complicated by varying levels of accuracy of database content and lack of adherence to standard formats. To address these issues, the latest version of HitPredict provides a manually curated dataset of 398 696 physical associations between 70 808 proteins from 105 species. Manual confirmation was used to resolve all issues encountered during data integration. For improved reliability assessment, this version combines a new score derived from the experimental information of the interactions with the original score based on the features of the interacting proteins. The combined interaction score performs better than either of the individual scores in HitPredict as well as the reliability score of another similar database. HitPredict provides a web interface to search proteins and visualize their interactions, and the data can be downloaded for offline analysis. Data usability has been enhanced by mapping protein identifiers across multiple reference databases. Thus, the latest version of HitPredict provides a significantly larger, more reliable and usable dataset of protein-protein interactions from several species for the study of gene groups. Database URL: http://hintdb.hgc.jp/htp. © The Author(s) 2015. Published by Oxford University Press.
Exploiting genomic data to identify proteins involved in abalone reproduction.
Mendoza-Porras, Omar; Botwright, Natasha A; McWilliam, Sean M; Cook, Mathew T; Harris, James O; Wijffels, Gene; Colgrave, Michelle L
2014-08-28
Aside from their critical role in reproduction, abalone gonads serve as an indicator of sexual maturity and energy balance, two key considerations for effective abalone culture. Temperate abalone farmers face issues with tank restocking with highly marketable abalone owing to inefficient spawning induction methods. The identification of key proteins in sexually mature abalone will serve as the foundation for a greater understanding of reproductive biology. Addressing this knowledge gap is the first step towards improving abalone aquaculture methods. Proteomic profiling of female and male gonads of greenlip abalone, Haliotis laevigata, was undertaken using liquid chromatography-mass spectrometry. Owing to the incomplete nature of abalone protein databases, in addition to searching against two publicly available databases, a custom database comprising genomic data was used. Overall, 162 and 110 proteins were identified in females and males respectively with 40 proteins common to both sexes. For proteins involved in sexual maturation, sperm and egg structure, motility, acrosomal reaction and fertilization, 23 were identified only in females, 18 only in males and 6 were common. Gene ontology analysis revealed clear differences between the female and male protein profiles reflecting a higher rate of protein synthesis in the ovary and higher metabolic activity in the testis. A comprehensive mass spectrometry-based analysis was performed to profile the abalone gonad proteome providing the foundation for future studies of reproduction in abalone. Key proteins involved in both reproduction and energy balance were identified. Genomic resources were utilised to build a database of molluscan proteins yielding >60% more protein identifications than in a standard workflow employing public protein databases. Copyright © 2014 Elsevier B.V. All rights reserved.
Chatonnet, A; Hotelier, T; Cousin, X
1999-05-14
Cholinesterases are targets for organophosphorus compounds which are used as insecticides, chemical warfare agents and drugs for the treatment of disease such as glaucoma, or parasitic infections. The widespread use of these chemicals explains the growing of this area of research and the ever increasing number of sequences, structures, or biochemical data available. Future advances will depend upon effective management of existing information as well as upon creation of new knowledge. The ESTHER database goal is to facilitate retrieval and comparison of data about structure and function of proteins presenting the alpha/beta hydrolase fold. Protein engineering and in vitro production of enzymes allow direct comparison of biochemical parameters. Kinetic parameters of enzymatic reactions are now included in the database. These parameters can be searched and compared with a table construction tool. ESTHER can be reached through internet (http://www.ensam.inra.fr/cholinesterase). The full database or the specialised X-window Client-server system can be downloaded from our ftp server (ftp://ftp.toulouse.inra.fr./pub/esther). Forms can be used to send updates or corrections directly from the web.
Mapping PDB chains to UniProtKB entries.
Martin, Andrew C R
2005-12-01
UniProtKB/SwissProt is the main resource for detailed annotations of protein sequences. This database provides a jumping-off point to many other resources through the links it provides. Among others, these include other primary databases, secondary databases, the Gene Ontology and OMIM. While a large number of links are provided to Protein Data Bank (PDB) files, obtaining a regularly updated mapping between UniProtKB entries and PDB entries at the chain or residue level is not straightforward. In particular, there is no regularly updated resource which allows a UniProtKB/SwissProt entry to be identified for a given residue of a PDB file. We have created a completely automatically maintained database which maps PDB residues to residues in UniProtKB/SwissProt and UniProtKB/trEMBL entries. The protocol uses links from PDB to UniProtKB, from UniProtKB to PDB and a brute-force sequence scan to resolve PDB chains for which no annotated link is available. Finally the sequences from PDB and UniProtKB are aligned to obtain a residue-level mapping. The resource may be queried interactively or downloaded from http://www.bioinf.org.uk/pdbsws/.
Glycan fragment database: a database of PDB-based glycan 3D structures.
Jo, Sunhwan; Im, Wonpil
2013-01-01
The glycan fragment database (GFDB), freely available at http://www.glycanstructure.org, is a database of the glycosidic torsion angles derived from the glycan structures in the Protein Data Bank (PDB). Analogous to protein structure, the structure of an oligosaccharide chain in a glycoprotein, referred to as a glycan, can be characterized by the torsion angles of glycosidic linkages between relatively rigid carbohydrate monomeric units. Knowledge of accessible conformations of biologically relevant glycans is essential in understanding their biological roles. The GFDB provides an intuitive glycan sequence search tool that allows the user to search complex glycan structures. After a glycan search is complete, each glycosidic torsion angle distribution is displayed in terms of the exact match and the fragment match. The exact match results are from the PDB entries that contain the glycan sequence identical to the query sequence. The fragment match results are from the entries with the glycan sequence whose substructure (fragment) or entire sequence is matched to the query sequence, such that the fragment results implicitly include the influences from the nearby carbohydrate residues. In addition, clustering analysis based on the torsion angle distribution can be performed to obtain the representative structures among the searched glycan structures.
De novo transcriptome assembly databases for the butterfly orchid Phalaenopsis equestris
Niu, Shan-Ce; Xu, Qing; Zhang, Guo-Qiang; Zhang, Yong-Qiang; Tsai, Wen-Chieh; Hsu, Jui-Ling; Liang, Chieh-Kai; Luo, Yi-Bo; Liu, Zhong-Jian
2016-01-01
Orchids are renowned for their spectacular flowers and ecological adaptations. After the sequencing of the genome of the tropical epiphytic orchid Phalaenopsis equestris, we combined Illumina HiSeq2000 for RNA-Seq and Trinity for de novo assembly to characterize the transcriptomes for 11 diverse P. equestris tissues representing the root, stem, leaf, flower buds, column, lip, petal, sepal and three developmental stages of seeds. Our aims were to contribute to a better understanding of the molecular mechanisms driving the analysed tissue characteristics and to enrich the available data for P. equestris. Here, we present three databases. The first dataset is the RNA-Seq raw reads, which can be used to execute new experiments with different analysis approaches. The other two datasets allow different types of searches for candidate homologues. The second dataset includes the sets of assembled unigenes and predicted coding sequences and proteins, enabling a sequence-based search. The third dataset consists of the annotation results of the aligned unigenes versus the Nonredundant (Nr) protein database, Kyoto Encyclopaedia of Genes and Genomes (KEGG) and Clusters of Orthologous Groups (COG) databases with low e-values, enabling a name-based search. PMID:27673730
Meta sequence analysis of human blood peptides and their parent proteins.
Bowden, Peter; Pendrak, Voitek; Zhu, Peihong; Marshall, John G
2010-04-18
Sequence analysis of the blood peptides and their qualities will be key to understanding the mechanisms that contribute to error in LC-ESI-MS/MS. Analysis of peptides and their proteins at the level of sequences is much more direct and informative than the comparison of disparate accession numbers. A portable database of all blood peptide and protein sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood. The results of twelve studies of human blood peptides and/or proteins identified by LC-MS/MS and correlated against a disparate array of genetic libraries were parsed and matched to proteins from the human ENSEMBL, SwissProt and RefSeq databases by SQL. The reported peptide and protein sequences were organized into an SQL database with full protein sequences and up to five unique peptides in order of prevalence along with the peptide count for each protein. Structured query language or BLAST was used to acquire descriptive information in current databases. Sampling error at the level of peptides is the largest source of disparity between groups. Chi Square analysis of peptide to protein distributions confirmed the significant agreement between groups on identified proteins. Copyright 2010. Published by Elsevier B.V.
Gene and protein nomenclature in public databases
Fundel, Katrin; Zimmer, Ralf
2006-01-01
Background Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap. Results We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism. Conclusion In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application. PMID:16899134
EuPathDB: the eukaryotic pathogen genomics database resource
Aurrecoechea, Cristina; Barreto, Ana; Basenko, Evelina Y.; Brestelli, John; Brunk, Brian P.; Cade, Shon; Crouch, Kathryn; Doherty, Ryan; Falke, Dave; Fischer, Steve; Gajria, Bindu; Harb, Omar S.; Heiges, Mark; Hertz-Fowler, Christiane; Hu, Sufen; Iodice, John; Kissinger, Jessica C.; Lawrence, Cris; Li, Wei; Pinney, Deborah F.; Pulman, Jane A.; Roos, David S.; Shanmugasundram, Achchuthan; Silva-Franco, Fatima; Steinbiss, Sascha; Stoeckert, Christian J.; Spruill, Drew; Wang, Haiming; Warrenfeltz, Susanne; Zheng, Jie
2017-01-01
The Eukaryotic Pathogen Genomics Database Resource (EuPathDB, http://eupathdb.org) is a collection of databases covering 170+ eukaryotic pathogens (protists & fungi), along with relevant free-living and non-pathogenic species, and select pathogen hosts. To facilitate the discovery of meaningful biological relationships, the databases couple preconfigured searches with visualization and analysis tools for comprehensive data mining via intuitive graphical interfaces and APIs. All data are analyzed with the same workflows, including creation of gene orthology profiles, so data are easily compared across data sets, data types and organisms. EuPathDB is updated with numerous new analysis tools, features, data sets and data types. New tools include GO, metabolic pathway and word enrichment analyses plus an online workspace for analysis of personal, non-public, large-scale data. Expanded data content is mostly genomic and functional genomic data while new data types include protein microarray, metabolic pathways, compounds, quantitative proteomics, copy number variation, and polysomal transcriptomics. New features include consistent categorization of searches, data sets and genome browser tracks; redesigned gene pages; effective integration of alternative transcripts; and a EuPathDB Galaxy instance for private analyses of a user's data. Forthcoming upgrades include user workspaces for private integration of data with existing EuPathDB data and improved integration and presentation of host–pathogen interactions. PMID:27903906
MultitaskProtDB: a database of multitasking proteins
Hernández, Sergio; Ferragut, Gabriela; Amela, Isaac; Perez-Pons, JosepAntoni; Piñol, Jaume; Mozo-Villarias, Angel; Cedano, Juan; Querol, Enrique
2014-01-01
We have compiled MultitaskProtDB, available online at http://wallace.uab.es/multitask, to provide a repository where the many multitasking proteins found in the literature can be stored. Multitasking or moonlighting is the capability of some proteins to execute two or more biological functions. Usually, multitasking proteins are experimentally revealed by serendipity. This ability of proteins to perform multitasking functions helps us to understand one of the ways used by cells to perform many complex functions with a limited number of genes. Even so, the study of this phenomenon is complex because, among other things, there is no database of moonlighting proteins. The existence of such a tool facilitates the collection and dissemination of these important data. This work reports the database, MultitaskProtDB, which is designed as a friendly user web page containing >288 multitasking proteins with their NCBI and UniProt accession numbers, canonical and additional biological functions, monomeric/oligomeric states, PDB codes when available and bibliographic references. This database also serves to gain insight into some characteristics of multitasking proteins such as frequencies of the different pairs of functions, phylogenetic conservation and so forth. PMID:24253302
Chandonia, John-Marc; Fox, Naomi K; Brenner, Steven E
2017-02-03
SCOPe (Structural Classification of Proteins-extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high-quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 Protein Data Bank entries, double the 38,221 structures classified in SCOP. Copyright © 2016 The Author(s). Published by Elsevier Ltd.. All rights reserved.
PrionScan: an online database of predicted prion domains in complete proteomes.
Espinosa Angarica, Vladimir; Angulo, Alfonso; Giner, Arturo; Losilla, Guillermo; Ventura, Salvador; Sancho, Javier
2014-02-05
Prions are a particular type of amyloids related to a large variety of important processes in cells, but also responsible for serious diseases in mammals and humans. The number of experimentally characterized prions is still low and corresponds to a handful of examples in microorganisms and mammals. Prion aggregation is mediated by specific protein domains with a remarkable compositional bias towards glutamine/asparagine and against charged residues and prolines. These compositional features have been used to predict new prion proteins in the genomes of different organisms. Despite these efforts, there are only a few available data sources containing prion predictions at a genomic scale. Here we present PrionScan, a new database of predicted prion-like domains in complete proteomes. We have previously developed a predictive methodology to identify and score prionogenic stretches in protein sequences. In the present work, we exploit this approach to scan all the protein sequences in public databases and compile a repository containing relevant information of proteins bearing prion-like domains. The database is updated regularly alongside UniprotKB and in its present version contains approximately 28000 predictions in proteins from different functional categories in more than 3200 organisms from all the taxonomic subdivisions. PrionScan can be used in two different ways: database query and analysis of protein sequences submitted by the users. In the first mode, simple queries allow to retrieve a detailed description of the properties of a defined protein. Queries can also be combined to generate more complex and specific searching patterns. In the second mode, users can submit and analyze their own sequences. It is expected that this database would provide relevant insights on prion functions and regulation from a genome-wide perspective, allowing researches performing cross-species prion biology studies. Our database might also be useful for guiding experimentalists in the identification of new candidates for further experimental characterization.
Bhawna; Bonthala, V.S.; Gajula, MNV Prasad
2016-01-01
The common bean [Phaseolus vulgaris (L.)] is one of the essential proteinaceous vegetables grown in developing countries. However, its production is challenged by low yields caused by numerous biotic and abiotic stress conditions. Regulatory transcription factors (TFs) symbolize a key component of the genome and are the most significant targets for producing stress tolerant crop and hence functional genomic studies of these TFs are important. Therefore, here we have constructed a web-accessible TFs database for P. vulgaris, called PvTFDB, which contains 2370 putative TF gene models in 49 TF families. This database provides a comprehensive information for each of the identified TF that includes sequence data, functional annotation, SSRs with their primer sets, protein physical properties, chromosomal location, phylogeny, tissue-specific gene expression data, orthologues, cis-regulatory elements and gene ontology (GO) assignment. Altogether, this information would be used in expediting the functional genomic studies of a specific TF(s) of interest. The objectives of this database are to understand functional genomics study of common bean TFs and recognize the regulatory mechanisms underlying various stress responses to ease breeding strategy for variety production through a couple of search interfaces including gene ID, functional annotation and browsing interfaces including by family and by chromosome. This database will also serve as a promising central repository for researchers as well as breeders who are working towards crop improvement of legume crops. In addition, this database provide the user unrestricted public access and the user can download entire data present in the database freely. Database URL: http://www.multiomics.in/PvTFDB/ PMID:27465131
Proteomics: Protein Identification Using Online Databases
ERIC Educational Resources Information Center
Eurich, Chris; Fields, Peter A.; Rice, Elizabeth
2012-01-01
Proteomics is an emerging area of systems biology that allows simultaneous study of thousands of proteins expressed in cells, tissues, or whole organisms. We have developed this activity to enable high school or college students to explore proteomic databases using mass spectrometry data files generated from yeast proteins in a college laboratory…
O'Leary, Nuala A; Wright, Mathew W; Brister, J Rodney; Ciufo, Stacy; Haddad, Diana; McVeigh, Rich; Rajput, Bhanu; Robbertse, Barbara; Smith-White, Brian; Ako-Adjei, Danso; Astashyn, Alexander; Badretdin, Azat; Bao, Yiming; Blinkova, Olga; Brover, Vyacheslav; Chetvernin, Vyacheslav; Choi, Jinna; Cox, Eric; Ermolaeva, Olga; Farrell, Catherine M; Goldfarb, Tamara; Gupta, Tripti; Haft, Daniel; Hatcher, Eneida; Hlavina, Wratko; Joardar, Vinita S; Kodali, Vamsi K; Li, Wenjun; Maglott, Donna; Masterson, Patrick; McGarvey, Kelly M; Murphy, Michael R; O'Neill, Kathleen; Pujar, Shashikant; Rangwala, Sanjida H; Rausch, Daniel; Riddick, Lillian D; Schoch, Conrad; Shkeda, Andrei; Storz, Susan S; Sun, Hanzhen; Thibaud-Nissen, Francoise; Tolstoy, Igor; Tully, Raymond E; Vatsan, Anjana R; Wallin, Craig; Webb, David; Wu, Wendy; Landrum, Melissa J; Kimchi, Avi; Tatusova, Tatiana; DiCuccio, Michael; Kitts, Paul; Murphy, Terence D; Pruitt, Kim D
2016-01-04
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55,000 organisms (>4800 viruses, >40,000 prokaryotes and >10,000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management. Published by Oxford University Press on behalf of Nucleic Acids Research 2015. This work is written by (a) US Government employee(s) and is in the public domain in the US.
SwePep, a database designed for endogenous peptides and mass spectrometry.
Fälth, Maria; Sköld, Karl; Norrman, Mathias; Svensson, Marcus; Fenyö, David; Andren, Per E
2006-06-01
A new database, SwePep, specifically designed for endogenous peptides, has been constructed to significantly speed up the identification process from complex tissue samples utilizing mass spectrometry. In the identification process the experimental peptide masses are compared with the peptide masses stored in the database both with and without possible post-translational modifications. This intermediate identification step is fast and singles out peptides that are potential endogenous peptides and can later be confirmed with tandem mass spectrometry data. Successful applications of this methodology are presented. The SwePep database is a relational database developed using MySql and Java. The database contains 4180 annotated endogenous peptides from different tissues originating from 394 different species as well as 50 novel peptides from brain tissue identified in our laboratory. Information about the peptides, including mass, isoelectric point, sequence, and precursor protein, is also stored in the database. This new approach holds great potential for removing the bottleneck that occurs during the identification process in the field of peptidomics. The SwePep database is available to the public.
From sequence to enzyme mechanism using multi-label machine learning.
De Ferrari, Luna; Mitchell, John B O
2014-05-19
In this work we predict enzyme function at the level of chemical mechanism, providing a finer granularity of annotation than traditional Enzyme Commission (EC) classes. Hence we can predict not only whether a putative enzyme in a newly sequenced organism has the potential to perform a certain reaction, but how the reaction is performed, using which cofactors and with susceptibility to which drugs or inhibitors, details with important consequences for drug and enzyme design. Work that predicts enzyme catalytic activity based on 3D protein structure features limits the prediction of mechanism to proteins already having either a solved structure or a close relative suitable for homology modelling. In this study, we evaluate whether sequence identity, InterPro or Catalytic Site Atlas sequence signatures provide enough information for bulk prediction of enzyme mechanism. By splitting MACiE (Mechanism, Annotation and Classification in Enzymes database) mechanism labels to a finer granularity, which includes the role of the protein chain in the overall enzyme complex, the method can predict at 96% accuracy (and 96% micro-averaged precision, 99.9% macro-averaged recall) the MACiE mechanism definitions of 248 proteins available in the MACiE, EzCatDb (Database of Enzyme Catalytic Mechanisms) and SFLD (Structure Function Linkage Database) databases using an off-the-shelf K-Nearest Neighbours multi-label algorithm. We find that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also find that incorporating Catalytic Site Atlas attributes does not seem to provide additional accuracy. The software code (ml2db), data and results are available online at http://sourceforge.net/projects/ml2db/ and as supplementary files.
Senatore, Adriano; Edirisinghe, Neranjan; Katz, Paul S.
2015-01-01
Background The sea slug Tritonia diomedea (Mollusca, Gastropoda, Nudibranchia), has a simple and highly accessible nervous system, making it useful for studying neuronal and synaptic mechanisms underlying behavior. Although many important contributions have been made using Tritonia, until now, a lack of genetic information has impeded exploration at the molecular level. Results We performed Illumina sequencing of central nervous system mRNAs from Tritonia, generating 133.1 million 100 base pair, paired-end reads. De novo reconstruction of the RNA-Seq data yielded a total of 185,546 contigs, which partitioned into 123,154 non-redundant gene clusters (unigenes). BLAST comparison with RefSeq and Swiss-Prot protein databases, as well as mRNA data from other invertebrates (gastropod molluscs: Aplysia californica, Lymnaea stagnalis and Biomphalaria glabrata; cnidarian: Nematostella vectensis) revealed that up to 76,292 unigenes in the Tritonia transcriptome have putative homologues in other databases, 18,246 of which are below a more stringent E-value cut-off of 1x10-6. In silico prediction of secreted proteins from the Tritonia transcriptome shotgun assembly (TSA) produced a database of 579 unique sequences of secreted proteins, which also exhibited markedly higher expression levels compared to other genes in the TSA. Conclusions Our efforts greatly expand the availability of gene sequences available for Tritonia diomedea. We were able to extract full length protein sequences for most queried genes, including those involved in electrical excitability, synaptic vesicle release and neurotransmission, thus confirming that the transcriptome will serve as a useful tool for probing the molecular correlates of behavior in this species. We also generated a neurosecretome database that will serve as a useful tool for probing peptidergic signalling systems in the Tritonia brain. PMID:25719197
Characterization of the Saccharomyces cerevisiae ATP-Interactome using the iTRAQ-SPROX Technique
NASA Astrophysics Data System (ADS)
Geer, M. Ariel; Fitzgerald, Michael C.
2016-02-01
The stability of proteins from rates of oxidation (SPROX) technique was used in combination with an isobaric mass tagging strategy to identify adenosine triphosphate (ATP) interacting proteins in the Saccharomyces cerevisiae proteome. The SPROX methodology utilized in this work enabled 373 proteins in a yeast cell lysate to be assayed for ATP interactions (both direct and indirect) using the non-hydrolyzable ATP analog, adenylyl imidodiphosphate (AMP-PNP). A total of 28 proteins were identified with AMP-PNP-induced thermodynamic stability changes. These protein hits included 14 proteins that were previously annotated as ATP-binding proteins in the Saccharomyces Genome Database (SGD). The 14 non-annotated ATP-binding proteins included nine proteins that were previously found to be ATP-sensitive in an earlier SPROX study using a stable isotope labeling with amino acids in cell culture (SILAC)-based approach. A bioinformatics analysis of the protein hits identified here and in the earlier SILAC-SPROX experiments revealed that many of the previously annotated ATP-binding protein hits were kinases, ligases, and chaperones. In contrast, many of the newly discovered ATP-sensitive proteins were not from these protein classes, but rather were hydrolases, oxidoreductases, and nucleic acid-binding proteins.
Bromilow, Sophie; Gethings, Lee A; Buckley, Mike; Bromley, Mike; Shewry, Peter R; Langridge, James I; Clare Mills, E N
2017-06-23
The unique physiochemical properties of wheat gluten enable a diverse range of food products to be manufactured. However, gluten triggers coeliac disease, a condition which is treated using a gluten-free diet. Analytical methods are required to confirm if foods are gluten-free, but current immunoassay-based methods can unreliable and proteomic methods offer an alternative but require comprehensive and well annotated sequence databases which are lacking for gluten. A manually a curated database (GluPro V1.0) of gluten proteins, comprising 630 discrete unique full length protein sequences has been compiled. It is representative of the different types of gliadin and glutenin components found in gluten. An in silico comparison of their coeliac toxicity was undertaken by analysing the distribution of coeliac toxic motifs. This demonstrated that whilst the α-gliadin proteins contained more toxic motifs, these were distributed across all gluten protein sub-types. Comparison of annotations observed using a discovery proteomics dataset acquired using ion mobility MS/MS showed that more reliable identifications were obtained using the GluPro V1.0 database compared to the complete reviewed Viridiplantae database. This highlights the value of a curated sequence database specifically designed to support the proteomic workflows and the development of methods to detect and quantify gluten. We have constructed the first manually curated open-source wheat gluten protein sequence database (GluPro V1.0) in a FASTA format to support the application of proteomic methods for gluten protein detection and quantification. We have also analysed the manually verified sequences to give the first comprehensive overview of the distribution of sequences able to elicit a reaction in coeliac disease, the prevalent form of gluten intolerance. Provision of this database will improve the reliability of gluten protein identification by proteomic analysis, and aid the development of targeted mass spectrometry methods in line with Codex Alimentarius Commission requirements for foods designed to meet the needs of gluten intolerant individuals. Copyright © 2017. Published by Elsevier B.V.
Recent progress and future directions in protein-protein docking.
Ritchie, David W
2008-02-01
This article gives an overview of recent progress in protein-protein docking and it identifies several directions for future research. Recent results from the CAPRI blind docking experiments show that docking algorithms are steadily improving in both reliability and accuracy. Current docking algorithms employ a range of efficient search and scoring strategies, including e.g. fast Fourier transform correlations, geometric hashing, and Monte Carlo techniques. These approaches can often produce a relatively small list of up to a few thousand orientations, amongst which a near-native binding mode is often observed. However, despite the use of improved scoring functions which typically include models of desolvation, hydrophobicity, and electrostatics, current algorithms still have difficulty in identifying the correct solution from the list of false positives, or decoys. Nonetheless, significant progress is being made through better use of bioinformatics, biochemical, and biophysical information such as e.g. sequence conservation analysis, protein interaction databases, alanine scanning, and NMR residual dipolar coupling restraints to help identify key binding residues. Promising new approaches to incorporate models of protein flexibility during docking are being developed, including the use of molecular dynamics snapshots, rotameric and off-rotamer searches, internal coordinate mechanics, and principal component analysis based techniques. Some investigators now use explicit solvent models in their docking protocols. Many of these approaches can be computationally intensive, although new silicon chip technologies such as programmable graphics processor units are beginning to offer competitive alternatives to conventional high performance computer systems. As cryo-EM techniques improve apace, docking NMR and X-ray protein structures into low resolution EM density maps is helping to bridge the resolution gap between these complementary techniques. The use of symmetry and fragment assembly constraints are also helping to make possible docking-based predictions of large multimeric protein complexes. In the near future, the closer integration of docking algorithms with protein interface prediction software, structural databases, and sequence analysis techniques should help produce better predictions of protein interaction networks and more accurate structural models of the fundamental molecular interactions within the cell.
RAID: a comprehensive resource for human RNA-associated (RNA–RNA/RNA–protein) interaction
Zhang, Xiaomeng; Wu, Deng; Chen, Liqun; Li, Xiang; Yang, Jinxurong; Fan, Dandan; Dong, Tingting; Liu, Mingyue; Tan, Puwen; Xu, Jintian; Yi, Ying; Wang, Yuting; Zou, Hua; Hu, Yongfei; Fan, Kaili; Kang, Juanjuan; Huang, Yan; Miao, Zhengqiang; Bi, Miaoman; Jin, Nana; Li, Kongning; Li, Xia; Xu, Jianzhen; Wang, Dong
2014-01-01
Transcriptomic analyses have revealed an unexpected complexity in the eukaryote transcriptome, which includes not only protein-coding transcripts but also an expanding catalog of noncoding RNAs (ncRNAs). Diverse coding and noncoding RNAs (ncRNAs) perform functions through interaction with each other in various cellular processes. In this project, we have developed RAID (http://www.rna-society.org/raid), an RNA-associated (RNA–RNA/RNA–protein) interaction database. RAID intends to provide the scientific community with all-in-one resources for efficient browsing and extraction of the RNA-associated interactions in human. This version of RAID contains more than 6100 RNA-associated interactions obtained by manually reviewing more than 2100 published papers, including 4493 RNA–RNA interactions and 1619 RNA–protein interactions. Each entry contains detailed information on an RNA-associated interaction, including RAID ID, RNA/protein symbol, RNA/protein categories, validated method, expressing tissue, literature references (Pubmed IDs), and detailed functional description. Users can query, browse, analyze, and manipulate RNA-associated (RNA–RNA/RNA–protein) interaction. RAID provides a comprehensive resource of human RNA-associated (RNA–RNA/RNA–protein) interaction network. Furthermore, this resource will help in uncovering the generic organizing principles of cellular function network. PMID:24803509
MultitaskProtDB-II: an update of a database of multitasking/moonlighting proteins
Franco-Serrano, Luís; Hernández, Sergio; Calvo, Alejandra; Severi, María A; Ferragut, Gabriela; Pérez-Pons, JosepAntoni; Piñol, Jaume; Pich, Òscar; Mozo-Villarias, Ángel; Amela, Isaac
2018-01-01
Abstract Multitasking, or moonlighting, is the capability of some proteins to execute two or more biological functions. MultitaskProtDB-II is a database of multifunctional proteins that has been updated. In the previous version, the information contained was: NCBI and UniProt accession numbers, canonical and additional biological functions, organism, monomeric/oligomeric states, PDB codes and bibliographic references. In the present update, the number of entries has been increased from 288 to 694 moonlighting proteins. MultitaskProtDB-II is continually being curated and updated. The new database also contains the following information: GO descriptors for the canonical and moonlighting functions, three-dimensional structure (for those proteins lacking PDB structure, a model was made using Itasser and Phyre), the involvement of the proteins in human diseases (78% of human moonlighting proteins) and whether the protein is a target of a current drug (48% of human moonlighting proteins). These numbers highlight the importance of these proteins for the analysis and explanation of human diseases and target-directed drug design. Moreover, 25% of the proteins of the database are involved in virulence of pathogenic microorganisms, largely in the mechanism of adhesion to the host. This highlights their importance for the mechanism of microorganism infection and vaccine design. MultitaskProtDB-II is available at http://wallace.uab.es/multitaskII. PMID:29136215
Intrinsic and extrinsic approaches for detecting genes in a bacterial genome.
Borodovsky, M; Rudd, K E; Koonin, E V
1994-01-01
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins. Images PMID:7984428
dbDSM: a manually curated database for deleterious synonymous mutations.
Wen, Pengbo; Xiao, Peng; Xia, Junfeng
2016-06-15
Synonymous mutations (SMs), which changed the sequence of a gene without directly altering the amino acid sequence of the encoded protein, were thought to have no functional consequences for a long time. They are often assumed to be neutral in models of mutation and selection and were completely ignored in many studies. However, accumulating experimental evidence has demonstrated that these mutations exert their impact on gene functions via splicing accuracy, mRNA stability, translation fidelity, protein folding and expression, and some of these mutations are implicated in human diseases. To the best of our knowledge, there is still no database specially focusing on disease-related SMs. We have developed a new database called dbDSM (database of Deleterious Synonymous Mutation), a continually updated database that collects, curates and manages available human disease-related SM data obtained from published literature. In the current release, dbDSM collects 1936 SM-disease association entries, including 1289 SMs and 443 human diseases from ClinVar, GRASP, GWAS Catalog, GWASdb, PolymiRTS database, PubMed database and Web of Knowledge. Additionally, we provided users a link to download all the data in the dbDSM and a link to submit novel data into the database. We hope dbDSM will be a useful resource for investigating the roles of SMs in human disease. dbDSM is freely available online at http://bioinfo.ahu.edu.cn:8080/dbDSM/index.jsp with all major browser supported. jfxia@ahu.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
Liang, Shih-Shin; Wang, Tsu-Nai; Tsai, Eing-Mei
2014-01-01
Phthalates are a class of plasticizers that have been characterized as endocrine disrupters, and are associated with genital diseases, cardiotoxicity, hepatotoxicity, and nephrotoxicity in the GeneOntology gene/protein database. In this study, we synthesized phthalic acid chemical probes and demonstrated differing protein–protein interactions between MCF-7 cells and MDA-MB-231 breast cancer cell lines. Phthalic acid chemical probes were synthesized using silicon dioxide particle carriers, which were modified using the silanized linker 3-aminopropyl triethoxyslane (APTES). Incubation with cell lysates from breast cancer cell lines revealed interactions between phthalic acid and cellular proteins in MCF-7 and MDA-MB-231 cells. Subsequent proteomics analyses indicated 22 phthalic acid-binding proteins in both cell types, including heat shock cognate 71-kDa protein, ATP synthase subunit beta, and heat shock protein HSP 90-beta. In addition, 21 MCF-7-specific and 32 MDA-MB-231 specific phthalic acid-binding proteins were identified, including related proteasome proteins, heat shock 70-kDa protein, and NADPH dehydrogenase and ribosomal correlated proteins, ras-related proteins, and members of the heat shock protein family, respectively. PMID:25402641
3D-SURFER 2.0: web platform for real-time search and characterization of protein surfaces.
Xiong, Yi; Esquivel-Rodriguez, Juan; Sael, Lee; Kihara, Daisuke
2014-01-01
The increasing number of uncharacterized protein structures necessitates the development of computational approaches for function annotation using the protein tertiary structures. Protein structure database search is the basis of any structure-based functional elucidation of proteins. 3D-SURFER is a web platform for real-time protein surface comparison of a given protein structure against the entire PDB using 3D Zernike descriptors. It can smoothly navigate the protein structure space in real-time from one query structure to another. A major new feature of Release 2.0 is the ability to compare the protein surface of a single chain, a single domain, or a single complex against databases of protein chains, domains, complexes, or a combination of all three in the latest PDB. Additionally, two types of protein structures can now be compared: all-atom-surface and backbone-atom-surface. The server can also accept a batch job for a large number of database searches. Pockets in protein surfaces can be identified by VisGrid and LIGSITE (csc) . The server is available at http://kiharalab.org/3d-surfer/.
The Gene Expression Omnibus Database.
Clough, Emily; Barrett, Tanya
2016-01-01
The Gene Expression Omnibus (GEO) database is an international public repository that archives and freely distributes high-throughput gene expression and other functional genomics data sets. Created in 2000 as a worldwide resource for gene expression studies, GEO has evolved with rapidly changing technologies and now accepts high-throughput data for many other data applications, including those that examine genome methylation, chromatin structure, and genome-protein interactions. GEO supports community-derived reporting standards that specify provision of several critical study elements including raw data, processed data, and descriptive metadata. The database not only provides access to data for tens of thousands of studies, but also offers various Web-based tools and strategies that enable users to locate data relevant to their specific interests, as well as to visualize and analyze the data. This chapter includes detailed descriptions of methods to query and download GEO data and use the analysis and visualization tools. The GEO homepage is at http://www.ncbi.nlm.nih.gov/geo/.
The Gene Expression Omnibus database
Clough, Emily; Barrett, Tanya
2016-01-01
The Gene Expression Omnibus (GEO) database is an international public repository that archives and freely distributes high-throughput gene expression and other functional genomics data sets. Created in 2000 as a worldwide resource for gene expression studies, GEO has evolved with rapidly changing technologies and now accepts high-throughput data for many other data applications, including those that examine genome methylation, chromatin structure, and genome–protein interactions. GEO supports community-derived reporting standards that specify provision of several critical study elements including raw data, processed data, and descriptive metadata. The database not only provides access to data for tens of thousands of studies, but also offers various Web-based tools and strategies that enable users to locate data relevant to their specific interests, as well as to visualize and analyze the data. This chapter includes detailed descriptions of methods to query and download GEO data and use the analysis and visualization tools. The GEO homepage is at http://www.ncbi.nlm.nih.gov/geo/. PMID:27008011
Mackey, Aaron J; Pearson, William R
2004-10-01
Relational databases are designed to integrate diverse types of information and manage large sets of search results, greatly simplifying genome-scale analyses. Relational databases are essential for management and analysis of large-scale sequence analyses, and can also be used to improve the statistical significance of similarity searches by focusing on subsets of sequence libraries most likely to contain homologs. This unit describes using relational databases to improve the efficiency of sequence similarity searching and to demonstrate various large-scale genomic analyses of homology-related data. This unit describes the installation and use of a simple protein sequence database, seqdb_demo, which is used as a basis for the other protocols. These include basic use of the database to generate a novel sequence library subset, how to extend and use seqdb_demo for the storage of sequence similarity search results and making use of various kinds of stored search results to address aspects of comparative genomic analysis.
Update of the androgen receptor gene mutations database.
Gottlieb, B; Beitel, L K; Lumbroso, R; Pinsky, L; Trifiro, M
1999-01-01
The current version of the androgen receptor (AR) gene mutations database is described. The total number of reported mutations has risen from 309 to 374 during the past year. We have expanded the database by adding information on AR-interacting proteins; and we have improved the database by identifying those mutation entries that have been updated. Mutations of unknown significance have now been reported in both the 5' and 3' untranslated regions of the AR gene, and in individuals who are somatic mosaics constitutionally. In addition, single nucleotide polymorphisms, including silent mutations, have been discovered in normal individuals and in individuals with male infertility. A mutation hotspot associated with prostatic cancer has been identified in exon 5. The database is available on the internet (http://www.mcgill.ca/androgendb/), from EMBL-European Bioinformatics Institute (ftp.ebi.ac.uk/pub/databases/androgen), or as a Macintosh FilemakerPro or Word file (MC33@musica.mcgill.ca). Copyright 1999 Wiley-Liss, Inc.
Identifying relevant data for a biological database: handcrafted rules versus machine learning.
Sehgal, Aditya Kumar; Das, Sanmay; Noto, Keith; Saier, Milton H; Elkan, Charles
2011-01-01
With well over 1,000 specialized biological databases in use today, the task of automatically identifying novel, relevant data for such databases is increasingly important. In this paper, we describe practical machine learning approaches for identifying MEDLINE documents and Swiss-Prot/TrEMBL protein records, for incorporation into a specialized biological database of transport proteins named TCDB. We show that both learning approaches outperform rules created by hand by a human expert. As one of the first case studies involving two different approaches to updating a deployed database, both the methods compared and the results will be of interest to curators of many specialized databases.
Proteomic analysis of pollination-induced corolla senescence in petunia.
Bai, Shuangyi; Willard, Belinda; Chapin, Laura J; Kinter, Michael T; Francis, David M; Stead, Anthony D; Jones, Michelle L
2010-02-01
Senescence represents the last phase of petal development during which macromolecules and organelles are degraded and nutrients are recycled to developing tissues. To understand better the post-transcriptional changes regulating petal senescence, a proteomic approach was used to profile protein changes during the senescence of Petuniaxhybrida 'Mitchell Diploid' corollas. Total soluble proteins were extracted from unpollinated petunia corollas at 0, 24, 48, and 72 h after flower opening and at 24, 48, and 72 h after pollination. Two-dimensional gel electrophoresis (2-DE) was used to identify proteins that were differentially expressed in non-senescing (unpollinated) and senescing (pollinated) corollas, and image analysis was used to determine which proteins were up- or down-regulated by the experimentally determined cut-off of 2.1-fold for P <0.05. One hundred and thirty-three differentially expressed protein spots were selected for sequencing. Liquid chromatography-tandem mass spectrometry (LC-MS/MS) was used to determine the identity of these proteins. Searching translated EST databases and the NCBI non-redundant protein database, it was possible to assign a putative identification to greater than 90% of these proteins. Many of the senescence up-regulated proteins were putatively involved in defence and stress responses or macromolecule catabolism. Some proteins, not previously characterized during flower senescence, were identified, including an orthologue of the tomato abscisic acid stress ripening protein 4 (ASR4). Gene expression patterns did not always correlate with protein expression, confirming that both proteomic and genomic approaches will be required to obtain a detailed understanding of the regulation of petal senescence.
Mao, Song; Chai, Xiaoqiang; Hu, Yuling; Hou, Xugang; Tang, Yiheng; Bi, Cheng; Li, Xiao
2014-01-01
Mitochondrion plays a central role in diverse biological processes in most eukaryotes, and its dysfunctions are critically involved in a large number of diseases and the aging process. A systematic identification of mitochondrial proteomes and characterization of functional linkages among mitochondrial proteins are fundamental in understanding the mechanisms underlying biological functions and human diseases associated with mitochondria. Here we present a database MitProNet which provides a comprehensive knowledgebase for mitochondrial proteome, interactome and human diseases. First an inventory of mammalian mitochondrial proteins was compiled by widely collecting proteomic datasets, and the proteins were classified by machine learning to achieve a high-confidence list of mitochondrial proteins. The current version of MitProNet covers 1124 high-confidence proteins, and the remainders were further classified as middle- or low-confidence. An organelle-specific network of functional linkages among mitochondrial proteins was then generated by integrating genomic features encoded by a wide range of datasets including genomic context, gene expression profiles, protein-protein interactions, functional similarity and metabolic pathways. The functional-linkage network should be a valuable resource for the study of biological functions of mitochondrial proteins and human mitochondrial diseases. Furthermore, we utilized the network to predict candidate genes for mitochondrial diseases using prioritization algorithms. All proteins, functional linkages and disease candidate genes in MitProNet were annotated according to the information collected from their original sources including GO, GEO, OMIM, KEGG, MIPS, HPRD and so on. MitProNet features a user-friendly graphic visualization interface to present functional analysis of linkage networks. As an up-to-date database and analysis platform, MitProNet should be particularly helpful in comprehensive studies of complicated biological mechanisms underlying mitochondrial functions and human mitochondrial diseases. MitProNet is freely accessible at http://bio.scu.edu.cn:8085/MitProNet. PMID:25347823
Shen, Hong-Bin; Chou, Kuo-Chen
2007-04-20
Proteins may simultaneously exist at, or move between, two or more different subcellular locations. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery. For instance, among the 6408 human protein entries that have experimentally observed subcellular location annotations in the Swiss-Prot database (version 50.7, released 19-Sept-2006), 973 ( approximately 15%) have multiple location sites. The number of total human protein entries (except those annotated with "fragment" or those with less than 50 amino acids) in the same database is 14,370, meaning a gap of (14,370-6408)=7962 entries for which no knowledge is available about their subcellular locations. Although one can use the computational approach to predict the desired information for the gap, so far all the existing methods for predicting human protein subcellular localization are limited in the case of single location site only. To overcome such a barrier, a new ensemble classifier, named Hum-mPLoc, was developed that can be used to deal with the case of multiple location sites as well. Hum-mPLoc is freely accessible to the public as a web server at http://202.120.37.186/bioinf/hum-multi. Meanwhile, for the convenience of people working in the relevant areas, Hum-mPLoc has been used to identify all human protein entries in the Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The large-scale results thus obtained have been deposited in a downloadable file prepared with Microsoft Excel and named "Tab_Hum-mPLoc.xls". This file is available at the same website and will be updated twice a year to include new entries of human proteins and reflect the continuous development of Hum-mPLoc.
Shen, Hong-Bin; Chou, Kuo-Chen
2007-02-15
Viruses can reproduce their progenies only within a host cell, and their actions depend both on its destructive tendencies toward a specific host cell and on environmental conditions. Therefore, knowledge of the subcellular localization of viral proteins in a host cell or virus-infected cell is very useful for in-depth studying of their functions and mechanisms as well as designing antiviral drugs. An analysis on the Swiss-Prot database (version 50.0, released on May 30, 2006) indicates that only 23.5% of viral protein entries are annotated for their subcellular locations in this regard. As for the gene ontology database, the corresponding percentage is 23.8%. Such a gap calls for the development of high throughput tools for timely annotating the localization of viral proteins within host and virus-infected cells. In this article, a predictor called "Virus-PLoc" has been developed that is featured by fusing many basic classifiers with each engineered according to the K-nearest neighbor rule. The overall jackknife success rate obtained by Virus-PLoc in identifying the subcellular compartments of viral proteins was 80% for a benchmark dataset in which none of proteins has more than 25% sequence identity to any other in a same location site. Virus-PLoc will be freely available as a web-server at http://202.120.37.186/bioinf/virus for the public usage. Furthermore, Virus-PLoc has been used to provide large-scale predictions of all viral protein entries in Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The results thus obtained have been deposited in a downloadable file prepared with Microsoft Excel and named "Tab_Virus-PLoc.xls." This file is available at the same website and will be updated twice a year to include the new entries of viral proteins and reflect the continuous development of Virus-PLoc. 2006 Wiley Periodicals, Inc.
Dave, Lakshmi A; Hayes, Maria; Montoya, Carlos A; Rutherfurd, Shane M; Moughan, Paul J
2016-02-01
It is well known that endogenous bioactive proteins and peptides play a substantial role in the body's first line of immunological defence, immune-regulation and normal body functioning. Further, the peptides derived from the luminal digestion of proteins are also important for body function. For example, within the peptide database BIOPEP (http://www.uwm.edu.pl/biochemia/index.php/en/biopep) 12 endogenous antimicrobial and 64 angiotensin-I-converting enzyme (ACE-I) inhibitory peptides derived from human milk and plasma proteins are listed. The antimicrobial peptide database (http://aps.unmc.edu/AP/main.php) lists over 111 human host-defence peptides. Several endogenous proteins are secreted in the gut and are subject to the same gastrointestinal digestion processes as food proteins derived from the diet. The human gut endogenous proteins (GEP) include mucins, serum albumin, digestive enzymes, hormones, and proteins from sloughed off epithelial cells and gut microbiota, and numerous other secreted proteins. To date, much work has been carried out regarding the health altering effects of food-derived bioactive peptides but little attention has been paid to the possibility that GEP may also be a source of bioactive peptides. In this review, we discuss the potential of GEP to constitute a gut cryptome from which bioactive peptides such as ACE-I inhibitory, renin inhibitory and antioxidant peptides may be derived. Copyright © 2015 Elsevier Inc. All rights reserved.
UniGene Tabulator: a full parser for the UniGene format.
Lenzi, Luca; Frabetti, Flavia; Facchin, Federica; Casadei, Raffaella; Vitale, Lorenza; Canaider, Silvia; Carinci, Paolo; Zannotti, Maria; Strippoli, Pierluigi
2006-10-15
UniGene Tabulator 1.0 provides a solution for full parsing of UniGene flat file format; it implements a structured graphical representation of each data field present in UniGene following import into a common database managing system usable in a personal computer. This database includes related tables for sequence, protein similarity, sequence-tagged site (STS) and transcript map interval (TXMAP) data, plus a summary table where each record represents a UniGene cluster. UniGene Tabulator enables full local management of UniGene data, allowing parsing, querying, indexing, retrieving, exporting and analysis of UniGene data in a relational database form, usable on Macintosh (OS X 10.3.9 or later) and Windows (2000, with service pack 4, XP, with service pack 2 or later) operating systems-based computers. The current release, including both the FileMaker runtime applications, is freely available at http://apollo11.isto.unibo.it/software/
Gene Unprediction with Spurio: A tool to identify spurious protein sequences.
Höps, Wolfram; Jeffryes, Matt; Bateman, Alex
2018-01-01
We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence's likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio.
How many human proteoforms are there?
Aebersold, Ruedi; Agar, Jeffrey N; Amster, I Jonathan; Baker, Mark S; Bertozzi, Carolyn R; Boja, Emily S; Costello, Catherine E; Cravatt, Benjamin F; Fenselau, Catherine; Garcia, Benjamin A; Ge, Ying; Gunawardena, Jeremy; Hendrickson, Ronald C; Hergenrother, Paul J; Huber, Christian G; Ivanov, Alexander R; Jensen, Ole N; Jewett, Michael C; Kelleher, Neil L; Kiessling, Laura L; Krogan, Nevan J; Larsen, Martin R; Loo, Joseph A; Ogorzalek Loo, Rachel R; Lundberg, Emma; MacCoss, Michael J; Mallick, Parag; Mootha, Vamsi K; Mrksich, Milan; Muir, Tom W; Patrie, Steven M; Pesavento, James J; Pitteri, Sharon J; Rodriguez, Henry; Saghatelian, Alan; Sandoval, Wendy; Schlüter, Hartmut; Sechi, Salvatore; Slavoff, Sarah A; Smith, Lloyd M; Snyder, Michael P; Thomas, Paul M; Uhlén, Mathias; Van Eyk, Jennifer E; Vidal, Marc; Walt, David R; White, Forest M; Williams, Evan R; Wohlschlager, Therese; Wysocki, Vicki H; Yates, Nathan A; Young, Nicolas L; Zhang, Bing
2018-02-14
Despite decades of accumulated knowledge about proteins and their post-translational modifications (PTMs), numerous questions remain regarding their molecular composition and biological function. One of the most fundamental queries is the extent to which the combinations of DNA-, RNA- and PTM-level variations explode the complexity of the human proteome. Here, we outline what we know from current databases and measurement strategies including mass spectrometry-based proteomics. In doing so, we examine prevailing notions about the number of modifications displayed on human proteins and how they combine to generate the protein diversity underlying health and disease. We frame central issues regarding determination of protein-level variation and PTMs, including some paradoxes present in the field today. We use this framework to assess existing data and to ask the question, "How many distinct primary structures of proteins (proteoforms) are created from the 20,300 human genes?" We also explore prospects for improving measurements to better regularize protein-level biology and efficiently associate PTMs to function and phenotype.
Global analysis of host-pathogen interactions that regulate early stage HIV-1 replication
König, Renate; Zhou, Yingyao; Elleder, Daniel; Diamond, Tracy L.; Bonamy, Ghislain M.C.; Irelan, Jeffrey T.; Chiang, Chih-yuan; Tu, Buu P.; De Jesus, Paul D.; Lilley, Caroline E.; Seidel, Shannon; Opaluch, Amanda M.; Caldwell, Jeremy S.; Weitzman, Matthew D.; Kuhen, Kelli L.; Bandyopadhyay, Sourav; Ideker, Trey; Orth, Anthony P.; Miraglia, Loren J.; Bushman, Frederic D.; Young, John A.; Chanda, Sumit K.
2008-01-01
Human Immunodeficiency Viruses (HIV-1 and HIV-2) rely upon host-encoded proteins to facilitate their replication. Here we combined genome-wide siRNA analyses with interrogation of human interactome databases to assemble a host-pathogen biochemical network containing 213 confirmed host cellular factors and 11 HIV-1-encoded proteins. Protein complexes that regulate ubiquitin conjugation, proteolysis, DNA damage response and RNA splicing were identified as important modulators of early stage HIV-1 infection. Additionally, over 40 new factors were shown to specifically influence initiation and/or kinetics of HIV-1 DNA synthesis, including cytoskeletal regulatory proteins, modulators of post-translational modification, and nucleic acid binding proteins. Finally, fifteen proteins with diverse functional roles, including nuclear transport, prostaglandin synthesis, ubiquitination, and transcription, were found to influence nuclear import or viral DNA integration. Taken together, the multi-scale approach described here has uncovered multiprotein virus-host interactions that likely act in concert to facilitate early steps of HIV-1 infection. PMID:18854154
Gioutlakis, Aris; Klapa, Maria I.
2017-01-01
It has been acknowledged that source databases recording experimentally supported human protein-protein interactions (PPIs) exhibit limited overlap. Thus, the reconstruction of a comprehensive PPI network requires appropriate integration of multiple heterogeneous primary datasets, presenting the PPIs at various genetic reference levels. Existing PPI meta-databases perform integration via normalization; namely, PPIs are merged after converted to a certain target level. Hence, the node set of the integrated network depends each time on the number and type of the combined datasets. Moreover, the irreversible a priori normalization process hinders the identification of normalization artifacts in the integrated network, which originate from the nonlinearity characterizing the genetic information flow. PICKLE (Protein InteraCtion KnowLedgebasE) 2.0 implements a new architecture for this recently introduced human PPI meta-database. Its main novel feature over the existing meta-databases is its approach to primary PPI dataset integration via genetic information ontology. Building upon the PICKLE principles of using the reviewed human complete proteome (RHCP) of UniProtKB/Swiss-Prot as the reference protein interactor set, and filtering out protein interactions with low probability of being direct based on the available evidence, PICKLE 2.0 first assembles the RHCP genetic information ontology network by connecting the corresponding genes, nucleotide sequences (mRNAs) and proteins (UniProt entries) and then integrates PPI datasets by superimposing them on the ontology network without any a priori transformations. Importantly, this process allows the resulting heterogeneous integrated network to be reversibly normalized to any level of genetic reference without loss of the original information, the latter being used for identification of normalization biases, and enables the appraisal of potential false positive interactions through PPI source database cross-checking. The PICKLE web-based interface (www.pickle.gr) allows for the simultaneous query of multiple entities and provides integrated human PPI networks at either the protein (UniProt) or the gene level, at three PPI filtering modes. PMID:29023571
Meta-All: a system for managing metabolic pathway information.
Weise, Stephan; Grosse, Ivo; Klukas, Christian; Koschützki, Dirk; Scholz, Uwe; Schreiber, Falk; Junker, Björn H
2006-10-23
Many attempts are being made to understand biological subjects at a systems level. A major resource for these approaches are biological databases, storing manifold information about DNA, RNA and protein sequences including their functional and structural motifs, molecular markers, mRNA expression levels, metabolite concentrations, protein-protein interactions, phenotypic traits or taxonomic relationships. The use of these databases is often hampered by the fact that they are designed for special application areas and thus lack universality. Databases on metabolic pathways, which provide an increasingly important foundation for many analyses of biochemical processes at a systems level, are no exception from the rule. Data stored in central databases such as KEGG, BRENDA or SABIO-RK is often limited to read-only access. If experimentalists want to store their own data, possibly still under investigation, there are two possibilities. They can either develop their own information system for managing that own data, which is very time-consuming and costly, or they can try to store their data in existing systems, which is often restricted. Hence, an out-of-the-box information system for managing metabolic pathway data is needed. We have designed META-ALL, an information system that allows the management of metabolic pathways, including reaction kinetics, detailed locations, environmental factors and taxonomic information. Data can be stored together with quality tags and in different parallel versions. META-ALL uses Oracle DBMS and Oracle Application Express. We provide the META-ALL information system for download and use. In this paper, we describe the database structure and give information about the tools for submitting and accessing the data. As a first application of META-ALL, we show how the information contained in a detailed kinetic model can be stored and accessed. META-ALL is a system for managing information about metabolic pathways. It facilitates the handling of pathway-related data and is designed to help biochemists and molecular biologists in their daily research. It is available on the Web at http://bic-gh.de/meta-all and can be downloaded free of charge and installed locally.
Meta-All: a system for managing metabolic pathway information
Weise, Stephan; Grosse, Ivo; Klukas, Christian; Koschützki, Dirk; Scholz, Uwe; Schreiber, Falk; Junker, Björn H
2006-01-01
Background Many attempts are being made to understand biological subjects at a systems level. A major resource for these approaches are biological databases, storing manifold information about DNA, RNA and protein sequences including their functional and structural motifs, molecular markers, mRNA expression levels, metabolite concentrations, protein-protein interactions, phenotypic traits or taxonomic relationships. The use of these databases is often hampered by the fact that they are designed for special application areas and thus lack universality. Databases on metabolic pathways, which provide an increasingly important foundation for many analyses of biochemical processes at a systems level, are no exception from the rule. Data stored in central databases such as KEGG, BRENDA or SABIO-RK is often limited to read-only access. If experimentalists want to store their own data, possibly still under investigation, there are two possibilities. They can either develop their own information system for managing that own data, which is very time-consuming and costly, or they can try to store their data in existing systems, which is often restricted. Hence, an out-of-the-box information system for managing metabolic pathway data is needed. Results We have designed META-ALL, an information system that allows the management of metabolic pathways, including reaction kinetics, detailed locations, environmental factors and taxonomic information. Data can be stored together with quality tags and in different parallel versions. META-ALL uses Oracle DBMS and Oracle Application Express. We provide the META-ALL information system for download and use. In this paper, we describe the database structure and give information about the tools for submitting and accessing the data. As a first application of META-ALL, we show how the information contained in a detailed kinetic model can be stored and accessed. Conclusion META-ALL is a system for managing information about metabolic pathways. It facilitates the handling of pathway-related data and is designed to help biochemists and molecular biologists in their daily research. It is available on the Web at and can be downloaded free of charge and installed locally. PMID:17059592
AIM: a comprehensive Arabidopsis interactome module database and related interologs in plants.
Wang, Yi; Thilmony, Roger; Zhao, Yunjun; Chen, Guoping; Gu, Yong Q
2014-01-01
Systems biology analysis of protein modules is important for understanding the functional relationships between proteins in the interactome. Here, we present a comprehensive database named AIM for Arabidopsis (Arabidopsis thaliana) interactome modules. The database contains almost 250,000 modules that were generated using multiple analysis methods and integration of microarray expression data. All the modules in AIM are well annotated using multiple gene function knowledge databases. AIM provides a user-friendly interface for different types of searches and offers a powerful graphical viewer for displaying module networks linked to the enrichment annotation terms. Both interactive Venn diagram and power graph viewer are integrated into the database for easy comparison of modules. In addition, predicted interologs from other plant species (homologous proteins from different species that share a conserved interaction module) are available for each Arabidopsis module. AIM is a powerful systems biology platform for obtaining valuable insights into the function of proteins in Arabidopsis and other plants using the modules of the Arabidopsis interactome. Database URL:http://probes.pw.usda.gov/AIM Published by Oxford University Press 2014. This work is written by US Government employees and is in the public domain in the US.
Establishing homologies in protein sequences
NASA Technical Reports Server (NTRS)
Dayhoff, M. O.; Barker, W. C.; Hunt, L. T.
1983-01-01
Computer-based statistical techniques used to determine homologies between proteins occurring in different species are reviewed. The technique is based on comparison of two protein sequences, either by relating all segments of a given length in one sequence to all segments of the second or by finding the best alignment of the two sequences. Approaches discussed include selection using printed tabulations, identification of very similar sequences, and computer searches of a database. The use of the SEARCH, RELATE, and ALIGN programs (Dayhoff, 1979) is explained; sample data are presented in graphs, diagrams, and tables and the construction of scoring matrices is considered.
ELISA-BASE: An Integrated Bioinformatics Tool for Analyzing and Tracking ELISA Microarray Data
DOE Office of Scientific and Technical Information (OSTI.GOV)
White, Amanda M.; Collett, James L.; Seurynck-Servoss, Shannon L.
ELISA-BASE is an open-source database for capturing, organizing and analyzing protein enzyme-linked immunosorbent assay (ELISA) microarray data. ELISA-BASE is an extension of the BioArray Soft-ware Environment (BASE) database system, which was developed for DNA microarrays. In order to make BASE suitable for protein microarray experiments, we developed several plugins for importing and analyzing quantitative ELISA microarray data. Most notably, our Protein Microarray Analysis Tool (ProMAT) for processing quantita-tive ELISA data is now available as a plugin to the database.
Natale, D A; Shankavaram, U T; Galperin, M Y; Wolf, Y I; Aravind, L; Koonin, E V
2000-01-01
Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange.
Natale, Darren A; Shankavaram, Uma T; Galperin, Michael Y; Wolf, Yuri I; Aravind, L; Koonin, Eugene V
2000-01-01
Background: Standard archival sequence databases have not been designed as tools for genome annotation and are far from being optimal for this purpose. We used the database of Clusters of Orthologous Groups of proteins (COGs) to reannotate the genomes of two archaea, Aeropyrum pernix, the first member of the Crenarchaea to be sequenced, and Pyrococcus abyssi. Results: A. pernix and P. abyssi proteins were assigned to COGs using the COGNITOR program; the results were verified on a case-by-case basis and augmented by additional database searches using the PSI-BLAST and TBLASTN programs. Functions were predicted for over 300 proteins from A. pernix, which could not be assigned a function using conventional methods with a conservative sequence similarity threshold, an approximately 50% increase compared to the original annotation. A. pernix shares most of the conserved core of proteins that were previously identified in the Euryarchaeota. Cluster analysis or distance matrix tree construction based on the co-occurrence of genomes in COGs showed that A. pernix forms a distinct group within the archaea, although grouping with the two species of Pyrococci, indicative of similar repertoires of conserved genes, was observed. No indication of a specific relationship between Crenarchaeota and eukaryotes was obtained in these analyses. Several proteins that are conserved in Euryarchaeota and most bacteria are unexpectedly missing in A. pernix, including the entire set of de novo purine biosynthesis enzymes, the GTPase FtsZ (a key component of the bacterial and euryarchaeal cell-division machinery), and the tRNA-specific pseudouridine synthase, previously considered universal. A. pernix is represented in 48 COGs that do not contain any euryarchaeal members. Many of these proteins are TCA cycle and electron transport chain enzymes, reflecting the aerobic lifestyle of A. pernix. Conclusions: Special-purpose databases organized on the basis of phylogenetic analysis and carefully curated with respect to known and predicted protein functions provide for a significant improvement in genome annotation. A differential genome display approach helps in a systematic investigation of common and distinct features of gene repertoires and in some cases reveals unexpected connections that may be indicative of functional similarities between phylogenetically distant organisms and of lateral gene exchange. PMID:11178258
HIPdb: a database of experimentally validated HIV inhibiting peptides.
Qureshi, Abid; Thakur, Nishant; Kumar, Manoj
2013-01-01
Besides antiretroviral drugs, peptides have also demonstrated potential to inhibit the Human immunodeficiency virus (HIV). For example, T20 has been discovered to effectively block the HIV entry and was approved by the FDA as a novel anti-HIV peptide (AHP). We have collated all experimental information on AHPs at a single platform. HIPdb is a manually curated database of experimentally verified HIV inhibiting peptides targeting various steps or proteins involved in the life cycle of HIV e.g. fusion, integration, reverse transcription etc. This database provides experimental information of 981 peptides. These are of varying length obtained from natural as well as synthetic sources and tested on different cell lines. Important fields included are peptide sequence, length, source, target, cell line, inhibition/IC(50), assay and reference. The database provides user friendly browse, search, sort and filter options. It also contains useful services like BLAST and 'Map' for alignment with user provided sequences. In addition, predicted structure and physicochemical properties of the peptides are also included. HIPdb database is freely available at http://crdd.osdd.net/servers/hipdb. Comprehensive information of this database will be helpful in selecting/designing effective anti-HIV peptides. Thus it may prove a useful resource to researchers for peptide based therapeutics development.
Kobayashi, Norio; Ishii, Manabu; Takahashi, Satoshi; Mochizuki, Yoshiki; Matsushima, Akihiro; Toyoda, Tetsuro
2011-07-01
Global cloud frameworks for bioinformatics research databases become huge and heterogeneous; solutions face various diametric challenges comprising cross-integration, retrieval, security and openness. To address this, as of March 2011 organizations including RIKEN published 192 mammalian, plant and protein life sciences databases having 8.2 million data records, integrated as Linked Open or Private Data (LOD/LPD) using SciNetS.org, the Scientists' Networking System. The huge quantity of linked data this database integration framework covers is based on the Semantic Web, where researchers collaborate by managing metadata across public and private databases in a secured data space. This outstripped the data query capacity of existing interface tools like SPARQL. Actual research also requires specialized tools for data analysis using raw original data. To solve these challenges, in December 2009 we developed the lightweight Semantic-JSON interface to access each fragment of linked and raw life sciences data securely under the control of programming languages popularly used by bioinformaticians such as Perl and Ruby. Researchers successfully used the interface across 28 million semantic relationships for biological applications including genome design, sequence processing, inference over phenotype databases, full-text search indexing and human-readable contents like ontology and LOD tree viewers. Semantic-JSON services of SciNetS.org are provided at http://semanticjson.org.
Games, Patrícia Dias; daSilva, Elói Quintas Gonçalves; Barbosa, Meire de Oliveira; Almeida-Souza, Hebréia Oliveira; Fontes, Patrícia Pereira; deMagalhães, Marcos Jorge; Pereira, Paulo Roberto Gomes; Prates, Maura Vianna; Franco, Gloria Regina; Faria-Campos, Alessandra; Campos, Sérgio Vale Aguiar; Baracat-Pereira, Maria Cristina
2016-12-15
Antimicrobial peptides from plants present mechanisms of action that are different from those of conventional defense agents. They are under-explored but have a potential as commercial antimicrobials. Bell pepper leaves ('Magali R') are discarded after harvesting the fruit and are sources of bioactive peptides. This work reports the isolation by peptidomics tools, and the identification and partially characterization by computational tools of an antimicrobial peptide from bell pepper leaves, and evidences the usefulness of records and the in silico analysis for the study of plant peptides aiming biotechnological uses. Aqueous extracts from leaves were enriched in peptide by salt fractionation and ultrafiltration. An antimicrobial peptide was isolated by tandem chromatographic procedures. Mass spectrometry, automated peptide sequencing and bioinformatics tools were used alternately for identification and partial characterization of the Hevein-like peptide, named HEV-CANN. The computational tools that assisted to the identification of the peptide included BlastP, PSI-Blast, ClustalOmega, PeptideCutter, and ProtParam; conventional protein databases (DB) as Mascot, Protein-DB, GenBank-DB, RefSeq, Swiss-Prot, and UniProtKB; specific for peptides DB as Amper, APD2, CAMP, LAMPs, and PhytAMP; other tools included in ExPASy for Proteomics; The Bioactive Peptide Databases, and The Pepper Genome Database. The HEV-CANN sequence presented 40 amino acid residues, 4258.8 Da, theoretical pI-value of 8.78, and four disulfide bonds. It was stable, and it has inhibited the growth of phytopathogenic bacteria and a fungus. HEV-CANN presented a chitin-binding domain in their sequence. There was a high identity and a positive alignment of HEV-CANN sequence in various databases, but there was not a complete identity, suggesting that HEV-CANN may be produced by ribosomal synthesis, which is in accordance with its constitutive nature. Computational tools for proteomics and databases are not adjusted for short sequences, which hampered HEV-CANN identification. The adjustment of statistical tests in large databases for proteins is an alternative to promote the significant identification of peptides. The development of specific DB for plant antimicrobial peptides, with information about peptide sequences, functional genomic data, structural motifs and domains of molecules, functional domains, and peptide-biomolecule interactions are valuable and necessary.
Ortseifen, Vera; Stolze, Yvonne; Maus, Irena; Sczyrba, Alexander; Bremges, Andreas; Albaum, Stefan P; Jaenicke, Sebastian; Fracowiak, Jochen; Pühler, Alfred; Schlüter, Andreas
2016-08-10
To study the metaproteome of a biogas-producing microbial community, fermentation samples were taken from an agricultural biogas plant for microbial cell and protein extraction and corresponding metagenome analyses. Based on metagenome sequence data, taxonomic community profiling was performed to elucidate the composition of bacterial and archaeal sub-communities. The community's cytosolic metaproteome was represented in a 2D-PAGE approach. Metaproteome databases for protein identification were compiled based on the assembled metagenome sequence dataset for the biogas plant analyzed and non-corresponding biogas metagenomes. Protein identification results revealed that the corresponding biogas protein database facilitated the highest identification rate followed by other biogas-specific databases, whereas common public databases yielded insufficient identification rates. Proteins of the biogas microbiome identified as highly abundant were assigned to the pathways involved in methanogenesis, transport and carbon metabolism. Moreover, the integrated metagenome/-proteome approach enabled the examination of genetic-context information for genes encoding identified proteins by studying neighboring genes on the corresponding contig. Exemplarily, this approach led to the identification of a Methanoculleus sp. contig encoding 16 methanogenesis-related gene products, three of which were also detected as abundant proteins within the community's metaproteome. Thus, metagenome contigs provide additional information on the genetic environment of identified abundant proteins. Copyright © 2016 Elsevier B.V. All rights reserved.
Czaplewski, Cezary; Karczynska, Agnieszka; Sieradzan, Adam K; Liwo, Adam
2018-04-30
A server implementation of the UNRES package (http://www.unres.pl) for coarse-grained simulations of protein structures with the physics-based UNRES model, coined a name UNRES server, is presented. In contrast to most of the protein coarse-grained models, owing to its physics-based origin, the UNRES force field can be used in simulations, including those aimed at protein-structure prediction, without ancillary information from structural databases; however, the implementation includes the possibility of using restraints. Local energy minimization, canonical molecular dynamics simulations, replica exchange and multiplexed replica exchange molecular dynamics simulations can be run with the current UNRES server; the latter are suitable for protein-structure prediction. The user-supplied input includes protein sequence and, optionally, restraints from secondary-structure prediction or small x-ray scattering data, and simulation type and parameters which are selected or typed in. Oligomeric proteins, as well as those containing D-amino-acid residues and disulfide links can be treated. The output is displayed graphically (minimized structures, trajectories, final models, analysis of trajectory/ensembles); however, all output files can be downloaded by the user. The UNRES server can be freely accessed at http://unres-server.chem.ug.edu.pl.
PASS2: an automated database of protein alignments organised as structural superfamilies.
Bhaduri, Anirban; Pugalenthi, Ganesan; Sowdhamini, Ramanathan
2004-04-02
The functional selection and three-dimensional structural constraints of proteins in nature often relates to the retention of significant sequence similarity between proteins of similar fold and function despite poor sequence identity. Organization of structure-based sequence alignments for distantly related proteins, provides a map of the conserved and critical regions of the protein universe that is useful for the analysis of folding principles, for the evolutionary unification of protein families and for maximizing the information return from experimental structure determination. The Protein Alignment organised as Structural Superfamily (PASS2) database represents continuously updated, structural alignments for evolutionary related, sequentially distant proteins. An automated and updated version of PASS2 is, in direct correspondence with SCOP 1.63, consisting of sequences having identity below 40% among themselves. Protein domains have been grouped into 628 multi-member superfamilies and 566 single member superfamilies. Structure-based sequence alignments for the superfamilies have been obtained using COMPARER, while initial equivalencies have been derived from a preliminary superposition using LSQMAN or STAMP 4.0. The final sequence alignments have been annotated for structural features using JOY4.0. The database is supplemented with sequence relatives belonging to different genomes, conserved spatially interacting and structural motifs, probabilistic hidden markov models of superfamilies based on the alignments and useful links to other databases. Probabilistic models and sensitive position specific profiles obtained from reliable superfamily alignments aid annotation of remote homologues and are useful tools in structural and functional genomics. PASS2 presents the phylogeny of its members both based on sequence and structural dissimilarities. Clustering of members allows us to understand diversification of the family members. The search engine has been improved for simpler browsing of the database. The database resolves alignments among the structural domains consisting of evolutionarily diverged set of sequences. Availability of reliable sequence alignments of distantly related proteins despite poor sequence identity and single-member superfamilies permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure-function relationships of individual superfamilies. PASS2 is accessible at http://www.ncbs.res.in/~faculty/mini/campass/pass2.html
Colangelo, Christopher M.; Shifman, Mark; Cheung, Kei-Hoi; Stone, Kathryn L.; Carriero, Nicholas J.; Gulcicek, Erol E.; Lam, TuKiet T.; Wu, Terence; Bjornson, Robert D.; Bruce, Can; Nairn, Angus C.; Rinehart, Jesse; Miller, Perry L.; Williams, Kenneth R.
2015-01-01
We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database (YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a single laboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography–tandem mass spectrometry (LC–MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring (MRM)/selective reaction monitoring (SRM) assay development. We have linked YPED’s database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results. PMID:25712262
DOE Office of Scientific and Technical Information (OSTI.GOV)
Karpinets, Tatiana V; Park, Byung; Syed, Mustafa H
2010-01-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire non-redundant sequences of the CAZy database. Themore » second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains (DUF) and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit (CAT), and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.« less
Colangelo, Christopher M; Shifman, Mark; Cheung, Kei-Hoi; Stone, Kathryn L; Carriero, Nicholas J; Gulcicek, Erol E; Lam, TuKiet T; Wu, Terence; Bjornson, Robert D; Bruce, Can; Nairn, Angus C; Rinehart, Jesse; Miller, Perry L; Williams, Kenneth R
2015-02-01
We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database (YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a single laboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography-tandem mass spectrometry (LC-MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring (MRM)/selective reaction monitoring (SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results. Copyright © 2015 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.
Park, Byung H; Karpinets, Tatiana V; Syed, Mustafa H; Leuze, Michael R; Uberbacher, Edward C
2010-12-01
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.
PlaMoM: a comprehensive database compiles plant mobile macromolecules.
Guan, Daogang; Yan, Bin; Thieme, Christoph; Hua, Jingmin; Zhu, Hailong; Boheler, Kenneth R; Zhao, Zhongying; Kragler, Friedrich; Xia, Yiji; Zhang, Shoudong
2017-01-04
In plants, various phloem-mobile macromolecules including noncoding RNAs, mRNAs and proteins are suggested to act as important long-distance signals in regulating crucial physiological and morphological transition processes such as flowering, plant growth and stress responses. Given recent advances in high-throughput sequencing technologies, numerous mobile macromolecules have been identified in diverse plant species from different plant families. However, most of the identified mobile macromolecules are not annotated in current versions of species-specific databases and are only available as non-searchable datasheets. To facilitate study of the mobile signaling macromolecules, we compiled the PlaMoM (Plant Mobile Macromolecules) database, a resource that provides convenient and interactive search tools allowing users to retrieve, to analyze and also to predict mobile RNAs/proteins. Each entry in the PlaMoM contains detailed information such as nucleotide/amino acid sequences, ortholog partners, related experiments, gene functions and literature. For the model plant Arabidopsis thaliana, protein-protein interactions of mobile transcripts are presented as interactive molecular networks. Furthermore, PlaMoM provides a built-in tool to identify potential RNA mobility signals such as tRNA-like structures. The current version of PlaMoM compiles a total of 17 991 mobile macromolecules from 14 plant species/ecotypes from published data and literature. PlaMoM is available at http://www.systembioinfo.org/plamom/. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Denef, Vincent; Shah, Manesh B; Verberkmoes, Nathan C
The recent surge in microbial genomic sequencing, combined with the development of high-throughput liquid chromatography-mass-spectrometry-based (LC/LC-MS/MS) proteomics, has raised the question of the extent to which genomic information of one strain or environmental sample can be used to profile proteomes of related strains or samples. Even with decreasing sequencing costs, it remains impractical to obtain genomic sequence for every strain or sample analyzed. Here, we evaluate how shotgun proteomics is affected by amino acid divergence between the sample and the genomic database using a probability-based model and a random mutation simulation model constrained by experimental data. To assess the effectsmore » of nonrandom distribution of mutations, we also evaluated identification levels using in silico peptide data from sequenced isolates with average amino acid identities (AAI) varying between 76 and 98%. We compared the predictions to experimental protein identification levels for a sample that was evaluated using a database that included genomic information for the dominant organism and for a closely related variant (95% AAI). The range of models set the boundaries at which half of the proteins in a proteomic experiment can be identified to be 77-92% AAI between orthologs in the sample and database. Consistent with this prediction, experimental data indicated loss of half the identifiable proteins at 90% AAI. Additional analysis indicated a 6.4% reduction of the initial protein coverage per 1% amino acid divergence and total identification loss at 86% AAI. Consequently, shotgun proteomics is capable of cross-strain identifications but avoids most crossspecies false positives.« less
The Sorcerer II Global Ocean Sampling expedition: expanding the universe of protein families.
Yooseph, Shibu; Sutton, Granger; Rusch, Douglas B; Halpern, Aaron L; Williamson, Shannon J; Remington, Karin; Eisen, Jonathan A; Heidelberg, Karla B; Manning, Gerard; Li, Weizhong; Jaroszewski, Lukasz; Cieplak, Piotr; Miller, Christopher S; Li, Huiying; Mashiyama, Susan T; Joachimiak, Marcin P; van Belle, Christopher; Chandonia, John-Marc; Soergel, David A; Zhai, Yufeng; Natarajan, Kannan; Lee, Shaun; Raphael, Benjamin J; Bafna, Vineet; Friedman, Robert; Brenner, Steven E; Godzik, Adam; Eisenberg, David; Dixon, Jack E; Taylor, Susan S; Strausberg, Robert L; Frazier, Marvin; Venter, J Craig
2007-03-01
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.
eMelanoBase: an online locus-specific variant database for familial melanoma.
Fung, David C Y; Holland, Elizabeth A; Becker, Therese M; Hayward, Nicholas K; Bressac-de Paillerets, Brigitte; Mann, Graham J
2003-01-01
A proportion of melanoma-prone individuals in both familial and non-familial contexts has been shown to carry inactivating mutations in either CDKN2A or, rarely, CDK4. CDKN2A is a complex locus that encodes two unrelated proteins from alternately spliced transcripts that are read in different frames. The alpha transcript (exons 1alpha, 2, and 3) produces the p16INK4A cyclin-dependent kinase inhibitor, while the beta transcript (exons 1beta and 2) is translated as p14ARF, a stabilizing factor of p53 levels through binding to MDM2. Mutations in exon 2 can impair both polypeptides and insertions and deletions in exons 1alpha, 1beta, and 2, which can theoretically generate p16INK4A-p14ARF fusion proteins. No online database currently takes into account all the consequences of these genotypes, a situation compounded by some problematic previous annotations of CDKN2A-related sequences and descriptions of their mutations. As an initiative of the international Melanoma Genetics Consortium, we have therefore established a database of germline variants observed in all loci implicated in familial melanoma susceptibility. Such a comprehensive, publicly accessible database is an essential foundation for research on melanoma susceptibility and its clinical application. Our database serves two types of data as defined by HUGO. The core dataset includes the nucleotide variants on the genomic and transcript levels, amino acid variants, and citation. The ancillary dataset includes keyword description of events at the transcription and translation levels and epidemiological data. The application that handles users' queries was designed in the model-view-controller architecture and was implemented in Java. The object-relational database schema was deduced using functional dependency analysis. We hereby present our first functional prototype of eMelanoBase. The service is accessible via the URL www.wmi.usyd.edu.au:8080/melanoma.html. Copyright 2002 Wiley-Liss, Inc.
Hao, J H; Dong, C J; Zhang, Z G; Wang, X L; Shang, Q M
2012-05-01
To investigate the response of cucumber seedlings to exogenous salicylic acid (SA) and gain a better understanding of SA action mechanism, we generated a proteomic profile of cucumber (Cucumis sativus L.) cotyledons treated with exogenous SA. Analysis of 1500 protein spots from each gel revealed 63 differentially expressed proteins, 59 of which were identified successfully. Of the identified proteins, 97% matched cucumber proteins using a whole cucumber protein database based on the newly completed genome established by our laboratory. The identified proteins were involved in various cellular responses and metabolic processes, including antioxidative reactions, cell defense, photosynthesis, carbohydrate metabolism, respiration and energy homeostasis, protein folding and biosynthesis. The two largest functional categories included proteins involved in antioxidative reactions (23.7%) and photosynthesis (18.6%). Furthermore, the SA-responsive protein interaction network revealed 13 key proteins, suggesting that the expression changes of these proteins could be critical for SA-induced resistance. An analysis of these changes suggested that SA-induced resistance and seedling growth might be regulated in part through pathways involving antioxidative reactions and photosynthesis. © 2012 Elsevier Ireland Ltd. All rights reserved.
When a domain isn’t a domain, and why it’s important to properly filter proteins in databases
Towse, Clare-Louise; Daggett, Valerie
2013-01-01
Summary Membership in a protein domain database does not a domain make; a feature we realized when generating a consensus view of protein fold space with our Consensus Domain Dictionary (CDD). This dictionary was used to select representative structures for characterization of the protein dynameome: the Dynameomics initiative. Through this endeavor we rejected a surprising 40% of the 1695 folds in the CDD as being non-autonomous folding units. Although some of this was due to the challenges of grouping similar fold topologies, the dissonance between the cataloguing and structural qualification of protein domains remains surprising. Another potential factor is previously overlooked intrinsic disorder; predicted estimates suggest 40% of proteins to have either local or global disorder. One thing is clear, filtering a structural database and ensuring a consistent definition for protein domains is crucial, and caution is prescribed when generalizations of globular domains are drawn from unfiltered protein domain datasets. PMID:23108912
SAFE Software and FED Database to Uncover Protein-Protein Interactions using Gene Fusion Analysis.
Tsagrasoulis, Dimosthenis; Danos, Vasilis; Kissa, Maria; Trimpalis, Philip; Koumandou, V Lila; Karagouni, Amalia D; Tsakalidis, Athanasios; Kossida, Sophia
2012-01-01
Domain Fusion Analysis takes advantage of the fact that certain proteins in a given proteome A, are found to have statistically significant similarity with two separate proteins in another proteome B. In other words, the result of a fusion event between two separate proteins in proteome B is a specific full-length protein in proteome A. In such a case, it can be safely concluded that the protein pair has a common biological function or even interacts physically. In this paper, we present the Fusion Events Database (FED), a database for the maintenance and retrieval of fusion data both in prokaryotic and eukaryotic organisms and the Software for the Analysis of Fusion Events (SAFE), a computational platform implemented for the automated detection, filtering and visualization of fusion events (both available at: http://www.bioacademy.gr/bioinformatics/projects/ProteinFusion/index.htm). Finally, we analyze the proteomes of three microorganisms using these tools in order to demonstrate their functionality.
SAFE Software and FED Database to Uncover Protein-Protein Interactions using Gene Fusion Analysis
Tsagrasoulis, Dimosthenis; Danos, Vasilis; Kissa, Maria; Trimpalis, Philip; Koumandou, V. Lila; Karagouni, Amalia D.; Tsakalidis, Athanasios; Kossida, Sophia
2012-01-01
Domain Fusion Analysis takes advantage of the fact that certain proteins in a given proteome A, are found to have statistically significant similarity with two separate proteins in another proteome B. In other words, the result of a fusion event between two separate proteins in proteome B is a specific full-length protein in proteome A. In such a case, it can be safely concluded that the protein pair has a common biological function or even interacts physically. In this paper, we present the Fusion Events Database (FED), a database for the maintenance and retrieval of fusion data both in prokaryotic and eukaryotic organisms and the Software for the Analysis of Fusion Events (SAFE), a computational platform implemented for the automated detection, filtering and visualization of fusion events (both available at: http://www.bioacademy.gr/bioinformatics/projects/ProteinFusion/index.htm). Finally, we analyze the proteomes of three microorganisms using these tools in order to demonstrate their functionality. PMID:22267904
CADB: Conformation Angles DataBase of proteins
Sheik, S. S.; Ananthalakshmi, P.; Bhargavi, G. Ramya; Sekar, K.
2003-01-01
Conformation Angles DataBase (CADB) provides an online resource to access data on conformation angles (both main-chain and side-chain) of protein structures in two data sets corresponding to 25% and 90% sequence identity between any two proteins, available in the Protein Data Bank. In addition, the database contains the necessary crystallographic parameters. The package has several flexible options and display facilities to visualize the main-chain and side-chain conformation angles for a particular amino acid residue. The package can also be used to study the interrelationship between the main-chain and side-chain conformation angles. A web based JAVA graphics interface has been deployed to display the user interested information on the client machine. The database is being updated at regular intervals and can be accessed over the World Wide Web interface at the following URL: http://144.16.71.148/cadb/. PMID:12520049
Classification of proteins with shared motifs and internal repeats in the ECOD database
Kinch, Lisa N.; Liao, Yuxing
2016-01-01
Abstract Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain‐like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade. PMID:26833690
Dictionary-driven protein annotation
Rigoutsos, Isidore; Huynh, Tien; Floratos, Aris; Parida, Laxmi; Platt, Daniel
2002-01-01
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/. PMID:12202776
Peptide reranking with protein-peptide correspondence and precursor peak intensity information.
Yang, Chao; He, Zengyou; Yang, Can; Yu, Weichuan
2012-01-01
Searching tandem mass spectra against a protein database has been a mainstream method for peptide identification. Improving peptide identification results by ranking true Peptide-Spectrum Matches (PSMs) over their false counterparts leads to the development of various reranking algorithms. In peptide reranking, discriminative information is essential to distinguish true PSMs from false PSMs. Generally, most peptide reranking methods obtain discriminative information directly from database search scores or by training machine learning models. Information in the protein database and MS1 spectra (i.e., single stage MS spectra) is ignored. In this paper, we propose to use information in the protein database and MS1 spectra to rerank peptide identification results. To quantitatively analyze their effects to peptide reranking results, three peptide reranking methods are proposed: PPMRanker, PPIRanker, and MIRanker. PPMRanker only uses Protein-Peptide Map (PPM) information from the protein database, PPIRanker only uses Precursor Peak Intensity (PPI) information, and MIRanker employs both PPM information and PPI information. According to our experiments on a standard protein mixture data set, a human data set and a mouse data set, PPMRanker and MIRanker achieve better peptide reranking results than PetideProphet, PeptideProphet+NSP (number of sibling peptides) and a score regularization method SRPI. The source codes of PPMRanker, PPIRanker, and MIRanker, and all supplementary documents are available at our website: http://bioinformatics.ust.hk/pepreranking/. Alternatively, these documents can also be downloaded from: http://sourceforge.net/projects/pepreranking/.
Lepoivre, Cyrille; Bergon, Aurélie; Lopez, Fabrice; Perumal, Narayanan B; Nguyen, Catherine; Imbert, Jean; Puthier, Denis
2012-01-31
Deciphering gene regulatory networks by in silico approaches is a crucial step in the study of the molecular perturbations that occur in diseases. The development of regulatory maps is a tedious process requiring the comprehensive integration of various evidences scattered over biological databases. Thus, the research community would greatly benefit from having a unified database storing known and predicted molecular interactions. Furthermore, given the intrinsic complexity of the data, the development of new tools offering integrated and meaningful visualizations of molecular interactions is necessary to help users drawing new hypotheses without being overwhelmed by the density of the subsequent graph. We extend the previously developed TranscriptomeBrowser database with a set of tables containing 1,594,978 human and mouse molecular interactions. The database includes: (i) predicted regulatory interactions (computed by scanning vertebrate alignments with a set of 1,213 position weight matrices), (ii) potential regulatory interactions inferred from systematic analysis of ChIP-seq experiments, (iii) regulatory interactions curated from the literature, (iv) predicted post-transcriptional regulation by micro-RNA, (v) protein kinase-substrate interactions and (vi) physical protein-protein interactions. In order to easily retrieve and efficiently analyze these interactions, we developed In-teractomeBrowser, a graph-based knowledge browser that comes as a plug-in for Transcriptome-Browser. The first objective of InteractomeBrowser is to provide a user-friendly tool to get new insight into any gene list by providing a context-specific display of putative regulatory and physical interactions. To achieve this, InteractomeBrowser relies on a "cell compartments-based layout" that makes use of a subset of the Gene Ontology to map gene products onto relevant cell compartments. This layout is particularly powerful for visual integration of heterogeneous biological information and is a productive avenue in generating new hypotheses. The second objective of InteractomeBrowser is to fill the gap between interaction databases and dynamic modeling. It is thus compatible with the network analysis software Cytoscape and with the Gene Interaction Network simulation software (GINsim). We provide examples underlying the benefits of this visualization tool for large gene set analysis related to thymocyte differentiation. The InteractomeBrowser plugin is a powerful tool to get quick access to a knowledge database that includes both predicted and validated molecular interactions. InteractomeBrowser is available through the TranscriptomeBrowser framework and can be found at: http://tagc.univ-mrs.fr/tbrowser/. Our database is updated on a regular basis.
Ran, Xia; Cai, Wei-Jun; Huang, Xiu-Feng; Liu, Qi; Lu, Fan; Qu, Jia; Wu, Jinyu; Jin, Zi-Bing
2014-01-01
Inherited retinal degeneration (IRD), a leading cause of human blindness worldwide, is exceptionally heterogeneous with clinical heterogeneity and genetic variety. During the past decades, tremendous efforts have been made to explore the complex heterogeneity, and massive mutations have been identified in different genes underlying IRD with the significant advancement of sequencing technology. In this study, we developed a comprehensive database, 'RetinoGenetics', which contains informative knowledge about all known IRD-related genes and mutations for IRD. 'RetinoGenetics' currently contains 4270 mutations in 186 genes, with detailed information associated with 164 phenotypes from 934 publications and various types of functional annotations. Then extensive annotations were performed to each gene using various resources, including Gene Ontology, KEGG pathways, protein-protein interaction, mutational annotations and gene-disease network. Furthermore, by using the search functions, convenient browsing ways and intuitive graphical displays, 'RetinoGenetics' could serve as a valuable resource for unveiling the genetic basis of IRD. Taken together, 'RetinoGenetics' is an integrative, informative and updatable resource for IRD-related genetic predispositions. Database URL: http://www.retinogenetics.org/. © The Author(s) 2014. Published by Oxford University Press.
Bhawna; Bonthala, V S; Gajula, Mnv Prasad
2016-01-01
The common bean [Phaseolus vulgaris (L.)] is one of the essential proteinaceous vegetables grown in developing countries. However, its production is challenged by low yields caused by numerous biotic and abiotic stress conditions. Regulatory transcription factors (TFs) symbolize a key component of the genome and are the most significant targets for producing stress tolerant crop and hence functional genomic studies of these TFs are important. Therefore, here we have constructed a web-accessible TFs database for P. vulgaris, called PvTFDB, which contains 2370 putative TF gene models in 49 TF families. This database provides a comprehensive information for each of the identified TF that includes sequence data, functional annotation, SSRs with their primer sets, protein physical properties, chromosomal location, phylogeny, tissue-specific gene expression data, orthologues, cis-regulatory elements and gene ontology (GO) assignment. Altogether, this information would be used in expediting the functional genomic studies of a specific TF(s) of interest. The objectives of this database are to understand functional genomics study of common bean TFs and recognize the regulatory mechanisms underlying various stress responses to ease breeding strategy for variety production through a couple of search interfaces including gene ID, functional annotation and browsing interfaces including by family and by chromosome. This database will also serve as a promising central repository for researchers as well as breeders who are working towards crop improvement of legume crops. In addition, this database provide the user unrestricted public access and the user can download entire data present in the database freely.Database URL: http://www.multiomics.in/PvTFDB/. © The Author(s) 2016. Published by Oxford University Press.
Bhatia, Vivek N.; Perlman, David H.; Costello, Catherine E.; McComb, Mark E.
2009-01-01
In order that biological meaning may be derived and testable hypotheses may be built from proteomics experiments, assignments of proteins identified by mass spectrometry or other techniques must be supplemented with additional notation, such as information on known protein functions, protein-protein interactions, or biological pathway associations. Collecting, organizing, and interpreting this data often requires the input of experts in the biological field of study, in addition to the time-consuming search for and compilation of information from online protein databases. Furthermore, visualizing this bulk of information can be challenging due to the limited availability of easy-to-use and freely available tools for this process. In response to these constraints, we have undertaken the design of software to automate annotation and visualization of proteomics data in order to accelerate the pace of research. Here we present the Software Tool for Researching Annotations of Proteins (STRAP) – a user-friendly, open-source C# application. STRAP automatically obtains gene ontology (GO) terms associated with proteins in a proteomics results ID list using the freely accessible UniProtKB and EBI GOA databases. Summarized in an easy-to-navigate tabular format, STRAP includes meta-information on the protein in addition to complimentary GO terminology. Additionally, this information can be edited by the user so that in-house expertise on particular proteins may be integrated into the larger dataset. STRAP provides a sortable tabular view for all terms, as well as graphical representations of GO-term association data in pie (biological process, cellular component and molecular function) and bar charts (cross comparison of sample sets) to aid in the interpretation of large datasets and differential analyses experiments. Furthermore, proteins of interest may be exported as a unique FASTA-formatted file to allow for customizable re-searching of mass spectrometry data, and gene names corresponding to the proteins in the lists may be encoded in the Gaggle microformat for further characterization, including pathway analysis. STRAP, a tutorial, and the C# source code are freely available from http://cpctools.sourceforge.net. PMID:19839595
Mi, Tian; Merlin, Jerlin Camilus; Deverasetty, Sandeep; Gryk, Michael R; Bill, Travis J; Brooks, Andrew W; Lee, Logan Y; Rathnayake, Viraj; Ross, Christian A; Sargeant, David P; Strong, Christy L; Watts, Paula; Rajasekaran, Sanguthevar; Schiller, Martin R
2012-01-01
Minimotif Miner (MnM available at http://minimotifminer.org or http://mnm.engr.uconn.edu) is an online database for identifying new minimotifs in protein queries. Minimotifs are short contiguous peptide sequences that have a known function in at least one protein. Here we report the third release of the MnM database which has now grown 60-fold to approximately 300,000 minimotifs. Since short minimotifs are by their nature not very complex we also summarize a new set of false-positive filters and linear regression scoring that vastly enhance minimotif prediction accuracy on a test data set. This online database can be used to predict new functions in proteins and causes of disease.
NASA Astrophysics Data System (ADS)
Nur Afifah, Diana; Rustanti, Ninik; Anjani, Gemala; Syah, Dahrul; Yanti; Suhartono, Maggy T.
2017-02-01
This paper presents the proteomics study which includes separation, identification and characterization of proteins. The experiment on Indonesian fermented food such as extracellular fibrinolytic protease from Bacillus licheniformis RO3 and Bacillus pumilus 2.g isolated from red oncom and tempeh gembus was conducted. The experimental works comprise the following steps: (1) a combination of one- and two-dimensional electrophoresis analysis, (2) mass spectrometry analysis using MALDI-TOF-MS and (3) investigation using protein database. The result suggested that there were new two protein fractions of B. licheniformis RO3 and three protein fractions of B. pumilus 2.g. These result has not been previously reported.
Proteome Analysis of the Plasma Membrane of Mycobacterium Tuberculosis
Arora, Shalini; Kosalai, K.; Namane, Abdelkader; Pym, Alex S.; Cole, Stewart T.
2002-01-01
The plasma membrane of Mycobacterium tuberculosis is likely to contain proteins that could serve as novel drug targets, diagnostic probes or even components of a vaccine against tuberculosis. With this in mind, we have undertaken proteome analysis of the membrane of M. tuberculosis H37Rv. Isolated membrane vesicles were extracted with either a detergent (Triton X114) or an alkaline buffer (carbonate) following two of the protocols recommended for membrane protein enrichment. Proteins were resolved by 2D-GE using immobilized pH gradient (IPG) strips, and identified by peptide mass mapping utilizing the M. tuberculosis genome database. The two extraction procedures yielded patterns with minimal overlap. Only two proteins, both HSPs, showed a common presence. MALDI–MS analysis of 61 spots led to the identification of 32 proteins, 17 of which were new to the M. tuberculosis proteome database. We classified 19 of the identified proteins as ‘membrane-associated’; 14 of these were further classified as ‘membrane-bound’, three of which were lipoproteins. The remaining proteins included four heat-shock proteins and several enzymes involved in energy or lipid metabolism. Extraction with Triton X114 was found to be more effective than carbonate for detecting ‘putative’ M. tuberculosis membrane proteins. The protocol was also found to be suitable for comparing BCG and M. tuberculosis membranes, identifying ESAT-6 as being expressed selectively in M. tuberculosis. While this study demonstrates for the first time some of the membrane proteins of M. tuberculosis, it also underscores the problems associated with proteomic analysis of a complex membrane such as that of a mycobacterium. PMID:18629250
BIOPEP database and other programs for processing bioactive peptide sequences.
Minkiewicz, Piotr; Dziuba, Jerzy; Iwaniak, Anna; Dziuba, Marta; Darewicz, Małgorzata
2008-01-01
This review presents the potential for application of computational tools in peptide science based on a sample BIOPEP database and program as well as other programs and databases available via the World Wide Web. The BIOPEP application contains a database of biologically active peptide sequences and a program enabling construction of profiles of the potential biological activity of protein fragments, calculation of quantitative descriptors as measures of the value of proteins as potential precursors of bioactive peptides, and prediction of bonds susceptible to hydrolysis by endopeptidases in a protein chain. Other bioactive and allergenic peptide sequence databases are also presented. Programs enabling the construction of binary and multiple alignments between peptide sequences, the construction of sequence motifs attributed to a given type of bioactivity, searching for potential precursors of bioactive peptides, and the prediction of sites susceptible to proteolytic cleavage in protein chains are available via the Internet as are other approaches concerning secondary structure prediction and calculation of physicochemical features based on amino acid sequence. Programs for prediction of allergenic and toxic properties have also been developed. This review explores the possibilities of cooperation between various programs.
RiboDB Database: A Comprehensive Resource for Prokaryotic Systematics.
Jauffrit, Frédéric; Penel, Simon; Delmotte, Stéphane; Rey, Carine; de Vienne, Damien M; Gouy, Manolo; Charrier, Jean-Philippe; Flandrois, Jean-Pierre; Brochier-Armanet, Céline
2016-08-01
Ribosomal proteins (r-proteins) are increasingly used as an alternative to ribosomal rRNA for prokaryotic systematics. However, their routine use is difficult because r-proteins are often not or wrongly annotated in complete genome sequences, and there is currently no dedicated exhaustive database of r-proteins. RiboDB aims at fulfilling this gap. This weekly updated comprehensive database allows the fast and easy retrieval of r-protein sequences from publicly available complete prokaryotic genome sequences. The current version of RiboDB contains 90 r-proteins from 3,750 prokaryotic complete genomes encompassing 38 phyla/major classes and 1,759 different species. RiboDB is accessible at http://ribodb.univ-lyon1.fr and through ACNUC interfaces. © The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
NCAD, a database integrating the intrinsic conformational preferences of non-coded amino acids
Revilla-López, Guillem; Torras, Juan; Curcó, David; Casanovas, Jordi; Calaza, M. Isabel; Zanuy, David; Jiménez, Ana I.; Cativiela, Carlos; Nussinov, Ruth; Grodzinski, Piotr; Alemán, Carlos
2010-01-01
Peptides and proteins find an ever-increasing number of applications in the biomedical and materials engineering fields. The use of non-proteinogenic amino acids endowed with diverse physicochemical and structural features opens the possibility to design proteins and peptides with novel properties and functions. Moreover, non-proteinogenic residues are particularly useful to control the three-dimensional arrangement of peptidic chains, which is a crucial issue for most applications. However, information regarding such amino acids –also called non-coded, non-canonical or non-standard– is usually scattered among publications specialized in quite diverse fields as well as in patents. Making all these data useful to the scientific community requires new tools and a framework for their assembly and coherent organization. We have successfully compiled, organized and built a database (NCAD, Non-Coded Amino acids Database) containing information about the intrinsic conformational preferences of non-proteinogenic residues determined by quantum mechanical calculations, as well as bibliographic information about their synthesis, physical and spectroscopic characterization, conformational propensities established experimentally, and applications. The architecture of the database is presented in this work together with the first family of non-coded residues included, namely, α-tetrasubstituted α-amino acids. Furthermore, the NCAD usefulness is demonstrated through a test-case application example. PMID:20455555
PATRIC, the bacterial bioinformatics database and analysis resource.
Wattam, Alice R; Abraham, David; Dalay, Oral; Disz, Terry L; Driscoll, Timothy; Gabbard, Joseph L; Gillespie, Joseph J; Gough, Roger; Hix, Deborah; Kenyon, Ronald; Machi, Dustin; Mao, Chunhong; Nordberg, Eric K; Olson, Robert; Overbeek, Ross; Pusch, Gordon D; Shukla, Maulik; Schulman, Julie; Stevens, Rick L; Sullivan, Daniel E; Vonstein, Veronika; Warren, Andrew; Will, Rebecca; Wilson, Meredith J C; Yoo, Hyun Seung; Zhang, Chengdong; Zhang, Yan; Sobral, Bruno W
2014-01-01
The Pathosystems Resource Integration Center (PATRIC) is the all-bacterial Bioinformatics Resource Center (BRC) (http://www.patricbrc.org). A joint effort by two of the original National Institute of Allergy and Infectious Diseases-funded BRCs, PATRIC provides researchers with an online resource that stores and integrates a variety of data types [e.g. genomics, transcriptomics, protein-protein interactions (PPIs), three-dimensional protein structures and sequence typing data] and associated metadata. Datatypes are summarized for individual genomes and across taxonomic levels. All genomes in PATRIC, currently more than 10,000, are consistently annotated using RAST, the Rapid Annotations using Subsystems Technology. Summaries of different data types are also provided for individual genes, where comparisons of different annotations are available, and also include available transcriptomic data. PATRIC provides a variety of ways for researchers to find data of interest and a private workspace where they can store both genomic and gene associations, and their own private data. Both private and public data can be analyzed together using a suite of tools to perform comparative genomic or transcriptomic analysis. PATRIC also includes integrated information related to disease and PPIs. All the data and integrated analysis and visualization tools are freely available. This manuscript describes updates to the PATRIC since its initial report in the 2007 NAR Database Issue.
Mu, Lin
2018-01-01
This work introduces a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homology, and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. In contrast to the conventional persistent homology, multi-component persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding analysis and virtual screening of small molecules. Extensive numerical experiments involving 4,414 protein-ligand complexes from the PDBBind database and 128,374 ligand-target and decoy-target pairs in the DUD database are performed to test respectively the scoring power and the discriminatory power of the proposed topological learning strategies. It is demonstrated that the present topological learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination. PMID:29309403
Approaches for Defining the Hsp90-dependent Proteome
Hartson, Steven D.; Matts, Robert L.
2011-01-01
Hsp90 is the target of ongoing drug discovery studies seeking new compounds to treat cancer, neurodegenerative diseases, and protein folding disorders. To better understand Hsp90’s roles in cellular pathologies and in normal cells, numerous studies have utilized proteomics assays and related high-throughput tools to characterize its physical and functional protein partnerships. This review surveys these studies, and summarizes the strengths and limitations of the individual attacks. We also include downloadable spreadsheets compiling all of the Hsp90-interacting proteins identified in more than 23 studies. These tools include cross-references among gene aliases, human homologues of yeast Hsp90-interacting proteins, hyperlinks to database entries, summaries of canonical pathways that are enriched in the Hsp90 interactome, and additional bioinformatic annotations. In addition to summarizing Hsp90 proteomics studies performed to date and the insights they have provided, we identify gaps in our current understanding of Hsp90-mediated proteostasis. PMID:21906632
Identification of helix capping and β-turn motifs from NMR chemical shifts
Shen, Yang; Bax, Ad
2012-01-01
We present an empirical method for identification of distinct structural motifs in proteins on the basis of experimentally determined backbone and 13Cβ chemical shifts. Elements identified include the N-terminal and C-terminal helix capping motifs and five types of β-turns: I, II, I′, II′ and VIII. Using a database of proteins of known structure, the NMR chemical shifts, together with the PDB-extracted amino acid preference of the helix capping and β-turn motifs are used as input data for training an artificial neural network algorithm, which outputs the statistical probability of finding each motif at any given position in the protein. The trained neural networks, contained in the MICS (motif identification from chemical shifts) program, also provide a confidence level for each of their predictions, and values ranging from ca 0.7–0.9 for the Matthews correlation coefficient of its predictions far exceed that attainable by sequence analysis. MICS is anticipated to be useful both in the conventional NMR structure determination process and for enhancing on-going efforts to determine protein structures solely on the basis of chemical shift information, where it can aid in identifying protein database fragments suitable for use in building such structures. PMID:22314702
Importance of databases of nucleic acids for bioinformatic analysis focused to genomics
NASA Astrophysics Data System (ADS)
Jimenez-Gutierrez, L. R.; Barrios-Hernández, C. J.; Pedraza-Ferreira, G. R.; Vera-Cala, L.; Martinez-Perez, F.
2016-08-01
Recently, bioinformatics has become a new field of science, indispensable in the analysis of millions of nucleic acids sequences, which are currently deposited in international databases (public or private); these databases contain information of genes, RNA, ORF, proteins, intergenic regions, including entire genomes from some species. The analysis of this information requires computer programs; which were renewed in the use of new mathematical methods, and the introduction of the use of artificial intelligence. In addition to the constant creation of supercomputing units trained to withstand the heavy workload of sequence analysis. However, it is still necessary the innovation on platforms that allow genomic analyses, faster and more effectively, with a technological understanding of all biological processes.
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos; ...
2016-02-24
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less
The standard operating procedure of the DOE-JGI Metagenome Annotation Pipeline (MAP v.4)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Huntemann, Marcel; Ivanova, Natalia N.; Mavromatis, Konstantinos
The DOE-JGI Metagenome Annotation Pipeline (MAP v.4) performs structural and functional annotation for metagenomic sequences that are submitted to the Integrated Microbial Genomes with Microbiomes (IMG/M) system for comparative analysis. The pipeline runs on nucleotide sequences provide d via the IMG submission site. Users must first define their analysis projects in GOLD and then submit the associated sequence datasets consisting of scaffolds/contigs with optional coverage information and/or unassembled reads in fasta and fastq file formats. The MAP processing consists of feature prediction including identification of protein-coding genes, non-coding RNAs and regulatory RNAs, as well as CRISPR elements. Structural annotation ismore » followed by functional annotation including assignment of protein product names and connection to various protein family databases.« less
Dynamic changes in proteins during apple (Malus x domestica) fruit ripening and storage
Shi, Yun; Jiang, Li; Zhang, Li; Kang, Ruoyi; Yu, Zhifang
2014-01-01
A proteomic study, using two-dimensional polyacrylamide gel electrophoresis and matrix-assisted laser desorption/ionization time-of-flight/time-of-flight, was conducted in apple fruit (cv. ‘Golden Delicious’) starting at 10 days prior to harvest through 50 days in storage. Total protein was extracted using a phenol/sodium dodecyl sulfate protocol. More than 400 protein spots were detected in each gel and 55 differentially expressed proteins (p<0.05) were subjected to matrix-assisted laser desorption/ionization time-of-flight/time-of-flight analysis. Fifty-three of these proteins were finally identified using an apple expressed sequence tag database downloaded from Genome Database for Rosaceae and placed into six categories. The categories and the percentage of proteins placed in each category were stress response and defense (49.0%), energy and metabolism (34.0%), fruit ripening and senescence (5.6%), signal transduction (3.8%), cell structure (3.8%) and protein synthesis (3.8%). Proteins involved in several multiple metabolic pathways, including glycolysis, pentose–phosphate pathway, anti-oxidative systems, photosynthesis and cell wall synthesis, were downregulated, especially during the climacteric burst in respiration and during the senescent stages of fruit development. Proteins classified as allergens or involved in cell wall degradation were upregulated during the ripening process. Some protein spots exhibited a mixed pattern (increasing to maximal abundance followed by a decrease), such as 1-aminocyclopropane-1-carboxylate oxidase, L-ascorbate peroxidase and abscisic acid response proteins. The identification of differentially expressed proteins associated with physiological processes identified in the current study provides a baseline of information for understanding the metabolic processes and regulatory mechanisms that occur in climacteric apple fruit during ripening and senescence. PMID:26504530
Proteome reference map and regulation network of neonatal rat cardiomyocyte
Li, Zi-jian; Liu, Ning; Han, Qi-de; Zhang, You-yi
2011-01-01
Aim: To study and establish a proteome reference map and regulation network of neonatal rat cardiomyocyte. Methods: Cultured cardiomyocytes of neonatal rats were used. All proteins expressed in the cardiomyocytes were separated and identified by two-dimensional polyacrylamide gel electrophoresis (2-DE) and matrix-assisted laser desorption/ionization-time of flight mass spectrometry (MALDI-TOF MS). Biological networks and pathways of the neonatal rat cardiomyocytes were analyzed using the Ingenuity Pathway Analysis (IPA) program (www.ingenuity.com). A 2-DE database was made accessible on-line by Make2ddb package on a web server. Results: More than 1000 proteins were separated on 2D gels, and 148 proteins were identified. The identified proteins were used for the construction of an extensible markup language-based database. Biological networks and pathways were constructed to analyze the functions associate with cardiomyocyte proteins in the database. The 2-DE database of rat cardiomyocyte proteins can be accessed at http://2d.bjmu.edu.cn. Conclusion: A proteome reference map and regulation network of the neonatal rat cardiomyocytes have been established, which may serve as an international platform for storage, analysis and visualization of cardiomyocyte proteomic data. PMID:21841810
RAID: a comprehensive resource for human RNA-associated (RNA-RNA/RNA-protein) interaction.
Zhang, Xiaomeng; Wu, Deng; Chen, Liqun; Li, Xiang; Yang, Jinxurong; Fan, Dandan; Dong, Tingting; Liu, Mingyue; Tan, Puwen; Xu, Jintian; Yi, Ying; Wang, Yuting; Zou, Hua; Hu, Yongfei; Fan, Kaili; Kang, Juanjuan; Huang, Yan; Miao, Zhengqiang; Bi, Miaoman; Jin, Nana; Li, Kongning; Li, Xia; Xu, Jianzhen; Wang, Dong
2014-07-01
Transcriptomic analyses have revealed an unexpected complexity in the eukaryote transcriptome, which includes not only protein-coding transcripts but also an expanding catalog of noncoding RNAs (ncRNAs). Diverse coding and noncoding RNAs (ncRNAs) perform functions through interaction with each other in various cellular processes. In this project, we have developed RAID (http://www.rna-society.org/raid), an RNA-associated (RNA-RNA/RNA-protein) interaction database. RAID intends to provide the scientific community with all-in-one resources for efficient browsing and extraction of the RNA-associated interactions in human. This version of RAID contains more than 6100 RNA-associated interactions obtained by manually reviewing more than 2100 published papers, including 4493 RNA-RNA interactions and 1619 RNA-protein interactions. Each entry contains detailed information on an RNA-associated interaction, including RAID ID, RNA/protein symbol, RNA/protein categories, validated method, expressing tissue, literature references (Pubmed IDs), and detailed functional description. Users can query, browse, analyze, and manipulate RNA-associated (RNA-RNA/RNA-protein) interaction. RAID provides a comprehensive resource of human RNA-associated (RNA-RNA/RNA-protein) interaction network. Furthermore, this resource will help in uncovering the generic organizing principles of cellular function network. © 2014 Zhang et al.; Published by Cold Spring Harbor Laboratory Press for the RNA Society.
ISAAC - InterSpecies Analysing Application using Containers.
Baier, Herbert; Schultz, Jörg
2014-01-15
Information about genes, transcripts and proteins is spread over a wide variety of databases. Different tools have been developed using these databases to identify biological signals in gene lists from large scale analysis. Mostly, they search for enrichments of specific features. But, these tools do not allow an explorative walk through different views and to change the gene lists according to newly upcoming stories. To fill this niche, we have developed ISAAC, the InterSpecies Analysing Application using Containers. The central idea of this web based tool is to enable the analysis of sets of genes, transcripts and proteins under different biological viewpoints and to interactively modify these sets at any point of the analysis. Detailed history and snapshot information allows tracing each action. Furthermore, one can easily switch back to previous states and perform new analyses. Currently, sets can be viewed in the context of genomes, protein functions, protein interactions, pathways, regulation, diseases and drugs. Additionally, users can switch between species with an automatic, orthology based translation of existing gene sets. As todays research usually is performed in larger teams and consortia, ISAAC provides group based functionalities. Here, sets as well as results of analyses can be exchanged between members of groups. ISAAC fills the gap between primary databases and tools for the analysis of large gene lists. With its highly modular, JavaEE based design, the implementation of new modules is straight forward. Furthermore, ISAAC comes with an extensive web-based administration interface including tools for the integration of third party data. Thus, a local installation is easily feasible. In summary, ISAAC is tailor made for highly explorative interactive analyses of gene, transcript and protein sets in a collaborative environment.
Pan, Weiran; Li, Gang; Yang, Xiaoxiao; Miao, Jinming
2015-04-01
This study aims to explore the potential mechanism of glioma through bioinformatic approaches. The gene expression profile (GSE4290) of glioma tumor and non-tumor samples was downloaded from Gene Expression Omnibus database. A total of 180 samples were available, including 23 non-tumor and 157 tumor samples. Then the raw data were preprocessed using robust multiarray analysis, and 8,890 differentially expressed genes (DEGs) were identified by using t-test (false discovery rate < 0.0005). Furthermore, 16 known glioma related genes were abstracted from Genetic Association Database. After mapping 8,890 DEGs and 16 known glioma related genes to Human Protein Reference Database, a glioma associated protein-protein interaction network (GAPN) was constructed. In addition, 51 sub-networks in GAPN were screened out through Molecular Complex Detection (score ≥ 1), and sub-network 1 was found to have the closest interaction (score = 3). What' more, for the top 10 sub-networks, Gene Ontology (GO) enrichment analysis (p value < 0.05) was performed, and DEGs involved in sub-network 1 and 2, such as BRMS1L and CCNA1, were predicted to regulate cell growth, cell cycle, and DNA replication via interacting with known glioma related genes. Finally, the overlaps of DEGs and human essential, housekeeping, tissue-specific genes were calculated (p value = 1.0, 1.0, and 0.00014, respectively) and visualized by Venn Diagram package in R. About 61% of human tissue-specific genes were DEGs as well. This research shed new light on the pathogenesis of glioma based on DEGs and GAPN, and our findings might provide potential targets for clinical glioma treatment.
Proteome analysis of pitcher fluid of the carnivorous plant Nepenthes alata.
Hatano, Naoya; Hamada, Tatsuro
2008-02-01
The genus Nepenthes comprises carnivorous plants that digest insects in pitcher fluid to supplement their nitrogen uptake. In a recent study, two acid proteinases (nepenthesins I and II) were purified from the pitcher fluid. However, no other enzymes involved in prey digestion have been identified, although several enzyme activities have been reported. To identify all the proteins involved, we performed a proteomic analysis of Nepenthes pitcher fluid. The secreted proteins in pitcher fluid were separated by sodium dodecyl sulfate-polyacrylamide gel electrophoresis, and several protein bands were detected by silver staining. The proteins were identified by in-gel tryptic digestion, de novo peptide sequencing, and homology searches against public databases. The proteins included homologues of beta-D-xylosidase, beta-1,3-glucanase, chitinase, and thaumatin-like protein, most of which are designated "pathogenesis-related proteins". These proteins presumably inhibit bacterial growth in the pitcher fluid to ensure sufficient nutrients for Nepenthes growth.
Gutiérrez-Sánchez, Gerardo; Atwood, James; Kolli, V S Kumar; Roussos, Sévastianos; Augur, Christopher
2012-04-01
Caffeine is toxic to most microorganisms. However, some filamentous fungi, such as Aspergillus tamarii, are able to metabolize this alkaloid when fed caffeine as the sole nitrogen source. The aim of the present work was to identify intracellular A. tamarii proteins, regulated by caffeine, using fluorescence difference two-dimensional gel electrophoresis. Specific proteins from two culture media of A. tamarii grown either on ammonium sulfate or caffeine as the sole nitrogen source were analysed by mass spectrometry. Thirteen out of a total of 85 differentially expressed spots were identified after database search. Identified up-regulated proteins include phosphoglycerate kinase, malate dehydrogenase, dyp-type peroxidase family protein, heat shock protein, Cu, Zn superoxidase dismutase and xanthine dehydrogenase. Some of the proteins identified in this study are involved in the caffeine degradation pathway as well as in stress response, suggesting that stress proteins could be involved in caffeine metabolism in filamentous fungi.
Venselaar, Hanka; Te Beek, Tim A H; Kuipers, Remko K P; Hekkelman, Maarten L; Vriend, Gert
2010-11-08
Many newly detected point mutations are located in protein-coding regions of the human genome. Knowledge of their effects on the protein's 3D structure provides insight into the protein's mechanism, can aid the design of further experiments, and eventually can lead to the development of new medicines and diagnostic tools. In this article we describe HOPE, a fully automatic program that analyzes the structural and functional effects of point mutations. HOPE collects information from a wide range of information sources including calculations on the 3D coordinates of the protein by using WHAT IF Web services, sequence annotations from the UniProt database, and predictions by DAS services. Homology models are built with YASARA. Data is stored in a database and used in a decision scheme to identify the effects of a mutation on the protein's 3D structure and function. HOPE builds a report with text, figures, and animations that is easy to use and understandable for (bio)medical researchers. We tested HOPE by comparing its output to the results of manually performed projects. In all straightforward cases HOPE performed similar to a trained bioinformatician. The use of 3D structures helps optimize the results in terms of reliability and details. HOPE's results are easy to understand and are presented in a way that is attractive for researchers without an extensive bioinformatics background.
Suspended marine particulate proteins in coastal and oligotrophic waters
NASA Astrophysics Data System (ADS)
Bridoux, Maxime C.; Neibauer, Jaqui; Ingalls, Anitra E.; Nunn, Brook L.; Keil, Richard G.
2015-03-01
Metaproteomic analyses were performed on suspended sediments collected in one coastal environment (Washington margin, Pacific Ocean, n = 5) and two oligotrophic environments (Atlantic Ocean near BATS, n = 5, and Pacific Ocean near HOTS, n = 5). Using a database of 2.3 million marine proteins developed using the NCBI database, 443 unique peptides were detected from which 363 unique proteins were identified. Samples from the euphotic zone contained on average 2-3x more identifiable proteins than deeper waters (150-1500 m) and these proteins were predominately from photosynthetic organisms. Diatom peptides dominate the spectra of the Washington margin while peptides from cyanobacteria, such as Synechococcus sp. dominated the spectra of both oligotrophic sites. Despite differences in the exact proteins identified at each location, there is good agreement for protein function and cellular location. Proteins in surface waters code for a variety of cellular functions including photosynthesis (24% of detected proteins), energy production (10%), membrane production (9%) and genetic coding and reading (9%), and are split 60-40 between membrane proteins and intracellular cytoplasmic proteins. Sargasso Sea surface waters contain a suite of peptides consistent with proteins involved in circadian rhythms that promote both C and N fixation at night. At depth in the Sargasso Sea, both muscle-derived myosin protein and the muscle-hydrolyzing proteases deseasin MCP-01 and metalloprotease Mcp02 from γ-proteobacteria were observed. Deeper waters contain peptides predominately sourced from γ-proteobacteria (37% of detected proteins) and α-proteobacteria (26%), although peptides from membrane and photosynthetic proteins attributable to phytoplankton were still observed (13%). Relative to surface values, detection frequencies for bacterial membrane proteins and extracellular enzymes rose from 9 to 16 and 2 to 4% respectively below the thermocline and the overall balance between membrane proteins and intracellular proteins grows to an approximate 75-25 split. Unlike the phytoplankton membrane proteins, which are detrital in nature, the bacterial protein suite at depth is consistent with living biomass.
sc-PDB-Frag: a database of protein-ligand interaction patterns for Bioisosteric replacements.
Desaphy, Jérémy; Rognan, Didier
2014-07-28
Bioisosteric replacement plays an important role in medicinal chemistry by keeping the biological activity of a molecule while changing either its core scaffold or substituents, thereby facilitating lead optimization and patenting. Bioisosteres are classically chosen in order to keep the main pharmacophoric moieties of the substructure to replace. However, notably when changing a scaffold, no attention is usually paid as whether all atoms of the reference scaffold are equally important for binding to the desired target. We herewith propose a novel database for bioisosteric replacement (scPDBFrag), capitalizing on our recently published structure-based approach to scaffold hopping, focusing on interaction pattern graphs. Protein-bound ligands are first fragmented and the interaction of the corresponding fragments with their protein environment computed-on-the-fly. Using an in-house developed graph alignment tool, interaction patterns graphs can be compared, aligned, and sorted by decreasing similarity to any reference. In the herein presented sc-PDB-Frag database ( http://bioinfo-pharma.u-strasbg.fr/scPDBFrag ), fragments, interaction patterns, alignments, and pairwise similarity scores have been extracted from the sc-PDB database of 8077 druggable protein-ligand complexes and further stored in a relational database. We herewith present the database, its Web implementation, and procedures for identifying true bioisosteric replacements based on conserved interaction patterns.
A series of PDB related databases for everyday needs.
Joosten, Robbie P; te Beek, Tim A H; Krieger, Elmar; Hekkelman, Maarten L; Hooft, Rob W W; Schneider, Reinhard; Sander, Chris; Vriend, Gert
2011-01-01
The Protein Data Bank (PDB) is the world-wide repository of macromolecular structure information. We present a series of databases that run parallel to the PDB. Each database holds one entry, if possible, for each PDB entry. DSSP holds the secondary structure of the proteins. PDBREPORT holds reports on the structure quality and lists errors. HSSP holds a multiple sequence alignment for all proteins. The PDBFINDER holds easy to parse summaries of the PDB file content, augmented with essentials from the other systems. PDB_REDO holds re-refined, and often improved, copies of all structures solved by X-ray. WHY_NOT summarizes why certain files could not be produced. All these systems are updated weekly. The data sets can be used for the analysis of properties of protein structures in areas ranging from structural genomics, to cancer biology and protein design.
Jeong, Seul-Ki; Hancock, William S; Paik, Young-Ki
2015-09-04
Since the launch of the Chromosome-centric Human Proteome Project (C-HPP) in 2012, the number of "missing" proteins has fallen to 2932, down from ∼5932 since the number was first counted in 2011. We compared the characteristics of missing proteins with those of already annotated proteins with respect to transcriptional expression pattern and the time periods in which newly identified proteins were annotated. We learned that missing proteins commonly exhibit lower levels of transcriptional expression and less tissue-specific expression compared with already annotated proteins. This makes it more difficult to identify missing proteins as time goes on. One of the C-HPP goals is to identify alternative spliced product of proteins (ASPs), which are usually difficult to find by shot-gun proteomic methods due to their sequence similarities with the representative proteins. To resolve this problem, it may be necessary to use a targeted proteomics approach (e.g., selected and multiple reaction monitoring [S/MRM] assays) and an innovative bioinformatics platform that enables the selection of target peptides for rarely expressed missing proteins or ASPs. Given that the success of efforts to identify missing proteins may rely on more informative public databases, it was necessary to upgrade the available integrative databases. To this end, we attempted to improve the features and utility of GenomewidePDB by integrating transcriptomic information (e.g., alternatively spliced transcripts), annotated peptide information, and an advanced search interface that can find proteins of interest when applying a targeted proteomics strategy. This upgraded version of the database, GenomewidePDB 2.0, may not only expedite identification of the remaining missing proteins but also enhance the exchange of information among the proteome community. GenomewidePDB 2.0 is available publicly at http://genomewidepdb.proteomix.org/.
Anekthanakul, Krittima; Hongsthong, Apiradee; Senachak, Jittisak; Ruengjitchatchawalya, Marasri
2018-04-20
Bioactive peptides, including biological sources-derived peptides with different biological activities, are protein fragments that influence the functions or conditions of organisms, in particular humans and animals. Conventional methods of identifying bioactive peptides are time-consuming and costly. To quicken the processes, several bioinformatics tools are recently used to facilitate screening of the potential peptides prior their activity assessment in vitro and/or in vivo. In this study, we developed an efficient computational method, SpirPep, which offers many advantages over the currently available tools. The SpirPep web application tool is a one-stop analysis and visualization facility to assist bioactive peptide discovery. The tool is equipped with 15 customized enzymes and 1-3 miscleavage options, which allows in silico digestion of protein sequences encoded by protein-coding genes from single, multiple, or genome-wide scaling, and then directly classifies the peptides by bioactivity using an in-house database that contains bioactive peptides collected from 13 public databases. With this tool, the resulting peptides are categorized by each selected enzyme, and shown in a tabular format where the peptide sequences can be tracked back to their original proteins. The developed tool and webpages are coded in PHP and HTML with CSS/JavaScript. Moreover, the tool allows protein-peptide alignment visualization by Generic Genome Browser (GBrowse) to display the region and details of the proteins and peptides within each parameter, while considering digestion design for the desirable bioactivity. SpirPep is efficient; it takes less than 20 min to digest 3000 proteins (751,860 amino acids) with 15 enzymes and three miscleavages for each enzyme, and only a few seconds for single enzyme digestion. Obviously, the tool identified more bioactive peptides than that of the benchmarked tool; an example of validated pentapeptide (FLPIL) from LC-MS/MS was demonstrated. The web and database server are available at http://spirpepapp.sbi.kmutt.ac.th . SpirPep, a web-based bioactive peptide discovery application, is an in silico-based tool with an overview of the results. The platform is a one-stop analysis and visualization facility; and offers advantages over the currently available tools. This tool may be useful for further bioactivity analysis and the quantitative discovery of desirable peptides.
CCProf: exploring conformational change profile of proteins
Chang, Che-Wei; Chou, Chai-Wei; Chang, Darby Tien-Hao
2016-01-01
In many biological processes, proteins have important interactions with various molecules such as proteins, ions or ligands. Many proteins undergo conformational changes upon these interactions, where regions with large conformational changes are critical to the interactions. This work presents the CCProf platform, which provides conformational changes of entire proteins, named conformational change profile (CCP) in the context. CCProf aims to be a platform where users can study potential causes of novel conformational changes. It provides 10 biological features, including conformational change, potential binding target site, secondary structure, conservation, disorder propensity, hydropathy propensity, sequence domain, structural domain, phosphorylation site and catalytic site. All these information are integrated into a well-aligned view, so that researchers can capture important relevance between different biological features visually. The CCProf contains 986 187 protein structure pairs for 3123 proteins. In addition, CCProf provides a 3D view in which users can see the protein structures before and after conformational changes as well as binding targets that induce conformational changes. All information (e.g. CCP, binding targets and protein structures) shown in CCProf, including intermediate data are available for download to expedite further analyses. Database URL: http://zoro.ee.ncku.edu.tw/ccprof/ PMID:27016699
GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.
Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de
2006-03-31
Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.
Ferro, Myriam; Brugière, Sabine; Salvi, Daniel; Seigneurin-Berny, Daphné; Court, Magali; Moyet, Lucas; Ramus, Claire; Miras, Stéphane; Mellal, Mourad; Le Gall, Sophie; Kieffer-Jaquinod, Sylvie; Bruley, Christophe; Garin, Jérôme; Joyard, Jacques; Masselon, Christophe; Rolland, Norbert
2010-06-01
Recent advances in the proteomics field have allowed a series of high throughput experiments to be conducted on chloroplast samples, and the data are available in several public databases. However, the accurate localization of many chloroplast proteins often remains hypothetical. This is especially true for envelope proteins. We went a step further into the knowledge of the chloroplast proteome by focusing, in the same set of experiments, on the localization of proteins in the stroma, the thylakoids, and envelope membranes. LC-MS/MS-based analyses first allowed building the AT_CHLORO database (http://www.grenoble.prabi.fr/protehome/grenoble-plant-proteomics/), a comprehensive repertoire of the 1323 proteins, identified by 10,654 unique peptide sequences, present in highly purified chloroplasts and their subfractions prepared from Arabidopsis thaliana leaves. This database also provides extensive proteomics information (peptide sequences and molecular weight, chromatographic retention times, MS/MS spectra, and spectral count) for a unique chloroplast protein accurate mass and time tag database gathering identified peptides with their respective and precise analytical coordinates, molecular weight, and retention time. We assessed the partitioning of each protein in the three chloroplast compartments by using a semiquantitative proteomics approach (spectral count). These data together with an in-depth investigation of the literature were compiled to provide accurate subplastidial localization of previously known and newly identified proteins. A unique knowledge base containing extensive information on the proteins identified in envelope fractions was thus obtained, allowing new insights into this membrane system to be revealed. Altogether, the data we obtained provide unexpected information about plastidial or subplastidial localization of some proteins that were not suspected to be associated to this membrane system. The spectral counting-based strategy was further validated as the compartmentation of well known pathways (for instance, photosynthesis and amino acid, fatty acid, or glycerolipid biosynthesis) within chloroplasts could be dissected. It also allowed revisiting the compartmentation of the chloroplast metabolism and functions.
PhosphoregDB: The tissue and sub-cellular distribution of mammalian protein kinases and phosphatases
Forrest, Alistair RR; Taylor, Darrin F; Fink, J Lynn; Gongora, M Milena; Flegg, Cameron; Teasdale, Rohan D; Suzuki, Harukazu; Kanamori, Mutsumi; Kai, Chikatoshi; Hayashizaki, Yoshihide; Grimmond, Sean M
2006-01-01
Background Protein kinases and protein phosphatases are the fundamental components of phosphorylation dependent protein regulatory systems. We have created a database for the protein kinase-like and phosphatase-like loci of mouse that integrates protein sequence, interaction, classification and pathway information with the results of a systematic screen of their sub-cellular localization and tissue specific expression data mined from the GNF tissue atlas of mouse. Results The database lets users query where a specific kinase or phosphatase is expressed at both the tissue and sub-cellular levels. Similarly the interface allows the user to query by tissue, pathway or sub-cellular localization, to reveal which components are co-expressed or co-localized. A review of their expression reveals 30% of these components are detected in all tissues tested while 70% show some level of tissue restriction. Hierarchical clustering of the expression data reveals that expression of these genes can be used to separate the samples into tissues of related lineage, including 3 larger clusters of nervous tissue, developing embryo and cells of the immune system. By overlaying the expression, sub-cellular localization and classification data we examine correlations between class, specificity and tissue restriction and show that tyrosine kinases are more generally expressed in fewer tissues than serine/threonine kinases. Conclusion Together these data demonstrate that cell type specific systems exist to regulate protein phosphorylation and that for accurate modelling and for determination of enzyme substrate relationships the co-location of components needs to be considered. PMID:16504016
Database resources of the National Center for Biotechnology Information
Wheeler, David L.; Barrett, Tanya; Benson, Dennis A.; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Kenton, David L.; Khovayko, Oleg; Lipman, David J.; Madden, Thomas L.; Maglott, Donna R.; Ostell, James; Pruitt, Kim D.; Schuler, Gregory D.; Schriml, Lynn M.; Sequeira, Edwin; Sherry, Stephen T.; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Suzek, Tugba O.; Tatusov, Roman; Tatusova, Tatiana A.; Wagner, Lukas; Yaschenko, Eugene
2006-01-01
In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through NCBI's Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Retroviral Genotyping Tools, HIV-1, Human Protein Interaction Database, SAGEmap, Gene Expression Omnibus, Entrez Probe, GENSAT, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized datasets. All of the resources can be accessed through the NCBI home page at: . PMID:16381840
PlaMoM: a comprehensive database compiles plant mobile macromolecules
Guan, Daogang; Yan, Bin; Thieme, Christoph; Hua, Jingmin; Zhu, Hailong; Boheler, Kenneth R.; Zhao, Zhongying; Kragler, Friedrich; Xia, Yiji; Zhang, Shoudong
2017-01-01
In plants, various phloem-mobile macromolecules including noncoding RNAs, mRNAs and proteins are suggested to act as important long-distance signals in regulating crucial physiological and morphological transition processes such as flowering, plant growth and stress responses. Given recent advances in high-throughput sequencing technologies, numerous mobile macromolecules have been identified in diverse plant species from different plant families. However, most of the identified mobile macromolecules are not annotated in current versions of species-specific databases and are only available as non-searchable datasheets. To facilitate study of the mobile signaling macromolecules, we compiled the PlaMoM (Plant Mobile Macromolecules) database, a resource that provides convenient and interactive search tools allowing users to retrieve, to analyze and also to predict mobile RNAs/proteins. Each entry in the PlaMoM contains detailed information such as nucleotide/amino acid sequences, ortholog partners, related experiments, gene functions and literature. For the model plant Arabidopsis thaliana, protein–protein interactions of mobile transcripts are presented as interactive molecular networks. Furthermore, PlaMoM provides a built-in tool to identify potential RNA mobility signals such as tRNA-like structures. The current version of PlaMoM compiles a total of 17 991 mobile macromolecules from 14 plant species/ecotypes from published data and literature. PlaMoM is available at http://www.systembioinfo.org/plamom/. PMID:27924044
Protein Structure and Function Prediction Using I-TASSER
Yang, Jianyi; Zhang, Yang
2016-01-01
I-TASSER is a hierarchical protocol for automated protein structure prediction and structure-based function annotation. Starting from the amino acid sequence of target proteins, I-TASSER first generates full-length atomic structural models from multiple threading alignments and iterative structural assembly simulations followed by atomic-level structure refinement. The biological functions of the protein, including ligand-binding sites, enzyme commission number, and gene ontology terms, are then inferred from known protein function databases based on sequence and structure profile comparisons. I-TASSER is freely available as both an on-line server and a stand-alone package. This unit describes how to use the I-TASSER protocol to generate structure and function prediction and how to interpret the prediction results, as well as alternative approaches for further improving the I-TASSER modeling quality for distant-homologous and multi-domain protein targets. PMID:26678386
Transporter taxonomy - a comparison of different transport protein classification schemes.
Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F
2014-06-01
Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.
Quality assurance for the query and distribution systems of the RCSB Protein Data Bank
Bluhm, Wolfgang F.; Beran, Bojan; Bi, Chunxiao; Dimitropoulos, Dimitris; Prlić, Andreas; Quinn, Gregory B.; Rose, Peter W.; Shah, Chaitali; Young, Jasmine; Yukich, Benjamin; Berman, Helen M.; Bourne, Philip E.
2011-01-01
The RCSB Protein Data Bank (RCSB PDB, www.pdb.org) is a key online resource for structural biology and related scientific disciplines. The website is used on average by 165 000 unique visitors per month, and more than 2000 other websites link to it. The amount and complexity of PDB data as well as the expectations on its usage are growing rapidly. Therefore, ensuring the reliability and robustness of the RCSB PDB query and distribution systems are crucially important and increasingly challenging. This article describes quality assurance for the RCSB PDB website at several distinct levels, including: (i) hardware redundancy and failover, (ii) testing protocols for weekly database updates, (iii) testing and release procedures for major software updates and (iv) miscellaneous monitoring and troubleshooting tools and practices. As such it provides suggestions for how other websites might be operated. Database URL: www.pdb.org PMID:21382834
BIND: the Biomolecular Interaction Network Database
Bader, Gary D.; Betel, Doron; Hogue, Christopher W. V.
2003-01-01
The Biomolecular Interaction Network Database (BIND: http://bind.ca) archives biomolecular interaction, complex and pathway information. A web-based system is available to query, view and submit records. BIND continues to grow with the addition of individual submissions as well as interaction data from the PDB and a number of large-scale interaction and complex mapping experiments using yeast two hybrid, mass spectrometry, genetic interactions and phage display. We have developed a new graphical analysis tool that provides users with a view of the domain composition of proteins in interaction and complex records to help relate functional domains to protein interactions. An interaction network clustering tool has also been developed to help focus on regions of interest. Continued input from users has helped further mature the BIND data specification, which now includes the ability to store detailed information about genetic interactions. The BIND data specification is available as ASN.1 and XML DTD. PMID:12519993
Moore, Jeffrey C; Spink, John; Lipp, Markus
2012-04-01
Food ingredient fraud and economically motivated adulteration are emerging risks, but a comprehensive compilation of information about known problematic ingredients and detection methods does not currently exist. The objectives of this research were to collect such information from publicly available articles in scholarly journals and general media, organize into a database, and review and analyze the data to identify trends. The results summarized are a database that will be published in the US Pharmacopeial Convention's Food Chemicals Codex, 8th edition, and includes 1305 records, including 1000 records with analytical methods collected from 677 references. Olive oil, milk, honey, and saffron were the most common targets for adulteration reported in scholarly journals, and potentially harmful issues identified include spices diluted with lead chromate and lead tetraoxide, substitution of Chinese star anise with toxic Japanese star anise, and melamine adulteration of high protein content foods. High-performance liquid chromatography and infrared spectroscopy were the most common analytical detection procedures, and chemometrics data analysis was used in a large number of reports. Future expansion of this database will include additional publically available articles published before 1980 and in other languages, as well as data outside the public domain. The authors recommend in-depth analyses of individual incidents. This report describes the development and application of a database of food ingredient fraud issues from publicly available references. The database provides baseline information and data useful to governments, agencies, and individual companies assessing the risks of specific products produced in specific regions as well as products distributed and sold in other regions. In addition, the report describes current analytical technologies for detecting food fraud and identifies trends and developments. © 2012 US Pharmacupia Journal of Food Science © 2012 Institute of Food Technologistsreg;
Mo, Fan; Hong, Xu; Gao, Feng; Du, Lin; Wang, Jun; Omenn, Gilbert S; Lin, Biaoyang
2008-12-16
Alternative splicing is an important gene regulation mechanism. It is estimated that about 74% of multi-exon human genes have alternative splicing. High throughput tandem (MS/MS) mass spectrometry provides valuable information for rapidly identifying potentially novel alternatively-spliced protein products from experimental datasets. However, the ability to identify alternative splicing events through tandem mass spectrometry depends on the database against which the spectra are searched. We wrote scripts in perl, Bioperl, mysql and Ensembl API and built a theoretical exon-exon junction protein database to account for all possible combinations of exons for a gene while keeping the frame of translation (i.e., keeping only in-phase exon-exon combinations) from the Ensembl Core Database. Using our liver cancer MS/MS dataset, we identified a total of 488 non-redundant peptides that represent putative exon skipping events. Our exon-exon junction database provides the scientific community with an efficient means to identify novel alternatively spliced (exon skipping) protein isoforms using mass spectrometry data. This database will be useful in annotating genome structures using rapidly accumulating proteomics data.
HIV Structural Database using Chem BLAST for all classes of AIDS inhibitors
National Institute of Standards and Technology Data Gateway
SRD 155 HIV Structural Database using Chem BLAST for all classes of AIDS inhibitors (Web, free access) The HIV structural database (HIVSDB) is a comprehensive collection of the structures of HIV protease, both of unliganded enzyme and of its inhibitor complexes. It contains abstracts and crystallographic data such as inhibitor and protein coordinates for 248 data sets, of which only 141 are from the Protein Data Bank (PDB).
Floden, Evan W; Tommaso, Paolo D; Chatzou, Maria; Magis, Cedrik; Notredame, Cedric; Chang, Jia-Ming
2016-07-08
The PSI/TM-Coffee web server performs multiple sequence alignment (MSA) of proteins by combining homology extension with a consistency based alignment approach. Homology extension is performed with Position Specific Iterative (PSI) BLAST searches against a choice of redundant and non-redundant databases. The main novelty of this server is to allow databases of reduced complexity to rapidly perform homology extension. This server also gives the possibility to use transmembrane proteins (TMPs) reference databases to allow even faster homology extension on this important category of proteins. Aside from an MSA, the server also outputs topological prediction of TMPs using the HMMTOP algorithm. Previous benchmarking of the method has shown this approach outperforms the most accurate alignment methods such as MSAProbs, Kalign, PROMALS, MAFFT, ProbCons and PRALINE™. The web server is available at http://tcoffee.crg.cat/tmcoffee. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
Kobayashi, Norio; Ishii, Manabu; Takahashi, Satoshi; Mochizuki, Yoshiki; Matsushima, Akihiro; Toyoda, Tetsuro
2011-01-01
Global cloud frameworks for bioinformatics research databases become huge and heterogeneous; solutions face various diametric challenges comprising cross-integration, retrieval, security and openness. To address this, as of March 2011 organizations including RIKEN published 192 mammalian, plant and protein life sciences databases having 8.2 million data records, integrated as Linked Open or Private Data (LOD/LPD) using SciNetS.org, the Scientists' Networking System. The huge quantity of linked data this database integration framework covers is based on the Semantic Web, where researchers collaborate by managing metadata across public and private databases in a secured data space. This outstripped the data query capacity of existing interface tools like SPARQL. Actual research also requires specialized tools for data analysis using raw original data. To solve these challenges, in December 2009 we developed the lightweight Semantic-JSON interface to access each fragment of linked and raw life sciences data securely under the control of programming languages popularly used by bioinformaticians such as Perl and Ruby. Researchers successfully used the interface across 28 million semantic relationships for biological applications including genome design, sequence processing, inference over phenotype databases, full-text search indexing and human-readable contents like ontology and LOD tree viewers. Semantic-JSON services of SciNetS.org are provided at http://semanticjson.org. PMID:21632604
RNAcentral: A vision for an international database of RNA sequences
Bateman, Alex; Agrawal, Shipra; Birney, Ewan; Bruford, Elspeth A.; Bujnicki, Janusz M.; Cochrane, Guy; Cole, James R.; Dinger, Marcel E.; Enright, Anton J.; Gardner, Paul P.; Gautheret, Daniel; Griffiths-Jones, Sam; Harrow, Jen; Herrero, Javier; Holmes, Ian H.; Huang, Hsien-Da; Kelly, Krystyna A.; Kersey, Paul; Kozomara, Ana; Lowe, Todd M.; Marz, Manja; Moxon, Simon; Pruitt, Kim D.; Samuelsson, Tore; Stadler, Peter F.; Vilella, Albert J.; Vogel, Jan-Hinnerk; Williams, Kelly P.; Wright, Mathew W.; Zwieb, Christian
2011-01-01
During the last decade there has been a great increase in the number of noncoding RNA genes identified, including new classes such as microRNAs and piRNAs. There is also a large growth in the amount of experimental characterization of these RNA components. Despite this growth in information, it is still difficult for researchers to access RNA data, because key data resources for noncoding RNAs have not yet been created. The most pressing omission is the lack of a comprehensive RNA sequence database, much like UniProt, which provides a comprehensive set of protein knowledge. In this article we propose the creation of a new open public resource that we term RNAcentral, which will contain a comprehensive collection of RNA sequences and fill an important gap in the provision of biomedical databases. We envision RNA researchers from all over the world joining a federated RNAcentral network, contributing specialized knowledge and databases. RNAcentral would centralize key data that are currently held across a variety of databases, allowing researchers instant access to a single, unified resource. This resource would facilitate the next generation of RNA research and help drive further discoveries, including those that improve food production and human and animal health. We encourage additional RNA database resources and research groups to join this effort. We aim to obtain international network funding to further this endeavor. PMID:21940779
Rose, Annkatrin; Manikantan, Sankaraganesh; Schraegle, Shannon J.; Maloy, Michael A.; Stahlberg, Eric A.; Meier, Iris
2004-01-01
Increasing evidence demonstrates the importance of long coiled-coil proteins for the spatial organization of cellular processes. Although several protein classes with long coiled-coil domains have been studied in animals and yeast, our knowledge about plant long coiled-coil proteins is very limited. The repeat nature of the coiled-coil sequence motif often prevents the simple identification of homologs of animal coiled-coil proteins by generic sequence similarity searches. As a consequence, counterparts of many animal proteins with long coiled-coil domains, like lamins, golgins, or microtubule organization center components, have not been identified yet in plants. Here, all Arabidopsis proteins predicted to contain long stretches of coiled-coil domains were identified by applying the algorithm MultiCoil to a genome-wide screen. A searchable protein database, ARABI-COIL (http://www.coiled-coil.org/arabidopsis), was established that integrates information on number, size, and position of predicted coiled-coil domains with subcellular localization signals, transmembrane domains, and available functional annotations. ARABI-COIL serves as a tool to sort and browse Arabidopsis long coiled-coil proteins to facilitate the identification and selection of candidate proteins of potential interest for specific research areas. Using the database, candidate proteins were identified for Arabidopsis membrane-bound, nuclear, and organellar long coiled-coil proteins. PMID:15020757
Dubovenko, Alexey; Nikolsky, Yuri; Rakhmatulin, Eugene; Nikolskaya, Tatiana
2017-01-01
Analysis of NGS and other sequencing data, gene variants, gene expression, proteomics, and other high-throughput (OMICs) data is challenging because of its biological complexity and high level of technical and biological noise. One way to deal with both problems is to perform analysis with a high fidelity annotated knowledgebase of protein interactions, pathways, and functional ontologies. This knowledgebase has to be structured in a computer-readable format and must include software tools for managing experimental data, analysis, and reporting. Here, we present MetaCore™ and Key Pathway Advisor (KPA), an integrated platform for functional data analysis. On the content side, MetaCore and KPA encompass a comprehensive database of molecular interactions of different types, pathways, network models, and ten functional ontologies covering human, mouse, and rat genes. The analytical toolkit includes tools for gene/protein list enrichment analysis, statistical "interactome" tool for the identification of over- and under-connected proteins in the dataset, and a biological network analysis module made up of network generation algorithms and filters. The suite also features Advanced Search, an application for combinatorial search of the database content, as well as a Java-based tool called Pathway Map Creator for drawing and editing custom pathway maps. Applications of MetaCore and KPA include molecular mode of action of disease research, identification of potential biomarkers and drug targets, pathway hypothesis generation, analysis of biological effects for novel small molecule compounds and clinical applications (analysis of large cohorts of patients, and translational and personalized medicine).
Update of KDBI: Kinetic Data of Bio-molecular Interaction database
Kumar, Pankaj; Han, B. C.; Shi, Z.; Jia, J.; Wang, Y. P.; Zhang, Y. T.; Liang, L.; Liu, Q. F.; Ji, Z. L.; Chen, Y. Z.
2009-01-01
Knowledge of the kinetics of biomolecular interactions is important for facilitating the study of cellular processes and underlying molecular events, and is essential for quantitative study and simulation of biological systems. Kinetic Data of Bio-molecular Interaction database (KDBI) has been developed to provide information about experimentally determined kinetic data of protein–protein, protein–nucleic acid, protein–ligand, nucleic acid–ligand binding or reaction events described in the literature. To accommodate increasing demand for studying and simulating biological systems, numerous improvements and updates have been made to KDBI, including new ways to access data by pathway and molecule names, data file in System Biology Markup Language format, more efficient search engine, access to published parameter sets of simulation models of 63 pathways, and 2.3-fold increase of data (19 263 entries of 10 532 distinctive biomolecular binding and 11 954 interaction events, involving 2635 proteins/protein complexes, 847 nucleic acids, 1603 small molecules and 45 multi-step processes). KDBI is publically available at http://bidd.nus.edu.sg/group/kdbi/kdbi.asp. PMID:18971255
Pané-Farré, Jan; Kusch, Harald; Wolf, Carmen; Reiß, Swantje; Binh, Le Thi Nguyen; Albrecht, Dirk; Riedel, Katharina; Hecker, Michael; Engelmann, Susanne
2013-01-01
Gel-based proteomics is a powerful approach to study the physiology of Staphylococcus aureus under various growth restricting conditions. We analyzed 679 protein spots from a reference 2-dimensional gel of cytosolic proteins of S. aureus COL by mass spectrometry resulting in 521 different proteins. 4,692 time dependent protein synthesis profiles were generated by exposing S. aureus to nine infection-related stress and starvation stimuli (H2O2, diamide, paraquat, NO, fermentation, nitrate respiration, heat shock, puromycin, mupirocin). These expression profiles are stored in an online resource called Aureolib (http://www.aureolib.de). Moreover, information on target genes of 75 regulators and regulatory elements were included in the database. Cross-comparisons of this extensive data collection of protein synthesis profiles using the tools implemented in Aureolib lead to the identification of stress and starvation specific marker proteins. Altogether, 226 protein synthesis profiles showed induction ratios of 2.5-fold or higher under at least one of the tested conditions with 157 protein synthesis profiles specifically induced in response to a single stimulus. The respective proteins might serve as marker proteins for the corresponding stimulus. By contrast, proteins whose synthesis was increased or repressed in response to more than four stimuli are rather exceptional. The only protein that was induced by six stimuli is the universal stress protein SACOL1759. Most strikingly, cluster analyses of synthesis profiles of proteins differentially synthesized under at least one condition revealed only in rare cases a grouping that correlated with known regulon structures. The most prominent examples are the GapR, Rex, and CtsR regulon. In contrast, protein synthesis profiles of proteins belonging to the CodY and σB regulon are widely distributed. In summary, Aureolib is by far the most comprehensive protein expression database for S. aureus and provides an essential tool to decipher more complex adaptation processes in S. aureus during host pathogen interaction. PMID:23967085
Implementation of a parallel protein structure alignment service on cloud.
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform.
Implementation of a Parallel Protein Structure Alignment Service on Cloud
Hung, Che-Lun; Lin, Yaw-Ling
2013-01-01
Protein structure alignment has become an important strategy by which to identify evolutionary relationships between protein sequences. Several alignment tools are currently available for online comparison of protein structures. In this paper, we propose a parallel protein structure alignment service based on the Hadoop distribution framework. This service includes a protein structure alignment algorithm, a refinement algorithm, and a MapReduce programming model. The refinement algorithm refines the result of alignment. To process vast numbers of protein structures in parallel, the alignment and refinement algorithms are implemented using MapReduce. We analyzed and compared the structure alignments produced by different methods using a dataset randomly selected from the PDB database. The experimental results verify that the proposed algorithm refines the resulting alignments more accurately than existing algorithms. Meanwhile, the computational performance of the proposed service is proportional to the number of processors used in our cloud platform. PMID:23671842
Lafuente, M J; Gamo, F J; Gancedo, C
1996-09-01
We have determined the sequence of a 10624 bp DNA segment located in the left arm of chromosome XV of Saccharomyces cerevisiae. The sequence contains eight open reading frames (ORFs) longer than 100 amino acids. Two of them do not present significant homology with sequences found in the databases. The product of ORF o0553 is identical to the protein encoded by the gene SMF1. Internal to it there is another ORF, o0555 that is apparently expressed. The proteins encoded by ORFs o0559 and o0565 are identical to ribosomal proteins S19.e and L18 respectively. ORF o0550 encodes a protein with an RNA binding signature including RNP motifs and stretches rich in asparagine, glutamine and arginine.
ImmunemiR - A Database of Prioritized Immune miRNA Disease Associations and its Interactome.
Prabahar, Archana; Natarajan, Jeyakumar
2017-01-01
MicroRNAs are the key regulators of gene expression and their abnormal expression in the immune system may be associated with several human diseases such as inflammation, cancer and autoimmune diseases. Elucidation of miRNA disease association through the interactome will deepen the understanding of its disease mechanisms. A specialized database for immune miRNAs is highly desirable to demonstrate the immune miRNA disease associations in the interactome. miRNAs specific to immune related diseases were retrieved from curated databases such as HMDD, miR2disease and PubMed literature based on MeSH classification of immune system diseases. The additional data such as miRNA target genes, genes coding protein-protein interaction information were compiled from related resources. Further, miRNAs were prioritized to specific immune diseases using random walk ranking algorithm. In total 245 immune miRNAs associated with 92 OMIM disease categories were identified from external databases. The resultant data were compiled as ImmunemiR, a database of prioritized immune miRNA disease associations. This database provides both text based annotation information and network visualization of its interactome. To our knowledge, ImmunemiR is the first available database to provide a comprehensive repository of human immune disease associated miRNAs with network visualization options of its target genes, protein-protein interactions (PPI) and its disease associations. It is freely available at http://www.biominingbu.org/immunemir/. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
An ontology-based search engine for protein-protein interactions
2010-01-01
Background Keyword matching or ID matching is the most common searching method in a large database of protein-protein interactions. They are purely syntactic methods, and retrieve the records in the database that contain a keyword or ID specified in a query. Such syntactic search methods often retrieve too few search results or no results despite many potential matches present in the database. Results We have developed a new method for representing protein-protein interactions and the Gene Ontology (GO) using modified Gödel numbers. This representation is hidden from users but enables a search engine using the representation to efficiently search protein-protein interactions in a biologically meaningful way. Given a query protein with optional search conditions expressed in one or more GO terms, the search engine finds all the interaction partners of the query protein by unique prime factorization of the modified Gödel numbers representing the query protein and the search conditions. Conclusion Representing the biological relations of proteins and their GO annotations by modified Gödel numbers makes a search engine efficiently find all protein-protein interactions by prime factorization of the numbers. Keyword matching or ID matching search methods often miss the interactions involving a protein that has no explicit annotations matching the search condition, but our search engine retrieves such interactions as well if they satisfy the search condition with a more specific term in the ontology. PMID:20122195
An ontology-based search engine for protein-protein interactions.
Park, Byungkyu; Han, Kyungsook
2010-01-18
Keyword matching or ID matching is the most common searching method in a large database of protein-protein interactions. They are purely syntactic methods, and retrieve the records in the database that contain a keyword or ID specified in a query. Such syntactic search methods often retrieve too few search results or no results despite many potential matches present in the database. We have developed a new method for representing protein-protein interactions and the Gene Ontology (GO) using modified Gödel numbers. This representation is hidden from users but enables a search engine using the representation to efficiently search protein-protein interactions in a biologically meaningful way. Given a query protein with optional search conditions expressed in one or more GO terms, the search engine finds all the interaction partners of the query protein by unique prime factorization of the modified Gödel numbers representing the query protein and the search conditions. Representing the biological relations of proteins and their GO annotations by modified Gödel numbers makes a search engine efficiently find all protein-protein interactions by prime factorization of the numbers. Keyword matching or ID matching search methods often miss the interactions involving a protein that has no explicit annotations matching the search condition, but our search engine retrieves such interactions as well if they satisfy the search condition with a more specific term in the ontology.
A glimpse into the proteome of phototrophic bacterium Rhodobacter capsulatus.
Onder, Ozlem; Aygun-Sunar, Semra; Selamoglu, Nur; Daldal, Fevzi
2010-01-01
A first glimpse into the proteome of Rhodobacter capsulatus revealed more than 450 (with over 210 cytoplasmic and 185 extracytoplasmic known as well as 55 unknown) proteins that are identified with high degree of confidence using nLC-MS/MS analyses. The accumulated data provide a solid platform for ongoing efforts to establish the proteome of this species and the cellular locations of its constituents. They also indicate that at least 40 of the identified proteins, which were annotated in genome databases as unknown hypothetical proteins, correspond to predicted translation products that are indeed present in cells under the growth conditions used in this work. In addition, matching the identification labels of the proteins reported between the two available R. capsulatus genome databases (ERGO-light with RRCxxxxx and NT05 with NT05RCxxxx numbers) indicated that 11 such proteins are listed only in the latter database.
IMGT, the International ImMunoGeneTics database.
Lefranc, M P; Giudicelli, V; Busin, C; Bodmer, J; Müller, W; Bontrop, R; Lemaitre, M; Malik, A; Chaume, D
1998-01-01
IMGT, the international ImMunoGeneTics database, is an integrated database specialising in Immunoglobulins (Ig), T cell Receptors (TcR) and Major Histocompatibility Complex (MHC) of all vertebrate species, created by Marie-Paule Lefranc, CNRS, Montpellier II University, Montpellier, France (lefranc@ligm.crbm.cnrs-mop.fr). IMGT includes three databases: LIGM-DB (for Ig and TcR), MHC/HLA-DB and PRIMER-DB (the last two in development). IMGT comprises expertly annotated sequences and alignment tables. LIGM-DB contains more than 23 000 Immunoglobulin and T cell Receptor sequences from 78 species. MHC/HLA-DB contains Class I and Class II Human Leucocyte Antigen alignment tables. An IMGT tool, DNAPLOT, developed for Ig, TcR and MHC sequence alignments, is also available. IMGT works in close collaboration with the EMBL database. IMGT goals are to establish a common data access to all immunogenetics data, including nucleotide and protein sequences, oligonucleotide primers, gene maps and other genetic data of Ig, TcR and MHC molecules, and to provide a graphical user friendly data access. IMGT has important implications in medical research (repertoire in autoimmune diseases, AIDS, leukemias, lymphomas), therapeutical approaches (antibody engineering), genome diversity and genome evolution studies. IMGT is freely available at http://imgt.cnusc.fr:8104 PMID:9399859
Winsor, Geoffrey L; Van Rossum, Thea; Lo, Raymond; Khaira, Bhavjinder; Whiteside, Matthew D; Hancock, Robert E W; Brinkman, Fiona S L
2009-01-01
Pseudomonas aeruginosa is a well-studied opportunistic pathogen that is particularly known for its intrinsic antimicrobial resistance, diverse metabolic capacity, and its ability to cause life threatening infections in cystic fibrosis patients. The Pseudomonas Genome Database (http://www.pseudomonas.com) was originally developed as a resource for peer-reviewed, continually updated annotation for the Pseudomonas aeruginosa PAO1 reference strain genome. In order to facilitate cross-strain and cross-species genome comparisons with other Pseudomonas species of importance, we have now expanded the database capabilities to include all Pseudomonas species, and have developed or incorporated methods to facilitate high quality comparative genomics. The database contains robust assessment of orthologs, a novel ortholog clustering method, and incorporates five views of the data at the sequence and annotation levels (Gbrowse, Mauve and custom views) to facilitate genome comparisons. A choice of simple and more flexible user-friendly Boolean search features allows researchers to search and compare annotations or sequences within or between genomes. Other features include more accurate protein subcellular localization predictions and a user-friendly, Boolean searchable log file of updates for the reference strain PAO1. This database aims to continue to provide a high quality, annotated genome resource for the research community and is available under an open source license.
Database resources of the National Center for Biotechnology Information
Wheeler, David L.; Barrett, Tanya; Benson, Dennis A.; Bryant, Stephen H.; Canese, Kathi; Chetvernin, Vyacheslav; Church, Deanna M.; DiCuccio, Michael; Edgar, Ron; Federhen, Scott; Feolo, Michael; Geer, Lewis Y.; Helmberg, Wolfgang; Kapustin, Yuri; Khovayko, Oleg; Landsman, David; Lipman, David J.; Madden, Thomas L.; Maglott, Donna R.; Miller, Vadim; Ostell, James; Pruitt, Kim D.; Schuler, Gregory D.; Shumway, Martin; Sequeira, Edwin; Sherry, Steven T.; Sirotkin, Karl; Souvorov, Alexandre; Starchenko, Grigory; Tatusov, Roman L.; Tatusova, Tatiana A.; Wagner, Lukas; Yaschenko, Eugene
2008-01-01
In addition to maintaining the GenBank(R) nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data available through NCBI's web site. NCBI resources include Entrez, the Entrez Programming Utilities, My NCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link, Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genome, Genome Project and related tools, the Trace, Assembly, and Short Read Archives, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups, Influenza Viral Resources, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus, Entrez Probe, GENSAT, Database of Genotype and Phenotype, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals, the Molecular Modeling Database, the Conserved Domain Database, the Conserved Domain Architecture Retrieval Tool and the PubChem suite of small molecule databases. Augmenting the web applications are custom implementations of the BLAST program optimized to search specialized data sets. These resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov. PMID:18045790
Grossmann, Jonas; Fernández, Helena; Chaubey, Pururawa M; Valdés, Ana E; Gagliardini, Valeria; Cañal, María J; Russo, Giancarlo; Grossniklaus, Ueli
2017-01-01
Performing proteomic studies on non-model organisms with little or no genomic information is still difficult. However, many specific processes and biochemical pathways occur only in species that are poorly characterized at the genomic level. For example, many plants can reproduce both sexually and asexually, the first one allowing the generation of new genotypes and the latter their fixation. Thus, both modes of reproduction are of great agronomic value. However, the molecular basis of asexual reproduction is not well understood in any plant. In ferns, it combines the production of unreduced spores (diplospory) and the formation of sporophytes from somatic cells (apogamy). To set the basis to study these processes, we performed transcriptomics by next-generation sequencing (NGS) and shotgun proteomics by tandem mass spectrometry in the apogamous fern D. affinis ssp. affinis . For protein identification we used the public viridiplantae database (VPDB) to identify orthologous proteins from other plant species and new transcriptomics data to generate a "species-specific transcriptome database" (SSTDB). In total 1,397 protein clusters with 5,865 unique peptide sequences were identified (13 decoy proteins out of 1,410, protFDR 0.93% on protein cluster level). We show that using the SSTDB for protein identification increases the number of identified peptides almost four times compared to using only the publically available VPDB. We identified homologs of proteins involved in reproduction of higher plants, including proteins with a potential role in apogamy. With the increasing availability of genomic data from non-model species, similar proteogenomics approaches will improve the sensitivity in protein identification for species only distantly related to models.
Delcourt, Vivian; Lucier, Jean-François; Gagnon, Jules; Beaudoin, Maxime C; Vanderperre, Benoît; Breton, Marc-André; Motard, Julie; Jacques, Jean-François; Brunelle, Mylène; Gagnon-Arsenault, Isabelle; Fournier, Isabelle; Ouangraoua, Aida; Hunting, Darel J; Cohen, Alan A; Landry, Christian R; Scott, Michelle S
2017-01-01
Recent functional, proteomic and ribosome profiling studies in eukaryotes have concurrently demonstrated the translation of alternative open-reading frames (altORFs) in addition to annotated protein coding sequences (CDSs). We show that a large number of small proteins could in fact be coded by these altORFs. The putative alternative proteins translated from altORFs have orthologs in many species and contain functional domains. Evolutionary analyses indicate that altORFs often show more extreme conservation patterns than their CDSs. Thousands of alternative proteins are detected in proteomic datasets by reanalysis using a database containing predicted alternative proteins. This is illustrated with specific examples, including altMiD51, a 70 amino acid mitochondrial fission-promoting protein encoded in MiD51/Mief1/SMCR7L, a gene encoding an annotated protein promoting mitochondrial fission. Our results suggest that many genes are multicoding genes and code for a large protein and one or several small proteins. PMID:29083303
VerSeDa: vertebrate secretome database
Cortazar, Ana R.; Oguiza, José A.
2017-01-01
Based on the current tools, de novo secretome (full set of proteins secreted by an organism) prediction is a time consuming bioinformatic task that requires a multifactorial analysis in order to obtain reliable in silico predictions. Hence, to accelerate this process and offer researchers a reliable repository where secretome information can be obtained for vertebrates and model organisms, we have developed VerSeDa (Vertebrate Secretome Database). This freely available database stores information about proteins that are predicted to be secreted through the classical and non-classical mechanisms, for the wide range of vertebrate species deposited at the NCBI, UCSC and ENSEMBL sites. To our knowledge, VerSeDa is the only state-of-the-art database designed to store secretome data from multiple vertebrate genomes, thus, saving an important amount of time spent in the prediction of protein features that can be retrieved from this repository directly. Database URL: VerSeDa is freely available at http://genomics.cicbiogune.es/VerSeDa/index.php PMID:28365718
Integrated web visualizations for protein-protein interaction databases.
Jeanquartier, Fleur; Jean-Quartier, Claire; Holzinger, Andreas
2015-06-16
Understanding living systems is crucial for curing diseases. To achieve this task we have to understand biological networks based on protein-protein interactions. Bioinformatics has come up with a great amount of databases and tools that support analysts in exploring protein-protein interactions on an integrated level for knowledge discovery. They provide predictions and correlations, indicate possibilities for future experimental research and fill the gaps to complete the picture of biochemical processes. There are numerous and huge databases of protein-protein interactions used to gain insights into answering some of the many questions of systems biology. Many computational resources integrate interaction data with additional information on molecular background. However, the vast number of diverse Bioinformatics resources poses an obstacle to the goal of understanding. We present a survey of databases that enable the visual analysis of protein networks. We selected M=10 out of N=53 resources supporting visualization, and we tested against the following set of criteria: interoperability, data integration, quantity of possible interactions, data visualization quality and data coverage. The study reveals differences in usability, visualization features and quality as well as the quantity of interactions. StringDB is the recommended first choice. CPDB presents a comprehensive dataset and IntAct lets the user change the network layout. A comprehensive comparison table is available via web. The supplementary table can be accessed on http://tinyurl.com/PPI-DB-Comparison-2015. Only some web resources featuring graph visualization can be successfully applied to interactive visual analysis of protein-protein interaction. Study results underline the necessity for further enhancements of visualization integration in biochemical analysis tools. Identified challenges are data comprehensiveness, confidence, interactive feature and visualization maturing.