Science.gov

Sample records for predicting protein function

  1. An iterative approach of protein function prediction

    PubMed Central

    2011-01-01

    Background Current approaches of predicting protein functions from a protein-protein interaction (PPI) dataset are based on an assumption that the available functions of the proteins (a.k.a. annotated proteins) will determine the functions of the proteins whose functions are unknown yet at the moment (a.k.a. un-annotated proteins). Therefore, the protein function prediction is a mono-directed and one-off procedure, i.e. from annotated proteins to un-annotated proteins. However, the interactions between proteins are mutual rather than static and mono-directed, although functions of some proteins are unknown for some reasons at present. That means when we use the similarity-based approach to predict functions of un-annotated proteins, the un-annotated proteins, once their functions are predicted, will affect the similarities between proteins, which in turn will affect the prediction results. In other words, the function prediction is a dynamic and mutual procedure. This dynamic feature of protein interactions, however, was not considered in the existing prediction algorithms. Results In this paper, we propose a new prediction approach that predicts protein functions iteratively. This iterative approach incorporates the dynamic and mutual features of PPI interactions, as well as the local and global semantic influence of protein functions, into the prediction. To guarantee predicting functions iteratively, we propose a new protein similarity from protein functions. We adapt new evaluation metrics to evaluate the prediction quality of our algorithm and other similar algorithms. Experiments on real PPI datasets were conducted to evaluate the effectiveness of the proposed approach in predicting unknown protein functions. Conclusions The iterative approach is more likely to reflect the real biological nature between proteins when predicting functions. A proper definition of protein similarity from protein functions is the key to predicting functions iteratively. The

  2. Protein Function Prediction: Problems and Pitfalls.

    PubMed

    Pearson, William R

    2015-09-03

    The characterization of new genomes based on their protein sets has been revolutionized by new sequencing technologies, but biologists seeking to exploit new sequence information are often frustrated by the challenges associated with accurately assigning biological functions to newly identified proteins. Here, we highlight some of the challenges in functional inference from sequence similarity. Investigators can improve the accuracy of function prediction by (1) being conservative about the evolutionary distance to a protein of known function; (2) considering the ambiguous meaning of "functional similarity," and (3) being aware of the limitations of annotations in functional databases. Protein function prediction does not offer "one-size-fits-all" solutions. Prediction strategies work better when the idiosyncrasies of function and functional annotation are better understood.

  3. Protein molecular function prediction by Bayesian phylogenomics.

    PubMed

    Engelhardt, Barbara E; Jordan, Michael I; Muratore, Kathryn E; Brenner, Steven E

    2005-10-01

    We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.

  4. Year 2 Report: Protein Function Prediction Platform

    SciTech Connect

    Zhou, C E

    2012-04-27

    Upon completion of our second year of development in a 3-year development cycle, we have completed a prototype protein structure-function annotation and function prediction system: Protein Function Prediction (PFP) platform (v.0.5). We have met our milestones for Years 1 and 2 and are positioned to continue development in completion of our original statement of work, or a reasonable modification thereof, in service to DTRA Programs involved in diagnostics and medical countermeasures research and development. The PFP platform is a multi-scale computational modeling system for protein structure-function annotation and function prediction. As of this writing, PFP is the only existing fully automated, high-throughput, multi-scale modeling, whole-proteome annotation platform, and represents a significant advance in the field of genome annotation (Fig. 1). PFP modules perform protein functional annotations at the sequence, systems biology, protein structure, and atomistic levels of biological complexity (Fig. 2). Because these approaches provide orthogonal means of characterizing proteins and suggesting protein function, PFP processing maximizes the protein functional information that can currently be gained by computational means. Comprehensive annotation of pathogen genomes is essential for bio-defense applications in pathogen characterization, threat assessment, and medical countermeasure design and development in that it can short-cut the time and effort required to select and characterize protein biomarkers.

  5. Hierarchical Ensemble Methods for Protein Function Prediction

    PubMed Central

    2014-01-01

    Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware “flat” prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a “consensus” ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research. PMID:25937954

  6. A new protein structure representation for efficient protein function prediction.

    PubMed

    Maghawry, Huda A; Mostafa, Mostafa G M; Gharib, Tarek F

    2014-12-01

    One of the challenging problems in bioinformatics is the prediction of protein function. Protein function is the main key that can be used to classify different proteins. Protein function can be inferred experimentally with very small throughput or computationally with very high throughput. Computational methods are sequence based or structure based. Structure-based methods produce more accurate protein function prediction. In this article, we propose a new protein structure representation for efficient protein function prediction. The representation is based on three-dimensional patterns of protein residues. In the analysis, we used protein function based on enzyme activity through six mechanistically diverse enzyme superfamilies: amidohydrolase, crotonase, haloacid dehalogenase, isoprenoid synthase type I, and vicinal oxygen chelate. We applied three different classification methods, naïve Bayes, k-nearest neighbors, and random forest, to predict the enzyme superfamily of a given protein. The prediction accuracy using the proposed representation outperforms a recently introduced representation method that is based only on the distance patterns. The results show that the proposed representation achieved prediction accuracy up to 98%, with improvement of about 10% on average.

  7. Protein function prediction using domain families

    PubMed Central

    2013-01-01

    Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons. PMID:23514456

  8. Protein function prediction based on data fusion and functional interrelationship.

    PubMed

    Meng, Jun; Wekesa, Jael-Sanyanda; Shi, Guan-Li; Luan, Yu-Shi

    2016-04-01

    One of the challenging tasks of bioinformatics is to predict more accurate and confident protein functions from genomics and proteomics datasets. Computational approaches use a variety of high throughput experimental data, such as protein-protein interaction (PPI), protein sequences and phylogenetic profiles, to predict protein functions. This paper presents a method that uses transductive multi-label learning algorithm by integrating multiple data sources for classification. Multiple proteomics datasets are integrated to make inferences about functions of unknown proteins and use a directed bi-relational graph to assign labels to unannotated proteins. Our method, bi-relational graph based transductive multi-label function annotation (Bi-TMF) uses functional correlation and topological PPI network properties on both the training and testing datasets to predict protein functions through data fusion of the individual kernel result. The main purpose of our proposed method is to enhance the performance of classifier integration for protein function prediction algorithms. Experimental results demonstrate the effectiveness and efficiency of Bi-TMF on multi-sources datasets in yeast, human and mouse benchmarks. Bi-TMF outperforms other recently proposed methods.

  9. Prediction of protein function from protein sequence and structure.

    PubMed

    Whisstock, James C; Lesk, Arthur M

    2003-08-01

    The sequence of a genome contains the plans of the possible life of an organism, but implementation of genetic information depends on the functions of the proteins and nucleic acids that it encodes. Many individual proteins of known sequence and structure present challenges to the understanding of their function. In particular, a number of genes responsible for diseases have been identified but their specific functions are unknown. Whole-genome sequencing projects are a major source of proteins of unknown function. Annotation of a genome involves assignment of functions to gene products, in most cases on the basis of amino-acid sequence alone. 3D structure can aid the assignment of function, motivating the challenge of structural genomics projects to make structural information available for novel uncharacterized proteins. Structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable. Nevertheless, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Many methods of function prediction rely on identifying similarity in sequence and/or structure between a protein of unknown function and one or more well-understood proteins. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known. However, these inferences are tenuous. Such methods provide reasonable guesses at function, but are far from foolproof. It is therefore fortunate that the development of whole-organism approaches and comparative genomics permits other approaches to function prediction when the data are available. These include the use of protein-protein interaction patterns, and correlations between occurrences of related proteins in different organisms, as

  10. CombFunc: predicting protein function using heterogeneous data sources.

    PubMed

    Wass, Mark N; Barton, Geraint; Sternberg, Michael J E

    2012-07-01

    Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein-protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc.

  11. CombFunc: predicting protein function using heterogeneous data sources

    PubMed Central

    Wass, Mark N.; Barton, Geraint; Sternberg, Michael J. E.

    2012-01-01

    Only a small fraction of known proteins have been functionally characterized, making protein function prediction essential to propose annotations for uncharacterized proteins. In recent years many function prediction methods have been developed using various sources of biological data from protein sequence and structure to gene expression data. Here we present the CombFunc web server, which makes Gene Ontology (GO)-based protein function predictions. CombFunc incorporates ConFunc, our existing function prediction method, with other approaches for function prediction that use protein sequence, gene expression and protein–protein interaction data. In benchmarking on a set of 1686 proteins CombFunc obtains precision and recall of 0.71 and 0.64 respectively for gene ontology molecular function terms. For biological process GO terms precision of 0.74 and recall of 0.41 is obtained. CombFunc is available at http://www.sbg.bio.ic.ac.uk/combfunc. PMID:22641853

  12. Predicting protein function by frequent functional association pattern mining in protein interaction networks.

    PubMed

    Cho, Young-Rae; Zhang, Aidong

    2010-01-01

    Predicting protein function from protein interaction networks has been challenging because of the complexity of functional relationships among proteins. Most previous function prediction methods depend on the neighborhood of or the connected paths to known proteins. However, their accuracy has been limited due to the functional inconsistency of interacting proteins. In this paper, we propose a novel approach for function prediction by identifying frequent patterns of functional associations in a protein interaction network. A set of functions that a protein performs is assigned into the corresponding node as a label. A functional association pattern is then represented as a labeled subgraph. Our frequent labeled subgraph mining algorithm efficiently searches the functional association patterns that occur frequently in the network. It iteratively increases the size of frequent patterns by one node at a time by selective joining, and simplifies the network by a priori pruning. Using the yeast protein interaction network, our algorithm found more than 1400 frequent functional association patterns. The function prediction is performed by matching the subgraph, including the unknown protein, with the frequent patterns analogous to it. By leave-one-out cross validation, we show that our approach has better performance than previous link-based methods in terms of prediction accuracy. The frequent functional association patterns generated in this study might become the foundations of advanced analysis for functional behaviors of proteins in a system level.

  13. Text Mining Improves Prediction of Protein Functional Sites

    PubMed Central

    Cohn, Judith D.; Ravikumar, Komandur E.

    2012-01-01

    We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions. PMID:22393388

  14. A survey of computational intelligence techniques in protein function prediction.

    PubMed

    Tiwari, Arvind Kumar; Srivastava, Rajeev

    2014-01-01

    During the past, there was a massive growth of knowledge of unknown proteins with the advancement of high throughput microarray technologies. Protein function prediction is the most challenging problem in bioinformatics. In the past, the homology based approaches were used to predict the protein function, but they failed when a new protein was different from the previous one. Therefore, to alleviate the problems associated with homology based traditional approaches, numerous computational intelligence techniques have been proposed in the recent past. This paper presents a state-of-the-art comprehensive review of various computational intelligence techniques for protein function predictions using sequence, structure, protein-protein interaction network, and gene expression data used in wide areas of applications such as prediction of DNA and RNA binding sites, subcellular localization, enzyme functions, signal peptides, catalytic residues, nuclear/G-protein coupled receptors, membrane proteins, and pathway analysis from gene expression datasets. This paper also summarizes the result obtained by many researchers to solve these problems by using computational intelligence techniques with appropriate datasets to improve the prediction performance. The summary shows that ensemble classifiers and integration of multiple heterogeneous data are useful for protein function prediction.

  15. A large-scale evaluation of computational protein function prediction.

    PubMed

    Radivojac, Predrag; Clark, Wyatt T; Oron, Tal Ronnen; Schnoes, Alexandra M; Wittkop, Tobias; Sokolov, Artem; Graim, Kiley; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa; Pandey, Gaurav; Yunes, Jeffrey M; Talwalkar, Ameet S; Repo, Susanna; Souza, Michael L; Piovesan, Damiano; Casadio, Rita; Wang, Zheng; Cheng, Jianlin; Fang, Hai; Gough, Julian; Koskinen, Patrik; Törönen, Petri; Nokso-Koivisto, Jussi; Holm, Liisa; Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T; Limaye, Bhakti; Inamdar, Harshal; Datta, Avik; Manjari, Sunitha K; Joshi, Rajendra; Chitale, Meghana; Kihara, Daisuke; Lisewski, Andreas M; Erdin, Serkan; Venner, Eric; Lichtarge, Olivier; Rentzsch, Robert; Yang, Haixuan; Romero, Alfonso E; Bhat, Prajwal; Paccanaro, Alberto; Hamp, Tobias; Kaßner, Rebecca; Seemayer, Stefan; Vicedo, Esmeralda; Schaefer, Christian; Achten, Dominik; Auer, Florian; Boehm, Ariane; Braun, Tatjana; Hecht, Maximilian; Heron, Mark; Hönigschmid, Peter; Hopf, Thomas A; Kaufmann, Stefanie; Kiening, Michael; Krompass, Denis; Landerer, Cedric; Mahlich, Yannick; Roos, Manfred; Björne, Jari; Salakoski, Tapio; Wong, Andrew; Shatkay, Hagit; Gatzmann, Fanny; Sommer, Ingolf; Wass, Mark N; Sternberg, Michael J E; Škunca, Nives; Supek, Fran; Bošnjak, Matko; Panov, Panče; Džeroski, Sašo; Šmuc, Tomislav; Kourmpetis, Yiannis A I; van Dijk, Aalt D J; ter Braak, Cajo J F; Zhou, Yuanpeng; Gong, Qingtian; Dong, Xinran; Tian, Weidong; Falda, Marco; Fontana, Paolo; Lavezzo, Enrico; Di Camillo, Barbara; Toppo, Stefano; Lan, Liang; Djuric, Nemanja; Guo, Yuhong; Vucetic, Slobodan; Bairoch, Amos; Linial, Michal; Babbitt, Patricia C; Brenner, Steven E; Orengo, Christine; Rost, Burkhard; Mooney, Sean D; Friedberg, Iddo

    2013-03-01

    Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

  16. A large-scale evaluation of computational protein function prediction

    PubMed Central

    Radivojac, Predrag; Clark, Wyatt T; Ronnen Oron, Tal; Schnoes, Alexandra M; Wittkop, Tobias; Sokolov, Artem; Graim, Kiley; Funk, Christopher; Verspoor, Karin; Ben-Hur, Asa; Pandey, Gaurav; Yunes, Jeffrey M; Talwalkar, Ameet S; Repo, Susanna; Souza, Michael L; Piovesan, Damiano; Casadio, Rita; Wang, Zheng; Cheng, Jianlin; Fang, Hai; Gough, Julian; Koskinen, Patrik; Törönen, Petri; Nokso-Koivisto, Jussi; Holm, Liisa; Cozzetto, Domenico; Buchan, Daniel W A; Bryson, Kevin; Jones, David T; Limaye, Bhakti; Inamdar, Harshal; Datta, Avik; Manjari, Sunitha K; Joshi, Rajendra; Chitale, Meghana; Kihara, Daisuke; Lisewski, Andreas M; Erdin, Serkan; Venner, Eric; Lichtarge, Olivier; Rentzsch, Robert; Yang, Haixuan; Romero, Alfonso E; Bhat, Prajwal; Paccanaro, Alberto; Hamp, Tobias; Kassner, Rebecca; Seemayer, Stefan; Vicedo, Esmeralda; Schaefer, Christian; Achten, Dominik; Auer, Florian; Böhm, Ariane; Braun, Tatjana; Hecht, Maximilian; Heron, Mark; Hönigschmid, Peter; Hopf, Thomas; Kaufmann, Stefanie; Kiening, Michael; Krompass, Denis; Landerer, Cedric; Mahlich, Yannick; Roos, Manfred; Björne, Jari; Salakoski, Tapio; Wong, Andrew; Shatkay, Hagit; Gatzmann, Fanny; Sommer, Ingolf; Wass, Mark N; Sternberg, Michael J E; Škunca, Nives; Supek, Fran; Bošnjak, Matko; Panov, Panče; Džeroski, Sašo; Šmuc, Tomislav; Kourmpetis, Yiannis A I; van Dijk, Aalt D J; ter Braak, Cajo J F; Zhou, Yuanpeng; Gong, Qingtian; Dong, Xinran; Tian, Weidong; Falda, Marco; Fontana, Paolo; Lavezzo, Enrico; Di Camillo, Barbara; Toppo, Stefano; Lan, Liang; Djuric, Nemanja; Guo, Yuhong; Vucetic, Slobodan; Bairoch, Amos; Linial, Michal; Babbitt, Patricia C; Brenner, Steven E; Orengo, Christine; Rost, Burkhard; Mooney, Sean D; Friedberg, Iddo

    2013-01-01

    Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools. PMID:23353650

  17. Protein Structure and Function Prediction Using I-TASSER.

    PubMed

    Yang, Jianyi; Zhang, Yang

    2015-12-17

    I-TASSER is a hierarchical protocol for automated protein structure prediction and structure-based function annotation. Starting from the amino acid sequence of target proteins, I-TASSER first generates full-length atomic structural models from multiple threading alignments and iterative structural assembly simulations followed by atomic-level structure refinement. The biological functions of the protein, including ligand-binding sites, enzyme commission number, and gene ontology terms, are then inferred from known protein function databases based on sequence and structure profile comparisons. I-TASSER is freely available as both an on-line server and a stand-alone package. This unit describes how to use the I-TASSER protocol to generate structure and function prediction and how to interpret the prediction results, as well as alternative approaches for further improving the I-TASSER modeling quality for distant-homologous and multi-domain protein targets.

  18. Protein function prediction using neighbor relativity in protein-protein interaction network.

    PubMed

    Moosavi, Sobhan; Rahgozar, Masoud; Rahimi, Amir

    2013-04-01

    There is a large gap between the number of discovered proteins and the number of functionally annotated ones. Due to the high cost of determining protein function by wet-lab research, function prediction has become a major task for computational biology and bioinformatics. Some researches utilize the proteins interaction information to predict function for un-annotated proteins. In this paper, we propose a novel approach called "Neighbor Relativity Coefficient" (NRC) based on interaction network topology which estimates the functional similarity between two proteins. NRC is calculated for each pair of proteins based on their graph-based features including distance, common neighbors and the number of paths between them. In order to ascribe function to an un-annotated protein, NRC estimates a weight for each neighbor to transfer its annotation to the unknown protein. Finally, the unknown protein will be annotated by the top score transferred functions. We also investigate the effect of using different coefficients for various types of functions. The proposed method has been evaluated on Saccharomyces cerevisiae and Homo sapiens interaction networks. The performance analysis demonstrates that NRC yields better results in comparison with previous protein function prediction approaches that utilize interaction network.

  19. Structure-based Methods for Computational Protein Functional Site Prediction

    PubMed Central

    Dukka, B KC

    2013-01-01

    Due to the advent of high throughput sequencing techniques and structural genomic projects, the number of gene and protein sequences has been ever increasing. Computational methods to annotate these genes and proteins are even more indispensable. Proteins are important macromolecules and study of the function of proteins is an important problem in structural bioinformatics. This paper discusses a number of methods to predict protein functional site especially focusing on protein ligand binding site prediction. Initially, a short overview is presented on recent advances in methods for selection of homologous sequences. Furthermore, a few recent structural based approaches and sequence-and-structure based approaches for protein functional sites are discussed in details. PMID:24688745

  20. BLANNOTATOR: enhanced homology-based function prediction of bacterial proteins

    PubMed Central

    2012-01-01

    Background Automated function prediction has played a central role in determining the biological functions of bacterial proteins. Typically, protein function annotation relies on homology, and function is inferred from other proteins with similar sequences. This approach has become popular in bacterial genomics because it is one of the few methods that is practical for large datasets and because it does not require additional functional genomics experiments. However, the existing solutions produce erroneous predictions in many cases, especially when query sequences have low levels of identity with the annotated source protein. This problem has created a pressing need for improvements in homology-based annotation. Results We present an automated method for the functional annotation of bacterial protein sequences. Based on sequence similarity searches, BLANNOTATOR accurately annotates query sequences with one-line summary descriptions of protein function. It groups sequences identified by BLAST into subsets according to their annotation and bases its prediction on a set of sequences with consistent functional information. We show the results of BLANNOTATOR's performance in sets of bacterial proteins with known functions. We simulated the annotation process for 3090 SWISS-PROT proteins using a database in its state preceding the functional characterisation of the query protein. For this dataset, our method outperformed the five others that we tested, and the improved performance was maintained even in the absence of highly related sequence hits. We further demonstrate the value of our tool by analysing the putative proteome of Lactobacillus crispatus strain ST1. Conclusions BLANNOTATOR is an accurate method for bacterial protein function prediction. It is practical for genome-scale data and does not require pre-existing sequence clustering; thus, this method suits the needs of bacterial genome and metagenome researchers. The method and a web-server are available at

  1. Roles for text mining in protein function prediction.

    PubMed

    Verspoor, Karin M

    2014-01-01

    The Human Genome Project has provided science with a hugely valuable resource: the blueprints for life; the specification of all of the genes that make up a human. While the genes have all been identified and deciphered, it is proteins that are the workhorses of the human body: they are essential to virtually all cell functions and are the primary mechanism through which biological function is carried out. Hence in order to fully understand what happens at a molecular level in biological organisms, and eventually to enable development of treatments for diseases where some aspect of a biological system goes awry, we must understand the functions of proteins. However, experimental characterization of protein function cannot scale to the vast amount of DNA sequence data now available. Computational protein function prediction has therefore emerged as a problem at the forefront of modern biology (Radivojac et al., Nat Methods 10(13):221-227, 2013).Within the varied approaches to computational protein function prediction that have been explored, there are several that make use of biomedical literature mining. These methods take advantage of information in the published literature to associate specific proteins with specific protein functions. In this chapter, we introduce two main strategies for doing this: association of function terms, represented as Gene Ontology terms (Ashburner et al., Nat Genet 25(1):25-29, 2000), to proteins based on information in published articles, and a paradigm called LEAP-FS (Literature-Enhanced Automated Prediction of Functional Sites) in which literature mining is used to validate the predictions of an orthogonal computational protein function prediction method.

  2. Protein function prediction using guilty by association from interaction networks.

    PubMed

    Piovesan, Damiano; Giollo, Manuel; Ferrari, Carlo; Tosatto, Silvio C E

    2015-12-01

    Protein function prediction from sequence using the Gene Ontology (GO) classification is useful in many biological problems. It has recently attracted increasing interest, thanks in part to the Critical Assessment of Function Annotation (CAFA) challenge. In this paper, we introduce Guilty by Association on STRING (GAS), a tool to predict protein function exploiting protein-protein interaction networks without sequence similarity. The assumption is that whenever a protein interacts with other proteins, it is part of the same biological process and located in the same cellular compartment. GAS retrieves interaction partners of a query protein from the STRING database and measures enrichment of the associated functional annotations to generate a sorted list of putative functions. A performance evaluation based on CAFA metrics and a fair comparison with optimized BLAST similarity searches is provided. The consensus of GAS and BLAST is shown to improve overall performance. The PPI approach is shown to outperform similarity searches for biological process and cellular compartment GO predictions. Moreover, an analysis of the best practices to exploit protein-protein interaction networks is also provided.

  3. ESG: extended similarity group method for automated protein function prediction

    PubMed Central

    Chitale, Meghana; Hawkins, Troy; Park, Changsoon; Kihara, Daisuke

    2009-01-01

    Motivation: Importance of accurate automatic protein function prediction is ever increasing in the face of a large number of newly sequenced genomes and proteomics data that are awaiting biological interpretation. Conventional methods have focused on high sequence similarity-based annotation transfer which relies on the concept of homology. However, many cases have been reported that simple transfer of function from top hits of a homology search causes erroneous annotation. New methods are required to handle the sequence similarity in a more robust way to combine together signals from strongly and weakly similar proteins for effectively predicting function for unknown proteins with high reliability. Results: We present the extended similarity group (ESG) method, which performs iterative sequence database searches and annotates a query sequence with Gene Ontology terms. Each annotation is assigned with probability based on its relative similarity score with the multiple-level neighbors in the protein similarity graph. We will depict how the statistical framework of ESG improves the prediction accuracy by iteratively taking into account the neighborhood of query protein in the sequence similarity space. ESG outperforms conventional PSI-BLAST and the protein function prediction (PFP) algorithm. It is found that the iterative search is effective in capturing multiple-domains in a query protein, enabling accurately predicting several functions which originate from different domains. Availability: ESG web server is available for automated protein function prediction at http://dragon.bio.purdue.edu/ESG/ Contact: cspark@cau.ac.kr; dkihara@purdue.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19435743

  4. A review of protein function prediction under machine learning perspective.

    PubMed

    Bernardes, Juliana S; Pedreira, Carlos E

    2013-08-01

    Protein function prediction is one of the most challenging problems in the post-genomic era. The number of newly identified proteins has been exponentially increasing with the advances of the high-throughput techniques. However, the functional characterization of these new proteins was not incremented in the same proportion. To fill this gap, a large number of computational methods have been proposed in the literature. Early approaches have explored homology relationships to associate known functions to the newly discovered proteins. Nevertheless, these approaches tend to fail when a new protein is considerably different (divergent) from previously known ones. Accordingly, more accurate approaches, that use expressive data representation and explore sophisticate computational techniques are required. Regarding these points, this review provides a comprehensible description of machine learning approaches that are currently applied to protein function prediction problems. We start by defining several problems enrolled in understanding protein function aspects, and describing how machine learning can be applied to these problems. We aim to expose, in a systematical framework, the role of these techniques in protein function inference, sometimes difficult to follow up due to the rapid evolvement of the field. With this purpose in mind, we highlight the most representative contributions, the recent advancements, and provide an insightful categorization and classification of machine learning methods in functional proteomics.

  5. Predicting Protein Function via Semantic Integration of Multiple Networks.

    PubMed

    Yu, Guoxian; Fu, Guangyuan; Wang, Jun; Zhu, Hailong

    2016-01-01

    Determining the biological functions of proteins is one of the key challenges in the post-genomic era. The rapidly accumulated large volumes of proteomic and genomic data drives to develop computational models for automatically predicting protein function in large scale. Recent approaches focus on integrating multiple heterogeneous data sources and they often get better results than methods that use single data source alone. In this paper, we investigate how to integrate multiple biological data sources with the biological knowledge, i.e., Gene Ontology (GO), for protein function prediction. We propose a method, called SimNet, to Semantically integrate multiple functional association Networks derived from heterogenous data sources. SimNet firstly utilizes GO annotations of proteins to capture the semantic similarity between proteins and introduces a semantic kernel based on the similarity. Next, SimNet constructs a composite network, obtained as a weighted summation of individual networks, and aligns the network with the kernel to get the weights assigned to individual networks. Then, it applies a network-based classifier on the composite network to predict protein function. Experiment results on heterogenous proteomic data sources of Yeast, Human, Mouse, and Fly show that, SimNet not only achieves better (or comparable) results than other related competitive approaches, but also takes much less time. The Matlab codes of SimNet are available at https://sites.google.com/site/guoxian85/simnet.

  6. SitesIdentify: a protein functional site prediction tool

    PubMed Central

    2009-01-01

    Background The rate of protein structures being deposited in the Protein Data Bank surpasses the capacity to experimentally characterise them and therefore computational methods to analyse these structures have become increasingly important. Identifying the region of the protein most likely to be involved in function is useful in order to gain information about its potential role. There are many available approaches to predict functional site, but many are not made available via a publicly-accessible application. Results Here we present a functional site prediction tool (SitesIdentify), based on combining sequence conservation information with geometry-based cleft identification, that is freely available via a web-server. We have shown that SitesIdentify compares favourably to other functional site prediction tools in a comparison of seven methods on a non-redundant set of 237 enzymes with annotated active sites. Conclusion SitesIdentify is able to produce comparable accuracy in predicting functional sites to its closest available counterpart, but in addition achieves improved accuracy for proteins with few characterised homologues. SitesIdentify is available via a webserver at http://www.manchester.ac.uk/bioinformatics/sitesidentify/ PMID:19922660

  7. A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions

    PubMed Central

    Kolker, Eugene

    2009-01-01

    Background Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. Methodology Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. Significance Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e−62, non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e−05, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to

  8. PROSNET: INTEGRATING HOMOLOGY WITH MOLECULAR NETWORKS FOR PROTEIN FUNCTION PREDICTION

    PubMed Central

    Wang, Sheng; Qu, Meng

    2016-01-01

    Automated annotation of protein function has become a critical task in the post-genomic era. Network-based approaches and homology-based approaches have been widely used and recently tested in large-scale community-wide assessment experiments. It is natural to integrate network data with homology information to further improve the predictive performance. However, integrating these two heterogeneous, high-dimensional and noisy datasets is non-trivial. In this work, we introduce a novel protein function prediction algorithm ProSNet. An integrated heterogeneous network is first built to include molecular networks of multiple species and link together homologous proteins across multiple species. Based on this integrated network, a dimensionality reduction algorithm is introduced to obtain compact low-dimensional vectors to encode proteins in the network. Finally, we develop machine learning classification algorithms that take the vectors as input and make predictions by transferring annotations both within each species and across different species. Extensive experiments on five major species demonstrate that our integration of homology with molecular networks substantially improves the predictive performance over existing approaches. PMID:27896959

  9. Protein function prediction using local 3D templates.

    PubMed

    Laskowski, Roman A; Watson, James D; Thornton, Janet M

    2005-08-19

    The prediction of a protein's function from its 3D structure is becoming more and more important as the worldwide structural genomics initiatives gather pace and continue to solve 3D structures, many of which are of proteins of unknown function. Here, we present a methodology for predicting function from structure that shows great promise. It is based on 3D templates that are defined as specific 3D conformations of small numbers of residues. We use four types of template, covering enzyme active sites, ligand-binding residues, DNA-binding residues and reverse templates. The latter are templates generated from the target structure itself and scanned against a representative subset of all known protein structures. Together, the templates provide a fairly thorough coverage of the known structures and ensure that if there is a match to a known structure it is unlikely to be missed. A new scoring scheme provides a highly sensitive means of discriminating between true positive and false positive template matches. In all, the methodology provides a powerful new tool for function prediction to complement those already in use.

  10. Quantitative assessment of protein function prediction from metagenomics shotgun sequences.

    PubMed

    Harrington, E D; Singh, A H; Doerks, T; Letunic, I; von Mering, C; Jensen, L J; Raes, J; Bork, P

    2007-08-28

    To assess the potential of protein function prediction in environmental genomics data, we analyzed shotgun sequences from four diverse and complex habitats. Using homology searches as well as customized gene neighborhood methods that incorporate intergenic and evolutionary distances, we inferred specific functions for 76% of the 1.4 million predicted ORFs in these samples (83% when nonspecific functions are considered). Surprisingly, these fractions are only slightly smaller than the corresponding ones in completely sequenced genomes (83% and 86%, respectively, by using the same methodology) and considerably higher than previously thought. For as many as 75,448 ORFs (5% of the total), only neighborhood methods can assign functions, illustrated here by a previously undescribed gene associated with the well characterized heme biosynthesis operon and a potential transcription factor that might regulate a coupling between fatty acid biosynthesis and degradation. Our results further suggest that, although functions can be inferred for most proteins on earth, many functions remain to be discovered in numerous small, rare protein families.

  11. STRING: a database of predicted functional associations between proteins.

    PubMed

    von Mering, Christian; Huynen, Martijn; Jaeggi, Daniel; Schmidt, Steffen; Bork, Peer; Snel, Berend

    2003-01-01

    Functional links between proteins can often be inferred from genomic associations between the genes that encode them: groups of genes that are required for the same function tend to show similar species coverage, are often located in close proximity on the genome (in prokaryotes), and tend to be involved in gene-fusion events. The database STRING is a precomputed global resource for the exploration and analysis of these associations. Since the three types of evidence differ conceptually, and the number of predicted interactions is very large, it is essential to be able to assess and compare the significance of individual predictions. Thus, STRING contains a unique scoring-framework based on benchmarks of the different types of associations against a common reference set, integrated in a single confidence score per prediction. The graphical representation of the network of inferred, weighted protein interactions provides a high-level view of functional linkage, facilitating the analysis of modularity in biological processes. STRING is updated continuously, and currently contains 261 033 orthologs in 89 fully sequenced genomes. The database predicts functional interactions at an expected level of accuracy of at least 80% for more than half of the genes; it is online at http://www.bork.embl-heidelberg.de/STRING/.

  12. Cloud prediction of protein structure and function with PredictProtein for Debian.

    PubMed

    Kaján, László; Yachdav, Guy; Vicedo, Esmeralda; Steinegger, Martin; Mirdita, Milot; Angermüller, Christof; Böhm, Ariane; Domke, Simon; Ertl, Julia; Mertes, Christian; Reisinger, Eva; Staniewski, Cedric; Rost, Burkhard

    2013-01-01

    We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome.

  13. Cloud Prediction of Protein Structure and Function with PredictProtein for Debian

    PubMed Central

    Kaján, László; Yachdav, Guy; Vicedo, Esmeralda; Steinegger, Martin; Mirdita, Milot; Angermüller, Christof; Böhm, Ariane; Domke, Simon; Ertl, Julia; Mertes, Christian; Reisinger, Eva; Rost, Burkhard

    2013-01-01

    We report the release of PredictProtein for the Debian operating system and derivatives, such as Ubuntu, Bio-Linux, and Cloud BioLinux. The PredictProtein suite is available as a standard set of open source Debian packages. The release covers the most popular prediction methods from the Rost Lab, including methods for the prediction of secondary structure and solvent accessibility (profphd), nuclear localization signals (predictnls), and intrinsically disordered regions (norsnet). We also present two case studies that successfully utilize PredictProtein packages for high performance computing in the cloud: the first analyzes protein disorder for whole organisms, and the second analyzes the effect of all possible single sequence variants in protein coding regions of the human genome. PMID:23971032

  14. Prediction of functional residues in water channels and related proteins.

    PubMed Central

    Froger, A.; Tallur, B.; Thomas, D.; Delamarche, C.

    1998-01-01

    In this paper, we present an updated classification of the ubiquitous MIP (Major Intrinsic Protein) family proteins, including 153 fully or partially sequenced members available in public databases. Presently, about 30 of these proteins have been functionally characterized, exhibiting essentially two distinct types of channel properties: (1) specific water transport by the aquaporins, and (2) small neutral solutes transport, such as glycerol by the glycerol facilitators. Sequence alignments were used to predict amino acids and motifs discriminant in channel specificity. The protein sequences were also analyzed using statistical tools (comparisons of means and correspondence analysis). Five key positions were clearly identified where the residues are specific for each functional subgroup and exhibit high dissimilar physico-chemical properties. Moreover, we have found that the putative channels for small neutral solutes clearly differ from the aquaporins by the amino acid content and the length of predicted loop regions, suggesting a substrate filter function for these loops. From these results, we propose a signature pattern for water transport. PMID:9655351

  15. Graphlet kernels for prediction of functional residues in protein structures.

    PubMed

    Vacic, Vladimir; Iakoucheva, Lilia M; Lonardi, Stefano; Radivojac, Predrag

    2010-01-01

    We introduce a novel graph-based kernel method for annotating functional residues in protein structures. A structure is first modeled as a protein contact graph, where nodes correspond to residues and edges connect spatially neighboring residues. Each vertex in the graph is then represented as a vector of counts of labeled non-isomorphic subgraphs (graphlets), centered on the vertex of interest. A similarity measure between two vertices is expressed as the inner product of their respective count vectors and is used in a supervised learning framework to classify protein residues. We evaluated our method on two function prediction problems: identification of catalytic residues in proteins, which is a well-studied problem suitable for benchmarking, and a much less explored problem of predicting phosphorylation sites in protein structures. The performance of the graphlet kernel approach was then compared against two alternative methods, a sequence-based predictor and our implementation of the FEATURE framework. On both tasks, the graphlet kernel performed favorably; however, the margin of difference was considerably higher on the problem of phosphorylation site prediction. While there is data that phosphorylation sites are preferentially positioned in intrinsically disordered regions, we provide evidence that for the sites that are located in structured regions, neither the surface accessibility alone nor the averaged measures calculated from the residue microenvironments utilized by FEATURE were sufficient to achieve high accuracy. The key benefit of the graphlet representation is its ability to capture neighborhood similarities in protein structures via enumerating the patterns of local connectivity in the corresponding labeled graphs.

  16. High Precision Prediction of Functional Sites in Protein Structures

    PubMed Central

    Buturovic, Ljubomir; Wong, Mike; Tang, Grace W.; Altman, Russ B.; Petkovic, Dragutin

    2014-01-01

    We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta. PMID:24632601

  17. Biochemical functional predictions for protein structures of unknown or uncertain function.

    PubMed

    Mills, Caitlyn L; Beuning, Penny J; Ondrechen, Mary Jo

    2015-01-01

    With the exponential growth in the determination of protein sequences and structures via genome sequencing and structural genomics efforts, there is a growing need for reliable computational methods to determine the biochemical function of these proteins. This paper reviews the efforts to address the challenge of annotating the function at the molecular level of uncharacterized proteins. While sequence- and three-dimensional-structure-based methods for protein function prediction have been reviewed previously, the recent trends in local structure-based methods have received less attention. These local structure-based methods are the primary focus of this review. Computational methods have been developed to predict the residues important for catalysis and the local spatial arrangements of these residues can be used to identify protein function. In addition, the combination of different types of methods can help obtain more information and better predictions of function for proteins of unknown function. Global initiatives, including the Enzyme Function Initiative (EFI), COMputational BRidges to EXperiments (COMBREX), and the Critical Assessment of Function Annotation (CAFA), are evaluating and testing the different approaches to predicting the function of proteins of unknown function. These initiatives and global collaborations will increase the capability and reliability of methods to predict biochemical function computationally and will add substantial value to the current volume of structural genomics data by reducing the number of absent or inaccurate functional annotations.

  18. Biochemical functional predictions for protein structures of unknown or uncertain function

    PubMed Central

    Mills, Caitlyn L.; Beuning, Penny J.; Ondrechen, Mary Jo

    2015-01-01

    With the exponential growth in the determination of protein sequences and structures via genome sequencing and structural genomics efforts, there is a growing need for reliable computational methods to determine the biochemical function of these proteins. This paper reviews the efforts to address the challenge of annotating the function at the molecular level of uncharacterized proteins. While sequence- and three-dimensional-structure-based methods for protein function prediction have been reviewed previously, the recent trends in local structure-based methods have received less attention. These local structure-based methods are the primary focus of this review. Computational methods have been developed to predict the residues important for catalysis and the local spatial arrangements of these residues can be used to identify protein function. In addition, the combination of different types of methods can help obtain more information and better predictions of function for proteins of unknown function. Global initiatives, including the Enzyme Function Initiative (EFI), COMputational BRidges to EXperiments (COMBREX), and the Critical Assessment of Function Annotation (CAFA), are evaluating and testing the different approaches to predicting the function of proteins of unknown function. These initiatives and global collaborations will increase the capability and reliability of methods to predict biochemical function computationally and will add substantial value to the current volume of structural genomics data by reducing the number of absent or inaccurate functional annotations. PMID:25848497

  19. Integrated protein function prediction by mining function associations, sequences, and protein–protein and gene–gene interaction networks

    PubMed Central

    Cao, Renzhi; Cheng, Jianlin

    2016-01-01

    Motivations Protein function prediction is an important and challenging problem in bioinformatics and computational biology. Functionally relevant biological information such as protein sequences, gene expression, and protein–protein interactions has been used mostly separately for protein function prediction. One of the major challenges is how to effectively integrate multiple sources of both traditional and new information such as spatial gene–gene interaction networks generated from chromosomal conformation data together to improve protein function prediction. Results In this work, we developed three different probabilistic scores (MIS, SEQ, and NET score) to combine protein sequence, function associations, and protein–protein interaction and spatial gene–gene interaction networks for protein function prediction. The MIS score is mainly generated from homologous proteins found by PSI-BLAST search, and also association rules between Gene Ontology terms, which are learned by mining the Swiss-Prot database. The SEQ score is generated from protein sequences. The NET score is generated from protein–protein interaction and spatial gene–gene interaction networks. These three scores were combined in a new Statistical Multiple Integrative Scoring System (SMISS) to predict protein function. We tested SMISS on the data set of 2011 Critical Assessment of Function Annotation (CAFA). The method performed substantially better than three base-line methods and an advanced method based on protein profile–sequence comparison, profile–profile comparison, and domain co-occurrence networks according to the maximum F-measure. PMID:26370280

  20. Effect of the quality of the interaction data on predicting protein function from protein-protein interactions.

    PubMed

    Ni, Qing-Shan; Wang, Zheng-Zhi; Li, Gang-Guo; Wang, Guang-Yun; Zhao, Ying-Jie

    2009-03-01

    Protein function prediction is an important issue in the post-genomic era. When protein function is deduced from protein interaction data, the traditional methods treat each interaction sample equally, where the qualities of the interaction samples are seldom taken into account. In this paper, we investigate the effect of the quality of protein-protein interaction data on predicting protein function. Moreover, two improved methods, weight neighbour counting method (WNC) and weight chi-square method (WCHI), are proposed by considering the quality of interaction samples with the neighbour counting method (NC) and chi-square method (CHI). Experimental results have shown that the qualities of interaction samples affect the performances of protein function prediction methods seriously. It is also demonstrated that WNC and WCHI methods outperform NC and CHI methods in protein function prediction when example weights are chosen properly.

  1. A novel neural response algorithm for protein function prediction

    PubMed Central

    2012-01-01

    Background Large amounts of data are being generated by high-throughput genome sequencing methods. But the rate of the experimental functional characterization falls far behind. To fill the gap between the number of sequences and their annotations, fast and accurate automated annotation methods are required. Many methods, such as GOblet, GOFigure, and Gotcha, are designed based on the BLAST search. Unfortunately, the sequence coverage of these methods is low as they cannot detect the remote homologues. Adding to this, the lack of annotation specificity advocates the need to improve automated protein function prediction. Results We designed a novel automated protein functional assignment method based on the neural response algorithm, which simulates the neuronal behavior of the visual cortex in the human brain. Firstly, we predict the most similar target protein for a given query protein and thereby assign its GO term to the query sequence. When assessed on test set, our method ranked the actual leaf GO term among the top 5 probable GO terms with accuracy of 86.93%. Conclusions The proposed algorithm is the first instance of neural response algorithm being used in the biological domain. The use of HMM profiles along with the secondary structure information to define the neural response gives our method an edge over other available methods on annotation accuracy. Results of the 5-fold cross validation and the comparison with PFP and FFPred servers indicate the prominent performance by our method. The program, the dataset, and help files are available at http://www.jjwanglab.org/NRProF/. PMID:23046521

  2. Gene Ontology Function prediction in Mollicutes using Protein-Protein Association Networks

    PubMed Central

    2011-01-01

    Background Many complex systems can be represented and analysed as networks. The recent availability of large-scale datasets, has made it possible to elucidate some of the organisational principles and rules that govern their function, robustness and evolution. However, one of the main limitations in using protein-protein interactions for function prediction is the availability of interaction data, especially for Mollicutes. If we could harness predicted interactions, such as those from a Protein-Protein Association Networks (PPAN), combining several protein-protein network function-inference methods with semantic similarity calculations, the use of protein-protein interactions for functional inference in this species would become more potentially useful. Results In this work we show that using PPAN data combined with other approximations, such as functional module detection, orthology exploitation methods and Gene Ontology (GO)-based information measures helps to predict protein function in Mycoplasma genitalium. Conclusions To our knowledge, the proposed method is the first that combines functional module detection among species, exploiting an orthology procedure and using information theory-based GO semantic similarity in PPAN of the Mycoplasma species. The results of an evaluation show a higher recall than previously reported methods that focused on only one organism network. PMID:21486441

  3. FunPred-1: protein function prediction from a protein interaction network using neighborhood analysis.

    PubMed

    Saha, Sovan; Chatterjee, Piyali; Basu, Subhadip; Kundu, Mahantapas; Nasipuri, Mita

    2014-12-01

    Proteins are responsible for all biological activities in living organisms. Thanks to genome sequencing projects, large amounts of DNA and protein sequence data are now available, but the biological functions of many proteins are still not annotated in most cases. The unknown function of such non-annotated proteins may be inferred or deduced from their neighbors in a protein interaction network. In this paper, we propose two new methods to predict protein functions based on network neighborhood properties. FunPred 1.1 uses a combination of three simple-yet-effective scoring techniques: the neighborhood ratio, the protein path connectivity and the relative functional similarity. FunPred 1.2 applies a heuristic approach using the edge clustering coefficient to reduce the search space by identifying densely connected neighborhood regions. The overall accuracy achieved in FunPred 1.2 over 8 functional groups involving hetero-interactions in 650 yeast proteins is around 87%, which is higher than the accuracy with FunPred 1.1. It is also higher than the accuracy of many of the state-of-the-art protein function prediction methods described in the literature. The test datasets and the complete source code of the developed software are now freely available at http://code.google.com/p/cmaterbioinfo/ .

  4. Timing Correlations in Proteins Predict Functional Modules and Dynamic Allostery.

    PubMed

    Lin, Milo M

    2016-04-20

    How protein structure encodes functionality is not fully understood. For example, long-range intraprotein communication can occur without measurable conformational change and is often not captured by existing structural correlation functions. It is shown here that important functional information is encoded in the timing of protein motions, rather than motion itself. I introduce the conditional activity function to quantify such timing correlations among the degrees of freedom within proteins. For three proteins, the conditional activities between side-chain dihedral angles were computed using the output of microseconds-long atomistic simulations. The new approach demonstrates that a sparse fraction of side-chain pairs are dynamically correlated over long distances (spanning protein lengths up to 7 nm), in sharp contrast to structural correlations, which are short-ranged (<1 nm). Regions of high self- and inter-side-chain dynamical correlations are found, corresponding to experimentally determined functional modules and allosteric connections, respectively.

  5. Evaluation of function predictions by PFP, ESG, and PSI-BLAST for moonlighting proteins

    PubMed Central

    2012-01-01

    Background Advancements in function prediction algorithms are enabling large scale computational annotation for newly sequenced genomes. With the increase in the number of functionally well characterized proteins it has been observed that there are many proteins involved in more than one function. These proteins characterized as moonlighting proteins show varied functional behavior depending on the cell type, localization in the cell, oligomerization, multiple binding sites, etc. The functional diversity shown by moonlighting proteins may have significant impact on the traditional sequence based function prediction methods. Here we investigate how well diverse functions of moonlighting proteins can be predicted by some existing function prediction methods. Results We have analyzed the performances of three major sequence based function prediction methods, PSI-BLAST, the Protein Function Prediction (PFP), and the Extended Similarity Group (ESG) on predicting diverse functions of moonlighting proteins. In predicting discrete functions of a set of 19 experimentally identified moonlighting proteins, PFP showed overall highest recall among the three methods. Although ESG showed the highest precision, its recall was lower than PSI-BLAST. Recall by PSI-BLAST greatly improved when BLOSUM45 was used instead of BLOSUM62. Conclusion We have analyzed the performances of PFP, ESG, and PSI-BLAST in predicting the functional diversity of moonlighting proteins. PFP shows overall better performance in predicting diverse moonlighting functions as compared with PSI-BLAST and ESG. Recall by PSI-BLAST greatly improved when BLOSUM45 was used. This analysis indicates that considering weakly similar sequences in prediction enhances the performance of sequence based AFP methods in predicting functional diversity of moonlighting proteins. The current study will also motivate development of novel computational frameworks for automatic identification of such proteins. PMID:23173871

  6. De-novo protein function prediction using DNA binding and RNA binding proteins as a test case

    PubMed Central

    Peled, Sapir; Leiderman, Olga; Charar, Rotem; Efroni, Gilat; Shav-Tal, Yaron; Ofran, Yanay

    2016-01-01

    Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features. PMID:27869118

  7. De-novo protein function prediction using DNA binding and RNA binding proteins as a test case.

    PubMed

    Peled, Sapir; Leiderman, Olga; Charar, Rotem; Efroni, Gilat; Shav-Tal, Yaron; Ofran, Yanay

    2016-11-21

    Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features.

  8. A new approach to assess and predict the functional roles of proteins across all known structures.

    PubMed

    Julfayev, Elchin S; McLaughlin, Ryan J; Tao, Yi-Ping; McLaughlin, William A

    2011-03-01

    The three dimensional atomic structures of proteins provide information regarding their function; and codified relationships between structure and function enable the assessment of function from structure. In the current study, a new data mining tool was implemented that checks current gene ontology (GO) annotations and predicts new ones across all the protein structures available in the Protein Data Bank (PDB). The tool overcomes some of the challenges of utilizing large amounts of protein annotation and measurement information to form correspondences between protein structure and function. Protein attributes were extracted from the Structural Biology Knowledgebase and open source biological databases. Based on the presence or absence of a given set of attributes, a given protein's functional annotations were inferred. The results show that attributes derived from the three dimensional structures of proteins enhanced predictions over that using attributes only derived from primary amino acid sequence. Some predictions reflected known but not completely documented GO annotations. For example, predictions for the GO term for copper ion binding reflected used information a copper ion was known to interact with the protein based on information in a ligand interaction database. Other predictions were novel and require further experimental validation. These include predictions for proteins labeled as unknown function in the PDB. Two examples are a role in the regulation of transcription for the protein AF1396 from Archaeoglobus fulgidus and a role in RNA metabolism for the protein psuG from Thermotoga maritima.

  9. Correlated Protein Function Prediction via Maximization of Data-Knowledge Consistency.

    PubMed

    Wang, Hua; Huang, Heng; Ding, Chris

    2015-06-01

    Conventional computational approaches for protein function prediction usually predict one function at a time, fundamentally. As a result, the protein functions are treated as separate target classes. However, biological processes are highly correlated in reality, which makes multiple functions assigned to a protein not independent. Therefore, it would be beneficial to make use of function category correlations when predicting protein functions. In this article, we propose a novel Maximization of Data-Knowledge Consistency (MDKC) approach to exploit function category correlations for protein function prediction. Our approach banks on the assumption that two proteins are likely to have large overlap in their annotated functions if they are highly similar according to certain experimental data. We first establish a new pairwise protein similarity using protein annotations from knowledge perspective. Then by maximizing the consistency between the established knowledge similarity upon annotations and the data similarity upon biological experiments, putative functions are assigned to unannotated proteins. Most importantly, function category correlations are gracefully incorporated into our learning objective through the knowledge similarity. Comprehensive experimental evaluations on the Saccharomyces cerevisiae species have demonstrated promising results that validate the performance of our methods.

  10. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps.

    PubMed

    Chang, Yi-Chien; Hu, Zhenjun; Rachlin, John; Anton, Brian P; Kasif, Simon; Roberts, Richard J; Steffen, Martin

    2016-01-04

    The COMBREX database (COMBREX-DB; combrex.bu.edu) is an online repository of information related to (i) experimentally determined protein function, (ii) predicted protein function, (iii) relationships among proteins of unknown function and various types of experimental data, including molecular function, protein structure, and associated phenotypes. The database was created as part of the novel COMBREX (COMputational BRidges to EXperiments) effort aimed at accelerating the rate of gene function validation. It currently holds information on ∼ 3.3 million known and predicted proteins from over 1000 completely sequenced bacterial and archaeal genomes. The database also contains a prototype recommendation system for helping users identify those proteins whose experimental determination of function would be most informative for predicting function for other proteins within protein families. The emphasis on documenting experimental evidence for function predictions, and the prioritization of uncharacterized proteins for experimental testing distinguish COMBREX from other publicly available microbial genomics resources. This article describes updates to COMBREX-DB since an initial description in the 2011 NAR Database Issue.

  11. I-TASSER: a unified platform for automated protein structure and function prediction.

    PubMed

    Roy, Ambrish; Kucukural, Alper; Zhang, Yang

    2010-04-01

    The iterative threading assembly refinement (I-TASSER) server is an integrated platform for automated protein structure and function prediction based on the sequence-to-structure-to-function paradigm. Starting from an amino acid sequence, I-TASSER first generates three-dimensional (3D) atomic models from multiple threading alignments and iterative structural assembly simulations. The function of the protein is then inferred by structurally matching the 3D models with other known proteins. The output from a typical server run contains full-length secondary and tertiary structure predictions, and functional annotations on ligand-binding sites, Enzyme Commission numbers and Gene Ontology terms. An estimate of accuracy of the predictions is provided based on the confidence score of the modeling. This protocol provides new insights and guidelines for designing of online server systems for the state-of-the-art protein structure and function predictions. The server is available at http://zhanglab.ccmb.med.umich.edu/I-TASSER.

  12. Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge

    PubMed Central

    2013-01-01

    Background Advances in sequencing technology over the past decade have resulted in an abundance of sequenced proteins whose function is yet unknown. As such, computational systems that can automatically predict and annotate protein function are in demand. Most computational systems use features derived from protein sequence or protein structure to predict function. In an earlier work, we demonstrated the utility of biomedical literature as a source of text features for predicting protein subcellular location. We have also shown that the combination of text-based and sequence-based prediction improves the performance of location predictors. Following up on this work, for the Critical Assessment of Function Annotations (CAFA) Challenge, we developed a text-based system that aims to predict molecular function and biological process (using Gene Ontology terms) for unannotated proteins. In this paper, we present the preliminary work and evaluation that we performed for our system, as part of the CAFA challenge. Results We have developed a preliminary system that represents proteins using text-based features and predicts protein function using a k-nearest neighbour classifier (Text-KNN). We selected text features for our classifier by extracting key terms from biomedical abstracts based on their statistical properties. The system was trained and tested using 5-fold cross-validation over a dataset of 36,536 proteins. System performance was measured using the standard measures of precision, recall, F-measure and overall accuracy. The performance of our system was compared to two baseline classifiers: one that assigns function based solely on the prior distribution of protein function (Base-Prior) and one that assigns function based on sequence similarity (Base-Seq). The overall prediction accuracy of Text-KNN, Base-Prior, and Base-Seq for molecular function classes are 62%, 43%, and 58% while the overall accuracy for biological process classes are 17%, 11%, and 28

  13. Local structure based method for prediction of the biochemical function of proteins: Applications to glycoside hydrolases.

    PubMed

    Parasuram, Ramya; Mills, Caitlyn L; Wang, Zhouxi; Somasundaram, Saroja; Beuning, Penny J; Ondrechen, Mary Jo

    2016-01-15

    Thousands of protein structures of unknown or uncertain function have been reported as a result of high-throughput structure determination techniques developed by Structural Genomics (SG) projects. However, many of the putative functional assignments of these SG proteins in the Protein Data Bank (PDB) are incorrect. While high-throughput biochemical screening techniques have provided valuable functional information for limited sets of SG proteins, the biochemical functions for most SG proteins are still unknown or uncertain. Therefore, computational methods for the reliable prediction of protein function from structure can add tremendous value to the existing SG data. In this article, we show how computational methods may be used to predict the function of SG proteins, using examples from the six-hairpin glycosidase (6-HG) and the concanavalin A-like lectin/glucanase (CAL/G) superfamilies. Using a set of predicted functional residues, obtained from computed electrostatic and chemical properties for each protein structure, it is shown that these superfamilies may be sorted into functional families according to biochemical function. Within these superfamilies, a total of 18 SG proteins were analyzed according to their predicted, local functional sites: 13 from the 6-HG superfamily, five from the CAL/G superfamily. Within the 6-HG superfamily, an uncharacterized protein BACOVA_03626 from Bacteroides ovatus (PDB 3ON6) and a hypothetical protein BT3781 from Bacteroides thetaiotaomicron (PDB 2P0V) are shown to have very strong active site matches with exo-α-1,6-mannosidases, thus likely possessing this function. Also in this superfamily, it is shown that protein BH0842, a putative glycoside hydrolase from Bacillus halodurans (PDB 2RDY), has a predicted active site that matches well with a known α-L-galactosidase. In the CAL/G superfamily, an uncharacterized glycosyl hydrolase family 16 protein from Mycobacterium smegmatis (PDB 3RQ0) is shown to have local structural

  14. Prediction of Functional Class of Proteins and Peptides Irrespective of Sequence Homology by Support Vector Machines

    PubMed Central

    Tang, Zhi Qun; Lin, Hong Huang; Zhang, Hai Lei; Han, Lian Yi; Chen, Xin; Chen, Yu Zong

    2007-01-01

    Various computational methods have been used for the prediction of protein and peptide function based on their sequences. A particular challenge is to derive functional properties from sequences that show low or no homology to proteins of known function. Recently, a machine learning method, support vector machines (SVM), have been explored for predicting functional class of proteins and peptides from amino acid sequence derived properties independent of sequence similarity, which have shown promising potential for a wide spectrum of protein and peptide classes including some of the low- and non-homologous proteins. This method can thus be explored as a potential tool to complement alignment-based, clustering-based, and structure-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using SVM for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented. PMID:20066123

  15. FINDSITE: a combined evolution/structure-based approach to protein function prediction

    PubMed Central

    Brylinski, Michal

    2009-01-01

    A key challenge of the post-genomic era is the identification of the function(s) of all the molecules in a given organism. Here, we review the status of sequence and structure-based approaches to protein function inference and ligand screening that can provide functional insights for a significant fraction of the ∼50% of ORFs of unassigned function in an average proteome. We then describe FINDSITE, a recently developed algorithm for ligand binding site prediction, ligand screening and molecular function prediction, which is based on binding site conservation across evolutionary distant proteins identified by threading. Importantly, FINDSITE gives comparable results when high-resolution experimental structures as well as predicted protein models are used. PMID:19324930

  16. MS-kNN: protein function prediction by integrating multiple data sources

    PubMed Central

    2013-01-01

    Background Protein function determination is a key challenge in the post-genomic era. Experimental determination of protein functions is accurate, but time-consuming and resource-intensive. A cost-effective alternative is to use the known information about sequence, structure, and functional properties of genes and proteins to predict functions using statistical methods. In this paper, we describe the Multi-Source k-Nearest Neighbor (MS-kNN) algorithm for function prediction, which finds k-nearest neighbors of a query protein based on different types of similarity measures and predicts its function by weighted averaging of its neighbors' functions. Specifically, we used 3 data sources to calculate the similarity scores: sequence similarity, protein-protein interactions, and gene expressions. Results We report the results in the context of 2011 Critical Assessment of Function Annotation (CAFA). Prior to CAFA submission deadline, we evaluated our algorithm on 1,302 human test proteins that were represented in all 3 data sources. Using only the sequence similarity information, MS-kNN had term-based Area Under the Curve (AUC) accuracy of Gene Ontology (GO) molecular function predictions of 0.728 when 7,412 human training proteins were used, and 0.819 when 35,622 training proteins from multiple eukaryotic and prokaryotic organisms were used. By aggregating predictions from all three sources, the AUC was further improved to 0.848. Similar result was observed on prediction of GO biological processes. Testing on 595 proteins that were annotated after the CAFA submission deadline showed that overall MS-kNN accuracy was higher than that of baseline algorithms Gotcha and BLAST, which were based solely on sequence similarity information. Since only 10 of the 595 proteins were represented by all 3 data sources, and 66 by two data sources, the difference between 3-source and one-source MS-kNN was rather small. Conclusions Based on our results, we have several useful insights: (1

  17. PFP: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data.

    PubMed

    Hawkins, Troy; Chitale, Meghana; Luban, Stanislav; Kihara, Daisuke

    2009-02-15

    Protein function prediction is a central problem in bioinformatics, increasing in importance recently due to the rapid accumulation of biological data awaiting interpretation. Sequence data represents the bulk of this new stock and is the obvious target for consideration as input, as newly sequenced organisms often lack any other type of biological characterization. We have previously introduced PFP (Protein Function Prediction) as our sequence-based predictor of Gene Ontology (GO) functional terms. PFP interprets the results of a PSI-BLAST search by extracting and scoring individual functional attributes, searching a wide range of E-value sequence matches, and utilizing conventional data mining techniques to fill in missing information. We have shown it to be effective in predicting both specific and low-resolution functional attributes when sufficient data is unavailable. Here we describe (1) significant improvements to the PFP infrastructure, including the addition of prediction significance and confidence scores, (2) a thorough benchmark of performance and comparisons to other related prediction methods, and (3) applications of PFP predictions to genome-scale data. We applied PFP predictions to uncharacterized protein sequences from 15 organisms. Among these sequences, 60-90% could be annotated with a GO molecular function term at high confidence (>or=80%). We also applied our predictions to the protein-protein interaction network of the Malaria plasmodium (Plasmodium falciparum). High confidence GO biological process predictions (>or=90%) from PFP increased the number of fully enriched interactions in this dataset from 23% of interactions to 94%. Our benchmark comparison shows significant performance improvement of PFP relative to GOtcha, InterProScan, and PSI-BLAST predictions. This is consistent with the performance of PFP as the overall best predictor in both the AFP-SIG '05 and CASP7 function (FN) assessments. PFP is available as a web service at http://dragon.bio.purdue.edu/pfp/.

  18. Functional classification of protein 3D structures from predicted local interaction sites.

    PubMed

    Parasuram, Ramya; Lee, Joslynn S; Yin, Pengcheng; Somarowthu, Srinivas; Ondrechen, Mary Jo

    2010-12-01

    A new approach to the functional classification of protein 3D structures is described with application to some examples from structural genomics. This approach is based on functional site prediction with THEMATICS and POOL. THEMATICS employs calculated electrostatic potentials of the query structure. POOL is a machine learning method that utilizes THEMATICS features and has been shown to predict accurate, precise, highly localized interaction sites. Extension to the functional classification of structural genomics proteins is now described. Predicted functionally important residues are structurally aligned with those of proteins with previously characterized biochemical functions. A 3D structure match at the predicted local functional site then serves as a more reliable predictor of biochemical function than an overall structure match. Annotation is confirmed for a structural genomics protein with the ribulose phosphate binding barrel (RPBB) fold. A putative glucoamylase from Bacteroides fragilis (PDB ID 3eu8) is shown to be in fact probably not a glucoamylase. Finally a structural genomics protein from Streptomyces coelicolor annotated as an enoyl-CoA hydratase (PDB ID 3g64) is shown to be misannotated. Its predicted active site does not match the well-characterized enoyl-CoA hydratases of similar structure but rather bears closer resemblance to those of a dehalogenase with similar fold.

  19. Predicting protein N-glycosylation by combining functional domain and secretion information.

    PubMed

    Li, Sujun; Liu, Boshu; Cai, Yudong; Li, Yixue

    2007-08-01

    Protein N-glycosylation plays an important role in protein function. Yet, at present, few computational methods are available for the prediction of this protein modification. This prompted our development of a support vector machine (SVM)-based method for this task, as well as a partial least squares (PLS) regression based prediction method for comparison. A functional domain feature space was used to create SVM and PLS models, which achieved accuracies of 83.91% and 79.89%, respectively, as evaluated by a leave-one-out cross-validation. Subsequently, SVM and PLS models were developed based on functional domain and protein secretion information, which yielded accuracies of 89.13% and 86%, respectively. This analysis demonstrates that the protein functional domain and secretion information are both efficient predictors of N-glycosylation.

  20. Inference of Functional Relations in Predicted Protein Networks with a Machine Learning Approach

    PubMed Central

    Ezkurdia, Iakes; Andrés-León, Eduardo; Valencia, Alfonso

    2010-01-01

    Background Molecular biology is currently facing the challenging task of functionally characterizing the proteome. The large number of possible protein-protein interactions and complexes, the variety of environmental conditions and cellular states in which these interactions can be reorganized, and the multiple ways in which a protein can influence the function of others, requires the development of experimental and computational approaches to analyze and predict functional associations between proteins as part of their activity in the interactome. Methodology/Principal Findings We have studied the possibility of constructing a classifier in order to combine the output of the several protein interaction prediction methods. The AODE (Averaged One-Dependence Estimators) machine learning algorithm is a suitable choice in this case and it provides better results than the individual prediction methods, and it has better performances than other tested alternative methods in this experimental set up. To illustrate the potential use of this new AODE-based Predictor of Protein InterActions (APPIA), when analyzing high-throughput experimental data, we show how it helps to filter the results of published High-Throughput proteomic studies, ranking in a significant way functionally related pairs. Availability: All the predictions of the individual methods and of the combined APPIA predictor, together with the used datasets of functional associations are available at http://ecid.bioinfo.cnio.es/. Conclusions We propose a strategy that integrates the main current computational techniques used to predict functional associations into a unified classifier system, specifically focusing on the evaluation of poorly characterized protein pairs. We selected the AODE classifier as the appropriate tool to perform this task. AODE is particularly useful to extract valuable information from large unbalanced and heterogeneous data sets. The combination of the information provided by five

  1. Predicting protein functions from redundancies in large-scale protein interaction networks

    NASA Technical Reports Server (NTRS)

    Samanta, Manoj Pratim; Liang, Shoudan

    2003-01-01

    Interpreting data from large-scale protein interaction experiments has been a challenging task because of the widespread presence of random false positives. Here, we present a network-based statistical algorithm that overcomes this difficulty and allows us to derive functions of unannotated proteins from large-scale interaction data. Our algorithm uses the insight that if two proteins share significantly larger number of common interaction partners than random, they have close functional associations. Analysis of publicly available data from Saccharomyces cerevisiae reveals >2,800 reliable functional associations, 29% of which involve at least one unannotated protein. By further analyzing these associations, we derive tentative functions for 81 unannotated proteins with high certainty. Our method is not overly sensitive to the false positives present in the data. Even after adding 50% randomly generated interactions to the measured data set, we are able to recover almost all (approximately 89%) of the original associations.

  2. Computational Prediction of Protein Function Based on Weighted Mapping of Domains and GO Terms

    PubMed Central

    Teng, Zhixia; Guo, Maozu; Dai, Qiguo; Wang, Chunyu; Li, Jin; Liu, Xiaoyan

    2014-01-01

    In this paper, we propose a novel method, SeekFun, to predict protein function based on weighted mapping of domains and GO terms. Firstly, a weighted mapping of domains and GO terms is constructed according to GO annotations and domain composition of the proteins. The association strength between domain and GO term is weighted by symmetrical conditional probability. Secondly, the mapping is extended along the true paths of the terms based on GO hierarchy. Finally, the terms associated with resident domains are transferred to host protein and real annotations of the host protein are determined by association strengths. Our careful comparisons demonstrate that SeekFun outperforms the concerned methods on most occasions. SeekFun provides a flexible and effective way for protein function prediction. It benefits from the well-constructed mapping of domains and GO terms, as well as the reasonable strategy for inferring annotations of protein from those of its domains. PMID:24868539

  3. iPFPi: A System for Improving Protein Function Prediction through Cumulative Iterations.

    PubMed

    Taha, Kamal; Yoo, Paul D; Alzaabi, Mohammed

    2015-01-01

    We propose a classifier system called iPFPi that predicts the functions of un-annotated proteins. iPFPi assigns an un-annotated protein P the functions of GO annotation terms that are semantically similar to P. An un-annotated protein P and a GO annotation term T are represented by their characteristics. The characteristics of P are GO terms found within the abstracts of biomedical literature associated with P. The characteristics of Tare GO terms found within the abstracts of biomedical literature associated with the proteins annotated with the function of T. Let F and F/ be the important (dominant) sets of characteristic terms representing T and P, respectively. iPFPi would annotate P with the function of T, if F and F/ are semantically similar. We constructed a novel semantic similarity measure that takes into consideration several factors, such as the dominance degree of each characteristic term t in set F based on its score, which is a value that reflects the dominance status of t relative to other characteristic terms, using pairwise beats and looses procedure. Every time a protein P is annotated with the function of T, iPFPi updates and optimizes the current scores of the characteristic terms for T based on the weights of the characteristic terms for P. Set F will be updated accordingly. Thus, the accuracy of predicting the function of T as the function of subsequent proteins improves. This prediction accuracy keeps improving over time iteratively through the cumulative weights of the characteristic terms representing proteins that are successively annotated with the function of T. We evaluated the quality of iPFPi by comparing it experimentally with two recent protein function prediction systems. Results showed marked improvement.

  4. Structure- and Sequence-Based Function Prediction for Non-Homologous Proteins

    PubMed Central

    Sael, Lee; Chitale, Meghana; Kihara, Daisuke

    2012-01-01

    The structural genomics projects have been accumulating an increasing number of protein structures, many of which remain functionally unknown. In parallel effort to experimental methods, computational methods are expected to make a significant contribution for functional elucidation of such proteins. However, conventional computational methods that transfer functions from homologous proteins do not help much for these uncharacterized protein structures because they do not have apparent structural or sequence similarity with the known proteins. Here, we briefly review two avenues of computational function prediction methods, i.e. structure-based methods and sequence-based methods. The focus is on our recently developments of local structure-based methods and sequence-based methods, which can effectively extract function information from distantly related proteins. Two structure-based methods, Pocket-Surfer and Patch-Surfer, identify similar known ligand binding sites for pocket regions in a query protein without using global protein fold similarity information. Two sequence-based methods, PFP and ESG, make use of weakly similar sequences that are conventionally discarded in homology based function annotation. Combined together with experimental methods we hope that computational methods will make leading contribution in functional elucidation of the protein structures. PMID:22270458

  5. Prediction of protein function improving sequence remote alignment search by a fuzzy logic algorithm.

    PubMed

    Gómez, Antonio; Cedano, Juan; Espadaler, Jordi; Hermoso, Antonio; Piñol, Jaume; Querol, Enrique

    2008-02-01

    The functional annotation of the new protein sequences represents a major drawback for genomic science. The best way to suggest the function of a protein from its sequence is by finding a related one for which biological information is available. Current alignment algorithms display a list of protein sequence stretches presenting significant similarity to different protein targets, ordered by their respective mathematical scores. However, statistical and biological significance do not always coincide, therefore, the rearrangement of the program output according to more biological characteristics than the mathematical scoring would help functional annotation. A new method that predicts the putative function for the protein integrating the results from the PSI-BLAST program and a fuzzy logic algorithm is described. Several protein sequence characteristics have been checked in their ability to rearrange a PSI-BLAST profile according more to their biological functions. Four of them: amino acid content, matched segment length and hydropathic and flexibility profiles positively contributed, upon being integrated by a fuzzy logic algorithm into a program, BYPASS, to the accurate prediction of the function of a protein from its sequence.

  6. Protein function prediction by massive integration of evolutionary analyses and multiple data sources

    PubMed Central

    2013-01-01

    Background Accurate protein function annotation is a severe bottleneck when utilizing the deluge of high-throughput, next generation sequencing data. Keeping database annotations up-to-date has become a major scientific challenge that requires the development of reliable automatic predictors of protein function. The CAFA experiment provided a unique opportunity to undertake comprehensive 'blind testing' of many diverse approaches for automated function prediction. We report on the methodology we used for this challenge and on the lessons we learnt. Methods Our method integrates into a single framework a wide variety of biological information sources, encompassing sequence, gene expression and protein-protein interaction data, as well as annotations in UniProt entries. The methodology transfers functional categories based on the results from complementary homology-based and feature-based analyses. We generated the final molecular function and biological process assignments by combining the initial predictions in a probabilistic manner, which takes into account the Gene Ontology hierarchical structure. Results We propose a novel scoring function called COmbined Graph-Information Content similarity (COGIC) score for the comparison of predicted functional categories and benchmark data. We demonstrate that our integrative approach provides increased scope and accuracy over both the component methods and the naïve predictors. In line with previous studies, we find that molecular function predictions are more accurate than biological process assignments. Conclusions Overall, the results indicate that there is considerable room for improvement in the field. It still remains for the community to invest a great deal of effort to make automated function prediction a useful and routine component in the toolbox of life scientists. As already witnessed in other areas, community-wide blind testing experiments will be pivotal in establishing standards for the evaluation of

  7. Analysis of protein function and its prediction from amino acid sequence.

    PubMed

    Clark, Wyatt T; Radivojac, Predrag

    2011-07-01

    Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in the context of human disease because many conditions arise as a consequence of alterations of protein function. The recent availability of relatively inexpensive sequencing technology has resulted in thousands of complete or partially sequenced genomes with millions of functionally uncharacterized proteins. Such a large volume of data, combined with the lack of high-throughput experimental assays to functionally annotate proteins, attributes to the growing importance of automated function prediction. Here, we study proteins annotated by Gene Ontology (GO) terms and estimate the accuracy of functional transfer from protein sequence only. We find that the transfer of GO terms by pairwise sequence alignments is only moderately accurate, showing a surprisingly small influence of sequence identity (SID) in a broad range (30-100%). We developed and evaluated a new predictor of protein function, functional annotator (FANN), from amino acid sequence. The predictor exploits a multioutput neural network framework which is well suited to simultaneously modeling dependencies between functional terms. Experiments provide evidence that FANN-GO (predictor of GO terms; available from http://www.informatics.indiana.edu/predrag) outperforms standard methods such as transfer by global or local SID as well as GOtcha, a method that incorporates the structure of GO.

  8. Predicted functional RNAs within coding regions constrain evolutionary rates of yeast proteins.

    PubMed

    Warden, Charles D; Kim, Seong-Ho; Yi, Soojin V

    2008-02-13

    Functional RNAs (fRNAs) are being recognized as an important regulatory component in biological processes. Interestingly, recent computational studies suggest that the number and biological significance of functional RNAs within coding regions (coding fRNAs) may have been underestimated. We hypothesized that such coding fRNAs will impose additional constraint on sequence evolution because the DNA primary sequence has to simultaneously code for functional RNA secondary structures on the messenger RNA in addition to the amino acid codons for the protein sequence. To test this prediction, we first utilized computational methods to predict conserved fRNA secondary structures within multiple species alignments of Saccharomyces sensu strico genomes. We predict that as much as 5% of the genes in the yeast genome contain at least one functional RNA secondary structure within their protein-coding region. We then analyzed the impact of coding fRNAs on the evolutionary rate of protein-coding genes because a decrease in evolutionary rate implies constraint due to biological functionality. We found that our predicted coding fRNAs have a significant influence on evolutionary rates (especially at synonymous sites), independent of other functional measures. Thus, coding fRNA may play a role on sequence evolution. Given that coding regions of humans and flies contain many more predicted coding fRNAs than yeast, the impact of coding fRNAs on sequence evolution may be substantial in genomes of higher eukaryotes.

  9. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    PubMed Central

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    2015-01-01

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. The SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded. PMID:25979264

  10. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    SciTech Connect

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    2015-05-15

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access to precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.

  11. SIFTER search: a web server for accurate phylogeny-based protein function prediction

    DOE PAGES

    Sahraeian, Sayed M.; Luo, Kevin R.; Brenner, Steven E.

    2015-05-15

    We are awash in proteins discovered through high-throughput sequencing projects. As only a minuscule fraction of these have been experimentally characterized, computational methods are widely used for automated annotation. Here, we introduce a user-friendly web interface for accurate protein function prediction using the SIFTER algorithm. SIFTER is a state-of-the-art sequence-based gene molecular function prediction algorithm that uses a statistical model of function evolution to incorporate annotations throughout the phylogenetic tree. Due to the resources needed by the SIFTER algorithm, running SIFTER locally is not trivial for most users, especially for large-scale problems. The SIFTER web server thus provides access tomore » precomputed predictions on 16 863 537 proteins from 232 403 species. Users can explore SIFTER predictions with queries for proteins, species, functions, and homologs of sequences not in the precomputed prediction set. Lastly, the SIFTER web server is accessible at http://sifter.berkeley.edu/ and the source code can be downloaded.« less

  12. Function prediction from networks of local evolutionary similarity in protein structure

    PubMed Central

    2013-01-01

    Background Annotating protein function with both high accuracy and sensitivity remains a major challenge in structural genomics. One proven computational strategy has been to group a few key functional amino acids into templates and search for these templates in other protein structures, so as to transfer function when a match is found. To this end, we previously developed Evolutionary Trace Annotation (ETA) and showed that diffusing known annotations over a network of template matches on a structural genomic scale improved predictions of function. In order to further increase sensitivity, we now let each protein contribute multiple templates rather than just one, and also let the template size vary. Results Retrospective benchmarks in 605 Structural Genomics enzymes showed that multiple templates increased sensitivity by up to 14% when combined with single template predictions even as they maintained the accuracy over 91%. Diffusing function globally on networks of single and multiple template matches marginally increased the area under the ROC curve over 0.97, but in a subset of proteins that could not be annotated by ETA, the network approach recovered annotations for the most confident 20-23 of 91 cases with 100% accuracy. Conclusions We improve the accuracy and sensitivity of predictions by using multiple templates per protein structure when constructing networks of ETA matches and diffusing annotations. PMID:23514548

  13. Gene Ontology consistent protein function prediction: the FALCON algorithm applied to six eukaryotic genomes.

    PubMed

    Kourmpetis, Yiannis Ai; van Dijk, Aalt Dj; Ter Braak, Cajo Jf

    2013-03-27

    : Gene Ontology (GO) is a hierarchical vocabulary for the description of biological functions and locations, often employed by computational methods for protein function prediction. Due to the structure of GO, function predictions can be self- contradictory. For example, a protein may be predicted to belong to a detailed functional class, but not in a broader class that, due to the vocabulary structure, includes the predicted one.We present a novel discrete optimization algorithm called Functional Annotation with Labeling CONsistency (FALCON) that resolves such contradictions. The GO is modeled as a discrete Bayesian Network. For any given input of GO term membership probabilities, the algorithm returns the most probable GO term assignments that are in accordance with the Gene Ontology structure. The optimization is done using the Differential Evolution algorithm. Performance is evaluated on simulated and also real data from Arabidopsis thaliana showing improvement compared to related approaches. We finally applied the FALCON algorithm to obtain genome-wide function predictions for six eukaryotic species based on data provided by the CAFA (Critical Assessment of Function Annotation) project.

  14. Network analysis and protein function prediction with the PRODISTIN Web site.

    PubMed

    Baudot, Anaïs; Souiai, Ouissem; Brun, Christine

    2012-01-01

    Interactions between macromolecules are deciphered to gain information about biological processes and protein function. This information is hidden in large interaction networks, yet very complicated to dissect. In this context, the PRODISTIN Web site is dedicated to the clustering of network proteins according to the identity of their interaction partners, and to the subsequent functional annotation of these clusters. It allows analysing functionally networks and eventually leads to the prediction of function for uncharacterized protein based on their belonging to protein clusters. PRODISTIN analyses also provide an overview of the different biological processes existing in a given interactome. Here, we present a step-by-step procedure to analyse interaction networks using the PRODISTIN Web site. The protocol is illustrated by an application to the Campylobacter jejuni interactome.

  15. Prediction of functionally important residues in globular proteins from unusual central distances of amino acids

    PubMed Central

    2011-01-01

    Background Well-performing automated protein function recognition approaches usually comprise several complementary techniques. Beside constructing better consensus, their predictive power can be improved by either adding or refining independent modules that explore orthogonal features of proteins. In this work, we demonstrated how the exploration of global atomic distributions can be used to indicate functionally important residues. Results Using a set of carefully selected globular proteins, we parametrized continuous probability density functions describing preferred central distances of individual protein atoms. Relative preferred burials were estimated using mixture models of radial density functions dependent on the amino acid composition of a protein under consideration. The unexpectedness of extraordinary locations of atoms was evaluated in the information-theoretic manner and used directly for the identification of key amino acids. In the validation study, we tested capabilities of a tool built upon our approach, called SurpResi, by searching for binding sites interacting with ligands. The tool indicated multiple candidate sites achieving success rates comparable to several geometric methods. We also showed that the unexpectedness is a property of regions involved in protein-protein interactions, and thus can be used for the ranking of protein docking predictions. The computational approach implemented in this work is freely available via a Web interface at http://www.bioinformatics.org/surpresi. Conclusions Probabilistic analysis of atomic central distances in globular proteins is capable of capturing distinct orientational preferences of amino acids as resulting from different sizes, charges and hydrophobic characters of their side chains. When idealized spatial preferences can be inferred from the sole amino acid composition of a protein, residues located in hydrophobically unfavorable environments can be easily detected. Such residues turn out to be

  16. Classification of Phylogenetic Profiles for Protein Function Prediction: An SVM Approach

    NASA Astrophysics Data System (ADS)

    Kotaru, Appala Raju; Joshi, Ramesh C.

    Predicting the function of an uncharacterized protein is a major challenge in post-genomic era due to problems complexity and scale. Having knowledge of protein function is a crucial link in the development of new drugs, better crops, and even the development of biochemicals such as biofuels. Recently numerous high-throughput experimental procedures have been invented to investigate the mechanisms leading to the accomplishment of a protein’s function and Phylogenetic profile is one of them. Phylogenetic profile is a way of representing a protein which encodes evolutionary history of proteins. In this paper we proposed a method for classification of phylogenetic profiles using supervised machine learning method, support vector machine classification along with radial basis function as kernel for identifying functionally linked proteins. We experimentally evaluated the performance of the classifier with the linear kernel, polynomial kernel and compared the results with the existing tree kernel. In our study we have used proteins of the budding yeast saccharomyces cerevisiae genome. We generated the phylogenetic profiles of 2465 yeast genes and for our study we used the functional annotations that are available in the MIPS database. Our experiments show that the performance of the radial basis kernel is similar to polynomial kernel is some functional classes together are better than linear, tree kernel and over all radial basis kernel outperformed the polynomial kernel, linear kernel and tree kernel. In analyzing these results we show that it will be feasible to make use of SVM classifier with radial basis function as kernel to predict the gene functionality using phylogenetic profiles.

  17. The contrasting properties of conservation and correlated phylogeny in protein functional residue prediction

    PubMed Central

    Manning, Jonathan R; Jefferson, Emily R; Barton, Geoffrey J

    2008-01-01

    Background Amino acids responsible for structure, core function or specificity may be inferred from multiple protein sequence alignments where a limited set of residue types are tolerated. The rise in available protein sequences continues to increase the power of techniques based on this principle. Results A new algorithm, SMERFS, for predicting protein functional sites from multiple sequences alignments was compared to 14 conservation measures and to the MINER algorithm. Validation was performed on an automatically generated dataset of 1457 families derived from the protein interactions database SNAPPI-DB, and a smaller manually curated set of 148 families. The best performing measure overall was Williamson property entropy, with ROC0.1 scores of 0.0087 and 0.0114 for domain and small molecule contact prediction, respectively. The Lancet method performed worse than random on protein-protein interaction site prediction (ROC0.1 score of 0.0008). The SMERFS algorithm gave similar accuracy to the phylogenetic tree-based MINER algorithm but was superior to Williamson in prediction of non-catalytic transient complex interfaces. SMERFS predicts sites that are significantly more solvent accessible compared to Williamson. Conclusion Williamson property entropy is the the best performing of 14 conservation measures examined. The difference in performance of SMERFS relative to Williamson in manually defined complexes was dependent on complex type. The best choice of analysis method is therefore dependent on the system of interest. Additional computation employed by Miner in calculation of phylogenetic trees did not produce improved results over SMERFS. SMERFS performance was improved by use of windows over alignment columns, illustrating the necessity of considering the local environment of positions when assessing their functional significance. PMID:18221517

  18. ProFunc: a server for predicting protein function from 3D structure.

    PubMed

    Laskowski, Roman A; Watson, James D; Thornton, Janet M

    2005-07-01

    ProFunc (http://www.ebi.ac.uk/thornton-srv/databases/ProFunc) is a web server for predicting the likely function of proteins whose 3D structure is known but whose function is not. Users submit the coordinates of their structure to the server in PDB format. ProFunc makes use of both existing and novel methods to analyse the protein's sequence and structure identifying functional motifs or close relationships to functionally characterized proteins. A summary of the analyses provides an at-a-glance view of what each of the different methods has found. More detailed results are available on separate pages. Often where one method has failed to find anything useful another may be more forthcoming. The server is likely to be of most use in structural genomics where a large proportion of the proteins whose structures are solved are of hypothetical proteins of unknown function. However, it may also find use in a comparative analysis of members of large protein families. It provides a convenient compendium of sequence and structural information that often hold vital functional clues to be followed up experimentally.

  19. Homology-based inference sets the bar high for protein function prediction

    PubMed Central

    2013-01-01

    Background Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference. Methods Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements. Results and conclusions During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA. PMID:23514582

  20. Predicting Structure and Function for Novel Proteins of an Extremophilic Iron Oxidizing Bacterium

    NASA Astrophysics Data System (ADS)

    Wheeler, K.; Zemla, A.; Banfield, J.; Thelen, M.

    2007-12-01

    Proteins isolated from uncultivated microbial populations represent the functional components of microbial processes and contribute directly to community fitness under natural conditions. Investigations into proteins in the environment are hindered by the lack of genome data, or where available, the high proportion of proteins of unknown function. We have identified thousands of proteins from biofilms in the extremely acidic drainage outflow of an iron mine ecosystem (1). With an extensive genomic and proteomic foundation, we have focused directly on the problem of several hundred proteins of unknown function within this well-defined model system. Here we describe the geobiological insights gained by using a high throughput computational approach for predicting structure and function of 421 novel proteins from the biofilm community. We used a homology based modeling system to compare these proteins to those of known structure (AS2TS) (2). This approach has resulted in the assignment of structures to 360 proteins (85%) and provided functional information for up to 75% of the modeled proteins. Detailed examination of the modeling results enables confident, high-throughput prediction of the roles of many of the novel proteins within the microbial community. For instance, one prediction places a protein in the phosphoenolpyruvate/pyruvate domain superfamily as a carboxylase that fills in a gap in an otherwise complete carbon cycle. Particularly important for a community in such a metal rich environment is the evolution of over 25% of the novel proteins that contain a metal cofactor; of these, one third are likely Fe containing proteins. Two of the most abundant proteins in biofilm samples are unusual c-type cytochromes. Both of these proteins catalyze iron- oxidation, a key metabolic reaction supporting the energy requirements of this community. Structural models of these cytochromes verify our experimental results on heme binding and electron transfer reactivity, and

  1. Multi-instance multi-label distance metric learning for genome-wide protein function prediction.

    PubMed

    Xu, Yonghui; Min, Huaqing; Song, Hengjie; Wu, Qingyao

    2016-08-01

    Multi-instance multi-label (MIML) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with not only multiple instances but also multiple class labels. To find an appropriate MIML learning method for genome-wide protein function prediction, many studies in the literature attempted to optimize objective functions in which dissimilarity between instances is measured using the Euclidean distance. But in many real applications, Euclidean distance may be unable to capture the intrinsic similarity/dissimilarity in feature space and label space. Unlike other previous approaches, in this paper, we propose to learn a multi-instance multi-label distance metric learning framework (MIMLDML) for genome-wide protein function prediction. Specifically, we learn a Mahalanobis distance to preserve and utilize the intrinsic geometric information of both feature space and label space for MIML learning. In addition, we try to deal with the sparsely labeled data by giving weight to the labeled data. Extensive experiments on seven real-world organisms covering the biological three-domain system (i.e., archaea, bacteria, and eukaryote; Woese et al., 1990) show that the MIMLDML algorithm is superior to most state-of-the-art MIML learning algorithms.

  2. Negative example selection for protein function prediction: the NoGO database.

    PubMed

    Youngs, Noah; Penfold-Brown, Duncan; Bonneau, Richard; Shasha, Dennis

    2014-06-01

    Negative examples - genes that are known not to carry out a given protein function - are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).

  3. Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration

    PubMed Central

    Xiong, Jianghui; Rayner, Simon; Luo, Kunyi; Li, Yinghui; Chen, Shanguang

    2006-01-01

    Background The automation of many common molecular biology techniques has resulted in the accumulation of vast quantities of experimental data. One of the major challenges now facing researchers is how to process this data to yield useful information about a biological system (e.g. knowledge of genes and their products, and the biological roles of proteins, their molecular functions, localizations and interaction networks). We present a technique called Global Mapping of Unknown Proteins (GMUP) which uses the Gene Ontology Index to relate diverse sources of experimental data by creation of an abstraction layer of evidence data. This abstraction layer is used as input to a neural network which, once trained, can be used to predict function from the evidence data of unannotated proteins. The method allows us to include almost any experimental data set related to protein function, which incorporates the Gene Ontology, to our evidence data in order to seek relationships between the different sets. Results We have demonstrated the capabilities of this method in two ways. We first collected various experimental datasets associated with yeast (Saccharomyces cerevisiae) and applied the technique to a set of previously annotated open reading frames (ORFs). These ORFs were divided into training and test sets and were used to examine the accuracy of the predictions made by our method. Then we applied GMUP to previously un-annotated ORFs and made 1980, 836 and 1969 predictions corresponding to the GO Biological Process, Molecular Function and Cellular Component sub-categories respectively. We found that GMUP was particularly successful at predicting ORFs with functions associated with the ribonucleoprotein complex, protein metabolism and transportation. Conclusion This study presents a global and generic gene knowledge discovery approach based on evidence integration of various genome-scale data. It can be used to provide insight as to how certain biological processes are

  4. Multi-Instance Metric Transfer Learning for Genome-Wide Protein Function Prediction.

    PubMed

    Xu, Yonghui; Min, Huaqing; Wu, Qingyao; Song, Hengjie; Ye, Bicui

    2017-02-06

    Multi-Instance (MI) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with multiple instances. Many studies in this literature attempted to find an appropriate Multi-Instance Learning (MIL) method for genome-wide protein function prediction under a usual assumption, the underlying distribution from testing data (target domain, i.e., TD) is the same as that from training data (source domain, i.e., SD). However, this assumption may be violated in real practice. To tackle this problem, in this paper, we propose a Multi-Instance Metric Transfer Learning (MIMTL) approach for genome-wide protein function prediction. In MIMTL, we first transfer the source domain distribution to the target domain distribution by utilizing the bag weights. Then, we construct a distance metric learning method with the reweighted bags. At last, we develop an alternative optimization scheme for MIMTL. Comprehensive experimental evidence on seven real-world organisms verifies the effectiveness and efficiency of the proposed MIMTL approach over several state-of-the-art methods.

  5. Multi-Instance Metric Transfer Learning for Genome-Wide Protein Function Prediction

    PubMed Central

    Xu, Yonghui; Min, Huaqing; Wu, Qingyao; Song, Hengjie; Ye, Bicui

    2017-01-01

    Multi-Instance (MI) learning has been proven to be effective for the genome-wide protein function prediction problems where each training example is associated with multiple instances. Many studies in this literature attempted to find an appropriate Multi-Instance Learning (MIL) method for genome-wide protein function prediction under a usual assumption, the underlying distribution from testing data (target domain, i.e., TD) is the same as that from training data (source domain, i.e., SD). However, this assumption may be violated in real practice. To tackle this problem, in this paper, we propose a Multi-Instance Metric Transfer Learning (MIMTL) approach for genome-wide protein function prediction. In MIMTL, we first transfer the source domain distribution to the target domain distribution by utilizing the bag weights. Then, we construct a distance metric learning method with the reweighted bags. At last, we develop an alternative optimization scheme for MIMTL. Comprehensive experimental evidence on seven real-world organisms verifies the effectiveness and efficiency of the proposed MIMTL approach over several state-of-the-art methods. PMID:28165495

  6. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction

    PubMed Central

    Huttenhower, Curtis; Hibbs, Matthew A.; Myers, Chad L.; Caudy, Amy A.; Hess, David C.; Troyanskaya, Olga G.

    2009-01-01

    Motivation: Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question. Results: We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches—even those employing the same training data—is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations. Availability: The mitochondrial benchmark gold standard, as well as experimental results and additional data, is available at http://function.princeton.edu/mitochondria Contact: ogt@cs.princeton.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19561015

  7. CATH: an expanded resource to predict protein function through structure and sequence

    PubMed Central

    Dawson, Natalie L.; Lewis, Tony E.; Das, Sayoni; Lees, Jonathan G.; Lee, David; Ashford, Paul; Orengo, Christine A.; Sillitoe, Ian

    2017-01-01

    The latest version of the CATH-Gene3D protein structure classification database has recently been released (version 4.1, http://www.cathdb.info). The resource comprises over 300 000 domain structures and over 53 million protein domains classified into 2737 homologous superfamilies, doubling the number of predicted protein domains in the previous version. The daily-updated CATH-B, which contains our very latest domain assignment data, provides putative classifications for over 100 000 additional protein domains. This article describes developments to the CATH-Gene3D resource over the last two years since the publication in 2015, including: significant increases to our structural and sequence coverage; expansion of the functional families in CATH; building a support vector machine (SVM) to automatically assign domains to superfamilies; improved search facilities to return alignments of query sequences against multiple sequence alignments; the redesign of the web pages and download site. PMID:27899584

  8. GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains.

    PubMed

    Lee, David A; Rentzsch, Robert; Orengo, Christine

    2010-01-01

    GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile-profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.

  9. Evolutionary Trace Annotation Server: automated enzyme function prediction in protein structures using 3D templates

    PubMed Central

    Matthew Ward, R.; Venner, Eric; Daines, Bryce; Murray, Stephen; Erdin, Serkan; Kristensen, David M.; Lichtarge, Olivier

    2009-01-01

    Summary:The Evolutionary Trace Annotation (ETA) Server predicts enzymatic activity. ETA starts with a structure of unknown function, such as those from structural genomics, and with no prior knowledge of its mechanism uses the phylogenetic Evolutionary Trace (ET) method to extract key functional residues and propose a function-associated 3D motif, called a 3D template. ETA then searches previously annotated structures for geometric template matches that suggest molecular and thus functional mimicry. In order to maximize the predictive value of these matches, ETA next applies distinctive specificity filters—evolutionary similarity, function plurality and match reciprocity. In large scale controls on enzymes, prediction coverage is 43% but the positive predictive value rises to 92%, thus minimizing false annotations. Users may modify any search parameter, including the template. ETA thus expands the ET suite for protein structure annotation, and can contribute to the annotation efforts of metaservers. Availability:The ETA Server is a web application available at http://mammoth.bcm.tmc.edu/eta/. Contact: lichtarge@bcm.edu PMID:19307237

  10. Adaptive diffusion kernel learning from biological networks for protein function prediction

    PubMed Central

    Sun, Liang; Ji, Shuiwang; Ye, Jieping

    2008-01-01

    Background Machine-learning tools have gained considerable attention during the last few years for analyzing biological networks for protein function prediction. Kernel methods are suitable for learning from graph-based data such as biological networks, as they only require the abstraction of the similarities between objects into the kernel matrix. One key issue in kernel methods is the selection of a good kernel function. Diffusion kernels, the discretization of the familiar Gaussian kernel of Euclidean space, are commonly used for graph-based data. Results In this paper, we address the issue of learning an optimal diffusion kernel, in the form of a convex combination of a set of pre-specified kernels constructed from biological networks, for protein function prediction. Most prior work on this kernel learning task focus on variants of the loss function based on Support Vector Machines (SVM). Their extensions to other loss functions such as the one based on Kullback-Leibler (KL) divergence, which is more suitable for mining biological networks, lead to expensive optimization problems. By exploiting the special structure of the diffusion kernel, we show that this KL divergence based kernel learning problem can be formulated as a simple optimization problem, which can then be solved efficiently. It is further extended to the multi-task case where we predict multiple functions of a protein simultaneously. We evaluate the efficiency and effectiveness of the proposed algorithms using two benchmark data sets. Conclusion Results show that the performance of linearly combined diffusion kernel is better than every single candidate diffusion kernel. When the number of tasks is large, the algorithms based on multiple tasks are favored due to their competitive recognition performance and small computational costs. PMID:18366736

  11. Incorporating significant amino acid pairs and protein domains to predict RNA splicing-related proteins with functional roles.

    PubMed

    Hsu, Justin Bo-Kai; Huang, Kai-Yao; Weng, Tzu-Ya; Huang, Chien-Hsun; Lee, Tzong-Yi

    2014-01-01

    Machinery of pre-mRNA splicing is carried out through the interaction of RNA sequence elements and a variety of RNA splicing-related proteins (SRPs) (e.g. spliceosome and splicing factors). Alternative splicing, which is an important post-transcriptional regulation in eukaryotes, gives rise to multiple mature mRNA isoforms, which encodes proteins with functional diversities. However, the regulation of RNA splicing is not yet fully elucidated, partly because SRPs have not yet been exhaustively identified and the experimental identification is labor-intensive. Therefore, we are motivated to design a new method for identifying SRPs with their functional roles in the regulation of RNA splicing. The experimentally verified SRPs were manually curated from research articles. According to the functional annotation of Splicing Related Gene Database, the collected SRPs were further categorized into four functional groups including small nuclear Ribonucleoprotein, Splicing Factor, Splicing Regulation Factor and Novel Spliceosome Protein. The composition of amino acid pairs indicates that there are remarkable differences among four functional groups of SRPs. Then, support vector machines (SVMs) were utilized to learn the predictive models for identifying SRPs as well as their functional roles. The cross-validation evaluation presents that the SVM models trained with significant amino acid pairs and functional domains could provide a better predictive performance. In addition, the independent testing demonstrates that the proposed method could accurately identify SRPs in mammals/plants as well as effectively distinguish between SRPs and RNA-binding proteins. This investigation provides a practical means to identifying potential SRPs and a perspective for exploring the regulation of RNA splicing.

  12. Incorporating significant amino acid pairs and protein domains to predict RNA splicing-related proteins with functional roles

    NASA Astrophysics Data System (ADS)

    Hsu, Justin Bo-Kai; Huang, Kai-Yao; Weng, Tzu-Ya; Huang, Chien-Hsun; Lee, Tzong-Yi

    2014-01-01

    Machinery of pre-mRNA splicing is carried out through the interaction of RNA sequence elements and a variety of RNA splicing-related proteins (SRPs) (e.g. spliceosome and splicing factors). Alternative splicing, which is an important post-transcriptional regulation in eukaryotes, gives rise to multiple mature mRNA isoforms, which encodes proteins with functional diversities. However, the regulation of RNA splicing is not yet fully elucidated, partly because SRPs have not yet been exhaustively identified and the experimental identification is labor-intensive. Therefore, we are motivated to design a new method for identifying SRPs with their functional roles in the regulation of RNA splicing. The experimentally verified SRPs were manually curated from research articles. According to the functional annotation of Splicing Related Gene Database, the collected SRPs were further categorized into four functional groups including small nuclear Ribonucleoprotein, Splicing Factor, Splicing Regulation Factor and Novel Spliceosome Protein. The composition of amino acid pairs indicates that there are remarkable differences among four functional groups of SRPs. Then, support vector machines (SVMs) were utilized to learn the predictive models for identifying SRPs as well as their functional roles. The cross-validation evaluation presents that the SVM models trained with significant amino acid pairs and functional domains could provide a better predictive performance. In addition, the independent testing demonstrates that the proposed method could accurately identify SRPs in mammals/plants as well as effectively distinguish between SRPs and RNA-binding proteins. This investigation provides a practical means to identifying potential SRPs and a perspective for exploring the regulation of RNA splicing.

  13. An iterative knowledge-based scoring function to predict protein-ligand interactions: II. Validation of the scoring function.

    PubMed

    Huang, Sheng-You; Zou, Xiaoqin

    2006-11-30

    We have developed an iterative knowledge-based scoring function (ITScore) to describe protein-ligand interactions. Here, we assess ITScore through extensive tests on native structure identification, binding affinity prediction, and virtual database screening. Specifically, ITScore was first applied to a test set of 100 protein-ligand complexes constructed by Wang et al. (J Med Chem 2003, 46, 2287), and compared with 14 other scoring functions. The results show that ITScore yielded a high success rate of 82% on identifying native-like binding modes under the criterion of rmsd < or = 2 A for each top-ranked ligand conformation. The success rate increased to 98% if the top five conformations were considered for each ligand. In the case of binding affinity prediction, ITScore also obtained a good correlation for this test set (R = 0.65). Next, ITScore was used to predict binding affinities of a second diverse test set of 77 protein-ligand complexes prepared by Muegge and Martin (J Med Chem 1999, 42, 791), and compared with four other widely used knowledge-based scoring functions. ITScore yielded a high correlation of R2 = 0.65 (or R = 0.81) in the affinity prediction. Finally, enrichment tests were performed with ITScore against four target proteins using the compound databases constructed by Jacobsson et al. (J Med Chem 2003, 46, 5781). The results were compared with those of eight other scoring functions. ITScore yielded high enrichments in all four database screening tests. ITScore can be easily combined with the existing docking programs for the use of structure-based drug design.

  14. Analysis of proteins with the 'hot dog' fold: Prediction of function and identification of catalytic residues of hypothetical proteins

    PubMed Central

    Pidugu, Lakshmi S; Maity, Koustav; Ramaswamy, Karthikeyan; Surolia, Namita; Suguna, Kaza

    2009-01-01

    Background The hot dog fold has been found in more than sixty proteins since the first report of its existence about a decade ago. The fold appears to have a strong association with fatty acid biosynthesis, its regulation and metabolism, as the proteins with this fold are predominantly coenzyme A-binding enzymes with a variety of substrates located at their active sites. Results We have analyzed the structural features and sequences of proteins having the hot dog fold. This study reveals that though the basic architecture of the fold is well conserved in these proteins, significant differences exist in their sequence, nature of substrate and oligomerization. Segments with certain conserved sequence motifs seem to play crucial structural and functional roles in various classes of these proteins. Conclusion The analysis led to predictions regarding the functional classification and identification of possible catalytic residues of a number of hot dog fold-containing hypothetical proteins whose structures were determined in high throughput structural genomics projects. PMID:19473548

  15. Coevolutionary modeling of protein sequences: Predicting structure, function, and mutational landscapes

    NASA Astrophysics Data System (ADS)

    Weigt, Martin

    Over the last years, biological research has been revolutionized by experimental high-throughput techniques, in particular by next-generation sequencing technology. Unprecedented amounts of data are accumulating, and there is a growing request for computational methods unveiling the information hidden in raw data, thereby increasing our understanding of complex biological systems. Statistical-physics models based on the maximum-entropy principle have, in the last few years, played an important role in this context. To give a specific example, proteins and many non-coding RNA show a remarkable degree of structural and functional conservation in the course of evolution, despite a large variability in amino acid sequences. We have developed a statistical-mechanics inspired inference approach - called Direct-Coupling Analysis - to link this sequence variability (easy to observe in sequence alignments, which are available in public sequence databases) to bio-molecular structure and function. In my presentation I will show, how this methodology can be used (i) to infer contacts between residues and thus to guide tertiary and quaternary protein structure prediction and RNA structure prediction, (ii) to discriminate interacting from non-interacting protein families, and thus to infer conserved protein-protein interaction networks, and (iii) to reconstruct mutational landscapes and thus to predict the phenotypic effect of mutations. References [1] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon and M. Weigt ''Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1'', Mol. Biol. Evol. (2015), doi: 10.1093/molbev/msv211 [2] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monasson, A. Schug, M. Weigt ''Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction'', Nucleic Acids Research (2015), doi: 10.1093/nar/gkv932 [3] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. Marks, C

  16. Prediction of mitochondrial protein function by comparative physiology and phylogenetic profiling.

    PubMed

    Cheng, Yiming; Perocchi, Fabiana

    2015-01-01

    According to the endosymbiotic theory, mitochondria originate from a free-living alpha-proteobacteria that established an intracellular symbiosis with the ancestor of present-day eukaryotic cells. During the bacterium-to-organelle transformation, the proto-mitochondrial proteome has undergone a massive turnover, whereby less than 20 % of modern mitochondrial proteomes can be traced back to the bacterial ancestor. Moreover, mitochondrial proteomes from several eukaryotic organisms, for example, yeast and human, show a rather modest overlap, reflecting differences in mitochondrial physiology. Those differences may result from the combination of differential gain and loss of genes and retargeting processes among lineages. Therefore, an evolutionary signature, also called "phylogenetic profile", could be generated for every mitochondrial protein. Here, we present two evolutionary biology approaches to study mitochondrial physiology: the first strategy, which we refer to as "comparative physiology," allows the de novo identification of mitochondrial proteins involved in a physiological function; the second, known as "phylogenetic profiling," allows to predict protein functions and functional interactions by comparing phylogenetic profiles of uncharacterized and known components.

  17. Multivariate analysis of properties of amino acid residues in proteins from a viewpoint of functional site prediction

    NASA Astrophysics Data System (ADS)

    Du, Shiqiao; Sakurai, Minoru

    2010-03-01

    For the prediction of a protein's function from its 3D-structure alone, it is of importance to elucidate by which properties functional site residues in a protein are discriminated from other residues. Here, we calculated five kinds of geometrical or physical properties of each residue in a protein. Those properties were integrated with techniques of multivariate analysis such as principal component analysis (PCA) or kernel PCA. Consequently, functional residues were found to show some distinct distributions in the scatter plot of those integrated data, which led to the proposal of a method for functional site prediction with a good performance.

  18. An efficient perturbation method to predict the functionally key sites of glutamine binding protein.

    PubMed

    Lv, Dashuai; Wang, Cunxin; Li, Chunhua; Tan, Jianjun; Zhang, Xiaoyi

    2017-04-01

    Glutamine-Binding Protein (GlnBP) of Escherichia coli, an important member of the periplasmic binding protein family, is responsible for the first step in the active transport of glutamine across the cytoplasmic membrane. In this work, the functionally key regulation sites of GlnBP were identified by utilizing a perturbation method proposed by our group, in which the residues whose perturbations markedly change the binding free energy between GlnBP and glutamine are considered to be functionally key residues. The results show that besides the substrate binding sites, some other residues distant from the binding pocket, including the ones in the hinge regions between the two domains, the front- and back- door channels and the exposed region, are important for the function of glutamine binding and transport. The predicted results are well consistent with the theoretical and experimental data, which indicates that our method is an effective approach to identify the key residues important for both ligand binding and long-range allosteric signal transmission. This work can provide some insights into the function performance of GlnBP and the physical mechanism of its allosteric regulation.

  19. Enhancing protein function prediction with taxonomic constraints--The Argot2.5 web server.

    PubMed

    Lavezzo, Enrico; Falda, Marco; Fontana, Paolo; Bianco, Luca; Toppo, Stefano

    2016-01-15

    Argot2.5 (Annotation Retrieval of Gene Ontology Terms) is a web server designed to predict protein function. It is an updated version of the previous Argot2 enriched with new features in order to enhance its usability and its overall performance. The algorithmic strategy exploits the grouping of Gene Ontology terms by means of semantic similarity to infer protein function. The tool has been challenged over two independent benchmarks and compared to Argot2, PANNZER, and a baseline method relying on BLAST, proving to obtain a better performance thanks to the contribution of some key interventions in critical steps of the working pipeline. The most effective changes regard: (a) the selection of the input data from sequence similarity searches performed against a clustered version of UniProt databank and a remodeling of the weights given to Pfam hits, (b) the application of taxonomic constraints to filter out annotations that cannot be applied to proteins belonging to the species under investigation. The taxonomic rules are derived from our in-house developed tool, FunTaxIS, that extends those provided by the Gene Ontology consortium. The web server is free for academic users and is available online at http://www.medcomp.medicina.unipd.it/Argot2-5/.

  20. Sampling multiple scoring functions can improve protein loop structure prediction accuracy.

    PubMed

    Li, Yaohang; Rata, Ionel; Jakobsson, Eric

    2011-07-25

    Accurately predicting loop structures is important for understanding functions of many proteins. In order to obtain loop models with high accuracy, efficiently sampling the loop conformation space to discover reasonable structures is a critical step. In loop conformation sampling, coarse-grain energy (scoring) functions coupling with reduced protein representations are often used to reduce the number of degrees of freedom as well as sampling computational time. However, due to implicitly considering many factors by reduced representations, the coarse-grain scoring functions may have potential insensitivity and inaccuracy, which can mislead the sampling process and consequently ignore important loop conformations. In this paper, we present a new computational sampling approach to obtain reasonable loop backbone models, so-called the Pareto optimal sampling (POS) method. The rationale of the POS method is to sample the function space of multiple, carefully selected scoring functions to discover an ensemble of diversified structures yielding Pareto optimality to all sampled conformations. The POS method can efficiently tolerate insensitivity and inaccuracy in individual scoring functions and thereby lead to significant accuracy improvement in loop structure prediction. We apply the POS method to a set of 4-12-residue loop targets using a function space composed of backbone-only Rosetta and distance-scale finite ideal-gas reference (DFIRE) and a triplet backbone dihedral potential developed in our lab. Our computational results show that in 501 out of 502 targets, the model sets generated by POS contain structure models are within subangstrom resolution. Moreover, the top-ranked models have a root mean square deviation (rmsd) less than 1 A in 96.8, 84.1, and 72.2% of the short (4-6 residues), medium (7-9 residues), and long (10-12 residues) targets, respectively, when the all-atom models are generated by local optimization from the backbone models and are ranked by our

  1. Statistical prediction of protein structural, localization and functional properties by the analysis of its fragment mass distributions after proteolytic cleavage

    NASA Astrophysics Data System (ADS)

    Bogachev, Mikhail I.; Kayumov, Airat R.; Markelov, Oleg A.; Bunde, Armin

    2016-02-01

    Structural, localization and functional properties of unknown proteins are often being predicted from their primary polypeptide chains using sequence alignment with already characterized proteins and consequent molecular modeling. Here we suggest an approach to predict various structural and structure-associated properties of proteins directly from the mass distributions of their proteolytic cleavage fragments. For amino-acid-specific cleavages, the distributions of fragment masses are determined by the distributions of inter-amino-acid intervals in the protein, that in turn apparently reflect its structural and structure-related features. Large-scale computer simulations revealed that for transmembrane proteins, either α-helical or β -barrel secondary structure could be predicted with about 90% accuracy after thermolysin cleavage. Moreover, 3/4 intrinsically disordered proteins could be correctly distinguished from proteins with fixed three-dimensional structure belonging to all four SCOP structural classes by combining 3–4 different cleavages. Additionally, in some cases the protein cellular localization (cytosolic or membrane-associated) and its host organism (Firmicute or Proteobacteria) could be predicted with around 80% accuracy. In contrast to cytosolic proteins, for membrane-associated proteins exhibiting specific structural conformations, their monotopic or transmembrane localization and functional group (ATP-binding, transporters, sensors and so on) could be also predicted with high accuracy and particular robustness against missing cleavages.

  2. Statistical prediction of protein structural, localization and functional properties by the analysis of its fragment mass distributions after proteolytic cleavage

    PubMed Central

    Bogachev, Mikhail I.; Kayumov, Airat R.; Markelov, Oleg A.; Bunde, Armin

    2016-01-01

    Structural, localization and functional properties of unknown proteins are often being predicted from their primary polypeptide chains using sequence alignment with already characterized proteins and consequent molecular modeling. Here we suggest an approach to predict various structural and structure-associated properties of proteins directly from the mass distributions of their proteolytic cleavage fragments. For amino-acid-specific cleavages, the distributions of fragment masses are determined by the distributions of inter-amino-acid intervals in the protein, that in turn apparently reflect its structural and structure-related features. Large-scale computer simulations revealed that for transmembrane proteins, either α-helical or β -barrel secondary structure could be predicted with about 90% accuracy after thermolysin cleavage. Moreover, 3/4 intrinsically disordered proteins could be correctly distinguished from proteins with fixed three-dimensional structure belonging to all four SCOP structural classes by combining 3–4 different cleavages. Additionally, in some cases the protein cellular localization (cytosolic or membrane-associated) and its host organism (Firmicute or Proteobacteria) could be predicted with around 80% accuracy. In contrast to cytosolic proteins, for membrane-associated proteins exhibiting specific structural conformations, their monotopic or transmembrane localization and functional group (ATP-binding, transporters, sensors and so on) could be also predicted with high accuracy and particular robustness against missing cleavages. PMID:26924271

  3. Predicting conformational switches in proteins.

    PubMed Central

    Young, M.; Kirshenbaum, K.; Dill, K. A.; Highsmith, S.

    1999-01-01

    We describe a new computational technique to predict conformationally switching elements in proteins from their amino acid sequences. The method, called ASP (Ambivalent Structure Predictor), analyzes results from a secondary structure prediction algorithm to identify regions of conformational ambivalence. ASP identifies ambivalent regions in 16 test protein sequences for which function involves substantial backbone rearrangements. In the test set, all sites previously described as conformational switches are correctly predicted to be structurally ambivalent regions. No such regions are predicted in three negative control protein sequences. ASP may be useful as a guide for experimental studies on protein function and motion in the absence of detailed three-dimensional structural data. PMID:10493576

  4. The MASH pipeline for protein function prediction and an algorithm for the geometric refinement of 3D motifs.

    PubMed

    Chen, Brian Y; Fofanov, Viacheslav Y; Bryant, Drew H; Dodson, Bradley D; Kristensen, David M; Lisewski, Andreas M; Kimmel, Marek; Lichtarge, Olivier; Kavraki, Lydia E

    2007-01-01

    The development of new and effective drugs is strongly affected by the need to identify drug targets and to reduce side effects. Resolving these issues depends partially on a thorough understanding of the biological function of proteins. Unfortunately, the experimental determination of protein function is expensive and time consuming. To support and accelerate the determination of protein functions, algorithms for function prediction are designed to gather evidence indicating functional similarity with well studied proteins. One such approach is the MASH pipeline, described in the first half of this paper. MASH identifies matches of geometric and chemical similarity between motifs, representing known functional sites, and substructures of functionally uncharacterized proteins (targets). Observations from several research groups concur that statistically significant matches can indicate functionally related active sites. One major subproblem is the design of effective motifs, which have many matches to functionally related targets (sensitive motifs), and few matches to functionally unrelated targets (specific motifs). Current techniques select and combine structural, physical, and evolutionary properties to generate motifs that mirror functional characteristics in active sites. This approach ignores incidental similarities that may occur with functionally unrelated proteins. To address this problem, we have developed Geometric Sieving (GS), a parallel distributed algorithm that efficiently refines motifs, designed by existing methods, into optimized motifs with maximal geometric and chemical dissimilarity from all known protein structures. In exhaustive comparison of all possible motifs based on the active sites of 10 well-studied proteins, we observed that optimized motifs were among the most sensitive and specific.

  5. VR-BFDT: A variance reduction based binary fuzzy decision tree induction method for protein function prediction.

    PubMed

    Golzari, Fahimeh; Jalili, Saeed

    2015-07-21

    In protein function prediction (PFP) problem, the goal is to predict function of numerous well-sequenced known proteins whose function is not still known precisely. PFP is one of the special and complex problems in machine learning domain in which a protein (regarded as instance) may have more than one function simultaneously. Furthermore, the functions (regarded as classes) are dependent and also are organized in a hierarchical structure in the form of a tree or directed acyclic graph. One of the common learning methods proposed for solving this problem is decision trees in which, by partitioning data into sharp boundaries sets, small changes in the attribute values of a new instance may cause incorrect change in predicted label of the instance and finally misclassification. In this paper, a Variance Reduction based Binary Fuzzy Decision Tree (VR-BFDT) algorithm is proposed to predict functions of the proteins. This algorithm just fuzzifies the decision boundaries instead of converting the numeric attributes into fuzzy linguistic terms. It has the ability of assigning multiple functions to each protein simultaneously and preserves the hierarchy consistency between functional classes. It uses the label variance reduction as splitting criterion to select the best "attribute-value" at each node of the decision tree. The experimental results show that the overall performance of the proposed algorithm is promising.

  6. Sparse Markov chain-based semi-supervised multi-instance multi-label method for protein function prediction.

    PubMed

    Han, Chao; Chen, Jian; Wu, Qingyao; Mu, Shuai; Min, Huaqing

    2015-10-01

    Automated assignment of protein function has received considerable attention in recent years for genome-wide study. With the rapid accumulation of genome sequencing data produced by high-throughput experimental techniques, the process of manually predicting functional properties of proteins has become increasingly cumbersome. Such large genomics data sets can only be annotated computationally. However, automated assignment of functions to unknown protein is challenging due to its inherent difficulty and complexity. Previous studies have revealed that solving problems involving complicated objects with multiple semantic meanings using the multi-instance multi-label (MIML) framework is effective. For the protein function prediction problems, each protein object in nature may associate with distinct structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. Thus, it is convenient and natural to tackle the protein function prediction problem by using the MIML framework. In this paper, we propose a sparse Markov chain-based semi-supervised MIML method, called Sparse-Markov. A sparse transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering three biological domains show that our proposed Sparse-Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.

  7. Locating overlapping dense subgraphs in gene (protein) association networks and predicting novel protein functional groups among these subgraphs

    NASA Astrophysics Data System (ADS)

    Palla, Gergely; Derenyi, Imre; Farkas, Illes J.; Vicsek, Tamas

    2006-03-01

    Most tasks in a cell are performed not by individual proteins, but by functional groups of proteins (either physically interacting with each other or associated in other ways). In gene (protein) association networks these groups show up as sets of densely connected nodes. In the yeast, Saccharomyces cerevisiae, known physically interacting groups of proteins (called protein complexes) strongly overlap: the total number of proteins contained by these complexes by far underestimates the sum of their sizes (2750 vs. 8932). Thus, most functional groups of proteins, both physically interacting and other, are likely to share many of their members with other groups. However, current algorithms searching for dense groups of nodes in networks usually exclude overlaps. With the aim to discover both novel functions of individual proteins and novel protein functional groups we combine in protein association networks (i) a search for overlapping dense subgraphs based on the Clique Percolation Method (CPM) (Palla, G., et.al. Nature 435, 814-818 (2005), http://angel.elte.hu/clustering), which explicitly allows for overlaps among the groups, and (ii) a verification and characterization of the identified groups of nodes (proteins) with the help of standard annotation databases listing known functions.

  8. In silico prediction of structure and functions for some proteins of male-specific region of the human Y chromosome.

    PubMed

    Saha, Chinmoy; Polash, Ahsan Habib; Islam, Md Tariqul; Shafrin, Farhana

    2013-12-01

    Male-specific region of the human Y chromosome (MSY) comprises 95% of its length that is functionally active. This portion inherits in block from father to male offspring. Most of the genes in the MSY region are involved in male-specific function, such as sex determination and spermatogenesis; also contains genes probably involved in other cellular functions. However, a detailed characterization of numerous MSY-encoded proteins still remains to be done. In this study, 12 uncharacterized proteins of MSY were analyzed through bioinformatics tools for structural and functional characterization. Within these 12 proteins, a total of 55 domains were found, with DnaJ domain signature corresponding to be the highest (11%) followed by both FAD-dependent pyridine nucleotide reductase signature and fumarate lyase superfamily signature (9%). The 3D structures of our selected proteins were built up using homology modeling and the protein threading approaches. These predicted structures confirmed in detail the stereochemistry; indicating reasonably good quality model. Furthermore the predicted functions and the proteins with whom they interact established their biological role and their mechanism of action at molecular level. The results of these structure-functional annotations provide a comprehensive view of the proteins encoded by MSY, which sheds light on their biological functions and molecular mechanisms. The data presented in this study may assist in future prognosis of several human diseases such as Turner syndrome, gonadal sex reversal, spermatogenic failure, and gonadoblastoma.

  9. Structure prediction and functional characterization of secondary metabolite proteins of Ocimum

    PubMed Central

    Roy, Sudeep; Maheshwari, Nidhi; Chauhan, Rashi; Sen, Naresh Kumar; Sharma, Ashok

    2011-01-01

    Various species of Ocimum have acquired special attention due to their medicinal properties. Different parts of the plant (root, stem, flower, leaves) are used in the treatment of a wide range of disorders from centuries. Experimental structures (X-ray and NMR) of proteins from different Ocimum species, are not yet available in the Protein Databank (PDB). These proteins play a key role in various metabolic pathways in Ocimum. 3D structures of the proteins are essential to determine most of their functions. Homology modeling approach was employed in order to derive structures for these proteins. A program meant for comparative modeling- Modeller 9v7 was utilized for the purpose. The modeled proteins were further validated by Prochek and Verify-3d and Errat servers. Amino acid composition and polarity of these proteins was determined by CLC-Protein Workbench tool. Expasy's Prot-param server and Cys_rec tool were used for physico-chemical and functional characterization of these proteins. Studies of secondary structure of these proteins were carried out by computational program, Profunc. Swiss-pdb viewer was used to visualize and analyze these homology derived structures. The structures are finally submitted in Protein Model Database, PMDB so that they become accessible to other users for further studies. PMID:21769194

  10. Computational approaches for protein function prediction: a combined strategy from multiple sequence alignment to molecular docking-based virtual screening.

    PubMed

    Pierri, Ciro Leonardo; Parisi, Giovanni; Porcelli, Vito

    2010-09-01

    The functional characterization of proteins represents a daily challenge for biochemical, medical and computational sciences. Although finally proved on the bench, the function of a protein can be successfully predicted by computational approaches that drive the further experimental assays. Current methods for comparative modeling allow the construction of accurate 3D models for proteins of unknown structure, provided that a crystal structure of a homologous protein is available. Binding regions can be proposed by using binding site predictors, data inferred from homologous crystal structures, and data provided from a careful interpretation of the multiple sequence alignment of the investigated protein and its homologs. Once the location of a binding site has been proposed, chemical ligands that have a high likelihood of binding can be identified by using ligand docking and structure-based virtual screening of chemical libraries. Most docking algorithms allow building a list sorted by energy of the lowest energy docking configuration for each ligand of the library. In this review the state-of-the-art of computational approaches in 3D protein comparative modeling and in the study of protein-ligand interactions is provided. Furthermore a possible combined/concerted multistep strategy for protein function prediction, based on multiple sequence alignment, comparative modeling, binding region prediction, and structure-based virtual screening of chemical libraries, is described by using suitable examples. As practical examples, Abl-kinase molecular modeling studies, HPV-E6 protein multiple sequence alignment analysis, and some other model docking-based characterization reports are briefly described to highlight the importance of computational approaches in protein function prediction.

  11. A comprehensive software suite for protein family construction and functional site prediction

    PubMed Central

    Haft, David Renfrew; Haft, Daniel H.

    2017-01-01

    In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user’s query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions. PMID:28182651

  12. Predicting functional divergence in protein evolution by site-specific rate shifts

    NASA Technical Reports Server (NTRS)

    Gaucher, Eric A.; Gu, Xun; Miyamoto, Michael M.; Benner, Steven A.

    2002-01-01

    Most modern tools that analyze protein evolution allow individual sites to mutate at constant rates over the history of the protein family. However, Walter Fitch observed in the 1970s that, if a protein changes its function, the mutability of individual sites might also change. This observation is captured in the "non-homogeneous gamma model", which extracts functional information from gene families by examining the different rates at which individual sites evolve. This model has recently been coupled with structural and molecular biology to identify sites that are likely to be involved in changing function within the gene family. Applying this to multiple gene families highlights the widespread divergence of functional behavior among proteins to generate paralogs and orthologs.

  13. A Random Forest Model for Predicting Allosteric and Functional Sites on Proteins.

    PubMed

    Chen, Ava S-Y; Westwood, Nicholas J; Brear, Paul; Rogers, Graeme W; Mavridis, Lazaros; Mitchell, John B O

    2016-04-01

    We created a computational method to identify allosteric sites using a machine learning method trained and tested on protein structures containing bound ligand molecules. The Random Forest machine learning approach was adopted to build our three-way predictive model. Based on descriptors collated for each ligand and binding site, the classification model allows us to assign protein cavities as allosteric, regular or orthosteric, and hence to identify allosteric sites. 43 structural descriptors per complex were derived and were used to characterize individual protein-ligand binding sites belonging to the three classes, allosteric, regular and orthosteric. We carried out a separate validation on a further unseen set of protein structures containing the ligand 2-(N-cyclohexylamino) ethane sulfonic acid (CHES).

  14. Accurate Prediction of Protein Functional Class From Sequence in the Mycobacterium Tuberculosis and Escherichia Coli Genomes Using Data Mining

    PubMed Central

    Karwath, Andreas; Clare, Amanda; Dehaspe, Luc

    2000-01-01

    The analysis of genomics data needs to become as automated as its generation. Here we present a novel data-mining approach to predicting protein functional class from sequence. This method is based on a combination of inductive logic programming clustering and rule learning. We demonstrate the effectiveness of this approach on the M. tuberculosis and E. coli genomes, and identify biologically interpretable rules which predict protein functional class from information only available from the sequence. These rules predict 65% of the ORFs with no assigned function in M. tuberculosis and 24% of those in E. coli, with an estimated accuracy of 60–80% (depending on the level of functional assignment). The rules are founded on a combination of detection of remote homology, convergent evolution and horizontal gene transfer. We identify rules that predict protein functional class even in the absence of detectable sequence or structural homology. These rules give insight into the evolutionary history of M. tuberculosis and E. coli. PMID:11119305

  15. A comparison of different functions for predicted protein model quality assessment

    NASA Astrophysics Data System (ADS)

    Li, Juan; Fang, Huisheng

    2016-07-01

    In protein structure prediction, a considerable number of models are usually produced by either the Template-Based Method (TBM) or the ab initio prediction. The purpose of this study is to find the critical parameter in assessing the quality of the predicted models. A non-redundant template library was developed and 138 target sequences were modeled. The target sequences were all distant from the proteins in the template library and were aligned with template library proteins on the basis of the transformation matrix. The quality of each model was first assessed with QMEAN and its six parameters, which are C_β interaction energy (C_beta), all-atom pairwise energy (PE), solvation energy (SE), torsion angle energy (TAE), secondary structure agreement (SSA), and solvent accessibility agreement (SAE). Finally, the alignment score (score) was also used to assess the quality of model. Hence, a total of eight parameters ( i.e., QMEAN, C_beta, PE, SE, TAE, SSA, SAE, score) were independently used to assess the quality of each model. The results indicate that SSA is the best parameter to estimate the quality of the model.

  16. Systematic analysis of non-structural protein features for the prediction of PTM function potential by artificial neural networks.

    PubMed

    Dewhurst, Henry M; Torres, Matthew P

    2017-01-01

    Post-translational modifications (PTMs) provide an extensible framework for regulation of protein behavior beyond the diversity represented within the genome alone. While the rate of identification of PTMs has rapidly increased in recent years, our knowledge of PTM functionality encompasses less than 5% of this data. We previously developed SAPH-ire (Structural Analysis of PTM Hotspots) for the prioritization of eukaryotic PTMs based on function potential of discrete modified alignment positions (MAPs) in a set of 8 protein families. A proteome-wide expansion of the dataset to all families of PTM-bearing, eukaryotic proteins with a representational crystal structure and the application of artificial neural network (ANN) models demonstrated the broader applicability of this approach. Although structural features of proteins have been repeatedly demonstrated to be predictive of PTM functionality, the availability of adequately resolved 3D structures in the Protein Data Bank (PDB) limits the scope of these methods. In order to bridge this gap and capture the larger set of PTM-bearing proteins without an available, homologous structure, we explored all available MAP features as ANN inputs to identify predictive models that do not rely on 3D protein structural data. This systematic, algorithmic approach explores 8 available input features in exhaustive combinations (247 models; size 2-8). To control for potential bias in random sampling for holdback in training sets, we iterated each model across 100 randomized, sample training and testing sets-yielding 24,700 individual ANNs. The size of the analyzed dataset and iterative generation of ANNs represents the largest and most thorough investigation of predictive models for PTM functionality to date. Comparison of input layer combinations allows us to quantify ANN performance with a high degree of confidence and subsequently select a top-ranked, robust fit model which highlights 3,687 MAPs, including 10,933 PTMs with a high

  17. Systematic analysis of non-structural protein features for the prediction of PTM function potential by artificial neural networks

    PubMed Central

    2017-01-01

    Post-translational modifications (PTMs) provide an extensible framework for regulation of protein behavior beyond the diversity represented within the genome alone. While the rate of identification of PTMs has rapidly increased in recent years, our knowledge of PTM functionality encompasses less than 5% of this data. We previously developed SAPH-ire (Structural Analysis of PTM Hotspots) for the prioritization of eukaryotic PTMs based on function potential of discrete modified alignment positions (MAPs) in a set of 8 protein families. A proteome-wide expansion of the dataset to all families of PTM-bearing, eukaryotic proteins with a representational crystal structure and the application of artificial neural network (ANN) models demonstrated the broader applicability of this approach. Although structural features of proteins have been repeatedly demonstrated to be predictive of PTM functionality, the availability of adequately resolved 3D structures in the Protein Data Bank (PDB) limits the scope of these methods. In order to bridge this gap and capture the larger set of PTM-bearing proteins without an available, homologous structure, we explored all available MAP features as ANN inputs to identify predictive models that do not rely on 3D protein structural data. This systematic, algorithmic approach explores 8 available input features in exhaustive combinations (247 models; size 2–8). To control for potential bias in random sampling for holdback in training sets, we iterated each model across 100 randomized, sample training and testing sets—yielding 24,700 individual ANNs. The size of the analyzed dataset and iterative generation of ANNs represents the largest and most thorough investigation of predictive models for PTM functionality to date. Comparison of input layer combinations allows us to quantify ANN performance with a high degree of confidence and subsequently select a top-ranked, robust fit model which highlights 3,687 MAPs, including 10,933 PTMs with a

  18. Towards New Drug Targets? Function Prediction of Putative Proteins of Neisseria meningitidis MC58 and Their Virulence Characterization

    PubMed Central

    Shahbaaz, Mohd.; Bisetty, Krishna; Ahmad, Faizan

    2015-01-01

    Abstract Neisseria meningitidis is a Gram-negative aerobic diplococcus, responsible for a variety of meningococcal diseases. The genome of N. meningitidis MC58 is comprised of 2114 genes that are translated into 1953 proteins. The 698 genes (∼35%) encode hypothetical proteins (HPs), because no experimental evidence of their biological functions are available. Analyses of these proteins are important to understand their functions in the metabolic networks and may lead to the discovery of novel drug targets against the infections caused by N. meningitidis. This study aimed at the identification and categorization of each HP present in the genome of N. meningitidis MC58 using computational tools. Functions of 363 proteins were predicted with high accuracy among the annotated set of HPs investigated. The reliably predicted 363 HPs were further grouped into 41 different classes of proteins, based on their possible roles in cellular processes such as metabolism, transport, and replication. Our studies revealed that 22 HPs may be involved in the pathogenesis caused by this microorganism. The top two HPs with highest virulence scores were subjected to molecular dynamics (MD) simulations to better understand their conformational behavior in a water environment. We also compared the MD simulation results with other virulent proteins present in N. meningitidis. This study broadens our understanding of the mechanistic pathways of pathogenesis, drug resistance, tolerance, and adaptability for host immune responses to N. meningitidis. PMID:26076386

  19. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity.

    PubMed

    Li, Ying Hong; Xu, Jing Yu; Tao, Lin; Li, Xiao Feng; Li, Shuang; Zeng, Xian; Chen, Shang Ying; Zhang, Peng; Qin, Chu; Zhang, Cheng; Chen, Zhe; Zhu, Feng; Chen, Yu Zong

    2016-01-01

    Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.

  20. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity

    PubMed Central

    Li, Xiao Feng; Li, Shuang; Zeng, Xian; Chen, Shang Ying; Zhang, Peng; Qin, Chu; Zhang, Cheng; Chen, Zhe; Zhu, Feng; Chen, Yu Zong

    2016-01-01

    Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi. PMID:27525735

  1. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks.

    PubMed

    Wong, Aaron K; Park, Christopher Y; Greene, Casey S; Bongo, Lars A; Guan, Yuanfang; Troyanskaya, Olga G

    2012-07-01

    Integrative multi-species prediction (IMP) is an interactive web server that enables molecular biologists to interpret experimental results and to generate hypotheses in the context of a large cross-organism compendium of functional predictions and networks. The system provides a framework for biologists to analyze their candidate gene sets in the context of functional networks, as they expand or focus these sets by mining functional relationships predicted from integrated high-throughput data. IMP integrates prior knowledge and data collections from multiple organisms in its analyses. Through flexible and interactive visualizations, researchers can compare functional contexts and interpret the behavior of their gene sets across organisms. Additionally, IMP identifies homologs with conserved functional roles for knowledge transfer, allowing for accurate function predictions even for biological processes that have very few experimental annotations in a given organism. IMP currently supports seven organisms (Homo sapiens, Mus musculus, Rattus novegicus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans and Saccharomyces cerevisiae), does not require any registration or installation and is freely available for use at http://imp.princeton.edu.

  2. PREFACE: Protein protein interactions: principles and predictions

    NASA Astrophysics Data System (ADS)

    Nussinov, Ruth; Tsai, Chung-Jung

    2005-06-01

    Proteins are the `workhorses' of the cell. Their roles span functions as diverse as being molecular machines and signalling. They carry out catalytic reactions, transport, form viral capsids, traverse membranes and form regulated channels, transmit information from DNA to RNA, making possible the synthesis of new proteins, and they are responsible for the degradation of unnecessary proteins and nucleic acids. They are the vehicles of the immune response and are responsible for viral entry into the cell. Given their importance, considerable effort has been centered on the prediction of protein function. A prime way to do this is through identification of binding partners. If the function of at least one of the components with which the protein interacts is known, that should let us assign its function(s) and the pathway(s) in which it plays a role. This holds since the vast majority of their chores in the living cell involve protein-protein interactions. Hence, through the intricate network of these interactions we can map cellular pathways, their interconnectivities and their dynamic regulation. Their identification is at the heart of functional genomics; their prediction is crucial for drug discovery. Knowledge of the pathway, its topology, length, and dynamics may provide useful information for forecasting side effects. The goal of predicting protein-protein interactions is daunting. Some associations are obligatory, others are continuously forming and dissociating. In principle, from the physical standpoint, any two proteins can interact, but under what conditions and at which strength? The principles of protein-protein interactions are general: the non-covalent interactions of two proteins are largely the outcome of the hydrophobic effect, which drives the interactions. In addition, hydrogen bonds and electrostatic interactions play important roles. Thus, many of the interactions observed in vitro are the outcome of experimental overexpression. Protein disorder

  3. Protein function prediction and annotation in an integrated environment powered by web services (AFAWE).

    PubMed

    Jöcker, Anika; Hoffmann, Fabian; Groscurth, Andreas; Schoof, Heiko

    2008-10-15

    Many sequenced genes are mainly annotated through automatic transfer of annotation from similar sequences. Manual comparison of results or intermediate results from different tools can help avoid wrong annotations and give hints to the function of a gene even if none of the automated tools could return any result. AFAWE simplifies the task of manual functional annotation by running different tools and workflows for automatic function prediction and displaying the results in a way that facilitates comparison. Because all programs are executed as web services, AFAWE is easily extensible and can directly query primary databases, thereby always using the most up-to-date data sources. Visual filters help to distinguish trustworthy results from non-significant results. Furthermore, an interface to add detailed manual annotation to each gene is provided, which can be displayed to other users.

  4. Many amino acid substitution variants identified in DNA repair genes during human population screenings are predicted to impact protein function

    SciTech Connect

    Xi, T; Jones, I M; Mohrenweiser, H W

    2003-11-03

    Over 520 different amino acid substitution variants have been previously identified in the systematic screening of 91 human DNA repair genes for sequence variation. Two algorithms were employed to predict the impact of these amino acid substitutions on protein activity. Sorting Intolerant From Tolerant (SIFT) classified 226 of 508 variants (44%) as ''Intolerant''. Polymorphism Phenotyping (PolyPhen) classed 165 of 489 amino acid substitutions (34%) as ''Probably or Possibly Damaging''. Another 9-15% of the variants were classed as ''Potentially Intolerant or Damaging''. The results from the two algorithms are highly associated, with concordance in predicted impact observed for {approx}62% of the variants. Twenty one to thirty one percent of the variant proteins are predicted to exhibit reduced activity by both algorithms. These variants occur at slightly lower individual allele frequency than do the variants classified as ''Tolerant'' or ''Benign''. Both algorithms correctly predicted the impact of 26 functionally characterized amino acid substitutions in the APE1 protein on biochemical activity, with one exception. It is concluded that a substantial fraction of the missense variants observed in the general human population are functionally relevant. These variants are expected to be the molecular genetic and biochemical basis for the associations of reduced DNA repair capacity phenotypes with elevated cancer risk.

  5. Three-level prediction of protein function by combining profile-sequence search, profile-profile search, and domain co-occurrence networks.

    PubMed

    Wang, Zheng; Cao, Renzhi; Cheng, Jianlin

    2013-01-01

    Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases).

  6. Three-Level Prediction of Protein Function by Combining Profile-Sequence Search, Profile-Profile Search, and Domain Co-Occurrence Networks

    PubMed Central

    2013-01-01

    Predicting protein function from sequence is useful for biochemical experiment design, mutagenesis analysis, protein engineering, protein design, biological pathway analysis, drug design, disease diagnosis, and genome annotation as a vast number of protein sequences with unknown function are routinely being generated by DNA, RNA and protein sequencing in the genomic era. However, despite significant progresses in the last several years, the accuracy of protein function prediction still needs to be improved in order to be used effectively in practice, particularly when little or no homology exists between a target protein and proteins with annotated function. Here, we developed a method that integrated profile-sequence alignment, profile-profile alignment, and Domain Co-Occurrence Networks (DCN) to predict protein function at different levels of complexity, ranging from obvious homology, to remote homology, to no homology. We tested the method blindingly in the 2011 Critical Assessment of Function Annotation (CAFA). Our experiments demonstrated that our three-level prediction method effectively increased the recall of function prediction while maintaining a reasonable precision. Particularly, our method can predict function terms defined by the Gene Ontology more accurately than three standard baseline methods in most situations, handle multi-domain proteins naturally, and make ab initio function prediction when no homology exists. These results show that our approach can combine complementary strengths of most widely used BLAST-based function prediction methods, rarely used in function prediction but more sensitive profile-profile comparison-based homology detection methods, and non-homology-based domain co-occurrence networks, to effectively extend the power of function prediction from high homology, to low homology, to no homology (ab initio cases). PMID:23514381

  7. WS-SNPs&GO: a web server for predicting the deleterious effect of human protein variants using functional annotation

    PubMed Central

    2013-01-01

    Background SNPs&GO is a method for the prediction of deleterious Single Amino acid Polymorphisms (SAPs) using protein functional annotation. In this work, we present the web server implementation of SNPs&GO (WS-SNPs&GO). The server is based on Support Vector Machines (SVM) and for a given protein, its input comprises: the sequence and/or its three-dimensional structure (when available), a set of target variations and its functional Gene Ontology (GO) terms. The output of the server provides, for each protein variation, the probabilities to be associated to human diseases. Results The server consists of two main components, including updated versions of the sequence-based SNPs&GO (recently scored as one of the best algorithms for predicting deleterious SAPs) and of the structure-based SNPs&GO3d programs. Sequence and structure based algorithms are extensively tested on a large set of annotated variations extracted from the SwissVar database. Selecting a balanced dataset with more than 38,000 SAPs, the sequence-based approach achieves 81% overall accuracy, 0.61 correlation coefficient and an Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve of 0.88. For the subset of ~6,600 variations mapped on protein structures available at the Protein Data Bank (PDB), the structure-based method scores with 84% overall accuracy, 0.68 correlation coefficient, and 0.91 AUC. When tested on a new blind set of variations, the results of the server are 79% and 83% overall accuracy for the sequence-based and structure-based inputs, respectively. Conclusions WS-SNPs&GO is a valuable tool that includes in a unique framework information derived from protein sequence, structure, evolutionary profile, and protein function. WS-SNPs&GO is freely available at http://snps.biofold.org/snps-and-go. PMID:23819482

  8. AUTO-MUTE 2.0: A Portable Framework with Enhanced Capabilities for Predicting Protein Functional Consequences upon Mutation

    PubMed Central

    Masso, Majid; Vaisman, Iosif I.

    2014-01-01

    The AUTO-MUTE 2.0 stand-alone software package includes a collection of programs for predicting functional changes to proteins upon single residue substitutions, developed by combining structure-based features with trained statistical learning models. Three of the predictors evaluate changes to protein stability upon mutation, each complementing a distinct experimental approach. Two additional classifiers are available, one for predicting activity changes due to residue replacements and the other for determining the disease potential of mutations associated with nonsynonymous single nucleotide polymorphisms (nsSNPs) in human proteins. These five command-line driven tools, as well as all the supporting programs, complement those that run our AUTO-MUTE web-based server. Nevertheless, all the codes have been rewritten and substantially altered for the new portable software, and they incorporate several new features based on user feedback. Included among these upgrades is the ability to perform three highly requested tasks: to run “big data” batch jobs; to generate predictions using modified protein data bank (PDB) structures, and unpublished personal models prepared using standard PDB file formatting; and to utilize NMR structure files that contain multiple models. PMID:25197272

  9. Physical scoring function based on AMBER force field and Poisson-Boltzmann implicit solvent for protein structure prediction.

    PubMed

    Hsieh, Meng-Juei; Luo, Ray

    2004-08-15

    A well-behaved physics-based all-atom scoring function for protein structure prediction is analyzed with several widely used all-atom decoy sets. The scoring function, termed AMBER/Poisson-Boltzmann (PB), is based on a refined AMBER force field for intramolecular interactions and an efficient PB model for solvation interactions. Testing on the chosen decoy sets shows that the scoring function, which is designed to consider detailed chemical environments, is able to consistently discriminate all 62 native crystal structures after considering the heteroatom groups, disulfide bonds, and crystal packing effects that are not included in the decoy structures. When NMR structures are considered in the testing, the scoring function is able to discriminate 8 out of 10 targets. In the more challenging test of selecting near-native structures, the scoring function also performs very well: for the majority of the targets studied, the scoring function is able to select decoys that are close to the corresponding native structures as evaluated by ranking numbers and backbone Calpha root mean square deviations. Various important components of the scoring function are also studied to understand their discriminative contributions toward the rankings of native and near-native structures. It is found that neither the nonpolar solvation energy as modeled by the surface area model nor a higher protein dielectric constant improves its discriminative power. The terms remaining to be improved are related to 1-4 interactions. The most troublesome term is found to be the large and highly fluctuating 1-4 electrostatics term, not the dihedral-angle term. These data support ongoing efforts in the community to develop protein structure prediction methods with physics-based potentials that are competitive with knowledge-based potentials.

  10. Final report for LDRD project {open_quotes}A new approach to protein function and structure prediction{close_quotes}

    SciTech Connect

    Phillips, C.A.

    1997-03-01

    This report describes the research performed under the laboratory-Directed Research and Development (LDRD) grant {open_quotes}A new approach to protein function and structure prediction{close_quotes}, funded FY94-6. We describe the goals of the research, motivate and list our improvements to the state of the art in multiple sequence alignment and phylogeny (evolutionary tree) construction, but leave technical details to the six publications resulting from this work. At least three algorithms for phylogeny construction or tree consensus have been implemented and used by researchers outside of Sandia.

  11. Methods for protein complex prediction and their contributions towards understanding the organisation, function and dynamics of complexes.

    PubMed

    Srihari, Sriganesh; Yong, Chern Han; Patil, Ashwini; Wong, Limsoon

    2015-09-14

    Complexes of physically interacting proteins constitute fundamental functional units responsible for driving biological processes within cells. A faithful reconstruction of the entire set of complexes is therefore essential to understand the functional organisation of cells. In this review, we discuss the key contributions of computational methods developed till date (approximately between 2003 and 2015) for identifying complexes from the network of interacting proteins (PPI network). We evaluate in depth the performance of these methods on PPI datasets from yeast, and highlight their limitations and challenges, in particular at detecting sparse and small or sub-complexes and discerning overlapping complexes. We describe methods for integrating diverse information including expression profiles and 3D structures of proteins with PPI networks to understand the dynamics of complex formation, for instance, of time-based assembly of complex subunits and formation of fuzzy complexes from intrinsically disordered proteins. Finally, we discuss methods for identifying dysfunctional complexes in human diseases, an application that is proving invaluable to understand disease mechanisms and to discover novel therapeutic targets. We hope this review aptly commemorates a decade of research on computational prediction of complexes and constitutes a valuable reference for further advancements in this exciting area.

  12. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective

    PubMed Central

    Jiang, Yuxiang; Clark, Wyatt T.; Friedberg, Iddo; Radivojac, Predrag

    2014-01-01

    Motivation: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. Results: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. Contact: predrag@indiana.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25161254

  13. Prediction of Certain Well-Characterized Domains of Known Functions within the PE and PPE Proteins of Mycobacteria

    PubMed Central

    Sultana, Rafiya; Tanneeru, Karunakar; Kumar, Ashwin B. R.; Guruprasad, Lalitha

    2016-01-01

    The PE and PPE protein family are unique to mycobacteria. Though the complete genome sequences for over 500 M. tuberculosis strains and mycobacterial species are available, few PE and PPE proteins have been structurally and functionally characterized. We have therefore used bioinformatics tools to characterize the structure and function of these proteins. We selected representative members of the PE and PPE protein family by phylogeny analysis and using structure-based sequence annotation identified ten well-characterized protein domains of known function. Some of these domains were observed to be common to all mycobacterial species and some were species specific. PMID:26891364

  14. A partial loss of function allele of methyl-CpG-binding protein 2 predicts a human neurodevelopmental syndrome.

    PubMed

    Samaco, Rodney C; Fryer, John D; Ren, Jun; Fyffe, Sharyl; Chao, Hsiao-Tuan; Sun, Yaling; Greer, John J; Zoghbi, Huda Y; Neul, Jeffrey L

    2008-06-15

    Rett Syndrome, an X-linked dominant neurodevelopmental disorder characterized by regression of language and hand use, is primarily caused by mutations in methyl-CpG-binding protein 2 (MECP2). Loss of function mutations in MECP2 are also found in other neurodevelopmental disorders such as autism, Angelman-like syndrome and non-specific mental retardation. Furthermore, duplication of the MECP2 genomic region results in mental retardation with speech and social problems. The common features of human neurodevelopmental disorders caused by the loss or increase of MeCP2 function suggest that even modest alterations of MeCP2 protein levels result in neurodevelopmental problems. To determine whether a small reduction in MeCP2 level has phenotypic consequences, we characterized a conditional mouse allele of Mecp2 that expresses 50% of the wild-type level of MeCP2. Upon careful behavioral analysis, mice that harbor this allele display a spectrum of abnormalities such as learning and motor deficits, decreased anxiety, altered social behavior and nest building, decreased pain recognition and disrupted breathing patterns. These results indicate that precise control of MeCP2 is critical for normal behavior and predict that human neurodevelopmental disorders will result from a subtle reduction in MeCP2 expression.

  15. Predicting residual kidney function in hemodialysis patients using serum β-trace protein and β2-microglobulin.

    PubMed

    Wong, Jonathan; Sridharan, Sivakumar; Berdeprado, Jocelyn; Vilar, Enric; Viljoen, Adie; Wellsted, David; Farrington, Ken

    2016-05-01

    Residual kidney function (RKF) contributes significant solute clearance in hemodialysis patients. Kidney Diseases Outcomes Quality Initiative (KDOQI) guidelines suggest that hemodialysis dose can be safely reduced in those with residual urea clearance (KRU) of 2 ml/min/1.73 m(2) or more. However, serial measurement of RKF is cumbersome and requires regular interdialytic urine collections. Simpler methods for assessing RKF are needed. β-trace protein (βTP) and β2-microglobulin (β2M) have been proposed as alternative markers of RKF. We derived predictive equations to estimate glomerular filtration rate (GFR) and KRU based on serum βTP and β2M from 191 hemodialysis patients based on standard measurements of KRU and GFR (mean of urea and creatinine clearances) using interdialytic urine collections. These modeled equations were tested in a separate validation cohort of 40 patients. A prediction equation for GFR that includes both βTP and β2M provided a better estimate than either alone and contained the terms 1/βTP, 1/β2M, 1/serum creatinine, and a factor for gender. The equation for KRU contained the terms 1/βTP, 1/β2M, and a factor for ethnicity. Mean bias between predicted and measured GFR was 0.63 ml/min and 0.50 ml/min for KRU. There was substantial agreement between predicted and measured KRU at a cut-off level of 2 ml/min/1.73 m(2). Thus, equations involving βTP and β2M provide reasonable estimates of RKF and could potentially be used to identify those with KRU of 2 ml/min/1.73 m(2) or more to follow the KDOQI incremental hemodialysis algorithm.

  16. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

    PubMed

    Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.

  17. SPOT-Seq-RNA: predicting protein-RNA complex structure and RNA-binding function by fold recognition and binding affinity prediction.

    PubMed

    Yang, Yuedong; Zhao, Huiying; Wang, Jihua; Zhou, Yaoqi

    2014-01-01

    RNA-binding proteins (RBPs) play key roles in RNA metabolism and post-transcriptional regulation. Computational methods have been developed separately for prediction of RBPs and RNA-binding residues by machine-learning techniques and prediction of protein-RNA complex structures by rigid or semiflexible structure-to-structure docking. Here, we describe a template-based technique called SPOT-Seq-RNA that integrates prediction of RBPs, RNA-binding residues, and protein-RNA complex structures into a single package. This integration is achieved by combining template-based structure-prediction software, SPARKS X, with binding affinity prediction software, DRNA. This tool yields reasonable sensitivity (46 %) and high precision (84 %) for an independent test set of 215 RBPs and 5,766 non-RBPs. SPOT-Seq-RNA is computationally efficient for genome-scale prediction of RBPs and protein-RNA complex structures. Its application to human genome study has revealed a similar sensitivity and ability to uncover hundreds of novel RBPs beyond simple homology. The online server and downloadable version of SPOT-Seq-RNA are available at http://sparks-lab.org/server/SPOT-Seq-RNA/.

  18. Protein-protein interaction site prediction in Homo sapiens and E. coli using an interaction-affinity based membership function in fuzzy SVM.

    PubMed

    Sriwastava, Brijesh Kumar; Basu, Subhadip; Maulik, Ujjwal

    2015-10-01

    Protein-protein interaction (PPI) site prediction aids to ascertain the interface residues that participate in interaction processes. Fuzzy support vector machine (F-SVM) is proposed as an effective method to solve this problem, and we have shown that the performance of the classical SVM can be enhanced with the help of an interaction-affinity based fuzzy membership function. The performances of both SVM and F-SVM on the PPI databases of the Homo sapiens and E. coli organisms are evaluated and estimated the statistical significance of the developed method over classical SVM and other fuzzy membership-based SVM methods available in the literature. Our membership function uses the residue-level interaction affinity scores for each pair of positive and negative sequence fragments. The average AUC scores in the 10-fold cross-validation experiments are measured as 79.94% and 80.48% for the Homo sapiens and E. coli organisms respectively. On the independent test datasets, AUC scores are obtained as 76.59% and 80.17% respectively for the two organisms. In almost all cases, the developed F-SVM method improves the performances obtained by the corresponding classical SVM and the other classifiers, available in the literature.

  19. Large scale interaction analysis predicts that the Gerbera hybrida floral E function is provided both by general and specialized proteins

    PubMed Central

    2010-01-01

    Background The ornamental plant Gerbera hybrida bears complex inflorescences with morphologically distinct floral morphs that are specific to the sunflower family Asteraceae. We have previously characterized several MADS box genes that regulate floral development in Gerbera. To study further their behavior in higher order complex formation according to the quartet model, we performed yeast two- and three-hybrid analysis with fourteen Gerbera MADS domain proteins to analyze their protein-protein interaction potential. Results The exhaustive pairwise interaction analysis showed significant differences in the interaction capacity of different Gerbera MADS domain proteins compared to other model plants. Of particular interest in these assays was the behavior of SEP-like proteins, known as GRCDs in Gerbera. The previously described GRCD1 and GRCD2 proteins, which are specific regulators involved in stamen and carpel development, respectively, showed very limited pairwise interactions, whereas the related GRCD4 and GRCD5 factors displayed hub-like positions in the interaction map. We propose GRCD4 and GRCD5 to provide a redundant and general E function in Gerbera, comparable to the SEP proteins in Arabidopsis. Based on the pairwise interaction data, combinations of MADS domain proteins were further subjected to yeast three-hybrid assays. Gerbera B function proteins showed active behavior in ternary complexes. All Gerbera SEP-like proteins with the exception of GRCD1 were excellent partners for B function proteins, further implicating the unique role of GRCD1 as a whorl- and flower-type specific C function partner. Conclusions Gerbera MADS domain proteins exhibit both conserved and derived behavior in higher order protein complex formation. This protein-protein interaction data can be used to classify and compare Gerbera MADS domain proteins to those of Arabidopsis and Petunia. Combined with our reverse genetic studies of Gerbera, these results reinforce the roles of

  20. Comparative genomic analysis of evolutionarily conserved but functionally uncharacterized membrane proteins in archaea: Prediction of novel components of secretion, membrane remodeling and glycosylation systems.

    PubMed

    Makarova, Kira S; Galperin, Michael Y; Koonin, Eugene V

    2015-11-01

    A systematic comparative genomic analysis of all archaeal membrane proteins that have been projected to the last archaeal common ancestor gene set led to the identification of several novel components of predicted secretion, membrane remodeling, and protein glycosylation systems. Among other findings, most crenarchaea have been shown to encode highly diverged orthologs of the membrane insertase YidC, which is nearly universal in bacteria, eukaryotes, and euryarchaea. We also identified a vast family of archaeal proteins, including the C-terminal domain of N-glycosylation protein AglD, as membrane flippases homologous to the flippase domain of bacterial multipeptide resistance factor MprF, a bifunctional lysylphosphatidylglycerol synthase and flippase. Additionally, several proteins were predicted to function as membrane transporters. The results of this work, combined with our previous analyses, reveal an unexpected diversity of putative archaeal membrane-associated functional systems that remain to be functionally characterized. A more general conclusion from this work is that the currently available collection of archaeal (and bacterial) genomes could be sufficient to identify (almost) all widespread functional modules and develop experimentally testable predictions of their functions.

  1. Modeling Protein Domain Function

    ERIC Educational Resources Information Center

    Baker, William P.; Jones, Carleton "Buck"; Hull, Elizabeth

    2007-01-01

    This simple but effective laboratory exercise helps students understand the concept of protein domain function. They use foam beads, Styrofoam craft balls, and pipe cleaners to explore how domains within protein active sites interact to form a functional protein. The activity allows students to gain content mastery and an understanding of the…

  2. Computational Prediction of Protein-Protein Interactions of Human Tyrosinase

    PubMed Central

    Wang, Su-Fang; Oh, Sangho; Si, Yue-Xiu; Wang, Zhi-Jiang; Han, Hong-Yan; Lee, Jinhyuk; Qian, Guo-Ying

    2012-01-01

    The various studies on tyrosinase have recently gained the attention of researchers due to their potential application values and the biological functions. In this study, we predicted the 3D structure of human tyrosinase and simulated the protein-protein interactions between tyrosinase and three binding partners, four and half LIM domains 2 (FHL2), cytochrome b-245 alpha polypeptide (CYBA), and RNA-binding motif protein 9 (RBM9). Our interaction simulations showed significant binding energy scores of −595.3 kcal/mol for FHL2, −859.1 kcal/mol for CYBA, and −821.3 kcal/mol for RBM9. We also investigated the residues of each protein facing toward the predicted site of interaction with tyrosinase. Our computational predictions will be useful for elucidating the protein-protein interactions of tyrosinase and studying its binding mechanisms. PMID:22577521

  3. IMP 2.0: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks.

    PubMed

    Wong, Aaron K; Krishnan, Arjun; Yao, Victoria; Tadych, Alicja; Troyanskaya, Olga G

    2015-07-01

    IMP (Integrative Multi-species Prediction), originally released in 2012, is an interactive web server that enables molecular biologists to interpret experimental results and to generate hypotheses in the context of a large cross-organism compendium of functional predictions and networks. The system provides biologists with a framework to analyze their candidate gene sets in the context of functional networks, expanding or refining their sets using functional relationships predicted from integrated high-throughput data. IMP 2.0 integrates updated prior knowledge and data collections from the last three years in the seven supported organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans, and Saccharomyces cerevisiae) and extends function prediction coverage to include human disease. IMP identifies homologs with conserved functional roles for disease knowledge transfer, allowing biologists to analyze disease contexts and predictions across all organisms. Additionally, IMP 2.0 implements a new flexible platform for experts to generate custom hypotheses about biological processes or diseases, making sophisticated data-driven methods easily accessible to researchers. IMP does not require any registration or installation and is freely available for use at http://imp.princeton.edu.

  4. Protein function annotation using protein domain family resources.

    PubMed

    Das, Sayoni; Orengo, Christine A

    2016-01-15

    As a result of the genome sequencing and structural genomics initiatives, we have a wealth of protein sequence and structural data. However, only about 1% of these proteins have experimental functional annotations. As a result, computational approaches that can predict protein functions are essential in bridging this widening annotation gap. This article reviews the current approaches of protein function prediction using structure and sequence based classification of protein domain family resources with a special focus on functional families in the CATH-Gene3D resource.

  5. Exploring Mouse Protein Function via Multiple Approaches

    PubMed Central

    Huang, Tao; Kong, Xiangyin; Zhang, Yunhua; Zhang, Ning

    2016-01-01

    Although the number of available protein sequences is growing exponentially, functional protein annotations lag far behind. Therefore, accurate identification of protein functions remains one of the major challenges in molecular biology. In this study, we presented a novel approach to predict mouse protein functions. The approach was a sequential combination of a similarity-based approach, an interaction-based approach and a pseudo amino acid composition-based approach. The method achieved an accuracy of about 0.8450 for the 1st-order predictions in the leave-one-out and ten-fold cross-validations. For the results yielded by the leave-one-out cross-validation, although the similarity-based approach alone achieved an accuracy of 0.8756, it was unable to predict the functions of proteins with no homologues. Comparatively, the pseudo amino acid composition-based approach alone reached an accuracy of 0.6786. Although the accuracy was lower than that of the previous approach, it could predict the functions of almost all proteins, even proteins with no homologues. Therefore, the combined method balanced the advantages and disadvantages of both approaches to achieve efficient performance. Furthermore, the results yielded by the ten-fold cross-validation indicate that the combined method is still effective and stable when there are no close homologs are available. However, the accuracy of the predicted functions can only be determined according to known protein functions based on current knowledge. Many protein functions remain unknown. By exploring the functions of proteins for which the 1st-order predicted functions are wrong but the 2nd-order predicted functions are correct, the 1st-order wrongly predicted functions were shown to be closely associated with the genes encoding the proteins. The so-called wrongly predicted functions could also potentially be correct upon future experimental verification. Therefore, the accuracy of the presented method may be much higher in

  6. Predicting protein-peptide interactions from scratch

    NASA Astrophysics Data System (ADS)

    Yan, Chengfei; Xu, Xianjin; Zou, Xiaoqin; Zou lab Team

    Protein-peptide interactions play an important role in many cellular processes. The ability to predict protein-peptide complex structures is valuable for mechanistic investigation and therapeutic development. Due to the high flexibility of peptides and lack of templates for homologous modeling, predicting protein-peptide complex structures is extremely challenging. Recently, we have developed a novel docking framework for protein-peptide structure prediction. Specifically, given the sequence of a peptide and a 3D structure of the protein, initial conformations of the peptide are built through protein threading. Then, the peptide is globally and flexibly docked onto the protein using a novel iterative approach. Finally, the sampled modes are scored and ranked by a statistical potential-based energy scoring function that was derived for protein-peptide interactions from statistical mechanics principles. Our docking methodology has been tested on the Peptidb database and compared with other protein-peptide docking methods. Systematic analysis shows significantly improved results compared to the performances of the existing methods. Our method is computationally efficient and suitable for large-scale applications. Nsf CAREER Award 0953839 (XZ) NIH R01GM109980 (XZ).

  7. Predicting disease-related proteins based on clique backbone in protein-protein interaction network.

    PubMed

    Yang, Lei; Zhao, Xudong; Tang, Xianglong

    2014-01-01

    Network biology integrates different kinds of data, including physical or functional networks and disease gene sets, to interpret human disease. A clique (maximal complete subgraph) in a protein-protein interaction network is a topological module and possesses inherently biological significance. A disease-related clique possibly associates with complex diseases. Fully identifying disease components in a clique is conductive to uncovering disease mechanisms. This paper proposes an approach of predicting disease proteins based on cliques in a protein-protein interaction network. To tolerate false positive and negative interactions in protein networks, extending cliques and scoring predicted disease proteins with gene ontology terms are introduced to the clique-based method. Precisions of predicted disease proteins are verified by disease phenotypes and steadily keep to more than 95%. The predicted disease proteins associated with cliques can partly complement mapping between genotype and phenotype, and provide clues for understanding the pathogenesis of serious diseases.

  8. Serum liver-type fatty acid-binding protein predicts recovery of graft function after kidney transplantation from donors after cardiac death.

    PubMed

    Kawai, Akihiro; Kusaka, Mamoru; Kitagawa, Fumihiko; Ishii, Junichi; Fukami, Naohiko; Maruyama, Takahiro; Sasaki, Hitomi; Shiroki, Ryoichi; Kurahashi, Hiroki; Hoshinaga, Kiyotaka

    2014-06-01

    Kidneys procured by donation after cardiac death (DCD) may increase the donor pool but are associated with high incidence of delayed graft function (DGF). Urinary liver-type fatty acid-binding protein (L-FABP) level is an early biomarker of renal injury after kidney transplantation (KTx); however, its utility is limited in DGF cases owing to urine sample unavailability. We examined whether serum L-FABP level predicts functional recovery of transplanted DCD kidneys. Consecutive patients undergoing KTx from living related donors (LD), brain-dead donors (BD), or DCD were retrospectively enrolled. Serum L-FABP levels were measured from samples collected before and after KTx. Serum L-FABP decreased rapidly in patients with immediate function, slowly in DGF patients, and somewhat increased in DGF patients requiring hemodialysis (HD) for >1 wk. Receiver-operating characteristic curve analysis demonstrated that DGF was predicted with 84% sensitivity (SE) and 86% specificity (SP) at cutoff of 9.0 ng/mL on post-operative day (POD) 1 and 68% SE and 90% SP at 6.0 on POD 2. DGF >7 d was predicted with 83% SE and 78% SP at 11.0 on POD 1 and 67% SE and 78% SP at 6.5 on POD 2. Serum L-FABP levels may predict graft recovery and need for HD after DCD KTx.

  9. Three-dimensional (3D) structure prediction and function analysis of the chitin-binding domain 3 protein HD73_3189 from Bacillus thuringiensis HD73.

    PubMed

    Zhan, Yiling; Guo, Shuyuan

    2015-01-01

    Bacillus thuringiensis (Bt) is capable of producing a chitin-binding protein believed to be functionally important to bacteria during the stationary phase of its growth cycle. In this paper, the chitin-binding domain 3 protein HD73_3189 from B. thuringiensis has been analyzed by computer technology. Primary and secondary structural analyses demonstrated that HD73_3189 is negatively charged and contains several α-helices, aperiodical coils and β-strands. Domain and motif analyses revealed that HD73_3189 contains a signal peptide, an N-terminal chitin binding 3 domains, two copies of a fibronectin-like domain 3 and a C-terminal carbohydrate binding domain classified as CBM_5_12. Moreover, analysis predicted the protein's associated localization site to be the cell wall. Ligand site prediction determined that amino acid residues GLU-312, TRP-334, ILE-341 and VAL-382 exposed on the surface of the target protein exhibit polar interactions with the substrate.

  10. Predicting Thermodynamic Behaviors of Non-Protein Amino Acids as a Function of Temperature and pH.

    PubMed

    Kitadai, Norio

    2016-03-01

    Why does life use α-amino acids exclusively as building blocks of proteins? To address that fundamental question from an energetic perspective, this study estimated the standard molal thermodynamic data for three non-α-amino acids (β-alanine, γ-aminobutyric acid, and ε-aminocaproic acid) and α-amino-n-butyric acid in their zwitterionic, negative, and positive ionization states based on the corresponding experimental measurements reported in the literature. Temperature dependences of their heat capacities were described based on the revised Helgeson-Kirkham-Flowers (HKF) equations of state. The obtained dataset was then used to calculate the standard molal Gibbs energies (∆G (o)) of the non-α-amino acids as a function of temperature and pH. Comparison of their ∆G (o) values with those of α-amino acids having the same molecular formula showed that the non-α-amino acids have similar ∆G (o) values to the corresponding α-amino acids in physiologically relevant conditions (neutral pH, <100 °C). In acidic and alkaline pH, the non-α-amino acids are thermodynamically more stable than the corresponding α-ones over a broad temperature range. These results suggest that the energetic cost of synthesis is not an important selection pressure to incorporate α-amino acids into biological systems.

  11. Predicting Thermodynamic Behaviors of Non-Protein Amino Acids as a Function of Temperature and pH

    NASA Astrophysics Data System (ADS)

    Kitadai, Norio

    2016-03-01

    Why does life use α-amino acids exclusively as building blocks of proteins? To address that fundamental question from an energetic perspective, this study estimated the standard molal thermodynamic data for three non-α-amino acids (β-alanine, γ-aminobutyric acid, and ɛ-aminocaproic acid) and α-amino- n-butyric acid in their zwitterionic, negative, and positive ionization states based on the corresponding experimental measurements reported in the literature. Temperature dependences of their heat capacities were described based on the revised Helgeson-Kirkham-Flowers (HKF) equations of state. The obtained dataset was then used to calculate the standard molal Gibbs energies ( ∆G o) of the non-α-amino acids as a function of temperature and pH. Comparison of their ∆G o values with those of α-amino acids having the same molecular formula showed that the non-α-amino acids have similar ∆G o values to the corresponding α-amino acids in physiologically relevant conditions (neutral pH, <100 °C). In acidic and alkaline pH, the non-α-amino acids are thermodynamically more stable than the corresponding α-ones over a broad temperature range. These results suggest that the energetic cost of synthesis is not an important selection pressure to incorporate α-amino acids into biological systems.

  12. Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL): adapting the Partial Phylogenetic Profiling algorithm to scan sequences for signatures that predict protein function

    PubMed Central

    2010-01-01

    characterization. Conclusions SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites. PMID:20102603

  13. Prediction driven functional annotation of hypothetical proteins in the major facilitator superfamily of S. aureus NCTC 8325

    PubMed Central

    Marklevitz, Jessica; Harris, Laura K.

    2016-01-01

    Antibiotic resistance Staphylococcus aureus strains cause several life threatening infections. New drug treatment options are needed, but are slow to develop because 50% of the S. aureus genome is hypothetical. The goal of this is to aid in the annotation of the S. aureus NCTC 8325 genome by identifying hypothetical proteins related to the Major Facilitator Superfamily (MFS). The MFS is a broad protein group with members involved in drug efflux mechanisms causing resistance. To do this, sequences for three MFS proteins with x-ray crystal structures in E. coli were PSI-BLASTed against the S. aureus NCTC 8325 genome to identify homologs. Eleven identified hypothetical protein homologs underwent BLASTP against the non-redundant NCBI database to fit homologs specific to each hypothetical protein. ExPASy characterized the physiochemical features, CDD-BLAST and Pfam identified domains, and the SOSUI server defined transmembrane helices of each hypothetical protein. Based on size (300 – 700 amino acids), number of transmembrane helices (>7), CD06174 and MFS domains in CDD-BLAST and Pfam, respectively, and close relation to well-defined homologs, SAOUHSC_00058, SAOUHSC_00078, SAOUHSC_00952, SAOUHSC_02435, SAOUHSC_02752, and ABD31642.1 are members of the MFS. Further multiple-alignment and phylogeny analyses show SAOUHSC_00058 to be a quinolone resistance protein (NorB), SAOUHSC_00058 a siderophore biosynthesis protein (SbnD), SAOUHSC_00952 a glycolipid permease (LtaA), SAOUHSC_02435 a macrolide MFS transporter, SAOUHSC_02752 a chloramphenicol resistance (DHA1), and ABD31642.1 is a Bcr/CflA family drug resistance efflux transporter. These findings provide better annotation for the existing genome, and identify proteins related to antibiotic resistance in S. aureus NCTC 8325. PMID:28197063

  14. Prediction driven functional annotation of hypothetical proteins in the major facilitator superfamily of S. aureus NCTC 8325.

    PubMed

    Marklevitz, Jessica; Harris, Laura K

    2016-01-01

    Antibiotic resistance Staphylococcus aureus strains cause several life threatening infections. New drug treatment options are needed, but are slow to develop because 50% of the S. aureus genome is hypothetical. The goal of this is to aid in the annotation of the S. aureus NCTC 8325 genome by identifying hypothetical proteins related to the Major Facilitator Superfamily (MFS). The MFS is a broad protein group with members involved in drug efflux mechanisms causing resistance. To do this, sequences for three MFS proteins with x-ray crystal structures in E. coli were PSI-BLASTed against the S. aureus NCTC 8325 genome to identify homologs. Eleven identified hypothetical protein homologs underwent BLASTP against the non-redundant NCBI database to fit homologs specific to each hypothetical protein. ExPASy characterized the physiochemical features, CDD-BLAST and Pfam identified domains, and the SOSUI server defined transmembrane helices of each hypothetical protein. Based on size (300 - 700 amino acids), number of transmembrane helices (>7), CD06174 and MFS domains in CDD-BLAST and Pfam, respectively, and close relation to well-defined homologs, SAOUHSC_00058, SAOUHSC_00078, SAOUHSC_00952, SAOUHSC_02435, SAOUHSC_02752, and ABD31642.1 are members of the MFS. Further multiple-alignment and phylogeny analyses show SAOUHSC_00058 to be a quinolone resistance protein (NorB), SAOUHSC_00058 a siderophore biosynthesis protein (SbnD), SAOUHSC_00952 a glycolipid permease (LtaA), SAOUHSC_02435 a macrolide MFS transporter, SAOUHSC_02752 a chloramphenicol resistance (DHA1), and ABD31642.1 is a Bcr/CflA family drug resistance efflux transporter. These findings provide better annotation for the existing genome, and identify proteins related to antibiotic resistance in S. aureus NCTC 8325.

  15. A comprehensive overview of computational protein disorder prediction methods†

    PubMed Central

    Deng, Xin; Eickholt, Jesse

    2013-01-01

    Over the past decade there has been a growing acknowledgement that a large proportion of proteins within most proteomes contain disordered regions. Disordered regions are segments of the protein chain which do not adopt a stable structure. Recognition of disordered regions in a protein is of great importance for protein structure prediction, protein structure determination and function annotation as these regions have a close relationship with protein expression and functionality. As a result, a great many protein disorder prediction methods have been developed so far. Here, we present an overview of current protein disorder prediction methods including an analysis of their advantages and shortcomings. In order to help users to select alternative tools under different circumstances, we also evaluate 23 disorder predictors on the benchmark data of the most recent round of the Critical Assessment of protein Structure Prediction (CASP) and assess their accuracy using several complementary measures. PMID:21874190

  16. Multitask learning for protein subcellular location prediction.

    PubMed

    Xu, Qian; Pan, Sinno Jialin; Xue, Hannah Hong; Yang, Qiang

    2011-01-01

    Protein subcellular localization is concerned with predicting the location of a protein within a cell using computational methods. The location information can indicate key functionalities of proteins. Thus, accurate prediction of subcellular localizations of proteins can help the prediction of protein functions and genome annotations, as well as the identification of drug targets. Machine learning methods such as Support Vector Machines (SVMs) have been used in the past for the problem of protein subcellular localization, but have been shown to suffer from a lack of annotated training data in each species under study. To overcome this data sparsity problem, we observe that because some of the organisms may be related to each other, there may be some commonalities across different organisms that can be discovered and used to help boost the data in each localization task. In this paper, we formulate protein subcellular localization problem as one of multitask learning across different organisms. We adapt and compare two specializations of the multitask learning algorithms on 20 different organisms. Our experimental results show that multitask learning performs much better than the traditional single-task methods. Among the different multitask learning methods, we found that the multitask kernels and supertype kernels under multitask learning that share parameters perform slightly better than multitask learning by sharing latent features. The most significant improvement in terms of localization accuracy is about 25 percent. We find that if the organisms are very different or are remotely related from a biological point of view, then jointly training the multiple models cannot lead to significant improvement. However, if they are closely related biologically, the multitask learning can do much better than individual learning.

  17. Predictive and comparative analysis of Ebolavirus proteins

    PubMed Central

    Cong, Qian; Pei, Jimin; Grishin, Nick V

    2015-01-01

    Ebolavirus is the pathogen for Ebola Hemorrhagic Fever (EHF). This disease exhibits a high fatality rate and has recently reached a historically epidemic proportion in West Africa. Out of the 5 known Ebolavirus species, only Reston ebolavirus has lost human pathogenicity, while retaining the ability to cause EHF in long-tailed macaque. Significant efforts have been spent to determine the three-dimensional (3D) structures of Ebolavirus proteins, to study their interaction with host proteins, and to identify the functional motifs in these viral proteins. Here, in light of these experimental results, we apply computational analysis to predict the 3D structures and functional sites for Ebolavirus protein domains with unknown structure, including a zinc-finger domain of VP30, the RNA-dependent RNA polymerase catalytic domain and a methyltransferase domain of protein L. In addition, we compare sequences of proteins that interact with Ebolavirus proteins from RESTV-resistant primates with those from RESTV-susceptible monkeys. The host proteins that interact with GP and VP35 show an elevated level of sequence divergence between the RESTV-resistant and RESTV-susceptible species, suggesting that they may be responsible for host specificity. Meanwhile, we detect variable positions in protein sequences that are likely associated with the loss of human pathogenicity in RESTV, map them onto the 3D structures and compare their positions to known functional sites. VP35 and VP30 are significantly enriched in these potential pathogenicity determinants and the clustering of such positions on the surfaces of VP35 and GP suggests possible uncharacterized interaction sites with host proteins that contribute to the virulence of Ebolavirus. PMID:26158395

  18. Developing algorithms for predicting protein-protein interactions of homology modeled proteins.

    SciTech Connect

    Martin, Shawn Bryan; Sale, Kenneth L.; Faulon, Jean-Loup Michel; Roe, Diana C.

    2006-01-01

    The goal of this project was to examine the protein-protein docking problem, especially as it relates to homology-based structures, identify the key bottlenecks in current software tools, and evaluate and prototype new algorithms that may be developed to improve these bottlenecks. This report describes the current challenges in the protein-protein docking problem: correctly predicting the binding site for the protein-protein interaction and correctly placing the sidechains. Two different and complementary approaches are taken that can help with the protein-protein docking problem. The first approach is to predict interaction sites prior to docking, and uses bioinformatics studies of protein-protein interactions to predict theses interaction site. The second approach is to improve validation of predicted complexes after docking, and uses an improved scoring function for evaluating proposed docked poses, incorporating a solvation term. This scoring function demonstrates significant improvement over current state-of-the art functions. Initial studies on both these approaches are promising, and argue for full development of these algorithms.

  19. Predictions of Protein-Protein Interfaces within Membrane Protein Complexes

    PubMed Central

    Asadabadi, Ebrahim Barzegari; Abdolmaleki, Parviz

    2013-01-01

    Background Prediction of interaction sites within the membrane protein complexes using the sequence data is of a great importance, because it would find applications in modification of molecules transport through membrane, signaling pathways and drug targets of many diseases. Nevertheless, it has gained little attention from the protein structural bioinformatics community. Methods In this study, a wide variety of prediction and classification tools were applied to distinguish the residues at the interfaces of membrane proteins from those not in the interfaces. Results The tuned SVM model achieved the high accuracy of 86.95% and the AUC of 0.812 which outperforms the results of the only previous similar study. Nevertheless, prediction performances obtained using most employed models cannot be used in applied fields and needs more effort to improve. Conclusion Considering the variety of the applied tools in this study, the present investigation could be a good starting point to develop more efficient tools to predict the membrane protein interaction site residues. PMID:23919118

  20. Using radial basis function on the general form of Chou's pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites.

    PubMed

    Huang, Chao; Yuan, Jingqi

    2013-07-01

    Prediction of protein subcellular location is a meaningful task which attracted much attention in recent years. A lot of protein subcellular location predictors which can only deal with the single-location proteins were developed. However, some proteins may belong to two or even more subcellular locations. It is important to develop predictors which will be able to deal with multiplex proteins, because these proteins have extremely useful implication in both basic biological research and drug discovery. Considering the circumstance that the number of methods dealing with multiplex proteins is limited, it is meaningful to explore some new methods which can predict subcellular location of proteins with both single and multiple sites. Different methods of feature extraction and different models of predict algorithms using on different benchmark datasets may receive some general results. In this paper, two different feature extraction methods and two different models of neural networks were performed on three benchmark datasets of different kinds of proteins, i.e. datasets constructed specially for Gram-positive bacterial proteins, plant proteins and virus proteins. These benchmark datasets have different number of location sites. The application result shows that RBF neural network has apparently superiorities against BP neural network on these datasets no matter which type of feature extraction is chosen.

  1. Structure to function prediction of hypothetical protein KPN_00953 (Ycbk) from Klebsiella pneumoniae MGH 78578 highlights possible role in cell wall metabolism

    PubMed Central

    2014-01-01

    Background Klebsiella pneumoniae plays a major role in causing nosocomial infection in immunocompromised patients. Medical inflictions by the pathogen can range from respiratory and urinary tract infections, septicemia and primarily, pneumonia. As more K. pneumoniae strains are becoming highly resistant to various antibiotics, treatment of this bacterium has been rendered more difficult. This situation, as a consequence, poses a threat to public health. Hence, identification of possible novel drug targets against this opportunistic pathogen need to be undertaken. In the complete genome sequence of K. pneumoniae MGH 78578, approximately one-fourth of the genome encodes for hypothetical proteins (HPs). Due to their low homology and relatedness to other known proteins, HPs may serve as potential, new drug targets. Results Sequence analysis on the HPs of K. pneumoniae MGH 78578 revealed that a particular HP termed KPN_00953 (YcbK) contains a M15_3 peptidases superfamily conserved domain. Some members of this superfamily are metalloproteases which are involved in cell wall metabolism. BLASTP similarity search on KPN_00953 (YcbK) revealed that majority of the hits were hypothetical proteins although two of the hits suggested that it may be a lipoprotein or related to twin-arginine translocation (Tat) pathway important for transport of proteins to the cell membrane and periplasmic space. As lipoproteins and other components of the cell wall are important pathogenic factors, homology modeling of KPN_00953 was attempted to predict the structure and function of this protein. Three-dimensional model of the protein showed that its secondary structure topology and active site are similar with those found among metalloproteases where two His residues, namely His169 and His209 and an Asp residue, Asp176 in KPN_00953 were found to be Zn-chelating residues. Interestingly, induced expression of the cloned KPN_00953 gene in lipoprotein-deficient E. coli JE5505 resulted in smoother

  2. Functional assignment to JEV proteins using SVM

    PubMed Central

    Sahoo, Ganesh Chandra; Dikhit, Manas Ranjan; Das, Pradeep

    2008-01-01

    Identification of different protein functions facilitates a mechanistic understanding of Japanese encephalitis virus (JEV) infection and opens novel means for drug development. Support vector machines (SVM), useful for predicting the functional class of distantly related proteins, is employed to ascribe a possible functional class to Japanese encephalitis virus protein. Our study from SVMProt and available JE virus sequences suggests that structural and nonstructural proteins of JEV genome possibly belong to diverse protein functions, are expected to occur in the life cycle of JE virus. Protein functions common to both structural and non-structural proteins are iron-binding, metal-binding, lipid-binding, copper-binding, transmembrane, outer membrane, channels/Pores - Pore-forming toxins (proteins and peptides) group of proteins. Non-structural proteins perform functions like actin binding, zinc-binding, calcium-binding, hydrolases, Carbon-Oxygen Lyases, P-type ATPase, proteins belonging to major facilitator family (MFS), secreting main terminal branch (MTB) family, phosphotransfer-driven group translocators and ATP-binding cassette (ABC) family group of proteins. Whereas structural proteins besides belonging to same structural group of proteins (capsid, structural, envelope), they also perform functions like nuclear receptor, antibiotic resistance, RNA-binding, DNA-binding, magnesium-binding, isomerase (intra-molecular), oxidoreductase and participate in type II (general) secretory pathway (IISP). PMID:19052658

  3. Predicting Resistance Mutations Using Protein Design Algorithms

    SciTech Connect

    Frey, K.; Georgiev, I; Donald, B; Anderson, A

    2010-01-01

    Drug resistance resulting from mutations to the target is an unfortunate common phenomenon that limits the lifetime of many of the most successful drugs. In contrast to the investigation of mutations after clinical exposure, it would be powerful to be able to incorporate strategies early in the development process to predict and overcome the effects of possible resistance mutations. Here we present a unique prospective application of an ensemble-based protein design algorithm, K*, to predict potential resistance mutations in dihydrofolate reductase from Staphylococcus aureus using positive design to maintain catalytic function and negative design to interfere with binding of a lead inhibitor. Enzyme inhibition assays show that three of the four highly-ranked predicted mutants are active yet display lower affinity (18-, 9-, and 13-fold) for the inhibitor. A crystal structure of the top-ranked mutant enzyme validates the predicted conformations of the mutated residues and the structural basis of the loss of potency. The use of protein design algorithms to predict resistance mutations could be incorporated in a lead design strategy against any target that is susceptible to mutational resistance.

  4. Statistical analysis and prediction of protein-protein interfaces.

    PubMed

    Bordner, Andrew J; Abagyan, Ruben

    2005-08-15

    Predicting protein-protein interfaces from a three-dimensional structure is a key task of computational structural proteomics. In contrast to geometrically distinct small molecule binding sites, protein-protein interface are notoriously difficult to predict. We generated a large nonredundant data set of 1494 true protein-protein interfaces using biological symmetry annotation where necessary. The data set was carefully analyzed and a Support Vector Machine was trained on a combination of a new robust evolutionary conservation signal with the local surface properties to predict protein-protein interfaces. Fivefold cross validation verifies the high sensitivity and selectivity of the model. As much as 97% of the predicted patches had an overlap with the true interface patch while only 22% of the surface residues were included in an average predicted patch. The model allowed the identification of potential new interfaces and the correction of mislabeled oligomeric states.

  5. Practical lessons from protein structure prediction

    PubMed Central

    Ginalski, Krzysztof; Grishin, Nick V.; Godzik, Adam; Rychlewski, Leszek

    2005-01-01

    Despite recent efforts to develop automated protein structure determination protocols, structural genomics projects are slow in generating fold assignments for complete proteomes, and spatial structures remain unknown for many protein families. Alternative cheap and fast methods to assign folds using prediction algorithms continue to provide valuable structural information for many proteins. The development of high-quality prediction methods has been boosted in the last years by objective community-wide assessment experiments. This paper gives an overview of the currently available practical approaches to protein structure prediction capable of generating accurate fold assignment. Recent advances in assessment of the prediction quality are also discussed. PMID:15805122

  6. Prediction of DNA-binding proteins from relational features

    PubMed Central

    2012-01-01

    Background The process of protein-DNA binding has an essential role in the biological processing of genetic information. We use relational machine learning to predict DNA-binding propensity of proteins from their structures. Automatically discovered structural features are able to capture some characteristic spatial configurations of amino acids in proteins. Results Prediction based only on structural relational features already achieves competitive results to existing methods based on physicochemical properties on several protein datasets. Predictive performance is further improved when structural features are combined with physicochemical features. Moreover, the structural features provide some insights not revealed by physicochemical features. Our method is able to detect common spatial substructures. We demonstrate this in experiments with zinc finger proteins. Conclusions We introduced a novel approach for DNA-binding propensity prediction using relational machine learning which could potentially be used also for protein function prediction in general. PMID:23146001

  7. Phylogenetic and genomewide analyses suggest a functional relationship between kayak, the Drosophila fos homolog, and fig, a predicted protein phosphatase 2c nested within a kayak intron.

    PubMed

    Hudson, Stephanie G; Garrett, Matthew J; Carlson, Joseph W; Micklem, Gos; Celniker, Susan E; Goldstein, Elliott S; Newfeld, Stuart J

    2007-11-01

    A gene located within the intron of a larger gene is an uncommon arrangement in any species. Few of these nested gene arrangements have been explored from an evolutionary perspective. Here we report a phylogenetic analysis of kayak (kay) and fos intron gene (fig), a divergently transcribed gene located in a kay intron, utilizing 12 Drosophila species. The evolutionary relationship between these genes is of interest because kay is the homolog of the proto-oncogene c-fos whose function is modulated by serine/threonine phosphorylation and fig is a predicted PP2C phosphatase specific for serine/threonine residues. We found that, despite an extraordinary level of diversification in the intron-exon structure of kay (11 inversions and six independent exon losses), the nested arrangement of kay and fig is conserved in all species. A genomewide analysis of protein-coding nested gene pairs revealed that approximately 20% of nested pairs in D. melanogaster are also nested in D. pseudoobscura and D. virilis. A phylogenetic examination of fig revealed that there are three subfamilies of PP2C phosphatases in all 12 species of Drosophila. Overall, our phylogenetic and genomewide analyses suggest that the nested arrangement of kay and fig may be due to a functional relationship between them.

  8. Phospholipid liposomes functionalized by protein

    NASA Astrophysics Data System (ADS)

    Glukhova, O. E.; Savostyanov, G. V.; Grishina, O. A.

    2015-03-01

    Finding new ways to deliver neurotrophic drugs to the brain in newborns is one of the contemporary problems of medicine and pharmaceutical industry. Modern researches in this field indicate the promising prospects of supramolecular transport systems for targeted drug delivery to the brain which can overcome the blood-brain barrier (BBB). Thus, the solution of this problem is actual not only for medicine, but also for society as a whole because it determines the health of future generations. Phospholipid liposomes due to combination of lipo- and hydrophilic properties are considered as the main future objects in medicine for drug delivery through the BBB as well as increasing their bioavailability and toxicity. Liposomes functionalized by various proteins were used as transport systems for ease of liposomes use. Designing of modification oligosaccharide of liposomes surface is promising in the last decade because it enables the delivery of liposomes to specific receptor of human cells by selecting ligand and it is widely used in pharmacology for the treatment of several diseases. The purpose of this work is creation of a coarse-grained model of bilayer of phospholipid liposomes, functionalized by specific to the structural elements of the BBB proteins, as well as prediction of the most favorable orientation and position of the molecules in the generated complex by methods of molecular docking for the formation of the structure. Investigation of activity of the ligand molecule to protein receptor of human cells by the methods of molecular dynamics was carried out.

  9. Construction of ontology augmented networks for protein complex prediction.

    PubMed

    Zhang, Yijia; Lin, Hongfei; Yang, Zhihao; Wang, Jian

    2013-01-01

    Protein complexes are of great importance in understanding the principles of cellular organization and function. The increase in available protein-protein interaction data, gene ontology and other resources make it possible to develop computational methods for protein complex prediction. Most existing methods focus mainly on the topological structure of protein-protein interaction networks, and largely ignore the gene ontology annotation information. In this article, we constructed ontology augmented networks with protein-protein interaction data and gene ontology, which effectively unified the topological structure of protein-protein interaction networks and the similarity of gene ontology annotations into unified distance measures. After constructing ontology augmented networks, a novel method (clustering based on ontology augmented networks) was proposed to predict protein complexes, which was capable of taking into account the topological structure of the protein-protein interaction network, as well as the similarity of gene ontology annotations. Our method was applied to two different yeast protein-protein interaction datasets and predicted many well-known complexes. The experimental results showed that (i) ontology augmented networks and the unified distance measure can effectively combine the structure closeness and gene ontology annotation similarity; (ii) our method is valuable in predicting protein complexes and has higher F1 and accuracy compared to other competing methods.

  10. PQuad: Visualization of Predicted Peptides and Proteins

    SciTech Connect

    Havre, Susan L.; Singhal, Mudita; Payne, Deborah A.; Webb-Robertson, Bobbie-Jo M.

    2004-10-10

    New high-throughput proteomic techniques generate data faster than biologist and bioinformaticists can analyze it. Yet, hidden within this massive and complex data are answers to basic questions about how cells function to support life or respond to disease. Now biologists can take a global or systems approach studying not one or two proteins at a time but whole proteomes comprising all the proteins in a cell. However, the tremendous size and complexity of the high-throughput experiment data make it difficult to process and interpret. Visualization provides powerful analysis capabilities for such enormous and complex data. In this paper, we introduce a novel interactive visualization, PQuad (Peptide Permutation and Protein Prediction), designed for the visual analysis of peptides (protein fragments) identified from high-throughput data. PQuad depicts the experiment peptides in the context of their parent protein and DNA, thereby integrating proteomic and genomic information. A wrapped line metaphor is applied across key resolutions of the data, from a compressed view of an entire chromosome to the actual nucleotide sequence. PQuad provides a difference visualization for comparing peptides from different experimental conditions. We describe the requirements for such a visual analysis tool, the design decisions, and the novel aspects of PQuad.

  11. Functional enrichment analyses and construction of functional similarity networks with high confidence function prediction by PFP

    PubMed Central

    2010-01-01

    Background A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance. Results Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted. Conclusion The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact

  12. An Overview of Practical Applications of Protein Disorder Prediction and Drive for Faster, More Accurate Predictions

    PubMed Central

    Deng, Xin; Gumm, Jordan; Karki, Suman; Eickholt, Jesse; Cheng, Jianlin

    2015-01-01

    Protein disordered regions are segments of a protein chain that do not adopt a stable structure. Thus far, a variety of protein disorder prediction methods have been developed and have been widely used, not only in traditional bioinformatics domains, including protein structure prediction, protein structure determination and function annotation, but also in many other biomedical fields. The relationship between intrinsically-disordered proteins and some human diseases has played a significant role in disorder prediction in disease identification and epidemiological investigations. Disordered proteins can also serve as potential targets for drug discovery with an emphasis on the disordered-to-ordered transition in the disordered binding regions, and this has led to substantial research in drug discovery or design based on protein disordered region prediction. Furthermore, protein disorder prediction has also been applied to healthcare by predicting the disease risk of mutations in patients and studying the mechanistic basis of diseases. As the applications of disorder prediction increase, so too does the need to make quick and accurate predictions. To fill this need, we also present a new approach to predict protein residue disorder using wide sequence windows that is applicable on the genomic scale. PMID:26198229

  13. An Overview of Practical Applications of Protein Disorder Prediction and Drive for Faster, More Accurate Predictions.

    PubMed

    Deng, Xin; Gumm, Jordan; Karki, Suman; Eickholt, Jesse; Cheng, Jianlin

    2015-07-07

    Protein disordered regions are segments of a protein chain that do not adopt a stable structure. Thus far, a variety of protein disorder prediction methods have been developed and have been widely used, not only in traditional bioinformatics domains, including protein structure prediction, protein structure determination and function annotation, but also in many other biomedical fields. The relationship between intrinsically-disordered proteins and some human diseases has played a significant role in disorder prediction in disease identification and epidemiological investigations. Disordered proteins can also serve as potential targets for drug discovery with an emphasis on the disordered-to-ordered transition in the disordered binding regions, and this has led to substantial research in drug discovery or design based on protein disordered region prediction. Furthermore, protein disorder prediction has also been applied to healthcare by predicting the disease risk of mutations in patients and studying the mechanistic basis of diseases. As the applications of disorder prediction increase, so too does the need to make quick and accurate predictions. To fill this need, we also present a new approach to predict protein residue disorder using wide sequence windows that is applicable on the genomic scale.

  14. Predicting Protein Structure Using Parallel Genetic Algorithms.

    DTIC Science & Technology

    1994-12-01

    By " Predicting rotein Structure D istribticfiar.. ................ Using Parallel Genetic Algorithms ,Avaiu " ’ •"... Dist THESIS I IGeorge H...iiLite-d Approved for public release; distribution unlimited AFIT/ GCS /ENG/94D-03 Predicting Protein Structure Using Parallel Genetic Algorithms ...1-1 1.2 Genetic Algorithms ......... ............................ 1-3 1.3 The Protein Folding Problem

  15. Protein Structure Prediction with Visuospatial Analogy

    NASA Astrophysics Data System (ADS)

    Davies, Jim; Glasgow, Janice; Kuo, Tony

    We show that visuospatial representations and reasoning techniques can be used as a similarity metric for analogical protein structure prediction. Our system retrieves pairs of α-helices based on contact map similarity, then transfers and adapts the structure information to an unknown helix pair, showing that similar protein contact maps predict similar 3D protein structure. The success of this method provides support for the notion that changing representations can enable similarity metrics in analogy.

  16. Predictive characterization of hypothetical proteins in Staphylococcus aureus NCTC 8325

    PubMed Central

    School, Kuana; Marklevitz, Jessica; K. Schram, William; K. Harris, Laura

    2016-01-01

    Staphylococcus aureus is one of the most common hospital acquired infections. It colonizes immunocompromised patients and with the number of antibiotic resistant strains increasing, medicine needs new treatment options. Understanding more about the proteins this organism uses would further this goal. Hypothetical proteins are sequences thought to encode a functional protein but for which little to no evidence of that function exists. About half of the genomic proteins in reference strain S. aureus NCTC 8325 are hypothetical. Since annotation of these proteins can lead to new therapeutic targets, a high demand to characterize hypothetical proteins is present. This work examines 35 hypothetical proteins from the chromosome of S. aureus NCTC 8325. Examination includes physiochemical characterization; sequence homology; structural homology; domain recognition; structure modeling; active site depiction; predicted protein-protein interactions; protein-chemical interactions; protein localization; protein stability; and protein solubility. The examination revealed some hypothetical proteins related to virulent domains and protein-protein interactions including superoxide dismutase, O-antigen, bacterial ferric iron reductase and siderophore synthesis. Yet other hypothetical proteins appear to be metabolic or transport proteins including ABC transporters, major facilitator superfamily, S-adenosylmethionine decarboxylase, and GTPases. Progress evaluating some hypothetical proteins, particularly the smaller ones, was incomplete due to limited homology and structural information in public repositories. These data characterizing hypothetical proteins will contribute to the scientific understanding of S. aureus by identifying potential drug targets and aiding in future drug discovery. PMID:28149057

  17. eF-seek: prediction of the functional sites of proteins by searching for similar electrostatic potential and molecular surface shape

    PubMed Central

    Kinoshita, Kengo; Murakami, Yoichi; Nakamura, Haruki

    2007-01-01

    We have developed a method to predict ligand-binding sites in a new protein structure by searching for similar binding sites in the Protein Data Bank (PDB). The similarities are measured according to the shapes of the molecular surfaces and their electrostatic potentials. A new web server, eF-seek, provides an interface to our search method. It simply requires a coordinate file in the PDB format, and generates a prediction result as a virtual complex structure, with the putative ligands in a PDB format file as the output. In addition, the predicted interacting interface is displayed to facilitate the examination of the virtual complex structure on our own applet viewer with the web browser (URL: http://eF-site.hgc.jp/eF-seek). PMID:17567616

  18. PI2PE: protein interface/interior prediction engine.

    PubMed

    Tjong, Harianto; Qin, Sanbo; Zhou, Huan-Xiang

    2007-07-01

    The side chains of the 20 types of amino acids, owing to a large extent to their different physical properties, have characteristic distributions in interior/surface regions of individual proteins and in interface/non-interface portions of protein surfaces that bind proteins or nucleic acids. These distributions have important structural and functional implications. We have developed accurate methods for predicting the solvent accessibility of amino acids from a protein sequence and for predicting interface residues from the structure of a protein-binding or DNA-binding protein. The methods are called WESA, cons-PPISP and DISPLAR, respectively. The web servers of these methods are now available at http://pipe.scs.fsu.edu. To illustrate the utility of these web servers, cons-PPISP and DISPLAR predictions are used to construct a structural model for a multicomponent protein-DNA complex.

  19. [Protein phosphatases: structure and function].

    PubMed

    Bulanova, E G; Budagian, V M

    1994-01-01

    The process of protein and enzyme systems phosphorylation is necessary for cell growth, differentiation and preparation for division and mitosis. The conformation changes of protein as a result of phosphorylation lead to increased enzyme activity and enhanced affinity to substrates. A large group of enzymes--protein kinases--is responsible for phosphorylation process in cell, which are divided into tyrosine- and serine-threonine-kinases depending on their ability to phosphorylate appropriate amino acid residues. In this review has been considered the functional importance and structure of protein phosphatases--enzymes, which are functional antagonists of protein kinases.

  20. HOPE: a homotopy optimization method for protein structure prediction.

    PubMed

    Dunlavy, Daniel M; O'Leary, Dianne P; Klimov, Dmitri; Thirumalai, D

    2005-12-01

    We use a homotopy optimization method, HOPE, to minimize the potential energy associated with a protein model. The method uses the minimum energy conformation of one protein as a template to predict the lowest energy structure of a query sequence. This objective is achieved by following a path of conformations determined by a homotopy between the potential energy functions for the two proteins. Ensembles of solutions are produced by perturbing conformations along the path, increasing the likelihood of predicting correct structures. Successful results are presented for pairs of homologous proteins, where HOPE is compared to a variant of Newton's method and to simulated annealing.

  1. Function and structure of inherently disordered proteins.

    PubMed

    Dunker, A Keith; Silman, Israel; Uversky, Vladimir N; Sussman, Joel L

    2008-12-01

    The application of bioinformatics methodologies to proteins inherently lacking 3D structure has brought increased attention to these macromolecules. Here topics concerning these proteins are discussed, including their prediction from amino acid sequence, their enrichment in eukaryotes compared to prokaryotes, their more rapid evolution compared to structured proteins, their organization into specific groups, their structural preferences, their half-lives in cells, their contributions to signaling diversity (via high contents of multiple-partner binding sites, post-translational modifications, and alternative splicing), their distinct functional repertoire compared to that of structured proteins, and their involvement in diseases.

  2. Structure prediction of magnetosome-associated proteins

    PubMed Central

    Nudelman, Hila; Zarivach, Raz

    2014-01-01

    Magnetotactic bacteria (MTB) are Gram-negative bacteria that can navigate along geomagnetic fields. This ability is a result of a unique intracellular organelle, the magnetosome. These organelles are composed of membrane-enclosed magnetite (Fe3O4) or greigite (Fe3S4) crystals ordered into chains along the cell. Magnetosome formation, assembly, and magnetic nano-crystal biomineralization are controlled by magnetosome-associated proteins (MAPs). Most MAP-encoding genes are located in a conserved genomic region – the magnetosome island (MAI). The MAI appears to be conserved in all MTB that were analyzed so far, although the MAI size and organization differs between species. It was shown that MAI deletion leads to a non-magnetic phenotype, further highlighting its important role in magnetosome formation. Today, about 28 proteins are known to be involved in magnetosome formation, but the structures and functions of most MAPs are unknown. To reveal the structure–function relationship of MAPs we used bioinformatics tools in order to build homology models as a way to understand their possible role in magnetosome formation. Here we present a predicted 3D structural models’ overview for all known Magnetospirillum gryphiswaldense strain MSR-1 MAPs. PMID:24523717

  3. Predicting the fission yeast protein interaction network.

    PubMed

    Pancaldi, Vera; Saraç, Omer S; Rallis, Charalampos; McLean, Janel R; Převorovský, Martin; Gould, Kathleen; Beyer, Andreas; Bähler, Jürg

    2012-04-01

    A systems-level understanding of biological processes and information flow requires the mapping of cellular component interactions, among which protein-protein interactions are particularly important. Fission yeast (Schizosaccharomyces pombe) is a valuable model organism for which no systematic protein-interaction data are available. We exploited gene and protein properties, global genome regulation datasets, and conservation of interactions between budding and fission yeast to predict fission yeast protein interactions in silico. We have extensively tested our method in three ways: first, by predicting with 70-80% accuracy a selected high-confidence test set; second, by recapitulating interactions between members of the well-characterized SAGA co-activator complex; and third, by verifying predicted interactions of the Cbf11 transcription factor using mass spectrometry of TAP-purified protein complexes. Given the importance of the pathway in cell physiology and human disease, we explore the predicted sub-networks centered on the Tor1/2 kinases. Moreover, we predict the histidine kinases Mak1/2/3 to be vital hubs in the fission yeast stress response network, and we suggest interactors of argonaute 1, the principal component of the siRNA-mediated gene silencing pathway, lost in budding yeast but preserved in S. pombe. Of the new high-quality interactions that were discovered after we started this work, 73% were found in our predictions. Even though any predicted interactome is imperfect, the protein network presented here can provide a valuable basis to explore biological processes and to guide wet-lab experiments in fission yeast and beyond. Our predicted protein interactions are freely available through PInt, an online resource on our website (www.bahlerlab.info/PInt).

  4. Protein localization prediction using random walks on graphs

    PubMed Central

    2013-01-01

    Background Understanding the localization of proteins in cells is vital to characterizing their functions and possible interactions. As a result, identifying the (sub)cellular compartment within which a protein is located becomes an important problem in protein classification. This classification issue thus involves predicting labels in a dataset with a limited number of labeled data points available. By utilizing a graph representation of protein data, random walk techniques have performed well in sequence classification and functional prediction; however, this method has not yet been applied to protein localization. Accordingly, we propose a novel classifier in the site prediction of proteins based on random walks on a graph. Results We propose a graph theory model for predicting protein localization using data generated in yeast and gram-negative (Gneg) bacteria. We tested the performance of our classifier on the two datasets, optimizing the model training parameters by varying the laziness values and the number of steps taken during the random walk. Using 10-fold cross-validation, we achieved an accuracy of above 61% for yeast data and about 93% for gram-negative bacteria. Conclusions This study presents a new classifier derived from the random walk technique and applies this classifier to investigate the cellular localization of proteins. The prediction accuracy and additional validation demonstrate an improvement over previous methods, such as support vector machine (SVM)-based classifiers. PMID:23815126

  5. Assigning protein functions by comparative genome analysis protein phylogenetic profiles

    DOEpatents

    Pellegrini, Matteo; Marcotte, Edward M.; Thompson, Michael J.; Eisenberg, David; Grothe, Robert; Yeates, Todd O.

    2003-05-13

    A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.

  6. Enhancing interacting residue prediction with integrated contact matrix prediction in protein-protein interaction.

    PubMed

    Du, Tianchuan; Liao, Li; Wu, Cathy H

    2016-12-01

    Identifying the residues in a protein that are involved in protein-protein interaction and identifying the contact matrix for a pair of interacting proteins are two computational tasks at different levels of an in-depth analysis of protein-protein interaction. Various methods for solving these two problems have been reported in the literature. However, the interacting residue prediction and contact matrix prediction were handled by and large independently in those existing methods, though intuitively good prediction of interacting residues will help with predicting the contact matrix. In this work, we developed a novel protein interacting residue prediction system, contact matrix-interaction profile hidden Markov model (CM-ipHMM), with the integration of contact matrix prediction and the ipHMM interaction residue prediction. We propose to leverage what is learned from the contact matrix prediction and utilize the predicted contact matrix as "feedback" to enhance the interaction residue prediction. The CM-ipHMM model showed significant improvement over the previous method that uses the ipHMM for predicting interaction residues only. It indicates that the downstream contact matrix prediction could help the interaction site prediction.

  7. Predicting membrane protein types with bragging learner.

    PubMed

    Niu, Bing; Jin, Yu-Huan; Feng, Kai-Yan; Liu, Liang; Lu, Wen-Cong; Cai, Yu-Dong; Li, Guo-Zheng

    2008-01-01

    The membrane protein type is an important feature in characterizing the overall topological folding type of a protein or its domains therein. Many investigators have put their efforts to the prediction of membrane protein type. Here, we propose a new approach, the bootstrap aggregating method or bragging learner, to address this problem based on the protein amino acid composition. As a demonstration, the benchmark dataset constructed by K.C. Chou and D.W. Elrod was used to test the new method. The overall success rate thus obtained by jackknife cross-validation was over 84%, indicating that the bragging learner as presented in this paper holds a quite high potential in predicting the attributes of proteins, or at least can play a complementary role to many existing algorithms in this area. It is anticipated that the prediction quality can be further enhanced if the pseudo amino acid composition can be effectively incorporated into the current predictor. An online membrane protein type prediction web server developed in our lab is available at http://chemdata.shu.edu.cn/protein/protein.jsp.

  8. Predicting the Fission Yeast Protein Interaction Network

    PubMed Central

    Pancaldi, Vera; Saraç, Ömer S.; Rallis, Charalampos; McLean, Janel R.; Převorovský, Martin; Gould, Kathleen; Beyer, Andreas; Bähler, Jürg

    2012-01-01

    A systems-level understanding of biological processes and information flow requires the mapping of cellular component interactions, among which protein–protein interactions are particularly important. Fission yeast (Schizosaccharomyces pombe) is a valuable model organism for which no systematic protein-interaction data are available. We exploited gene and protein properties, global genome regulation datasets, and conservation of interactions between budding and fission yeast to predict fission yeast protein interactions in silico. We have extensively tested our method in three ways: first, by predicting with 70–80% accuracy a selected high-confidence test set; second, by recapitulating interactions between members of the well-characterized SAGA co-activator complex; and third, by verifying predicted interactions of the Cbf11 transcription factor using mass spectrometry of TAP-purified protein complexes. Given the importance of the pathway in cell physiology and human disease, we explore the predicted sub-networks centered on the Tor1/2 kinases. Moreover, we predict the histidine kinases Mak1/2/3 to be vital hubs in the fission yeast stress response network, and we suggest interactors of argonaute 1, the principal component of the siRNA-mediated gene silencing pathway, lost in budding yeast but preserved in S. pombe. Of the new high-quality interactions that were discovered after we started this work, 73% were found in our predictions. Even though any predicted interactome is imperfect, the protein network presented here can provide a valuable basis to explore biological processes and to guide wet-lab experiments in fission yeast and beyond. Our predicted protein interactions are freely available through PInt, an online resource on our website (www.bahlerlab.info/PInt). PMID:22540037

  9. Characterization and Prediction of Protein Flexibility Based on Structural Alphabets

    PubMed Central

    Liu, Bin

    2016-01-01

    Motivation. To assist efforts in determining and exploring the functional properties of proteins, it is desirable to characterize and predict protein flexibilities. Results. In this study, the conformational entropy is used as an indicator of the protein flexibility. We first explore whether the conformational change can capture the protein flexibility. The well-defined decoy structures are converted into one-dimensional series of letters from a structural alphabet. Four different structure alphabets, including the secondary structure in 3-class and 8-class, the PB structure alphabet (16-letter), and the DW structure alphabet (28-letter), are investigated. The conformational entropy is then calculated from the structure alphabet letters. Some of the proteins show high correlation between the conformation entropy and the protein flexibility. We then predict the protein flexibility from basic amino acid sequence. The local structures are predicted by the dual-layer model and the conformational entropy of the predicted class distribution is then calculated. The results show that the conformational entropy is a good indicator of the protein flexibility, but false positives remain a problem. The DW structure alphabet performs the best, which means that more subtle local structures can be captured by large number of structure alphabet letters. Overall this study provides a simple and efficient method for the characterization and prediction of the protein flexibility. PMID:27660756

  10. Proteins and Their Interacting Partners: An Introduction to Protein-Ligand Binding Site Prediction Methods.

    PubMed

    Roche, Daniel Barry; Brackenridge, Danielle Allison; McGuffin, Liam James

    2015-12-15

    Elucidating the biological and biochemical roles of proteins, and subsequently determining their interacting partners, can be difficult and time consuming using in vitro and/or in vivo methods, and consequently the majority of newly sequenced proteins will have unknown structures and functions. However, in silico methods for predicting protein-ligand binding sites and protein biochemical functions offer an alternative practical solution. The characterisation of protein-ligand binding sites is essential for investigating new functional roles, which can impact the major biological research spheres of health, food, and energy security. In this review we discuss the role in silico methods play in 3D modelling of protein-ligand binding sites, along with their role in predicting biochemical functionality. In addition, we describe in detail some of the key alternative in silico prediction approaches that are available, as well as discussing the Critical Assessment of Techniques for Protein Structure Prediction (CASP) and the Continuous Automated Model EvaluatiOn (CAMEO) projects, and their impact on developments in the field. Furthermore, we discuss the importance of protein function prediction methods for tackling 21st century problems.

  11. Computational Modeling of complete HOXB13 protein for predicting the functional effect of SNPs and the associated role in hereditary prostate cancer

    PubMed Central

    Chandrasekaran, Gopalakrishnan; Hwang, Eu Chang; Kang, Taek Won; Kwon, Dong Deuk; Park, Kwangsung; Lee, Je-Jung; Lakshmanan, Vinoth-Kumar

    2017-01-01

    The human HOXB13 gene encodes a 284 amino acid transcription factor belonging to the homeobox gene family containing a homeobox and a HoxA13 N-terminal domain. It is highly linked to hereditary prostate cancer, the majority of which is manifested as a result of a Single Nucleotide Polymorphism (SNP). In silico analysis of 95 missense SNP’s corresponding to the non-homeobox region of HOXB13 predicted 21 nsSNP’s to be potentially deleterious. Among 123 UTR SNPs analysed by UTRScan, rs543028086, rs550968159, rs563065128 were found to affect the UNR_BS, GY-BOX and MBE UTR signals, respectively. Subsequent analysis by PolymiRTS revealed 23 UTR SNPs altering the miRNA binding site. The complete HOXB13_M26 protein structure was modelled using MODELLER v9.17. Computational analysis of the 21 nsSNP’s mapped into the HOXB13_M26 protein revealed seven nsSNP’s (rs761914407, rs8556, rs138213197, rs772962401, rs778843798, rs770620686 and rs587780165) seriously resulting in a damaging and deleterious effect on the protein. G84E, G135E, and A128V resulted in increased, while, R215C, C66R, Y80C and S122R resulted in decreased protein stability, ultimately predicted to result in the altered binding patterns of HOXB13. While the genotype-phenotype based effects of nsSNP’s were assessed, the exact biological and biochemical mechanism driven by the above predicted SNPs still needs to be extensively evaluated by in vivo and GWAS studies. PMID:28272408

  12. Chemical shift prediction for denatured proteins.

    PubMed

    Prestegard, James H; Sahu, Sarata C; Nkari, Wendy K; Morris, Laura C; Live, David; Gruta, Christian

    2013-02-01

    While chemical shift prediction has played an important role in aspects of protein NMR that include identification of secondary structure, generation of torsion angle constraints for structure determination, and assignment of resonances in spectra of intrinsically disordered proteins, interest has arisen more recently in using it in alternate assignment strategies for crosspeaks in (1)H-(15)N HSQC spectra of sparsely labeled proteins. One such approach involves correlation of crosspeaks in the spectrum of the native protein with those observed in the spectrum of the denatured protein, followed by assignment of the peaks in the latter spectrum. As in the case of disordered proteins, predicted chemical shifts can aid in these assignments. Some previously developed empirical formulas for chemical shift prediction have depended on basis data sets of 20 pentapeptides. In each case the central residue was varied among the 20 amino common acids, with the flanking residues held constant throughout the given series. However, previous choices of solvent conditions and flanking residues make the parameters in these formulas less than ideal for general application to denatured proteins. Here, we report (1)H and (15)N shifts for a set of alanine based pentapeptides under the low pH urea denaturing conditions that are more appropriate for sparse label assignments. New parameters have been derived and a Perl script was created to facilitate comparison with other parameter sets. A small, but significant, improvement in shift predictions for denatured ubiquitin is demonstrated.

  13. Prediction of Protein–Protein Interactions by Evidence Combining Methods

    PubMed Central

    Chang, Ji-Wei; Zhou, Yan-Qing; Ul Qamar, Muhammad Tahir; Chen, Ling-Ling; Ding, Yu-Duan

    2016-01-01

    Most cellular functions involve proteins’ features based on their physical interactions with other partner proteins. Sketching a map of protein–protein interactions (PPIs) is therefore an important inception step towards understanding the basics of cell functions. Several experimental techniques operating in vivo or in vitro have made significant contributions to screening a large number of protein interaction partners, especially high-throughput experimental methods. However, computational approaches for PPI predication supported by rapid accumulation of data generated from experimental techniques, 3D structure definitions, and genome sequencing have boosted the map sketching of PPIs. In this review, we shed light on in silico PPI prediction methods that integrate evidence from multiple sources, including evolutionary relationship, function annotation, sequence/structure features, network topology and text mining. These methods are developed for integration of multi-dimensional evidence, for designing the strategies to predict novel interactions, and for making the results consistent with the increase of prediction coverage and accuracy. PMID:27879651

  14. Genome-wide protein-protein interactions and protein function exploration in cyanobacteria.

    PubMed

    Lv, Qi; Ma, Weimin; Liu, Hui; Li, Jiang; Wang, Huan; Lu, Fang; Zhao, Chen; Shi, Tieliu

    2015-10-22

    Genome-wide network analysis is well implemented to study proteins of unknown function. Here, we effectively explored protein functions and the biological mechanism based on inferred high confident protein-protein interaction (PPI) network in cyanobacteria. We integrated data from seven different sources and predicted 1,997 PPIs, which were evaluated by experiments in molecular mechanism, text mining of literatures in proved direct/indirect evidences, and "interologs" in conservation. Combined the predicted PPIs with known PPIs, we obtained 4,715 no-redundant PPIs (involving 3,231 proteins covering over 90% of genome) to generate the PPI network. Based on the PPI network, terms in Gene ontology (GO) were assigned to function-unknown proteins. Functional modules were identified by dissecting the PPI network into sub-networks and analyzing pathway enrichment, with which we investigated novel function of underlying proteins in protein complexes and pathways. Examples of photosynthesis and DNA repair indicate that the network approach is a powerful tool in protein function analysis. Overall, this systems biology approach provides a new insight into posterior functional analysis of PPIs in cyanobacteria.

  15. Protein-protein interactions prediction based on iterative clique extension with gene ontology filtering.

    PubMed

    Yang, Lei; Tang, Xianglong

    2014-01-01

    Cliques (maximal complete subnets) in protein-protein interaction (PPI) network are an important resource used to analyze protein complexes and functional modules. Clique-based methods of predicting PPI complement the data defection from biological experiments. However, clique-based predicting methods only depend on the topology of network. The false-positive and false-negative interactions in a network usually interfere with prediction. Therefore, we propose a method combining clique-based method of prediction and gene ontology (GO) annotations to overcome the shortcoming and improve the accuracy of predictions. According to different GO correcting rules, we generate two predicted interaction sets which guarantee the quality and quantity of predicted protein interactions. The proposed method is applied to the PPI network from the Database of Interacting Proteins (DIP) and most of the predicted interactions are verified by another biological database, BioGRID. The predicted protein interactions are appended to the original protein network, which leads to clique extension and shows the significance of biological meaning.

  16. Using protein binding site prediction to improve protein docking.

    PubMed

    Huang, Bingding; Schroeder, Michael

    2008-10-01

    Predicting protein interaction interfaces and protein complexes are two important related problems. For interface prediction, there are a number of tools, such as PPI-Pred, PPISP, PINUP, Promate, and SPPIDER, which predict enzyme-inhibitor interfaces with success rates of 23% to 55% and other interfaces with 10% to 28% on a benchmark dataset of 62 complexes. Here, we develop, metaPPI, a meta server for interface prediction. It significantly improves prediction success rates to 70% for enzyme-inhibitor and 44% for other interfaces. As shown with Promate, predicted interfaces can be used to improve protein docking. Here, we follow this idea using the meta server instead of individual predictions. We confirm that filtering with predicted interfaces significantly improves candidate generation in rigid-body docking based on shape complementarity. Finally, we show that the initial ranking of candidate solutions in rigid-body docking can be further improved for the class of enzyme-inhibitor complexes by a geometrical scoring which rewards deep pockets. A web server of metaPPI is available at scoppi.tu-dresden.de/metappi. The source code of our docking algorithm BDOCK is also available at www.biotec.tu-dresden.de /approximately bhuang/bdock.

  17. Functional domains in tetraspanin proteins.

    PubMed

    Stipp, Christopher S; Kolesnikova, Tatiana V; Hemler, Martin E

    2003-02-01

    Exciting new findings have emerged about the structure, function and biochemistry of tetraspanin proteins. Five distinct tetraspanin regions have now been delineated linking structural features to specific functions. Within the large extracellular loop of tetraspanins, there is a variable region that mediates specific interactions with other proteins, as well as a more highly conserved region that has been suggested to mediate homodimerization. Within the transmembrane region, the four tetraspanin transmembrane domains are probable sites of both intra- and inter-molecular interactions that are crucial during biosynthesis and assembly of the network of tetraspanin-linked membrane proteins known as the 'tetraspanin web'. In the intracellular juxtamembrane region, palmitoylation of cysteine residues also contributes to tetraspanin web assembly, and the C-terminal cytoplasmic tail region could provide specific functional links to cytoskeletal or signaling proteins.

  18. Protein structure prediction from sequence variation

    PubMed Central

    Marks, Debora S; Hopf, Thomas A; Sander, Chris

    2015-01-01

    Genomic sequences contain rich evolutionary information about functional constraints on macromolecules such as proteins. This information can be efficiently mined to detect evolutionary couplings between residues in proteins and address the long-standing challenge to compute protein three-dimensional structures from amino acid sequences. Substantial progress has recently been made on this problem owing to the explosive growth in available sequences and the application of global statistical methods. In addition to three-dimensional structure, the improved understanding of covariation may help identify functional residues involved in ligand binding, protein-complex formation and conformational changes. We expect computation of covariation patterns to complement experimental structural biology in elucidating the full spectrum of protein structures, their functional interactions and evolutionary dynamics. PMID:23138306

  19. Computational Prediction of Effector Proteins in Fungi: Opportunities and Challenges

    PubMed Central

    Sonah, Humira; Deshmukh, Rupesh K.; Bélanger, Richard R.

    2016-01-01

    Effector proteins are mostly secretory proteins that stimulate plant infection by manipulating the host response. Identifying fungal effector proteins and understanding their function is of great importance in efforts to curb losses to plant diseases. Recent advances in high-throughput sequencing technologies have facilitated the availability of several fungal genomes and 1000s of transcriptomes. As a result, the growing amount of genomic information has provided great opportunities to identify putative effector proteins in different fungal species. There is little consensus over the annotation and functionality of effector proteins, and mostly small secretory proteins are considered as effector proteins, a concept that tends to overestimate the number of proteins involved in a plant–pathogen interaction. With the characterization of Avr genes, criteria for computational prediction of effector proteins are becoming more efficient. There are 100s of tools available for the identification of conserved motifs, signature sequences and structural features in the proteins. Many pipelines and online servers, which combine several tools, are made available to perform genome-wide identification of effector proteins. In this review, available tools and pipelines, their strength and limitations for effective identification of fungal effector proteins are discussed. We also present an exhaustive list of classically secreted proteins along with their key conserved motifs found in 12 common plant pathogens (11 fungi and one oomycete) through an analytical pipeline. PMID:26904083

  20. 3D-Fun: predicting enzyme function from structure.

    PubMed

    von Grotthuss, Marcin; Plewczynski, Dariusz; Vriend, Gert; Rychlewski, Leszek

    2008-07-01

    The 'omics' revolution is causing a flurry of data that all needs to be annotated for it to become useful. Sequences of proteins of unknown function can be annotated with a putative function by comparing them with proteins of known function. This form of annotation is typically performed with BLAST or similar software. Structural genomics is nowadays also bringing us three dimensional structures of proteins with unknown function. We present here software that can be used when sequence comparisons fail to determine the function of a protein with known structure but unknown function. The software, called 3D-Fun, is implemented as a server that runs at several European institutes and is freely available for everybody at all these sites. The 3D-Fun servers accept protein coordinates in the standard PDB format and compare them with all known protein structures by 3D structural superposition using the 3D-Hit software. If structural hits are found with proteins with known function, these are listed together with their function and some vital comparison statistics. This is conceptually very similar in 3D to what BLAST does in 1D. Additionally, the superposition results are displayed using interactive graphics facilities. Currently, the 3D-Fun system only predicts enzyme function but an expanded version with Gene Ontology predictions will be available soon. The server can be accessed at http://3dfun.bioinfo.pl/ or at http://3dfun.cmbi.ru.nl/.

  1. Experimental and bioinformatic approaches for interrogating protein-protein interactions to determine protein function.

    PubMed

    Droit, Arnaud; Poirier, Guy G; Hunter, Joanna M

    2005-04-01

    An ambitious goal of proteomics is to elucidate the structure, interactions and functions of all proteins within cells and organisms. One strategy to determine protein function is to identify the protein-protein interactions. The increasing use of high-throughput and large-scale bioinformatics-based studies has generated a massive amount of data stored in a number of different databases. A challenge for bioinformatics is to explore this disparate data and to uncover biologically relevant interactions and pathways. In parallel, there is clearly a need for the development of approaches that can predict novel protein-protein interaction networks in silico. Here, we present an overview of different experimental and bioinformatic methods to elucidate protein-protein interactions.

  2. Blind Test of Physics-Based Prediction of Protein Structures

    PubMed Central

    Shell, M. Scott; Ozkan, S. Banu; Voelz, Vincent; Wu, Guohong Albert; Dill, Ken A.

    2009-01-01

    We report here a multiprotein blind test of a computer method to predict native protein structures based solely on an all-atom physics-based force field. We use the AMBER 96 potential function with an implicit (GB/SA) model of solvation, combined with replica-exchange molecular-dynamics simulations. Coarse conformational sampling is performed using the zipping and assembly method (ZAM), an approach that is designed to mimic the putative physical routes of protein folding. ZAM was applied to the folding of six proteins, from 76 to 112 monomers in length, in CASP7, a community-wide blind test of protein structure prediction. Because these predictions have about the same level of accuracy as typical bioinformatics methods, and do not utilize information from databases of known native structures, this work opens up the possibility of predicting the structures of membrane proteins, synthetic peptides, or other foldable polymers, for which there is little prior knowledge of native structures. This approach may also be useful for predicting physical protein folding routes, non-native conformations, and other physical properties from amino acid sequences. PMID:19186130

  3. Delineation of modular proteins: domain boundary prediction from sequence information.

    PubMed

    Kong, Lesheng; Ranganathan, Shoba

    2004-06-01

    The delineation of domain boundaries of a given sequence in the absence of known 3D structures or detectable sequence homology to known domains benefits many areas in protein science, such as protein engineering, protein 3D structure determination and protein structure prediction. With the exponential growth of newly determined sequences, our ability to predict domain boundaries rapidly and accurately from sequence information alone is both essential and critical from the viewpoint of gene function annotation. Anyone attempting to predict domain boundaries for a single protein sequence is invariably confronted with a plethora of databases that contain boundary information available from the internet and a variety of methods for domain boundary prediction. How are these derived and how well do they work? What definition of 'domain' do they use? We will first clarify the different definitions of protein domains, and then describe the available public databases with domain boundary information. Finally, we will review existing domain boundary prediction methods and discuss their strengths and weaknesses.

  4. Protease-inhibitor interaction predictions: Lessons on the complexity of protein-protein interactions.

    PubMed

    Fortelny, Nikolaus; Butler, Georgina S; Overall, Christopher Mark; Pavlidis, Paul

    2017-04-06

    Protein interactions shape proteome function and thus biology. Identification of protein interactions is a major goal in molecular biology, but biochemical methods, although improving, remain limited in coverage and accuracy. Whereas computational predictions can guide biochemical experiments, low validation rates of predictions remain a major limitation. Here, we investigated computational methods in the prediction of a specific type of interaction, the inhibitory interactions between proteases and their inhibitors. Proteases generate thousands of proteoforms that dynamically shape the functional state of proteomes. Despite the important regulatory role of proteases, knowledge of their inhibitors remains largely incomplete with the vast majority of proteases lacking an annotated inhibitor. To link inhibitors to their target proteases on a large scale, we applied computational methods to predict inhibitory interactions between proteases and their inhibitors based on complementary data including coexpression, phylogenetic similarity, structural information, co-annotation, and colocalization, and also surveyed general protein interaction networks for potential inhibitory interactions. In testing nine predicted interactions biochemically, we validated the inhibition of kallikrein 5 by serpin B12. Despite the use of a wide array of complementary data, we found a high false positive rate of computational predictions in biochemical follow-up. Based on a protease-specific definition of true negatives derived from the biochemical classification of proteases and inhibitors, we analyzed prediction accuracy of individual features. Thereby we identified feature-specific limitations, which also affected general protein interaction prediction methods. Interestingly, proteases were often not coexpressed with most of their functional inhibitors, contrary to what is commonly assumed and extrapolated predominantly from cell culture experiments. Predictions of inhibitory interactions

  5. Prediction of N-terminal protein sorting signals.

    PubMed

    Claros, M G; Brunak, S; von Heijne, G

    1997-06-01

    Recently, neural networks have been applied to a widening range of problems in molecular biology. An area particularly suited to neural-network methods is the identification of protein sorting signals and the prediction of their cleavage sites, as these functional units are encoded by local, linear sequences of amino acids rather than global 3D structures.

  6. Evaluation of Several Two-Step Scoring Functions Based on Linear Interaction Energy, Effective Ligand Size, and Empirical Pair Potentials for Prediction of Protein-Ligand Binding Geometry and Free Energy

    PubMed Central

    Rahaman, Obaidur; Estrada, Trilce P.; Doren, Douglas J.; Taufer, Michela; Brooks, Charles L.; Armen, Roger S.

    2011-01-01

    The performance of several two-step scoring approaches for molecular docking were assessed for their ability to predict binding geometries and free energies. Two new scoring functions designed for “step 2 discrimination” were proposed and compared to our CHARMM implementation of the linear interaction energy (LIE) approach using the Generalized-Born with Molecular Volume (GBMV) implicit solvation model. A scoring function S1 was proposed by considering only “interacting” ligand atoms as the “effective size” of the ligand, and extended to an empirical regression-based pair potential S2. The S1 and S2 scoring schemes were trained and five-fold cross validated on a diverse set of 259 protein-ligand complexes from the Ligand Protein Database (LPDB). The regression-based parameters for S1 and S2 also demonstrated reasonable transferability in the CSARdock 2010 benchmark using a new dataset (NRC HiQ) of diverse protein-ligand complexes. The ability of the scoring functions to accurately predict ligand geometry was evaluated by calculating the discriminative power (DP) of the scoring functions to identify native poses. The parameters for the LIE scoring function with the optimal discriminative power (DP) for geometry (step 1 discrimination) were found to be very similar to the best-fit parameters for binding free energy over a large number of protein-ligand complexes (step 2 discrimination). Reasonable performance of the scoring functions in enrichment of active compounds in four different protein target classes established that the parameters for S1 and S2 provided reasonable accuracy and transferability. Additional analysis was performed to definitively separate scoring function performance from molecular weight effects. This analysis included the prediction of ligand binding efficiencies for a subset of the CSARdock NRC HiQ dataset where the number of ligand heavy atoms ranged from 17 to 35. This range of ligand heavy atoms is where improved accuracy of

  7. Evaluation of several two-step scoring functions based on linear interaction energy, effective ligand size, and empirical pair potentials for prediction of protein-ligand binding geometry and free energy.

    PubMed

    Rahaman, Obaidur; Estrada, Trilce P; Doren, Douglas J; Taufer, Michela; Brooks, Charles L; Armen, Roger S

    2011-09-26

    The performances of several two-step scoring approaches for molecular docking were assessed for their ability to predict binding geometries and free energies. Two new scoring functions designed for "step 2 discrimination" were proposed and compared to our CHARMM implementation of the linear interaction energy (LIE) approach using the Generalized-Born with Molecular Volume (GBMV) implicit solvation model. A scoring function S1 was proposed by considering only "interacting" ligand atoms as the "effective size" of the ligand and extended to an empirical regression-based pair potential S2. The S1 and S2 scoring schemes were trained and 5-fold cross-validated on a diverse set of 259 protein-ligand complexes from the Ligand Protein Database (LPDB). The regression-based parameters for S1 and S2 also demonstrated reasonable transferability in the CSARdock 2010 benchmark using a new data set (NRC HiQ) of diverse protein-ligand complexes. The ability of the scoring functions to accurately predict ligand geometry was evaluated by calculating the discriminative power (DP) of the scoring functions to identify native poses. The parameters for the LIE scoring function with the optimal discriminative power (DP) for geometry (step 1 discrimination) were found to be very similar to the best-fit parameters for binding free energy over a large number of protein-ligand complexes (step 2 discrimination). Reasonable performance of the scoring functions in enrichment of active compounds in four different protein target classes established that the parameters for S1 and S2 provided reasonable accuracy and transferability. Additional analysis was performed to definitively separate scoring function performance from molecular weight effects. This analysis included the prediction of ligand binding efficiencies for a subset of the CSARdock NRC HiQ data set where the number of ligand heavy atoms ranged from 17 to 35. This range of ligand heavy atoms is where improved accuracy of predicted ligand

  8. Toxicology of protein allergenicity: prediction and characterization.

    PubMed

    Kimber, I; Kerkvliet, N I; Taylor, S L; Astwood, J D; Sarlo, K; Dearman, R J

    1999-04-01

    The ability of exogenous proteins to cause respiratory and gastrointestinal allergy, and sometimes systemic anaphylactic reactions, is well known. What is not clear however, are the properties that confer on proteins the ability to induce allergic sensitization. With an expansion in the use of enzymes for industrial applications and consumer products, and a substantial and growing investment in the development of transgenic crop plants that express novel proteins introduced from other sources, the issue of protein allergenicity has assumed considerable toxicological significance. There is a need now for methods that will allow the accurate identification and characterization of potential protein allergens and for estimation of relative potency as a first step towards risk assessment. To address some of these issues, and to review progress that has been made in the toxicological investigation of respiratory and gastrointestinal allergy induced by proteins, a workshop, entitled the Toxicology of Protein Allergenicity: Prediction and Characterization, was convened at the 37th Annual Conference of the Society of Toxicology in Seattle, Washington (1998). The subject of protein allergenicity is considered here in the context of presentations made at that workshop.

  9. Predicting protein subcellular location using digital signal processing.

    PubMed

    Pan, Yu-Xi; Li, Da-Wei; Duan, Yun; Zhang, Zhi-Zhou; Xu, Ming-Qing; Feng, Guo-Yin; He, Lin

    2005-02-01

    The biological functions of a protein are closely related to its attributes in a cell. With the rapid accumulation of newly found protein sequence data in databanks, it is highly desirable to develop an automated method for predicting the subcellular location of proteins. The establishment of such a predictor will expedite the functional determination of newly found proteins and the process of prioritizing genes and proteins identified by genomic efforts as potential molecular targets for drug design. The traditional algorithms for predicting these attributes were based solely on amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns in protein sequences is extremely large, posing a formidable difficulty for realizing this goal. To deal with such difficulty, a well-developed tool in digital signal processing named digital Fourier transform (DFT) [1] was introduced. After being translated to a digital signal according to the hydrophobicity of each amino acid, a protein was analyzed by DFT within the frequency domain. A set of frequency spectrum parameters, thus obtained, were regarded as the factors to represent the sequence order effect. A significant improvement in prediction quality was observed by incorporating the frequency spectrum parameters with the conventional amino acid composition. One of the crucial merits of this approach is that many existing tools in mathematics and engineering can be easily applied in the predicting process. It is anticipated that digital signal processing may serve as a useful vehicle for many other protein science areas.

  10. PREDICTION OF NONLINEAR SPATIAL FUNCTIONALS. (R827257)

    EPA Science Inventory

    Spatial statistical methodology can be useful in the arena of environmental regulation. Some regulatory questions may be addressed by predicting linear functionals of the underlying signal, but other questions may require the prediction of nonlinear functionals of the signal. ...

  11. Functional Classification of Immune Regulatory Proteins

    SciTech Connect

    Rubinstein, Rotem; Ramagopal, Udupi A.; Nathenson, Stanley G.; Almo, Steven C.; Fiser, Andras

    2013-05-01

    Members of the immunoglobulin superfamily (IgSF) control innate and adaptive immunity and are prime targets for the treatment of autoimmune diseases, infectious diseases, and malignancies. We describe a computational method, termed the Brotherhood algorithm, which utilizes intermediate sequence information to classify proteins into functionally related families. This approach identifies functional relationships within the IgSF and predicts additional receptor-ligand interactions. As a specific example, we examine the nectin/nectin-like family of cell adhesion and signaling proteins and propose receptor-ligand interactions within this family. We were guided by the Brotherhood approach and present the high-resolution structural characterization of a homophilic interaction involving the class-I MHC-restricted T-cell-associated molecule, which we now classify as a nectin-like family member. The Brotherhood algorithm is likely to have a significant impact on structural immunology by identifying those proteins and complexes for which structural characterization will be particularly informative.

  12. Predicting protein-protein relationships from literature using latent topics.

    PubMed

    Aso, Tatsuya; Eguchi, Koji

    2009-10-01

    This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model. We also apply probabilistic Latent Semantic Analysis (pLSA) as a baseline for comparison, and compare them from the viewpoints of log-likelihood, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the others, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.

  13. Prediction of protein-protein interactions: unifying evolution and structure at protein interfaces.

    PubMed

    Tuncbag, Nurcan; Gursoy, Attila; Keskin, Ozlem

    2011-06-01

    The vast majority of the chores in the living cell involve protein-protein interactions. Providing details of protein interactions at the residue level and incorporating them into protein interaction networks are crucial toward the elucidation of a dynamic picture of cells. Despite the rapid increase in the number of structurally known protein complexes, we are still far away from a complete network. Given experimental limitations, computational modeling of protein interactions is a prerequisite to proceed on the way to complete structural networks. In this work, we focus on the question 'how do proteins interact?' rather than 'which proteins interact?' and we review structure-based protein-protein interaction prediction approaches. As a sample approach for modeling protein interactions, PRISM is detailed which combines structural similarity and evolutionary conservation in protein interfaces to infer structures of complexes in the protein interaction network. This will ultimately help us to understand the role of protein interfaces in predicting bound conformations.

  14. Critical Features of Fragment Libraries for Protein Structure Prediction.

    PubMed

    Trevizani, Raphael; Custódio, Fábio Lima; Dos Santos, Karina Baptista; Dardenne, Laurent Emmanuel

    2017-01-01

    The use of fragment libraries is a popular approach among protein structure prediction methods and has proven to substantially improve the quality of predicted structures. However, some vital aspects of a fragment library that influence the accuracy of modeling a native structure remain to be determined. This study investigates some of these features. Particularly, we analyze the effect of using secondary structure prediction guiding fragments selection, different fragments sizes and the effect of structural clustering of fragments within libraries. To have a clearer view of how these factors affect protein structure prediction, we isolated the process of model building by fragment assembly from some common limitations associated with prediction methods, e.g., imprecise energy functions and optimization algorithms, by employing an exact structure-based objective function under a greedy algorithm. Our results indicate that shorter fragments reproduce the native structure more accurately than the longer. Libraries composed of multiple fragment lengths generate even better structures, where longer fragments show to be more useful at the beginning of the simulations. The use of many different fragment sizes shows little improvement when compared to predictions carried out with libraries that comprise only three different fragment sizes. Models obtained from libraries built using only sequence similarity are, on average, better than those built with a secondary structure prediction bias. However, we found that the use of secondary structure prediction allows greater reduction of the search space, which is invaluable for prediction methods. The results of this study can be critical guidelines for the use of fragment libraries in protein structure prediction.

  15. Critical Features of Fragment Libraries for Protein Structure Prediction

    PubMed Central

    dos Santos, Karina Baptista

    2017-01-01

    The use of fragment libraries is a popular approach among protein structure prediction methods and has proven to substantially improve the quality of predicted structures. However, some vital aspects of a fragment library that influence the accuracy of modeling a native structure remain to be determined. This study investigates some of these features. Particularly, we analyze the effect of using secondary structure prediction guiding fragments selection, different fragments sizes and the effect of structural clustering of fragments within libraries. To have a clearer view of how these factors affect protein structure prediction, we isolated the process of model building by fragment assembly from some common limitations associated with prediction methods, e.g., imprecise energy functions and optimization algorithms, by employing an exact structure-based objective function under a greedy algorithm. Our results indicate that shorter fragments reproduce the native structure more accurately than the longer. Libraries composed of multiple fragment lengths generate even better structures, where longer fragments show to be more useful at the beginning of the simulations. The use of many different fragment sizes shows little improvement when compared to predictions carried out with libraries that comprise only three different fragment sizes. Models obtained from libraries built using only sequence similarity are, on average, better than those built with a secondary structure prediction bias. However, we found that the use of secondary structure prediction allows greater reduction of the search space, which is invaluable for prediction methods. The results of this study can be critical guidelines for the use of fragment libraries in protein structure prediction. PMID:28085928

  16. Predicting Protein Hinge Motions and Allostery Using Rigidity Theory

    NASA Astrophysics Data System (ADS)

    Sljoka, Adnan; Bezginov, Alexandr

    2011-11-01

    Understanding how a 3D structure of a protein functions depends on predicting which regions are rigid, and which are flexible. One recent approach models molecules as a structure of fixed units (atoms with their bond angles as rigid units, bonds as hinges) plus biochemical constraints coming from the local geometry. This generates a `molecular graph' in the theory of combinatorial rigidity. The 6|V|-6 counting condition for 3-dimensional body-hinge structures (modulo molecular theorem), and a fast `pebble game' algorithm which tracks this count in the multigraph, have led to the development of the program FIRST, for rapid predictions of the flexibility of proteins. In this study we develop a novel protein hinge prediction algorithm via our extension of the pebble game algorithm (relevant regions detection algorithm). We have tested our hinge prediction algorithm on several proteins chosen from the dataset of manually annotated hinges available on the MOLMOV server. Many of our predictions are in very good agreement with this data set. Our algorithms can also predict `allosteric' interactions in proteins—where binding on one site of a molecule changes the shape or binding at a distance `active site' of the molecule. We also give some promising results which support the sliding piston-like movement of helices with respect to one another as a plausible mechanism by which GCPR receptors propagate conformational changes across membranes.

  17. How special is the biochemical function of native proteins?

    PubMed

    Skolnick, Jeffrey; Gao, Mu; Zhou, Hongyi

    2016-01-01

    Native proteins perform an amazing variety of biochemical functions, including enzymatic catalysis, and can engage in protein-protein and protein-DNA interactions that are essential for life. A key question is how special are these functional properties of proteins. Are they extremely rare, or are they an intrinsic feature? Comparison to the properties of compact conformations of artificially generated compact protein structures selected for thermodynamic stability but not any type of function, the artificial (ART) protein library, demonstrates that a remarkable number of the properties of native-like proteins are recapitulated. These include the complete set of small molecule ligand-binding pockets and most protein-protein interfaces. ART structures are predicted to be capable of weakly binding metabolites and cover a significant fraction of metabolic pathways, with the most enriched pathways including ancient ones such as glycolysis. Native-like active sites are also found in ART proteins. A small fraction of ART proteins are predicted to have strong protein-protein and protein-DNA interactions. Overall, it appears that biochemical function is an intrinsic feature of proteins which nature has significantly optimized during evolution. These studies raise questions as to the relative roles of specificity and promiscuity in the biochemical function and control of cells that need investigation.

  18. MUFOLD: A new solution for protein 3D structure prediction.

    PubMed

    Zhang, Jingfen; Wang, Qingguo; Barz, Bogdan; He, Zhiquan; Kosztin, Ioan; Shang, Yi; Xu, Dong

    2010-04-01

    There have been steady improvements in protein structure prediction during the past 2 decades. However, current methods are still far from consistently predicting structural models accurately with computing power accessible to common users. Toward achieving more accurate and efficient structure prediction, we developed a number of novel methods and integrated them into a software package, MUFOLD. First, a systematic protocol was developed to identify useful templates and fragments from Protein Data Bank for a given target protein. Then, an efficient process was applied for iterative coarse-grain model generation and evaluation at the Calpha or backbone level. In this process, we construct models using interresidue spatial restraints derived from alignments by multidimensional scaling, evaluate and select models through clustering and static scoring functions, and iteratively improve the selected models by integrating spatial restraints and previous models. Finally, the full-atom models were evaluated using molecular dynamics simulations based on structural changes under simulated heating. We have continuously improved the performance of MUFOLD by using a benchmark of 200 proteins from the Astral database, where no template with >25% sequence identity to any target protein is included. The average root-mean-square deviation of the best models from the native structures is 4.28 A, which shows significant and systematic improvement over our previous methods. The computing time of MUFOLD is much shorter than many other tools, such as Rosetta. MUFOLD demonstrated some success in the 2008 community-wide experiment for protein structure prediction CASP8.

  19. Improving structure-based function prediction using molecular dynamics

    PubMed Central

    Glazer, Dariya S.; Radmer, Randall J.; Altman, Russ B.

    2009-01-01

    Summary The number of molecules with solved three-dimensional structure but unknown function is increasing rapidly. Particularly problematic are novel folds with little detectable similarity to molecules of known function. Experimental assays can determine the functions of such molecules, but are time-consuming and expensive. Computational approaches can identify potential functional sites; however, these approaches generally rely on single static structures and do not use information about dynamics. In fact, structural dynamics can enhance function prediction: we coupled molecular dynamics simulations with structure-based function prediction algorithms that identify Ca2+ binding sites. When applied to 11 challenging proteins, both methods showed substantial improvement in performance, revealing 22 more sites in one case and 12 more in the other, with a modest increase in apparent false positives. Thus, we show that treating molecules as dynamic entities improves the performance of structure-based function prediction methods. PMID:19604472

  20. Design of membrane proteins: toward functional systems.

    PubMed

    Ghirlanda, Giovanna

    2009-12-01

    Over the years, membrane-soluble peptides have provided a convenient model system to investigate the folding and assembly of integral membrane proteins. Recent advances in experimental and computational methods are now being translated into the design of functional membrane proteins. Applications include artificial modulators of membrane protein function, inhibitors of protein-protein interactions, and redox membrane proteins.

  1. Predicting DNA-binding proteins and binding residues by complex structure prediction and application to human proteome.

    PubMed

    Zhao, Huiying; Wang, Jihua; Zhou, Yaoqi; Yang, Yuedong

    2014-01-01

    As more and more protein sequences are uncovered from increasingly inexpensive sequencing techniques, an urgent task is to find their functions. This work presents a highly reliable computational technique for predicting DNA-binding function at the level of protein-DNA complex structures, rather than low-resolution two-state prediction of DNA-binding as most existing techniques do. The method first predicts protein-DNA complex structure by utilizing the template-based structure prediction technique HHblits, followed by binding affinity prediction based on a knowledge-based energy function (Distance-scaled finite ideal-gas reference state for protein-DNA interactions). A leave-one-out cross validation of the method based on 179 DNA-binding and 3797 non-binding protein domains achieves a Matthews correlation coefficient (MCC) of 0.77 with high precision (94%) and high sensitivity (65%). We further found 51% sensitivity for 82 newly determined structures of DNA-binding proteins and 56% sensitivity for the human proteome. In addition, the method provides a reasonably accurate prediction of DNA-binding residues in proteins based on predicted DNA-binding complex structures. Its application to human proteome leads to more than 300 novel DNA-binding proteins; some of these predicted structures were validated by known structures of homologous proteins in APO forms. The method [SPOT-Seq (DNA)] is available as an on-line server at http://sparks-lab.org.

  2. Bioinformatics pipeline for functional identification and characterization of proteins

    NASA Astrophysics Data System (ADS)

    Skarzyńska, Agnieszka; Pawełkowicz, Magdalena; Krzywkowski, Tomasz; Świerkula, Katarzyna; PlÄ der, Wojciech; Przybecki, Zbigniew

    2015-09-01

    The new sequencing methods, called Next Generation Sequencing gives an opportunity to possess a vast amount of data in short time. This data requires structural and functional annotation. Functional identification and characterization of predicted proteins could be done by in silico approches, thanks to a numerous computational tools available nowadays. However, there is a need to confirm the results of proteins function prediction using different programs and comparing the results or confirm experimentally. Here we present a bioinformatics pipeline for structural and functional annotation of proteins.

  3. Prediction and Annotation of Plant Protein Interaction Networks

    SciTech Connect

    McDermott, Jason E.; Wang, Jun; Yu, Jun; Wong, Gane Ka-Shu; Samudrala, Ram

    2009-02-01

    Large-scale experimental studies of interactions between components of biological systems have been performed for a variety of eukaryotic organisms. However, there is a dearth of such data for plants. Computational methods for prediction of relationships between proteins, primarily based on comparative genomics, provide a useful systems-level view of cellular functioning and can be used to extend information about other eukaryotes to plants. We have predicted networks for Arabidopsis thaliana, Oryza sativa indica and japonica and several plant pathogens using the Bioverse (http://bioverse.compbio.washington.edu) and show that they are similar to experimentally-derived interaction networks. Predicted interaction networks for plants can be used to provide novel functional annotations and predictions about plant phenotypes and aid in rational engineering of biosynthesis pathways.

  4. SDR: a database of predicted specificity-determining residues in proteins.

    PubMed

    Donald, Jason E; Shakhnovich, Eugene I

    2009-01-01

    The specificity-determining residue database (SDR database) presents residue positions where mutations are predicted to have changed protein function in large protein families. Because the database pre-calculates predictions on existing protein sequence alignments, users can quickly find the predictions by selecting the appropriate protein family or searching by protein sequence. Predictions can be used to guide mutagenesis or to gain a better understanding of specificity changes in a protein family. The database is available on the web at http://paradox.harvard.edu/sdr.

  5. On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation

    PubMed Central

    2014-01-01

    Background Protein sequence similarities to any types of non-globular segments (coiled coils, low complexity regions, transmembrane regions, long loops, etc. where either positional sequence conservation is the result of a very simple, physically induced pattern or rather integral sequence properties are critical) are pertinent sources for mistaken homologies. Regretfully, these considerations regularly escape attention in large-scale annotation studies since, often, there is no substitute to manual handling of these cases. Quantitative criteria are required to suppress events of function annotation transfer as a result of false homology assignments. Results The sequence homology concept is based on the similarity comparison between the structural elements, the basic building blocks for conferring the overall fold of a protein. We propose to dissect the total similarity score into fold-critical and other, remaining contributions and suggest that, for a valid homology statement, the fold-relevant score contribution should at least be significant on its own. As part of the article, we provide the DissectHMMER software program for dissecting HMMER2/3 scores into segment-specific contributions. We show that DissectHMMER reproduces HMMER2/3 scores with sufficient accuracy and that it is useful in automated decisions about homology for instructive sequence examples. To generalize the dissection concept for cases without 3D structural information, we find that a dissection based on alignment quality is an appropriate surrogate. The approach was applied to a large-scale study of SMART and PFAM domains in the space of seed sequences and in the space of UniProt/SwissProt. Conclusions Sequence similarity core dissection with regard to fold-critical and other contributions systematically suppresses false hits and, additionally, recovers previously obscured homology relationships such as the one between aquaporins and formate/nitrite transporters that, so far, was only

  6. Proteins with Novel Structure, Function and Dynamics

    NASA Technical Reports Server (NTRS)

    Pohorille, Andrew

    2014-01-01

    Recently, a small enzyme that ligates two RNA fragments with the rate of 10(exp 6) above background was evolved in vitro (Seelig and Szostak, Nature 448:828-831, 2007). This enzyme does not resemble any contemporary protein (Chao et al., Nature Chem. Biol. 9:81-83, 2013). It consists of a dynamic, catalytic loop, a small, rigid core containing two zinc ions coordinated by neighboring amino acids, and two highly flexible tails that might be unimportant for protein function. In contrast to other proteins, this enzyme does not contain ordered secondary structure elements, such as alpha-helix or beta-sheet. The loop is kept together by just two interactions of a charged residue and a histidine with a zinc ion, which they coordinate on the opposite side of the loop. Such structure appears to be very fragile. Surprisingly, computer simulations indicate otherwise. As the coordinating, charged residue is mutated to alanine, another, nearby charged residue takes its place, thus keeping the structure nearly intact. If this residue is also substituted by alanine a salt bridge involving two other, charged residues on the opposite sides of the loop keeps the loop in place. These adjustments are facilitated by high flexibility of the protein. Computational predictions have been confirmed experimentally, as both mutants retain full activity and overall structure. These results challenge our notions about what is required for protein activity and about the relationship between protein dynamics, stability and robustness. We hypothesize that small, highly dynamic proteins could be both active and fault tolerant in ways that many other proteins are not, i.e. they can adjust to retain their structure and activity even if subjected to mutations in structurally critical regions. This opens the doors for designing proteins with novel functions, structures and dynamics that have not been yet considered.

  7. TESTLoc: protein subcellular localization prediction from EST data

    PubMed Central

    2010-01-01

    Background The eukaryotic cell has an intricate architecture with compartments and substructures dedicated to particular biological processes. Knowing the subcellular location of proteins not only indicates how bio-processes are organized in different cellular compartments, but also contributes to unravelling the function of individual proteins. Computational localization prediction is possible based on sequence information alone, and has been successfully applied to proteins from virtually all subcellular compartments and all domains of life. However, we realized that current prediction tools do not perform well on partial protein sequences such as those inferred from Expressed Sequence Tag (EST) data, limiting the exploitation of the large and taxonomically most comprehensive body of sequence information from eukaryotes. Results We developed a new predictor, TESTLoc, suited for subcellular localization prediction of proteins based on their partial sequence conceptually translated from ESTs (EST-peptides). Support Vector Machine (SVM) is used as computational method and EST-peptides are represented by different features such as amino acid composition and physicochemical properties. When TESTLoc was applied to the most challenging test case (plant data), it yielded high accuracy (~85%). Conclusions TESTLoc is a localization prediction tool tailored for EST data. It provides a variety of models for the users to choose from, and is available for download at http://megasun.bch.umontreal.ca/~shenyq/TESTLoc/TESTLoc.html PMID:21078192

  8. Using support vector machine for improving protein-protein interaction prediction utilizing domain interactions

    SciTech Connect

    Singhal, Mudita; Shah, Anuj R.; Brown, Roslyn N.; Adkins, Joshua N.

    2010-10-02

    Understanding protein interactions is essential to gain insights into the biological processes at the whole cell level. The high-throughput experimental techniques for determining protein-protein interactions (PPI) are error prone and expensive with low overlap amongst them. Although several computational methods have been proposed for predicting protein interactions there is definite room for improvement. Here we present DomainSVM, a predictive method for PPI that uses computationally inferred domain-domain interaction values in a Support Vector Machine framework to predict protein interactions. DomainSVM method utilizes evidence of multiple interacting domains to predict a protein interaction. It outperforms existing methods of PPI prediction by achieving very high explanation ratios, precision, specificity, sensitivity and F-measure values in a 10 fold cross-validation study conducted on the positive and negative PPIs in yeast. A Functional comparison study using GO annotations on the positive and the negative test sets is presented in addition to discussing novel PPI predictions in Salmonella Typhimurium.

  9. Neurodegenerative diseases: quantitative predictions of protein-RNA interactions.

    PubMed

    Cirillo, Davide; Agostini, Federico; Klus, Petr; Marchese, Domenica; Rodriguez, Silvia; Bolognesi, Benedetta; Tartaglia, Gian Gaetano

    2013-02-01

    Increasing evidence indicates that RNA plays an active role in a number of neurodegenerative diseases. We recently introduced a theoretical framework, catRAPID, to predict the binding ability of protein and RNA molecules. Here, we use catRAPID to investigate ribonucleoprotein interactions linked to inherited intellectual disability, amyotrophic lateral sclerosis, Creutzfeuld-Jakob, Alzheimer's, and Parkinson's diseases. We specifically focus on (1) RNA interactions with fragile X mental retardation protein FMRP; (2) protein sequestration caused by CGG repeats; (3) noncoding transcripts regulated by TAR DNA-binding protein 43 TDP-43; (4) autogenous regulation of TDP-43 and FMRP; (5) iron-mediated expression of amyloid precursor protein APP and α-synuclein; (6) interactions between prions and RNA aptamers. Our results are in striking agreement with experimental evidence and provide new insights in processes associated with neuronal function and misfunction.

  10. Prediction of protein-protein interaction network using a multi-objective optimization approach.

    PubMed

    Chowdhury, Archana; Rakshit, Pratyusha; Konar, Amit

    2016-06-01

    Protein-Protein Interactions (PPIs) are very important as they coordinate almost all cellular processes. This paper attempts to formulate PPI prediction problem in a multi-objective optimization framework. The scoring functions for the trial solution deal with simultaneous maximization of functional similarity, strength of the domain interaction profiles, and the number of common neighbors of the proteins predicted to be interacting. The above optimization problem is solved using the proposed Firefly Algorithm with Nondominated Sorting. Experiments undertaken reveal that the proposed PPI prediction technique outperforms existing methods, including gene ontology-based Relative Specific Similarity, multi-domain-based Domain Cohesion Coupling method, domain-based Random Decision Forest method, Bagging with REP Tree, and evolutionary/swarm algorithm-based approaches, with respect to sensitivity, specificity, and F1 score.

  11. Prediction of membrane protein types using maximum variance projection

    NASA Astrophysics Data System (ADS)

    Wang, Tong; Yang, Jie

    2011-05-01

    Predicting membrane protein types has a positive influence on further biological function analysis. To quickly and efficiently annotate the type of an uncharacterized membrane protein is a challenge. In this work, a system based on maximum variance projection (MVP) is proposed to improve the prediction performance of membrane protein types. The feature extraction step is based on a hybridization representation approach by fusing Position-Specific Score Matrix composition. The protein sequences are quantized in a high-dimensional space using this representation strategy. Some problems will be brought when analysing these high-dimensional feature vectors such as high computing time and high classifier complexity. To solve this issue, MVP, a novel dimensionality reduction algorithm is introduced by extracting the essential features from the high-dimensional feature space. Then, a K-nearest neighbour classifier is employed to identify the types of membrane proteins based on their reduced low-dimensional features. As a result, the jackknife and independent dataset test success rates of this model reach 86.1 and 88.4%, respectively, and suggest that the proposed approach is very promising for predicting membrane proteins types.

  12. A Bayesian Framework for Combining Protein and Network Topology Information for Predicting Protein-Protein Interactions.

    PubMed

    Birlutiu, Adriana; d'Alché-Buc, Florence; Heskes, Tom

    2015-01-01

    Computational methods for predicting protein-protein interactions are important tools that can complement high-throughput technologies and guide biologists in designing new laboratory experiments. The proteins and the interactions between them can be described by a network which is characterized by several topological properties. Information about proteins and interactions between them, in combination with knowledge about topological properties of the network, can be used for developing computational methods that can accurately predict unknown protein-protein interactions. This paper presents a supervised learning framework based on Bayesian inference for combining two types of information: i) network topology information, and ii) information related to proteins and the interactions between them. The motivation of our model is that by combining these two types of information one can achieve a better accuracy in predicting protein-protein interactions, than by using models constructed from these two types of information independently.

  13. Predicting Turns in Proteins with a Unified Model

    PubMed Central

    Song, Qi; Li, Tonghua; Cong, Peisheng; Sun, Jiangming; Li, Dapeng; Tang, Shengnan

    2012-01-01

    Motivation Turns are a critical element of the structure of a protein; turns play a crucial role in loops, folds, and interactions. Current prediction methods are well developed for the prediction of individual turn types, including α-turn, β-turn, and γ-turn, etc. However, for further protein structure and function prediction it is necessary to develop a uniform model that can accurately predict all types of turns simultaneously. Results In this study, we present a novel approach, TurnP, which offers the ability to investigate all the turns in a protein based on a unified model. The main characteristics of TurnP are: (i) using newly exploited features of structural evolution information (secondary structure and shape string of protein) based on structure homologies, (ii) considering all types of turns in a unified model, and (iii) practical capability of accurate prediction of all turns simultaneously for a query. TurnP utilizes predicted secondary structures and predicted shape strings, both of which have greater accuracy, based on innovative technologies which were both developed by our group. Then, sequence and structural evolution features, which are profile of sequence, profile of secondary structures and profile of shape strings are generated by sequence and structure alignment. When TurnP was validated on a non-redundant dataset (4,107 entries) by five-fold cross-validation, we achieved an accuracy of 88.8% and a sensitivity of 71.8%, which exceeded the most state-of-the-art predictors of certain type of turn. Newly determined sequences, the EVA and CASP9 datasets were used as independent tests and the results we achieved were outstanding for turn predictions and confirmed the good performance of TurnP for practical applications. PMID:23144872

  14. CATH FunFHMMer web server: protein functional annotations using functional family assignments.

    PubMed

    Das, Sayoni; Sillitoe, Ian; Lee, David; Lees, Jonathan G; Dawson, Natalie L; Ward, John; Orengo, Christine A

    2015-07-01

    The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence-structure-function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.

  15. Systematic Prediction of Scaffold Proteins Reveals New Design Principles in Scaffold-Mediated Signal Transduction

    PubMed Central

    Hu, Jianfei; Neiswinger, Johnathan; Zhang, Jin; Zhu, Heng; Qian, Jiang

    2015-01-01

    Scaffold proteins play a crucial role in facilitating signal transduction in eukaryotes by bringing together multiple signaling components. In this study, we performed a systematic analysis of scaffold proteins in signal transduction by integrating protein-protein interaction and kinase-substrate relationship networks. We predicted 212 scaffold proteins that are involved in 605 distinct signaling pathways. The computational prediction was validated using a protein microarray-based approach. The predicted scaffold proteins showed several interesting characteristics, as we expected from the functionality of scaffold proteins. We found that the scaffold proteins are likely to interact with each other, which is consistent with previous finding that scaffold proteins tend to form homodimers and heterodimers. Interestingly, a single scaffold protein can be involved in multiple signaling pathways by interacting with other scaffold protein partners. Furthermore, we propose two possible regulatory mechanisms by which the activity of scaffold proteins is coordinated with their associated pathways through phosphorylation process. PMID:26393507

  16. Predicting network functions with nested patterns

    NASA Astrophysics Data System (ADS)

    Ganter, Mathias; Kaltenbach, Hans-Michael; Stelling, Jörg

    2014-01-01

    Identifying suitable patterns in complex biological interaction networks helps understanding network functions and allows for predictions at the pattern level: by recognizing a known pattern, one can assign its previously established function. However, current approaches fail for previously unseen patterns, when patterns overlap and when they are embedded into a new network context. Here we show how to conceptually extend pattern-based approaches. We define metabolite patterns in metabolic networks that formalize co-occurrences of metabolites. Our probabilistic framework decodes the implicit information in the networks’ metabolite patterns to predict metabolic functions. We demonstrate the predictive power by identifying ‘indicator patterns’, for instance, for enzyme classification, by predicting directions of novel reactions and of known reactions in new network contexts, and by ranking candidate network extensions for gap filling. Beyond their use in improving genome annotations and metabolic network models, we expect that the concepts transfer to other network types.

  17. Predicting gene ontology annotations of orphan GWAS genes using protein-protein interactions

    PubMed Central

    2014-01-01

    Background The number of genome-wide association studies (GWAS) has increased rapidly in the past couple of years, resulting in the identification of genes associated with different diseases. The next step in translating these findings into biomedically useful information is to find out the mechanism of the action of these genes. However, GWAS studies often implicate genes whose functions are currently unknown; for example, MYEOV, ANKLE1, TMEM45B and ORAOV1 are found to be associated with breast cancer, but their molecular function is unknown. Results We carried out Bayesian inference of Gene Ontology (GO) term annotations of genes by employing the directed acyclic graph structure of GO and the network of protein-protein interactions (PPIs). The approach is designed based on the fact that two proteins that interact biophysically would be in physical proximity of each other, would possess complementary molecular function, and play role in related biological processes. Predicted GO terms were ranked according to their relative association scores and the approach was evaluated quantitatively by plotting the precision versus recall values and F-scores (the harmonic mean of precision and recall) versus varying thresholds. Precisions of ~58% and ~ 40% for localization and functions respectively of proteins were determined at a threshold of ~30 (top 30 GO terms in the ranked list). Comparison with function prediction based on semantic similarity among nodes in an ontology and incorporation of those similarities in a k-nearest neighbor classifier confirmed that our results compared favorably. Conclusions This approach was applied to predict the cellular component and molecular function GO terms of all human proteins that have interacting partners possessing at least one known GO annotation. The list of predictions is available at http://severus.dbmi.pitt.edu/engo/GOPRED.html. We present the algorithm, evaluations and the results of the computational predictions

  18. Rosetta stone method for detecting protein function and protein-protein interactions from genome sequences

    DOEpatents

    Eisenberg, David; Marcotte, Edward M.; Pellegrini, Matteo; Thompson, Michael J.; Yeates, Todd O.

    2002-10-15

    A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.

  19. BPROMPT: A consensus server for membrane protein prediction.

    PubMed

    Taylor, Paul D; Attwood, Teresa K; Flower, Darren R

    2003-07-01

    Protein structure prediction is a cornerstone of bioinformatics research. Membrane proteins require their own prediction methods due to their intrinsically different composition. A variety of tools exist for topology prediction of membrane proteins, many of them available on the Internet. The server described in this paper, BPROMPT (Bayesian PRediction Of Membrane Protein Topology), uses a Bayesian Belief Network to combine the results of other prediction methods, providing a more accurate consensus prediction. Topology predictions with accuracies of 70% for prokaryotes and 53% for eukaryotes were achieved. BPROMPT can be accessed at http://www.jenner.ac.uk/BPROMPT.

  20. Genetic Ancestry in Lung-Function Predictions

    PubMed Central

    Kumar, Rajesh; Seibold, Max A.; Aldrich, Melinda C.; Williams, L. Keoki; Reiner, Alex P.; Colangelo, Laura; Galanter, Joshua; Gignoux, Christopher; Hu, Donglei; Sen, Saunak; Choudhry, Shweta; Peterson, Edward L.; Rodriguez-Santana, Jose; Rodriguez-Cintron, William; Nalls, Michael A.; Leak, Tennille S.; O’Meara, Ellen; Meibohm, Bernd; Kritchevsky, Stephen B.; Li, Rongling; Harris, Tamara B.; Nickerson, Deborah A.; Fornage, Myriam; Enright, Paul; Ziv, Elad; Smith, Lewis J.; Liu, Kiang; Burchard, Esteban González

    2010-01-01

    BACKGROUND Self-identified race or ethnic group is used to determine normal reference standards in the prediction of pulmonary function. We conducted a study to determine whether the genetically determined percentage of African ancestry is associated with lung function and whether its use could improve predictions of lung function among persons who identified themselves as African American. METHODS We assessed the ancestry of 777 participants self-identified as African American in the Coronary Artery Risk Development in Young Adults (CARDIA) study and evaluated the relation between pulmonary function and ancestry by means of linear regression. We performed similar analyses of data for two independent cohorts of subjects identifying themselves as African American: 813 participants in the Health, Aging, and Body Composition (HABC) study and 579 participants in the Cardiovascular Health Study (CHS). We compared the fit of two types of models to lung-function measurements: models based on the covariates used in standard prediction equations and models incorporating ancestry. We also evaluated the effect of the ancestry-based models on the classification of disease severity in two asthma-study populations. RESULTS African ancestry was inversely related to forced expiratory volume in 1 second (FEV1) and forced vital capacity in the CARDIA cohort. These relations were also seen in the HABC and CHS cohorts. In predicting lung function, the ancestry-based model fit the data better than standard models. Ancestry-based models resulted in the reclassification of asthma severity (based on the percentage of the predicted FEV1) in 4 to 5% of participants. CONCLUSIONS Current predictive equations, which rely on self-identified race alone, may misestimate lung function among subjects who identify themselves as African American. Incorporating ancestry into normative equations may improve lung-function estimates and more accurately categorize disease severity. (Funded by the National

  1. Modelling proteins' hidden conformations to predict antibiotic resistance

    NASA Astrophysics Data System (ADS)

    Hart, Kathryn M.; Ho, Chris M. W.; Dutta, Supratik; Gross, Michael L.; Bowman, Gregory R.

    2016-10-01

    TEM β-lactamase confers bacteria with resistance to many antibiotics and rapidly evolves activity against new drugs. However, functional changes are not easily explained by differences in crystal structures. We employ Markov state models to identify hidden conformations and explore their role in determining TEM's specificity. We integrate these models with existing drug-design tools to create a new technique, called Boltzmann docking, which better predicts TEM specificity by accounting for conformational heterogeneity. Using our MSMs, we identify hidden states whose populations correlate with activity against cefotaxime. To experimentally detect our predicted hidden states, we use rapid mass spectrometric footprinting and confirm our models' prediction that increased cefotaxime activity correlates with reduced Ω-loop flexibility. Finally, we design novel variants to stabilize the hidden cefotaximase states, and find their populations predict activity against cefotaxime in vitro and in vivo. Therefore, we expect this framework to have numerous applications in drug and protein design.

  2. Structure-based prediction of host-pathogen protein interactions.

    PubMed

    Mariano, Rachelle; Wuchty, Stefan

    2017-03-16

    The discovery, validation, and characterization of protein-based interactions from different species are crucial for translational research regarding a variety of pathogens, ranging from the malaria parasite Plasmodium falciparum to HIV-1. Here, we review recent advances in the prediction of host-pathogen protein interfaces using structural information. In particular, we observe that current methods chiefly perform machine learning on sequence and domain information to produce large sets of candidate interactions that are further assessed and pruned to generate final, highly probable sets. Structure-based studies have also emphasized the electrostatic properties and evolutionary transformations of pathogenic interfaces, supplying crucial insight into antigenic determinants and the ways pathogens compete for host protein binding. Advancements in spectroscopic and crystallographic methods complement the aforementioned techniques, permitting the rigorous study of true positives at a molecular level. Together, these approaches illustrate how protein structure on a variety of levels functions coordinately and dynamically to achieve host takeover.

  3. Prediction of zinc finger DNA binding protein.

    PubMed

    Nakata, K

    1995-04-01

    Using the neural network algorithm with back-propagation training procedure, we analysed the zinc finger DNA binding protein sequences. We incorporated the characteristic patterns around the zinc finger motifs TFIIIA type (Cys-X2-5-Cys-X12-13-His-X2-5-His) and the steroid hormone receptor type (Cys-X2-5-Cys-X12-15-Cys-X2-5-Cys-X15-16-Cys-X4-5-Cys-X8-10- Cys-X2-3-Cys) in the neural network algorithm. The patterns used in the neural network were the amino acid pattern, the electric charge and polarity pattern, the side-chain chemical property and subproperty patterns, the hydrophobicity and hydrophilicity patterns and the secondary structure propensity pattern. Two consecutive patterns were also considered. Each pattern was incorporated in the single layer perceptron algorithm and the combinations of patterns were considered in the two-layer perceptron algorithm. As for the TFIIIA type zinc finger DNA binding motifs, the prediction results of the two-layer perceptron algorithm reached up to 96.9% discrimination, and the prediction results of the discriminant analysis using the combination of several characters reached up to 97.0%. As for the steroid hormone receptor type zinc finger, the prediction results of neural network algorithm and the discriminant analyses reached up to 96.0%.

  4. 3D protein structure prediction using Imperialist Competitive algorithm and half sphere exposure prediction.

    PubMed

    Khaji, Erfan; Karami, Masoumeh; Garkani-Nejad, Zahra

    2016-02-21

    Predicting the native structure of proteins based on half-sphere exposure and contact numbers has been studied deeply within recent years. Online predictors of these vectors and secondary structures of amino acids sequences have made it possible to design a function for the folding process. By choosing variant structures and directs for each secondary structure, a random conformation can be generated, and a potential function can then be assigned. Minimizing the potential function utilizing meta-heuristic algorithms is the final step of finding the native structure of a given amino acid sequence. In this work, Imperialist Competitive algorithm was used in order to accelerate the process of minimization. Moreover, we applied an adaptive procedure to apply revolutionary changes. Finally, we considered a more accurate tool for prediction of secondary structure. The results of the computational experiments on standard benchmark show the superiority of the new algorithm over the previous methods with similar potential function.

  5. TOPPER: topology prediction of transmembrane protein based on evidential reasoning.

    PubMed

    Deng, Xinyang; Liu, Qi; Hu, Yong; Deng, Yong

    2013-01-01

    The topology prediction of transmembrane protein is a hot research field in bioinformatics and molecular biology. It is a typical pattern recognition problem. Various prediction algorithms are developed to predict the transmembrane protein topology since the experimental techniques have been restricted by many stringent conditions. Usually, these individual prediction algorithms depend on various principles such as the hydrophobicity or charges of residues. In this paper, an evidential topology prediction method for transmembrane protein is proposed based on evidential reasoning, which is called TOPPER (topology prediction of transmembrane protein based on evidential reasoning). In the proposed method, the prediction results of multiple individual prediction algorithms can be transformed into BPAs (basic probability assignments) according to the confusion matrix. Then, the final prediction result can be obtained by the combination of each individual prediction base on Dempster's rule of combination. The experimental results show that the proposed method is superior to the individual prediction algorithms, which illustrates the effectiveness of the proposed method.

  6. Collective Dynamics Differentiates Functional Divergence in Protein Evolution

    PubMed Central

    Glembo, Tyler J.; Farrell, Daniel W.; Gerek, Z. Nevin; Thorpe, M. F.; Ozkan, S. Banu

    2012-01-01

    Protein evolution is most commonly studied by analyzing related protein sequences and generating ancestral sequences through Bayesian and Maximum Likelihood methods, and/or by resurrecting ancestral proteins in the lab and performing ligand binding studies to determine function. Structural and dynamic evolution have largely been left out of molecular evolution studies. Here we incorporate both structure and dynamics to elucidate the molecular principles behind the divergence in the evolutionary path of the steroid receptor proteins. We determine the likely structure of three evolutionarily diverged ancestral steroid receptor proteins using the Zipping and Assembly Method with FRODA (ZAMF). Our predictions are within ∼2.7 Å all-atom RMSD of the respective crystal structures of the ancestral steroid receptors. Beyond static structure prediction, a particular feature of ZAMF is that it generates protein dynamics information. We investigate the differences in conformational dynamics of diverged proteins by obtaining the most collective motion through essential dynamics. Strikingly, our analysis shows that evolutionarily diverged proteins of the same family do not share the same dynamic subspace, while those sharing the same function are simultaneously clustered together and distant from those, that have functionally diverged. Dynamic analysis also enables those mutations that most affect dynamics to be identified. It correctly predicts all mutations (functional and permissive) necessary to evolve new function and ∼60% of permissive mutations necessary to recover ancestral function. PMID:22479170

  7. Are non-functional, unfolded proteins ('junk proteins') common in the genome?

    PubMed

    Lovell, Simon C

    2003-11-20

    It has recently been shown that many proteins are unfolded in their functional state. In addition, a large number of stretches of protein sequences are predicted to be unfolded. It has been argued that the high frequency of occurrence of these predicted unfolded sequences indicates that the majority of these sequences must also be functional. These sequences tend to be of low complexity. It is well established that certain types of low-complexity sequences are genetically unstable, and are prone to expand in the genome. It is possible, therefore, that in addition to these well-characterised functional unfolded proteins, there are a large number of unfolded proteins that are non-functional. Analogous to 'junk DNA' these protein sequences may arise due to physical characteristics of DNA. Their high frequency may reflect, therefore, the high probability of expansion in the genome. Such 'junk proteins' would not be advantageous, and may be mildly deleterious to the cell.

  8. Predicting β-Turns in Protein Using Kernel Logistic Regression

    PubMed Central

    Elbashir, Murtada Khalafallah; Sheng, Yu; Wang, Jianxin; Wu, FangXiang; Li, Min

    2013-01-01

    A β-turn is a secondary protein structure type that plays a significant role in protein configuration and function. On average 25% of amino acids in protein structures are located in β-turns. It is very important to develope an accurate and efficient method for β-turns prediction. Most of the current successful β-turns prediction methods use support vector machines (SVMs) or neural networks (NNs). The kernel logistic regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems. However, it is often not found in β-turns classification, mainly because it is computationally expensive. In this paper, we used KLR to obtain sparse β-turns prediction in short evolution time. Secondary structure information and position-specific scoring matrices (PSSMs) are utilized as input features. We achieved Qtotal of 80.7% and MCC of 50% on BT426 dataset. These results show that KLR method with the right algorithm can yield performance equivalent to or even better than NNs and SVMs in β-turns prediction. In addition, KLR yields probabilistic outcome and has a well-defined extension to multiclass case. PMID:23509793

  9. Predicting the tolerated sequences for proteins and protein interfaces using RosettaBackrub flexible backbone design.

    PubMed

    Smith, Colin A; Kortemme, Tanja

    2011-01-01

    Predicting the set of sequences that are tolerated by a protein or protein interface, while maintaining a desired function, is useful for characterizing protein interaction specificity and for computationally designing sequence libraries to engineer proteins with new functions. Here we provide a general method, a detailed set of protocols, and several benchmarks and analyses for estimating tolerated sequences using flexible backbone protein design implemented in the Rosetta molecular modeling software suite. The input to the method is at least one experimentally determined three-dimensional protein structure or high-quality model. The starting structure(s) are expanded or refined into a conformational ensemble using Monte Carlo simulations consisting of backrub backbone and side chain moves in Rosetta. The method then uses a combination of simulated annealing and genetic algorithm optimization methods to enrich for low-energy sequences for the individual members of the ensemble. To emphasize certain functional requirements (e.g. forming a binding interface), interactions between and within parts of the structure (e.g. domains) can be reweighted in the scoring function. Results from each backbone structure are merged together to create a single estimate for the tolerated sequence space. We provide an extensive description of the protocol and its parameters, all source code, example analysis scripts and three tests applying this method to finding sequences predicted to stabilize proteins or protein interfaces. The generality of this method makes many other applications possible, for example stabilizing interactions with small molecules, DNA, or RNA. Through the use of within-domain reweighting and/or multistate design, it may also be possible to use this method to find sequences that stabilize particular protein conformations or binding interactions over others.

  10. The functional importance of co-evolving residues in proteins.

    PubMed

    Sandler, Inga; Zigdon, Nitzan; Levy, Efrat; Aharoni, Amir

    2014-02-01

    Computational approaches for detecting co-evolution in proteins allow for the identification of protein-protein interaction networks in different organisms and the assignment of function to under-explored proteins. The detection of co-variation of amino acids within or between proteins, moreover, allows for the discovery of residue-residue contacts and highlights functional residues that can affect the binding affinity, catalytic activity, or substrate specificity of a protein. To explore the functional impact of co-evolutionary changes in proteins, a combined experimental and computational approach must be recruited. Here, we review recent studies that apply computational and experimental tools to obtain novel insight into the structure, function, and evolution of proteins. Specifically, we describe the application of co-evolutionary analysis for predicting high-resolution three-dimensional structures of proteins. In addition, we describe computational approaches followed by experimental analysis for identifying specificity-determining residues in proteins. Finally, we discuss studies addressing the importance of such residues in terms of the functional divergence of proteins, allowing proteins to evolve new functions while avoiding crosstalk with existing cellular pathways or forming reproductive barriers and hence promoting speciation.

  11. PDBalert: automatic, recurrent remote homology tracking and protein structure prediction

    PubMed Central

    Agarwal, Vatsal; Remmert, Michael; Biegert, Andreas; Söding, Johannes

    2008-01-01

    Background During the last years, methods for remote homology detection have grown more and more sensitive and reliable. Automatic structure prediction servers relying on these methods can generate useful 3D models even below 20% sequence identity between the protein of interest and the known structure (template). When no homologs can be found in the protein structure database (PDB), the user would need to rerun the same search at regular intervals in order to make timely use of a template once it becomes available. Results PDBalert is a web-based automatic system that sends an email alert as soon as a structure with homology to a protein in the user's watch list is released to the PDB database or appears among the sequences on hold. The mail contains links to the search results and to an automatically generated 3D homology model. The sequence search is performed with the same software as used by the very sensitive and reliable remote homology detection server HHpred, which is based on pairwise comparison of Hidden Markov models. Conclusion PDBalert will accelerate the information flow from the PDB database to all those who can profit from the newly released protein structures for predicting the 3D structure or function of their proteins of interest. PMID:19025670

  12. Ontology-Based Prediction and Prioritization of Gene Functional Annotations.

    PubMed

    Chicco, Davide; Masseroli, Marco

    2016-01-01

    Genes and their protein products are essential molecular units of a living organism. The knowledge of their functions is key for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. The association of a gene or protein with its functions, described by controlled terms of biomolecular terminologies or ontologies, is named gene functional annotation. Very many and valuable gene annotations expressed through terminologies and ontologies are available. Nevertheless, they might include some erroneous information, since only a subset of annotations are reviewed by curators. Furthermore, they are incomplete by definition, given the rapidly evolving pace of biomolecular knowledge. In this scenario, computational methods that are able to quicken the annotation curation process and reliably suggest new annotations are very important. Here, we first propose a computational pipeline that uses different semantic and machine learning methods to predict novel ontology-based gene functional annotations; then, we introduce a new semantic prioritization rule to categorize the predicted annotations by their likelihood of being correct. Our tests and validations proved the effectiveness of our pipeline and prioritization of predicted annotations, by selecting as most likely manifold predicted annotations that were later confirmed.

  13. Prediction of Chemical Function: Model Development and ...

    EPA Pesticide Factsheets

    The United States Environmental Protection Agency’s Exposure Forecaster (ExpoCast) project is developing both statistical and mechanism-based computational models for predicting exposures to thousands of chemicals, including those in consumer products. The high-throughput (HT) screening-level exposures developed under ExpoCast can be combined with HT screening (HTS) bioactivity data for the risk-based prioritization of chemicals for further evaluation. The functional role (e.g. solvent, plasticizer, fragrance) that a chemical performs can drive both the types of products in which it is found and the concentration in which it is present and therefore impacting exposure potential. However, critical chemical use information (including functional role) is lacking for the majority of commercial chemicals for which exposure estimates are needed. A suite of machine-learning based models for classifying chemicals in terms of their likely functional roles in products based on structure were developed. This effort required collection, curation, and harmonization of publically-available data sources of chemical functional use information from government and industry bodies. Physicochemical and structure descriptor data were generated for chemicals with function data. Machine-learning classifier models for function were then built in a cross-validated manner from the descriptor/function data using the method of random forests. The models were applied to: 1) predict chemi

  14. Blind protein structure prediction using accelerated free-energy simulations

    PubMed Central

    Perez, Alberto; Morrone, Joseph A.; Brini, Emiliano; MacCallum, Justin L.; Dill, Ken A.

    2016-01-01

    We report a key proof of principle of a new acceleration method [Modeling Employing Limited Data (MELD)] for predicting protein structures by molecular dynamics simulation. It shows that such Boltzmann-satisfying techniques are now sufficiently fast and accurate to predict native protein structures in a limited test within the Critical Assessment of Structure Prediction (CASP) community-wide blind competition. PMID:27847872

  15. Concomitant prediction of function and fold at the domain level with GO-based profiles

    PubMed Central

    2013-01-01

    Predicting the function of newly sequenced proteins is crucial due to the pace at which these raw sequences are being obtained. Almost all resources for predicting protein function assign functional terms to whole chains, and do not distinguish which particular domain is responsible for the allocated function. This is not a limitation of the methodologies themselves but it is due to the fact that in the databases of functional annotations these methods use for transferring functional terms to new proteins, these annotations are done on a whole-chain basis. Nevertheless, domains are the basic evolutionary and often functional units of proteins. In many cases, the domains of a protein chain have distinct molecular functions, independent from each other. For that reason resources with functional annotations at the domain level, as well as methodologies for predicting function for individual domains adapted to these resources are required. We present a methodology for predicting the molecular function of individual domains, based on a previously developed database of functional annotations at the domain level. The approach, which we show outperforms a standard method based on sequence searches in assigning function, concomitantly predicts the structural fold of the domains and can give hints on the functionally important residues associated to the predicted function. PMID:23514233

  16. Functional Foods Containing Whey Proteins

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Whey proteins, modified whey proteins, and whey components are useful as nutrients or supplements for health maintenance. Extrusion modified whey proteins can easily fit into new products such as beverages, confectionery items (e.g., candies), convenience foods, desserts, baked goods, sauces, and in...

  17. Exploration of the Dynamic Properties of Protein Complexes Predicted from Spatially Constrained Protein-Protein Interaction Networks

    PubMed Central

    Yen, Eric A.; Tsay, Aaron; Waldispuhl, Jerome; Vogel, Jackie

    2014-01-01

    Protein complexes are not static, but rather highly dynamic with subunits that undergo 1-dimensional diffusion with respect to each other. Interactions within protein complexes are modulated through regulatory inputs that alter interactions and introduce new components and deplete existing components through exchange. While it is clear that the structure and function of any given protein complex is coupled to its dynamical properties, it remains a challenge to predict the possible conformations that complexes can adopt. Protein-fragment Complementation Assays detect physical interactions between protein pairs constrained to ≤8 nm from each other in living cells. This method has been used to build networks composed of 1000s of pair-wise interactions. Significantly, these networks contain a wealth of dynamic information, as the assay is fully reversible and the proteins are expressed in their natural context. In this study, we describe a method that extracts this valuable information in the form of predicted conformations, allowing the user to explore the conformational landscape, to search for structures that correlate with an activity state, and estimate the abundance of conformations in the living cell. The generator is based on a Markov Chain Monte Carlo simulation that uses the interaction dataset as input and is constrained by the physical resolution of the assay. We applied this method to an 18-member protein complex composed of the seven core proteins of the budding yeast Arp2/3 complex and 11 associated regulators and effector proteins. We generated 20,480 output structures and identified conformational states using principle component analysis. We interrogated the conformation landscape and found evidence of symmetry breaking, a mixture of likely active and inactive conformational states and dynamic exchange of the core protein Arc15 between core and regulatory components. Our method provides a novel tool for prediction and visualization of the hidden

  18. Predictive energy landscapes for folding membrane protein assemblies

    NASA Astrophysics Data System (ADS)

    Truong, Ha H.; Kim, Bobby L.; Schafer, Nicholas P.; Wolynes, Peter G.

    2015-12-01

    We study the energy landscapes for membrane protein oligomerization using the Associative memory, Water mediated, Structure and Energy Model with an implicit membrane potential (AWSEM-membrane), a coarse-grained molecular dynamics model previously optimized under the assumption that the energy landscapes for folding α-helical membrane protein monomers are funneled once their native topology within the membrane is established. In this study we show that the AWSEM-membrane force field is able to sample near native binding interfaces of several oligomeric systems. By predicting candidate structures using simulated annealing, we further show that degeneracies in predicting structures of membrane protein monomers are generally resolved in the folding of the higher order assemblies as is the case in the assemblies of both nicotinic acetylcholine receptor and V-type Na+-ATPase dimers. The physics of the phenomenon resembles domain swapping, which is consistent with the landscape following the principle of minimal frustration. We revisit also the classic Khorana study of the reconstitution of bacteriorhodopsin from its fragments, which is the close analogue of the early Anfinsen experiment on globular proteins. Here, we show the retinal cofactor likely plays a major role in selecting the final functional assembly.

  19. Bio-basis function neural networks in protein data mining.

    PubMed

    Yang, Zheng Rong; Hamer, Rebecca

    2007-01-01

    Accurately identifying functional sites in proteins is one of the most important topics in bioinformatics and systems biology. In bioinformatics, identifying protease cleavage sites in protein sequences can aid drug/inhibitor design. In systems biology, post-translational protein-protein interaction activity is one of the major components for analyzing signaling pathway activities. Determining functional sites using laboratory experiments are normally time consuming and expensive. Computer programs have therefore been widely used for this kind of task. Mining protein sequence data using computer programs covers two major issues: 1) discovering how amino acid specificity affects functional sites and 2) discovering what amino acid specificity is. Both need a proper coding mechanism prior to using a proper machine learning algorithm. The development of the bio-basis function neural network (BBFNN) has made a new way for protein sequence data mining. The bio-basis function used in BBFNN is biologically sound in well coding biological information in protein sequences, i.e. well measuring the similarity between protein sequences. BBFNN has therefore been outperforming conventional neural networks in many subjects of protein sequence data mining from protease cleavage site prediction to disordered protein identification. This review focuses on the variants of BBFNN and their applications in mining protein sequence data.

  20. Optimizing nondecomposable loss functions in structured prediction.

    PubMed

    Ranjbar, Mani; Lan, Tian; Wang, Yang; Robinovitch, Steven N; Li, Ze-Nian; Mori, Greg

    2013-04-01

    We develop an algorithm for structured prediction with nondecomposable performance measures. The algorithm learns parameters of Markov Random Fields (MRFs) and can be applied to multivariate performance measures. Examples include performance measures such as Fβ score (natural language processing), intersection over union (object category segmentation), Precision/Recall at k (search engines), and ROC area (binary classifiers). We attack this optimization problem by approximating the loss function with a piecewise linear function. The loss augmented inference forms a Quadratic Program (QP), which we solve using LP relaxation. We apply this approach to two tasks: object class-specific segmentation and human action retrieval from videos. We show significant improvement over baseline approaches that either use simple loss functions or simple scoring functions on the PASCAL VOC and H3D Segmentation datasets, and a nursing home action recognition dataset.

  1. Predicting Protein-Protein Interaction Sites with a Novel Membership Based Fuzzy SVM Classifier.

    PubMed

    Sriwastava, Brijesh K; Basu, Subhadip; Maulik, Ujjwal

    2015-01-01

    Predicting residues that participate in protein-protein interactions (PPI) helps to identify, which amino acids are located at the interface. In this paper, we show that the performance of the classical support vector machine (SVM) algorithm can further be improved with the use of a custom-designed fuzzy membership function, for the partner-specific PPI interface prediction problem. We evaluated the performances of both classical SVM and fuzzy SVM (F-SVM) on the PPI databases of three different model proteomes of Homo sapiens, Escherichia coli and Saccharomyces Cerevisiae and calculated the statistical significance of the developed F-SVM over classical SVM algorithm. We also compared our performance with the available state-of-the-art fuzzy methods in this domain and observed significant performance improvements. To predict interaction sites in protein complexes, local composition of amino acids together with their physico-chemical characteristics are used, where the F-SVM based prediction method exploits the membership function for each pair of sequence fragments. The average F-SVM performance (area under ROC curve) on the test samples in 10-fold cross validation experiment are measured as 77.07, 78.39, and 74.91 percent for the aforementioned organisms respectively. Performances on independent test sets are obtained as 72.09, 73.24 and 82.74 percent respectively. The software is available for free download from http://code.google.com/p/cmater-bioinfo.

  2. Evolutionary conservation and predicted structure of the Drosophila extra sex combs repressor protein.

    PubMed Central

    Ng, J; Li, R; Morgan, K; Simon, J

    1997-01-01

    The Drosophila extra sex combs (esc) protein, a member of the Polycomb group (PcG), is a transcriptional repressor of homeotic genes. Genetic studies have shown that esc protein is required in early embryos at about the time that other PcG proteins become engaged in homeotic gene repression. The esc protein consists primarily of multiple copies of the WD repeat, a motif that has been implicated in protein-protein interaction. To further investigate the domain organization of esc protein, we have isolated and characterized esc homologs from divergent insect species. We report that esc protein is highly conserved in housefly (72% identical to Drosophila esc), butterfly (55% identical), and grasshopper (56% identical). We show that the butterfly homolog provides esc function in Drosophila, indicating that the sequence similarities reflect functional conservation. Homology modeling using the crystal structure of another WD repeat protein, the G-protein beta-subunit, predicts that esc protein adopts a beta-propeller structure. The sequence comparisons and modeling suggest that there are seven WD repeats in esc protein which together form a seven-bladed beta-propeller. We locate the conserved regions in esc protein with respect to this predicted structure. Site-directed mutagenesis of specific loops, predicted to extend from the propeller surface, identifies conserved parts of esc protein required for function in vivo. We suggest that these regions might mediate physical interaction with esc partner proteins. PMID:9343430

  3. Evolutionary Trace Annotation of Protein Function in the Structural Proteome

    PubMed Central

    Erdin, Serkan; Ward, R. Matthew; Venner, Eric

    2010-01-01

    By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1– 3 (depth 3 PPV). In a high sensitivity mode coverage rose significantly (84%) while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 un-annotated SG proteins. In 529 cases—including 280 non-enzymes and 21 for metal ion ligands—the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta. PMID:20036248

  4. Parental education predicts corticostriatal functionality in adulthood.

    PubMed

    Gianaros, Peter J; Manuck, Stephen B; Sheu, Lei K; Kuan, Dora C H; Votruba-Drzal, Elizabeth; Craig, Anna E; Hariri, Ahmad R

    2011-04-01

    Socioeconomic disadvantage experienced in early development predicts ill health in adulthood. However, the neurobiological pathways linking early disadvantage to adult health remain unclear. Lower parental education-a presumptive indicator of early socioeconomic disadvantage-predicts health-impairing adult behaviors, including tobacco and alcohol dependencies. These behaviors depend, in part, on the functionality of corticostriatal brain systems that 1) show developmental plasticity and early vulnerability, 2) process reward-related information, and 3) regulate impulsive decisions and actions. Hence, corticostriatal functionality in adulthood may covary directly with indicators of early socioeconomic disadvantage, particularly lower parental education. Here, we tested the covariation between parental education and corticostriatal activation and connectivity in 76 adults without confounding clinical syndromes. Corticostriatal activation and connectivity were assessed during the processing of stimuli signaling monetary gains (positive feedback [PF]) and losses (negative feedback). After accounting for participants' own education and other explanatory factors, lower parental education predicted reduced activation in anterior cingulate and dorsomedial prefrontal cortices during PF, along with reduced connectivity between these cortices and orbitofrontal and striatal areas implicated in reward processing and impulse regulation. In speculation, adult alterations in corticostriatal functionality may represent facets of a neurobiological endophenotype linked to socioeconomic conditions of early development.

  5. Conserved Functional Motifs and Homology Modeling to Predict Hidden Moonlighting Functional Sites

    PubMed Central

    Wong, Aloysius; Gehring, Chris; Irving, Helen R.

    2015-01-01

    Moonlighting functional centers within proteins can provide them with hitherto unrecognized functions. Here, we review how hidden moonlighting functional centers, which we define as binding sites that have catalytic activity or regulate protein function in a novel manner, can be identified using targeted bioinformatic searches. Functional motifs used in such searches include amino acid residues that are conserved across species and many of which have been assigned functional roles based on experimental evidence. Molecules that were identified in this manner seeking cyclic mononucleotide cyclases in plants are used as examples. The strength of this computational approach is enhanced when good homology models can be developed to test the functionality of the predicted centers in silico, which, in turn, increases confidence in the ability of the identified candidates to perform the predicted functions. Computational characterization of moonlighting functional centers is not diagnostic for catalysis but serves as a rapid screening method, and highlights testable targets from a potentially large pool of candidates for subsequent in vitro and in vivo experiments required to confirm the functionality of the predicted moonlighting centers. PMID:26106597

  6. Functionalizing Microporous Membranes for Protein Purification and Protein Digestion.

    PubMed

    Dong, Jinlan; Bruening, Merlin L

    2015-01-01

    This review examines advances in the functionalization of microporous membranes for protein purification and the development of protease-containing membranes for controlled protein digestion prior to mass spectrometry analysis. Recent studies confirm that membranes are superior to bead-based columns for rapid protein capture, presumably because convective mass transport in membrane pores rapidly brings proteins to binding sites. Modification of porous membranes with functional polymeric films or TiO₂ nanoparticles yields materials that selectively capture species ranging from phosphopeptides to His-tagged proteins, and protein-binding capacities often exceed those of commercial beads. Thin membranes also provide a convenient framework for creating enzyme-containing reactors that afford control over residence times. With millisecond residence times, reactors with immobilized proteases limit protein digestion to increase sequence coverage in mass spectrometry analysis and facilitate elucidation of protein structures. This review emphasizes the advantages of membrane-based techniques and concludes with some challenges for their practical application.

  7. Functionalizing Microporous Membranes for Protein Purification and Protein Digestion

    NASA Astrophysics Data System (ADS)

    Dong, Jinlan; Bruening, Merlin L.

    2015-07-01

    This review examines advances in the functionalization of microporous membranes for protein purification and the development of protease-containing membranes for controlled protein digestion prior to mass spectrometry analysis. Recent studies confirm that membranes are superior to bead-based columns for rapid protein capture, presumably because convective mass transport in membrane pores rapidly brings proteins to binding sites. Modification of porous membranes with functional polymeric films or TiO2 nanoparticles yields materials that selectively capture species ranging from phosphopeptides to His-tagged proteins, and protein-binding capacities often exceed those of commercial beads. Thin membranes also provide a convenient framework for creating enzyme-containing reactors that afford control over residence times. With millisecond residence times, reactors with immobilized proteases limit protein digestion to increase sequence coverage in mass spectrometry analysis and facilitate elucidation of protein structures. This review emphasizes the advantages of membrane-based techniques and concludes with some challenges for their practical application.

  8. Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction.

    PubMed

    Faraggi, Eshel; Yang, Yuedong; Zhang, Shesheng; Zhou, Yaoqi

    2009-11-11

    Local structures predicted from protein sequences are used extensively in every aspect of modeling and prediction of protein structure and function. For more than 50 years, they have been predicted at a low-resolution coarse-grained level (e.g., three-state secondary structure). Here, we combine a two-state classifier with real-value predictor to predict local structure in continuous representation by backbone torsion angles. The accuracy of the angles predicted by this approach is close to that derived from NMR chemical shifts. Their substitution for predicted secondary structure as restraints for ab initio structure prediction doubles the success rate. This result demonstrates the potential of predicted local structure for fragment-free tertiary-structure prediction. It further implies potentially significant benefits from using predicted real-valued torsion angles as a replacement for or supplement to the secondary-structure prediction tools used almost exclusively in many computational methods ranging from sequence alignment to function prediction.

  9. MEGADOCK: An All-to-All Protein-Protein Interaction Prediction System Using Tertiary Structure Data

    PubMed Central

    Ohue, Masahito; Matsuzaki, Yuri; Uchikoga, Nobuyuki; Ishida, Takashi; Akiyama, Yutaka

    2014-01-01

    The elucidation of protein-protein interaction (PPI) networks is important for understanding cellular structure and function and structure-based drug design. However, the development of an effective method to conduct exhaustive PPI screening represents a computational challenge. We have been investigating a protein docking approach based on shape complementarity and physicochemical properties. We describe here the development of the protein-protein docking software package “MEGADOCK” that samples an extremely large number of protein dockings at high speed. MEGADOCK reduces the calculation time required for docking by using several techniques such as a novel scoring function called the real Pairwise Shape Complementarity (rPSC) score. We showed that MEGADOCK is capable of exhaustive PPI screening by completing docking calculations 7.5 times faster than the conventional docking software, ZDOCK, while maintaining an acceptable level of accuracy. When MEGADOCK was applied to a subset of a general benchmark dataset to predict 120 relevant interacting pairs from 120 x 120 = 14,400 combinations of proteins, an F-measure value of 0.231 was obtained. Further, we showed that MEGADOCK can be applied to a large-scale protein-protein interaction-screening problem with accuracy better than random. When our approach is combined with parallel high-performance computing systems, it is now feasible to search and analyze protein-protein interactions while taking into account three-dimensional structures at the interactome scale. MEGADOCK is freely available at http://www.bi.cs.titech.ac.jp/megadock. PMID:23855673

  10. Community-Wide Evaluation of Computational Function Prediction.

    PubMed

    Friedberg, Iddo; Radivojac, Predrag

    2017-01-01

    A biological experiment is the most reliable way of assigning function to a protein. However, in the era of high-throughput sequencing, scientists are unable to carry out experiments to determine the function of every single gene product. Therefore, to gain insights into the activity of these molecules and guide experiments, we must rely on computational means to functionally annotate the majority of sequence data. To understand how well these algorithms perform, we have established a challenge involving a broad scientific community in which we evaluate different annotation methods according to their ability to predict the associations between previously unannotated protein sequences and Gene Ontology terms. Here we discuss the rationale, benefits, and issues associated with evaluating computational methods in an ongoing community-wide challenge.

  11. Improved method for predicting protein fold patterns with ensemble classifiers.

    PubMed

    Chen, W; Liu, X; Huang, Y; Jiang, Y; Zou, Q; Lin, C

    2012-01-27

    Protein folding is recognized as a critical problem in the field of biophysics in the 21st century. Predicting protein-folding patterns is challenging due to the complex structure of proteins. In an attempt to solve this problem, we employed ensemble classifiers to improve prediction accuracy. In our experiments, 188-dimensional features were extracted based on the composition and physical-chemical property of proteins and 20-dimensional features were selected using a coupled position-specific scoring matrix. Compared with traditional prediction methods, these methods were superior in terms of prediction accuracy. The 188-dimensional feature-based method achieved 71.2% accuracy in five cross-validations. The accuracy rose to 77% when we used a 20-dimensional feature vector. These methods were used on recent data, with 54.2% accuracy. Source codes and dataset, together with web server and software tools for prediction, are available at: http://datamining.xmu.edu.cn/main/~cwc/ProteinPredict.html.

  12. Essential protein identification based on essential protein-protein interaction prediction by Integrated Edge Weights.

    PubMed

    Jiang, Yuexu; Wang, Yan; Pang, Wei; Chen, Liang; Sun, Huiyan; Liang, Yanchun; Blanzieri, Enrico

    2015-07-15

    Essential proteins play a crucial role in cellular survival and development process. Experimentally, essential proteins are identified by gene knockouts or RNA interference, which are expensive and often fatal to the target organisms. Regarding this, an alternative yet important approach to essential protein identification is through computational prediction. Existing computational methods predict essential proteins based on their relative densities in a protein-protein interaction (PPI) network. Degree, betweenness, and other appropriate criteria are often used to measure the relative density. However, no matter what criterion is used, a protein is actually ordered by the attributes of this protein per se. In this research, we presented a novel computational method, Integrated Edge Weights (IEW), to first rank protein-protein interactions by integrating their edge weights, and then identified sub PPI networks consisting of those highly-ranked edges, and finally regarded the nodes in these sub networks as essential proteins. We evaluated IEW on three model organisms: Saccharomyces cerevisiae (S. cerevisiae), Escherichia coli (E. coli), and Caenorhabditis elegans (C. elegans). The experimental results showed that IEW achieved better performance than the state-of-the-art methods in terms of precision-recall and Jackknife measures. We had also demonstrated that IEW is a robust and effective method, which can retrieve biologically significant modules by its highly-ranked protein-protein interactions for S. cerevisiae, E. coli, and C. elegans. We believe that, with sufficient data provided, IEW can be used to any other organisms' essential protein identification. A website about IEW can be accessed from http://digbio.missouri.edu/IEW/index.html.

  13. Origins of Protein Functions in Cells

    NASA Technical Reports Server (NTRS)

    Seelig, Burchard; Pohorille, Andrzej

    2011-01-01

    In modern organisms proteins perform a majority of cellular functions, such as chemical catalysis, energy transduction and transport of material across cell walls. Although great strides have been made towards understanding protein evolution, a meaningful extrapolation from contemporary proteins to their earliest ancestors is virtually impossible. In an alternative approach, the origin of water-soluble proteins was probed through the synthesis and in vitro evolution of very large libraries of random amino acid sequences. In combination with computer modeling and simulations, these experiments allow us to address a number of fundamental questions about the origins of proteins. Can functionality emerge from random sequences of proteins? How did the initial repertoire of functional proteins diversify to facilitate new functions? Did this diversification proceed primarily through drawing novel functionalities from random sequences or through evolution of already existing proto-enzymes? Did protein evolution start from a pool of proteins defined by a frozen accident and other collections of proteins could start a different evolutionary pathway? Although we do not have definitive answers to these questions yet, important clues have been uncovered. In one example (Keefe and Szostak, 2001), novel ATP binding proteins were identified that appear to be unrelated in both sequence and structure to any known ATP binding proteins. One of these proteins was subsequently redesigned computationally to bind GTP through introducing several mutations that introduce targeted structural changes to the protein, improve its binding to guanine and prevent water from accessing the active center. This study facilitates further investigations of individual evolutionary steps that lead to a change of function in primordial proteins. In a second study (Seelig and Szostak, 2007), novel enzymes were generated that can join two pieces of RNA in a reaction for which no natural enzymes are known

  14. Detection of Functional Modes in Protein Dynamics

    PubMed Central

    Hub, Jochen S.; de Groot, Bert L.

    2009-01-01

    Proteins frequently accomplish their biological function by collective atomic motions. Yet the identification of collective motions related to a specific protein function from, e.g., a molecular dynamics trajectory is often non-trivial. Here, we propose a novel technique termed “functional mode analysis” that aims to detect the collective motion that is directly related to a particular protein function. Based on an ensemble of structures, together with an arbitrary “functional quantity” that quantifies the functional state of the protein, the technique detects the collective motion that is maximally correlated to the functional quantity. The functional quantity could, e.g., correspond to a geometric, electrostatic, or chemical observable, or any other variable that is relevant to the function of the protein. In addition, the motion that displays the largest likelihood to induce a substantial change in the functional quantity is estimated from the given protein ensemble. Two different correlation measures are applied: first, the Pearson correlation coefficient that measures linear correlation only; and second, the mutual information that can assess any kind of interdependence. Detecting the maximally correlated motion allows one to derive a model for the functional state in terms of a single collective coordinate. The new approach is illustrated using a number of biomolecules, including a polyalanine-helix, T4 lysozyme, Trp-cage, and leucine-binding protein. PMID:19714202

  15. Prediction of protein-protein interaction sites from weakly homologous template structures using meta-threading and machine learning.

    PubMed

    Maheshwari, Surabhi; Brylinski, Michal

    2015-01-01

    The identification of protein-protein interactions is vital for understanding protein function, elucidating interaction mechanisms, and for practical applications in drug discovery. With the exponentially growing protein sequence data, fully automated computational methods that predict interactions between proteins are becoming essential components of system-level function inference. A thorough analysis of protein complex structures demonstrated that binding site locations as well as the interfacial geometry are highly conserved across evolutionarily related proteins. Because the conformational space of protein-protein interactions is highly covered by experimental structures, sensitive protein threading techniques can be used to identify suitable templates for the accurate prediction of interfacial residues. Toward this goal, we developed eFindSite(PPI) , an algorithm that uses the three-dimensional structure of a target protein, evolutionarily remotely related templates and machine learning techniques to predict binding residues. Using crystal structures, the average sensitivity (specificity) of eFindSite(PPI) in interfacial residue prediction is 0.46 (0.92). For weakly homologous protein models, these values only slightly decrease to 0.40-0.43 (0.91-0.92) demonstrating that eFindSite(PPI) performs well not only using experimental data but also tolerates structural imperfections in computer-generated structures. In addition, eFindSite(PPI) detects specific molecular interactions at the interface; for instance, it correctly predicts approximately one half of hydrogen bonds and aromatic interactions, as well as one third of salt bridges and hydrophobic contacts. Comparative benchmarks against several dimer datasets show that eFindSite(PPI) outperforms other methods for protein-binding residue prediction. It also features a carefully tuned confidence estimation system, which is particularly useful in large-scale applications using raw genomic data. eFindSite(PPI) is

  16. Local functional descriptors for surface comparison based binding prediction

    PubMed Central

    2012-01-01

    Background Molecular recognition in proteins occurs due to appropriate arrangements of physical, chemical, and geometric properties of an atomic surface. Similar surface regions should create similar binding interfaces. Effective methods for comparing surface regions can be used in identifying similar regions, and to predict interactions without regard to the underlying structural scaffold that creates the surface. Results We present a new descriptor for protein functional surfaces and algorithms for using these descriptors to compare protein surface regions to identify ligand binding interfaces. Our approach uses descriptors of local regions of the surface, and assembles collections of matches to compare larger regions. Our approach uses a variety of physical, chemical, and geometric properties, adaptively weighting these properties as appropriate for different regions of the interface. Our approach builds a classifier based on a training corpus of examples of binding sites of the target ligand. The constructed classifiers can be applied to a query protein providing a probability for each position on the protein that the position is part of a binding interface. We demonstrate the effectiveness of the approach on a number of benchmarks, demonstrating performance that is comparable to the state-of-the-art, with an approach with more generality than these prior methods. Conclusions Local functional descriptors offer a new method for protein surface comparison that is sufficiently flexible to serve in a variety of applications. PMID:23176080

  17. GOPred: GO Molecular Function Prediction by Combined Classifiers

    PubMed Central

    Saraç, Ömer Sinan; Atalay, Volkan; Cetin-Atalay, Rengul

    2010-01-01

    Functional protein annotation is an important matter for in vivo and in silico biology. Several computational methods have been proposed that make use of a wide range of features such as motifs, domains, homology, structure and physicochemical properties. There is no single method that performs best in all functional classification problems because information obtained using any of these features depends on the function to be assigned to the protein. In this study, we portray a novel approach that combines different methods to better represent protein function. First, we formulated the function annotation problem as a classification problem defined on 300 different Gene Ontology (GO) terms from molecular function aspect. We presented a method to form positive and negative training examples while taking into account the directed acyclic graph (DAG) structure and evidence codes of GO. We applied three different methods and their combinations. Results show that combining different methods improves prediction accuracy in most cases. The proposed method, GOPred, is available as an online computational annotation tool (http://kinaz.fen.bilkent.edu.tr/gopred). PMID:20824206

  18. Can computationally designed protein sequences improve secondary structure prediction?

    PubMed

    Bondugula, Rajkumar; Wallqvist, Anders; Lee, Michael S

    2011-05-01

    Computational sequence design methods are used to engineer proteins with desired properties such as increased thermal stability and novel function. In addition, these algorithms can be used to identify an envelope of sequences that may be compatible with a particular protein fold topology. In this regard, we hypothesized that sequence-property prediction, specifically secondary structure, could be significantly enhanced by using a large database of computationally designed sequences. We performed a large-scale test of this hypothesis with 6511 diverse protein domains and 50 designed sequences per domain. After analysis of the inherent accuracy of the designed sequences database, we realized that it was necessary to put constraints on what fraction of the native sequence should be allowed to change. With mutational constraints, accuracy was improved vs. no constraints, but the diversity of designed sequences, and hence effective size of the database, was moderately reduced. Overall, the best three-state prediction accuracy (Q(3)) that we achieved was nearly a percentage point improved over using a natural sequence database alone, well below the theoretical possibility for improvement of 8-10 percentage points. Furthermore, our nascent method was used to augment the state-of-the-art PSIPRED program by a percentage point.

  19. HHomp—prediction and classification of outer membrane proteins

    PubMed Central

    Remmert, Michael; Linke, Dirk; Lupas, Andrei N.; Söding, Johannes

    2009-01-01

    Outer membrane proteins (OMPs) are the transmembrane proteins found in the outer membranes of Gram-negative bacteria, mitochondria and plastids. Most prediction methods have focused on analogous features, such as alternating hydrophobicity patterns. Here, we start from the observation that almost all β-barrel OMPs are related by common ancestry. We identify proteins as OMPs by detecting their homologous relationships to known OMPs using sequence similarity. Given an input sequence, HHomp builds a profile hidden Markov model (HMM) and compares it with an OMP database by pairwise HMM comparison, integrating OMP predictions by PROFtmb. A crucial ingredient is the OMP database, which contains profile HMMs for over 20 000 putative OMP sequences. These were collected with the exhaustive, transitive homology detection method HHsenser, starting from 23 representative OMPs in the PDB database. In a benchmark on TransportDB, HHomp detects 63.5% of the true positives before including the first false positive. This is 70% more than PROFtmb, four times more than BOMP and 10 times more than TMB-Hunt. In Escherichia coli, HHomp identifies 57 out of 59 known OMPs and correctly assigns them to their functional subgroups. HHomp can be accessed at http://toolkit.tuebingen.mpg.de/hhomp. PMID:19429691

  20. HHomp--prediction and classification of outer membrane proteins.

    PubMed

    Remmert, Michael; Linke, Dirk; Lupas, Andrei N; Söding, Johannes

    2009-07-01

    Outer membrane proteins (OMPs) are the transmembrane proteins found in the outer membranes of Gram-negative bacteria, mitochondria and plastids. Most prediction methods have focused on analogous features, such as alternating hydrophobicity patterns. Here, we start from the observation that almost all beta-barrel OMPs are related by common ancestry. We identify proteins as OMPs by detecting their homologous relationships to known OMPs using sequence similarity. Given an input sequence, HHomp builds a profile hidden Markov model (HMM) and compares it with an OMP database by pairwise HMM comparison, integrating OMP predictions by PROFtmb. A crucial ingredient is the OMP database, which contains profile HMMs for over 20,000 putative OMP sequences. These were collected with the exhaustive, transitive homology detection method HHsenser, starting from 23 representative OMPs in the PDB database. In a benchmark on TransportDB, HHomp detects 63.5% of the true positives before including the first false positive. This is 70% more than PROFtmb, four times more than BOMP and 10 times more than TMB-Hunt. In Escherichia coli, HHomp identifies 57 out of 59 known OMPs and correctly assigns them to their functional subgroups. HHomp can be accessed at http://toolkit.tuebingen.mpg.de/hhomp.

  1. Characterization and Functionality of Corn Germ Proteins

    Technology Transfer Automated Retrieval System (TEKTRAN)

    This study was conducted to evaluate the functional properties of protein extracted from wet-milled corn germ and identify potential applications of the recovered protein. Corn germ comprises 12% of the total weight of normal dent corn and about 29% of the corn protein (moisture-free and oil- free ...

  2. Qualitative and Quantitative Protein Complex Prediction Through Proteome-Wide Simulations.

    PubMed

    Rizzetto, Simone; Priami, Corrado; Csikász-Nagy, Attila

    2015-10-01

    Despite recent progress in proteomics most protein complexes are still unknown. Identification of these complexes will help us understand cellular regulatory mechanisms and support development of new drugs. Therefore it is really important to establish detailed information about the composition and the abundance of protein complexes but existing algorithms can only give qualitative predictions. Herein, we propose a new approach based on stochastic simulations of protein complex formation that integrates multi-source data--such as protein abundances, domain-domain interactions and functional annotations--to predict alternative forms of protein complexes together with their abundances. This method, called SiComPre (Simulation based Complex Prediction), achieves better qualitative prediction of yeast and human protein complexes than existing methods and is the first to predict protein complex abundances. Furthermore, we show that SiComPre can be used to predict complexome changes upon drug treatment with the example of bortezomib. SiComPre is the first method to produce quantitative predictions on the abundance of molecular complexes while performing the best qualitative predictions. With new data on tissue specific protein complexes becoming available SiComPre will be able to predict qualitative and quantitative differences in the complexome in various tissue types and under various conditions.

  3. Prediction of protein-protein interactions in dengue virus coat proteins guided by low resolution cryoEM structures

    PubMed Central

    2010-01-01

    Background Dengue virus along with the other members of the flaviviridae family has reemerged as deadly human pathogens. Understanding the mechanistic details of these infections can be highly rewarding in developing effective antivirals. During maturation of the virus inside the host cell, the coat proteins E and M undergo conformational changes, altering the morphology of the viral coat. However, due to low resolution nature of the available 3-D structures of viral assemblies, the atomic details of these changes are still elusive. Results In the present analysis, starting from Cα positions of low resolution cryo electron microscopic structures the residue level details of protein-protein interaction interfaces of dengue virus coat proteins have been predicted. By comparing the preexisting structures of virus in different phases of life cycle, the changes taking place in these predicted protein-protein interaction interfaces were followed as a function of maturation process of the virus. Besides changing the current notion about the presence of only homodimers in the mature viral coat, the present analysis indicated presence of a proline-rich motif at the protein-protein interaction interface of the coat protein. Investigating the conservation status of these seemingly functionally crucial residues across other members of flaviviridae family enabled dissecting common mechanisms used for infections by these viruses. Conclusions Thus, using computational approach the present analysis has provided better insights into the preexisting low resolution structures of virus assemblies, the findings of which can be made use of in designing effective antivirals against these deadly human pathogens. PMID:20550721

  4. Bayesian inference for genomic data integration reduces misclassification rate in predicting protein-protein interactions.

    PubMed

    Xing, Chuanhua; Dunson, David B

    2011-07-01

    Protein-protein interactions (PPIs) are essential to most fundamental cellular processes. There has been increasing interest in reconstructing PPIs networks. However, several critical difficulties exist in obtaining reliable predictions. Noticeably, false positive rates can be as high as >80%. Error correction from each generating source can be both time-consuming and inefficient due to the difficulty of covering the errors from multiple levels of data processing procedures within a single test. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate (both false positives and negatives) through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes to unreliable, error-prone and contaminated data. On a large human data set our NBEL approach predicts many more PPIs than naïve Bayes. This suggests that previous studies may have large numbers of not only false positives but also false negatives. The validation on two human PPIs datasets having high quality supports our observations. Our experiments demonstrate that it is feasible to predict high-throughput PPIs computationally with substantially reduced false positives and false negatives. The ability of predicting large numbers of PPIs both reliably and automatically may inspire people to use computational approaches to correct data errors in general, and may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as protein functions prediction and roles of PPIs in disease susceptibility.

  5. A yeast functional screen predicts new candidate ALS disease genes

    PubMed Central

    Couthouis, Julien; Hart, Michael P.; Shorter, James; DeJesus-Hernandez, Mariely; Erion, Renske; Oristano, Rachel; Liu, Annie X.; Ramos, Daniel; Jethava, Niti; Hosangadi, Divya; Epstein, James; Chiang, Ashley; Diaz, Zamia; Nakaya, Tadashi; Ibrahim, Fadia; Kim, Hyung-Jun; Solski, Jennifer A.; Williams, Kelly L.; Mojsilovic-Petrovic, Jelena; Ingre, Caroline; Boylan, Kevin; Graff-Radford, Neill R.; Dickson, Dennis W.; Clay-Falcone, Dana; Elman, Lauren; McCluskey, Leo; Greene, Robert; Kalb, Robert G.; Lee, Virginia M.-Y.; Trojanowski, John Q.; Ludolph, Albert; Robberecht, Wim; Andersen, Peter M.; Nicholson, Garth A.; Blair, Ian P.; King, Oliver D.; Bonini, Nancy M.; Van Deerlin, Vivianna; Rademakers, Rosa; Mourelatos, Zissimos; Gitler, Aaron D.

    2011-01-01

    Amyotrophic lateral sclerosis (ALS) is a devastating and universally fatal neurodegenerative disease. Mutations in two related RNA-binding proteins, TDP-43 and FUS, that harbor prion-like domains, cause some forms of ALS. There are at least 213 human proteins harboring RNA recognition motifs, including FUS and TDP-43, raising the possibility that additional RNA-binding proteins might contribute to ALS pathogenesis. We performed a systematic survey of these proteins to find additional candidates similar to TDP-43 and FUS, followed by bioinformatics to predict prion-like domains in a subset of them. We sequenced one of these genes, TAF15, in patients with ALS and identified missense variants, which were absent in a large number of healthy controls. These disease-associated variants of TAF15 caused formation of cytoplasmic foci when expressed in primary cultures of spinal cord neurons. Very similar to TDP-43 and FUS, TAF15 aggregated in vitro and conferred neurodegeneration in Drosophila, with the ALS-linked variants having a more severe effect than wild type. Immunohistochemistry of postmortem spinal cord tissue revealed mislocalization of TAF15 in motor neurons of patients with ALS. We propose that aggregation-prone RNA-binding proteins might contribute very broadly to ALS pathogenesis and the genes identified in our yeast functional screen, coupled with prion-like domain prediction analysis, now provide a powerful resource to facilitate ALS disease gene discovery. PMID:22065782

  6. Predictable convergence in hemoglobin function has unpredictable molecular underpinnings.

    PubMed

    Natarajan, Chandrasekhar; Hoffmann, Federico G; Weber, Roy E; Fago, Angela; Witt, Christopher C; Storz, Jay F

    2016-10-21

    To investigate the predictability of genetic adaptation, we examined the molecular basis of convergence in hemoglobin function in comparisons involving 56 avian taxa that have contrasting altitudinal range limits. Convergent increases in hemoglobin-oxygen affinity were pervasive among high-altitude taxa, but few such changes were attributable to parallel amino acid substitutions at key residues. Thus, predictable changes in biochemical phenotype do not have a predictable molecular basis. Experiments involving resurrected ancestral proteins revealed that historical substitutions have context-dependent effects, indicating that possible adaptive solutions are contingent on prior history. Mutations that produce an adaptive change in one species may represent precluded possibilities in other species because of differences in genetic background.

  7. LOCALIZER: subcellular localization prediction of both plant and effector proteins in the plant cell

    PubMed Central

    Sperschneider, Jana; Catanzariti, Ann-Maree; DeBoer, Kathleen; Petre, Benjamin; Gardiner, Donald M.; Singh, Karam B.; Dodds, Peter N.; Taylor, Jennifer M.

    2017-01-01

    Pathogens secrete effector proteins and many operate inside plant cells to enable infection. Some effectors have been found to enter subcellular compartments by mimicking host targeting sequences. Although many computational methods exist to predict plant protein subcellular localization, they perform poorly for effectors. We introduce LOCALIZER for predicting plant and effector protein localization to chloroplasts, mitochondria, and nuclei. LOCALIZER shows greater prediction accuracy for chloroplast and mitochondrial targeting compared to other methods for 652 plant proteins. For 107 eukaryotic effectors, LOCALIZER outperforms other methods and predicts a previously unrecognized chloroplast transit peptide for the ToxA effector, which we show translocates into tobacco chloroplasts. Secretome-wide predictions and confocal microscopy reveal that rust fungi might have evolved multiple effectors that target chloroplasts or nuclei. LOCALIZER is the first method for predicting effector localisation in plants and is a valuable tool for prioritizing effector candidates for functional investigations. LOCALIZER is available at http://localizer.csiro.au/. PMID:28300209

  8. J domain independent functions of J proteins.

    PubMed

    Ajit Tamadaddi, Chetana; Sahi, Chandan

    2016-07-01

    Heat shock proteins of 40 kDa (Hsp40s), also called J proteins, are obligate partners of Hsp70s. Via their highly conserved and functionally critical J domain, J proteins interact and modulate the activity of their Hsp70 partners. Mutations in the critical residues in the J domain often result in the null phenotype for the J protein in question. However, as more J proteins have been characterized, it is becoming increasingly clear that a significant number of J proteins do not "completely" rely on their J domains to carry out their cellular functions, as previously thought. In some cases, regions outside the highly conserved J domain have become more important making the J domain dispensable for some, if not for all functions of a J protein. This has profound effects on the evolution of such J proteins. Here we present selected examples of J proteins that perform J domain independent functions and discuss this in the context of evolution of J proteins with dispensable J domains and J-like proteins in eukaryotes.

  9. Predictability of gene ontology slim-terms from primary structure information in Embryophyta plant proteins

    PubMed Central

    2013-01-01

    Background Proteins are the key elements on the path from genetic information to the development of life. The roles played by the different proteins are difficult to uncover experimentally as this process involves complex procedures such as genetic modifications, injection of fluorescent proteins, gene knock-out methods and others. The knowledge learned from each protein is usually annotated in databases through different methods such as the proposed by The Gene Ontology (GO) consortium. Different methods have been proposed in order to predict GO terms from primary structure information, but very few are available for large-scale functional annotation of plants, and reported success rates are much less than the reported by other non-plant predictors. This paper explores the predictability of GO annotations on proteins belonging to the Embryophyta group from a set of features extracted solely from their primary amino acid sequence. Results High predictability of several GO terms was found for Molecular Function and Cellular Component. As expected, a lower degree of predictability was found on Biological Process ontology annotations, although a few biological processes were easily predicted. Proteins related to transport and transcription were particularly well predicted from primary structure information. The most discriminant features for prediction were those related to electric charges of the amino-acid sequence and hydropathicity derived features. Conclusions An analysis of GO-slim terms predictability in plants was carried out, in order to determine single categories or groups of functions that are most related with primary structure information. For each highly predictable GO term, the responsible features of such successfulness were identified and discussed. In addition to most published studies, focused on few categories or single ontologies, results in this paper comprise a complete landscape of GO predictability from primary structure encompassing 75 GO

  10. Functions of red cell surface proteins.

    PubMed

    Daniels, G

    2007-11-01

    The external membrane of the red cell contains numerous proteins that either cross the lipid bilayer one or more times or are anchored to it through a lipid tail. Many of these proteins express blood group activity. The functions of some of these proteins are known; in others their function can only be surmised from the protein structure or from limited experimental evidence. They are loosely divided into four categories based on their functions: membrane transporters; adhesion molecules and receptors; enzymes; and structural proteins that link the membrane with the membrane skeleton. Some of the proteins carry out more than one of these functions. Some proteins may complete their major functions during erythropoiesis or may only be important under adverse physiological conditions. Furthermore, some might be evolutionary relics and may no longer have significant functions. Polymorphisms or rare changes in red cell surface proteins are often responsible for blood groups. The biological significance of these polymorphisms or the selective pressures responsible for their stability within populations are mostly not known, although exploitation of the proteins by pathogenic micro-organisms has probably played a major role.

  11. Structure-templated predictions of novel protein interactions from sequence information.

    PubMed

    Betel, Doron; Breitkreuz, Kevin E; Isserlin, Ruth; Dewar-Darch, Danielle; Tyers, Mike; Hogue, Christopher W V

    2007-09-01

    The multitude of functions performed in the cell are largely controlled by a set of carefully orchestrated protein interactions often facilitated by specific binding of conserved domains in the interacting proteins. Interacting domains commonly exhibit distinct binding specificity to short and conserved recognition peptides called binding profiles. Although many conserved domains are known in nature, only a few have well-characterized binding profiles. Here, we describe a novel predictive method known as domain-motif interactions from structural topology (D-MIST) for elucidating the binding profiles of interacting domains. A set of domains and their corresponding binding profiles were derived from extant protein structures and protein interaction data and then used to predict novel protein interactions in yeast. A number of the predicted interactions were verified experimentally, including new interactions of the mitotic exit network, RNA polymerases, nucleotide metabolism enzymes, and the chaperone complex. These results demonstrate that new protein interactions can be predicted exclusively from sequence information.

  12. Executive functions predict conceptual learning of science.

    PubMed

    Rhodes, Sinéad M; Booth, Josephine N; Palmer, Lorna Elise; Blythe, Richard A; Delibegovic, Mirela; Wheate, Nial J

    2016-06-01

    We examined the relationship between executive functions and both factual and conceptual learning of science, specifically chemistry, in early adolescence. Sixty-three pupils in their second year of secondary school (aged 12-13 years) participated. Pupils completed tasks of working memory (Spatial Working Memory), inhibition (Stop-Signal), attention set-shifting (ID/ED), and planning (Stockings of Cambridge), from the CANTAB. They also participated in a chemistry teaching session, practical, and assessment on the topic of acids and alkalis designed specifically for this study. Executive function data were related to (1) the chemistry assessment which included aspects of factual and conceptual learning and (2) a recent school science exam. Correlational analyses between executive functions and both the chemistry assessment and science grades revealed that science achievements were significantly correlated with working memory. Linear regression analysis revealed that visuospatial working memory ability was predictive of chemistry performance. Interestingly, this relationship was observed solely in relation to the conceptual learning condition of the assessment highlighting the role of executive functions in understanding and applying knowledge about what is learned within science teaching.

  13. A Web-Accessible Protein Structure Prediction Pipeline

    DTIC Science & Technology

    2009-06-01

    almost all of these programs, the input is a protein position-specific substitution matrix ( PSSM ), in addition to the protein sequence itself. We...calculate a single PSSM for each of the input proteins using PSI-BLAST[8] and the NR database. Once the structural properties are predicted using the

  14. Food Protein Functionality--A New Model.

    PubMed

    Foegeding, E Allen

    2015-12-01

    Proteins in foods serve dual roles as nutrients and structural building blocks. The concept of protein functionality has historically been restricted to nonnutritive functions--such as creating emulsions, foams, and gels--but this places sole emphasis on food quality considerations and potentially overlooks modifications that may also alter nutritional quality or allergenicity. A new model is proposed that addresses the function of proteins in foods based on the length scale(s) responsible for the function. Properties such as flavor binding, color, allergenicity, and digestibility are explained based on the structure of individual molecules; placing this functionality at the nano/molecular scale. At the next higher scale, applications in foods involving gelation, emulsification, and foam formation are based on how proteins form secondary structures that are seen at the nano and microlength scales, collectively called the mesoscale. The macroscale structure represents the arrangements of molecules and mesoscale structures in a food. Macroscale properties determine overall product appearance, stability, and texture. The historical approach of comparing among proteins based on forming and stabilizing specific mesoscale structures remains valid but emphasis should be on a common means for structure formation to allow for comparisons across investigations. For applications in food products, protein functionality should start with identification of functional needs across scales. Those needs are then evaluated relative to how processing and other ingredients could alter desired molecular scale properties, or proper formation of mesoscale structures. This allows for a comprehensive approach to achieving the desired function of proteins in foods.

  15. Sucrose Synthase: Expanding Protein Function

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Sucrose synthase (SUS: EC 2.4.1.13), a key enzyme in plant sucrose catabolism, is uniquely able to mobilize sucrose into multiple pathways involved in metabolic, structural, and storage functions. Our research indicates that the biological function of SUS may extend beyond its catalytic activity. Th...

  16. Interrogating noise in protein sequences from the perspective of protein-protein interactions prediction.

    PubMed

    Wang, Yongcui; Ren, Xianwen; Zhang, Chunhua; Deng, Naiyang; Zhang, Xiangsun

    2012-12-21

    The past decades witnessed extensive efforts to study the relationship among proteins. Particularly, sequence-based protein-protein interactions (PPIs) prediction is fundamentally important in speeding up the process of mapping interactomes of organisms. High-throughput experimental methodologies make many model organism's PPIs known, which allows us to apply machine learning methods to learn understandable rules from the available PPIs. Under the machine learning framework, the composition vectors are usually applied to encode proteins as real-value vectors. However, the composition vector value might be highly correlated to the distribution of amino acids, i.e., amino acids which are frequently observed in nature tend to have a large value of composition vectors. Thus formulation to estimate the noise induced by the background distribution of amino acids may be needed during representations. Here, we introduce two kinds of denoising composition vectors, which were successfully used in construction of phylogenetic trees, to eliminate the noise. When validating these two denoising composition vectors on Escherichia coli (E. coli), Saccharomyces cerevisiae (S. cerevisiae) and human PPIs datasets, surprisingly, the predictive performance is not improved, and even worse than non-denoised prediction. These results suggest that the noise in phylogenetic tree construction may be valuable information in PPIs prediction.

  17. Novel 3D bio-macromolecular bilinear descriptors for protein science: Predicting protein structural classes.

    PubMed

    Marrero-Ponce, Yovani; Contreras-Torres, Ernesto; García-Jacas, César R; Barigye, Stephen J; Cubillán, Néstor; Alvarado, Ysaías J

    2015-06-07

    In the present study, we introduce novel 3D protein descriptors based on the bilinear algebraic form in the ℝ(n) space on the coulombic matrix. For the calculation of these descriptors, macromolecular vectors belonging to ℝ(n) space, whose components represent certain amino acid side-chain properties, were used as weighting schemes. Generalization approaches for the calculation of inter-amino acidic residue spatial distances based on Minkowski metrics are proposed. The simple- and double-stochastic schemes were defined as approaches to normalize the coulombic matrix. The local-fragment indices for both amino acid-types and amino acid-groups are presented in order to permit characterizing fragments of interest in proteins. On the other hand, with the objective of taking into account specific interactions among amino acids in global or local indices, geometric and topological cut-offs are defined. To assess the utility of global and local indices a classification model for the prediction of the major four protein structural classes, was built with the Linear Discriminant Analysis (LDA) technique. The developed LDA-model correctly classifies the 92.6% and 92.7% of the proteins on the training and test sets, respectively. The obtained model showed high values of the generalized square correlation coefficient (GC(2)) on both the training and test series. The statistical parameters derived from the internal and external validation procedures demonstrate the robustness, stability and the high predictive power of the proposed model. The performance of the LDA-model demonstrates the capability of the proposed indices not only to codify relevant biochemical information related to the structural classes of proteins, but also to yield suitable interpretability. It is anticipated that the current method will benefit the prediction of other protein attributes or functions.

  18. Phylointeractomics reconstructs functional evolution of protein binding

    PubMed Central

    Kappei, Dennis; Scheibe, Marion; Paszkowski-Rogacz, Maciej; Bluhm, Alina; Gossmann, Toni Ingolf; Dietz, Sabrina; Dejung, Mario; Herlyn, Holger; Buchholz, Frank; Mann, Matthias; Butter, Falk

    2017-01-01

    Molecular phylogenomics investigates evolutionary relationships based on genomic data. However, despite genomic sequence conservation, changes in protein interactions can occur relatively rapidly and may cause strong functional diversification. To investigate such functional evolution, we here combine phylogenomics with interaction proteomics. We develop this concept by investigating the molecular evolution of the shelterin complex, which protects telomeres, across 16 vertebrate species from zebrafish to humans covering 450 million years of evolution. Our phylointeractomics screen discovers previously unknown telomere-associated proteins and reveals how homologous proteins undergo functional evolution. For instance, we show that TERF1 evolved as a telomere-binding protein in the common stem lineage of marsupial and placental mammals. Phylointeractomics is a versatile and scalable approach to investigate evolutionary changes in protein function and thus can provide experimental evidence for phylogenomic relationships. PMID:28176777

  19. ENZPRED-enzymatic protein class predicting by machine learning.

    PubMed

    Dave, Kirtan; Panchal, Hetalkumar

    2013-01-01

    Recent times have seen flooding of biological data into the scientific community. Due to increase in large amounts of data from genome and other sequencing projects become available, being diverted on to Insilco approach for data collection and prediction has become a priority also progresses in sequencing technologies have found an exponential function rise in the number of newly found enzymes. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. As new approaches are needed to determine the functions of the proteins these genes encode. The protein parameters that can be used for an enzyme/ non-enzyme classification includes features of sequences like amino acid composition, dipeptide composition, grand Average of hydropathicity (GRAVY), probability of being in alpha helix, probability of being in beta sheet Probability of being in a turn. We show how large-scale computational analysis can help to address this challenge by help of java and support vector machine library. In this paper, a recently developed machine learning algorithm referred to as the svm library Learning Machine is used to classify protein sequences with six main classes of enzyme data downloaded from a public domain database. Comparative studies on different type of kernel methods like 1.radial basis function, 2.polynomial available in SVM library. Results show that RBF method take less time in training and give more accurate result then other kernel methods to also less training time compared to other kernel methods. The classification accuracy of RBF is also higher than various methods in respect of available sequences data.

  20. Architecture and Function of Mechanosensitive Membrane Protein Lattices

    NASA Astrophysics Data System (ADS)

    Kahraman, Osman; Koch, Peter D.; Klug, William S.; Haselwandter, Christoph A.

    2016-01-01

    Experiments have revealed that membrane proteins can form two-dimensional clusters with regular translational and orientational protein arrangements, which may allow cells to modulate protein function. However, the physical mechanisms yielding supramolecular organization and collective function of membrane proteins remain largely unknown. Here we show that bilayer-mediated elastic interactions between membrane proteins can yield regular and distinctive lattice architectures of protein clusters, and may provide a link between lattice architecture and lattice function. Using the mechanosensitive channel of large conductance (MscL) as a model system, we obtain relations between the shape of MscL and the supramolecular architecture of MscL lattices. We predict that the tetrameric and pentameric MscL symmetries observed in previous structural studies yield distinct lattice architectures of MscL clusters and that, in turn, these distinct MscL lattice architectures yield distinct lattice activation barriers. Our results suggest general physical mechanisms linking protein symmetry, the lattice architecture of membrane protein clusters, and the collective function of membrane protein lattices.

  1. Accuracy of functional surfaces on comparatively modeled protein structures

    PubMed Central

    Zhao, Jieling; Dundas, Joe; Kachalo, Sema; Ouyang, Zheng; Liang, Jie

    2012-01-01

    Identification and characterization of protein functional surfaces are important for predicting protein function, understanding enzyme mechanism, and docking small compounds to proteins. As the rapid speed of accumulation of protein sequence information far exceeds that of structures, constructing accurate models of protein functional surfaces and identify their key elements become increasingly important. A promising approach is to build comparative models from sequences using known structural templates such as those obtained from structural genome projects. Here we assess how well this approach works in modeling binding surfaces. By systematically building three-dimensional comparative models of proteins using Modeller, we determine how well functional surfaces can be accurately reproduced. We use an alpha shape based pocket algorithm to compute all pockets on the modeled structures, and conduct a large-scale computation of similarity measurements (pocket RMSD and fraction of functional atoms captured) for 26,590 modeled enzyme protein structures. Overall, we find that when the sequence fragment of the binding surfaces has more than 45% identity to that of the tempalte protein, the modeled surfaces have on average an RMSD of 0.5 Å, and contain 48% or more of the binding surface atoms, with nearly all of the important atoms in the signatures of binding pockets captured. PMID:21541664

  2. Disorder Prediction Methods, Their Applicability to Different Protein Targets and Their Usefulness for Guiding Experimental Studies

    PubMed Central

    Atkins, Jennifer D.; Boateng, Samuel Y.; Sorensen, Thomas; McGuffin, Liam J.

    2015-01-01

    The role and function of a given protein is dependent on its structure. In recent years, however, numerous studies have highlighted the importance of unstructured, or disordered regions in governing a protein’s function. Disordered proteins have been found to play important roles in pivotal cellular functions, such as DNA binding and signalling cascades. Studying proteins with extended disordered regions is often problematic as they can be challenging to express, purify and crystallise. This means that interpretable experimental data on protein disorder is hard to generate. As a result, predictive computational tools have been developed with the aim of predicting the level and location of disorder within a protein. Currently, over 60 prediction servers exist, utilizing different methods for classifying disorder and different training sets. Here we review several good performing, publicly available prediction methods, comparing their application and discussing how disorder prediction servers can be used to aid the experimental solution of protein structure. The use of disorder prediction methods allows us to adopt a more targeted approach to experimental studies by accurately identifying the boundaries of ordered protein domains so that they may be investigated separately, thereby increasing the likelihood of their successful experimental solution. PMID:26287166

  3. Finding the “Dark Matter” in Human and Yeast Protein Network Prediction and Modelling

    PubMed Central

    Lees, Jon G.; Reid, Adam J.; Yeats, Corin; Clegg, Andrew B.; Sanchez-Jimenez, Francisca; Orengo, Christine

    2010-01-01

    Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case

  4. Protein function from its emergence to diversity in contemporary proteins

    NASA Astrophysics Data System (ADS)

    Goncearenco, Alexander; Berezovsky, Igor N.

    2015-07-01

    The goal of this work is to learn from nature the rules that govern evolution and the design of protein function. The fundamental laws of physics lie in the foundation of the protein structure and all stages of the protein evolution, determining optimal sizes and shapes at different levels of structural hierarchy. We looked back into the very onset of the protein evolution with a goal to find elementary functions (EFs) that came from the prebiotic world and served as building blocks of the first enzymes. We defined the basic structural and functional units of biochemical reactions—elementary functional loops. The diversity of contemporary enzymes can be described via combinations of a limited number of elementary chemical reactions, many of which are performed by the descendants of primitive prebiotic peptides/proteins. By analyzing protein sequences we were able to identify EFs shared by seemingly unrelated protein superfamilies and folds and to unravel evolutionary relations between them. Binding and metabolic processing of the metal- and nucleotide-containing cofactors and ligands are among the most abundant ancient EFs that became indispensable in many natural enzymes. Highly designable folds provide structural scaffolds for many different biochemical reactions. We show that contemporary proteins are built from a limited number of EFs, making their analysis instrumental for establishing the rules for protein design. Evolutionary studies help us to accumulate the library of essential EFs and to establish intricate relations between different folds and functional superfamilies. Generalized sequence-structure descriptors of the EF will become useful in future design and engineering of desired enzymatic functions.

  5. Protein function from its emergence to diversity in contemporary proteins.

    PubMed

    Goncearenco, Alexander; Berezovsky, Igor N

    2015-06-09

    The goal of this work is to learn from nature the rules that govern evolution and the design of protein function. The fundamental laws of physics lie in the foundation of the protein structure and all stages of the protein evolution, determining optimal sizes and shapes at different levels of structural hierarchy. We looked back into the very onset of the protein evolution with a goal to find elementary functions (EFs) that came from the prebiotic world and served as building blocks of the first enzymes. We defined the basic structural and functional units of biochemical reactions-elementary functional loops. The diversity of contemporary enzymes can be described via combinations of a limited number of elementary chemical reactions, many of which are performed by the descendants of primitive prebiotic peptides/proteins. By analyzing protein sequences we were able to identify EFs shared by seemingly unrelated protein superfamilies and folds and to unravel evolutionary relations between them. Binding and metabolic processing of the metal- and nucleotide-containing cofactors and ligands are among the most abundant ancient EFs that became indispensable in many natural enzymes. Highly designable folds provide structural scaffolds for many different biochemical reactions. We show that contemporary proteins are built from a limited number of EFs, making their analysis instrumental for establishing the rules for protein design. Evolutionary studies help us to accumulate the library of essential EFs and to establish intricate relations between different folds and functional superfamilies. Generalized sequence-structure descriptors of the EF will become useful in future design and engineering of desired enzymatic functions.

  6. Predicting Human Protein Subcellular Locations by the Ensemble of Multiple Predictors via Protein-Protein Interaction Network with Edge Clustering Coefficients

    PubMed Central

    Du, Pufeng; Wang, Lusheng

    2014-01-01

    One of the fundamental tasks in biology is to identify the functions of all proteins to reveal the primary machinery of a cell. Knowledge of the subcellular locations of proteins will provide key hints to reveal their functions and to understand the intricate pathways that regulate biological processes at the cellular level. Protein subcellular location prediction has been extensively studied in the past two decades. A lot of methods have been developed based on protein primary sequences as well as protein-protein interaction network. In this paper, we propose to use the protein-protein interaction network as an infrastructure to integrate existing sequence based predictors. When predicting the subcellular locations of a given protein, not only the protein itself, but also all its interacting partners were considered. Unlike existing methods, our method requires neither the comprehensive knowledge of the protein-protein interaction network nor the experimentally annotated subcellular locations of most proteins in the protein-protein interaction network. Besides, our method can be used as a framework to integrate multiple predictors. Our method achieved 56% on human proteome in absolute-true rate, which is higher than the state-of-the-art methods. PMID:24466278

  7. Genome-scale phylogenetic function annotation of large and diverse protein families.

    PubMed

    Engelhardt, Barbara E; Jordan, Michael I; Srouji, John R; Brenner, Steven E

    2011-11-01

    The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.

  8. Comparison of Algorithms for Prediction of Protein Structural Features from Evolutionary Data.

    PubMed

    Bywater, Robert P

    2016-01-01

    Proteins have many functions and predicting these is still one of the major challenges in theoretical biophysics and bioinformatics. Foremost amongst these functions is the need to fold correctly thereby allowing the other genetically dictated tasks that the protein has to carry out to proceed efficiently. In this work, some earlier algorithms for predicting protein domain folds are revisited and they are compared with more recently developed methods. In dealing with intractable problems such as fold prediction, when different algorithms show convergence onto the same result there is every reason to take all algorithms into account such that a consensus result can be arrived at. In this work it is shown that the application of different algorithms in protein structure prediction leads to results that do not converge as such but rather they collude in a striking and useful way that has never been considered before.

  9. Prediction of Protein Structure Using Surface Accessibility Data

    PubMed Central

    Hartlmüller, Christoph; Göbl, Christoph

    2016-01-01

    Abstract An approach to the de novo structure prediction of proteins is described that relies on surface accessibility data from NMR paramagnetic relaxation enhancements by a soluble paramagnetic compound (sPRE). This method exploits the distance‐to‐surface information encoded in the sPRE data in the chemical shift‐based CS‐Rosetta de novo structure prediction framework to generate reliable structural models. For several proteins, it is demonstrated that surface accessibility data is an excellent measure of the correct protein fold in the early stages of the computational folding algorithm and significantly improves accuracy and convergence of the standard Rosetta structure prediction approach. PMID:27560616

  10. Proteins: sequence to structure and function--current status.

    PubMed

    Shenoy, Sandhya R; Jayaram, B

    2010-11-01

    In an era that has been dominated by Structural Biology for the last 30-40 years, a dramatic change of focus towards sequence analysis has spurred the advent of the genome projects and the resultant diverging sequence/structure deficit. The central challenge of Computational Structural Biology is therefore to rationalize the mass of sequence information into biochemical and biophysical knowledge and to decipher the structural, functional and evolutionary clues encoded in the language of biological sequences. In investigating the meaning of sequences, two distinct analytical themes have emerged: in the first approach, pattern recognition techniques are used to detect similarity between sequences and hence to infer related structures and functions; in the second ab initio prediction methods are used to deduce 3D structure, and ultimately to infer function, directly from the linear sequence. In this article, we attempt to provide a critical assessment of what one may and may not expect from the biological sequences and to identify major issues yet to be resolved. The presentation is organized under several subtitles like protein sequences, pattern recognition techniques, protein tertiary structure prediction, membrane protein bioinformatics, human proteome, protein-protein interactions, metabolic networks, potential drug targets based on simple sequence properties, disordered proteins, the sequence-structure relationship and chemical logic of protein sequences.

  11. The APSES family proteins in fungi: Characterizations, evolution and functions.

    PubMed

    Zhao, Yong; Su, Hao; Zhou, Jing; Feng, Huihua; Zhang, Ke-Qin; Yang, Jinkui

    2015-08-01

    The APSES protein family belongs to transcriptional factors of the basic helix-loop-helix (bHLH) class, the originally described members (APSES: Asm1p, Phd1p, Sok2p, Efg1p and StuAp) are used to designate this group of proteins, and they have been identified as key regulators of fungal development and other biological processes. APSES proteins share a highly conserved DNA-binding domain (APSES domain) of about 100 amino acids, whose central domain is predicted to form a typical bHLH structure. Besides APSES domain, several APSES proteins also contain additional domains, such as KilA-N and ankyrin repeats. In recent years, an increasing number of APSES proteins have been identified from diverse fungi, and they involve in numerous biological processes, such as sporulation, cellular differentiation, mycelial growth, secondary metabolism and virulence. Most fungi, including Aspergillus fumigatus, Aspergillus nidulans, Candida albicans, Fusarium graminearum, and Neurospora crassa, contain five APSES proteins. However, Cryptococcus neoformans only contains two APSES proteins, and Saccharomyces cerevisiae contains six APSES proteins. The phylogenetic analysis showed the APSES domains from different fungi were grouped into four clades (A, B, C and D), which is consistent with the result of homologous alignment of APSES domains using DNAman. The roles of APSES proteins in clade C have been studied in detail, while little is known about the roles of other APSES proteins in clades A, B and D. In this review, the biochemical properties and functional domains of APSES proteins are predicted and compared, and the phylogenetic relationship among APSES proteins from various fungi are analyzed based on the APSES domains. Moreover, the functions of APSES proteins in different fungi are summarized and discussed.

  12. Determining protein function and interaction from genome analysis

    DOEpatents

    Eisenberg, David; Marcotte, Edward M.; Thompson, Michael J.; Pellegrini, Matteo; Yeates, Todd O.

    2004-08-03

    A computational method system, and computer program are provided for inferring functional links from genome sequences. One method is based on the observation that some pairs of proteins A' and B' have homologs in another organism fused into a single protein chain AB. A trans-genome comparison of sequences can reveal these AB sequences, which are Rosetta Stone sequences because they decipher an interaction between A' and B. Another method compares the genomic sequence of two or more organisms to create a phylogenetic profile for each protein indicating its presence or absence across all the genomes. The profile provides information regarding functional links between different families of proteins. In yet another method a combination of the above two methods is used to predict functional links.

  13. QuaBingo: A Prediction System for Protein Quaternary Structure Attributes Using Block Composition

    PubMed Central

    Tung, Chi-Hua; Chen, Chi-Wei; Guo, Ren-Chao; Ng, Hui-Fuang

    2016-01-01

    Background. Quaternary structures of proteins are closely relevant to gene regulation, signal transduction, and many other biological functions of proteins. In the current study, a new method based on protein-conserved motif composition in block format for feature extraction is proposed, which is termed block composition. Results. The protein quaternary assembly states prediction system which combines blocks with functional domain composition, called QuaBingo, is constructed by three layers of classifiers that can categorize quaternary structural attributes of monomer, homooligomer, and heterooligomer. The building of the first layer classifier uses support vector machines (SVM) based on blocks and functional domains of proteins, and the second layer SVM was utilized to process the outputs of the first layer. Finally, the result is determined by the Random Forest of the third layer. We compared the effectiveness of the combination of block composition, functional domain composition, and pseudoamino acid composition of the model. In the 11 kinds of functional protein families, QuaBingo is 23% of Matthews Correlation Coefficient (MCC) higher than the existing prediction system. The results also revealed the biological characterization of the top five block compositions. Conclusions. QuaBingo provides better predictive ability for predicting the quaternary structural attributes of proteins. PMID:27610389

  14. Structural determinants of TRIM protein function.

    PubMed

    Esposito, Diego; Koliopoulos, Marios G; Rittinger, Katrin

    2017-02-08

    Tripartite motif (TRIM) proteins constitute one of the largest subfamilies of Really Interesting New Gene (RING) E3 ubiquitin ligases and contribute to the regulation of numerous cellular activities, including innate immune responses. The conserved TRIM harbours a RING domain that imparts E3 ligase activity to TRIM family proteins, whilst a variable C-terminal region can mediate recognition of substrate proteins. The knowledge of the structure of these multidomain proteins and the functional interplay between their constituent domains is paramount to understanding their cellular roles. To date, available structural information on TRIM proteins is still largely restricted to subdomains of many TRIMs in isolation. Nevertheless, applying a combination of structural, biophysical and biochemical approaches has recently allowed important progress to be made towards providing a better understanding of the molecular features that underlie the function of TRIM family proteins and has uncovered an unexpected diversity in the link between self-association and catalytic activity.

  15. A review on protein functionalized carbon nanotubes.

    PubMed

    Nagaraju, Kathyayini; Reddy, Roopa; Reddy, Narendra

    2015-12-18

    Carbon nanotubes (CNTs) have been widely recognized and used for controlled drug delivery and in various other fields due to their unique properties and distinct advantages. Both single-walled carbon nanotubes (SWCNTs) and multiwalled (MWCNTs) carbon nanotubes are used and/or studied for potential applications in medical, energy, textile, composite, and other areas. Since CNTs are chemically inert and are insoluble in water or other organic solvents, they are functionalized or modified to carry payloads or interact with biological molecules. CNTs have been preferably functionalized with proteins because CNTs are predominantly used for medical applications such as delivery of drugs, DNA and genes, and also for biosensing. Extensive studies have been conducted to understand the interactions, cytotoxicity, and potential applications of protein functionalized CNTs but contradicting results have been published on the cytotoxicity of the functionalized CNTs. This paper provides a brief review of CNTs functionalized with proteins, methods used to functionalize the CNTs, and their potential applications.

  16. Protein Structure Prediction Using String Kernels

    DTIC Science & Technology

    2006-03-03

    Prediction using String Kernels 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER...consists of 4352 sequences from SCOP version 1.53 extracted from the Astral database, grouped into families and superfamilies. The dataset is processed

  17. Identifying the singleplex and multiplex proteins based on transductive learning for protein subcellular localization prediction.

    PubMed

    Cao, Junzhe; Liu, Wenqi; He, Jianjun; Gu, Hong

    2013-07-01

    A new method is proposed to identify whether a query protein is singleplex or multiplex for improving the quality of protein subcellular localization prediction. Based on the transductive learning technique, this approach utilizes the information from the both query proteins and known proteins to estimate the subcellular location number of every query protein so that the singleplex and multiplex proteins can be recognized and distinguished. Each query protein is then dealt with by a targeted single-label or multi-label predictor to achieve a high-accuracy prediction result. We assess the performance of the proposed approach by applying it to three groups of protein sequences datasets. Simulation experiments show that the proposed approach can effectively identify the singleplex and multiplex proteins. Through a comparison, the reliably of this method for enhancing the power of predicting protein subcellular localization can also be verified.

  18. PPIevo: protein-protein interaction prediction from PSSM based evolutionary information.

    PubMed

    Zahiri, Javad; Yaghoubi, Omid; Mohammad-Noori, Morteza; Ebrahimpour, Reza; Masoudi-Nejad, Ali

    2013-10-01

    Protein-protein interactions regulate a variety of cellular processes. There is a great need for computational methods as a complement to experimental methods with which to predict protein interactions due to the existence of many limitations involved in experimental techniques. Here, we introduce a novel evolutionary based feature extraction algorithm for protein-protein interaction (PPI) prediction. The algorithm is called PPIevo and extracts the evolutionary feature from Position-Specific Scoring Matrix (PSSM) of protein with known sequence. The algorithm does not depend on the protein annotations, and the features are based on the evolutionary history of the proteins. This enables the algorithm to have more power for predicting protein-protein interaction than many sequence based algorithms. Results on the HPRD database show better performance and robustness of the proposed method. They also reveal that the negative dataset selection could lead to an acute performance overestimation which is the principal drawback of the available methods.

  19. PredPlantPTS1: A Web Server for the Prediction of Plant Peroxisomal Proteins

    PubMed Central

    Reumann, Sigrun; Buchwald, Daniela; Lingner, Thomas

    2012-01-01

    Prediction of subcellular protein localization is essential to correctly assign unknown proteins to cell organelle-specific protein networks and to ultimately determine protein function. For metazoa, several computational approaches have been developed in the past decade to predict peroxisomal proteins carrying the peroxisome targeting signal type 1 (PTS1). However, plant-specific PTS1 protein prediction methods have been lacking up to now, and pre-existing methods generally were incapable of correctly predicting low-abundance plant proteins possessing non-canonical PTS1 patterns. Recently, we presented a machine learning approach that is able to predict PTS1 proteins for higher plants (spermatophytes) with high accuracy and which can correctly identify unknown targeting patterns, i.e., novel PTS1 tripeptides and tripeptide residues. Here we describe the first plant-specific web server PredPlantPTS1 for the prediction of plant PTS1 proteins using the above-mentioned underlying models. The server allows the submission of protein sequences from diverse spermatophytes and also performs well for mosses and algae. The easy-to-use web interface provides detailed output in terms of (i) the peroxisomal targeting probability of the given sequence, (ii) information whether a particular non-canonical PTS1 tripeptide has already been experimentally verified, and (iii) the prediction scores for the single C-terminal 14 amino acid residues. The latter allows identification of predicted residues that inhibit peroxisome targeting and which can be optimized using site-directed mutagenesis to raise the peroxisome targeting efficiency. The prediction server will be instrumental in identifying low-abundance and stress-inducible peroxisomal proteins and defining the entire peroxisomal proteome of Arabidopsis and agronomically important crop plants. PredPlantPTS1 is freely accessible at ppp.gobics.de. PMID:22969783

  20. A Software Pipeline for Protein Structure Prediction

    DTIC Science & Technology

    2006-11-01

    programs, such as GenTHREADER (Jones 1999), FUGUE (Shi, Blundell et al. 2001), and 3D- PSSM (Kelley, MacCallum et al. 2000), use hybrid approaches...annotation using structural profiles in the program 3D- PSSM , J Mol Biol, 299, 499- 520. Kim, D., D. Xu, et al., 2003: PROSPECT II: protein structure

  1. Using the folding landscapes of proteins to understand protein function.

    PubMed

    Giri Rao, V V Hemanth; Gosavi, Shachi

    2016-02-01

    Proteins fold on a biologically-relevant timescale because of a funnel-shaped energy landscape. This landscape is sculpted through evolution by selecting amino-acid sequences that stabilize native interactions while suppressing stable non-native interactions that occur during folding. However, there is strong evolutionary selection for functional residues and these cannot be chosen to optimize folding. Their presence impacts the folding energy landscape in a variety of ways. Here, we survey the effects of functional residues on folding by providing several examples. We then review how such effects can be detected computationally and be used as assays for protein function. Overall, an understanding of how functional residues modulate folding should provide insights into the design of natural proteins and their homeostasis.

  2. Prediction of structural features and application to outer membrane protein identification

    PubMed Central

    Yan, Renxiang; Wang, Xiaofeng; Huang, Lanqing; Yan, Feidi; Xue, Xiaoyu; Cai, Weiwen

    2015-01-01

    Protein three-dimensional (3D) structures provide insightful information in many fields of biology. One-dimensional properties derived from 3D structures such as secondary structure, residue solvent accessibility, residue depth and backbone torsion angles are helpful to protein function prediction, fold recognition and ab initio folding. Here, we predict various structural features with the assistance of neural network learning. Based on an independent test dataset, protein secondary structure prediction generates an overall Q3 accuracy of ~80%. Meanwhile, the prediction of relative solvent accessibility obtains the highest mean absolute error of 0.164, and prediction of residue depth achieves the lowest mean absolute error of 0.062. We further improve the outer membrane protein identification by including the predicted structural features in a scoring function using a simple profile-to-profile alignment. The results demonstrate that the accuracy of outer membrane protein identification can be improved by ~3% at a 1% false positive level when structural features are incorporated. Finally, our methods are available as two convenient and easy-to-use programs. One is PSSM-2-Features for predicting secondary structure, relative solvent accessibility, residue depth and backbone torsion angles, the other is PPA-OMP for identifying outer membrane proteins from proteomes. PMID:26104144

  3. Prediction of structural features and application to outer membrane protein identification

    NASA Astrophysics Data System (ADS)

    Yan, Renxiang; Wang, Xiaofeng; Huang, Lanqing; Yan, Feidi; Xue, Xiaoyu; Cai, Weiwen

    2015-06-01

    Protein three-dimensional (3D) structures provide insightful information in many fields of biology. One-dimensional properties derived from 3D structures such as secondary structure, residue solvent accessibility, residue depth and backbone torsion angles are helpful to protein function prediction, fold recognition and ab initio folding. Here, we predict various structural features with the assistance of neural network learning. Based on an independent test dataset, protein secondary structure prediction generates an overall Q3 accuracy of ~80%. Meanwhile, the prediction of relative solvent accessibility obtains the highest mean absolute error of 0.164, and prediction of residue depth achieves the lowest mean absolute error of 0.062. We further improve the outer membrane protein identification by including the predicted structural features in a scoring function using a simple profile-to-profile alignment. The results demonstrate that the accuracy of outer membrane protein identification can be improved by ~3% at a 1% false positive level when structural features are incorporated. Finally, our methods are available as two convenient and easy-to-use programs. One is PSSM-2-Features for predicting secondary structure, relative solvent accessibility, residue depth and backbone torsion angles, the other is PPA-OMP for identifying outer membrane proteins from proteomes.

  4. Using compound similarity and functional domain composition for prediction of drug-target interaction networks.

    PubMed

    Chen, Lei; He, Zhi-Song; Huang, Tao; Cai, Yu-Dong

    2010-11-01

    Study of interactions between drugs and target proteins is an essential step in genomic drug discovery. It is very hard to determine the compound-protein interactions or drug-target interactions by experiment alone. As supplementary, effective prediction model using machine learning or data mining methods can provide much help. In this study, a prediction method based on Nearest Neighbor Algorithm and a novel metric, which was obtained by combining compound similarity and functional domain composition, was proposed. The target proteins were divided into the following groups: enzymes, ion channels, G protein-coupled receptors, and nuclear receptors. As a result, four predictors with the optimal parameters were established. The overall prediction accuracies, evaluated by jackknife cross-validation test, for four groups of target proteins are 90.23%, 94.74%, 97.80%, and 97.51%, respectively, indicating that compound similarity and functional domain composition are very effective to predict drug-target interaction networks.

  5. Identification of Extracellular Segments by Mass Spectrometry Improves Topology Prediction of Transmembrane Proteins

    PubMed Central

    Langó, Tamás; Róna, Gergely; Hunyadi-Gulyás, Éva; Turiák, Lilla; Varga, Julia; Dobson, László; Várady, György; Drahos, László; Vértessy, Beáta G.; Medzihradszky, Katalin F.; Szakács, Gergely; Tusnády, Gábor E.

    2017-01-01

    Transmembrane proteins play crucial role in signaling, ion transport, nutrient uptake, as well as in maintaining the dynamic equilibrium between the internal and external environment of cells. Despite their important biological functions and abundance, less than 2% of all determined structures are transmembrane proteins. Given the persisting technical difficulties associated with high resolution structure determination of transmembrane proteins, additional methods, including computational and experimental techniques remain vital in promoting our understanding of their topologies, 3D structures, functions and interactions. Here we report a method for the high-throughput determination of extracellular segments of transmembrane proteins based on the identification of surface labeled and biotin captured peptide fragments by LC/MS/MS. We show that reliable identification of extracellular protein segments increases the accuracy and reliability of existing topology prediction algorithms. Using the experimental topology data as constraints, our improved prediction tool provides accurate and reliable topology models for hundreds of human transmembrane proteins. PMID:28211907

  6. Identification of Extracellular Segments by Mass Spectrometry Improves Topology Prediction of Transmembrane Proteins.

    PubMed

    Langó, Tamás; Róna, Gergely; Hunyadi-Gulyás, Éva; Turiák, Lilla; Varga, Julia; Dobson, László; Várady, György; Drahos, László; Vértessy, Beáta G; Medzihradszky, Katalin F; Szakács, Gergely; Tusnády, Gábor E

    2017-02-13

    Transmembrane proteins play crucial role in signaling, ion transport, nutrient uptake, as well as in maintaining the dynamic equilibrium between the internal and external environment of cells. Despite their important biological functions and abundance, less than 2% of all determined structures are transmembrane proteins. Given the persisting technical difficulties associated with high resolution structure determination of transmembrane proteins, additional methods, including computational and experimental techniques remain vital in promoting our understanding of their topologies, 3D structures, functions and interactions. Here we report a method for the high-throughput determination of extracellular segments of transmembrane proteins based on the identification of surface labeled and biotin captured peptide fragments by LC/MS/MS. We show that reliable identification of extracellular protein segments increases the accuracy and reliability of existing topology prediction algorithms. Using the experimental topology data as constraints, our improved prediction tool provides accurate and reliable topology models for hundreds of human transmembrane proteins.

  7. HMMpTM: improving transmembrane protein topology prediction using phosphorylation and glycosylation site prediction.

    PubMed

    Tsaousis, Georgios N; Bagos, Pantelis G; Hamodrakas, Stavros J

    2014-02-01

    During the last two decades a large number of computational methods have been developed for predicting transmembrane protein topology. Current predictors rely on topogenic signals in the protein sequence, such as the distribution of positively charged residues in extra-membrane loops and the existence of N-terminal signals. However, phosphorylation and glycosylation are post-translational modifications (PTMs) that occur in a compartment-specific manner and therefore the presence of a phosphorylation or glycosylation site in a transmembrane protein provides topological information. We examine the combination of phosphorylation and glycosylation site prediction with transmembrane protein topology prediction. We report the development of a Hidden Markov Model based method, capable of predicting the topology of transmembrane proteins and the existence of kinase specific phosphorylation and N/O-linked glycosylation sites along the protein sequence. Our method integrates a novel feature in transmembrane protein topology prediction, which results in improved performance for topology prediction and reliable prediction of phosphorylation and glycosylation sites. The method is freely available at http://bioinformatics.biol.uoa.gr/HMMpTM.

  8. DBD-Hunter: a knowledge-based method for the prediction of DNA-protein interactions.

    PubMed

    Gao, Mu; Skolnick, Jeffrey

    2008-07-01

    The structures of DNA-protein complexes have illuminated the diversity of DNA-protein binding mechanisms shown by different protein families. This lack of generality could pose a great challenge for predicting DNA-protein interactions. To address this issue, we have developed a knowledge-based method, DNA-binding Domain Hunter (DBD-Hunter), for identifying DNA-binding proteins and associated binding sites. The method combines structural comparison and the evaluation of a statistical potential, which we derive to describe interactions between DNA base pairs and protein residues. We demonstrate that DBD-Hunter is an accurate method for predicting DNA-binding function of proteins, and that DNA-binding protein residues can be reliably inferred from the corresponding templates if identified. In benchmark tests on approximately 4000 proteins, our method achieved an accuracy of 98% and a precision of 84%, which significantly outperforms three previous methods. We further validate the method on DNA-binding protein structures determined in DNA-free (apo) state. We show that the accuracy of our method is only slightly affected on apo-structures compared to the performance on holo-structures cocrystallized with DNA. Finally, we apply the method to approximately 1700 structural genomics targets and predict that 37 targets with previously unknown function are likely to be DNA-binding proteins. DBD-Hunter is freely available at http://cssb.biology.gatech.edu/skolnick/webservice/DBD-Hunter/.

  9. Computational Prediction of RNA-Binding Proteins and Binding Sites

    PubMed Central

    Si, Jingna; Cui, Jing; Cheng, Jin; Wu, Rongling

    2015-01-01

    Proteins and RNA interaction have vital roles in many cellular processes such as protein synthesis, sequence encoding, RNA transfer, and gene regulation at the transcriptional and post-transcriptional levels. Approximately 6%–8% of all proteins are RNA-binding proteins (RBPs). Distinguishing these RBPs or their binding residues is a major aim of structural biology. Previously, a number of experimental methods were developed for the determination of protein–RNA interactions. However, these experimental methods are expensive, time-consuming, and labor-intensive. Alternatively, researchers have developed many computational approaches to predict RBPs and protein–RNA binding sites, by combining various machine learning methods and abundant sequence and/or structural features. There are three kinds of computational approaches, which are prediction from protein sequence, prediction from protein structure, and protein-RNA docking. In this paper, we review all existing studies of predictions of RNA-binding sites and RBPs and complexes, including data sets used in different approaches, sequence and structural features used in several predictors, prediction method classifications, performance comparisons, evaluation methods, and future directions. PMID:26540053

  10. The N and C Termini of ZO-1 Are Surrounded by Distinct Proteins and Functional Protein Networks*

    PubMed Central

    Van Itallie, Christina M.; Aponte, Angel; Tietgens, Amber Jean; Gucek, Marjan; Fredriksson, Karin; Anderson, James Melvin

    2013-01-01

    The proteins and functional protein networks of the tight junction remain incompletely defined. Among the currently known proteins are barrier-forming proteins like occludin and the claudin family; scaffolding proteins like ZO-1; and some cytoskeletal, signaling, and cell polarity proteins. To define a more complete list of proteins and infer their functional implications, we identified the proteins that are within molecular dimensions of ZO-1 by fusing biotin ligase to either its N or C terminus, expressing these fusion proteins in Madin-Darby canine kidney epithelial cells, and purifying and identifying the resulting biotinylated proteins by mass spectrometry. Of a predicted proteome of ∼9000, we identified more than 400 proteins tagged by biotin ligase fused to ZO-1, with both identical and distinct proteins near the N- and C-terminal ends. Those proximal to the N terminus were enriched in transmembrane tight junction proteins, and those proximal to the C terminus were enriched in cytoskeletal proteins. We also identified many unexpected but easily rationalized proteins and verified partial colocalization of three of these proteins with ZO-1 as examples. In addition, functional networks of interacting proteins were tagged, such as the basolateral but not apical polarity network. These results provide a rich inventory of proteins and potential novel insights into functions and protein networks that should catalyze further understanding of tight junction biology. Unexpectedly, the technique demonstrates high spatial resolution, which could be generally applied to defining other subcellular protein compartmentalization. PMID:23553632

  11. Functional Importance of Mobile Ribosomal Proteins.

    PubMed

    Chang, Kai-Chun; Wen, Jin-Der; Yang, Lee-Wei

    2015-01-01

    Although the dynamic motions and peptidyl transferase activity seem to be embedded in the rRNAs, the ribosome contains more than 50 ribosomal proteins (r-proteins), whose functions remain largely elusive. Also, the precise forms of some of these r-proteins, as being part of the ribosome, are not structurally solved due to their high flexibility, which hinders the efforts in their functional elucidation. Owing to recent advances in cryo-electron microscopy, single-molecule techniques, and theoretical modeling, much has been learned about the dynamics of these r-proteins. Surprisingly, allosteric regulations have been found in between spatially separated components as distant as those in the opposite sides of the ribosome. Here, we focus on the functional roles and intricate regulations of the mobile L1 and L12 stalks and L9 and S1 proteins. Conformational flexibility also enables versatile functions for r-proteins beyond translation. The arrangement of r-proteins may be under evolutionary pressure that fine-tunes mass distributions for optimal structural dynamics and catalytic activity of the ribosome.

  12. Control of protein function through optochemical translocation.

    PubMed

    Engelke, Hanna; Chou, Chungjung; Uprety, Rajendra; Jess, Phillip; Deiters, Alexander

    2014-10-17

    Controlled manipulation of proteins and their function is important in almost all biological disciplines. Here, we demonstrate control of protein activity with light. We present two different applications-light-triggered transcription and light-triggered protease cleavage-both based on the same concept of protein mislocation, followed by optochemically triggered translocation to an active cellular compartment. In our approach, we genetically encode a photocaged lysine into the nuclear localization signal (NLS) of the transcription factor SATB1. This blocks nuclear import of the protein until illumination induces caging group removal and release of the protein into the nucleus. In the first application, prepending this NLS to the transcription factor FOXO3 allows us to optochemically switch on its transcription activity. The second application uses the developed light-activated NLS to control nuclear import of TEV protease and subsequent cleavage of nuclear proteins containing TEV cleavage sites. The small size of the light-controlled NLS (only 20 amino acids) minimizes impact of its insertion on protein function and promises a general approach to a wide range of optochemical applications. Since the light-activated NLS is genetically encoded and optically triggered, it will prove useful to address a variety of problems requiring spatial and temporal control of protein function, for example, in stem-cell, developmental, and cancer biology.

  13. Predicting protein-protein interactions from sequence using correlation coefficient and high-quality interaction dataset.

    PubMed

    Shi, Ming-Guang; Xia, Jun-Feng; Li, Xue-Ling; Huang, De-Shuang

    2010-03-01

    Identifying protein-protein interactions (PPIs) is critical for understanding the cellular function of the proteins and the machinery of a proteome. Data of PPIs derived from high-throughput technologies are often incomplete and noisy. Therefore, it is important to develop computational methods and high-quality interaction dataset for predicting PPIs. A sequence-based method is proposed by combining correlation coefficient (CC) transformation and support vector machine (SVM). CC transformation not only adequately considers the neighboring effect of protein sequence but describes the level of CC between two protein sequences. A gold standard positives (interacting) dataset MIPS Core and a gold standard negatives (non-interacting) dataset GO-NEG of yeast Saccharomyces cerevisiae were mined to objectively evaluate the above method and attenuate the bias. The SVM model combined with CC transformation yielded the best performance with a high accuracy of 87.94% using gold standard positives and gold standard negatives datasets. The source code of MATLAB and the datasets are available on request under smgsmg@mail.ustc.edu.cn.

  14. Engineering Genes for Predictable Protein Expression

    PubMed Central

    Gustafsson, Claes; Minshull, Jeremy; Govindarajan, Sridhar; Ness, Jon; Villalobos, Alan; Welch, Mark

    2013-01-01

    The DNA sequence used to encode a polypeptide can have dramatic effects on its expression. Lack of readily available tools has until recently inhibited meaningful experimental investigation of this phenomenon. Advances in synthetic biology and the application of modern engineering approaches now provide the tools for systematic analysis of the sequence variables affecting heterologous expression of recombinant proteins. We here discuss how these new tools are being applied and how they circumvent the constraints of previous approaches, highlighting some of the surprising and promising results emerging from the developing field of gene engineering. PMID:22425659

  15. Engineering genes for predictable protein expression.

    PubMed

    Gustafsson, Claes; Minshull, Jeremy; Govindarajan, Sridhar; Ness, Jon; Villalobos, Alan; Welch, Mark

    2012-05-01

    The DNA sequence used to encode a polypeptide can have dramatic effects on its expression. Lack of readily available tools has until recently inhibited meaningful experimental investigation of this phenomenon. Advances in synthetic biology and the application of modern engineering approaches now provide the tools for systematic analysis of the sequence variables affecting heterologous expression of recombinant proteins. We here discuss how these new tools are being applied and how they circumvent the constraints of previous approaches, highlighting some of the surprising and promising results emerging from the developing field of gene engineering.

  16. Structure and functional annotation of hypothetical proteins having putative Rubisco activase function from Vitis vinifera.

    PubMed

    Kumar, Suresh

    2015-01-01

    Rubisco is a very large, complex and one of the most abundant proteins in the world and comprises up to 50% of all soluble protein in plants. The activity of Rubisco, the enzyme that catalyzes CO2 assimilation in photosynthesis, is regulated by Rubisco activase (Rca). In the present study, we searched for hypothetical protein of Vitis vinifera which has putative Rubisco activase function. The Arabidopsis and tobacco Rubisco activase protein sequences were used as seed sequences to search against Vitis vinifera in UniprotKB database. The selected hypothetical proteins of Vitis vinifera were subjected to sequence, structural and functional annotation. Subcellular localization predictions suggested it to be cytoplasmic protein. Homology modelling was used to define the three-dimensional (3D) structure of selected hypothetical proteins of Vitis vinifera. Template search revealed that all the hypothetical proteins share more than 80% sequence identity with structure of green-type Rubisco activase from tobacco, indicating proteins are evolutionary conserved. The homology modelling was generated using SWISS-MODEL. Several quality assessment and validation parameters computed indicated that homology models are reliable. Further, functional annotation through PFAM, CATH, SUPERFAMILY, CDART suggested that selected hypothetical proteins of Vitis vinifera contain ATPase family associated with various cellular activities (AAA) and belong to the AAA+ super family of ring-shaped P-loop containing nucleoside triphosphate hydrolases. This study will lead to research in the optimization of the functionality of Rubisco which has large implication in the improvement of plant productivity and resource use efficiency.

  17. Fast and accurate multivariate Gaussian modeling of protein families: predicting residue contacts and protein-interaction partners.

    PubMed

    Baldassi, Carlo; Zamparo, Marco; Feinauer, Christoph; Procaccini, Andrea; Zecchina, Riccardo; Weigt, Martin; Pagnani, Andrea

    2014-01-01

    In the course of evolution, proteins show a remarkable conservation of their three-dimensional structure and their biological function, leading to strong evolutionary constraints on the sequence variability between homologous proteins. Our method aims at extracting such constraints from rapidly accumulating sequence data, and thereby at inferring protein structure and function from sequence information alone. Recently, global statistical inference methods (e.g. direct-coupling analysis, sparse inverse covariance estimation) have achieved a breakthrough towards this aim, and their predictions have been successfully implemented into tertiary and quaternary protein structure prediction methods. However, due to the discrete nature of the underlying variable (amino-acids), exact inference requires exponential time in the protein length, and efficient approximations are needed for practical applicability. Here we propose a very efficient multivariate Gaussian modeling approach as a variant of direct-coupling analysis: the discrete amino-acid variables are replaced by continuous Gaussian random variables. The resulting statistical inference problem is efficiently and exactly solvable. We show that the quality of inference is comparable or superior to the one achieved by mean-field approximations to inference with discrete variables, as done by direct-coupling analysis. This is true for (i) the prediction of residue-residue contacts in proteins, and (ii) the identification of protein-protein interaction partner in bacterial signal transduction. An implementation of our multivariate Gaussian approach is available at the website http://areeweb.polito.it/ricerca/cmp/code.

  18. A Prediction Model for Membrane Proteins Using Moments Based Features

    PubMed Central

    Butt, Ahmad Hassan; Khan, Sher Afzal; Jamil, Hamza; Rasool, Nouman; Khan, Yaser Daanial

    2016-01-01

    The most expedient unit of the human body is its cell. Encapsulated within the cell are many infinitesimal entities and molecules which are protected by a cell membrane. The proteins that are associated with this lipid based bilayer cell membrane are known as membrane proteins and are considered to play a significant role. These membrane proteins exhibit their effect in cellular activities inside and outside of the cell. According to the scientists in pharmaceutical organizations, these membrane proteins perform key task in drug interactions. In this study, a technique is presented that is based on various computationally intelligent methods used for the prediction of membrane protein without the experimental use of mass spectrometry. Statistical moments were used to extract features and furthermore a Multilayer Neural Network was trained using backpropagation for the prediction of membrane proteins. Results show that the proposed technique performs better than existing methodologies. PMID:26966690

  19. Resting state functional connectivity predicts neurofeedback response

    PubMed Central

    Scheinost, Dustin; Stoica, Teodora; Wasylink, Suzanne; Gruner, Patricia; Saksa, John; Pittenger, Christopher; Hampson, Michelle

    2014-01-01

    Tailoring treatments to the specific needs and biology of individual patients—personalized medicine—requires delineation of reliable predictors of response. Unfortunately, these have been slow to emerge, especially in neuropsychiatric disorders. We have recently described a real-time functional magnetic resonance imaging (rt-fMRI) neurofeedback protocol that can reduce contamination-related anxiety, a prominent symptom of many cases of obsessive-compulsive disorder (OCD). Individual response to this intervention is variable. Here we used patterns of brain functional connectivity, as measured by baseline resting-state fMRI (rs-fMRI), to predict improvements in contamination anxiety after neurofeedback training. Activity of a region of the orbitofrontal cortex (OFC) and anterior prefrontal cortex, Brodmann area (BA) 10, associated with contamination anxiety in each subject was measured in real time and presented as a neurofeedback signal, permitting subjects to learn to modulate this target brain region. We have previously reported both enhanced OFC/BA 10 control and improved anxiety in a group of subclinically anxious subjects after neurofeedback. Five individuals with contamination-related OCD who underwent the same protocol also showed improved clinical symptomatology. In both groups, these behavioral improvements were strongly correlated with baseline whole-brain connectivity in the OFC/BA 10, computed from rs-fMRI collected several days prior to neurofeedback training. These pilot data suggest that rs-fMRI can be used to identify individuals likely to benefit from rt-fMRI neurofeedback training to control contamination anxiety. PMID:25309375

  20. Protein conformational populations and functionally relevant substates.

    PubMed

    Ramanathan, Arvind; Savol, Andrej; Burger, Virginia; Chennubhotla, Chakra S; Agarwal, Pratul K

    2014-01-21

    Functioning proteins do not remain fixed in a unique structure, but instead they sample a range of conformations facilitated by motions within the protein. Even in the native state, a protein exists as a collection of interconverting conformations driven by thermodynamic fluctuations. Motions on the fast time scale allow a protein to sample conformations in the nearby area of its conformational landscape, while motions on slower time scales give it access to conformations in distal areas of the landscape. Emerging evidence indicates that protein landscapes contain conformational substates with dynamic and structural features that support the designated function of the protein. Nuclear magnetic resonance (NMR) experiments provide information about conformational ensembles of proteins. X-ray crystallography allows researchers to identify the most populated states along the landscape, and computational simulations give atom-level information about the conformational substates of different proteins. This ability to characterize and obtain quantitative information about the conformational substates and the populations of proteins within them is allowing researchers to better understand the relationship between protein structure and dynamics and the mechanisms of protein function. In this Account, we discuss recent developments and challenges in the characterization of functionally relevant conformational populations and substates of proteins. In some enzymes, the sampling of functionally relevant conformational substates is connected to promoting the overall mechanism of catalysis. For example, the conformational landscape of the enzyme dihydrofolate reductase has multiple substates, which facilitate the binding and the release of the cofactor and substrate and catalyze the hydride transfer. For the enzyme cyclophilin A, computational simulations reveal that the long time scale conformational fluctuations enable the enzyme to access conformational substates that allow

  1. Distinguishing between biochemical and cellular function: Are there peptide signatures for cellular function of proteins?

    PubMed

    Jain, Shruti; Bhattacharyya, Kausik; Bakshi, Rachit; Narang, Ankita; Brahmachari, Vani

    2017-04-01

    The genome annotation and identification of gene function depends on conserved biochemical activity. However, in the cell, proteins with the same biochemical function can participate in different cellular pathways and cannot complement one another. Similarly, two proteins of very different biochemical functions are put in the same class of cellular function; for example, the classification of a gene as an oncogene or a tumour suppressor gene is not related to its biochemical function, but is related to its cellular function. We have taken an approach to identify peptide signatures for cellular function in proteins with known biochemical function. ATPases as a test case, we classified ATPases (2360 proteins) and kinases (517 proteins) from the human genome into different cellular function categories such as transcriptional, replicative, and chromatin remodelling proteins. Using publicly available tool, MEME, we identify peptide signatures shared among the members of a given category but not between cellular functional categories; for example, no motif sharing is seen between chromatin remodelling and transporter ATPases, similarly between receptor Serine/Threonine Kinase and Receptor Tyrosine Kinase. There are motifs shared within each category with significant E value and high occurrence. This concept of signature for cellular function was applied to developmental regulators, the polycomb and trithorax proteins which led to the prediction of the role of INO80, a chromatin remodelling protein, in development. This has been experimentally validated earlier for its role in homeotic gene regulation and its interaction with regulatory complexes like the Polycomb and Trithorax complex. Proteins 2017; 85:682-693. © 2016 Wiley Periodicals, Inc.

  2. Assessing Predicted Contacts for Building Protein Three-Dimensional Models.

    PubMed

    Adhikari, Badri; Bhattacharya, Debswapna; Cao, Renzhi; Cheng, Jianlin

    2017-01-01

    Recent successes of contact-guided protein structure prediction methods have revived interest in solving the long-standing problem of ab initio protein structure prediction. With homology modeling failing for many protein sequences that do not have templates, contact-guided structure prediction has shown promise, and consequently, contact prediction has gained a lot of interest recently. Although a few dozen contact prediction tools are already currently available as web servers and downloadables, not enough research has been done towards using existing measures like precision and recall to evaluate these contacts with the goal of building three-dimensional models. Moreover, when we do not have a native structure for a set of predicted contacts, the only analysis we can perform is a simple contact map visualization of the predicted contacts. A wider and more rigorous assessment of the predicted contacts is needed, in order to build tertiary structure models. This chapter discusses instructions and protocols for using tools and applying techniques in order to assess predicted contacts for building three-dimensional models.

  3. Ligand Similarity Complements Sequence, Physical Interaction, and Co-Expression for Gene Function Prediction

    PubMed Central

    Shoichet, Brian K.; Gillis, Jesse

    2016-01-01

    The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63–0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited. PMID:27467773

  4. Self-Assembled Materials Made from Functional Recombinant Proteins.

    PubMed

    Jang, Yeongseon; Champion, Julie A

    2016-10-18

    the surface and can also carry protein, small-molecule, or nanoparticle cargo in the vesicle lumen. To create a material with a more complex hierarchical structure, we combined calcium phosphate with zipper fusion proteins containing random coil polypeptides to produce hybrid protein-inorganic supraparticles with high surface area and porous structure. The use of a functional enzyme created supraparticles with the ability to degrade inflammatory cytokines. Our characterization of these protein materials revealed that the molecular interactions are complex because of the large size of the protein building blocks, their folded structures, and the number of potential interactions including hydrophobic interactions, electrostatic interactions, van der Waals forces, and specific affinity-based interactions. It is difficult or even impossible to predict the structures a priori. However, once the basic assembly principles are understood, there is opportunity to tune the material properties, such as size, through control of the self-assembly conditions. Our future efforts on the fundamental side will focus on identifying the phase space of self-assembly of these fusion proteins and additional experimental levers with which to control and tune the resulting materials. On the application side, we are investigating an array of different functional proteins to expand the use of these structures in both therapeutic protein delivery and biocatalysis.

  5. Nano-functionalization of protein microspheres

    NASA Astrophysics Data System (ADS)

    Yoon, Sungkwon; Nichols, William T.

    2014-08-01

    Protein microspheres are promising building blocks for the assembly of complex functional materials. Here we demonstrate a set of three techniques that add functionality to the surface of protein microspheres. In the first technique, a positive surface charge on the protein spheres is deposited by electrostatic adsorption. Negatively charged silica and gold nanoparticle colloids can then electrostatically bind reversibly to the microsphere surface. In the second technique, nanoparticles are covalently anchored to the protein shell using a simple one-pot process. The strong covalent bond between sulfur groups in cysteine in the protein shell irreversibly binds to the gold nanoparticles. In the third technique, surface morphology of the protein microsphere is tuned through hydrodynamic instability at the water-oil interface. This is accomplished through the degree of solubility of the oil phase in water. Taken together these three techniques form a platform to create nano-functionalized protein microspheres, which can then be used as building blocks for the assembly of more complex macroscopic materials.

  6. Geary autocorrelation and DCCA coefficient: Application to predict apoptosis protein subcellular localization via PSSM

    NASA Astrophysics Data System (ADS)

    Liang, Yunyun; Liu, Sanyang; Zhang, Shengli

    2017-02-01

    Apoptosis is a fundamental process controlling normal tissue homeostasis by regulating a balance between cell proliferation and death. Predicting subcellular location of apoptosis proteins is very helpful for understanding its mechanism of programmed cell death. Prediction of apoptosis protein subcellular location is still a challenging and complicated task, and existing methods mainly based on protein primary sequences. In this paper, we propose a new position-specific scoring matrix (PSSM)-based model by using Geary autocorrelation function and detrended cross-correlation coefficient (DCCA coefficient). Then a 270-dimensional (270D) feature vector is constructed on three widely used datasets: ZD98, ZW225 and CL317, and support vector machine is adopted as classifier. The overall prediction accuracies are significantly improved by rigorous jackknife test. The results show that our model offers a reliable and effective PSSM-based tool for prediction of apoptosis protein subcellular localization.

  7. Evolution-Based Functional Decomposition of Proteins.

    PubMed

    Rivoire, Olivier; Reynolds, Kimberly A; Ranganathan, Rama

    2016-06-01

    The essential biological properties of proteins-folding, biochemical activities, and the capacity to adapt-arise from the global pattern of interactions between amino acid residues. The statistical coupling analysis (SCA) is an approach to defining this pattern that involves the study of amino acid coevolution in an ensemble of sequences comprising a protein family. This approach indicates a functional architecture within proteins in which the basic units are coupled networks of amino acids termed sectors. This evolution-based decomposition has potential for new understandings of the structural basis for protein function. To facilitate its usage, we present here the principles and practice of the SCA and introduce new methods for sector analysis in a python-based software package (pySCA). We show that the pattern of amino acid interactions within sectors is linked to the divergence of functional lineages in a multiple sequence alignment-a model for how sector properties might be differentially tuned in members of a protein family. This work provides new tools for studying proteins and for generally testing the concept of sectors as the principal units of function and adaptive variation.

  8. Physiological functions of MTA family of proteins.

    PubMed

    Sen, Nirmalya; Gui, Bin; Kumar, Rakesh

    2014-12-01

    Although the functional significance of the metastasic tumor antigen (MTA) family of chromatin remodeling proteins in the pathobiology of cancer is fairly well recognized, the physiological role of MTA proteins continues to be an understudied research area and is just beginning to be recognized. Similar to cancer cells, MTA1 also modulates the expression of target genes in normal cells either by acting as a corepressor or coactivator. In addition, physiological functions of MTA proteins are likely to be influenced by its differential expression, subcellular localization, and regulation by upstream modulators and extracellular signals. This review summarizes our current understanding of the physiological functions of the MTA proteins in model systems. In particular, we highlight recent advances of the role MTA proteins play in the brain, eye, circadian rhythm, mammary gland biology, spermatogenesis, liver, immunomodulation and inflammation, cellular radio-sensitivity, and hematopoiesis and differentiation. Based on the growth of knowledge regarding the exciting new facets of the MTA family of proteins in biology and medicine, we speculate that the next burst of findings in this field may reveal further molecular regulatory insights of non-redundant functions of MTA coregulators in the normal physiology as well as in pathological conditions outside cancer.

  9. WeFold: A Coopetition for Protein Structure Prediction

    PubMed Central

    Khoury, George A.; Liwo, Adam; Khatib, Firas; Zhou, Hongyi; Chopra, Gaurav; Bacardit, Jaume; Bortot, Leandro O.; Faccioli, Rodrigo A.; Deng, Xin; He, Yi; Krupa, Pawel; Li, Jilong; Mozolewska, Magdalena A.; Sieradzan, Adam K.; Smadbeck, James; Wirecki, Tomasz; Cooper, Seth; Flatten, Jeff; Xu, Kefan; Baker, David; Cheng, Jianlin; Delbem, Alexandre C. B.; Floudas, Christodoulos A.; Keasar, Chen; Levitt, Michael; Popović, Zoran; Scheraga, Harold A.; Skolnick, Jeffrey; Crivelli, Silvia N.; Players, Foldit

    2014-01-01

    The protein structure prediction problem continues to elude scientists. Despite the introduction of many methods, only modest gains were made over the last decade for certain classes of prediction targets. To address this challenge, a social-media based worldwide collaborative effort, named WeFold, was undertaken by thirteen labs. During the collaboration, the labs were simultaneously competing with each other. Here, we present the first attempt at “coopetition” in scientific research applied to the protein structure prediction and refinement problems. The coopetition was possible by allowing the participating labs to contribute different components of their protein structure prediction pipelines and create new hybrid pipelines that they tested during CASP10. This manuscript describes both successes and areas needing improvement as identified throughout the first WeFold experiment and discusses the efforts that are underway to advance this initiative. A footprint of all contributions and structures are publicly accessible at http://www.wefold.org. PMID:24677212

  10. 'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools

    PubMed Central

    Shen, Yao Qing; Burger, Gertraud

    2007-01-01

    Background Knowing the subcellular location of proteins provides clues to their function as well as the interconnectivity of biological processes. Dozens of tools are available for predicting protein location in the eukaryotic cell. Each tool performs well on certain data sets, but their predictions often disagree for a given protein. Since the individual tools each have particular strengths, we set out to integrate them in a way that optimally exploits their potential. The method we present here is applicable to various subcellular locations, but tailored for predicting whether or not a protein is localized in mitochondria. Knowledge of the mitochondrial proteome is relevant to understanding the role of this organelle in global cellular processes. Results In order to develop a method for enhanced prediction of subcellular localization, we integrated the outputs of available localization prediction tools by several strategies, and tested the performance of each strategy with known mitochondrial proteins. The accuracy obtained (up to 92%) surpasses by far the individual tools. The method of integration proved crucial to the performance. For the prediction of mitochondrion-located proteins, integration via a two-layer decision tree clearly outperforms simpler methods, as it allows emphasis of biologically relevant features such as the mitochondrial targeting peptide and transmembrane domains. Conclusion We developed an approach that enhances the prediction accuracy of mitochondrial proteins by uniting the strength of specialized tools. The combination of machine-learning based integration with biological expert knowledge leads to improved performance. This approach also alleviates the conundrum of how to choose between conflicting predictions. Our approach is easy to implement, and applicable to predicting subcellular locations other than mitochondria, as well as other biological features. For a trial of our approach, we provide a webservice for mitochondrial protein

  11. Predicting multisite protein subcellular locations: progress and challenges.

    PubMed

    Du, Pufeng; Xu, Chao

    2013-06-01

    In the last two decades, predicting protein subcellular locations has become a hot topic in bioinformatics. A number of algorithms and online services have been developed to computationally assign a subcellular location to a given protein sequence. With the progress of many proteome projects, more and more proteins are annotated with more than one subcellular location. However, multisite prediction has only been considered in a handful of recent studies, in which there are several common challenges. In this special report, the authors discuss what these challenges are, why these challenges are important and how the existing studies gave their solutions. Finally, a vision of the future of predicting multisite protein subcellular locations is given.

  12. Computational prediction of the tolerance to amino-acid deletion in green-fluorescent protein

    PubMed Central

    Jackson, Eleisha L.; Spielman, Stephanie J.

    2017-01-01

    Proteins evolve through two primary mechanisms: substitution, where mutations alter a protein’s amino-acid sequence, and insertions and deletions (indels), where amino acids are either added to or removed from the sequence. Protein structure has been shown to influence the rate at which substitutions accumulate across sites in proteins, but whether structure similarly constrains the occurrence of indels has not been rigorously studied. Here, we investigate the extent to which structural properties known to covary with protein evolutionary rates might also predict protein tolerance to indels. Specifically, we analyze a publicly available dataset of single—amino-acid deletion mutations in enhanced green fluorescent protein (eGFP) to assess how well the functional effect of deletions can be predicted from protein structure. We find that weighted contact number (WCN), which measures how densely packed a residue is within the protein’s three-dimensional structure, provides the best single predictor for whether eGFP will tolerate a given deletion. We additionally find that using protein design to explicitly model deletions results in improved predictions of functional status when combined with other structural predictors. Our work suggests that structure plays fundamental role in constraining deletions at sites in proteins, and further that similar biophysical constraints influence both substitutions and deletions. This study therefore provides a solid foundation for future work to examine how protein structure influences tolerance of more complex indel events, such as insertions or large deletions. PMID:28369116

  13. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder

    PubMed Central

    Lorenzo, J. Ramiro; Alonso, Leonardo G.; Sánchez, Ignacio E.

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage “Protein and nucleic acid structure and sequence analysis”. PMID:26674530

  14. Prediction of Spontaneous Protein Deamidation from Sequence-Derived Secondary Structure and Intrinsic Disorder.

    PubMed

    Lorenzo, J Ramiro; Alonso, Leonardo G; Sánchez, Ignacio E

    2015-01-01

    Asparagine residues in proteins undergo spontaneous deamidation, a post-translational modification that may act as a molecular clock for the regulation of protein function and turnover. Asparagine deamidation is modulated by protein local sequence, secondary structure and hydrogen bonding. We present NGOME, an algorithm able to predict non-enzymatic deamidation of internal asparagine residues in proteins in the absence of structural data, using sequence-based predictions of secondary structure and intrinsic disorder. Compared to previous algorithms, NGOME does not require three-dimensional structures yet yields better predictions than available sequence-only methods. Four case studies of specific proteins show how NGOME may help the user identify deamidation-prone asparagine residues, often related to protein gain of function, protein degradation or protein misfolding in pathological processes. A fifth case study applies NGOME at a proteomic scale and unveils a correlation between asparagine deamidation and protein degradation in yeast. NGOME is freely available as a webserver at the National EMBnet node Argentina, URL: http://www.embnet.qb.fcen.uba.ar/ in the subpage "Protein and nucleic acid structure and sequence analysis".

  15. Truly Absorbed Microbial Protein Synthesis, Rumen Bypass Protein, Endogenous Protein, and Total Metabolizable Protein from Starchy and Protein-Rich Raw Materials: Model Comparison and Predictions.

    PubMed

    Parand, Ehsan; Vakili, Alireza; Mesgaran, Mohsen Danesh; van Duinkerken, Gert; Yu, Peiqiang

    2015-07-29

    This study was carried out to measure truly absorbed microbial protein synthesis, rumen bypass protein, and endogenous protein loss, as well as total metabolizable protein, from starchy and protein-rich raw feed materials with model comparisons. Predictions by the DVE2010 system as a more mechanistic model were compared with those of two other models, DVE1994 and NRC-2001, that are frequently used in common international feeding practice. DVE1994 predictions for intestinally digestible rumen undegradable protein (ARUP) for starchy concentrates were higher (27 vs 18 g/kg DM, p < 0.05, SEM = 1.2) than predictions by the NRC-2001, whereas there was no difference in predictions for ARUP from protein concentrates among the three models. DVE2010 and NRC-2001 had highest estimations of intestinally digestible microbial protein for starchy (92 g/kg DM in DVE2010 vs 46 g/kg DM in NRC-2001 and 67 g/kg DM in DVE1994, p < 0.05 SEM = 4) and protein concentrates (69 g/kg DM in NRC-2001 vs 31 g/kg DM in DVE1994 and 49 g/kg DM in DVE2010, p < 0.05 SEM = 4), respectively. Potential protein supplies predicted by tested models from starchy and protein concentrates are widely different, and comparable direct measurements are needed to evaluate the actual ability of different models to predict the potential protein supply to dairy cows from different feedstuffs.

  16. Machine Learning Approaches for Predicting Protein Complex Similarity.

    PubMed

    Farhoodi, Roshanak; Akbal-Delibas, Bahar; Haspel, Nurit

    2017-01-01

    Discriminating native-like structures from false positives with high accuracy is one of the biggest challenges in protein-protein docking. While there is an agreement on the existence of a relationship between various favorable intermolecular interactions (e.g., Van der Waals, electrostatic, and desolvation forces) and the similarity of a conformation to its native structure, the precise nature of this relationship is not known. Existing protein-protein docking methods typically formulate this relationship as a weighted sum of selected terms and calibrate their weights by using a training set to evaluate and rank candidate complexes. Despite improvements in the predictive power of recent docking methods, producing a large number of false positives by even state-of-the-art methods often leads to failure in predicting the correct binding of many complexes. With the aid of machine learning methods, we tested several approaches that not only rank candidate structures relative to each other but also predict how similar each candidate is to the native conformation. We trained a two-layer neural network, a multilayer neural network, and a network of Restricted Boltzmann Machines against extensive data sets of unbound complexes generated by RosettaDock and PyDock. We validated these methods with a set of refinement candidate structures. We were able to predict the root mean squared deviations (RMSDs) of protein complexes with a very small, often less than 1.5 Å, error margin when trained with structures that have RMSD values of up to 7 Å. In our most recent experiments with the protein samples having RMSD values up to 27 Å, the average prediction error was still relatively small, attesting to the potential of our approach in predicting the correct binding of protein-protein complexes.

  17. Functional dynamics of cell surface membrane proteins

    NASA Astrophysics Data System (ADS)

    Nishida, Noritaka; Osawa, Masanori; Takeuchi, Koh; Imai, Shunsuke; Stampoulis, Pavlos; Kofuku, Yutaka; Ueda, Takumi; Shimada, Ichio

    2014-04-01

    Cell surface receptors are integral membrane proteins that receive external stimuli, and transmit signals across plasma membranes. In the conventional view of receptor activation, ligand binding to the extracellular side of the receptor induces conformational changes, which convert the structure of the receptor into an active conformation. However, recent NMR studies of cell surface membrane proteins have revealed that their structures are more dynamic than previously envisioned, and they fluctuate between multiple conformations in an equilibrium on various timescales. In addition, NMR analyses, along with biochemical and cell biological experiments indicated that such dynamical properties are critical for the proper functions of the receptors. In this review, we will describe several NMR studies that revealed direct linkage between the structural dynamics and the functions of the cell surface membrane proteins, such as G-protein coupled receptors (GPCRs), ion channels, membrane transporters, and cell adhesion molecules.

  18. Calreticulin: one protein, one gene, many functions.

    PubMed Central

    Michalak, M; Corbett, E F; Mesaeli, N; Nakamura, K; Opas, M

    1999-01-01

    The endoplasmic reticulum (ER) plays a critical role in the synthesis and chaperoning of membrane-associated and secreted proteins. The membrane is also an important site of Ca(2+) storage and release. Calreticulin is a unique ER luminal resident protein. The protein affects many cellular functions, both in the ER lumen and outside of the ER environment. In the ER lumen, calreticulin performs two major functions: chaperoning and regulation of Ca(2+) homoeostasis. Calreticulin is a highly versatile lectin-like chaperone, and it participates during the synthesis of a variety of molecules, including ion channels, surface receptors, integrins and transporters. The protein also affects intracellular Ca(2+) homoeostasis by modulation of ER Ca(2+) storage and transport. Studies on the cell biology of calreticulin revealed that the ER membrane is a very dynamic intracellular compartment affecting many aspects of cell physiology. PMID:10567207

  19. Predicting Ligand Binding Sites on Protein Surfaces by 3-Dimensional Probability Density Distributions of Interacting Atoms

    PubMed Central

    Jian, Jhih-Wei; Elumalai, Pavadai; Pitti, Thejkiran; Wu, Chih Yuan; Tsai, Keng-Chang; Chang, Jeng-Yih; Peng, Hung-Pin; Yang, An-Suei

    2016-01-01

    Predicting ligand binding sites (LBSs) on protein structures, which are obtained either from experimental or computational methods, is a useful first step in functional annotation or structure-based drug design for the protein structures. In this work, the structure-based machine learning algorithm ISMBLab-LIG was developed to predict LBSs on protein surfaces with input attributes derived from the three-dimensional probability density maps of interacting atoms, which were reconstructed on the query protein surfaces and were relatively insensitive to local conformational variations of the tentative ligand binding sites. The prediction accuracy of the ISMBLab-LIG predictors is comparable to that of the best LBS predictors benchmarked on several well-established testing datasets. More importantly, the ISMBLab-LIG algorithm has substantial tolerance to the prediction uncertainties of computationally derived protein structure models. As such, the method is particularly useful for predicting LBSs not only on experimental protein structures without known LBS templates in the database but also on computationally predicted model protein structures with structural uncertainties in the tentative ligand binding sites. PMID:27513851

  20. Feature Fusion Based SVM Classifier for Protein Subcellular Localization Prediction.

    PubMed

    Rahman, Julia; Mondal, Md Nazrul Islam; Islam, Md Khaled Ben; Hasan, Md Al Mehedi

    2016-12-18

    For the importance of protein subcellular localization in different branches of life science and drug discovery, researchers have focused their attentions on protein subcellular localization prediction. Effective representation of features from protein sequences plays a most vital role in protein subcellular localization prediction specially in case of machine learning techniques. Single feature representation-like pseudo amino acid composition (PseAAC), physiochemical property models (PPM), and amino acid index distribution (AAID) contains insufficient information from protein sequences. To deal with such problems, we have proposed two feature fusion representations, AAIDPAAC and PPMPAAC, to work with Support Vector Machine classifiers, which fused PseAAC with PPM and AAID accordingly. We have evaluated the performance for both single and fused feature representation of a Gram-negative bacterial dataset. We have got at least 3% more actual accuracy by AAIDPAAC and 2% more locative accuracy by PPMPAAC than single feature representation.

  1. Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests

    PubMed Central

    Šikić, Mile; Tomić, Sanja; Vlahoviček, Kristian

    2009-01-01

    Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i) a combination of sequence- and structure-derived parameters and (ii) sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras–Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information. PMID:19180183

  2. Gene3D: modelling protein structure, function and evolution.

    PubMed

    Yeats, Corin; Maibaum, Michael; Marsden, Russell; Dibley, Mark; Lee, David; Addou, Sarah; Orengo, Christine A

    2006-01-01

    The Gene3D release 4 database and web portal (http://cathwww.biochem.ucl.ac.uk:8080/Gene3D) provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives--including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein-protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers.

  3. Computational approaches for inferring the functions of intrinsically disordered proteins

    PubMed Central

    Varadi, Mihaly; Vranken, Wim; Guharoy, Mainak; Tompa, Peter

    2015-01-01

    Intrinsically disordered proteins (IDPs) are ubiquitously involved in cellular processes and often implicated in human pathological conditions. The critical biological roles of these proteins, despite not adopting a well-defined fold, encouraged structural biologists to revisit their views on the protein structure-function paradigm. Unfortunately, investigating the characteristics and describing the structural behavior of IDPs is far from trivial, and inferring the function(s) of a disordered protein region remains a major challenge. Computational methods have proven particularly relevant for studying IDPs: on the sequence level their dependence on distinct characteristics determined by the local amino acid context makes sequence-based prediction algorithms viable and reliable tools for large scale analyses, while on the structure level the in silico integration of fundamentally different experimental data types is essential to describe the behavior of a flexible protein chain. Here, we offer an overview of the latest developments and computational techniques that aim to uncover how protein function is connected to intrinsic disorder. PMID:26301226

  4. Investigating neuronal function with optically controllable proteins

    PubMed Central

    Zhou, Xin X.; Pan, Michael; Lin, Michael Z.

    2015-01-01

    In the nervous system, protein activities are highly regulated in space and time. This regulation allows for fine modulation of neuronal structure and function during development and adaptive responses. For example, neurite extension and synaptogenesis both involve localized and transient activation of cytoskeletal and signaling proteins, allowing changes in microarchitecture to occur rapidly and in a localized manner. To investigate the role of specific protein regulation events in these processes, methods to optically control the activity of specific proteins have been developed. In this review, we focus on how photosensory domains enable optical control over protein activity and have been used in neuroscience applications. These tools have demonstrated versatility in controlling various proteins and thereby cellular functions, and possess enormous potential for future applications in nervous systems. Just as optogenetic control of neuronal firing using opsins has changed how we investigate the function of cellular circuits in vivo, optical control may yet yield another revolution in how we study the circuitry of intracellular signaling in the brain. PMID:26257603

  5. A Prediction Model of the Capillary Pressure J-Function

    PubMed Central

    Xu, W. S.; Luo, P. Y.; Sun, L.; Lin, N.

    2016-01-01

    The capillary pressure J-function is a dimensionless measure of the capillary pressure of a fluid in a porous medium. The function was derived based on a capillary bundle model. However, the dependence of the J-function on the saturation Sw is not well understood. A prediction model for it is presented based on capillary pressure model, and the J-function prediction model is a power function instead of an exponential or polynomial function. Relative permeability is calculated with the J-function prediction model, resulting in an easier calculation and results that are more representative. PMID:27603701

  6. Protein Secondary Structure Prediction Using Local Adaptive Techniques in Training Neural Networks

    NASA Astrophysics Data System (ADS)

    Aik, Lim Eng; Zainuddin, Zarita; Joseph, Annie

    2008-01-01

    One of the most significant problems in computer molecular biology today is how to predict a protein's three-dimensional structure from its one-dimensional amino acid sequence or generally call the protein folding problem and difficult to determine the corresponding protein functions. Thus, this paper involves protein secondary structure prediction using neural network in order to solve the protein folding problem. The neural network used for protein secondary structure prediction is multilayer perceptron (MLP) of the feed-forward variety. The training set are taken from the protein data bank which are 120 proteins while 60 testing set is the proteins which were chosen randomly from the protein data bank. Multiple sequence alignment (MSA) is used to get the protein similar sequence and Position Specific Scoring matrix (PSSM) is used for network input. The training process of the neural network involves local adaptive techniques. Local adaptive techniques used in this paper comprises Learning rate by sign changes, SuperSAB, Quickprop and RPROP. From the simulation, the performance for learning rate by Rprop and Quickprop are superior to all other algorithms with respect to the convergence time. However, the best result was obtained using Rprop algorithm.

  7. Protein-protein interaction network-based detection of functionally similar proteins within species.

    PubMed

    Song, Baoxing; Wang, Fen; Guo, Yang; Sang, Qing; Liu, Min; Li, Dengyun; Fang, Wei; Zhang, Deli

    2012-07-01

    Although functionally similar proteins across species have been widely studied, functionally similar proteins within species showing low sequence similarity have not been examined in detail. Identification of these proteins is of significant importance for understanding biological functions, evolution of protein families, progression of co-evolution, and convergent evolution and others which cannot be obtained by detection of functionally similar proteins across species. Here, we explored a method of detecting functionally similar proteins within species based on graph theory. After denoting protein-protein interaction networks using graphs, we split the graphs into subgraphs using the 1-hop method. Proteins with functional similarities in a species were detected using a method of modified shortest path to compare these subgraphs and to find the eligible optimal results. Using seven protein-protein interaction networks and this method, some functionally similar proteins with low sequence similarity that cannot detected by sequence alignment were identified. By analyzing the results, we found that, sometimes, it is difficult to separate homologous from convergent evolution. Evaluation of the performance of our method by gene ontology term overlap showed that the precision of our method was excellent.

  8. Exploiting protein flexibility to predict the location of allosteric sites

    PubMed Central

    2012-01-01

    Background Allostery is one of the most powerful and common ways of regulation of protein activity. However, for most allosteric proteins identified to date the mechanistic details of allosteric modulation are not yet well understood. Uncovering common mechanistic patterns underlying allostery would allow not only a better academic understanding of the phenomena, but it would also streamline the design of novel therapeutic solutions. This relatively unexplored therapeutic potential and the putative advantages of allosteric drugs over classical active-site inhibitors fuel the attention allosteric-drug research is receiving at present. A first step to harness the regulatory potential and versatility of allosteric sites, in the context of drug-discovery and design, would be to detect or predict their presence and location. In this article, we describe a simple computational approach, based on the effect allosteric ligands exert on protein flexibility upon binding, to predict the existence and position of allosteric sites on a given protein structure. Results By querying the literature and a recently available database of allosteric sites, we gathered 213 allosteric proteins with structural information that we further filtered into a non-redundant set of 91 proteins. We performed normal-mode analysis and observed significant changes in protein flexibility upon allosteric-ligand binding in 70% of the cases. These results agree with the current view that allosteric mechanisms are in many cases governed by changes in protein dynamics caused by ligand binding. Furthermore, we implemented an approach that achieves 65% positive predictive value in identifying allosteric sites within the set of predicted cavities of a protein (stricter parameters set, 0.22 sensitivity), by combining the current analysis on dynamics with previous results on structural conservation of allosteric sites. We also analyzed four biological examples in detail, revealing that this simple coarse

  9. Prediction of transmembrane helices from hydrophobic characteristics of proteins.

    PubMed

    Ponnuswamy, P K; Gromiha, M M

    1993-10-01

    Membrane proteins, requiring to be embedded into the lipid bilayers, have evolved to have amino acid sequences that will fold with a hydrophobic surface in contact with the alkane chains of the lipids and polar surface in contact with the aqueous phases on both sides of the membrane and the polar head groups of the lipids. It is generally assumed that the characteristics of the aqueous parts of the membrane proteins are similar to those of normal globular proteins, and the embedded parts are highly hydrophobic. In our earlier works, we introduced the concept of 'surrounding hydrophobicity' and developed a hydrophobicity scale for the 20 amino acid residues, and applied it successfully to the study of the family of globular proteins. In this work we use the concept of surrounding hydrophobicity to indicate quantitatively how the aqueous parts of membrane proteins compare with the normal globular proteins, and how rich the embedded parts are in their hydrophobic activity. We then develop a surrounding hydrophobicity scale applicable to membrane proteins, by mixing judicially the surrounding hydrophobicities observed in the crystals of the membrane protein, photosynthetic reaction center from the bacterium Rhodopseudomonas viridis, porin from Rhodobacter capsulatus and a set of 64 globular proteins. A predictive scheme based on this scale predicts from amino acid sequence, transmembrane segments in PRC and randomly selected 26 membrane proteins to 80% level of accuracy. This is a much higher predictive power when compared to the existing popular methods. A new procedure to measure the amphipathicity of sequence segments is proposed, and it is used to characterize the transmembrane parts of the sample membrane proteins.

  10. The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information

    PubMed Central

    Saunders, Neil F. W.

    2008-01-01

    The Predikin webserver allows users to predict substrates of protein kinases. The Predikin system is built from three components: a database of protein kinase substrates that links phosphorylation sites with specific protein kinase sequences; a perl module to analyse query protein kinases and a web interface through which users can submit protein kinases for analysis. The Predikin perl module provides methods to (i) locate protein kinase catalytic domains in a sequence, (ii) classify them by type or family, (iii) identify substrate-determining residues, (iv) generate weighted scoring matrices using three different methods, (v) extract putative phosphorylation sites in query substrate sequences and (vi) score phosphorylation sites for a given kinase, using optional filters. The web interface provides user-friendly access to each of these functions and allows users to obtain rapidly a set of predictions that they can export for further analysis. The server is available at http://predikin.biosci.uq.edu.au. PMID:18477637

  11. Highly Accurate Prediction of Protein-Protein Interactions via Incorporating Evolutionary Information and Physicochemical Characteristics

    PubMed Central

    Li, Zheng-Wei; You, Zhu-Hong; Chen, Xing; Gui, Jie; Nie, Ru

    2016-01-01

    Protein-protein interactions (PPIs) occur at almost all levels of cell functions and play crucial roles in various cellular processes. Thus, identification of PPIs is critical for deciphering the molecular mechanisms and further providing insight into biological processes. Although a variety of high-throughput experimental techniques have been developed to identify PPIs, existing PPI pairs by experimental approaches only cover a small fraction of the whole PPI networks, and further, those approaches hold inherent disadvantages, such as being time-consuming, expensive, and having high false positive rate. Therefore, it is urgent and imperative to develop automatic in silico approaches to predict PPIs efficiently and accurately. In this article, we propose a novel mixture of physicochemical and evolutionary-based feature extraction method for predicting PPIs using our newly developed discriminative vector machine (DVM) classifier. The improvements of the proposed method mainly consist in introducing an effective feature extraction method that can capture discriminative features from the evolutionary-based information and physicochemical characteristics, and then a powerful and robust DVM classifier is employed. To the best of our knowledge, it is the first time that DVM model is applied to the field of bioinformatics. When applying the proposed method to the Yeast and Helicobacter pylori (H. pylori) datasets, we obtain excellent prediction accuracies of 94.35% and 90.61%, respectively. The computational results indicate that our method is effective and robust for predicting PPIs, and can be taken as a useful supplementary tool to the traditional experimental methods for future proteomics research. PMID:27571061

  12. Protein Nitration in Placenta – Functional Significance

    PubMed Central

    Webster, RP; Roberts, VHJ; Myatt, L

    2009-01-01

    Crucial roles of the placenta are disrupted in early and mid-trimester pregnancy loss, preeclampsia, eclampsia and intrauterine growth restriction. The pathophysiology of these disorders includes a relative hypoxia of the placenta, ischemia/reperfusion injury, an inflammatory response and oxidative stress. Reactive oxygen species including nitric oxide (NO), carbon monoxide and superoxide have been shown to participate in trophoblast invasion, regulation of placental vascular reactivity and other events. Superoxide, which regulates expression of redox sensitive genes, has been implicated in up-regulation of transcription factors, antioxidant production, angiogenesis, proliferation and matrix remodeling. When superoxide and nitric oxide are present in abundance, their interaction yields peroxynitrite a potent pro-oxidant, but also alters levels of nitric oxide, which in turn affect physiological functions. The peroxynitrite anion is extremely unstable thus evidence of its formation in vivo has been indirect via the occurrence of nitrated moieties including nitrated lipids and nitrotyrosine residues in proteins. Formation of 3-nitrotyrosine (protein nitration) is a “molecular fingerprint” of peroxynitrite formation. Protein nitration has been widely reported in a number of pathological states associated with inflammation but is reported to occur in normal physiology and is thought of as a prevalent, functionally relevant post-translational modification of proteins. Nitration of proteins can give either no effect, a gain or a loss of function. Nitration of a range of placental proteins is found in normal pregnancy but increased in pathologic pregnancies. Evidence is presented for nitration of placental signal transduction enzymes and transporters. The targets and extent of nitration of enzymes, receptors, transporters and structural proteins may markedly influence placental cellular function in both physiologic and pathologic settings. PMID:18851882

  13. PPCM: Combing Multiple Classifiers to Improve Protein-Protein Interaction Prediction

    PubMed Central

    Yao, Jianzhuang; Guo, Hong; Yang, Xiaohan

    2015-01-01

    Determining protein-protein interaction (PPI) in biological systems is of considerable importance, and prediction of PPI has become a popular research area. Although different classifiers have been developed for PPI prediction, no single classifier seems to be able to predict PPI with high confidence. We postulated that by combining individual classifiers the accuracy of PPI prediction could be improved. We developed a method called protein-protein interaction prediction classifiers merger (PPCM), and this method combines output from two PPI prediction tools, GO2PPI and Phyloprof, using Random Forests algorithm. The performance of PPCM was tested by area under the curve (AUC) using an assembled Gold Standard database that contains both positive and negative PPI pairs. Our AUC test showed that PPCM significantly improved the PPI prediction accuracy over the corresponding individual classifiers. We found that additional classifiers incorporated into PPCM could lead to further improvement in the PPI prediction accuracy. Furthermore, cross species PPCM could achieve competitive and even better prediction accuracy compared to the single species PPCM. This study established a robust pipeline for PPI prediction by integrating multiple classifiers using Random Forests algorithm. This pipeline will be useful for predicting PPI in nonmodel species. PMID:26539460

  14. PPCM: Combing Multiple Classifiers to Improve Protein-Protein Interaction Prediction

    DOE PAGES

    Yao, Jianzhuang; Guo, Hong; Yang, Xiaohan

    2015-01-01

    Determining protein-protein interaction (PPI) in biological systems is of considerable importance, and prediction of PPI has become a popular research area. Although different classifiers have been developed for PPI prediction, no single classifier seems to be able to predict PPI with high confidence. We postulated that by combining individual classifiers the accuracy of PPI prediction could be improved. We developed a method called protein-protein interaction prediction classifiers merger (PPCM), and this method combines output from two PPI prediction tools, GO2PPI and Phyloprof, using Random Forests algorithm. The performance of PPCM was tested by area under the curve (AUC) using anmore » assembled Gold Standard database that contains both positive and negative PPI pairs. Our AUC test showed that PPCM significantly improved the PPI prediction accuracy over the corresponding individual classifiers. We found that additional classifiers incorporated into PPCM could lead to further improvement in the PPI prediction accuracy. Furthermore, cross species PPCM could achieve competitive and even better prediction accuracy compared to the single species PPCM. This study established a robust pipeline for PPI prediction by integrating multiple classifiers using Random Forests algorithm. This pipeline will be useful for predicting PPI in nonmodel species.« less

  15. Functional insights from structural predictions: analysis of the Escherichia coli genome.

    PubMed Central

    Rychlewski, L.; Zhang, B.; Godzik, A.

    1999-01-01

    Fold assignments for proteins from the Escherichia coli genome are carried out using BASIC, a profile-profile alignment algorithm, recently tested on fold recognition benchmarks and on the Mycoplasma genitalium genome and PSI BLAST, the newest generation of the de facto standard in homology search algorithms. The fold assignments are followed by automated modeling and the resulting three-dimensional models are analyzed for possible function prediction. Close to 30% of the proteins encoded in the E. coli genome can be recognized as homologous to a protein family with known structure. Most of these homologies (23% of the entire genome) can be recognized both by PSI BLAST and BASIC algorithms, but the latter recognizes an additional 260 homologies. Previous estimates suggested that only 10-15% of E. coli proteins can be characterized this way. This dramatic increase in the number of recognized homologies between E. coli proteins and structurally characterized protein families is partly due to the rapid increase of the database of known protein structures, but mostly it is due to the significant improvement in prediction algorithms. Knowing protein structure adds a new dimension to our understanding of its function and the predictions presented here can be used to predict function for uncharacterized proteins. Several examples, analyzed in more detail in this paper, include the DPS protein protecting DNA from oxidative damage (predicted to be homologous to ferritin with iron ion acting as a reducing agent) and the ahpC/tsa family of proteins, which provides resistance to various oxidating agents (predicted to be homologous to glutathione peroxidase). PMID:10091664

  16. A predicted protein interactome identifies conserved global networks and disease resistance subnetworks in maize

    PubMed Central

    Musungu, Bryan; Bhatnagar, Deepak; Brown, Robert L.; Fakhoury, Ahmad M.; Geisler, Matt

    2015-01-01

    Interactomes are genome-wide roadmaps of protein-protein interactions. They have been produced for humans, yeast, the fruit fly, and Arabidopsis thaliana and have become invaluable tools for generating and testing hypotheses. A predicted interactome for Zea mays (PiZeaM) is presented here as an aid to the research community for this valuable crop species. PiZeaM was built using a proven method of interologs (interacting orthologs) that were identified using both one-to-one and many-to-many orthology between genomes of maize and reference species. Where both maize orthologs occurred for an experimentally determined interaction in the reference species, we predicted a likely interaction in maize. A total of 49,026 unique interactions for 6004 maize proteins were predicted. These interactions are enriched for processes that are evolutionarily conserved, but include many otherwise poorly annotated proteins in maize. The predicted maize interactions were further analyzed by comparing annotation of interacting proteins, including different layers of ontology. A map of pairwise gene co-expression was also generated and compared to predicted interactions. Two global subnetworks were constructed for highly conserved interactions. These subnetworks showed clear clustering of proteins by function. Another subnetwork was created for disease response using a bait and prey strategy to capture interacting partners for proteins that respond to other organisms. Closer examination of this subnetwork revealed the connectivity between biotic and abiotic hormone stress pathways. We believe PiZeaM will provide a useful tool for the prediction of protein function and analysis of pathways for Z. mays researchers and is presented in this paper as a reference tool for the exploration of protein interactions in maize. PMID:26089837

  17. Functional Constraint Profiling of a Viral Protein Reveals Discordance of Evolutionary Conservation and Functionality

    PubMed Central

    Wu, Nicholas C.; Olson, C. Anders; Du, Yushen; Le, Shuai; Tran, Kevin; Remenyi, Roland; Gong, Danyang; Al-Mawsawi, Laith Q.; Qi, Hangfei; Wu, Ting-Ting; Sun, Ren

    2015-01-01

    Viruses often encode proteins with multiple functions due to their compact genomes. Existing approaches to identify functional residues largely rely on sequence conservation analysis. Inferring functional residues from sequence conservation can produce false positives, in which the conserved residues are functionally silent, or false negatives, where functional residues are not identified since they are species-specific and therefore non-conserved. Furthermore, the tedious process of constructing and analyzing individual mutations limits the number of residues that can be examined in a single study. Here, we developed a systematic approach to identify the functional residues of a viral protein by coupling experimental fitness profiling with protein stability prediction using the influenza virus polymerase PA subunit as the target protein. We identified a significant number of functional residues that were influenza type-specific and were evolutionarily non-conserved among different influenza types. Our results indicate that type-specific functional residues are prevalent and may not otherwise be identified by sequence conservation analysis alone. More importantly, this technique can be adapted to any viral (and potentially non-viral) protein where structural information is available. PMID:26132554

  18. Modular protein domains: an engineering approach toward functional biomaterials.

    PubMed

    Lin, Charng-Yu; Liu, Julie C

    2016-08-01

    Protein domains and peptide sequences are a powerful tool for conferring specific functions to engineered biomaterials. Protein sequences with a wide variety of functionalities, including structure, bioactivity, protein-protein interactions, and stimuli responsiveness, have been identified, and advances in molecular biology continue to pinpoint new sequences. Protein domains can be combined to make recombinant proteins with multiple functionalities. The high fidelity of the protein translation machinery results in exquisite control over the sequence of recombinant proteins and the resulting properties of protein-based materials. In this review, we discuss protein domains and peptide sequences in the context of functional protein-based materials, composite materials, and their biological applications.

  19. Knowledge of Native Protein-Protein Interfaces Is Sufficient To Construct Predictive Models for the Selection of Binding Candidates.

    PubMed

    Popov, Petr; Grudinin, Sergei

    2015-10-26

    Selection of putative binding poses is a challenging part of virtual screening for protein-protein interactions. Predictive models to filter out binding candidates with the highest binding affinities comprise scoring functions that assign a score to each binding pose. Existing scoring functions are typically deduced by collecting statistical information about interfaces of native conformations of protein complexes along with interfaces of a large generated set of non-native conformations. However, the obtained scoring functions become biased toward the method used to generate the non-native conformations, i.e., they may not recognize near-native interfaces generated with a different method. The present study demonstrates that knowledge of only native protein-protein interfaces is sufficient to construct well-discriminative predictive models for the selection of binding candidates. Here we introduce a new scoring method that comprises a knowledge-based potential called KSENIA deduced from structural information about the native interfaces of 844 crystallographic protein-protein complexes. We derive KSENIA using convex optimization with a training set composed of native protein complexes and their near-native conformations obtained using deformations along the low-frequency normal modes. As a result, our knowledge-based potential has only marginal bias toward a method used to generate putative binding poses. Furthermore, KSENIA is smooth by construction, which allows it to be used along with rigid-body optimization to refine the binding poses. Using several test benchmarks, we demonstrate that our method discriminates well native and near-native conformations of protein complexes from non-native ones. Our methodology can be easily adapted to the recognition of other types of molecular interactions, such as protein-ligand, protein-RNA, etc. KSENIA will be made publicly available as a part of the SAMSON software platform at https://team.inria.fr/nano-d/software .

  20. The lipocalin protein family: structure and function.

    PubMed Central

    Flower, D R

    1996-01-01

    The lipocalin protein family is a large group of small extracellular proteins. The family demonstrates great diversity at the sequence level; however, most lipocalins share three characteristic conserved sequence motifs, the kernel lipocalins, while a group of more divergent family members, the outlier lipocalins, share only one. Belying this sequence dissimilarity, lipocalin crystal structures are highly conserved and comprise a single eight-stranded continuously hydrogen-bonded antiparallel beta-barrel, which encloses an internal ligand-binding site. Together with two other families of ligand-binding proteins, the fatty-acid-binding proteins (FABPs) and the avidins, the lipocalins form part of an overall structural superfamily: the calycins. Members of the lipocalin family are characterized by several common molecular-recognition properties: the ability to bind a range of small hydrophobic molecules, binding to specific cell-surface receptors and the formation of complexes with soluble macromolecules. The varied biological functions of the lipocalins are mediated by one or more of these properties. In the past, the lipocalins have been classified as transport proteins; however, it is now clear that the lipocalins exhibit great functional diversity, with roles in retinol transport, invertebrate cryptic coloration, olfaction and pheromone transport, and prostaglandin synthesis. The lipocalins have also been implicated in the regulation of cell homoeostasis and the modulation of the immune response, and, as carrier proteins, to act in the general clearance of endogenous and exogenous compounds. PMID:8761444

  1. Plasma proteins predict conversion to dementia from prodromal disease

    PubMed Central

    Hye, Abdul; Riddoch-Contreras, Joanna; Baird, Alison L.; Ashton, Nicholas J.; Bazenet, Chantal; Leung, Rufina; Westman, Eric; Simmons, Andrew; Dobson, Richard; Sattlecker, Martina; Lupton, Michelle; Lunnon, Katie; Keohane, Aoife; Ward, Malcolm; Pike, Ian; Zucht, Hans Dieter; Pepin, Danielle; Zheng, Wei; Tunnicliffe, Alan; Richardson, Jill; Gauthier, Serge; Soininen, Hilkka; Kłoszewska, Iwona; Mecocci, Patrizia; Tsolaki, Magda; Vellas, Bruno; Lovestone, Simon

    2014-01-01

    Background The study aimed to validate previously discovered plasma biomarkers associated with AD, using a design based on imaging measures as surrogate for disease severity and assess their prognostic value in predicting conversion to dementia. Methods Three multicenter cohorts of cognitively healthy elderly, mild cognitive impairment (MCI), and AD participants with standardized clinical assessments and structural neuroimaging measures were used. Twenty-six candidate proteins were quantified in 1148 subjects using multiplex (xMAP) assays. Results Sixteen proteins correlated with disease severity and cognitive decline. Strongest associations were in the MCI group with a panel of 10 proteins predicting progression to AD (accuracy 87%, sensitivity 85%, and specificity 88%). Conclusions We have identified 10 plasma proteins strongly associated with disease severity and disease progression. Such markers may be useful for patient selection for clinical trials and assessment of patients with predisease subjective memory complaints. PMID:25012867

  2. DSP: a protein shape string and its profile prediction server

    PubMed Central

    Sun, Jiangming; Tang, Shengnan; Xiong, Wenwei; Cong, Peisheng; Li, Tonghua

    2012-01-01

    Many studies have demonstrated that shape string is an extremely important structure representation, since it is more complete than the classical secondary structure. The shape string provides detailed information also in the regions denoted random coil. But few services are provided for systematic analysis of protein shape string. To fill this gap, we have developed an accurate shape string predictor based on two innovative technologies: a knowledge-driven sequence alignment and a sequence shape string profile method. The performance on blind test data demonstrates that the proposed method can be used for accurate prediction of protein shape string. The DSP server provides both predicted shape string and sequence shape string profile for each query sequence. Using this information, the users can compare protein structure or display protein evolution in shape string space. The DSP server is available at both http://cheminfo.tongji.edu.cn/dsp/ and its main mirror http://chemcenter.tongji.edu.cn/dsp/. PMID:22553364

  3. DSP: a protein shape string and its profile prediction server.

    PubMed

    Sun, Jiangming; Tang, Shengnan; Xiong, Wenwei; Cong, Peisheng; Li, Tonghua

    2012-07-01

    Many studies have demonstrated that shape string is an extremely important structure representation, since it is more complete than the classical secondary structure. The shape string provides detailed information also in the regions denoted random coil. But few services are provided for systematic analysis of protein shape string. To fill this gap, we have developed an accurate shape string predictor based on two innovative technologies: a knowledge-driven sequence alignment and a sequence shape string profile method. The performance on blind test data demonstrates that the proposed method can be used for accurate prediction of protein shape string. The DSP server provides both predicted shape string and sequence shape string profile for each query sequence. Using this information, the users can compare protein structure or display protein evolution in shape string space. The DSP server is available at both http://cheminfo.tongji.edu.cn/dsp/ and its main mirror http://chemcenter.tongji.edu.cn/dsp/.

  4. Determining confidence of predicted interactions between HIV-1 and human proteins using conformal method.

    PubMed

    Nouretdinov, Ilia; Gammerman, Alex; Qi, Yanjun; Klein-Seetharaman, Judith

    2012-01-01

    Identifying protein-protein interactions (PPI's) is critical for understanding virtually all cellular molecular mechanisms. Previously, predicting PPI's was treated as a binary classification task and has commonly been solved in a supervised setting which requires a positive labeled set of known PPI's and a negative labeled set of non-interacting protein pairs. In those methods, the learner provides the likelihood of the predicted interaction, but without a confidence level associated with each prediction. Here, we apply a conformal prediction framework to make predictions and estimate confidence of the predictions. The conformal predictor uses a function measuring relative 'strangeness' interacting pairs to check whether prediction of a new example added to the sequence of already known PPI's would conform to the 'exchangeability' assumption: distribution of interacting pairs is invariant with any permutations of the pairs. In fact, this is the only assumption we make about the data. Another advantage is that the user can control a number of errors by providing a desirable confidence level. This feature of CP is very useful for a ranking list of possible interactive pairs. In this paper, the conformal method has been developed to deal with just one class - class interactive proteins - while there is not clearly defined of 'non-interactive'pairs. The confidence level helps the biologist in the interpretation of the results, and better assists the choices of pairs for experimental validation. We apply the proposed conformal framework to improve the identification of interacting pairs between HIV-1 and human proteins.

  5. SCRATCH: a protein structure and structural feature prediction server

    PubMed Central

    Cheng, J.; Randall, A. Z.; Sweredoski, M. J.; Baldi, P.

    2005-01-01

    SCRATCH is a server for predicting protein tertiary structure and structural features. The SCRATCH software suite includes predictors for secondary structure, relative solvent accessibility, disordered regions, domains, disulfide bridges, single mutation stability, residue contacts versus average, individual residue contacts and tertiary structure. The user simply provides an amino acid sequence and selects the desired predictions, then submits to the server. Results are emailed to the user. The server is available at . PMID:15980571

  6. Modification of sorghum proteins for enhanced functionality

    Technology Transfer Automated Retrieval System (TEKTRAN)

    Sorghum is the third most widely produced crop in the United States (U.S.) and fifth in the world during fiscal year 2006/07(USDA-FAS, 2007). The use of sorghum in foods faces functional and nutritional constraints due, mainly, to the rigidity of the protein bodies. The disruption and modificatio...

  7. Posttranslational Modification Assays on Functional Protein Microarrays.

    PubMed

    Neiswinger, Johnathan; Uzoma, Ijeoma; Cox, Eric; Rho, HeeSool; Jeong, Jun Seop; Zhu, Heng

    2016-10-03

    Protein microarray technology provides a straightforward yet powerful strategy for identifying substrates of posttranslational modifications (PTMs) and studying the specificity of the enzymes that catalyze these reactions. Protein microarray assays can be designed for individual enzymes or a mixture to establish connections between enzymes and substrates. Assays for four well-known PTMs-phosphorylation, acetylation, ubiquitylation, and SUMOylation-have been developed and are described here for use on functional protein microarrays. Phosphorylation and acetylation require a single enzyme and are easily adapted for use on an array. The ubiquitylation and SUMOylation cascades are very similar, and the combination of the E1, E2, and E3 enzymes plus ubiquitin or SUMO protein and ATP is sufficient for in vitro modification of many substrates.

  8. Functional conservation of an ancestral Pellino protein in helminth species

    PubMed Central

    Cluxton, Christopher D.; Caffrey, Brian E.; Kinsella, Gemma K.; Moynagh, Paul N.; Fares, Mario A.; Fallon, Padraic G.

    2015-01-01

    The immune system of H. sapiens has innate signaling pathways that arose in ancestral species. This is exemplified by the discovery of the Toll-like receptor (TLR) pathway using free-living model organisms such as Drosophila melanogaster. The TLR pathway is ubiquitous and controls sensitivity to pathogen-associated molecular patterns (PAMPs) in eukaryotes. There is, however, a marked absence of this pathway from the plathyhelminthes, with the exception of the Pellino protein family, which is present in a number of species from this phylum. Helminth Pellino proteins are conserved having high similarity, both at the sequence and predicted structural protein level, with that of human Pellino proteins. Pellino from a model helminth, Schistosoma mansoni Pellino (SmPellino), was shown to bind and poly-ubiquitinate human IRAK-1, displaying E3 ligase activity consistent with its human counterparts. When transfected into human cells SmPellino is functional, interacting with signaling proteins and modulating mammalian signaling pathways. Strict conservation of a protein family in species lacking its niche signalling pathway is rare and provides a platform to examine the ancestral functions of Pellino proteins that may translate into novel mechanisms of immune regulation in humans. PMID:26120048

  9. Inferring modules of functionally interacting proteins using the Bond Energy Algorithm

    PubMed Central

    Watanabe, Ryosuke LA; Morett, Enrique; Vallejo, Edgar E

    2008-01-01

    Background Non-homology based methods such as phylogenetic profiles are effective for predicting functional relationships between proteins with no considerable sequence or structure similarity. Those methods rely heavily on traditional similarity metrics defined on pairs of phylogenetic patterns. Proteins do not exclusively interact in pairs as the final biological function of a protein in the cellular context is often hold by a group of proteins. In order to accurately infer modules of functionally interacting proteins, the consideration of not only direct but also indirect relationships is required. In this paper, we used the Bond Energy Algorithm (BEA) to predict functionally related groups of proteins. With BEA we create clusters of phylogenetic profiles based on the associations of the surrounding elements of the analyzed data using a metric that considers linked relationships among elements in the data set. Results Using phylogenetic profiles obtained from the Cluster of Orthologous Groups of Proteins (COG) database, we conducted a series of clustering experiments using BEA to predict (upper level) relationships between profiles. We evaluated our results by comparing with COG's functional categories, And even more, with the experimentally determined functional relationships between proteins provided by the DIP and ECOCYC databases. Our results demonstrate that BEA is capable of predicting meaningful modules of functionally related proteins. BEA outperforms traditionally used clustering methods, such as k-means and hierarchical clustering by predicting functional relationships between proteins with higher accuracy. Conclusion This study shows that the linked relationships of phylogenetic profiles obtained by BEA is useful for detecting functional associations between profiles and extending functional modules not found by traditional methods. BEA is capable of detecting relationship among phylogenetic patterns by linking them through a common element shared in

  10. MeMo: a web tool for prediction of protein methylation modifications.

    PubMed

    Chen, Hu; Xue, Yu; Huang, Ni; Yao, Xuebiao; Sun, Zhirong

    2006-07-01

    Protein methylation is an important and reversible post-translational modification of proteins (PTMs), which governs cellular dynamics and plasticity. Experimental identification of the methylation site is labor-intensive and often limited by the availability of reagents, such as methyl-specific antibodies and optimization of enzymatic reaction. Computational analysis may facilitate the identification of potential methylation sites with ease and provide insight for further experimentation. Here we present a novel protein methylation prediction web server named MeMo, protein methylation modification prediction, implemented in Support Vector Machines (SVMs). Our present analysis is primarily focused on methylation on lysine and arginine, two major protein methylation sites. However, our computational platform can be easily extended into the analyses of other amino acids. The accuracies for prediction of protein methylation on lysine and arginine have reached 67.1 and 86.7%, respectively. Thus, the MeMo system is a novel tool for predicting protein methylation and may prove useful in the study of protein methylation function and dynamics. The MeMo web server is available at: http://www.bioinfo.tsinghua.edu.cn/~tigerchen/memo.html.

  11. Addressing the Role of Conformational Diversity in Protein Structure Prediction

    PubMed Central

    Parisi, Gustavo; Fornasari, Maria Silvina

    2016-01-01

    Computational modeling of tertiary structures has become of standard use to study proteins that lack experimental characterization. Unfortunately, 3D structure prediction methods and model quality assessment programs often overlook that an ensemble of conformers in equilibrium populates the native state of proteins. In this work we collected sets of publicly available protein models and the corresponding target structures experimentally solved and studied how they describe the conformational diversity of the protein. For each protein, we assessed the quality of the models against known conformers by several standard measures and identified those models ranked best. We found that model rankings are defined by both the selected target conformer and the similarity measure used. 70% of the proteins in our datasets show that different models are structurally closest to different conformers of the same protein target. We observed that model building protocols such as template-based or ab initio approaches describe in similar ways the conformational diversity of the protein, although for template-based methods this description may depend on the sequence similarity between target and template sequences. Taken together, our results support the idea that protein structure modeling could help to identify members of the native ensemble, highlight the importance of considering conformational diversity in protein 3D quality evaluations and endorse the study of the variability of the native structure for a meaningful biological analysis. PMID:27159429

  12. Which Working Memory Functions Predict Intelligence?

    ERIC Educational Resources Information Center

    Oberauer, Klaus; Sub, Heinz-Martin; Wilhelm, Oliver; Wittmann, Werner W.

    2008-01-01

    Investigates the relationship between three factors of working memory (storage and processing, relational integration, and supervision) and four factors of intelligence (reasoning, speed, memory, and creativity) using structural equation models. Relational integration predicted reasoning ability at least as well as the storage-and-processing…

  13. From patenting genes to proteins: the search for utility via function.

    PubMed

    Ilag, Lawrence L; Ilag, Leodegario M; Ilag, Leodevico L

    2002-05-01

    The debate regarding the patenting of genes has extended into the post-genome era. With only approximately 35000 genes deduced from the draft sequence of the human genome, there are fears that a few companies have already gained monopoly on the potential benefits from this knowledge. Nevertheless, it is accepted that proteins determine gene function and function is not readily predicted from gene sequence. Furthermore, genes can encode multiple proteins and a single protein can have multiple functions. Here, we argue that unraveling the intrinsic complexity of proteins and their functions is the key towards determining the utility requirement for patenting protein inventions and consider the possible socioeconomic impact.

  14. De novo prediction of RNA-protein interactions from sequence information.

    PubMed

    Wang, Ying; Chen, Xiaowei; Liu, Zhi-Ping; Huang, Qiang; Wang, Yong; Xu, Derong; Zhang, Xiang-Sun; Chen, Runsheng; Chen, Luonan

    2013-01-27

    Protein-RNA interactions are fundamentally important in understanding cellular processes. In particular, non-coding RNA-protein interactions play an important role to facilitate biological functions in signalling, transcriptional regulation, and even the progression of complex diseases. However, experimental determination of protein-RNA interactions remains time-consuming and labour-intensive. Here, we develop a novel extended naïve-Bayes-classifier for de novo prediction of protein-RNA interactions, only using protein and RNA sequence information. Specifically, we first collect a set of known protein-RNA interactions as gold-standard positives and extract sequence-based features to represent each protein-RNA pair. To fill the gap between high dimensional features and scarcity of gold-standard positives, we select effective features by cutting a likelihood ratio score, which not only reduces the computational complexity but also allows transparent feature integration during prediction. An extended naïve Bayes classifier is then constructed using these effective features to train a protein-RNA interaction prediction model. Numerical experiments show that our method can achieve the prediction accuracy of 0.77 even though only a small number of protein-RNA interaction data are available. In particular, we demonstrate that the extended naïve-Bayes-classifier is superior to the naïve-Bayes-classifier by fully considering the dependences among features. Importantly, we conduct ncRNA pull-down experiments to validate the predicted novel protein-RNA interactions and identify the interacting proteins of sbRNA CeN72 in C. elegans, which further demonstrates the effectiveness of our method.

  15. Protein (multi-)location prediction: utilizing interdependencies via a generative model

    PubMed Central

    Shatkay, Hagit

    2015-01-01

    Motivation: Proteins are responsible for a multitude of vital tasks in all living organisms. Given that a protein’s function and role are strongly related to its subcellular location, protein location prediction is an important research area. While proteins move from one location to another and can localize to multiple locations, most existing location prediction systems assign only a single location per protein. A few recent systems attempt to predict multiple locations for proteins, however, their performance leaves much room for improvement. Moreover, such systems do not capture dependencies among locations and usually consider locations as independent. We hypothesize that a multi-location predictor that captures location inter-dependencies can improve location predictions for proteins. Results: We introduce a probabilistic generative model for protein localization, and develop a system based on it—which we call MDLoc—that utilizes inter-dependencies among locations to predict multiple locations for proteins. The model captures location inter-dependencies using Bayesian networks and represents dependency between features and locations using a mixture model. We use iterative processes for learning model parameters and for estimating protein locations. We evaluate our classifier MDLoc, on a dataset of single- and multi-localized proteins derived from the DBMLoc dataset, which is the most comprehensive protein multi-localization dataset currently available. Our results, obtained by using MDLoc, significantly improve upon results obtained by an initial simpler classifier, as well as on results reported by other top systems. Availability and implementation: MDLoc is available at: http://www.eecis.udel.edu/∼compbio/mdloc. Contact: shatkay@udel.edu. PMID:26072505

  16. Using computational biophysics to understand protein evolution and function

    NASA Astrophysics Data System (ADS)

    Ytreberg, F. Marty

    2010-10-01

    Understanding how proteins evolve and function is vital for human health (e.g., developing better drugs, predicting the outbreak of disease, etc.). In spite of its importance, little is known about the underlying molecular mechanisms behind these biological processes. Computational biophysics has emerged as a useful tool in this area due to its unique ability to obtain a detailed, atomistic view of proteins and how they interact. I will give two examples from our studies where computational biophysics has provided valuable insight: (i) Protein evolution in viruses. Our results suggest that the amino acid changes that occur during high temperature evolution of a virus decrease the binding free energy of the capsid, i.e., these changes increase capsid stability. (ii) Determining realistic structural ensembles for intrinsically disordered proteins. Most methods for determining protein structure rely on the protein folding into a single conformation, and thus are not suitable for disordered proteins. I will describe a new approach that combines experiment and simulation to generate structures for disordered proteins.

  17. A Novel Method for Functional Annotation Prediction Based on Combination of Classification Methods

    PubMed Central

    Jung, Jaehee; Lee, Heung Ki

    2014-01-01

    Automated protein function prediction defines the designation of functions of unknown protein functions by using computational methods. This technique is useful to automatically assign gene functional annotations for undefined sequences in next generation genome analysis (NGS). NGS is a popular research method since high-throughput technologies such as DNA sequencing and microarrays have created large sets of genes. These huge sequences have greatly increased the need for analysis. Previous research has been based on the similarities of sequences as this is strongly related to the functional homology. However, this study aimed to designate protein functions by automatically predicting the function of the genome by utilizing InterPro (IPR), which can represent the properties of the protein family and groups of the protein function. Moreover, we used gene ontology (GO), which is the controlled vocabulary used to comprehensively describe the protein function. To define the relationship between IPR and GO terms, three pattern recognition techniques have been employed under different conditions, such as feature selection and weighted value, instead of a binary one. PMID:25133242

  18. Prediction of allosteric sites on protein surfaces with an elastic-network-model-based thermodynamic method

    NASA Astrophysics Data System (ADS)

    Su, Ji Guo; Qi, Li Sheng; Li, Chun Hua; Zhu, Yan Ying; Du, Hui Jing; Hou, Yan Xue; Hao, Rui; Wang, Ji Hua

    2014-08-01

    Allostery is a rapid and efficient way in many biological processes to regulate protein functions, where binding of an effector at the allosteric site alters the activity and function at a distant active site. Allosteric regulation of protein biological functions provides a promising strategy for novel drug design. However, how to effectively identify the allosteric sites remains one of the major challenges for allosteric drug design. In the present work, a thermodynamic method based on the elastic network model was proposed to predict the allosteric sites on the protein surface. In our method, the thermodynamic coupling between the allosteric and active sites was considered, and then the allosteric sites were identified as those where the binding of an effector molecule induces a large change in the binding free energy of the protein with its ligand. Using the proposed method, two proteins, i.e., the 70 kD heat shock protein (Hsp70) and GluA2 alpha-amino-3-hydroxy-5-methyl-4-isoxazole propionic acid (AMPA) receptor, were studied and the allosteric sites on the protein surface were successfully identified. The predicted results are consistent with the available experimental data, which indicates that our method is a simple yet effective approach for the identification of allosteric sites on proteins.

  19. Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds.

    PubMed Central

    Overington, J.; Donnelly, D.; Johnson, M. S.; Sali, A.; Blundell, T. L.

    1992-01-01

    The local environment of an amino acid in a folded protein determines the acceptability of mutations at that position. In order to characterize and quantify these structural constraints, we have made a comparative analysis of families of homologous proteins. Residues in each structure are classified according to amino acid type, secondary structure, accessibility of the side chain, and existence of hydrogen bonds from the side chains. Analysis of the pattern of observed substitutions as a function of local environment shows that there are distinct patterns, especially for buried polar residues. The substitution data tables are available on diskette with Protein Science. Given the fold of a protein, one is able to predict sequences compatible with the fold (profiles or templates) and potentially to discriminate between a correctly folded and misfolded protein. Conversely, analysis of residue variation across a family of aligned sequences in terms of substitution profiles can allow prediction of secondary structure or tertiary environment. PMID:1304904

  20. Models to predict intestinal absorption of therapeutic peptides and proteins.

    PubMed

    Antunes, Filipa; Andrade, Fernanda; Ferreira, Domingos; Nielsen, Hanne Morck; Sarmento, Bruno

    2013-01-01

    Prediction of human intestinal absorption is a major goal in the design, optimization, and selection of drugs intended for oral delivery, in particular proteins, which possess intrinsic poor transport across intestinal epithelium. There are various techniques currently employed to evaluate the extension of protein absorption in the different phases of drug discovery and development. Screening protocols to evaluate protein absorption include a range of preclinical methodologies like in silico, in vitro, in situ, ex vivo and in vivo. It is the careful and critical use of these techniques that can help to identify drug candidates, which most probably will be well absorbed from the human intestinal tract. It is well recognized that the human intestinal permeability cannot be accurately predicted based on a single preclinical method. However, the present social and scientific concerns about the animal well care as well as the pharmaceutical industries need for rapid, cheap and reliable models predicting bioavailability give reasons for using methods providing an appropriate correlation between results of in vivo and in vitro drug absorption. The aim of this review is to describe and compare in silico, in vitro, in situ, ex vivo and in vivo methods used to predict human intestinal absorption, giving a special attention to the intestinal absorption of therapeutic peptides and proteins.

  1. Urinary intestinal fatty acid binding protein predicts necrotizing enterocolitis.

    PubMed

    Gregory, Katherine E; Winston, Abigail B; Yamamoto, Hidemi S; Dawood, Hassan Y; Fashemi, Titilayo; Fichorova, Raina N; Van Marter, Linda J

    2014-06-01

    Necrotizing enterocolitis, characterized by sudden onset and rapid progression, remains the most significant gastrointestinal disorder among premature infants. In seeking a predictive biomarker, we found intestinal fatty acid binding protein, an indicator of enterocyte damage, was substantially increased within three and seven days before the diagnosis of necrotizing enterocolitis.

  2. Intrinsic Disorder in Transmembrane Proteins: Roles in Signaling and Topology Prediction.

    PubMed

    Bürgi, Jérôme; Xue, Bin; Uversky, Vladimir N; van der Goot, F Gisou

    2016-01-01

    Intrinsically disordered regions (IDRs) are peculiar stretches of amino acids that lack stable conformations in solution. Intrinsic Disorder containing Proteins (IDP) are defined by the presence of at least one large IDR and have been linked to multiple cellular processes including cell signaling, DNA binding and cancer. Here we used computational analyses and publicly available databases to deepen insight into the prevalence and function of IDRs specifically in transmembrane proteins, which are somewhat neglected in most studies. We found that 50% of transmembrane proteins have at least one IDR of 30 amino acids or more. Interestingly, these domains preferentially localize to the cytoplasmic side especially of multi-pass transmembrane proteins, suggesting that disorder prediction could increase the confidence of topology prediction algorithms. This was supported by the successful prediction of the topology of the uncharacterized multi-pass transmembrane protein TMEM117, as confirmed experimentally. Pathway analysis indicated that IDPs are enriched in cell projection and axons and appear to play an important role in cell adhesion, signaling and ion binding. In addition, we found that IDP are enriched in phosphorylation sites, a crucial post translational modification in signal transduction, when compared to fully ordered proteins and to be implicated in more protein-protein interaction events. Accordingly, IDPs were highly enriched in short protein binding regions called Molecular Recognition Features (MoRFs). Altogether our analyses strongly support the notion that the transmembrane IDPs act as hubs in cellular signal events.

  3. Genome-wide protein localization prediction strategies for gram negative bacteria

    SciTech Connect

    Romine, Margaret F.

    2011-06-15

    Genome-wide prediction of protein subcellular localization is an important type of evidence used for inferring protein function. While a variety of computational tools have been developed for this purpose, errors in the gene models and use of protein sorting signals that are not recognized by the more commonly accepted tools can diminish the accuracy of their output. As part of an effort to manually curate the annotations of 19 strains of Shewanella, numerous insights were gained regarding the use of computational tools and proteomics data to predict protein localization. Identification of the suite of secretion systems present in each strain at the start of the process made it possible to tailor-fit the subsequent localization prediction strategies to each strain for improved accuracy. Comparisons of the computational predictions among orthologous proteins revealed inconsistencies in the computational outputs, which could often be resolved by adjusting the gene models or ortholog group memberships. While proteomic data was useful for verifying start site predictions and post-translational proteolytic cleavage, care was needed to distinguish cellular versus sample processing-mediated cleavage events. Searches for lipoprotein signal peptides revealed that neither TatP nor LipoP are designed for identification of lipoprotein substrates of the twin arginine translocation system and that the +2 rule for lipoprotein sorting does not apply to this Genus. Analysis of the relationships between domain occurrence and protein localization prediction enabled identification of numerous location-informative domains which could then be used to refine or increase confidence in location predictions. This collective knowledge was used to develop a general strategy for predicting protein localization that could be adapted to other organisms.

  4. Predicting protein-protein interactions from multimodal biological data sources via nonnegative matrix tri-factorization.

    PubMed

    Wang, Hua; Huang, Heng; Ding, Chris; Nie, Feiping

    2013-04-01

    Protein interactions are central to all the biological processes and structural scaffolds in living organisms, because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Several high-throughput methods, for example, yeast two-hybrid system and mass spectrometry method, can help determine protein interactions, which, however, suffer from high false-positive rates. Moreover, many protein interactions predicted by one method are not supported by another. Therefore, computational methods are necessary and crucial to complete the interactome expeditiously. In this work, we formulate the problem of predicting protein interactions from a new mathematical perspective--sparse matrix completion, and propose a novel nonnegative matrix factorization (NMF)-based matrix completion approach to predict new protein interactions from existing protein interaction networks. Through using manifold regularization, we further develop our method to integrate different biological data sources, such as protein sequences, gene expressions, protein structure information, etc. Extensive experimental results on four species, Saccharomyces cerevisiae, Drosophila melanogaster, Homo sapiens, and Caenorhabditis elegans, have shown that our new methods outperform related state-of-the-art protein interaction prediction methods.

  5. Carbohydrate-binding protein identification by coupling structural similarity searching with binding affinity prediction.

    PubMed

    Zhao, Huiying; Yang, Yuedong; von Itzstein, Mark; Zhou, Yaoqi

    2014-11-15

    Carbohydrate-binding proteins (CBPs) are potential biomarkers and drug targets. However, the interactions between carbohydrates and proteins are challenging to study experimentally and computationally because of their low binding affinity, high flexibility, and the lack of a linear sequence in carbohydrates as exists in RNA, DNA, and proteins. Here, we describe a structure-based function-prediction technique called SPOT-Struc that identifies carbohydrate-recognizing proteins and their binding amino acid residues by structural alignment program SPalign and binding affinity scoring according to a knowledge-based statistical potential based on the distance-scaled finite-ideal gas reference state (DFIRE). The leave-one-out cross-validation of the method on 113 carbohydrate-binding domains and 3442 noncarbohydrate binding proteins yields a Matthews correlation coefficient of 0.56 for SPalign alone and 0.63 for SPOT-Struc (SPalign + binding affinity scoring) for CBP prediction. SPOT-Struc is a technique with high positive predictive value (79% correct predictions in all positive CBP predictions) with a reasonable sensitivity (52% positive predictions in all CBPs). The sensitivity of the method was changed slightly when applied to 31 APO (unbound) structures found in the protein databank (14/31 for APO versus 15/31 for HOLO). The result of SPOT-Struc will not change significantly if highly homologous templates were used. SPOT-Struc predicted 19 out of 2076 structural genome targets as CBPs. In particular, one uncharacterized protein in Bacillus subtilis (1oq1A) was matched to galectin-9 from Mus musculus. Thus, SPOT-Struc is useful for uncovering novel carbohydrate-binding proteins. SPOT-Struc is available at http://sparks-lab.org.

  6. Functional brain network efficiency predicts intelligence.

    PubMed

    Langer, Nicolas; Pedroni, Andreas; Gianotti, Lorena R R; Hänggi, Jürgen; Knoch, Daria; Jäncke, Lutz

    2012-06-01

    The neuronal causes of individual differences in mental abilities such as intelligence are complex and profoundly important. Understanding these abilities has the potential to facilitate their enhancement. The purpose of this study was to identify the functional brain network characteristics and their relation to psychometric intelligence. In particular, we examined whether the functional network exhibits efficient small-world network attributes (high clustering and short path length) and whether these small-world network parameters are associated with intellectual performance. High-density resting state electroencephalography (EEG) was recorded in 74 healthy subjects to analyze graph-theoretical functional network characteristics at an intracortical level. Ravens advanced progressive matrices were used to assess intelligence. We found that the clustering coefficient and path length of the functional network are strongly related to intelligence. Thus, the more intelligent the subjects are the more the functional brain network resembles a small-world network. We further identified the parietal cortex as a main hub of this resting state network as indicated by increased degree centrality that is associated with higher intelligence. Taken together, this is the first study that substantiates the neural efficiency hypothesis as well as the Parieto-Frontal Integration Theory (P-FIT) of intelligence in the context of functional brain network characteristics. These theories are currently the most established intelligence theories in neuroscience. Our findings revealed robust evidence of an efficiently organized resting state functional brain network for highly productive cognitions.

  7. Improved hybrid optimization algorithm for 3D protein structure prediction.

    PubMed

    Zhou, Changjun; Hou, Caixia; Wei, Xiaopeng; Zhang, Qiang

    2014-07-01

    A new improved hybrid optimization algorithm - PGATS algorithm, which is based on toy off-lattice model, is presented for dealing with three-dimensional protein structure prediction problems. The algorithm combines the particle swarm optimization (PSO), genetic algorithm (GA), and tabu search (TS) algorithms. Otherwise, we also take some different improved strategies. The factor of stochastic disturbance is joined in the particle swarm optimization to improve the search ability; the operations of crossover and mutation that are in the genetic algorithm are changed to a kind of random liner method; at last tabu search algorithm is improved by appending a mutation operator. Through the combination of a variety of strategies and algorithms, the protein structure prediction (PSP) in a 3D off-lattice model is achieved. The PSP problem is an NP-hard problem, but the problem can be attributed to a global optimization problem of multi-extremum and multi-parameters. This is the theoretical principle of the hybrid optimization algorithm that is proposed in this paper. The algorithm combines local search and global search, which overcomes the shortcoming of a single algorithm, giving full play to the advantage of each algorithm. In the current universal standard sequences, Fibonacci sequences and real protein sequences are certified. Experiments show that the proposed new method outperforms single algorithms on the accuracy of calculating the protein sequence energy value, which is proved to be an effective way to predict the structure of proteins.

  8. Boosting compound-protein interaction prediction by deep learning.

    PubMed

    Tian, Kai; Shao, Mingyu; Wang, Yang; Guan, Jihong; Zhou, Shuigeng

    2016-11-01

    The identification of interactions between compounds and proteins plays an important role in network pharmacology and drug discovery. However, experimentally identifying compound-protein interactions (CPIs) is generally expensive and time-consuming, computational approaches are thus introduced. Among these, machine-learning based methods have achieved a considerable success. However, due to the nonlinear and imbalanced nature of biological data, many machine learning approaches have their own limitations. Recently, deep learning techniques show advantages over many state-of-the-art machine learning methods in some applications. In this study, we aim at improving the performance of CPI prediction based on deep learning, and propose a method called DL-CPI (the abbreviation of Deep Learning for Compound-Protein Interactions prediction), which employs deep neural network (DNN) to effectively learn the representations of compound-protein pairs. Extensive experiments show that DL-CPI can learn useful features of compound-protein pairs by a layerwise abstraction, and thus achieves better prediction performance than existing methods on both balanced and imbalanced datasets.

  9. iStable: off-the-shelf predictor integration for predicting protein stability changes

    PubMed Central

    2013-01-01

    Background Mutation of a single amino acid residue can cause changes in a protein, which could then lead to a loss of protein function. Predicting the protein stability changes can provide several possible candidates for the novel protein designing. Although many prediction tools are available, the conflicting prediction results from different tools could cause confusion to users. Results We proposed an integrated predictor, iStable, with grid computing architecture constructed by using sequence information and prediction results from different element predictors. In the learning model, several machine learning methods were evaluated and adopted the support vector machine as an integrator, while not just choosing the majority answer given by element predictors. Furthermore, the role of the sequence information played was analyzed in our model, and an 11-window size was determined. On the other hand, iStable is available with two different input types: structural and sequential. After training and cross-validation, iStable has better performance than all of the element predictors on several datasets. Under different classifications and conditions for validation, this study has also shown better overall performance in different types of secondary structures, relative solvent accessibility circumstances, protein memberships in different superfamilies, and experimental conditions. Conclusions The trained and validated version of iStable provides an accurate approach for prediction of protein stability changes. iStable is freely available online at: http://predictor.nchu.edu.tw/iStable. PMID:23369171

  10. OMPcontact: An Outer Membrane Protein Inter-Barrel Residue Contact Prediction Method.

    PubMed

    Zhang, Li; Wang, Han; Yan, Lun; Su, Lingtao; Xu, Dong

    2017-03-01

    In the two transmembrane protein types, outer membrane proteins (OMPs) perform diverse important biochemical functions, including substrate transport and passive nutrient uptake and intake. Hence their 3D structures are expected to reveal these functions. Because experimental structures are scarce, predicted 3D structures are more adapted to OMP research instead, and the inter-barrel residue contact is becoming one of the most remarkable features, improving prediction accuracy by describing the structural information of OMPs. To predict OMP structures accurately, we explored an OMP inter-barrel residue contact prediction method: OMPcontact. Multiple OMP-specific features were integrated in the method, including residue evolutionary covariation, topology-based transmembrane segment relative residue position, OMP lipid layer accessibility, and residue evolution conservation. These features describe the properties of a residue pair in different respects: sequential, structural, evolutionary, and biochemical. Within a 3-residues slide window, a Support Vector Machine (SVM) could accurately determinate the inter-barrel contact residue pair using above features. A 5-fold cross-valuation process was applied in testing the OMPcontact performance against a non-redundant OMP set with 75 samples inside. The tests compared four evolutionary covariation methods and screen analyzed the adaptive ones for inter-barrel contact prediction. The results showed our method not only efficiently realized the prediction, but also scored the possibility for residue pairs reliably. This is expected to improve OMP tertiary structure prediction. Therefore, OMPcontact will be helpful in compiling a structural census of outer membrane protein.

  11. Three-dimensional protein structure prediction: Methods and computational strategies.

    PubMed

    Dorn, Márcio; E Silva, Mariel Barbachan; Buriol, Luciana S; Lamb, Luis C

    2014-10-12

    A long standing problem in structural bioinformatics is to determine the three-dimensional (3-D) structure of a protein when only a sequence of amino acid residues is given. Many computational methodologies and algorithms have been proposed as a solution to the 3-D Protein Structure Prediction (3-D-PSP) problem. These methods can be divided in four main classes: (a) first principle methods without database information; (b) first principle methods with database information; (c) fold recognition and threading methods; and (d) comparative modeling methods and sequence alignment strategies. Deterministic computational techniques, optimization techniques, data mining and machine learning approaches are typically used in the construction of computational solutions for the PSP problem. Our main goal with this work is to review the methods and computational strategies that are currently used in 3-D protein prediction.

  12. Amino acid composition analysis of human secondary transport proteins and implications for reliable membrane topology prediction.

    PubMed

    Saidijam, Massoud; Azizpour, Sonia; Patching, Simon G

    2016-07-08

    Secondary transporters in humans are a large group of proteins that transport a wide range of ions, metals, organic and inorganic solutes involved in energy transduction, control of membrane potential and osmotic balance, metabolic processes and in the absorption or efflux of drugs and xenobiotics. They are also emerging as important targets for development of new drugs and as target sites for drug delivery to specific organs or tissues. We have performed amino acid composition (AAC) and phylogenetic analyses and membrane topology predictions for 336 human secondary transport proteins and used the results to confirm protein classification and to look for trends and correlations with structural domains and specific substrates and/or function. Some proteins showed statistically high contents of individual amino acids or of groups of amino acids with similar physicochemical properties. One recurring trend was a correlation between high contents of charged and/or polar residues with misleading results in predictions of membrane topology, which was especially prevalent in Mitochondrial Carrier family proteins. We demonstrate how charged or polar residues located in the middle of transmembrane helices can interfere with their identification by membrane topology tools resulting in missed helices in the prediction. Comparison of AAC in the human proteins with that in 235 secondary transport proteins from Escherichia coli revealed similar overall trends along with differences in average contents for some individual amino acids and groups of similar amino acids that are presumed to result from a greater number of functions and complexity in the higher organism.

  13. Nanoparticles-cell association predicted by protein corona fingerprints

    NASA Astrophysics Data System (ADS)

    Palchetti, S.; Digiacomo, L.; Pozzi, D.; Peruzzi, G.; Micarelli, E.; Mahmoudi, M.; Caracciolo, G.

    2016-06-01

    In a physiological environment (e.g., blood and interstitial fluids) nanoparticles (NPs) will bind proteins shaping a ``protein corona'' layer. The long-lived protein layer tightly bound to the NP surface is referred to as the hard corona (HC) and encodes information that controls NP bioactivity (e.g. cellular association, cellular signaling pathways, biodistribution, and toxicity). Decrypting this complex code has become a priority to predict the NP biological outcomes. Here, we use a library of 16 lipid NPs of varying size (Ø ~ 100-250 nm) and surface chemistry (unmodified and PEGylated) to investigate the relationships between NP physicochemical properties (nanoparticle size, aggregation state and surface charge), protein corona fingerprints (PCFs), and NP-cell association. We found out that none of the NPs' physicochemical properties alone was exclusively able to account for association with human cervical cancer cell line (HeLa). For the entire library of NPs, a total of 436 distinct serum proteins were detected. We developed a predictive-validation modeling that provides a means of assessing the relative significance of the identified corona proteins. Interestingly, a minor fraction of the HC, which consists of only 8 PCFs were identified as main promoters of NP association with HeLa cells. Remarkably, identified PCFs have several receptors with high level of expression on the plasma membrane of HeLa cells.In a physiological environment (e.g., blood and interstitial fluids) nanoparticles (NPs) will bind proteins shaping a ``protein corona'' layer. The long-lived protein layer tightly bound to the NP surface is referred to as the hard corona (HC) and encodes information that controls NP bioactivity (e.g. cellular association, cellular signaling pathways, biodistribution, and toxicity). Decrypting this complex code has become a priority to predict the NP biological outcomes. Here, we use a library of 16 lipid NPs of varying size (Ø ~ 100-250 nm) and surface

  14. Predicting the functions of long noncoding RNAs using RNA-seq based on Bayesian network.

    PubMed

    Xiao, Yun; Lv, Yanling; Zhao, Hongying; Gong, Yonghui; Hu, Jing; Li, Feng; Xu, Jinyuan; Bai, Jing; Yu, Fulong; Li, Xia

    2015-01-01

    Long noncoding RNAs (lncRNAs) have been shown to play key roles in various biological processes. However, functions of most lncRNAs are poorly characterized. Here, we represent a framework to predict functions of lncRNAs through construction of a regulatory network between lncRNAs and protein-coding genes. Using RNA-seq data, the transcript profiles of lncRNAs and protein-coding genes are constructed. Using the Bayesian network method, a regulatory network, which implies dependency relations between lncRNAs and protein-coding genes, was built. In combining protein interaction network, highly connected coding genes linked by a given lncRNA were subsequently used to predict functions of the lncRNA through functional enrichment. Application of our method to prostate RNA-seq data showed that 762 lncRNAs in the constructed regulatory network were assigned functions. We found that lncRNAs are involved in diverse biological processes, such as tissue development or embryo development (e.g., nervous system development and mesoderm development). By comparison with functions inferred using the neighboring gene-based method and functions determined using lncRNA knockdown experiments, our method can provide comparable predicted functions of lncRNAs. Overall, our method can be applied to emerging RNA-seq data, which will help researchers identify complex relations between lncRNAs and coding genes and reveal important functions of lncRNAs.

  15. Optimized distance-dependent atom-pair-based potential DOOP for protein structure prediction.

    PubMed

    Chae, Myong-Ho; Krull, Florian; Knapp, Ernst-Walter

    2015-05-01

    The DOcking decoy-based Optimized Potential (DOOP) energy function for protein structure prediction is based on empirical distance-dependent atom-pair interactions. To optimize the atom-pair interactions, native protein structures are decomposed into polypeptide chain segments that correspond to structural motives involving complete secondary structure elements. They constitute near native ligand-receptor systems (or just pairs). Thus, a total of 8609 ligand-receptor systems were prepared from 954 selected proteins. For each of these hypothetical ligand-receptor systems, 1000 evenly sampled docking decoys with 0-10 Å interface root-mean-square-deviation (iRMSD) were generated with a method used before for protein-protein docking. A neural network-based optimization method was applied to derive the optimized energy parameters using these decoys so that the energy function mimics the funnel-like energy landscape for the interaction between these hypothetical ligand-receptor systems. Thus, our method hierarchically models the overall funnel-like energy landscape of native protein structures. The resulting energy function was tested on several commonly used decoy sets for native protein structure recognition and compared with other statistical potentials. In combination with a torsion potential term which describes the local conformational preference, the atom-pair-based potential outperforms other reported statistical energy functions in correct ranking of native protein structures for a variety of decoy sets. This is especially the case for the most challenging ROSETTA decoy set, although it does not take into account side chain orientation-dependence explicitly. The DOOP energy function for protein structure prediction, the underlying database of protein structures with hypothetical ligand-receptor systems and their decoys are freely available at http://agknapp.chemie.fu-berlin.de/doop/.

  16. DEPTH: a web server to compute depth and predict small-molecule binding cavities in proteins.

    PubMed

    Tan, Kuan Pern; Varadarajan, Raghavan; Madhusudhan, M S

    2011-07-01

    Depth measures the extent of atom/residue burial within a protein. It correlates with properties such as protein stability, hydrogen exchange rate, protein-protein interaction hot spots, post-translational modification sites and sequence variability. Our server, DEPTH, accurately computes depth and solvent-accessible surface area (SASA) values. We show that depth can be used to predict small molecule ligand binding cavities in proteins. Often, some of the residues lining a ligand binding cavity are both deep and solvent exposed. Using the depth-SASA pair values for a residue, its likelihood to form part of a small molecule binding cavity is estimated. The parameters of the method were calibrated over a training set of 900 high-resolution X-ray crystal structures of single-domain proteins bound to small molecules (molecular weight <1.5  KDa). The prediction accuracy of DEPTH is comparable to that of other geometry-based prediction methods including LIGSITE, SURFNET and Pocket-Finder (all with Matthew's correlation coefficient of ∼0.4) over a testing set of 225 single and multi-chain protein structures. Users have the option of tuning several parameters to detect cavities of different sizes, for example, geometrically flat binding sites. The input to the server is a protein 3D structure in PDB format. The users have the option of tuning the values of four parameters associated with the computation of residue depth and the prediction of binding cavities. The computed depths, SASA and binding cavity predictions are displayed in 2D plots and mapped onto 3D representations of the protein structure using Jmol. Links are provided to download the outputs. Our server is useful for all structural analysis based on residue depth and SASA, such as guiding site-directed mutagenesis experiments and small molecule docking exercises, in the context of protein functional annotation and drug discovery.

  17. A system for predicting energy and protein requirements of wild ruminants.

    PubMed

    Hackmann, Timothy J

    2011-01-01

    Wild ruminants require energy and protein for the normal function. I developed a system for predicting these energy and protein requirements across ruminant species and life stages. This system defines requirements on the basis of net energy (NE), net protein (NP), and ruminally degraded protein (RDP). Total NE and NP requirements are calculated as the sum of NE and NP required for several functions (maintenance, activity, thermoregulation, gain, lactation, and gestation). To estimate the requirements for each function, I collected data predominantly for wild species and then formulated allometric and other equations that predict requirements across species. I estimated RDP requirements using an equation for cattle. I then related NE, NP, and RDP to quantities more practical for diet formulation (e.g. dry matter intake). I tabulated requirements over a range of body mass and life stages (neonate, juvenile, nonproductive adult, lactating adult, and gestating adult). Tabulated requirements suggest that adults at peak lactation require greatest quantities of energy and neonates generally require greatest quantities of protein, agreeing with suggestions that lactation is energetically expensive and protein is most limiting during growth. Equations used in this system were precise (allometric equations had R(2) generally ≥0.89 and coefficient of variation <31.1%) and expected to reliably predict requirements across species. Results showed that a system for beef cattle would overestimate NE and either over- or underestimate NP for gain when applied to wild ruminants, showing that systems for wild ruminants should not extrapolate from requirements for domestic ruminants. One prominent system for wild ruminants predicted at times vastly different protein requirements from those predicted by the proposed system. The proposed system should be further evaluated and expanded to include other nutrients.

  18. Folding funnels, binding funnels, and protein function.

    PubMed Central

    Tsai, C. J.; Kumar, S.; Ma, B.; Nussinov, R.

    1999-01-01

    Folding funnels have been the focus of considerable attention during the last few years. These have mostly been discussed in the general context of the theory of protein folding. Here we extend the utility of the concept of folding funnels, relating them to biological mechanisms and function. In particular, here we describe the shape of the funnels in light of protein synthesis and folding; flexibility, conformational diversity, and binding mechanisms; and the associated binding funnels, illustrating the multiple routes and the range of complexed conformers. Specifically, the walls of the folding funnels, their crevices, and bumps are related to the complexity of protein folding, and hence to sequential vs. nonsequential folding. Whereas the former is more frequently observed in eukaryotic proteins, where the rate of protein synthesis is slower, the latter is more frequent in prokaryotes, with faster translation rates. The bottoms of the funnels reflect the extent of the flexibility of the proteins. Rugged floors imply a range of conformational isomers, which may be close on the energy landscape. Rather than undergoing an induced fit binding mechanism, the conformational ensembles around the rugged bottoms argue that the conformers, which are most complementary to the ligand, will bind to it with the equilibrium shifting in their favor. Furthermore, depending on the extent of the ruggedness, or of the smoothness with only a few minima, we may infer nonspecific, broad range vs. specific binding. In particular, folding and binding are similar processes, with similar underlying principles. Hence, the shape of the folding funnel of the monomer enables making reasonable guesses regarding the shape of the corresponding binding funnel. Proteins having a broad range of binding, such as proteolytic enzymes or relatively nonspecific endonucleases, may be expected to have not only rugged floors in their folding funnels, but their binding funnels will also behave similarly

  19. Mining Proteins with Non-Experimental Annotations Based on an Active Sample Selection Strategy for Predicting Protein Subcellular Localization.

    PubMed

    Cao, Junzhe; Liu, Wenqi; He, Jianjun; Gu, Hong

    2013-01-01

    Subcellular localization of a protein is important to understand proteins' functions and interactions. There are many techniques based on computational methods to predict protein subcellular locations, but it has been shown that many prediction tasks have a training data shortage problem. This paper introduces a new method to mine proteins with non-experimental annotations, which are labeled by non-experimental evidences of protein databases to overcome the training data shortage problem. A novel active sample selection strategy is designed, taking advantage of active learning technology, to actively find useful samples from the entire data pool of candidate proteins with non-experimental annotations. This approach can adequately estimate the "value" of each sample, automatically select the most valuable samples and add them into the original training set, to help to retrain the classifiers. Numerical experiments with for four popular multi-label classifiers on three benchmark datasets show that the proposed method can effectively select the valuable samples to supplement the original training set and significantly improve the performances of predicting classifiers.

  20. Functional protein microarrays by electrohydrodynamic jet printing.

    PubMed

    Shigeta, Kazuyo; He, Ying; Sutanto, Erick; Kang, Somi; Le, An-Phong; Nuzzo, Ralph G; Alleyne, Andrew G; Ferreira, Placid M; Lu, Yi; Rogers, John A

    2012-11-20

    This paper reports the use of advanced forms of electrohydrodynamic jet (e-jet) printing for creating micro- and nanoscale patterns of proteins on various surfaces ranging from flat silica substrates to structured plasmonic crystals, suitable for micro/nanoarray analysis and other applications in both fluorescent and plasmonic detection modes. The approaches function well with diverse classes of proteins, including streptavidin, IgG, fibrinogen, and γ-globulin. Detailed study reveals that the printing process does not adversely alter the protein structure or function, as demonstrated in the specific case of streptavidin through measurements of its binding specificity to biotin-modified DNA. Multinozzle printing systems enable several types of proteins (up to four currently) to be patterned on a single substrate, in rapid fashion and with excellent control over spatial dimensions and registration. High-speed, pulsed operational modes allow large-area printing, with narrow statistical distributions of drop size and spacing in patterns that include millions of droplets. The process is also compatible with the structured surfaces of plasmonic crystal substrates to enable detection without fluorescence. These collective characteristics suggest potential utility of e-jet techniques in wide-ranging areas of biotechnology, where its