Sample records for protein classification based

  1. Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.

    PubMed

    Chen, Yifei; Sun, Yuxing; Han, Bing-Qing

    2015-01-01

    Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

  2. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.

    PubMed

    Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H

    2017-04-15

    Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.

  3. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification

    PubMed Central

    Sinclair, Robert M.; Ravantti, Janne J.

    2017-01-01

    ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. PMID:28122979

  4. An information-based network approach for protein classification

    PubMed Central

    Wan, Xiaogeng; Zhao, Xin; Yau, Stephen S. T.

    2017-01-01

    Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method. PMID:28350835

  5. Automatic classification of protein structures using physicochemical parameters.

    PubMed

    Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam

    2014-09-01

    Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.

  6. Computational approaches for the classification of seed storage proteins.

    PubMed

    Radhika, V; Rao, V Sree Hari

    2015-07-01

    Seed storage proteins comprise a major part of the protein content of the seed and have an important role on the quality of the seed. These storage proteins are important because they determine the total protein content and have an effect on the nutritional quality and functional properties for food processing. Transgenic plants are being used to develop improved lines for incorporation into plant breeding programs and the nutrient composition of seeds is a major target of molecular breeding programs. Hence, classification of these proteins is crucial for the development of superior varieties with improved nutritional quality. In this study we have applied machine learning algorithms for classification of seed storage proteins. We have presented an algorithm based on nearest neighbor approach for classification of seed storage proteins and compared its performance with decision tree J48, multilayer perceptron neural (MLP) network and support vector machine (SVM) libSVM. The model based on our algorithm has been able to give higher classification accuracy in comparison to the other methods.

  7. Prediction of hot regions in protein-protein interaction by combining density-based incremental clustering with feature-based classification.

    PubMed

    Hu, Jing; Zhang, Xiaolong; Liu, Xiaoming; Tang, Jinshan

    2015-06-01

    Discovering hot regions in protein-protein interaction is important for drug and protein design, while experimental identification of hot regions is a time-consuming and labor-intensive effort; thus, the development of predictive models can be very helpful. In hot region prediction research, some models are based on structure information, and others are based on a protein interaction network. However, the prediction accuracy of these methods can still be improved. In this paper, a new method is proposed for hot region prediction, which combines density-based incremental clustering with feature-based classification. The method uses density-based incremental clustering to obtain rough hot regions, and uses feature-based classification to remove the non-hot spot residues from the rough hot regions. Experimental results show that the proposed method significantly improves the prediction performance of hot regions. Copyright © 2015 Elsevier Ltd. All rights reserved.

  8. Classification of proteins: available structural space for molecular modeling.

    PubMed

    Andreeva, Antonina

    2012-01-01

    The wealth of available protein structural data provides unprecedented opportunity to study and better understand the underlying principles of protein folding and protein structure evolution. A key to achieving this lies in the ability to analyse these data and to organize them in a coherent classification scheme. Over the past years several protein classifications have been developed that aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution and thus provide a discrete rather than continuum view of protein structure space. This chapter presents a strategy for classification of proteins with known three-dimensional structure. Steps in the classification process along with basic definitions are introduced. Examples illustrating some fundamental concepts of protein folding and evolution with a special focus on the exceptions to them are presented.

  9. Elman RNN based classification of proteins sequences on account of their mutual information.

    PubMed

    Mishra, Pooja; Nath Pandey, Paras

    2012-10-21

    In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.

  10. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  11. Protein classification based on text document classification techniques.

    PubMed

    Cheng, Betty Yee Man; Carbonell, Jaime G; Klein-Seetharaman, Judith

    2005-03-01

    The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Copyright 2005 Wiley-Liss, Inc.

  12. A topological approach for protein classification

    DOE PAGES

    Cang, Zixuan; Mu, Lin; Wu, Kedi; ...

    2015-11-04

    Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.

  13. Multi-label literature classification based on the Gene Ontology graph.

    PubMed

    Jin, Bo; Muller, Brian; Zhai, Chengxiang; Lu, Xinghua

    2008-12-08

    The Gene Ontology is a controlled vocabulary for representing knowledge related to genes and proteins in a computable form. The current effort of manually annotating proteins with the Gene Ontology is outpaced by the rate of accumulation of biomedical knowledge in literature, which urges the development of text mining approaches to facilitate the process by automatically extracting the Gene Ontology annotation from literature. The task is usually cast as a text classification problem, and contemporary methods are confronted with unbalanced training data and the difficulties associated with multi-label classification. In this research, we investigated the methods of enhancing automatic multi-label classification of biomedical literature by utilizing the structure of the Gene Ontology graph. We have studied three graph-based multi-label classification algorithms, including a novel stochastic algorithm and two top-down hierarchical classification methods for multi-label literature classification. We systematically evaluated and compared these graph-based classification algorithms to a conventional flat multi-label algorithm. The results indicate that, through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods can significantly improve predictions of the Gene Ontology terms implied by the analyzed text. Furthermore, the graph-based multi-label classifiers are capable of suggesting Gene Ontology annotations (to curators) that are closely related to the true annotations even if they fail to predict the true ones directly. A software package implementing the studied algorithms is available for the research community. Through utilizing the information from the structure of the Gene Ontology graph, the graph-based multi-label classification methods have better potential than the conventional flat multi-label classification approach to facilitate protein annotation based on the literature.

  14. Wavelet images and Chou's pseudo amino acid composition for protein classification.

    PubMed

    Nanni, Loris; Brahnam, Sheryl; Lumini, Alessandra

    2012-08-01

    The last decade has seen an explosion in the collection of protein data. To actualize the potential offered by this wealth of data, it is important to develop machine systems capable of classifying and extracting features from proteins. Reliable machine systems for protein classification offer many benefits, including the promise of finding novel drugs and vaccines. In developing our system, we analyze and compare several feature extraction methods used in protein classification that are based on the calculation of texture descriptors starting from a wavelet representation of the protein. We then feed these texture-based representations of the protein into an Adaboost ensemble of neural network or a support vector machine classifier. In addition, we perform experiments that combine our feature extraction methods with a standard method that is based on the Chou's pseudo amino acid composition. Using several datasets, we show that our best approach outperforms standard methods. The Matlab code of the proposed protein descriptors is available at http://bias.csr.unibo.it/nanni/wave.rar .

  15. Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.

    In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less

  16. Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric

    DOE PAGES

    Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.; ...

    2015-10-09

    In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less

  17. Classification of Complete Proteomes of Different Organisms and Protein Sets Based on Their Protein Distributions in Terms of Some Key Attributes of Proteins

    PubMed Central

    Ma, Yue; Tuskan, Gerald A.

    2018-01-01

    The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here) from the protein distribution densities in the LD space defined by ln(L) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level. PMID:29686995

  18. Protein classification using sequential pattern mining.

    PubMed

    Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I

    2006-01-01

    Protein classification in terms of fold recognition can be employed to determine the structural and functional properties of a newly discovered protein. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. One of the most efficient SPM algorithms, cSPADE, is employed for protein primary structure analysis. Then a classifier uses the extracted sequential patterns for classifying proteins of unknown structure in the appropriate fold category. The proposed methodology exhibited an overall accuracy of 36% in a multi-class problem of 17 candidate categories. The classification performance reaches up to 65% when the three most probable protein folds are considered.

  19. Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions.

    PubMed

    Mai, Te-Lun; Hu, Geng-Ming; Chen, Chi-Ming

    2016-07-01

    Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.

  20. The Classification of Protein Domains.

    PubMed

    Dawson, Natalie; Sillitoe, Ian; Marsden, Russell L; Orengo, Christine A

    2017-01-01

    The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.

  1. Graph pyramids for protein function prediction

    PubMed Central

    2015-01-01

    Background Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Methods Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Results Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data. PMID:26044522

  2. Graph pyramids for protein function prediction.

    PubMed

    Sandhan, Tushar; Yoo, Youngjun; Choi, Jin; Kim, Sun

    2015-01-01

    Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.

  3. Support vector machine based classification of fast Fourier transform spectroscopy of proteins

    NASA Astrophysics Data System (ADS)

    Lazarevic, Aleksandar; Pokrajac, Dragoljub; Marcano, Aristides; Melikechi, Noureddine

    2009-02-01

    Fast Fourier transform spectroscopy has proved to be a powerful method for study of the secondary structure of proteins since peak positions and their relative amplitude are affected by the number of hydrogen bridges that sustain this secondary structure. However, to our best knowledge, the method has not been used yet for identification of proteins within a complex matrix like a blood sample. The principal reason is the apparent similarity of protein infrared spectra with actual differences usually masked by the solvent contribution and other interactions. In this paper, we propose a novel machine learning based method that uses protein spectra for classification and identification of such proteins within a given sample. The proposed method uses principal component analysis (PCA) to identify most important linear combinations of original spectral components and then employs support vector machine (SVM) classification model applied on such identified combinations to categorize proteins into one of given groups. Our experiments have been performed on the set of four different proteins, namely: Bovine Serum Albumin, Leptin, Insulin-like Growth Factor 2 and Osteopontin. Our proposed method of applying principal component analysis along with support vector machines exhibits excellent classification accuracy when identifying proteins using their infrared spectra.

  4. Protein Sequence Classification with Improved Extreme Learning Machine Algorithms

    PubMed Central

    2014-01-01

    Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876

  5. Interplay of biopharmaceutics, biopharmaceutics drug disposition and salivary excretion classification systems

    PubMed Central

    Idkaidek, Nasir M.

    2013-01-01

    The aim of this commentary is to investigate the interplay of Biopharmaceutics Classification System (BCS), Biopharmaceutics Drug Disposition Classification System (BDDCS) and Salivary Excretion Classification System (SECS). BCS first classified drugs based on permeability and solubility for the purpose of predicting oral drug absorption. Then BDDCS linked permeability with hepatic metabolism and classified drugs based on metabolism and solubility for the purpose of predicting oral drug disposition. On the other hand, SECS classified drugs based on permeability and protein binding for the purpose of predicting the salivary excretion of drugs. The role of metabolism, rather than permeability, on salivary excretion is investigated and the results are not in agreement with BDDCS. Conclusion The proposed Salivary Excretion Classification System (SECS) can be used as a guide for drug salivary excretion based on permeability (not metabolism) and protein binding. PMID:24493977

  6. ECOD: An Evolutionary Classification of Protein Domains

    PubMed Central

    Kinch, Lisa N.; Pei, Jimin; Shi, Shuoyong; Kim, Bong-Hyun; Grishin, Nick V.

    2014-01-01

    Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or “fold”). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies. PMID:25474468

  7. ECOD: an evolutionary classification of protein domains.

    PubMed

    Cheng, Hua; Schaeffer, R Dustin; Liao, Yuxing; Kinch, Lisa N; Pei, Jimin; Shi, Shuoyong; Kim, Bong-Hyun; Grishin, Nick V

    2014-12-01

    Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or "fold"). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies.

  8. An updated version of NPIDB includes new classifications of DNA–protein complexes and their families

    PubMed Central

    Zanegina, Olga; Kirsanov, Dmitriy; Baulin, Eugene; Karyagina, Anna; Alexeevski, Andrei; Spirin, Sergey

    2016-01-01

    The recent upgrade of nucleic acid–protein interaction database (NPIDB, http://npidb.belozersky.msu.ru/) includes a newly elaborated classification of complexes of protein domains with double-stranded DNA and a classification of families of related complexes. Our classifications are based on contacting structural elements of both DNA: the major groove, the minor groove and the backbone; and protein: helices, beta-strands and unstructured segments. We took into account both hydrogen bonds and hydrophobic interaction. The analyzed material contains 1942 structures of protein domains from 748 PDB entries. We have identified 97 interaction modes of individual protein domain–DNA complexes and 17 DNA–protein interaction classes of protein domain families. We analyzed the sources of diversity of DNA–protein interaction modes in different complexes of one protein domain family. The observed interaction mode is sometimes influenced by artifacts of crystallization or diversity in secondary structure assignment. The interaction classes of domain families are more stable and thus possess more biological sense than a classification of single complexes. Integration of the classification into NPIDB allows the user to browse the database according to the interacting structural elements of DNA and protein molecules. For each family, we present average DNA shape parameters in contact zones with domains of the family. PMID:26656949

  9. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier.

    PubMed

    Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi

    2017-03-15

    Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  10. Classification of Complete Proteomes of Different Organisms and Protein Sets Based on Their Protein Distributions in Terms of Some Key Attributes of Proteins

    DOE PAGES

    Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.; ...

    2018-01-01

    The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less

  11. Classification of Complete Proteomes of Different Organisms and Protein Sets Based on Their Protein Distributions in Terms of Some Key Attributes of Proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.

    The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less

  12. Classification of Dynamical Diffusion States in Single Molecule Tracking Microscopy

    PubMed Central

    Bosch, Peter J.; Kanger, Johannes S.; Subramaniam, Vinod

    2014-01-01

    Single molecule tracking of membrane proteins by fluorescence microscopy is a promising method to investigate dynamic processes in live cells. Translating the trajectories of proteins to biological implications, such as protein interactions, requires the classification of protein motion within the trajectories. Spatial information of protein motion may reveal where the protein interacts with cellular structures, because binding of proteins to such structures often alters their diffusion speed. For dynamic diffusion systems, we provide an analytical framework to determine in which diffusion state a molecule is residing during the course of its trajectory. We compare different methods for the quantification of motion to utilize this framework for the classification of two diffusion states (two populations with different diffusion speed). We found that a gyration quantification method and a Bayesian statistics-based method are the most accurate in diffusion-state classification for realistic experimentally obtained datasets, of which the gyration method is much less computationally demanding. After classification of the diffusion, the lifetime of the states can be determined, and images of the diffusion states can be reconstructed at high resolution. Simulations validate these applications. We apply the classification and its applications to experimental data to demonstrate the potential of this approach to obtain further insights into the dynamics of cell membrane proteins. PMID:25099798

  13. SeqRate: sequence-based protein folding type classification and rates prediction

    PubMed Central

    2010-01-01

    Background Protein folding rate is an important property of a protein. Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish the different kinetic nature (two-state folding or multi-state folding) of the proteins. Here we developed a method, SeqRate, to predict both protein folding kinetic type (two-state versus multi-state) and real-value folding rate using sequence length, amino acid composition, contact order, contact number, and secondary structure information predicted from only protein sequence with support vector machines. Results We systematically studied the contributions of individual features to folding rate prediction. On a standard benchmark dataset, the accuracy of folding kinetic type classification is 80%. The Pearson correlation coefficient and the mean absolute difference between predicted and experimental folding rates (sec-1) in the base-10 logarithmic scale are 0.81 and 0.79 for two-state protein folders, and 0.80 and 0.68 for three-state protein folders. SeqRate is the first sequence-based method for protein folding type classification and its accuracy of fold rate prediction is improved over previous sequence-based methods. Its performance can be further enhanced with additional information, such as structure-based geometric contacts, as inputs. Conclusions Both the web server and software of predicting folding rate are publicly available at http://casp.rnet.missouri.edu/fold_rate/index.html. PMID:20438647

  14. Fourier-based classification of protein secondary structures.

    PubMed

    Shu, Jian-Jun; Yong, Kian Yan

    2017-04-15

    The correct prediction of protein secondary structures is one of the key issues in predicting the correct protein folded shape, which is used for determining gene function. Existing methods make use of amino acids properties as indices to classify protein secondary structures, but are faced with a significant number of misclassifications. The paper presents a technique for the classification of protein secondary structures based on protein "signal-plotting" and the use of the Fourier technique for digital signal processing. New indices are proposed to classify protein secondary structures by analyzing hydrophobicity profiles. The approach is simple and straightforward. Results show that the more types of protein secondary structures can be classified by means of these newly-proposed indices. Copyright © 2017 Elsevier Inc. All rights reserved.

  15. Improving protein complex classification accuracy using amino acid composition profile.

    PubMed

    Huang, Chien-Hung; Chou, Szu-Yu; Ng, Ka-Lok

    2013-09-01

    Protein complex prediction approaches are based on the assumptions that complexes have dense protein-protein interactions and high functional similarity between their subunits. We investigated those assumptions by studying the subunits' interaction topology, sequence similarity and molecular function for human and yeast protein complexes. Inclusion of amino acids' physicochemical properties can provide better understanding of protein complex properties. Principal component analysis is carried out to determine the major features. Adopting amino acid composition profile information with the SVM classifier serves as an effective post-processing step for complexes classification. Improvement is based on primary sequence information only, which is easy to obtain. Copyright © 2013 Elsevier Ltd. All rights reserved.

  16. Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

    PubMed Central

    Faye, Ibrahima; Samir, Brahim Belhaouari; Md Said, Abas

    2014-01-01

    Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth. PMID:25045727

  17. Mining sequential patterns for protein fold recognition.

    PubMed

    Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I

    2008-02-01

    Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.

  18. Re-visiting protein-centric two-tier classification of existing DNA-protein complexes

    PubMed Central

    2012-01-01

    Background Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. Results On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Conclusions Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information. PMID:22800292

  19. Re-visiting protein-centric two-tier classification of existing DNA-protein complexes.

    PubMed

    Malhotra, Sony; Sowdhamini, Ramanathan

    2012-07-16

    Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.

  20. SCOWLP classification: Structural comparison and analysis of protein binding regions

    PubMed Central

    Teyra, Joan; Paszkowski-Rogacz, Maciej; Anders, Gerd; Pisabarro, M Teresa

    2008-01-01

    Background Detailed information about protein interactions is critical for our understanding of the principles governing protein recognition mechanisms. The structures of many proteins have been experimentally determined in complex with different ligands bound either in the same or different binding regions. Thus, the structural interactome requires the development of tools to classify protein binding regions. A proper classification may provide a general view of the regions that a protein uses to bind others and also facilitate a detailed comparative analysis of the interacting information for specific protein binding regions at atomic level. Such classification might be of potential use for deciphering protein interaction networks, understanding protein function, rational engineering and design. Description Protein binding regions (PBRs) might be ideally described as well-defined separated regions that share no interacting residues one another. However, PBRs are often irregular, discontinuous and can share a wide range of interacting residues among them. The criteria to define an individual binding region can be often arbitrary and may differ from other binding regions within a protein family. Therefore, the rational behind protein interface classification should aim to fulfil the requirements of the analysis to be performed. We extract detailed interaction information of protein domains, peptides and interfacial solvent from the SCOWLP database and we classify the PBRs of each domain family. For this purpose, we define a similarity index based on the overlapping of interacting residues mapped in pair-wise structural alignments. We perform our classification with agglomerative hierarchical clustering using the complete-linkage method. Our classification is calculated at different similarity cut-offs to allow flexibility in the analysis of PBRs, feature especially interesting for those protein families with conflictive binding regions. The hierarchical classification of PBRs is implemented into the SCOWLP database and extends the SCOP classification with three additional family sub-levels: Binding Region, Interface and Contacting Domains. SCOWLP contains 9,334 binding regions distributed within 2,561 families. In 65% of the cases we observe families containing more than one binding region. Besides, 22% of the regions are forming complex with more than one different protein family. Conclusion The current SCOWLP classification and its web application represent a framework for the study of protein interfaces and comparative analysis of protein family binding regions. This comparison can be performed at atomic level and allows the user to study interactome conservation and variability. The new SCOWLP classification may be of great utility for reconstruction of protein complexes, understanding protein networks and ligand design. SCOWLP will be updated with every SCOP release. The web application is available at . PMID:18182098

  1. Protein Structure Classification and Loop Modeling Using Multiple Ramachandran Distributions.

    PubMed

    Najibi, Seyed Morteza; Maadooliat, Mehdi; Zhou, Lan; Huang, Jianhua Z; Gao, Xin

    2017-01-01

    Recently, the study of protein structures using angular representations has attracted much attention among structural biologists. The main challenge is how to efficiently model the continuous conformational space of the protein structures based on the differences and similarities between different Ramachandran plots. Despite the presence of statistical methods for modeling angular data of proteins, there is still a substantial need for more sophisticated and faster statistical tools to model the large-scale circular datasets. To address this need, we have developed a nonparametric method for collective estimation of multiple bivariate density functions for a collection of populations of protein backbone angles. The proposed method takes into account the circular nature of the angular data using trigonometric spline which is more efficient compared to existing methods. This collective density estimation approach is widely applicable when there is a need to estimate multiple density functions from different populations with common features. Moreover, the coefficients of adaptive basis expansion for the fitted densities provide a low-dimensional representation that is useful for visualization, clustering, and classification of the densities. The proposed method provides a novel and unique perspective to two important and challenging problems in protein structure research: structure-based protein classification and angular-sampling-based protein loop structure prediction.

  2. Many local pattern texture features: which is better for image-based multilabel human protein subcellular localization classification?

    PubMed

    Yang, Fan; Xu, Ying-Ying; Shen, Hong-Bin

    2014-01-01

    Human protein subcellular location prediction can provide critical knowledge for understanding a protein's function. Since significant progress has been made on digital microscopy, automated image-based protein subcellular location classification is urgently needed. In this paper, we aim to investigate more representative image features that can be effectively used for dealing with the multilabel subcellular image samples. We prepared a large multilabel immunohistochemistry (IHC) image benchmark from the Human Protein Atlas database and tested the performance of different local texture features, including completed local binary pattern, local tetra pattern, and the standard local binary pattern feature. According to our experimental results from binary relevance multilabel machine learning models, the completed local binary pattern, and local tetra pattern are more discriminative for describing IHC images when compared to the traditional local binary pattern descriptor. The combination of these two novel local pattern features and the conventional global texture features is also studied. The enhanced performance of final binary relevance classification model trained on the combined feature space demonstrates that different features are complementary to each other and thus capable of improving the accuracy of classification.

  3. Structural classification of proteins using texture descriptors extracted from the cellular automata image.

    PubMed

    Kavianpour, Hamidreza; Vasighi, Mahdi

    2017-02-01

    Nowadays, having knowledge about cellular attributes of proteins has an important role in pharmacy, medical science and molecular biology. These attributes are closely correlated with the function and three-dimensional structure of proteins. Knowledge of protein structural class is used by various methods for better understanding the protein functionality and folding patterns. Computational methods and intelligence systems can have an important role in performing structural classification of proteins. Most of protein sequences are saved in databanks as characters and strings and a numerical representation is essential for applying machine learning methods. In this work, a binary representation of protein sequences is introduced based on reduced amino acids alphabets according to surrounding hydrophobicity index. Many important features which are hidden in these long binary sequences can be clearly displayed through their cellular automata images. The extracted features from these images are used to build a classification model by support vector machine. Comparing to previous studies on the several benchmark datasets, the promising classification rates obtained by tenfold cross-validation imply that the current approach can help in revealing some inherent features deeply hidden in protein sequences and improve the quality of predicting protein structural class.

  4. Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data

    PubMed Central

    Andreev, Victor P; Gillespie, Brenda W; Helfand, Brian T; Merion, Robert M

    2016-01-01

    Unsupervised classification methods are gaining acceptance in omics studies of complex common diseases, which are often vaguely defined and are likely the collections of disease subtypes. Unsupervised classification based on the molecular signatures identified in omics studies have the potential to reflect molecular mechanisms of the subtypes of the disease and to lead to more targeted and successful interventions for the identified subtypes. Multiple classification algorithms exist but none is ideal for all types of data. Importantly, there are no established methods to estimate sample size in unsupervised classification (unlike power analysis in hypothesis testing). Therefore, we developed a simulation approach allowing comparison of misclassification errors and estimating the required sample size for a given effect size, number, and correlation matrix of the differentially abundant proteins in targeted proteomics studies. All the experiments were performed in silico. The simulated data imitated the expected one from the study of the plasma of patients with lower urinary tract dysfunction with the aptamer proteomics assay Somascan (SomaLogic Inc, Boulder, CO), which targeted 1129 proteins, including 330 involved in inflammation, 180 in stress response, 80 in aging, etc. Three popular clustering methods (hierarchical, k-means, and k-medoids) were compared. K-means clustering performed much better for the simulated data than the other two methods and enabled classification with misclassification error below 5% in the simulated cohort of 100 patients based on the molecular signatures of 40 differentially abundant proteins (effect size 1.5) from among the 1129-protein panel. PMID:27524871

  5. A novel and efficient technique for identification and classification of GPCRs.

    PubMed

    Gupta, Ravi; Mittal, Ankush; Singh, Kuldip

    2008-07-01

    G-protein coupled receptors (GPCRs) play a vital role in different biological processes, such as regulation of growth, death, and metabolism of cells. GPCRs are the focus of significant amount of current pharmaceutical research since they interact with more than 50% of prescription drugs. The dipeptide-based support vector machine (SVM) approach is the most accurate technique to identify and classify the GPCRs. However, this approach has two major disadvantages. First, the dimension of dipeptide-based feature vector is equal to 400. The large dimension makes the classification task computationally and memory wise inefficient. Second, it does not consider the biological properties of protein sequence for identification and classification of GPCRs. In this paper, we present a novel-feature-based SVM classification technique. The novel features are derived by applying wavelet-based time series analysis approach on protein sequences. The proposed feature space summarizes the variance information of seven important biological properties of amino acids in a protein sequence. In addition, the dimension of the feature vector for proposed technique is equal to 35. Experiments were performed on GPCRs protein sequences available at GPCRs Database. Our approach achieves an accuracy of 99.9%, 98.06%, 97.78%, and 94.08% for GPCR superfamily, families, subfamilies, and subsubfamilies (amine group), respectively, when evaluated using fivefold cross-validation. Further, an accuracy of 99.8%, 97.26%, and 97.84% was obtained when evaluated on unseen or recall datasets of GPCR superfamily, families, and subfamilies, respectively. Comparison with dipeptide-based SVM technique shows the effectiveness of our approach.

  6. HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features.

    PubMed

    Zaman, Rianon; Chowdhury, Shahana Yasmin; Rashid, Mahmood A; Sharma, Alok; Dehzangi, Abdollah; Shatabda, Swakkhar

    2017-01-01

    DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.

  7. SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins - extended Database.

    PubMed

    Chandonia, John-Marc; Fox, Naomi K; Brenner, Steven E

    2017-02-03

    SCOPe (Structural Classification of Proteins-extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high-quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 Protein Data Bank entries, double the 38,221 structures classified in SCOP. Copyright © 2016 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  8. Predicting protein complexes using a supervised learning method combined with local structural information.

    PubMed

    Dong, Yadong; Sun, Yongqi; Qin, Chao

    2018-01-01

    The existing protein complex detection methods can be broadly divided into two categories: unsupervised and supervised learning methods. Most of the unsupervised learning methods assume that protein complexes are in dense regions of protein-protein interaction (PPI) networks even though many true complexes are not dense subgraphs. Supervised learning methods utilize the informative properties of known complexes; they often extract features from existing complexes and then use the features to train a classification model. The trained model is used to guide the search process for new complexes. However, insufficient extracted features, noise in the PPI data and the incompleteness of complex data make the classification model imprecise. Consequently, the classification model is not sufficient for guiding the detection of complexes. Therefore, we propose a new robust score function that combines the classification model with local structural information. Based on the score function, we provide a search method that works both forwards and backwards. The results from experiments on six benchmark PPI datasets and three protein complex datasets show that our approach can achieve better performance compared with the state-of-the-art supervised, semi-supervised and unsupervised methods for protein complex detection, occasionally significantly outperforming such methods.

  9. Molecular Pathological Classification of Neurodegenerative Diseases: Turning towards Precision Medicine.

    PubMed

    Kovacs, Gabor G

    2016-02-02

    Neurodegenerative diseases (NDDs) are characterized by selective dysfunction and loss of neurons associated with pathologically altered proteins that deposit in the human brain but also in peripheral organs. These proteins and their biochemical modifications can be potentially targeted for therapy or used as biomarkers. Despite a plethora of modifications demonstrated for different neurodegeneration-related proteins, such as amyloid-β, prion protein, tau, α-synuclein, TAR DNA-binding protein 43 (TDP-43), or fused in sarcoma protein (FUS), molecular classification of NDDs relies on detailed morphological evaluation of protein deposits, their distribution in the brain, and their correlation to clinical symptoms together with specific genetic alterations. A further facet of the neuropathology-based classification is the fact that many protein deposits show a hierarchical involvement of brain regions. This has been shown for Alzheimer and Parkinson disease and some forms of tauopathies and TDP-43 proteinopathies. The present paper aims to summarize current molecular classification of NDDs, focusing on the most relevant biochemical and morphological aspects. Since the combination of proteinopathies is frequent, definition of novel clusters of patients with NDDs needs to be considered in the era of precision medicine. Optimally, neuropathological categorizing of NDDs should be translated into in vivo detectable biomarkers to support better prediction of prognosis and stratification of patients for therapy trials.

  10. Molecular Pathological Classification of Neurodegenerative Diseases: Turning towards Precision Medicine

    PubMed Central

    Kovacs, Gabor G.

    2016-01-01

    Neurodegenerative diseases (NDDs) are characterized by selective dysfunction and loss of neurons associated with pathologically altered proteins that deposit in the human brain but also in peripheral organs. These proteins and their biochemical modifications can be potentially targeted for therapy or used as biomarkers. Despite a plethora of modifications demonstrated for different neurodegeneration-related proteins, such as amyloid-β, prion protein, tau, α-synuclein, TAR DNA-binding protein 43 (TDP-43), or fused in sarcoma protein (FUS), molecular classification of NDDs relies on detailed morphological evaluation of protein deposits, their distribution in the brain, and their correlation to clinical symptoms together with specific genetic alterations. A further facet of the neuropathology-based classification is the fact that many protein deposits show a hierarchical involvement of brain regions. This has been shown for Alzheimer and Parkinson disease and some forms of tauopathies and TDP-43 proteinopathies. The present paper aims to summarize current molecular classification of NDDs, focusing on the most relevant biochemical and morphological aspects. Since the combination of proteinopathies is frequent, definition of novel clusters of patients with NDDs needs to be considered in the era of precision medicine. Optimally, neuropathological categorizing of NDDs should be translated into in vivo detectable biomarkers to support better prediction of prognosis and stratification of patients for therapy trials. PMID:26848654

  11. An alternative view of protein fold space.

    PubMed

    Shindyalov, I N; Bourne, P E

    2000-02-15

    Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50-150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect--the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at http://cl.sdsc.edu/ce.html.

  12. A three-way approach for protein function classification

    PubMed Central

    2017-01-01

    The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy. PMID:28234929

  13. A three-way approach for protein function classification.

    PubMed

    Ur Rehman, Hafeez; Azam, Nouman; Yao, JingTao; Benso, Alfredo

    2017-01-01

    The knowledge of protein functions plays an essential role in understanding biological cells and has a significant impact on human life in areas such as personalized medicine, better crops and improved therapeutic interventions. Due to expense and inherent difficulty of biological experiments, intelligent methods are generally relied upon for automatic assignment of functions to proteins. The technological advancements in the field of biology are improving our understanding of biological processes and are regularly resulting in new features and characteristics that better describe the role of proteins. It is inevitable to neglect and overlook these anticipated features in designing more effective classification techniques. A key issue in this context, that is not being sufficiently addressed, is how to build effective classification models and approaches for protein function prediction by incorporating and taking advantage from the ever evolving biological information. In this article, we propose a three-way decision making approach which provides provisions for seeking and incorporating future information. We considered probabilistic rough sets based models such as Game-Theoretic Rough Sets (GTRS) and Information-Theoretic Rough Sets (ITRS) for inducing three-way decisions. An architecture of protein functions classification with probabilistic rough sets based three-way decisions is proposed and explained. Experiments are carried out on Saccharomyces cerevisiae species dataset obtained from Uniprot database with the corresponding functional classes extracted from the Gene Ontology (GO) database. The results indicate that as the level of biological information increases, the number of deferred cases are reduced while maintaining similar level of accuracy.

  14. A framework for classification of prokaryotic protein kinases.

    PubMed

    Tyagi, Nidhi; Anamika, Krishanpal; Srinivasan, Narayanaswamy

    2010-05-26

    Overwhelming majority of the Serine/Threonine protein kinases identified by gleaning archaeal and eubacterial genomes could not be classified into any of the well known Hanks and Hunter subfamilies of protein kinases. This is owing to the development of Hanks and Hunter classification scheme based on eukaryotic protein kinases which are highly divergent from their prokaryotic homologues. A large dataset of prokaryotic Serine/Threonine protein kinases recognized from genomes of prokaryotes have been used to develop a classification framework for prokaryotic Ser/Thr protein kinases. We have used traditional sequence alignment and phylogenetic approaches and clustered the prokaryotic kinases which represent 72 subfamilies with at least 4 members in each. Such a clustering enables classification of prokaryotic Ser/Thr kinases and it can be used as a framework to classify newly identified prokaryotic Ser/Thr kinases. After series of searches in a comprehensive sequence database we recognized that 38 subfamilies of prokaryotic protein kinases are associated to a specific taxonomic level. For example 4, 6 and 3 subfamilies have been identified that are currently specific to phylum proteobacteria, cyanobacteria and actinobacteria respectively. Similarly subfamilies which are specific to an order, sub-order, class, family and genus have also been identified. In addition to these, we also identify organism-diverse subfamilies. Members of these clusters are from organisms of different taxonomic levels, such as archaea, bacteria, eukaryotes and viruses. Interestingly, occurrence of several taxonomic level specific subfamilies of prokaryotic kinases contrasts with classification of eukaryotic protein kinases in which most of the popular subfamilies of eukaryotic protein kinases occur diversely in several eukaryotes. Many prokaryotic Ser/Thr kinases exhibit a wide variety of modular organization which indicates a degree of complexity and protein-protein interactions in the signaling pathways in these microbes.

  15. PCR and RFLP analyses based on the ribosomal protein operon

    USDA-ARS?s Scientific Manuscript database

    Differentiation and classification of phytoplasmas have been primarily based on the highly conserved 16Sr RNA gene. RFLP analysis of 16Sr RNA gene sequences has identified 31 16Sr RNA (16Sr) groups and more than 100 16Sr subgroups. Classification of phytoplasma strains can however, become more refin...

  16. Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs.

    PubMed

    Shamim, Mohammad Tabrez Anwar; Anwaruddin, Mohammad; Nagarajaram, H A

    2007-12-15

    Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features. We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is approximately 8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions. Dataset and stand-alone program are available upon request.

  17. Classification of drug molecules considering their IC50 values using mixed-integer linear programming based hyper-boxes method.

    PubMed

    Armutlu, Pelin; Ozdemir, Muhittin E; Uney-Yuksektepe, Fadime; Kavakli, I Halil; Turkay, Metin

    2008-10-03

    A priori analysis of the activity of drugs on the target protein by computational approaches can be useful in narrowing down drug candidates for further experimental tests. Currently, there are a large number of computational methods that predict the activity of drugs on proteins. In this study, we approach the activity prediction problem as a classification problem and, we aim to improve the classification accuracy by introducing an algorithm that combines partial least squares regression with mixed-integer programming based hyper-boxes classification method, where drug molecules are classified as low active or high active regarding their binding activity (IC50 values) on target proteins. We also aim to determine the most significant molecular descriptors for the drug molecules. We first apply our approach by analyzing the activities of widely known inhibitor datasets including Acetylcholinesterase (ACHE), Benzodiazepine Receptor (BZR), Dihydrofolate Reductase (DHFR), Cyclooxygenase-2 (COX-2) with known IC50 values. The results at this stage proved that our approach consistently gives better classification accuracies compared to 63 other reported classification methods such as SVM, Naïve Bayes, where we were able to predict the experimentally determined IC50 values with a worst case accuracy of 96%. To further test applicability of this approach we first created dataset for Cytochrome P450 C17 inhibitors and then predicted their activities with 100% accuracy. Our results indicate that this approach can be utilized to predict the inhibitory effects of inhibitors based on their molecular descriptors. This approach will not only enhance drug discovery process, but also save time and resources committed.

  18. GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.

    PubMed

    Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd

    2018-01-01

    In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.

  19. SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition

    PubMed Central

    Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Stafford, William Noble; Leslie, Christina

    2007-01-01

    Background Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. Results We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. Conclusion By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. PMID:17570145

  20. 77 FR 44456 - Classification of Two Steroids, Prostanozol

    Federal Register 2010, 2011, 2012, 2013, 2014

    2012-07-30

    ... by positive nitrogen balance and protein metabolism, resulting in increases in protein synthesis and... activity by means of nitrogen balance and androgenic activity based on weight changes of the ventral...

  1. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cang, Zixuan; Mu, Lin; Wu, Kedi

    Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.

  2. Spider Neurotoxins, Short Linear Cationic Peptides and Venom Protein Classification Improved by an Automated Competition between Exhaustive Profile HMM Classifiers

    PubMed Central

    Koua, Dominique; Kuhn-Nentwig, Lucia

    2017-01-01

    Spider venoms are rich cocktails of bioactive peptides, proteins, and enzymes that are being intensively investigated over the years. In order to provide a better comprehension of that richness, we propose a three-level family classification system for spider venom components. This classification is supported by an exhaustive set of 219 new profile hidden Markov models (HMMs) able to attribute a given peptide to its precise peptide type, family, and group. The proposed classification has the advantages of being totally independent from variable spider taxonomic names and can easily evolve. In addition to the new classifiers, we introduce and demonstrate the efficiency of hmmcompete, a new standalone tool that monitors HMM-based family classification and, after post-processing the result, reports the best classifier when multiple models produce significant scores towards given peptide queries. The combined used of hmmcompete and the new spider venom component-specific classifiers demonstrated 96% sensitivity to properly classify all known spider toxins from the UniProtKB database. These tools are timely regarding the important classification needs caused by the increasing number of peptides and proteins generated by transcriptomic projects. PMID:28786958

  3. Signal peptide discrimination and cleavage site identification using SVM and NN.

    PubMed

    Kazemian, H B; Yusuf, S A; White, K

    2014-02-01

    About 15% of all proteins in a genome contain a signal peptide (SP) sequence, at the N-terminus, that targets the protein to intracellular secretory pathways. Once the protein is targeted correctly in the cell, the SP is cleaved, releasing the mature protein. Accurate prediction of the presence of these short amino-acid SP chains is crucial for modelling the topology of membrane proteins, since SP sequences can be confused with transmembrane domains due to similar composition of hydrophobic amino acids. This paper presents a cascaded Support Vector Machine (SVM)-Neural Network (NN) classification methodology for SP discrimination and cleavage site identification. The proposed method utilises a dual phase classification approach using SVM as a primary classifier to discriminate SP sequences from Non-SP. The methodology further employs NNs to predict the most suitable cleavage site candidates. In phase one, a SVM classification utilises hydrophobic propensities as a primary feature vector extraction using symmetric sliding window amino-acid sequence analysis for discrimination of SP and Non-SP. In phase two, a NN classification uses asymmetric sliding window sequence analysis for prediction of cleavage site identification. The proposed SVM-NN method was tested using Uni-Prot non-redundant datasets of eukaryotic and prokaryotic proteins with SP and Non-SP N-termini. Computer simulation results demonstrate an overall accuracy of 0.90 for SP and Non-SP discrimination based on Matthews Correlation Coefficient (MCC) tests using SVM. For SP cleavage site prediction, the overall accuracy is 91.5% based on cross-validation tests using the novel SVM-NN model. © 2013 Published by Elsevier Ltd.

  4. Disease gene classification with metagraph representations.

    PubMed

    Kircali Ata, Sezin; Fang, Yuan; Wu, Min; Li, Xiao-Li; Xiao, Xiaokui

    2017-12-01

    Protein-protein interaction (PPI) networks play an important role in studying the functional roles of proteins, including their association with diseases. However, protein interaction networks are not sufficient without the support of additional biological knowledge for proteins such as their molecular functions and biological processes. To complement and enrich PPI networks, we propose to exploit biological properties of individual proteins. More specifically, we integrate keywords describing protein properties into the PPI network, and construct a novel PPI-Keywords (PPIK) network consisting of both proteins and keywords as two different types of nodes. As disease proteins tend to have a similar topological characteristics on the PPIK network, we further propose to represent proteins with metagraphs. Different from a traditional network motif or subgraph, a metagraph can capture a particular topological arrangement involving the interactions/associations between both proteins and keywords. Based on the novel metagraph representations for proteins, we further build classifiers for disease protein classification through supervised learning. Our experiments on three different PPI databases demonstrate that the proposed method consistently improves disease protein prediction across various classifiers, by 15.3% in AUC on average. It outperforms the baselines including the diffusion-based methods (e.g., RWR) and the module-based methods by 13.8-32.9% for overall disease protein prediction. For predicting breast cancer genes, it outperforms RWR, PRINCE and the module-based baselines by 6.6-14.2%. Finally, our predictions also turn out to have better correlations with literature findings from PubMed. Copyright © 2017 Elsevier Inc. All rights reserved.

  5. Towards quantitative classification of folded proteins in terms of elementary functions.

    PubMed

    Hu, Shuangwei; Krokhotin, Andrei; Niemi, Antti J; Peng, Xubiao

    2011-04-01

    A comparative classification scheme provides a good basis for several approaches to understand proteins, including prediction of relations between their structure and biological function. But it remains a challenge to combine a classification scheme that describes a protein starting from its well-organized secondary structures and often involves direct human involvement, with an atomary-level physics-based approach where a protein is fundamentally nothing more than an ensemble of mutually interacting carbon, hydrogen, oxygen, and nitrogen atoms. In order to bridge these two complementary approaches to proteins, conceptually novel tools need to be introduced. Here we explain how an approach toward geometric characterization of entire folded proteins can be based on a single explicit elementary function that is familiar from nonlinear physical systems where it is known as the kink soliton. Our approach enables the conversion of hierarchical structural information into a quantitative form that allows for a folded protein to be characterized in terms of a small number of global parameters that are in principle computable from atomary-level considerations. As an example we describe in detail how the native fold of the myoglobin 1M6C emerges from a combination of kink solitons with a very high atomary-level accuracy. We also verify that our approach describes longer loops and loops connecting α helices with β strands, with the same overall accuracy. ©2011 American Physical Society

  6. PANDORA: keyword-based analysis of protein sets by integration of annotation sources.

    PubMed

    Kaplan, Noam; Vaaknin, Avishay; Linial, Michal

    2003-10-01

    Recent advances in high-throughput methods and the application of computational tools for automatic classification of proteins have made it possible to carry out large-scale proteomic analyses. Biological analysis and interpretation of sets of proteins is a time-consuming undertaking carried out manually by experts. We have developed PANDORA (Protein ANnotation Diagram ORiented Analysis), a web-based tool that provides an automatic representation of the biological knowledge associated with any set of proteins. PANDORA uses a unique approach of keyword-based graphical analysis that focuses on detecting subsets of proteins that share unique biological properties and the intersections of such sets. PANDORA currently supports SwissProt keywords, NCBI Taxonomy, InterPro entries and the hierarchical classification terms from ENZYME, SCOP and GO databases. The integrated study of several annotation sources simultaneously allows a representation of biological relations of structure, function, cellular location, taxonomy, domains and motifs. PANDORA is also integrated into the ProtoNet system, thus allowing testing thousands of automatically generated clusters. We illustrate how PANDORA enhances the biological understanding of large, non-uniform sets of proteins originating from experimental and computational sources, without the need for prior biological knowledge on individual proteins.

  7. Support Vector Machines Trained with Evolutionary Algorithms Employing Kernel Adatron for Large Scale Classification of Protein Structures.

    PubMed

    Arana-Daniel, Nancy; Gallegos, Alberto A; López-Franco, Carlos; Alanís, Alma Y; Morales, Jacob; López-Franco, Adriana

    2016-01-01

    With the increasing power of computers, the amount of data that can be processed in small periods of time has grown exponentially, as has the importance of classifying large-scale data efficiently. Support vector machines have shown good results classifying large amounts of high-dimensional data, such as data generated by protein structure prediction, spam recognition, medical diagnosis, optical character recognition and text classification, etc. Most state of the art approaches for large-scale learning use traditional optimization methods, such as quadratic programming or gradient descent, which makes the use of evolutionary algorithms for training support vector machines an area to be explored. The present paper proposes an approach that is simple to implement based on evolutionary algorithms and Kernel-Adatron for solving large-scale classification problems, focusing on protein structure prediction. The functional properties of proteins depend upon their three-dimensional structures. Knowing the structures of proteins is crucial for biology and can lead to improvements in areas such as medicine, agriculture and biofuels.

  8. Investigating the Importance of the Pocket-estimation Method in Pocket-based Approaches: An Illustration Using Pocket-ligand Classification.

    PubMed

    Caumes, Géraldine; Borrel, Alexandre; Abi Hussein, Hiba; Camproux, Anne-Claude; Regad, Leslie

    2017-09-01

    Small molecules interact with their protein target on surface cavities known as binding pockets. Pocket-based approaches are very useful in all of the phases of drug design. Their first step is estimating the binding pocket based on protein structure. The available pocket-estimation methods produce different pockets for the same target. The aim of this work is to investigate the effects of different pocket-estimation methods on the results of pocket-based approaches. We focused on the effect of three pocket-estimation methods on a pocket-ligand (PL) classification. This pocket-based approach is useful for understanding the correspondence between the pocket and ligand spaces and to develop pharmacological profiling models. We found pocket-estimation methods yield different binding pockets in terms of boundaries and properties. These differences are responsible for the variation in the PL classification results that can have an impact on the detected correspondence between pocket and ligand profiles. Thus, we highlighted the importance of the pocket-estimation method choice in pocket-based approaches. © 2017 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim.

  9. Protein Information Resource: a community resource for expert annotation of protein data

    PubMed Central

    Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy

    2001-01-01

    The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-Inter­national databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041

  10. 3D Complex: A Structural Classification of Protein Complexes

    PubMed Central

    Levy, Emmanuel D; Pereira-Leal, Jose B; Chothia, Cyrus; Teichmann, Sarah A

    2006-01-01

    Most of the proteins in a cell assemble into complexes to carry out their function. It is therefore crucial to understand the physicochemical properties as well as the evolution of interactions between proteins. The Protein Data Bank represents an important source of information for such studies, because more than half of the structures are homo- or heteromeric protein complexes. Here we propose the first hierarchical classification of whole protein complexes of known 3-D structure, based on representing their fundamental structural features as a graph. This classification provides the first overview of all the complexes in the Protein Data Bank and allows nonredundant sets to be derived at different levels of detail. This reveals that between one-half and two-thirds of known structures are multimeric, depending on the level of redundancy accepted. We also analyse the structures in terms of the topological arrangement of their subunits and find that they form a small number of arrangements compared with all theoretically possible ones. This is because most complexes contain four subunits or less, and the large majority are homomeric. In addition, there is a strong tendency for symmetry in complexes, even for heteromeric complexes. Finally, through comparison of Biological Units in the Protein Data Bank with the Protein Quaternary Structure database, we identified many possible errors in quaternary structure assignments. Our classification, available as a database and Web server at http://www.3Dcomplex.org, will be a starting point for future work aimed at understanding the structure and evolution of protein complexes. PMID:17112313

  11. 76 FR 72355 - Classification of Two Steroids, Prostanozol and Methasterone, as Schedule III Anabolic Steroids...

    Federal Register 2010, 2011, 2012, 2013, 2014

    2011-11-23

    ... positive nitrogen balance and protein metabolism, resulting in increases in protein synthesis and lean body... nitrogen balance and androgenic activity based on weight changes of the ventral prostrate of prostanozol...

  12. T-RMSD: a web server for automated fine-grained protein structural classification.

    PubMed

    Magis, Cedrik; Di Tommaso, Paolo; Notredame, Cedric

    2013-07-01

    This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd.

  13. T-RMSD: a web server for automated fine-grained protein structural classification

    PubMed Central

    Magis, Cedrik; Di Tommaso, Paolo; Notredame, Cedric

    2013-01-01

    This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd. PMID:23716642

  14. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification.

    PubMed

    Huang, Chuen-Der; Lin, Chin-Teng; Pal, Nikhil Ranjan

    2003-12-01

    The structure classification of proteins plays a very important role in bioinformatics, since the relationships and characteristics among those known proteins can be exploited to predict the structure of new proteins. The success of a classification system depends heavily on two things: the tools being used and the features considered. For the bioinformatics applications, the role of appropriate features has not been paid adequate importance. In this investigation we use three novel ideas for multiclass protein fold classification. First, we use the gating neural network, where each input node is associated with a gate. This network can select important features in an online manner when the learning goes on. At the beginning of the training, all gates are almost closed, i.e., no feature is allowed to enter the network. Through the training, gates corresponding to good features are completely opened while gates corresponding to bad features are closed more tightly, and some gates may be partially open. The second novel idea is to use a hierarchical learning architecture (HLA). The classifier in the first level of HLA classifies the protein features into four major classes: all alpha, all beta, alpha + beta, and alpha/beta. And in the next level we have another set of classifiers, which further classifies the protein features into 27 folds. The third novel idea is to induce the indirect coding features from the amino-acid composition sequence of proteins based on the N-gram concept. This provides us with more representative and discriminative new local features of protein sequences for multiclass protein fold classification. The proposed HLA with new indirect coding features increases the protein fold classification accuracy by about 12%. Moreover, the gating neural network is found to reduce the number of features drastically. Using only half of the original features selected by the gating neural network can reach comparable test accuracy as that using all the original features. The gating mechanism also helps us to get a better insight into the folding process of proteins. For example, tracking the evolution of different gates we can find which characteristics (features) of the data are more important for the folding process. And, of course, it also reduces the computation time.

  15. Segmentation and classification of cell cycle phases in fluorescence imaging.

    PubMed

    Ersoy, Ilker; Bunyak, Filiz; Chagin, Vadim; Cardoso, M Christina; Palaniappan, Kannappan

    2009-01-01

    Current chemical biology methods for studying spatiotemporal correlation between biochemical networks and cell cycle phase progression in live-cells typically use fluorescence-based imaging of fusion proteins. Stable cell lines expressing fluorescently tagged protein GFP-PCNA produce rich, dynamically varying sub-cellular foci patterns characterizing the cell cycle phases, including the progress during the S-phase. Variable fluorescence patterns, drastic changes in SNR, shape and position changes and abundance of touching cells require sophisticated algorithms for reliable automatic segmentation and cell cycle classification. We extend the recently proposed graph partitioning active contours (GPAC) for fluorescence-based nucleus segmentation using regional density functions and dramatically improve its efficiency, making it scalable for high content microscopy imaging. We utilize surface shape properties of GFP-PCNA intensity field to obtain descriptors of foci patterns and perform automated cell cycle phase classification, and give quantitative performance by comparing our results to manually labeled data.

  16. Mining for class-specific motifs in protein sequence classification

    PubMed Central

    2013-01-01

    Background In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. Results We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. Conclusion The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms. PMID:23496846

  17. NPIDB: Nucleic acid-Protein Interaction DataBase.

    PubMed

    Kirsanov, Dmitry D; Zanegina, Olga N; Aksianov, Evgeniy A; Spirin, Sergei A; Karyagina, Anna S; Alexeevski, Andrei V

    2013-01-01

    The Nucleic acid-Protein Interaction DataBase (http://npidb.belozersky.msu.ru/) contains information derived from structures of DNA-protein and RNA-protein complexes extracted from the Protein Data Bank (3846 complexes in October 2012). It provides a web interface and a set of tools for extracting biologically meaningful characteristics of nucleoprotein complexes. The content of the database is updated weekly. The current version of the Nucleic acid-Protein Interaction DataBase is an upgrade of the version published in 2007. The improvements include a new web interface, new tools for calculation of intermolecular interactions, a classification of SCOP families that contains DNA-binding protein domains and data on conserved water molecules on the DNA-protein interface.

  18. Analysis of A Drug Target-based Classification System using Molecular Descriptors.

    PubMed

    Lu, Jing; Zhang, Pin; Bi, Yi; Luo, Xiaomin

    2016-01-01

    Drug-target interaction is an important topic in drug discovery and drug repositioning. KEGG database offers a drug annotation and classification using a target-based classification system. In this study, we gave an investigation on five target-based classes: (I) G protein-coupled receptors; (II) Nuclear receptors; (III) Ion channels; (IV) Enzymes; (V) Pathogens, using molecular descriptors to represent each drug compound. Two popular feature selection methods, maximum relevance minimum redundancy and incremental feature selection, were adopted to extract the important descriptors. Meanwhile, an optimal prediction model based on nearest neighbor algorithm was constructed, which got the best result in identifying drug target-based classes. Finally, some key descriptors were discussed to uncover their important roles in the identification of drug-target classes.

  19. Computational intelligence techniques for biological data mining: An overview

    NASA Astrophysics Data System (ADS)

    Faye, Ibrahima; Iqbal, Muhammad Javed; Said, Abas Md; Samir, Brahim Belhaouari

    2014-10-01

    Computational techniques have been successfully utilized for a highly accurate analysis and modeling of multifaceted and raw biological data gathered from various genome sequencing projects. These techniques are proving much more effective to overcome the limitations of the traditional in-vitro experiments on the constantly increasing sequence data. However, most critical problems that caught the attention of the researchers may include, but not limited to these: accurate structure and function prediction of unknown proteins, protein subcellular localization prediction, finding protein-protein interactions, protein fold recognition, analysis of microarray gene expression data, etc. To solve these problems, various classification and clustering techniques using machine learning have been extensively used in the published literature. These techniques include neural network algorithms, genetic algorithms, fuzzy ARTMAP, K-Means, K-NN, SVM, Rough set classifiers, decision tree and HMM based algorithms. Major difficulties in applying the above algorithms include the limitations found in the previous feature encoding and selection methods while extracting the best features, increasing classification accuracy and decreasing the running time overheads of the learning algorithms. The application of this research would be potentially useful in the drug design and in the diagnosis of some diseases. This paper presents a concise overview of the well-known protein classification techniques.

  20. NOXclass: prediction of protein-protein interaction types.

    PubMed

    Zhu, Hongbo; Domingues, Francisco S; Sommer, Ingolf; Lengauer, Thomas

    2006-01-19

    Structural models determined by X-ray crystallography play a central role in understanding protein-protein interactions at the molecular level. Interpretation of these models requires the distinction between non-specific crystal packing contacts and biologically relevant interactions. This has been investigated previously and classification approaches have been proposed. However, less attention has been devoted to distinguishing different types of biological interactions. These interactions are classified as obligate and non-obligate according to the effect of the complex formation on the stability of the protomers. So far no automatic classification methods for distinguishing obligate, non-obligate and crystal packing interactions have been made available. Six interface properties have been investigated on a dataset of 243 protein interactions. The six properties have been combined using a support vector machine algorithm, resulting in NOXclass, a classifier for distinguishing obligate, non-obligate and crystal packing interactions. We achieve an accuracy of 91.8% for the classification of these three types of interactions using a leave-one-out cross-validation procedure. NOXclass allows the interpretation and analysis of protein quaternary structures. In particular, it generates testable hypotheses regarding the nature of protein-protein interactions, when experimental results are not available. We expect this server will benefit the users of protein structural models, as well as protein crystallographers and NMR spectroscopists. A web server based on the method and the datasets used in this study are available at http://noxclass.bioinf.mpi-inf.mpg.de/.

  1. Adenosine monophosphate-activated protein kinase-based classification of diabetes pharmacotherapy

    PubMed Central

    Dutta, D; Kalra, S; Sharma, M

    2017-01-01

    The current classification of both diabetes and antidiabetes medication is complex, preventing a treating physician from choosing the most appropriate treatment for an individual patient, sometimes resulting in patient-drug mismatch. We propose a novel, simple systematic classification of drugs, based on their effect on adenosine monophosphate-activated protein kinase (AMPK). AMPK is the master regular of energy metabolism, an energy sensor, activated when cellular energy levels are low, resulting in activation of catabolic process, and inactivation of anabolic process, having a beneficial effect on glycemia in diabetes. This listing of drugs makes it easier for students and practitioners to analyze drug profiles and match them with patient requirements. It also facilitates choice of rational combinations, with complementary modes of action. Drugs are classified as stimulators, inhibitors, mixed action, possible action, and no action on AMPK activity. Metformin and glitazones are pure stimulators of AMPK. Incretin-based therapies have a mixed action on AMPK. Sulfonylureas either inhibit AMPK or have no effect on AMPK. Glycemic efficacy of alpha-glucosidase inhibitors, sodium glucose co-transporter-2 inhibitor, colesevelam, and bromocriptine may also involve AMPK activation, which warrants further evaluation. Berberine, salicylates, and resveratrol are newer promising agents in the management of diabetes, having well-documented evidence of AMPK stimulation medicated glycemic efficacy. Hence, AMPK-based classification of antidiabetes medications provides a holistic unifying understanding of pharmacotherapy in diabetes. This classification is flexible with a scope for inclusion of promising agents of future. PMID:27652986

  2. Inhibition of Retinoblastoma Protein Inactivation

    DTIC Science & Technology

    2016-09-01

    Retinoblastoma protein, E2F transcription factor, high throughput screen, drug discovery, x-ray crystallography 16. SECURITY CLASSIFICATION OF: 17...developed a method to perform fragment based screening by x-ray crystallography . 2.0 KEYWORDS Retinoblastoma (Rb) pathway, E2F transcription factor...cancer, cell-cycle inhibition, activation, modulation, inhibition, high throughput screening, fragment-based screening, x-ray crystallography

  3. Predicting Flavonoid UGT Regioselectivity

    PubMed Central

    Jackson, Rhydon; Knisley, Debra; McIntosh, Cecilia; Pfeiffer, Phillip

    2011-01-01

    Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities. PMID:21747849

  4. Phylogeny-dominant classification of J-proteins in Arabidopsis thaliana and Brassica oleracea.

    PubMed

    Zhang, Bin; Qiu, Han-Lin; Qu, Dong-Hai; Ruan, Ying; Chen, Dong-Hong

    2018-04-05

    Hsp40s or DnaJ/J-proteins are evolutionarily conserved in all organisms as co-chaperones of molecular chaperone HSP70s that mainly participate in maintaining cellular protein homeostasis, such as protein folding, assembly, stabilization, and translocation under normal conditions as well as refolding and degradation under environmental stresses. It has been reported that Arabidopsis J-proteins are classified into four classes (types A-D) according to domain organization, but their phylogenetic relationships are unknown. Here, we identified 129 J-proteins in the world-wide popular vegetable Brassica oleracea, a close relative of the model plant Arabidopsis, and also revised the information of Arabidopsis J-proteins based on the latest online bioresources. According to phylogenetic analysis with domain organization and gene structure as references, the J-proteins from Arabidopsis and B. oleracea were classified into 15 main clades (I-XV) separated by a number of undefined small branches with remote relationship. Based on the number of members, they respectively belong to multigene clades, oligo-gene clades, and mono-gene clades. The J-protein genes from different clades may function together or separately to constitute a complicated regulatory network. This study provides a constructive viewpoint for J-protein classification and an informative platform for further functional dissection and resistant genes discovery related to genetic improvement of crop plants.

  5. Modified Cut-Off Value of the Urine Protein-To-Creatinine Ratio Is Helpful for Identifying Patients at High Risk for Chronic Kidney Disease: Validation of the Revised Japanese Guideline.

    PubMed

    Yamamoto, Hiroyuki; Yamamoto, Kyoko; Yoshida, Katsumi; Shindoh, Chiyohiko; Takeda, Kyoko; Monden, Masami; Izumo, Hiroko; Niinuma, Hiroyuki; Nishi, Yutaro; Niwa, Koichiro; Komatsu, Yasuhiro

    2015-11-01

    Chronic kidney disease (CKD) is a global public health issue, and strategies for its early detection and intervention are imperative. The latest Japanese CKD guideline recommends that patients without diabetes should be classified using the urine protein-to-creatinine ratio (PCR) instead of the urine albumin-to-creatinine ratio (ACR); however, no validation studies are available. This study aimed to validate the PCR-based CKD risk classification compared with the ACR-based classification and to explore more accurate classification methods. We analyzed two previously reported datasets that included diabetic and/or cardiovascular patients who were classified into early CKD stages. In total, 860 patients (131 diabetic patients and 729 cardiovascular patients, including 193 diabetic patients) were enrolled. We assessed the CKD risk classification of each patient according to the estimated glomerular filtration rate and the ACR-based or PCR-based classification. The use of the cut-off value recommended in the current guideline (PCR 0.15 g/g creatinine) resulted in risk misclassification rates of 26.0% and 16.6% for the two datasets. The misclassification was primarily caused by underestimation. Moderate to substantial agreement between each classification was achieved: Cohen's kappa, 0.56 (95% confidence interval, 0.45-0.69) and 0.72 (0.67-0.76) in each dataset, respectively. To improve the accuracy, we tested various candidate PCR cut-off values, showing that a PCR cut-off value of 0.08-0.10 g/g creatinine resulted in improvement in the misclassification rates and kappa values. Modification of the PCR cut-off value would improve its efficacy to identify high-risk populations who will benefit from early intervention.

  6. Classification of proteins with shared motifs and internal repeats in the ECOD database

    PubMed Central

    Kinch, Lisa N.; Liao, Yuxing

    2016-01-01

    Abstract Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain‐like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade. PMID:26833690

  7. Transporter taxonomy - a comparison of different transport protein classification schemes.

    PubMed

    Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F

    2014-06-01

    Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.

  8. sc-PDB: a database for identifying variations and multiplicity of 'druggable' binding sites in proteins.

    PubMed

    Meslamani, Jamel; Rognan, Didier; Kellenberger, Esther

    2011-05-01

    The sc-PDB database is an annotated archive of druggable binding sites extracted from the Protein Data Bank. It contains all-atoms coordinates for 8166 protein-ligand complexes, chosen for their geometrical and physico-chemical properties. The sc-PDB provides a functional annotation for proteins, a chemical description for ligands and the detailed intermolecular interactions for complexes. The sc-PDB now includes a hierarchical classification of all the binding sites within a functional class. The sc-PDB entries were first clustered according to the protein name indifferent of the species. For each cluster, we identified dissimilar sites (e.g. catalytic and allosteric sites of an enzyme). SCOPE AND APPLICATIONS: The classification of sc-PDB targets by binding site diversity was intended to facilitate chemogenomics approaches to drug design. In ligand-based approaches, it avoids comparing ligands that do not share the same binding site. In structure-based approaches, it permits to quantitatively evaluate the diversity of the binding site definition (variations in size, sequence and/or structure). The sc-PDB database is freely available at: http://bioinfo-pharma.u-strasbg.fr/scPDB.

  9. Surveying alignment-free features for Ortholog detection in related yeast proteomes by using supervised big data classifiers.

    PubMed

    Galpert, Deborah; Fernández, Alberto; Herrera, Francisco; Antunes, Agostinho; Molina-Ruiz, Reinaldo; Agüero-Chapin, Guillermin

    2018-05-03

    The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. We have previously introduced a successful supervised pairwise ortholog classification approach implemented in a big data platform that considered several pairwise protein features and the low ortholog pair ratios found between two annotated proteomes (Galpert, D et al., BioMed Research International, 2015). The supervised models were built and tested using a Saccharomycete yeast benchmark dataset proposed by Salichos and Rokas (2011). Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. Here, we aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. The Spark Random Forest and Decision Trees with oversampling and undersampling techniques, and built with only alignment-based similarity measures or combined with several alignment-free pairwise protein features showed the highest classification performance for ortholog detection in three yeast proteome pairs. Although such supervised approaches outperformed traditional methods, there were no significant differences between the exclusive use of alignment-based similarity measures and their combination with alignment-free features, even within the twilight zone of the studied proteomes. Just when alignment-based and alignment-free features were combined in Spark Decision Trees with imbalance management, a higher success rate (98.71%) within the twilight zone could be achieved for a yeast proteome pair that underwent a whole genome duplication. The feature selection study showed that alignment-based features were top-ranked for the best classifiers while the runners-up were alignment-free features related to amino acid composition. The incorporation of alignment-free features in supervised big data models did not significantly improve ortholog detection in yeast proteomes regarding the classification qualities achieved with just alignment-based similarity measures. However, the similarity of their classification performance to that of traditional ortholog detection methods encourages the evaluation of other alignment-free protein pair descriptors in future research.

  10. Availability of MudPIT data for classification of biological samples.

    PubMed

    Silvestre, Dario Di; Zoppis, Italo; Brambilla, Francesca; Bellettato, Valeria; Mauri, Giancarlo; Mauri, Pierluigi

    2013-01-14

    Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins. Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software. These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application.

  11. Probabilistic grammatical model for helix‐helix contact site classification

    PubMed Central

    2013-01-01

    Background Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. Results In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. Conclusions We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists. PMID:24350601

  12. The value of protein structure classification information—Surveying the scientific literature

    PubMed Central

    Fox, Naomi K.; Brenner, Steven E.

    2015-01-01

    ABSTRACT The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:26313554

  13. Structure-Based Design of Inhibitors of Protein–Protein Interactions: Mimicking Peptide Binding Epitopes

    PubMed Central

    Pelay-Gimeno, Marta; Glas, Adrian; Koch, Oliver; Grossmann, Tom N

    2015-01-01

    Protein–protein interactions (PPIs) are involved at all levels of cellular organization, thus making the development of PPI inhibitors extremely valuable. The identification of selective inhibitors is challenging because of the shallow and extended nature of PPI interfaces. Inhibitors can be obtained by mimicking peptide binding epitopes in their bioactive conformation. For this purpose, several strategies have been evolved to enable a projection of side chain functionalities in analogy to peptide secondary structures, thereby yielding molecules that are generally referred to as peptidomimetics. Herein, we introduce a new classification of peptidomimetics (classes A–D) that enables a clear assignment of available approaches. Based on this classification, the Review summarizes strategies that have been applied for the structure-based design of PPI inhibitors through stabilizing or mimicking turns, β-sheets, and helices. PMID:26119925

  14. A simple atomic-level hydrophobicity scale reveals protein interfacial structure.

    PubMed

    Kapcha, Lauren H; Rossky, Peter J

    2014-01-23

    Many amino acid residue hydrophobicity scales have been created in an effort to better understand and rapidly characterize water-protein interactions based only on protein structure and sequence. There is surprisingly low consistency in the ranking of residue hydrophobicity between scales, and their ability to provide insightful characterization varies substantially across subject proteins. All current scales characterize hydrophobicity based on entire amino acid residue units. We introduce a simple binary but atomic-level hydrophobicity scale that allows for the classification of polar and non-polar moieties within single residues, including backbone atoms. This simple scale is first shown to capture the anticipated hydrophobic character for those whole residues that align in classification among most scales. Examination of a set of protein binding interfaces establishes good agreement between residue-based and atomic-level descriptions of hydrophobicity for five residues, while the remaining residues produce discrepancies. We then show that the atomistic scale properly classifies the hydrophobicity of functionally important regions where residue-based scales fail. To illustrate the utility of the new approach, we show that the atomic-level scale rationalizes the hydration of two hydrophobic pockets and the presence of a void in a third pocket within a single protein and that it appropriately classifies all of the functionally important hydrophilic sites within two otherwise hydrophobic pores. We suggest that an atomic level of detail is, in general, necessary for the reliable depiction of hydrophobicity for all protein surfaces. The present formulation can be implemented simply in a manner no more complex than current residue-based approaches. © 2013.

  15. Granular support vector machines with association rules mining for protein homology prediction.

    PubMed

    Tang, Yuchun; Jin, Bo; Zhang, Yan-Qing

    2005-01-01

    Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions. A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high "purity" and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified. The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems.

  16. TIM Barrel Protein Structure Classification Using Alignment Approach and Best Hit Strategy

    NASA Astrophysics Data System (ADS)

    Chu, Jia-Han; Lin, Chun Yuan; Chang, Cheng-Wen; Lee, Chihan; Yang, Yuh-Shyong; Tang, Chuan Yi

    2007-11-01

    The classification of protein structures is essential for their function determination in bioinformatics. It has been estimated that around 10% of all known enzymes have TIM barrel domains from the Structural Classification of Proteins (SCOP) database. With its high sequence variation and diverse functionalities, TIM barrel protein becomes to be an attractive target for protein engineering and for the evolution study. Hence, in this paper, an alignment approach with the best hit strategy is proposed to classify the TIM barrel protein structure in terms of superfamily and family levels in the SCOP. This work is also used to do the classification for class level in the Enzyme nomenclature (ENZYME) database. Two testing data sets, TIM40D and TIM95D, both are used to evaluate this approach. The resulting classification has an overall prediction accuracy rate of 90.3% for the superfamily level in the SCOP, 89.5% for the family level in the SCOP and 70.1% for the class level in the ENZYME. These results demonstrate that the alignment approach with the best hit strategy is a simple and viable method for the TIM barrel protein structure classification, even only has the amino acid sequences information.

  17. A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary

    PubMed Central

    Day, Ryan; Beck, David A.C.; Armen, Roger S.; Daggett, Valerie

    2003-01-01

    We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web. PMID:14500873

  18. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary.

    PubMed

    Day, Ryan; Beck, David A C; Armen, Roger S; Daggett, Valerie

    2003-10-01

    We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.

  19. Feature-based classification of amino acid substitutions outside conserved functional protein domains.

    PubMed

    Gemovic, Branislava; Perovic, Vladimir; Glisic, Sanja; Veljkovic, Nevena

    2013-01-01

    There are more than 500 amino acid substitutions in each human genome, and bioinformatics tools irreplaceably contribute to determination of their functional effects. We have developed feature-based algorithm for the detection of mutations outside conserved functional domains (CFDs) and compared its classification efficacy with the most commonly used phylogeny-based tools, PolyPhen-2 and SIFT. The new algorithm is based on the informational spectrum method (ISM), a feature-based technique, and statistical analysis. Our dataset contained neutral polymorphisms and mutations associated with myeloid malignancies from epigenetic regulators ASXL1, DNMT3A, EZH2, and TET2. PolyPhen-2 and SIFT had significantly lower accuracies in predicting the effects of amino acid substitutions outside CFDs than expected, with especially low sensitivity. On the other hand, only ISM algorithm showed statistically significant classification of these sequences. It outperformed PolyPhen-2 and SIFT by 15% and 13%, respectively. These results suggest that feature-based methods, like ISM, are more suitable for the classification of amino acid substitutions outside CFDs than phylogeny-based tools.

  20. A Functional-Phylogenetic Classification System for Transmembrane Solute Transporters

    PubMed Central

    Saier, Milton H.

    2000-01-01

    A comprehensive classification system for transmembrane molecular transporters has been developed and recently approved by the transport panel of the nomenclature committee of the International Union of Biochemistry and Molecular Biology. This system is based on (i) transporter class and subclass (mode of transport and energy coupling mechanism), (ii) protein phylogenetic family and subfamily, and (iii) substrate specificity. Almost all of the more than 250 identified families of transporters include members that function exclusively in transport. Channels (115 families), secondary active transporters (uniporters, symporters, and antiporters) (78 families), primary active transporters (23 families), group translocators (6 families), and transport proteins of ill-defined function or of unknown mechanism (51 families) constitute distinct categories. Transport mode and energy coupling prove to be relatively immutable characteristics and therefore provide primary bases for classification. Phylogenetic grouping reflects structure, function, mechanism, and often substrate specificity and therefore provides a reliable secondary basis for classification. Substrate specificity and polarity of transport prove to be more readily altered during evolutionary history and therefore provide a tertiary basis for classification. With very few exceptions, a phylogenetic family of transporters includes members that function by a single transport mode and energy coupling mechanism, although a variety of substrates may be transported, sometimes with either inwardly or outwardly directed polarity. In this review, I provide cross-referencing of well-characterized constituent transporters according to (i) transport mode, (ii) energy coupling mechanism, (iii) phylogenetic grouping, and (iv) substrates transported. The structural features and distribution of recognized family members throughout the living world are also evaluated. The tabulations should facilitate familial and functional assignments of newly sequenced transport proteins that will result from future genome sequencing projects. PMID:10839820

  1. Fast and automated functional classification with MED-SuMo: an application on purine-binding proteins.

    PubMed

    Doppelt-Azeroual, Olivia; Delfaud, François; Moriaud, Fabrice; de Brevern, Alexandre G

    2010-04-01

    Ligand-protein interactions are essential for biological processes, and precise characterization of protein binding sites is crucial to understand protein functions. MED-SuMo is a powerful technology to localize similar local regions on protein surfaces. Its heuristic is based on a 3D representation of macromolecules using specific surface chemical features associating chemical characteristics with geometrical properties. MED-SMA is an automated and fast method to classify binding sites. It is based on MED-SuMo technology, which builds a similarity graph, and it uses the Markov Clustering algorithm. Purine binding sites are well studied as drug targets. Here, purine binding sites of the Protein DataBank (PDB) are classified. Proteins potentially inhibited or activated through the same mechanism are gathered. Results are analyzed according to PROSITE annotations and to carefully refined functional annotations extracted from the PDB. As expected, binding sites associated with related mechanisms are gathered, for example, the Small GTPases. Nevertheless, protein kinases from different Kinome families are also found together, for example, Aurora-A and CDK2 proteins which are inhibited by the same drugs. Representative examples of different clusters are presented. The effectiveness of the MED-SMA approach is demonstrated as it gathers binding sites of proteins with similar structure-activity relationships. Moreover, an efficient new protocol associates structures absent of cocrystallized ligands to the purine clusters enabling those structures to be associated with a specific binding mechanism. Applications of this classification by binding mode similarity include target-based drug design and prediction of cross-reactivity and therefore potential toxic side effects.

  2. Fast and automated functional classification with MED-SuMo: An application on purine-binding proteins

    PubMed Central

    Doppelt-Azeroual, Olivia; Delfaud, François; Moriaud, Fabrice; de Brevern, Alexandre G

    2010-01-01

    Ligand–protein interactions are essential for biological processes, and precise characterization of protein binding sites is crucial to understand protein functions. MED-SuMo is a powerful technology to localize similar local regions on protein surfaces. Its heuristic is based on a 3D representation of macromolecules using specific surface chemical features associating chemical characteristics with geometrical properties. MED-SMA is an automated and fast method to classify binding sites. It is based on MED-SuMo technology, which builds a similarity graph, and it uses the Markov Clustering algorithm. Purine binding sites are well studied as drug targets. Here, purine binding sites of the Protein DataBank (PDB) are classified. Proteins potentially inhibited or activated through the same mechanism are gathered. Results are analyzed according to PROSITE annotations and to carefully refined functional annotations extracted from the PDB. As expected, binding sites associated with related mechanisms are gathered, for example, the Small GTPases. Nevertheless, protein kinases from different Kinome families are also found together, for example, Aurora-A and CDK2 proteins which are inhibited by the same drugs. Representative examples of different clusters are presented. The effectiveness of the MED-SMA approach is demonstrated as it gathers binding sites of proteins with similar structure-activity relationships. Moreover, an efficient new protocol associates structures absent of cocrystallized ligands to the purine clusters enabling those structures to be associated with a specific binding mechanism. Applications of this classification by binding mode similarity include target-based drug design and prediction of cross-reactivity and therefore potential toxic side effects. PMID:20162627

  3. The value of protein structure classification information-Surveying the scientific literature

    DOE PAGES

    Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc

    2015-08-27

    The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less

  4. The value of protein structure classification information-Surveying the scientific literature

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc

    The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less

  5. Protein subcellular location pattern classification in cellular images using latent discriminative models.

    PubMed

    Li, Jieyue; Xiong, Liang; Schneider, Jeff; Murphy, Robert F

    2012-06-15

    Knowledge of the subcellular location of a protein is crucial for understanding its functions. The subcellular pattern of a protein is typically represented as the set of cellular components in which it is located, and an important task is to determine this set from microscope images. In this article, we address this classification problem using confocal immunofluorescence images from the Human Protein Atlas (HPA) project. The HPA contains images of cells stained for many proteins; each is also stained for three reference components, but there are many other components that are invisible. Given one such cell, the task is to classify the pattern type of the stained protein. We first randomly select local image regions within the cells, and then extract various carefully designed features from these regions. This region-based approach enables us to explicitly study the relationship between proteins and different cell components, as well as the interactions between these components. To achieve these two goals, we propose two discriminative models that extend logistic regression with structured latent variables. The first model allows the same protein pattern class to be expressed differently according to the underlying components in different regions. The second model further captures the spatial dependencies between the components within the same cell so that we can better infer these components. To learn these models, we propose a fast approximate algorithm for inference, and then use gradient-based methods to maximize the data likelihood. In the experiments, we show that the proposed models help improve the classification accuracies on synthetic data and real cellular images. The best overall accuracy we report in this article for classifying 942 proteins into 13 classes of patterns is about 84.6%, which to our knowledge is the best so far. In addition, the dependencies learned are consistent with prior knowledge of cell organization. http://murphylab.web.cmu.edu/software/.

  6. Binary Classification using Decision Tree based Genetic Programming and Its Application to Analysis of Bio-mass Data

    NASA Astrophysics Data System (ADS)

    To, Cuong; Pham, Tuan D.

    2010-01-01

    In machine learning, pattern recognition may be the most popular task. "Similar" patterns identification is also very important in biology because first, it is useful for prediction of patterns associated with disease, for example cancer tissue (normal or tumor); second, similarity or dissimilarity of the kinetic patterns is used to identify coordinately controlled genes or proteins involved in the same regulatory process. Third, similar genes (proteins) share similar functions. In this paper, we present an algorithm which uses genetic programming to create decision tree for binary classification problem. The application of the algorithm was implemented on five real biological databases. Base on the results of comparisons with well-known methods, we see that the algorithm is outstanding in most of cases.

  7. Applying graph theory to protein structures: an atlas of coiled coils.

    PubMed

    Heal, Jack W; Bartlett, Gail J; Wood, Christopher W; Thomson, Andrew R; Woolfson, Derek N

    2018-05-02

    To understand protein structure, folding and function fully and to design proteins de novo reliably, we must learn from natural protein structures that have been characterised experimentally. The number of protein structures available is large and growing exponentially, which makes this task challenging. Indeed, computational resources are becoming increasingly important for classifying and analysing this resource. Here, we use tools from graph theory to define an atlas classification scheme for automatically categorising certain protein substructures. Focusing on the α-helical coiled coils, which are ubiquitous protein-structure and protein-protein interaction motifs, we present a suite of computational resources designed for analysing these assemblies. iSOCKET enables interactive analysis of side-chain packing within proteins to identify coiled coils automatically and with considerable user control. Applying a graph theory-based atlas classification scheme to structures identified by iSOCKET gives the Atlas of Coiled Coils, a fully automated, updated overview of extant coiled coils. The utility of this approach is illustrated with the first formal classification of an emerging subclass of coiled coils called α-helical barrels. Furthermore, in the Atlas, the known coiled-coil universe is presented alongside a partial enumeration of the 'dark matter' of coiled-coil structures; i.e., those coiled-coil architectures that are theoretically possible but have not been observed to date, and thus present defined targets for protein design. iSOCKET is available as part of the open-source GitHub repository associated with this work (https://github.com/woolfson-group/isocket). This repository also contains all the data generated when classifying the protein graphs. The Atlas of Coiled Coils is available at: http://coiledcoils.chm.bris.ac.uk/atlas/app.

  8. Protein Subcellular Localization with Gaussian Kernel Discriminant Analysis and Its Kernel Parameter Selection.

    PubMed

    Wang, Shunfang; Nie, Bing; Yue, Kun; Fei, Yu; Li, Wenjia; Xu, Dongshu

    2017-12-15

    Kernel discriminant analysis (KDA) is a dimension reduction and classification algorithm based on nonlinear kernel trick, which can be novelly used to treat high-dimensional and complex biological data before undergoing classification processes such as protein subcellular localization. Kernel parameters make a great impact on the performance of the KDA model. Specifically, for KDA with the popular Gaussian kernel, to select the scale parameter is still a challenging problem. Thus, this paper introduces the KDA method and proposes a new method for Gaussian kernel parameter selection depending on the fact that the differences between reconstruction errors of edge normal samples and those of interior normal samples should be maximized for certain suitable kernel parameters. Experiments with various standard data sets of protein subcellular localization show that the overall accuracy of protein classification prediction with KDA is much higher than that without KDA. Meanwhile, the kernel parameter of KDA has a great impact on the efficiency, and the proposed method can produce an optimum parameter, which makes the new algorithm not only perform as effectively as the traditional ones, but also reduce the computational time and thus improve efficiency.

  9. LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.

    PubMed

    Filatov, Gleb; Bauwens, Bruno; Kertész-Farkas, Attila

    2018-05-07

    Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis. Here, we present a new convolutional kernel function for protein sequences called the LZW-Kernel. It is based on code words identified with the Lempel-Ziv-Welch (LZW) universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance (NCD), which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with BLAST scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel, and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests. LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel. akerteszfarkas@hse.ru. Supplementary data are available at Bioinformatics Online.

  10. Classification and quantitation of milk powder by near-infrared spectroscopy and mutual information-based variable selection and partial least squares

    NASA Astrophysics Data System (ADS)

    Chen, Hui; Tan, Chao; Lin, Zan; Wu, Tong

    2018-01-01

    Milk is among the most popular nutrient source worldwide, which is of great interest due to its beneficial medicinal properties. The feasibility of the classification of milk powder samples with respect to their brands and the determination of protein concentration is investigated by NIR spectroscopy along with chemometrics. Two datasets were prepared for experiment. One contains 179 samples of four brands for classification and the other contains 30 samples for quantitative analysis. Principal component analysis (PCA) was used for exploratory analysis. Based on an effective model-independent variable selection method, i.e., minimal-redundancy maximal-relevance (MRMR), only 18 variables were selected to construct a partial least-square discriminant analysis (PLS-DA) model. On the test set, the PLS-DA model based on the selected variable set was compared with the full-spectrum PLS-DA model, both of which achieved 100% accuracy. In quantitative analysis, the partial least-square regression (PLSR) model constructed by the selected subset of 260 variables outperforms significantly the full-spectrum model. It seems that the combination of NIR spectroscopy, MRMR and PLS-DA or PLSR is a powerful tool for classifying different brands of milk and determining the protein content.

  11. iTRAQ-based Quantitative Proteomics Study in Patients with Refractory Mycoplasma pneumoniae Pneumonia.

    PubMed

    Yu, Jia-Lu; Song, Qi-Fang; Xie, Zhi-Wei; Jiang, Wen-Hui; Chen, Jia-Hui; Fan, Hui-Feng; Xie, Ya-Ping; Lu, Gen

    2017-09-25

    Mycoplasma pneumoniae (MP) is a leading cause of community-acquired pneumonia in children and young adults. Although MP pneumonia is usually benign and self-limited, in some cases it can develop into life-threating refractory MP pneumonia (RMPP). However, the pathogenesis of RMPP is poorly understood. The identification and characterization of proteins related to RMPP could provide a proof of principle to facilitate appropriate diagnostic and therapeutic strategies for treating paients with MP. In this study, we used a quantitative proteomic technique (iTRAQ) to analyze MP-related proteins in serum samples from 5 patients with RMPP, 5 patients with non-refractory MP pneumonia (NRMPP), and 5 healthy children. Functional classification, sub-cellular localization, and protein interaction network analysis were carried out based on protein annotation through evolutionary relationship (PANTHER) and Cytoscape analysis. A total of 260 differentially expressed proteins were identified in the RMPP and NRMPP groups. Compared to the control group, the NRMPP and RMPP groups showed 134 (70 up-regulated and 64 down-regulated) and 126 (63 up-regulated and 63 down-regulated) differentially expressed proteins, respectively. The complex functional classification and protein interaction network of the identified proteins reflected the complex pathogenesis of RMPP. Our study provides the first comprehensive proteome map of RMPP-related proteins from MP pneumonia. These profiles may be useful as part of a diagnostic panel, and the identified proteins provide new insights into the pathological mechanisms underlying RMPP.

  12. Camps 2.0: exploring the sequence and structure space of prokaryotic, eukaryotic, and viral membrane proteins.

    PubMed

    Neumann, Sindy; Hartmann, Holger; Martin-Galiano, Antonio J; Fuchs, Angelika; Frishman, Dmitrij

    2012-03-01

    Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Copyright © 2011 Wiley Periodicals, Inc.

  13. CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area.

    PubMed

    Terashi, Genki; Takeda-Shitaka, Mayuko

    2015-01-01

    Proteins are flexible, and this flexibility has an essential functional role. Flexibility can be observed in loop regions, rearrangements between secondary structure elements, and conformational changes between entire domains. However, most protein structure alignment methods treat protein structures as rigid bodies. Thus, these methods fail to identify the equivalences of residue pairs in regions with flexibility. In this study, we considered that the evolutionary relationship between proteins corresponds directly to the residue-residue physical contacts rather than the three-dimensional (3D) coordinates of proteins. Thus, we developed a new protein structure alignment method, contact area-based alignment (CAB-align), which uses the residue-residue contact area to identify regions of similarity. The main purpose of CAB-align is to identify homologous relationships at the residue level between related protein structures. The CAB-align procedure comprises two main steps: First, a rigid-body alignment method based on local and global 3D structure superposition is employed to generate a sufficient number of initial alignments. Then, iterative dynamic programming is executed to find the optimal alignment. We evaluated the performance and advantages of CAB-align based on four main points: (1) agreement with the gold standard alignment, (2) alignment quality based on an evolutionary relationship without 3D coordinate superposition, (3) consistency of the multiple alignments, and (4) classification agreement with the gold standard classification. Comparisons of CAB-align with other state-of-the-art protein structure alignment methods (TM-align, FATCAT, and DaliLite) using our benchmark dataset showed that CAB-align performed robustly in obtaining high-quality alignments and generating consistent multiple alignments with high coverage and accuracy rates, and it performed extremely well when discriminating between homologous and nonhomologous pairs of proteins in both single and multi-domain comparisons. The CAB-align software is freely available to academic users as stand-alone software at http://www.pharm.kitasato-u.ac.jp/bmd/bmd/Publications.html.

  14. Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection.

    PubMed

    Jahandideh, Samad; Srinivasasainagendra, Vinodh; Zhi, Degui

    2012-11-07

    RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.

  15. Classification of ligand molecules in PDB with graph match-based structural superposition.

    PubMed

    Shionyu-Mitsuyama, Clara; Hijikata, Atsushi; Tsuji, Toshiyuki; Shirai, Tsuyoshi

    2016-12-01

    The fast heuristic graph match algorithm for small molecules, COMPLIG, was improved by adding a structural superposition process to verify the atom-atom matching. The modified method was used to classify the small molecule ligands in the Protein Data Bank (PDB) by their three-dimensional structures, and 16,660 types of ligands in the PDB were classified into 7561 clusters. In contrast, a classification by a previous method (without structure superposition) generated 3371 clusters from the same ligand set. The characteristic feature in the current classification system is the increased number of singleton clusters, which contained only one ligand molecule in a cluster. Inspections of the singletons in the current classification system but not in the previous one implied that the major factors for the isolation were differences in chirality, cyclic conformations, separation of substructures, and bond length. Comparisons between current and previous classification systems revealed that the superposition-based classification was effective in clustering functionally related ligands, such as drugs targeted to specific biological processes, owing to the strictness of the atom-atom matching.

  16. Evolution and classification of the CRISPR-Cas systems

    PubMed Central

    S. Makarova, Kira; H. Haft, Daniel; Barrangou, Rodolphe; J. J. Brouns, Stan; Charpentier, Emmanuelle; Horvath, Philippe; Moineau, Sylvain; J. M. Mojica, Francisco; I. Wolf, Yuri; Yakunin, Alexander F.; van der Oost, John; V. Koonin, Eugene

    2012-01-01

    The CRISPR–Cas (clustered regularly interspaced short palindromic repeats–CRISPR-associated proteins) modules are adaptive immunity systems that are present in many archaea and bacteria. These defence systems are encoded by operons that have an extraordinarily diverse architecture and a high rate of evolution for both the cas genes and the unique spacer content. Here, we provide an updated analysis of the evolutionary relationships between CRISPR–Cas systems and Cas proteins. Three major types of CRISPR–Cas system are delineated, with a further division into several subtypes and a few chimeric variants. Given the complexity of the genomic architectures and the extremely dynamic evolution of the CRISPR–Cas systems, a unified classification of these systems should be based on multiple criteria. Accordingly, we propose a `polythetic' classification that integrates the phylogenies of the most common cas genes, the sequence and organization of the CRISPR repeats and the architecture of the CRISPR–cas loci. PMID:21552286

  17. Automatic classification of protein structures relying on similarities between alignments

    PubMed Central

    2012-01-01

    Background Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. Results When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments. Conclusions We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP. PMID:22974051

  18. Protein classification using modified n-grams and skip-grams.

    PubMed

    Islam, S M Ashiqul; Heil, Benjamin J; Kearney, Christopher Michel; Baker, Erich J

    2018-05-01

    Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG). A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. erich_baker@baylor.edu. Supplementary data are available at Bioinformatics online.

  19. A Novel Biclustering Approach to Association Rule Mining for Predicting HIV-1–Human Protein Interactions

    PubMed Central

    Mukhopadhyay, Anirban; Maulik, Ujjwal; Bandyopadhyay, Sanghamitra

    2012-01-01

    Identification of potential viral-host protein interactions is a vital and useful approach towards development of new drugs targeting those interactions. In recent days, computational tools are being utilized for predicting viral-host interactions. Recently a database containing records of experimentally validated interactions between a set of HIV-1 proteins and a set of human proteins has been published. The problem of predicting new interactions based on this database is usually posed as a classification problem. However, posing the problem as a classification one suffers from the lack of biologically validated negative interactions. Therefore it will be beneficial to use the existing database for predicting new viral-host interactions without the need of negative samples. Motivated by this, in this article, the HIV-1–human protein interaction database has been analyzed using association rule mining. The main objective is to identify a set of association rules both among the HIV-1 proteins and among the human proteins, and use these rules for predicting new interactions. In this regard, a novel association rule mining technique based on biclustering has been proposed for discovering frequent closed itemsets followed by the association rules from the adjacency matrix of the HIV-1–human interaction network. Novel HIV-1–human interactions have been predicted based on the discovered association rules and tested for biological significance. For validation of the predicted new interactions, gene ontology-based and pathway-based studies have been performed. These studies show that the human proteins which are predicted to interact with a particular viral protein share many common biological activities. Moreover, literature survey has been used for validation purpose to identify some predicted interactions that are already validated experimentally but not present in the database. Comparison with other prediction methods is also discussed. PMID:22539940

  20. The use of multiplexed MRM for the discovery of biomarkers to differentiate iron-deficiency anemia from anemia of inflammation.

    PubMed

    Domanski, Dominik; Cohen Freue, Gabriela V; Sojo, Luis; Kuzyk, Michael A; Ratkay, Leslie; Parker, Carol E; Goldberg, Y Paul; Borchers, Christoph H

    2012-06-27

    In this study we demonstrate the use of a multiplexed MRM-based assay to distinguish among normal (NL) and iron-metabolism disorder mouse models, particularly, iron-deficiency anemia (IDA), inflammation (INFL), and inflammation and anemia (INFL+IDA). Our initial panel of potential biomarkers was based on the analysis of 14 proteins expressed by candidate genes involved in iron transport and metabolism. Based on this study, we were able to identify a panel of 8 biomarker proteins: apolipoprotein A4 (APO4), transferrin, transferrin receptor 1, ceruloplasmin, haptoglobin, lactoferrin, hemopexin, and matrix metalloproteinase-8 (MMP8) that clearly distinguish among the normal and disease models. Within this set of proteins, transferrin showed the best individual classification accuracy over all samples (72%) and within the NL group (94%). Compared to the best single-protein biomarker, transferrin, the use of the composite 8-protein biomarker panel improved the classification accuracy from 94% to 100% in the NL group, from 50% to 72% in the INFL group, from 66% to 96% in the IDA group, and from 79% to 83% in the INFL+IDA group. Based on these findings, validation of the utility of this potentially important biomarker panel in human samples in an effort to differentiate IDA, inflammation, and combinations thereof, is now warranted. This article is part of a Special Section entitled: Understanding genome regulation and genetic diversity by mass spectrometry. Copyright © 2011 Elsevier B.V. All rights reserved.

  1. GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank.

    PubMed

    You, Ronghui; Zhang, Zihan; Xiong, Yi; Sun, Fengzhu; Mamitsuka, Hiroshi; Zhu, Shanfeng

    2018-03-07

    Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of more than 70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins. The key of this method is to extract not only homology information but also diverse, deep- rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification. The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods. http://datamining-iip.fudan.edu.cn/golabeler. zhusf@fudan.edu.cn. Supplementary data are available at Bioinformatics online.

  2. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

    PubMed

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-05-01

    Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. ivan.borozan@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  3. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

    PubMed Central

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-01-01

    Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913

  4. Classification of Ancient Mammal Individuals Using Dental Pulp MALDI-TOF MS Peptide Profiling

    PubMed Central

    Tran, Thi-Nguyen-Ny; Aboudharam, Gérard; Gardeisen, Armelle; Davoust, Bernard; Bocquet-Appel, Jean-Pierre; Flaudrops, Christophe; Belghazi, Maya; Raoult, Didier; Drancourt, Michel

    2011-01-01

    Background The classification of ancient animal corpses at the species level remains a challenging task for forensic scientists and anthropologists. Severe damage and mixed, tiny pieces originating from several skeletons may render morphological classification virtually impossible. Standard approaches are based on sequencing mitochondrial and nuclear targets. Methodology/Principal Findings We present a method that can accurately classify mammalian species using dental pulp and mass spectrometry peptide profiling. Our work was organized into three successive steps. First, after extracting proteins from the dental pulp collected from 37 modern individuals representing 13 mammalian species, trypsin-digested peptides were used for matrix-assisted laser desorption/ionization time-of-flight mass spectrometry analysis. The resulting peptide profiles accurately classified every individual at the species level in agreement with parallel cytochrome b gene sequencing gold standard. Second, using a 279–modern spectrum database, we blindly classified 33 of 37 teeth collected in 37 modern individuals (89.1%). Third, we classified 10 of 18 teeth (56%) collected in 15 ancient individuals representing five mammal species including human, from five burial sites dating back 8,500 years. Further comparison with an upgraded database comprising ancient specimen profiles yielded 100% classification in ancient teeth. Peptide sequencing yield 4 and 16 different non-keratin proteins including collagen (alpha-1 type I and alpha-2 type I) in human ancient and modern dental pulp, respectively. Conclusions/Significance Mass spectrometry peptide profiling of the dental pulp is a new approach that can be added to the arsenal of species classification tools for forensics and anthropology as a complementary method to DNA sequencing. The dental pulp is a new source for collagen and other proteins for the species classification of modern and ancient mammal individuals. PMID:21364886

  5. Specificity of molecular interactions in transient protein-protein interaction interfaces.

    PubMed

    Cho, Kyu-il; Lee, KiYoung; Lee, Kwang H; Kim, Dongsup; Lee, Doheon

    2006-11-15

    In this study, we investigate what types of interactions are specific to their biological function, and what types of interactions are persistent regardless of their functional category in transient protein-protein heterocomplexes. This is the first approach to analyze protein-protein interfaces systematically at the molecular interaction level in the context of protein functions. We perform systematic analysis at the molecular interaction level using classification and feature subset selection technique prevalent in the field of pattern recognition. To represent the physicochemical properties of protein-protein interfaces, we design 18 molecular interaction types using canonical and noncanonical interactions. Then, we construct input vector using the frequency of each interaction type in protein-protein interface. We analyze the 131 interfaces of transient protein-protein heterocomplexes in PDB: 33 protease-inhibitors, 52 antibody-antigens, 46 signaling proteins including 4 cyclin dependent kinase and 26 G-protein. Using kNN classification and feature subset selection technique, we show that there are specific interaction types based on their functional category, and such interaction types are conserved through the common binding mechanism, rather than through the sequence or structure conservation. The extracted interaction types are C(alpha)-- H...O==C interaction, cation...anion interaction, amine...amine interaction, and amine...cation interaction. With these four interaction types, we achieve the classification success rate up to 83.2% with leave-one-out cross-validation at k = 15. Of these four interaction types, C(alpha)--H...O==C shows binding specificity for protease-inhibitor complexes, while cation-anion interaction is predominant in signaling complexes. The amine ... amine and amine...cation interaction give a minor contribution to the classification accuracy. When combined with these two interactions, they increase the accuracy by 3.8%. In the case of antibody-antigen complexes, the sign is somewhat ambiguous. From the evolutionary perspective, while protease-inhibitors and sig-naling proteins have optimized their interfaces to suit their biological functions, antibody-antigen interactions are the happenstance, implying that antibody-antigen complexes do not show distinctive interaction types. Persistent interaction types such as pi...pi, amide-carbonyl, and hydroxyl-carbonyl interaction, are also investigated. Analyzing the structural orientations of the pi...pi stacking interactions, we find that herringbone shape is a major configuration in transient protein-protein interfaces. This result is different from that of protein core, where parallel-displaced configurations are the major configuration. We also analyze overall trend of amide-carbonyl and hydroxyl-carbonyl interactions. It is noticeable that nearly 82% of the interfaces have at least one hydroxyl-carbonyl interactions. (c) 2006 Wiley-Liss, Inc.

  6. The use of neural networks and texture analysis for rapid objective selection of regions of interest in cytoskeletal images.

    PubMed

    Derkacs, Amanda D Felder; Ward, Samuel R; Lieber, Richard L

    2012-02-01

    Understanding cytoskeletal dynamics in living tissue is prerequisite to understanding mechanisms of injury, mechanotransduction, and mechanical signaling. Real-time visualization is now possible using transfection with plasmids that encode fluorescent cytoskeletal proteins. Using this approach with the muscle-specific intermediate filament protein desmin, we found that a green fluorescent protein-desmin chimeric protein was unevenly distributed throughout the muscle fiber, resulting in some image areas that were saturated as well as others that lacked any signal. Our goal was to analyze the muscle fiber cytoskeletal network quantitatively in an unbiased fashion. To objectively select areas of the muscle fiber that are suitable for analysis, we devised a method that provides objective classification of regions of images of striated cytoskeletal structures into "usable" and "unusable" categories. This method consists of a combination of spatial analysis of the image using Fourier methods along with a boosted neural network that "decides" on the quality of the image based on previous training. We trained the neural network using the expert opinion of three scientists familiar with these types of images. We found that this method was over 300 times faster than manual classification and that it permitted objective and accurate classification of image regions.

  7. PlantTribes: a gene and gene family resource for comparative genomics in plants

    PubMed Central

    Wall, P. Kerr; Leebens-Mack, Jim; Müller, Kai F.; Field, Dawn; Altman, Naomi S.; dePamphilis, Claude W.

    2008-01-01

    The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting ∼4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study. PMID:18073194

  8. Improved protein surface comparison and application to low-resolution protein structure data.

    PubMed

    Sael, Lee; Kihara, Daisuke

    2010-12-14

    Recent advancements of experimental techniques for determining protein tertiary structures raise significant challenges for protein bioinformatics. With the number of known structures of unknown function expanding at a rapid pace, an urgent task is to provide reliable clues to their biological function on a large scale. Conventional approaches for structure comparison are not suitable for a real-time database search due to their slow speed. Moreover, a new challenge has arisen from recent techniques such as electron microscopy (EM), which provide low-resolution structure data. Previously, we have introduced a method for protein surface shape representation using the 3D Zernike descriptors (3DZDs). The 3DZD enables fast structure database searches, taking advantage of its rotation invariance and compact representation. The search results of protein surface represented with the 3DZD has showngood agreement with the existing structure classifications, but some discrepancies were also observed. The three new surface representations of backbone atoms, originally devised all-atom-surface representation, and the combination of all-atom surface with the backbone representation are examined. All representations are encoded with the 3DZD. Also, we have investigated the applicability of the 3DZD for searching protein EM density maps of varying resolutions. The surface representations are evaluated on structure retrieval using two existing classifications, SCOP and the CE-based classification. Overall, the 3DZDs representing backbone atoms show better retrieval performance than the original all-atom surface representation. The performance further improved when the two representations are combined. Moreover, we observed that the 3DZD is also powerful in comparing low-resolution structures obtained by electron microscopy.

  9. Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

    PubMed Central

    2012-01-01

    Background The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. Conclusions This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups. PMID:22726767

  10. Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors.

    PubMed

    König, Caroline; Cárdenas, Martha I; Giraldo, Jesús; Alquézar, René; Vellido, Alfredo

    2015-09-29

    The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.

  11. Genic insights from integrated human proteomics in GeneCards.

    PubMed

    Fishilevich, Simon; Zimmerman, Shahar; Kohn, Asher; Iny Stein, Tsippi; Olender, Tsviya; Kolker, Eugene; Safran, Marilyn; Lancet, Doron

    2016-01-01

    GeneCards is a one-stop shop for searchable human gene annotations (http://www.genecards.org/). Data are automatically mined from ∼120 sources and presented in an integrated web card for every human gene. We report the application of recent advances in proteomics to enhance gene annotation and classification in GeneCards. First, we constructed the Human Integrated Protein Expression Database (HIPED), a unified database of protein abundance in human tissues, based on the publically available mass spectrometry (MS)-based proteomics sources ProteomicsDB, Multi-Omics Profiling Expression Database, Protein Abundance Across Organisms and The MaxQuant DataBase. The integrated database, residing within GeneCards, compares favourably with its individual sources, covering nearly 90% of human protein-coding genes. For gene annotation and comparisons, we first defined a protein expression vector for each gene, based on normalized abundances in 69 normal human tissues. This vector is portrayed in the GeneCards expression section as a bar graph, allowing visual inspection and comparison. These data are juxtaposed with transcriptome bar graphs. Using the protein expression vectors, we further defined a pairwise metric that helps assess expression-based pairwise proximity. This new metric for finding functional partners complements eight others, including sharing of pathways, gene ontology (GO) terms and domains, implemented in the GeneCards Suite. In parallel, we calculated proteome-based differential expression, highlighting a subset of tissues that overexpress a gene and subserving gene classification. This textual annotation allows users of VarElect, the suite's next-generation phenotyper, to more effectively discover causative disease variants. Finally, we define the protein-RNA expression ratio and correlation as yet another attribute of every gene in each tissue, adding further annotative information. The results constitute a significant enhancement of several GeneCards sections and help promote and organize the genome-wide structural and functional knowledge of the human proteome. Database URL:http://www.genecards.org/. © The Author(s) 2016. Published by Oxford University Press.

  12. Sequentially distant but structurally similar proteins exhibit fold specific patterns based on their biophysical properties.

    PubMed

    Rajendran, Senthilnathan; Jothi, Arunachalam

    2018-05-16

    The Three-dimensional structure of a protein depends on the interaction between their amino acid residues. These interactions are in turn influenced by various biophysical properties of the amino acids. There are several examples of proteins that share the same fold but are very dissimilar at the sequence level. For proteins to share a common fold some crucial interactions should be maintained despite insignificant sequence similarity. Since the interactions are because of the biophysical properties of the amino acids, we should be able to detect descriptive patterns for folds at such a property level. In this line, the main focus of our research is to analyze such proteins and to characterize them in terms of their biophysical properties. Protein structures with sequence similarity lesser than 40% were selected for ten different subfolds from three different mainfolds (according to CATH classification) and were used for this analysis. We used the normalized values of the 49 physio-chemical, energetic and conformational properties of amino acids. We characterize the folds based on the average biophysical property values. We also observed a fold specific correlational behavior of biophysical properties despite a very low sequence similarity in our data. We further trained three different binary classification models (Naive Bayes-NB, Support Vector Machines-SVM and Bayesian Generalized Linear Model-BGLM) which could discriminate mainfold based on the biophysical properties. We also show that among the three generated models, the BGLM classifier model was able to discriminate protein sequences coming under all beta category with 81.43% accuracy and all alpha, alpha-beta proteins with 83.37% accuracy. Copyright © 2018 Elsevier Ltd. All rights reserved.

  13. Different evolutionary patterns of SNPs between domains and unassigned regions in human protein-coding sequences.

    PubMed

    Pang, Erli; Wu, Xiaomei; Lin, Kui

    2016-06-01

    Protein evolution plays an important role in the evolution of each genome. Because of their functional nature, in general, most of their parts or sites are differently constrained selectively, particularly by purifying selection. Most previous studies on protein evolution considered individual proteins in their entirety or compared protein-coding sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each protein of a given genome. To this end, based on PfamA annotation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occurring within protein domains and those within unassigned regions. With these classifications, we found: the density of synonymous SNPs within domains is significantly greater than that of synonymous SNPs within unassigned regions; however, the density of non-synonymous SNPs shows the opposite pattern. We also found there are signatures of purifying selection on both the domain and unassigned regions. Furthermore, the selective strength on domains is significantly greater than that on unassigned regions. In addition, among all of the human protein sequences, there are 117 PfamA domains in which no SNPs are found. Our results highlight an important aspect of protein domains and may contribute to our understanding of protein evolution.

  14. Anaplasma marginale major surface protein 1a: a marker of strain diversity with implications for control of bovine anaplasmosis.

    PubMed

    Cabezas-Cruz, Alejandro; de la Fuente, José

    2015-04-01

    Classification of bacteria is challenging due to the lack of a theory-based framework. In addition, the adaptation of bacteria to ecological niches often results in selection of strains with diverse virulence, pathogenicity and transmission characteristics. Bacterial strain diversity presents challenges for taxonomic classification, which in turn impacts the ability to develop accurate diagnostics and effective vaccines. Over the past decade, the worldwide diversity of Anaplasma marginale, an economically important tick-borne pathogen of cattle, has become apparent. The extent of A. marginale strain diversity, formerly underappreciated, has contributed to the challenges of classification which, in turn, likely impacts the design and development of improved vaccines. Notably, the A. marginale surface protein 1a (MSP1a) is a model molecule for these studies because it serves as a marker for strain identity, is both an adhesin necessary for infection of cells and an immuno-reactive protein and is also an indicator of the evolution of strain diversity. Herein, we discuss a molecular taxonomic approach for classification of A. marginale strain diversity. Taxonomic analysis of this important molecule provides the opportunity to understand A. marginale strain diversity as it relates geographic and ecological factors and to the development of effective vaccines for control of bovine anaplasmosis worldwide. Copyright © 2015 Elsevier GmbH. All rights reserved.

  15. Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.

    PubMed

    Pascual-García, Alberto; Abia, David; Ortiz, Angel R; Bastolla, Ugo

    2009-03-01

    Structural classifications of proteins assume the existence of the fold, which is an intrinsic equivalence class of protein domains. Here, we test in which conditions such an equivalence class is compatible with objective similarity measures. We base our analysis on the transitive property of the equivalence relationship, requiring that similarity of A with B and B with C implies that A and C are also similar. Divergent gene evolution leads us to expect that the transitive property should approximately hold. However, if protein domains are a combination of recurrent short polypeptide fragments, as proposed by several authors, then similarity of partial fragments may violate the transitive property, favouring the continuous view of the protein structure space. We propose a measure to quantify the violations of the transitive property when a clustering algorithm joins elements into clusters, and we find out that such violations present a well defined and detectable cross-over point, from an approximately transitive regime at high structure similarity to a regime with large transitivity violations and large differences in length at low similarity. We argue that protein structure space is discrete and hierarchic classification is justified up to this cross-over point, whereas at lower similarities the structure space is continuous and it should be represented as a network. We have tested the qualitative behaviour of this measure, varying all the choices involved in the automatic classification procedure, i.e., domain decomposition, alignment algorithm, similarity score, and clustering algorithm, and we have found out that this behaviour is quite robust. The final classification depends on the chosen algorithms. We used the values of the clustering coefficient and the transitivity violations to select the optimal choices among those that we tested. Interestingly, this criterion also favours the agreement between automatic and expert classifications. As a domain set, we have selected a consensus set of 2,890 domains decomposed very similarly in SCOP and CATH. As an alignment algorithm, we used a global version of MAMMOTH developed in our group, which is both rapid and accurate. As a similarity measure, we used the size-normalized contact overlap, and as a clustering algorithm, we used average linkage. The resulting automatic classification at the cross-over point was more consistent than expert ones with respect to the structure similarity measure, with 86% of the clusters corresponding to subsets of either SCOP or CATH superfamilies and fewer than 5% containing domains in distinct folds according to both SCOP and CATH. Almost 15% of SCOP superfamilies and 10% of CATH superfamilies were split, consistent with the notion of fold change in protein evolution. These results were qualitatively robust for all choices that we tested, although we did not try to use alignment algorithms developed by other groups. Folds defined in SCOP and CATH would be completely joined in the regime of large transitivity violations where clustering is more arbitrary. Consistently, the agreement between SCOP and CATH at fold level was lower than their agreement with the automatic classification obtained using as a clustering algorithm, respectively, average linkage (for SCOP) or single linkage (for CATH). The networks representing significant evolutionary and structural relationships between clusters beyond the cross-over point may allow us to perform evolutionary, structural, or functional analyses beyond the limits of classification schemes. These networks and the underlying clusters are available at http://ub.cbm.uam.es/research/ProtNet.php.

  16. HAMAP in 2013, new developments in the protein family classification and annotation system

    PubMed Central

    Pedruzzi, Ivo; Rivoire, Catherine; Auchincloss, Andrea H.; Coudert, Elisabeth; Keller, Guillaume; de Castro, Edouard; Baratin, Delphine; Cuche, Béatrice A.; Bougueleret, Lydie; Poux, Sylvain; Redaschi, Nicole; Xenarios, Ioannis; Bridge, Alan

    2013-01-01

    HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support the manual curation of UniProtKB/Swiss-Prot records describing microbial proteins. Here we describe new developments in HAMAP, including the extension of HAMAP to eukaryotic proteins, the use of HAMAP in the automated annotation of UniProtKB/TrEMBL, providing high-quality annotation for millions of protein sequences, and the future integration of HAMAP into a unified system for UniProtKB annotation, UniRule. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized. The collection of HAMAP family classification profiles and annotation rules can be browsed and viewed on the HAMAP website, which also provides an interface to scan user sequences against HAMAP profiles. PMID:23193261

  17. Sequence-based protein superfamily classification using computational intelligence techniques: a review.

    PubMed

    Vipsita, Swati; Rath, Santanu Kumar

    2015-01-01

    Protein superfamily classification deals with the problem of predicting the family membership of newly discovered amino acid sequence. Although many trivial alignment methods are already developed by previous researchers, but the present trend demands the application of computational intelligent techniques. As there is an exponential growth in size of biological database, retrieval and inference of essential knowledge in the biological domain become a very cumbersome task. This problem can be easily handled using intelligent techniques due to their ability of tolerance for imprecision, uncertainty, approximate reasoning, and partial truth. This paper discusses the various global and local features extracted from full length protein sequence which are used for the approximation and generalisation of the classifier. The various parameters used for evaluating the performance of the classifiers are also discussed. Therefore, this review article can show right directions to the present researchers to make an improvement over the existing methods.

  18. HHsvm: fast and accurate classification of profile–profile matches identified by HHsearch

    PubMed Central

    Dlakić, Mensur

    2009-01-01

    Motivation: Recently developed profile–profile methods rival structural comparisons in their ability to detect homology between distantly related proteins. Despite this tremendous progress, many genuine relationships between protein families cannot be recognized as comparisons of their profiles result in scores that are statistically insignificant. Results: Using known evolutionary relationships among protein superfamilies in SCOP database, support vector machines were trained on four sets of discriminatory features derived from the output of HHsearch. Upon validation, it was shown that the automatic classification of all profile–profile matches was superior to fixed threshold-based annotation in terms of sensitivity and specificity. The effectiveness of this approach was demonstrated by annotating several domains of unknown function from the Pfam database. Availability: Programs and scripts implementing the methods described in this manuscript are freely available from http://hhsvm.dlakiclab.org/. Contact: mdlakic@montana.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19773335

  19. Protein Secondary Structure Prediction Using AutoEncoder Network and Bayes Classifier

    NASA Astrophysics Data System (ADS)

    Wang, Leilei; Cheng, Jinyong

    2018-03-01

    Protein secondary structure prediction is belong to bioinformatics,and it's important in research area. In this paper, we propose a new prediction way of protein using bayes classifier and autoEncoder network. Our experiments show some algorithms including the construction of the model, the classification of parameters and so on. The data set is a typical CB513 data set for protein. In terms of accuracy, the method is the cross validation based on the 3-fold. Then we can get the Q3 accuracy. Paper results illustrate that the autoencoder network improved the prediction accuracy of protein secondary structure.

  20. GIRAF: a method for fast search and flexible alignment of ligand binding interfaces in proteins at atomic resolution

    PubMed Central

    Kinjo, Akira R.; Nakamura, Haruki

    2012-01-01

    Comparison and classification of protein structures are fundamental means to understand protein functions. Due to the computational difficulty and the ever-increasing amount of structural data, however, it is in general not feasible to perform exhaustive all-against-all structure comparisons necessary for comprehensive classifications. To efficiently handle such situations, we have previously proposed a method, now called GIRAF. We herein describe further improvements in the GIRAF protein structure search and alignment method. The GIRAF method achieves extremely efficient search of similar structures of ligand binding sites of proteins by exploiting database indexing of structural features of local coordinate frames. In addition, it produces refined atom-wise alignments by iterative applications of the Hungarian method to the bipartite graph defined for a pair of superimposed structures. By combining the refined alignments based on different local coordinate frames, it is made possible to align structures involving domain movements. We provide detailed accounts for the database design, the search and alignment algorithms as well as some benchmark results. PMID:27493524

  1. Spiking of serum specimens with exogenous reporter peptides for mass spectrometry based protease profiling as diagnostic tool.

    PubMed

    Findeisen, Peter; Peccerella, Teresa; Post, Stefan; Wenz, Frederik; Neumaier, Michael

    2008-04-01

    Serum is a difficult matrix for the identification of biomarkers by mass spectrometry (MS). This is due to high-abundance proteins and their complex processing by a multitude of endogenous proteases making rigorous standardisation difficult. Here, we have investigated the use of defined exogenous reporter peptides as substrates for disease-specific proteases with respect to improved standardisation and disease classification accuracy. A recombinant N-terminal fragment of the Adenomatous Polyposis Coli (APC) protein was digested with trypsin to yield a peptide mixture for subsequent Reporter Peptide Spiking (RPS) of serum. Different preanalytical handling of serum samples was simulated by storage of serum samples for up to 6 h at ambient temperature, followed by RPS, further incubation under standardised conditions and testing for stability of protease-generated MS profiles. To demonstrate the superior classification accuracy achieved by RPS, a pilot profiling experiment was performed using serum specimens from pancreatic cancer patients (n = 50) and healthy controls (n = 50). After RPS six different peak categories could be defined, two of which (categories C and D) are modulated by endogenous proteases. These latter are relevant for improved classification accuracy as shown by enhanced disease-specific classification from 78% to 87% in unspiked and spiked samples, respectively. Peaks of these categories presented with unchanged signal intensities regardless of preanalytical conditions. The use of RPS generally improved the signal intensities of protease-generated peptide peaks. RPS circumvents preanalytical variabilities and improves classification accuracies. Our approach will be helpful to introduce MS-based proteomic profiling into routine laboratory testing.

  2. Human cell structure-driven model construction for predicting protein subcellular location from biological images.

    PubMed

    Shao, Wei; Liu, Mingxia; Zhang, Daoqiang

    2016-01-01

    The systematic study of subcellular location pattern is very important for fully characterizing the human proteome. Nowadays, with the great advances in automated microscopic imaging, accurate bioimage-based classification methods to predict protein subcellular locations are highly desired. All existing models were constructed on the independent parallel hypothesis, where the cellular component classes are positioned independently in a multi-class classification engine. The important structural information of cellular compartments is missed. To deal with this problem for developing more accurate models, we proposed a novel cell structure-driven classifier construction approach (SC-PSorter) by employing the prior biological structural information in the learning model. Specifically, the structural relationship among the cellular components is reflected by a new codeword matrix under the error correcting output coding framework. Then, we construct multiple SC-PSorter-based classifiers corresponding to the columns of the error correcting output coding codeword matrix using a multi-kernel support vector machine classification approach. Finally, we perform the classifier ensemble by combining those multiple SC-PSorter-based classifiers via majority voting. We evaluate our method on a collection of 1636 immunohistochemistry images from the Human Protein Atlas database. The experimental results show that our method achieves an overall accuracy of 89.0%, which is 6.4% higher than the state-of-the-art method. The dataset and code can be downloaded from https://github.com/shaoweinuaa/. dqzhang@nuaa.edu.cn Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  3. Reticulate classification of mosaic microbial genomes using NeAT website.

    PubMed

    Lima-Mendez, Gipsi

    2012-01-01

    The tree of life is the classical representation of the evolutionary relationships between existent species. A tree is appropriate to display the divergence of species through mutation, i.e., by vertical descent. However, lateral gene transfer (LGT) is excluded from such representations. When LGT contribution to genome evolution cannot be neglected (e.g., for prokaryotes and mobile genetic elements), the tree becomes misleading. Networks appear as an intuitive way to represent both vertical and horizontal relationships, while overlapping groups within such graphs are more suitable for their classification. Here, we describe a method to represent both vertical and horizontal relationships. We start with a set of genomes whose coded proteins have been grouped into families based on sequence similarity. Next, all pairs of genomes are compared, counting the number of proteins classified into the same family. From this comparison, we derive a weighted graph where genomes with a significant number of similar proteins are linked. Finally, we apply a two-step clustering of this graph to produce a classification where nodes can be assigned to multiple clusters. The procedure can be performed using the Network Analysis Tools (NeAT) website.

  4. Classification-based quantitative analysis of stable isotope labeling by amino acids in cell culture (SILAC) data.

    PubMed

    Kim, Seongho; Carruthers, Nicholas; Lee, Joohyoung; Chinni, Sreenivasa; Stemmer, Paul

    2016-12-01

    Stable isotope labeling by amino acids in cell culture (SILAC) is a practical and powerful approach for quantitative proteomic analysis. A key advantage of SILAC is the ability to simultaneously detect the isotopically labeled peptides in a single instrument run and so guarantee relative quantitation for a large number of peptides without introducing any variation caused by separate experiment. However, there are a few approaches available to assessing protein ratios and none of the existing algorithms pays considerable attention to the proteins having only one peptide hit. We introduce new quantitative approaches to dealing with SILAC protein-level summary using classification-based methodologies, such as Gaussian mixture models with EM algorithms and its Bayesian approach as well as K-means clustering. In addition, a new approach is developed using Gaussian mixture model and a stochastic, metaheuristic global optimization algorithm, particle swarm optimization (PSO), to avoid either a premature convergence or being stuck in a local optimum. Our simulation studies show that the newly developed PSO-based method performs the best among others in terms of F1 score and the proposed methods further demonstrate the ability of detecting potential markers through real SILAC experimental data. No matter how many peptide hits the protein has, the developed approach can be applicable, rescuing many proteins doomed to removal. Furthermore, no additional correction for multiple comparisons is necessary for the developed methods, enabling direct interpretation of the analysis outcomes. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.

  5. The Human Skeletal Muscle Proteome Project: a reappraisal of the current literature

    PubMed Central

    Gonzalez‐Freire, Marta; Semba, Richard D.; Ubaida‐Mohien, Ceereena; Fabbri, Elisa; Scalzo, Paul; Højlund, Kurt; Dufresne, Craig; Lyashkov, Alexey

    2016-01-01

    Abstract Skeletal muscle is a large organ that accounts for up to half the total mass of the human body. A progressive decline in muscle mass and strength occurs with ageing and in some individuals configures the syndrome of ‘sarcopenia’, a condition that impairs mobility, challenges autonomy, and is a risk factor for mortality. The mechanisms leading to sarcopenia as well as myopathies are still little understood. The Human Skeletal Muscle Proteome Project was initiated with the aim to characterize muscle proteins and how they change with ageing and disease. We conducted an extensive review of the literature and analysed publically available protein databases. A systematic search of peer‐reviewed studies was performed using PubMed. Search terms included ‘human’, ‘skeletal muscle’, ‘proteome’, ‘proteomic(s)’, and ‘mass spectrometry’, ‘liquid chromatography‐mass spectrometry (LC‐MS/MS)’. A catalogue of 5431 non‐redundant muscle proteins identified by mass spectrometry‐based proteomics from 38 peer‐reviewed scientific publications from 2002 to November 2015 was created. We also developed a nosology system for the classification of muscle proteins based on localization and function. Such inventory of proteins should serve as a useful background reference for future research on changes in muscle proteome assessed by quantitative mass spectrometry‐based proteomic approaches that occur with ageing and diseases. This classification and compilation of the human skeletal muscle proteome can be used for the identification and quantification of proteins in skeletal muscle to discover new mechanisms for sarcopenia and specific muscle diseases that can be targeted for the prevention and treatment. PMID:27897395

  6. Protein Kinase Classification with 2866 Hidden Markov Models and One Support Vector Machine

    NASA Technical Reports Server (NTRS)

    Weber, Ryan; New, Michael H.; Fonda, Mark (Technical Monitor)

    2002-01-01

    The main application considered in this paper is predicting true kinases from randomly permuted kinases that share the same length and amino acid distributions as the true kinases. Numerous methods already exist for this classification task, such as HMMs, motif-matchers, and sequence comparison algorithms. We build on some of these efforts by creating a vector from the output of thousands of structurally based HMMs, created offline with Pfam-A seed alignments using SAM-T99, which then must be combined into an overall classification for the protein. Then we use a Support Vector Machine for classifying this large ensemble Pfam-Vector, with a polynomial and chisquared kernel. In particular, the chi-squared kernel SVM performs better than the HMMs and better than the BLAST pairwise comparisons, when predicting true from false kinases in some respects, but no one algorithm is best for all purposes or in all instances so we consider the particular strengths and weaknesses of each.

  7. Improved protein surface comparison and application to low-resolution protein structure data

    PubMed Central

    2010-01-01

    Background Recent advancements of experimental techniques for determining protein tertiary structures raise significant challenges for protein bioinformatics. With the number of known structures of unknown function expanding at a rapid pace, an urgent task is to provide reliable clues to their biological function on a large scale. Conventional approaches for structure comparison are not suitable for a real-time database search due to their slow speed. Moreover, a new challenge has arisen from recent techniques such as electron microscopy (EM), which provide low-resolution structure data. Previously, we have introduced a method for protein surface shape representation using the 3D Zernike descriptors (3DZDs). The 3DZD enables fast structure database searches, taking advantage of its rotation invariance and compact representation. The search results of protein surface represented with the 3DZD has showngood agreement with the existing structure classifications, but some discrepancies were also observed. Results The three new surface representations of backbone atoms, originally devised all-atom-surface representation, and the combination of all-atom surface with the backbone representation are examined. All representations are encoded with the 3DZD. Also, we have investigated the applicability of the 3DZD for searching protein EM density maps of varying resolutions. The surface representations are evaluated on structure retrieval using two existing classifications, SCOP and the CE-based classification. Conclusions Overall, the 3DZDs representing backbone atoms show better retrieval performance than the original all-atom surface representation. The performance further improved when the two representations are combined. Moreover, we observed that the 3DZD is also powerful in comparing low-resolution structures obtained by electron microscopy. PMID:21172052

  8. Serum C-reactive protein level in COPD patients stratified according to GOLD 2011 grading classification

    PubMed Central

    Lin, Yi-Hua; Wang, Wan-Yu; Hu, Su-Xian; Shi, Yong-Hong

    2016-01-01

    Background and Objective: The Global Initiative for Chronic Obstructive Lung Disease (GOLD) 2011 grading classification has been used to evaluate the severity of patients with chronic obstructive pulmonary disease (COPD). However, little is known about the relationship between the systemic inflammation and this classification. We aimed to study the relationship between serum CRP and the components of the GOLD 2011 grading classification. Methods: C-reactive protein (CRP) levels were measured in 391 clinically stable COPD patients and in 50 controls from June 2, 2015 to October 31, 2015 in the First Affiliated Hospital of Xiamen University. The association between CRP levels and the components of the GOLD 2011 grading classification were assessed. Results: Correlation was found with the following variables: GOLD 2011 group (0.240), age (0.227), pack year (0.136), forced expiratory volume in one second % predicted (FEV1%; -0.267), forced vital capacity % predicted (-0.210), number of acute exacerbations in the past year (0.265), number of hospitalized exacerbations in the past year (0.165), British medical Research Council dyspnoea scale (0.121), COPD assessment test score (CAT, 0.233). Using multivariate analysis, FEV1% and CAT score manifested the strongest negative association with CRP levels. Conclusions: CRP levels differ in COPD patients among groups A-D based on GOLD 2011 grading classification. CRP levels are associated with several important clinical variables, of which FEV1% and CAT score manifested the strongest negative correlation. PMID:28083044

  9. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature.

    PubMed

    Wang, Xinglong; Rak, Rafal; Restificar, Angelo; Nobata, Chikashi; Rupp, C J; Batista-Navarro, Riza Theresa B; Nawaz, Raheel; Ananiadou, Sophia

    2011-10-03

    The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest's Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task's development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew's Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance.

  10. Integrated Proteomic and Transcriptomic-Based Approaches to Identifying Signature Biomarkers and Pathways for Elucidation of Daoy and UW228 Subtypes.

    PubMed

    Higdon, Roger; Kala, Jessie; Wilkins, Devan; Yan, Julia Fangfei; Sethi, Manveen K; Lin, Liang; Liu, Siqi; Montague, Elizabeth; Janko, Imre; Choiniere, John; Kolker, Natali; Hancock, William S; Kolker, Eugene; Fanayan, Susan

    2017-02-03

    Medulloblastoma (MB) is the most common malignant pediatric brain tumor. Patient survival has remained largely the same for the past 20 years, with therapies causing significant health, cognitive, behavioral and developmental complications for those who survive the tumor. In this study, we profiled the total transcriptome and proteome of two established MB cell lines, Daoy and UW228, using high-throughput RNA sequencing (RNA-Seq) and label-free nano-LC-MS/MS-based quantitative proteomics, coupled with advanced pathway analysis. While Daoy has been suggested to belong to the sonic hedgehog (SHH) subtype, the exact UW228 subtype is not yet clearly established. Thus, a goal of this study was to identify protein markers and pathways that would help elucidate their subtype classification. A number of differentially expressed genes and proteins, including a number of adhesion, cytoskeletal and signaling molecules, were observed between the two cell lines. While several cancer-associated genes/proteins exhibited similar expression across the two cell lines, upregulation of a number of signature proteins and enrichment of key components of SHH and WNT signaling pathways were uniquely observed in Daoy and UW228, respectively. The novel information on differentially expressed genes/proteins and enriched pathways provide insights into the biology of MB, which could help elucidate their subtype classification.

  11. Protein classification using probabilistic chain graphs and the Gene Ontology structure.

    PubMed

    Carroll, Steven; Pavlovic, Vladimir

    2006-08-01

    Probabilistic graphical models have been developed in the past for the task of protein classification. In many cases, classifications obtained from the Gene Ontology have been used to validate these models. In this work we directly incorporate the structure of the Gene Ontology into the graphical representation for protein classification. We present a method in which each protein is represented by a replicate of the Gene Ontology structure, effectively modeling each protein in its own 'annotation space'. Proteins are also connected to one another according to different measures of functional similarity, after which belief propagation is run to make predictions at all ontology terms. The proposed method was evaluated on a set of 4879 proteins from the Saccharomyces Genome Database whose interactions were also recorded in the GRID project. Results indicate that direct utilization of the Gene Ontology improves predictive ability, outperforming traditional models that do not take advantage of dependencies among functional terms. Average increase in accuracy (precision) of positive and negative term predictions of 27.8% (2.0%) over three different similarity measures and three subontologies was observed. C/C++/Perl implementation is available from authors upon request.

  12. Feature generation and representations for protein-protein interaction classification.

    PubMed

    Lan, Man; Tan, Chew Lim; Su, Jian

    2009-10-01

    Automatic detecting protein-protein interaction (PPI) relevant articles is a crucial step for large-scale biological database curation. The previous work adopted POS tagging, shallow parsing and sentence splitting techniques, but they achieved worse performance than the simple bag-of-words representation. In this paper, we generated and investigated multiple types of feature representations in order to further improve the performance of PPI text classification task. Besides the traditional domain-independent bag-of-words approach and the term weighting methods, we also explored other domain-dependent features, i.e. protein-protein interaction trigger keywords, protein named entities and the advanced ways of incorporating Natural Language Processing (NLP) output. The integration of these multiple features has been evaluated on the BioCreAtIvE II corpus. The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLP output improved the performance of text classification. Specifically, in comparison with the best performance achieved in the BioCreAtIvE II IAS, the feature-level and classifier-level integration of multiple features improved the performance of classification 2.71% and 3.95%, respectively.

  13. Application of Wavelet Transform for PDZ Domain Classification

    PubMed Central

    Daqrouq, Khaled; Alhmouz, Rami; Balamesh, Ahmed; Memic, Adnan

    2015-01-01

    PDZ domains have been identified as part of an array of signaling proteins that are often unrelated, except for the well-conserved structural PDZ domain they contain. These domains have been linked to many disease processes including common Avian influenza, as well as very rare conditions such as Fraser and Usher syndromes. Historically, based on the interactions and the nature of bonds they form, PDZ domains have most often been classified into one of three classes (class I, class II and others - class III), that is directly dependent on their binding partner. In this study, we report on three unique feature extraction approaches based on the bigram and trigram occurrence and existence rearrangements within the domain's primary amino acid sequences in assisting PDZ domain classification. Wavelet packet transform (WPT) and Shannon entropy denoted by wavelet entropy (WE) feature extraction methods were proposed. Using 115 unique human and mouse PDZ domains, the existence rearrangement approach yielded a high recognition rate (78.34%), which outperformed our occurrence rearrangements based method. The recognition rate was (81.41%) with validation technique. The method reported for PDZ domain classification from primary sequences proved to be an encouraging approach for obtaining consistent classification results. We anticipate that by increasing the database size, we can further improve feature extraction and correct classification. PMID:25860375

  14. Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

    PubMed

    Li, Man; Ling, Cheng; Xu, Qi; Gao, Jingyang

    2018-02-01

    Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN .

  15. A label distance maximum-based classifier for multi-label learning.

    PubMed

    Liu, Xiaoli; Bao, Hang; Zhao, Dazhe; Cao, Peng

    2015-01-01

    Multi-label classification is useful in many bioinformatics tasks such as gene function prediction and protein site localization. This paper presents an improved neural network algorithm, Max Label Distance Back Propagation Algorithm for Multi-Label Classification. The method was formulated by modifying the total error function of the standard BP by adding a penalty term, which was realized by maximizing the distance between the positive and negative labels. Extensive experiments were conducted to compare this method against state-of-the-art multi-label methods on three popular bioinformatic benchmark datasets. The results illustrated that this proposed method is more effective for bioinformatic multi-label classification compared to commonly used techniques.

  16. EUCLID: automatic classification of proteins in functional classes by their database annotations.

    PubMed

    Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A

    1998-01-01

    A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper

  17. T-RMSD: a fine-grained, structure-based classification method and its application to the functional characterization of TNF receptors.

    PubMed

    Magis, Cedrik; Stricher, François; van der Sloot, Almer M; Serrano, Luis; Notredame, Cedric

    2010-07-16

    This study addresses the relation between structural and functional similarity in proteins. We introduce a novel method named tree based on root mean square deviation (T-RMSD), which uses distance RMSD (dRMSD) variations to build fine-grained structure-based classifications of proteins. The main improvement of the T-RMSD over similar methods, such as Dali, is its capacity to produce the equivalent of a bootstrap value for each cluster node. We validated our approach on two domain families studied extensively for their role in many biological and pathological pathways: the small GTPase RAS superfamily and the cysteine-rich domains (CRDs) associated with the tumor necrosis factor receptors (TNFRs) family. Our analysis showed that T-RMSD is able to automatically recover and refine existing classifications. In the case of the small GTPase ARF subfamily, T-RMSD can distinguish GTP- from GDP-bound states, while in the case of CRDs it can identify two new subgroups associated with well defined functional features (ligand binding and formation of ligand pre-assembly complex). We show how hidden Markov models (HMMs) can be built on these new groups and propose a methodology to use these models simultaneously in order to do fine-grained functional genomic annotation without known 3D structures. T-RMSD, an open source freeware incorporated in the T-Coffee package, is available online. 2010 Elsevier Ltd. All rights reserved.

  18. Classification of cancer cell lines using matrix-assisted laser desorption/ionization time‑of‑flight mass spectrometry and statistical analysis.

    PubMed

    Serafim, Vlad; Shah, Ajit; Puiu, Maria; Andreescu, Nicoleta; Coricovac, Dorina; Nosyrev, Alexander; Spandidos, Demetrios A; Tsatsakis, Aristides M; Dehelean, Cristina; Pinzaru, Iulia

    2017-10-01

    Over the past decade, matrix-assisted laser desorption/ionization time‑of‑flight mass spectrometry (MALDI‑TOF MS) has been established as a valuable platform for microbial identification, and it is also frequently applied in biology and clinical studies to identify new markers expressed in pathological conditions. The aim of the present study was to assess the potential of using this approach for the classification of cancer cell lines as a quantifiable method for the proteomic profiling of cellular organelles. Intact protein extracts isolated from different tumor cell lines (human and murine) were analyzed using MALDI‑TOF MS and the obtained mass lists were processed using principle component analysis (PCA) within Bruker Biotyper® software. Furthermore, reference spectra were created for each cell line and were used for classification. Based on the intact protein profiles, we were able to differentiate and classify six cancer cell lines: two murine melanoma (B16‑F0 and B164A5), one human melanoma (A375), two human breast carcinoma (MCF7 and MDA‑MB‑231) and one human liver carcinoma (HepG2). The cell lines were classified according to cancer type and the species they originated from, as well as by their metastatic potential, offering the possibility to differentiate non‑invasive from invasive cells. The obtained results pave the way for developing a broad‑based strategy for the identification and classification of cancer cells.

  19. A proposal of criteria for the classification of systemic sclerosis.

    PubMed

    Nadashkevich, Oleg; Davis, Paul; Fritzler, Marvin J

    2004-11-01

    Sensitive and specific criteria for the classification of systemic sclerosis are required by clinicians and investigators to achieve higher quality clinical studies and approaches to therapy. A clinical study of systemic sclerosis patients in Europe and Canada led to a set of criteria that achieve high sensitivity and specificity. Both clinical and laboratory investigations of patients with systemic sclerosis, related conditions and diseases with clinical features that can be mistaken as part of the systemic sclerosis spectrum were undertaken. Laboratory investigations included the detection of autoantibodies to centromere proteins, Scl-70 (topoisomerase I), and fibrillarin (U3-RNP). Based on the investigation of 269 systemic sclerosis patients and 720 patients presenting with related and confounding conditions, the following set of criteria for the classification of systemic sclerosis was proposed: 1) autoantibodies to: centromere proteins, Scl-70 (topo I), fibrillarin; 2) bibasilar pulmonary fibrosis; 3) contractures of the digital joints or prayer sign; 4) dermal thickening proximal to the wrists; 5) calcinosis cutis; 6) Raynaud's phenomenon; 7) esophageal distal hypomotility or reflux-esophagitis; 8) sclerodactyly or non-pitting digital edema; 9) teleangiectasias. The classification of definite SSc requires at least three of the above criteria. Criteria for the classification of systemic sclerosis have been proposed. Preliminary testing has defined the sensitivity and specificity of these criteria as high as 99% and 100%, respectively. Testing and validation of the proposed criteria by other clinical centers is required.

  20. Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

    PubMed

    Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene

    2011-01-01

    To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

  1. Mining protein function from text using term-based support vector machines

    PubMed Central

    Rice, Simon B; Nenadic, Goran; Stapley, Benjamin J

    2005-01-01

    Background Text mining has spurred huge interest in the domain of biology. The goal of the BioCreAtIvE exercise was to evaluate the performance of current text mining systems. We participated in Task 2, which addressed assigning Gene Ontology terms to human proteins and selecting relevant evidence from full-text documents. We approached it as a modified form of the document classification task. We used a supervised machine-learning approach (based on support vector machines) to assign protein function and select passages that support the assignments. As classification features, we used a protein's co-occurring terms that were automatically extracted from documents. Results The results evaluated by curators were modest, and quite variable for different problems: in many cases we have relatively good assignment of GO terms to proteins, but the selected supporting text was typically non-relevant (precision spanning from 3% to 50%). The method appears to work best when a substantial set of relevant documents is obtained, while it works poorly on single documents and/or short passages. The initial results suggest that our approach can also mine annotations from text even when an explicit statement relating a protein to a GO term is absent. Conclusion A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction. The most promising results are for combined document retrieval and GO term assignment, which calls for the integration of methods developed in BioCreAtIvE Task 1 and Task 2. PMID:15960835

  2. Anomalies in Network Bridges Involved in Bile Acid Metabolism Predict Outcomes of Colorectal Cancer Patients

    PubMed Central

    Yoon, Seyeol; Lee, Jae W.; Lee, Doheon

    2014-01-01

    Biomarkers prognostic for colorectal cancer (CRC) would be highly desirable in clinical practice. Proteins that regulate bile acid (BA) homeostasis, by linking metabolic sensors and metabolic enzymes, also called bridge proteins, may be reliable prognostic biomarkers for CRC. Based on a devised metric, “bridgeness,” we identified bridge proteins involved in the regulation of BA homeostasis and identified their prognostic potentials. The expression patterns of these bridge proteins could distinguish between normal and diseased tissues, suggesting that these proteins are associated with CRC pathogenesis. Using a supervised classification system, we found that these bridge proteins were reproducibly prognostic, with high prognostic ability compared to other known markers. PMID:25259881

  3. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

  4. Optimizing high performance computing workflow for protein functional annotation

    PubMed Central

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-01-01

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  5. The COG database: new developments in phylogenetic classification of proteins from complete genomes

    PubMed Central

    Tatusov, Roman L.; Natale, Darren A.; Garkavtsev, Igor V.; Tatusova, Tatiana A.; Shankavaram, Uma T.; Rao, Bachoti S.; Kiryutin, Boris; Galperin, Michael Y.; Fedorova, Natalie D.; Koonin, Eugene V.

    2001-01-01

    The database of Clusters of Orthologous Groups of proteins (COGs), which represents an attempt on a phylogenetic classification of the proteins encoded in complete genomes, currently consists of 2791 COGs including 45 350 proteins from 30 genomes of bacteria, archaea and the yeast Saccharomyces cerevisiae (http://www.ncbi.nlm.nih.gov/COG). In addition, a supplement to the COGs is available, in which proteins encoded in the genomes of two multicellular eukaryotes, the nematode Caenorhabditis elegans and the fruit fly Drosophila melanogaster, and shared with bacteria and/or archaea were included. The new features added to the COG database include information pages with structural and functional details on each COG and literature references, improvements of the COGNITOR program that is used to fit new proteins into the COGs, and classification of genomes and COGs constructed by using principal component analysis. PMID:11125040

  6. Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords.

    PubMed

    Koyabu, Shun; Phan, Thi Thanh Thuy; Ohkawa, Takenao

    2015-01-01

    For the automatic extraction of protein-protein interaction information from scientific articles, a machine learning approach is useful. The classifier is generated from training data represented using several features to decide whether a protein pair in each sentence has an interaction. Such a specific keyword that is directly related to interaction as "bind" or "interact" plays an important role for training classifiers. We call it a dominant keyword that affects the capability of the classifier. Although it is important to identify the dominant keywords, whether a keyword is dominant depends on the context in which it occurs. Therefore, we propose a method for predicting whether a keyword is dominant for each instance. In this method, a keyword that derives imbalanced classification results is tentatively assumed to be a dominant keyword initially. Then the classifiers are separately trained from the instance with and without the assumed dominant keywords. The validity of the assumed dominant keyword is evaluated based on the classification results of the generated classifiers. The assumption is updated by the evaluation result. Repeating this process increases the prediction accuracy of the dominant keyword. Our experimental results using five corpora show the effectiveness of our proposed method with dominant keyword prediction.

  7. Cancer classification in the genomic era: five contemporary problems.

    PubMed

    Song, Qingxuan; Merajver, Sofia D; Li, Jun Z

    2015-10-19

    Classification is an everyday instinct as well as a full-fledged scientific discipline. Throughout the history of medicine, disease classification is central to how we develop knowledge, make diagnosis, and assign treatment. Here, we discuss the classification of cancer and the process of categorizing cancer subtypes based on their observed clinical and biological features. Traditionally, cancer nomenclature is primarily based on organ location, e.g., "lung cancer" designates a tumor originating in lung structures. Within each organ-specific major type, finer subgroups can be defined based on patient age, cell type, histological grades, and sometimes molecular markers, e.g., hormonal receptor status in breast cancer or microsatellite instability in colorectal cancer. In the past 15+ years, high-throughput technologies have generated rich new data regarding somatic variations in DNA, RNA, protein, or epigenomic features for many cancers. These data, collected for increasingly large tumor cohorts, have provided not only new insights into the biological diversity of human cancers but also exciting opportunities to discover previously unrecognized cancer subtypes. Meanwhile, the unprecedented volume and complexity of these data pose significant challenges for biostatisticians, cancer biologists, and clinicians alike. Here, we review five related issues that represent contemporary problems in cancer taxonomy and interpretation. (1) How many cancer subtypes are there? (2) How can we evaluate the robustness of a new classification system? (3) How are classification systems affected by intratumor heterogeneity and tumor evolution? (4) How should we interpret cancer subtypes? (5) Can multiple classification systems co-exist? While related issues have existed for a long time, we will focus on those aspects that have been magnified by the recent influx of complex multi-omics data. Exploration of these problems is essential for data-driven refinement of cancer classification and the successful application of these concepts in precision medicine.

  8. On Utilizing Optimal and Information Theoretic Syntactic Modeling for Peptide Classification

    NASA Astrophysics Data System (ADS)

    Aygün, Eser; Oommen, B. John; Cataltepe, Zehra

    Syntactic methods in pattern recognition have been used extensively in bioinformatics, and in particular, in the analysis of gene and protein expressions, and in the recognition and classification of bio-sequences. These methods are almost universally distance-based. This paper concerns the use of an Optimal and Information Theoretic (OIT) probabilistic model [11] to achieve peptide classification using the information residing in their syntactic representations. The latter has traditionally been achieved using the edit distances required in the respective peptide comparisons. We advocate that one can model the differences between compared strings as a mutation model consisting of random Substitutions, Insertions and Deletions (SID) obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a Support Vector Machine (SVM)-based peptide classifier, referred to as OIT_SVM, can be devised.

  9. [Modern bacterial taxonomy: techniques review--application to bacteria that nodulate leguminous plants (BNL)].

    PubMed

    Zakhia, Frédéric; de Lajudie, Philippe

    2006-03-01

    Taxonomy is the science that studies the relationships between organisms. It comprises classification, nomenclature, and identification. Modern bacterial taxonomy is polyphasic. This means that it is based on several molecular techniques, each one retrieving the information at different cellular levels (proteins, fatty acids, DNA...). The obtained results are combined and analysed to reach a "consensus taxonomy" of a microorganism. Until 1970, a small number of classification techniques were available for microbiologists (mainly phenotypic characterization was performed: a legume species nodulation ability for a Rhizobium, for example). With the development of techniques based on polymerase chain reaction for characterization, the bacterial taxonomy has undergone great changes. In particular, the classification of the legume nodulating bacteria has been repeatedly modified over the last 20 years. We present here a review of the currently used molecular techniques in bacterial characterization, with examples of application of these techniques for the study of the legume nodulating bacteria.

  10. The Protein-DNA Interface database

    PubMed Central

    2010-01-01

    The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 Å or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface. We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes. PMID:20482798

  11. The Protein-DNA Interface database.

    PubMed

    Norambuena, Tomás; Melo, Francisco

    2010-05-18

    The Protein-DNA Interface database (PDIdb) is a repository containing relevant structural information of Protein-DNA complexes solved by X-ray crystallography and available at the Protein Data Bank. The database includes a simple functional classification of the protein-DNA complexes that consists of three hierarchical levels: Class, Type and Subtype. This classification has been defined and manually curated by humans based on the information gathered from several sources that include PDB, PubMed, CATH, SCOP and COPS. The current version of the database contains only structures with resolution of 2.5 A or higher, accounting for a total of 922 entries. The major aim of this database is to contribute to the understanding of the main rules that underlie the molecular recognition process between DNA and proteins. To this end, the database is focused on each specific atomic interface rather than on the separated binding partners. Therefore, each entry in this database consists of a single and independent protein-DNA interface.We hope that PDIdb will be useful to many researchers working in fields such as the prediction of transcription factor binding sites in DNA, the study of specificity determinants that mediate enzyme recognition events, engineering and design of new DNA binding proteins with distinct binding specificity and affinity, among others. Finally, due to its friendly and easy-to-use web interface, we hope that PDIdb will also serve educational and teaching purposes.

  12. BayesMotif: de novo protein sorting motif discovery from impure datasets.

    PubMed

    Hu, Jianjun; Zhang, Fan

    2010-01-18

    Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.

  13. Ebolavirus Classification Based on Natural Vectors

    PubMed Central

    Zheng, Hui; Yin, Changchuan; Hoang, Tung; He, Rong Lucy; Yang, Jie

    2015-01-01

    According to the WHO, ebolaviruses have resulted in 8818 human deaths in West Africa as of January 2015. To better understand the evolutionary relationship of the ebolaviruses and infer virulence from the relationship, we applied the alignment-free natural vector method to classify the newest ebolaviruses. The dataset includes three new Guinea viruses as well as 99 viruses from Sierra Leone. For the viruses of the family of Filoviridae, both genus label classification and species label classification achieve an accuracy rate of 100%. We represented the relationships among Filoviridae viruses by Unweighted Pair Group Method with Arithmetic Mean (UPGMA) phylogenetic trees and found that the filoviruses can be separated well by three genera. We performed the phylogenetic analysis on the relationship among different species of Ebolavirus by their coding-complete genomes and seven viral protein genes (glycoprotein [GP], nucleoprotein [NP], VP24, VP30, VP35, VP40, and RNA polymerase [L]). The topology of the phylogenetic tree by the viral protein VP24 shows consistency with the variations of virulence of ebolaviruses. The result suggests that VP24 be a pharmaceutical target for treating or preventing ebolaviruses. PMID:25803489

  14. Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants in the InSiGHT locus-specific database.

    PubMed

    Thompson, Bryony A; Spurdle, Amanda B; Plazzer, John-Paul; Greenblatt, Marc S; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P; Farrington, Susan M; Frayling, Ian M; Frebourg, Thierry; Goldgar, David E; Heinen, Christopher D; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J; Sijmons, Rolf; Tavtigian, Sean V; Tops, Carli M; Weber, Thomas; Wijnen, Juul; Woods, Michael O; Macrae, Finlay; Genuardi, Maurizio

    2014-02-01

    The clinical classification of hereditary sequence variants identified in disease-related genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch syndrome-associated genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist in variant classification and was recognized through microattribution. The scheme was refined by multidisciplinary expert committee review of the clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants that were not obviously protein truncating from nomenclature. This large-scale endeavor will facilitate the consistent management of families suspected to have Lynch syndrome and demonstrates the value of multidisciplinary collaboration in the curation and classification of variants in public locus-specific databases.

  15. Application of a five-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants lodged on the InSiGHT locus-specific database

    PubMed Central

    Plazzer, John-Paul; Greenblatt, Marc S.; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T.; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P.; Farrington, Susan M.; Frayling, Ian M.; Frebourg, Thierry; Goldgar, David E.; Heinen, Christopher D.; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J.; Sijmons, Rolf; Tavtigian, Sean V.; Tops, Carli M.; Weber, Thomas; Wijnen, Juul; Woods, Michael O.; Macrae, Finlay; Genuardi, Maurizio

    2015-01-01

    Clinical classification of sequence variants identified in hereditary disease genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch Syndrome genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist variant classification, and recognized by microattribution. The scheme was refined by multidisciplinary expert committee review of clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants not obviously protein-truncating from nomenclature. This large-scale endeavor will facilitate consistent management of suspected Lynch Syndrome families, and demonstrates the value of multidisciplinary collaboration for curation and classification of variants in public locus-specific databases. PMID:24362816

  16. Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches.

    PubMed

    Yugandhar, K; Gromiha, M Michael

    2014-09-01

    Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein-protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein-protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 protein-protein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein-protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein-protein interaction networks and human-pathogen interactions based on the strength of interactions. © 2014 Wiley Periodicals, Inc.

  17. PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine.

    PubMed

    Hayat, Maqsood; Tahir, Muhammad

    2015-08-01

    Membrane protein is a central component of the cell that manages intra and extracellular processes. Membrane proteins execute a diversity of functions that are vital for the survival of organisms. The topology of transmembrane proteins describes the number of transmembrane (TM) helix segments and its orientation. However, owing to the lack of its recognized structures, the identification of TM helix and its topology through experimental methods is laborious with low throughput. In order to identify TM helix segments reliably, accurately, and effectively from topogenic sequences, we propose the PSOFuzzySVM-TMH model. In this model, evolutionary based information position specific scoring matrix and discrete based information 6-letter exchange group are used to formulate transmembrane protein sequences. The noisy and extraneous attributes are eradicated using an optimization selection technique, particle swarm optimization, from both feature spaces. Finally, the selected feature spaces are combined in order to form ensemble feature space. Fuzzy-support vector Machine is utilized as a classification algorithm. Two benchmark datasets, including low and high resolution datasets, are used. At various levels, the performance of the PSOFuzzySVM-TMH model is assessed through 10-fold cross validation test. The empirical results reveal that the proposed framework PSOFuzzySVM-TMH outperforms in terms of classification performance in the examined datasets. It is ascertained that the proposed model might be a useful and high throughput tool for academia and research community for further structure and functional studies on transmembrane proteins.

  18. Structure based classification for bile salt export pump (BSEP) inhibitors using comparative structural modeling of human BSEP

    NASA Astrophysics Data System (ADS)

    Jain, Sankalp; Grandits, Melanie; Richter, Lars; Ecker, Gerhard F.

    2017-06-01

    The bile salt export pump (BSEP) actively transports conjugated monovalent bile acids from the hepatocytes into the bile. This facilitates the formation of micelles and promotes digestion and absorption of dietary fat. Inhibition of BSEP leads to decreased bile flow and accumulation of cytotoxic bile salts in the liver. A number of compounds have been identified to interact with BSEP, which results in drug-induced cholestasis or liver injury. Therefore, in silico approaches for flagging compounds as potential BSEP inhibitors would be of high value in the early stage of the drug discovery pipeline. Up to now, due to the lack of a high-resolution X-ray structure of BSEP, in silico based identification of BSEP inhibitors focused on ligand-based approaches. In this study, we provide a homology model for BSEP, developed using the corrected mouse P-glycoprotein structure (PDB ID: 4M1M). Subsequently, the model was used for docking-based classification of a set of 1212 compounds (405 BSEP inhibitors, 807 non-inhibitors). Using the scoring function ChemScore, a prediction accuracy of 81% on the training set and 73% on two external test sets could be obtained. In addition, the applicability domain of the models was assessed based on Euclidean distance. Further, analysis of the protein-ligand interaction fingerprints revealed certain functional group-amino acid residue interactions that could play a key role for ligand binding. Though ligand-based models, due to their high speed and accuracy, remain the method of choice for classification of BSEP inhibitors, structure-assisted docking models demonstrate reasonably good prediction accuracies while additionally providing information about putative protein-ligand interactions.

  19. Lessons from making the Structural Classification of Proteins (SCOP) and their implications for protein structure modelling.

    PubMed

    Andreeva, Antonina

    2016-06-15

    The Structural Classification of Proteins (SCOP) database has facilitated the development of many tools and algorithms and it has been successfully used in protein structure prediction and large-scale genome annotations. During the development of SCOP, numerous exceptions were found to topological rules, along with complex evolutionary scenarios and peculiarities in proteins including the ability to fold into alternative structures. This article reviews cases of structural variations observed for individual proteins and among groups of homologues, knowledge of which is essential for protein structure modelling. © 2016 The Author(s). published by Portland Press Limited on behalf of the Biochemical Society.

  20. Classification of hyperlipidaemias and hyperlipoproteinaemias*

    PubMed Central

    1970-01-01

    Many studies of atherosclerosis have indicated hyperlipidaemia as a predisposing factor to vascular disease. The relationship holds even for mild degrees of hyperlipidaemia, a fact that underlines the importance of this category of disorders. Both primary and secondary hyperlipidaemias represent such a variety of abnormalities that an internationally acceptable provisional classification is highly desirable in order to facilitate communication between scientists with different backgrounds. The present memorandum presents such a classification; it briefly describes the criteria for diagnosis of the main types of hyperlipidaemia as well as the methods of their determination. Because lipoproteins offer more information than analysis of plasma lipids (most of the plasma lipids being bound to various proteins), the classification is based on lipoprotein analyses by electrophoresis and ultracentrifugation. Simpler methods, however, such as the observation of plasma and measurements of cholesterol and triglycerides, are used to the fullest possible extent in determining the lipoprotein patterns. PMID:4930042

  1. Determining Effects of Non-synonymous SNPs on Protein-Protein Interactions using Supervised and Semi-supervised Learning

    PubMed Central

    Zhao, Nan; Han, Jing Ginger; Shyu, Chi-Ren; Korkin, Dmitry

    2014-01-01

    Single nucleotide polymorphisms (SNPs) are among the most common types of genetic variation in complex genetic disorders. A growing number of studies link the functional role of SNPs with the networks and pathways mediated by the disease-associated genes. For example, many non-synonymous missense SNPs (nsSNPs) have been found near or inside the protein-protein interaction (PPI) interfaces. Determining whether such nsSNP will disrupt or preserve a PPI is a challenging task to address, both experimentally and computationally. Here, we present this task as three related classification problems, and develop a new computational method, called the SNP-IN tool (non-synonymous SNP INteraction effect predictor). Our method predicts the effects of nsSNPs on PPIs, given the interaction's structure. It leverages supervised and semi-supervised feature-based classifiers, including our new Random Forest self-learning protocol. The classifiers are trained based on a dataset of comprehensive mutagenesis studies for 151 PPI complexes, with experimentally determined binding affinities of the mutant and wild-type interactions. Three classification problems were considered: (1) a 2-class problem (strengthening/weakening PPI mutations), (2) another 2-class problem (mutations that disrupt/preserve a PPI), and (3) a 3-class classification (detrimental/neutral/beneficial mutation effects). In total, 11 different supervised and semi-supervised classifiers were trained and assessed resulting in a promising performance, with the weighted f-measure ranging from 0.87 for Problem 1 to 0.70 for the most challenging Problem 3. By integrating prediction results of the 2-class classifiers into the 3-class classifier, we further improved its performance for Problem 3. To demonstrate the utility of SNP-IN tool, it was applied to study the nsSNP-induced rewiring of two disease-centered networks. The accurate and balanced performance of SNP-IN tool makes it readily available to study the rewiring of large-scale protein-protein interaction networks, and can be useful for functional annotation of disease-associated SNPs. SNIP-IN tool is freely accessible as a web-server at http://korkinlab.org/snpintool/. PMID:24784581

  2. Conserved thioredoxin fold is present in Pisum sativum L. sieve element occlusion-1 protein

    PubMed Central

    Umate, Pavan; Tuteja, Renu

    2010-01-01

    Homology-based three-dimensional model for Pisum sativum sieve element occlusion 1 (Ps.SEO1) (forisomes) protein was constructed. A stretch of amino acids (residues 320 to 456) which is well conserved in all known members of forisomes proteins was used to model the 3D structure of Ps.SEO1. The structural prediction was done using Protein Homology/analogY Recognition Engine (PHYRE) web server. Based on studies of local sequence alignment, the thioredoxin-fold containing protein [Structural Classification of Proteins (SCOP) code d1o73a_], a member of the glutathione peroxidase family was selected as a template for modeling the spatial structure of Ps.SEO1. Selection was based on comparison of primary sequence, higher match quality and alignment accuracy. Motif 1 (EVF) is conserved in Ps.SEO1, Vicia faba (Vf.For1) and Medicago truncatula (MT.SEO3); motif 2 (KKED) is well conserved across all forisomes proteins and motif 3 (IGYIGNP) is conserved in Ps.SEO1 and Vf.For1. PMID:20404566

  3. A novel Multi-Agent Ada-Boost algorithm for predicting protein structural class with the information of protein secondary structure.

    PubMed

    Fan, Ming; Zheng, Bin; Li, Lihua

    2015-10-01

    Knowledge of the structural class of a given protein is important for understanding its folding patterns. Although a lot of efforts have been made, it still remains a challenging problem for prediction of protein structural class solely from protein sequences. The feature extraction and classification of proteins are the main problems in prediction. In this research, we extended our earlier work regarding these two aspects. In protein feature extraction, we proposed a scheme by calculating the word frequency and word position from sequences of amino acid, reduced amino acid, and secondary structure. For an accurate classification of the structural class of protein, we developed a novel Multi-Agent Ada-Boost (MA-Ada) method by integrating the features of Multi-Agent system into Ada-Boost algorithm. Extensive experiments were taken to test and compare the proposed method using four benchmark datasets in low homology. The results showed classification accuracies of 88.5%, 96.0%, 88.4%, and 85.5%, respectively, which are much better compared with the existing methods. The source code and dataset are available on request.

  4. Detecting experimental techniques and selecting relevant documents for protein-protein interactions from biomedical literature

    PubMed Central

    2011-01-01

    Background The selection of relevant articles for curation, and linking those articles to experimental techniques confirming the findings became one of the primary subjects of the recent BioCreative III contest. The contest’s Protein-Protein Interaction (PPI) task consisted of two sub-tasks: Article Classification Task (ACT) and Interaction Method Task (IMT). ACT aimed to automatically select relevant documents for PPI curation, whereas the goal of IMT was to recognise the methods used in experiments for identifying the interactions in full-text articles. Results We proposed and compared several classification-based methods for both tasks, employing rich contextual features as well as features extracted from external knowledge sources. For IMT, a new method that classifies pair-wise relations between every text phrase and candidate interaction method obtained promising results with an F1 score of 64.49%, as tested on the task’s development dataset. We also explored ways to combine this new approach and more conventional, multi-label document classification methods. For ACT, our classifiers exploited automatically detected named entities and other linguistic information. The evaluation results on the BioCreative III PPI test datasets showed that our systems were very competitive: one of our IMT methods yielded the best performance among all participants, as measured by F1 score, Matthew’s Correlation Coefficient and AUC iP/R; whereas for ACT, our best classifier was ranked second as measured by AUC iP/R, and also competitive according to other metrics. Conclusions Our novel approach that converts the multi-class, multi-label classification problem to a binary classification problem showed much promise in IMT. Nevertheless, on the test dataset the best performance was achieved by taking the union of the output of this method and that of a multi-class, multi-label document classifier, which indicates that the two types of systems complement each other in terms of recall. For ACT, our system exploited a rich set of features and also obtained encouraging results. We examined the features with respect to their contributions to the classification results, and concluded that contextual words surrounding named entities, as well as the MeSH headings associated with the documents were among the main contributors to the performance. PMID:22151769

  5. Mouse Vk gene classification by nucleic acid sequence similarity.

    PubMed

    Strohal, R; Helmberg, A; Kroemer, G; Kofler, R

    1989-01-01

    Analyses of immunoglobulin (Ig) variable (V) region gene usage in the immune response, estimates of V gene germline complexity, and other nucleic acid hybridization-based studies depend on the extent to which such genes are related (i.e., sequence similarity) and their organization in gene families. While mouse Igh heavy chain V region (VH) gene families are relatively well-established, a corresponding systematic classification of Igk light chain V region (Vk) genes has not been reported. The present analysis, in the course of which we reviewed the known extent of the Vk germline gene repertoire and Vk gene usage in a variety of responses to foreign and self antigens, provides a classification of mouse Vk genes in gene families composed of members with greater than 80% overall nucleic acid sequence similarity. This classification differed in several aspects from that of VH genes: only some Vk gene families were as clearly separated (by greater than 25% sequence dissimilarity) as typical VH gene families; most Vk gene families were closely related and, in several instances, members from different families were very similar (greater than 80%) over large sequence portions; frequently, classification by nucleic acid sequence similarity diverged from existing classifications based on amino-terminal protein sequence similarity. Our data have implications for Vk gene analyses by nucleic acid hybridization and describe potentially important differences in sequence organization between VH and Vk genes.

  6. Problems of classification in the family Paramyxoviridae.

    PubMed

    Rima, Bert; Collins, Peter; Easton, Andrew; Fouchier, Ron; Kurath, Gael; Lamb, Robert A; Lee, Benhur; Maisner, Andrea; Rota, Paul; Wang, Lin-Fa

    2018-05-01

    A number of unassigned viruses in the family Paramyxoviridae need to be classified either as a new genus or placed into one of the seven genera currently recognized in this family. Furthermore, numerous new paramyxoviruses continue to be discovered. However, attempts at classification have highlighted the difficulties that arise by applying historic criteria or criteria based on sequence alone to the classification of the viruses in this family. While the recent taxonomic change that elevated the previous subfamily Pneumovirinae into a separate family Pneumoviridae is readily justified on the basis of RNA dependent -RNA polymerase (RdRp or L protein) sequence motifs, using RdRp sequence comparisons for assignment to lower level taxa raises problems that would require an overhaul of the current criteria for assignment into genera in the family Paramyxoviridae. Arbitrary cut off points to delineate genera and species would have to be set if classification was based on the amino acid sequence of the RdRp alone or on pairwise analysis of sequence complementarity (PASC) of all open reading frames (ORFs). While these cut-offs cannot be made consistent with the current classification in this family, resorting to genus-level demarcation criteria with additional input from the biological context may afford a way forward. Such criteria would reflect the increasingly dynamic nature of virus taxonomy even if it would require a complete revision of the current classification.

  7. Toward genetics-based virus taxonomy: comparative analysis of a genetics-based classification and the taxonomy of picornaviruses.

    PubMed

    Lauber, Chris; Gorbalenya, Alexander E

    2012-04-01

    Virus taxonomy has received little attention from the research community despite its broad relevance. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3890-3904, 2012), we have introduced a quantitative approach to hierarchically classify viruses of a family using pairwise evolutionary distances (PEDs) as a measure of genetic divergence. When applied to the six most conserved proteins of the Picornaviridae, it clustered 1,234 genome sequences in groups at three hierarchical levels (to which we refer as the "GENETIC classification"). In this study, we compare the GENETIC classification with the expert-based picornavirus taxonomy and outline differences in the underlying frameworks regarding the relation of virus groups and genetic diversity that represent, respectively, the structure and content of a classification. To facilitate the analysis, we introduce two novel diagrams. The first connects the genetic diversity of taxa to both the PED distribution and the phylogeny of picornaviruses. The second depicts a classification and the accommodated genetic diversity in a standardized manner. Generally, we found striking agreement between the two classifications on species and genus taxa. A few disagreements concern the species Human rhinovirus A and Human rhinovirus C and the genus Aphthovirus, which were split in the GENETIC classification. Furthermore, we propose a new supergenus level and universal, level-specific PED thresholds, not reached yet by many taxa. Since the species threshold is approached mostly by taxa with large sampling sizes and those infecting multiple hosts, it may represent an upper limit on divergence, beyond which homologous recombination in the six most conserved genes between two picornaviruses might not give viable progeny.

  8. Prediction of protein mutant stability using classification and regression tool.

    PubMed

    Huang, Liang-Tsung; Saraboji, K; Ho, Shinn-Ying; Hwang, Shiow-Fen; Ponnuswamy, M N; Gromiha, M Michael

    2007-02-01

    Prediction of protein stability upon amino acid substitutions is an important problem in molecular biology and the solving of which would help for designing stable mutants. In this work, we have analyzed the stability of protein mutants using two different datasets of 1396 and 2204 mutants obtained from ProTherm database, respectively for free energy change due to thermal (DeltaDeltaG) and denaturant denaturations (DeltaDeltaG(H(2)O)). We have used a set of 48 physical, chemical energetic and conformational properties of amino acid residues and computed the difference of amino acid properties for each mutant in both sets of data. These differences in amino acid properties have been related to protein stability (DeltaDeltaG and DeltaDeltaG(H(2)O)) and are used to train with classification and regression tool for predicting the stability of protein mutants. Further, we have tested the method with 4 fold, 5 fold and 10 fold cross validation procedures. We found that the physical properties, shape and flexibility are important determinants of protein stability. The classification of mutants based on secondary structure (helix, strand, turn and coil) and solvent accessibility (buried, partially buried, partially exposed and exposed) distinguished the stabilizing/destabilizing mutants at an average accuracy of 81% and 80%, respectively for DeltaDeltaG and DeltaDeltaG(H(2)O). The correlation between the experimental and predicted stability change is 0.61 for DeltaDeltaG and 0.44 for DeltaDeltaG(H(2)O). Further, the free energy change due to the replacement of amino acid residue has been predicted within an average error of 1.08 kcal/mol and 1.37 kcal/mol for thermal and chemical denaturation, respectively. The relative importance of secondary structure and solvent accessibility, and the influence of the dataset on prediction of protein mutant stability have been discussed.

  9. DockQ: A Quality Measure for Protein-Protein Docking Models

    PubMed Central

    Basu, Sankar

    2016-01-01

    The state-of-the-art to assess the structural quality of docking models is currently based on three related yet independent quality measures: Fnat, LRMS, and iRMS as proposed and standardized by CAPRI. These quality measures quantify different aspects of the quality of a particular docking model and need to be viewed together to reveal the true quality, e.g. a model with relatively poor LRMS (>10Å) might still qualify as 'acceptable' with a descent Fnat (>0.50) and iRMS (<3.0Å). This is also the reason why the so called CAPRI criteria for assessing the quality of docking models is defined by applying various ad-hoc cutoffs on these measures to classify a docking model into the four classes: Incorrect, Acceptable, Medium, or High quality. This classification has been useful in CAPRI, but since models are grouped in only four bins it is also rather limiting, making it difficult to rank models, correlate with scoring functions or use it as target function in machine learning algorithms. Here, we present DockQ, a continuous protein-protein docking model quality measure derived by combining Fnat, LRMS, and iRMS to a single score in the range [0, 1] that can be used to assess the quality of protein docking models. By using DockQ on CAPRI models it is possible to almost completely reproduce the original CAPRI classification into Incorrect, Acceptable, Medium and High quality. An average PPV of 94% at 90% Recall demonstrating that there is no need to apply predefined ad-hoc cutoffs to classify docking models. Since DockQ recapitulates the CAPRI classification almost perfectly, it can be viewed as a higher resolution version of the CAPRI classification, making it possible to estimate model quality in a more quantitative way using Z-scores or sum of top ranked models, which has been so valuable for the CASP community. The possibility to directly correlate a quality measure to a scoring function has been crucial for the development of scoring functions for protein structure prediction, and DockQ should be useful in a similar development in the protein docking field. DockQ is available at http://github.com/bjornwallner/DockQ/ PMID:27560519

  10. Extraction of Protein-Protein Interaction from Scientific Articles by Predicting Dominant Keywords

    PubMed Central

    Koyabu, Shun; Phan, Thi Thanh Thuy; Ohkawa, Takenao

    2015-01-01

    For the automatic extraction of protein-protein interaction information from scientific articles, a machine learning approach is useful. The classifier is generated from training data represented using several features to decide whether a protein pair in each sentence has an interaction. Such a specific keyword that is directly related to interaction as “bind” or “interact” plays an important role for training classifiers. We call it a dominant keyword that affects the capability of the classifier. Although it is important to identify the dominant keywords, whether a keyword is dominant depends on the context in which it occurs. Therefore, we propose a method for predicting whether a keyword is dominant for each instance. In this method, a keyword that derives imbalanced classification results is tentatively assumed to be a dominant keyword initially. Then the classifiers are separately trained from the instance with and without the assumed dominant keywords. The validity of the assumed dominant keyword is evaluated based on the classification results of the generated classifiers. The assumption is updated by the evaluation result. Repeating this process increases the prediction accuracy of the dominant keyword. Our experimental results using five corpora show the effectiveness of our proposed method with dominant keyword prediction. PMID:26783534

  11. fRMSDPred: Predicting Local RMSD Between Structural Fragments Using Sequence Information

    DTIC Science & Technology

    2007-04-04

    machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel

  12. AllergenFP: allergenicity prediction by descriptor fingerprints.

    PubMed

    Dimitrov, Ivan; Naneva, Lyudmila; Doytchinova, Irini; Bangov, Ivan

    2014-03-15

    Allergenicity, like antigenicity and immunogenicity, is a property encoded linearly and non-linearly, and therefore the alignment-based approaches are not able to identify this property unambiguously. A novel alignment-free descriptor-based fingerprint approach is presented here and applied to identify allergens and non-allergens. The approach was implemented into a four step algorithm. Initially, the protein sequences are described by amino acid principal properties as hydrophobicity, size, relative abundance, helix and β-strand forming propensities. Then, the generated strings of different length are converted into vectors with equal length by auto- and cross-covariance (ACC). The vectors were transformed into binary fingerprints and compared in terms of Tanimoto coefficient. The approach was applied to a set of 2427 known allergens and 2427 non-allergens and identified correctly 88% of them with Matthews correlation coefficient of 0.759. The descriptor fingerprint approach presented here is universal. It could be applied for any classification problem in computational biology. The set of E-descriptors is able to capture the main structural and physicochemical properties of amino acids building the proteins. The ACC transformation overcomes the main problem in the alignment-based comparative studies arising from the different length of the aligned protein sequences. The conversion of protein ACC values into binary descriptor fingerprints allows similarity search and classification. The algorithm described in the present study was implemented in a specially designed Web site, named AllergenFP (FP stands for FingerPrint). AllergenFP is written in Python, with GIU in HTML. It is freely accessible at http://ddg-pharmfac.net/Allergen FP. idoytchinova@pharmfac.net or ivanbangov@shu-bg.net.

  13. 7TMRmine: a Web server for hierarchical mining of 7TMR proteins

    PubMed Central

    Lu, Guoqing; Wang, Zhifang; Jones, Alan M; Moriyama, Etsuko N

    2009-01-01

    Background Seven-transmembrane region-containing receptors (7TMRs) play central roles in eukaryotic signal transduction. Due to their biomedical importance, thorough mining of 7TMRs from diverse genomes has been an active target of bioinformatics and pharmacogenomics research. The need for new and accurate 7TMR/GPCR prediction tools is paramount with the accelerated rate of acquisition of diverse sequence information. Currently available and often used protein classification methods (e.g., profile hidden Markov Models) are highly accurate for identifying their membership information among already known 7TMR subfamilies. However, these alignment-based methods are less effective for identifying remote similarities, e.g., identifying proteins from highly divergent or possibly new 7TMR families. In this regard, more sensitive (e.g., alignment-free) methods are needed to complement the existing protein classification methods. A better strategy would be to combine different classifiers, from more specific to more sensitive methods, to identify a broader spectrum of 7TMR protein candidates. Description We developed a Web server, 7TMRmine, by integrating alignment-free and alignment-based classifiers specifically trained to identify candidate 7TMR proteins as well as transmembrane (TM) prediction methods. This new tool enables researchers to easily assess the distribution of GPCR functionality in diverse genomes or individual newly-discovered proteins. 7TMRmine is easily customized and facilitates exploratory analysis of diverse genomes. Users can integrate various alignment-based, alignment-free, and TM-prediction methods in any combination and in any hierarchical order. Sixteen classifiers (including two TM-prediction methods) are available on the 7TMRmine Web server. Not only can the 7TMRmine tool be used for 7TMR mining, but also for general TM-protein analysis. Users can submit protein sequences for analysis, or explore pre-analyzed results for multiple genomes. The server currently includes prediction results and the summary statistics for 68 genomes. Conclusion 7TMRmine facilitates the discovery of 7TMR proteins. By combining prediction results from different classifiers in a multi-level filtering process, prioritized sets of 7TMR candidates can be obtained for further investigation. 7TMRmine can be also used as a general TM-protein classifier. Comparisons of TM and 7TMR protein distributions among 68 genomes revealed interesting differences in evolution of these protein families among major eukaryotic phyla. PMID:19538753

  14. Ultra-sensitive high performance liquid chromatography-laser-induced fluorescence based proteomics for clinical applications.

    PubMed

    Patil, Ajeetkumar; Bhat, Sujatha; Pai, Keerthilatha M; Rai, Lavanya; Kartha, V B; Chidangil, Santhosh

    2015-09-08

    An ultra-sensitive high performance liquid chromatography-laser induced fluorescence (HPLC-LIF) based technique has been developed by our group at Manipal, for screening, early detection, and staging for various cancers, using protein profiling of clinical samples like, body fluids, cellular specimens, and biopsy-tissue. More than 300 protein profiles of different clinical samples (serum, saliva, cellular samples and tissue homogenates) from volunteers (normal, and different pre-malignant/malignant conditions) were recorded using this set-up. The protein profiles were analyzed using principal component analysis (PCA) to achieve objective detection and classification of malignant, premalignant and healthy conditions with high sensitivity and specificity. The HPLC-LIF protein profiling combined with PCA, as a routine method for screening, diagnosis, and staging of cervical cancer and oral cancer, is discussed in this paper. In recent years, proteomics techniques have advanced tremendously in life sciences and medical sciences for the detection and identification of proteins in body fluids, tissue homogenates and cellular samples to understand biochemical mechanisms leading to different diseases. Some of the methods include techniques like high performance liquid chromatography, 2D-gel electrophoresis, MALDI-TOF-MS, SELDI-TOF-MS, CE-MS and LC-MS techniques. We have developed an ultra-sensitive high performance liquid chromatography-laser induced fluorescence (HPLC-LIF) based technique, for screening, early detection, and staging for various cancers, using protein profiling of clinical samples like, body fluids, cellular specimens, and biopsy-tissue. More than 300 protein profiles of different clinical samples (serum, saliva, cellular samples and tissue homogenates) from healthy and volunteers with different malignant conditions were recorded by using this set-up. The protein profile data were analyzed using principal component analysis (PCA) for objective classification and detection of malignant, premalignant and healthy conditions. The method is extremely sensitive to detect proteins with limit of detection of the order of femto-moles. The HPLC-LIF combined with PCA as a potential proteomic method for the diagnosis of oral cancer and cervical cancer has been discussed in this paper. This article is part of a Special Issue entitled: Proteomics in India. Copyright © 2015 Elsevier B.V. All rights reserved.

  15. IDM-PhyChm-Ens: intelligent decision-making ensemble methodology for classification of human breast cancer using physicochemical properties of amino acids.

    PubMed

    Ali, Safdar; Majid, Abdul; Khan, Asifullah

    2014-04-01

    Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.

  16. MALDI Mass Spectrometry Imaging: A Novel Tool for the Identification and Classification of Amyloidosis.

    PubMed

    Winter, Martin; Tholey, Andreas; Kristen, Arnt; Röcken, Christoph

    2017-11-01

    Amyloidosis is a group of diseases caused by extracellular accumulation of fibrillar polypeptide aggregates. So far, diagnosis is performed by Congo red staining of tissue sections in combination with polarization microscopy. Subsequent identification of the causative protein by immunohistochemistry harbors some difficulties regarding sensitivity and specificity. Mass spectrometry based approaches have been demonstrated to constitute a reliable method to supplement typing of amyloidosis, but still depend on Congo red staining. In the present study, we used matrix-assisted laser desorption/ionization mass spectrometry imaging coupled with ion mobility separation (MALDI-IMS MSI) to investigate amyloid deposits in formalin-fixed and paraffin-embedded tissue samples. Utilizing a novel peptide filter method, we found a universal peptide signature for amyloidoses. Furthermore, differences in the peptide composition of ALλ and ATTR amyloid were revealed and used to build a reliable classification model. Integrating the peptide filter in MALDI-IMS MSI analysis, we developed a bioinformatics workflow facilitating the identification and classification of amyloidosis in a less time and sample-consuming experimental setup. Our findings demonstrate also the feasibility to investigate the amyloid's protein composition, thus paving the way to establish classification models for the diverse types of amyloidoses and to shed further light on the complex process of amyloidogenesis. © 2017 The Authors, Proteomics Published by WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  17. A reliable Raman-spectroscopy-based approach for diagnosis, classification and follow-up of B-cell acute lymphoblastic leukemia

    NASA Astrophysics Data System (ADS)

    Managò, Stefano; Valente, Carmen; Mirabelli, Peppino; Circolo, Diego; Basile, Filomena; Corda, Daniela; de Luca, Anna Chiara

    2016-04-01

    Acute lymphoblastic leukemia type B (B-ALL) is a neoplastic disorder that shows high mortality rates due to immature lymphocyte B-cell proliferation. B-ALL diagnosis requires identification and classification of the leukemia cells. Here, we demonstrate the use of Raman spectroscopy to discriminate normal lymphocytic B-cells from three different B-leukemia transformed cell lines (i.e., RS4;11, REH, MN60 cells) based on their biochemical features. In combination with immunofluorescence and Western blotting, we show that these Raman markers reflect the relative changes in the potential biological markers from cell surface antigens, cytoplasmic proteins, and DNA content and correlate with the lymphoblastic B-cell maturation/differentiation stages. Our study demonstrates the potential of this technique for classification of B-leukemia cells into the different differentiation/maturation stages, as well as for the identification of key biochemical changes under chemotherapeutic treatments. Finally, preliminary results from clinical samples indicate high consistency of, and potential applications for, this Raman spectroscopy approach.

  18. Linear regression models for solvent accessibility prediction in proteins.

    PubMed

    Wagner, Michael; Adamczak, Rafał; Porollo, Aleksey; Meller, Jarosław

    2005-04-01

    The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.

  19. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold.

    PubMed

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel; Ten Have, Arjen

    2018-01-01

    Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER.

  20. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold

    PubMed Central

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel

    2018-01-01

    Background Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. Results HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. Conclusions HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER. PMID:29579071

  1. Complex lasso: new entangled motifs in proteins

    NASA Astrophysics Data System (ADS)

    Niemyska, Wanda; Dabrowski-Tumanski, Pawel; Kadlof, Michal; Haglund, Ellinor; Sułkowski, Piotr; Sulkowska, Joanna I.

    2016-11-01

    We identify new entangled motifs in proteins that we call complex lassos. Lassos arise in proteins with disulfide bridges (or in proteins with amide linkages), when termini of a protein backbone pierce through an auxiliary surface of minimal area, spanned on a covalent loop. We find that as much as 18% of all proteins with disulfide bridges in a non-redundant subset of PDB form complex lassos, and classify them into six distinct geometric classes, one of which resembles supercoiling known from DNA. Based on biological classification of proteins we find that lassos are much more common in viruses, plants and fungi than in other kingdoms of life. We also discuss how changes in the oxidation/reduction potential may affect the function of proteins with lassos. Lassos and associated surfaces of minimal area provide new, interesting and possessing many potential applications geometric characteristics not only of proteins, but also of other biomolecules.

  2. Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery.

    PubMed

    Sigdel, Madhav; Dinç, İmren; Dinç, Semih; Sigdel, Madhu S; Pusey, Marc L; Aygün, Ramazan S

    2014-03-01

    In this paper, we investigate the performance of two wrapper methods for semi-supervised learning algorithms for classification of protein crystallization images with limited labeled images. Firstly, we evaluate the performance of semi-supervised approach using self-training with naïve Bayesian (NB) and sequential minimum optimization (SMO) as the base classifiers. The confidence values returned by these classifiers are used to select high confident predictions to be used for self-training. Secondly, we analyze the performance of Yet Another Two Stage Idea (YATSI) semi-supervised learning using NB, SMO, multilayer perceptron (MLP), J48 and random forest (RF) classifiers. These results are compared with the basic supervised learning using the same training sets. We perform our experiments on a dataset consisting of 2250 protein crystallization images for different proportions of training and test data. Our results indicate that NB and SMO using both self-training and YATSI semi-supervised approaches improve accuracies with respect to supervised learning. On the other hand, MLP, J48 and RF perform better using basic supervised learning. Overall, random forest classifier yields the best accuracy with supervised learning for our dataset.

  3. Evaluating the efficacy of a structure-derived amino acid substitution matrix in detecting protein homologs by BLAST and PSI-BLAST.

    PubMed

    Goonesekere, Nalin Cw

    2009-01-01

    The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.

  4. APOL1 Oligomerization as the Key Mediator of Kidney Disease in African Americans

    DTIC Science & Technology

    2015-10-01

    kidney disease that accounts for the high rate of kidney disease in African Americans. This work is based on the hypothesize that APOL1 kidney disease...microscopy- based approaches. 15. SUBJECT TERMS Kidney, ESRD, APOL1, African American 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18. NUMBER...is based on the hypothesize that APOL1 kidney disease in African Americans results from abnormal aggregation of the APOL1 risk variant protein in an

  5. Development, characterization, and lethal effect of monoclonal antibodies against hemocytes in an adult female tick, Ornithodoros moubata (Acari: Argasidae).

    PubMed

    Matsuo, T; Tsukamoto, D; Inoue, N; Fujisaki, K

    2003-12-01

    In the present study, 19 monoclonal antibodies (mAbs) against adult Ornithodoros moubata hemocytes were established, and the reactivity of the hemocytes to these mAbs was examined by an indirect fluorescent antibody test (IFAT), Western blot and immunoprecipitation analyses. It was shown that the reactivities of the hemocytes to the mAbs varied among morphologically similar hemocyte types, and most mAbs produced in the present study showed the multiple band reactivity. However, the presence of shared epitopes among peptide subunits of the same protein or entirely different proteins are not common, so their reactivity could not be explained in detail. These results suggest that there are morphologically similar but functionally differentiated hemocytes. Therefore, in addition to morphological classification, the molecular-based classification of the hemocytes is also required. In order to assess the lethal effect of blood meal containing each mAb, artificial feeding was performed. The OmHC 31 showed the strongest lethal effect on adult female O. moubata. In conclusion, anti-hemocyte mAbs produced in this study are useful not only for the immunological classification of hemocytes but also for the immunological control of the tick.

  6. Sequence-structure relationship study in all-α transmembrane proteins using an unsupervised learning approach.

    PubMed

    Esque, Jérémy; Urbain, Aurélie; Etchebest, Catherine; de Brevern, Alexandre G

    2015-11-01

    Transmembrane proteins (TMPs) are major drug targets, but the knowledge of their precise topology structure remains highly limited compared with globular proteins. In spite of the difficulties in obtaining their structures, an important effort has been made these last years to increase their number from an experimental and computational point of view. In view of this emerging challenge, the development of computational methods to extract knowledge from these data is crucial for the better understanding of their functions and in improving the quality of structural models. Here, we revisit an efficient unsupervised learning procedure, called Hybrid Protein Model (HPM), which is applied to the analysis of transmembrane proteins belonging to the all-α structural class. HPM method is an original classification procedure that efficiently combines sequence and structure learning. The procedure was initially applied to the analysis of globular proteins. In the present case, HPM classifies a set of overlapping protein fragments, extracted from a non-redundant databank of TMP 3D structure. After fine-tuning of the learning parameters, the optimal classification results in 65 clusters. They represent at best similar relationships between sequence and local structure properties of TMPs. Interestingly, HPM distinguishes among the resulting clusters two helical regions with distinct hydrophobic patterns. This underlines the complexity of the topology of these proteins. The HPM classification enlightens unusual relationship between amino acids in TMP fragments, which can be useful to elaborate new amino acids substitution matrices. Finally, two challenging applications are described: the first one aims at annotating protein functions (channel or not), the second one intends to assess the quality of the structures (X-ray or models) via a new scoring function deduced from the HPM classification.

  7. Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification

    PubMed Central

    2012-01-01

    Background Previous studies on tumor classification based on gene expression profiles suggest that gene selection plays a key role in improving the classification performance. Moreover, finding important tumor-related genes with the highest accuracy is a very important task because these genes might serve as tumor biomarkers, which is of great benefit to not only tumor molecular diagnosis but also drug development. Results This paper proposes a novel gene selection method with rich biomedical meaning based on Heuristic Breadth-first Search Algorithm (HBSA) to find as many optimal gene subsets as possible. Due to the curse of dimensionality, this type of method could suffer from over-fitting and selection bias problems. To address these potential problems, a HBSA-based ensemble classifier is constructed using majority voting strategy from individual classifiers constructed by the selected gene subsets, and a novel HBSA-based gene ranking method is designed to find important tumor-related genes by measuring the significance of genes using their occurrence frequencies in the selected gene subsets. The experimental results on nine tumor datasets including three pairs of cross-platform datasets indicate that the proposed method can not only obtain better generalization performance but also find many important tumor-related genes. Conclusions It is found that the frequencies of the selected genes follow a power-law distribution, indicating that only a few top-ranked genes can be used as potential diagnosis biomarkers. Moreover, the top-ranked genes leading to very high prediction accuracy are closely related to specific tumor subtype and even hub genes. Compared with other related methods, the proposed method can achieve higher prediction accuracy with fewer genes. Moreover, they are further justified by analyzing the top-ranked genes in the context of individual gene function, biological pathway, and protein-protein interaction network. PMID:22830977

  8. Reductive alkylation of ribosomes as a probe to the topography of ribosomal proteins*

    PubMed Central

    Moore, Graham; Crichton, Robert R.

    1974-01-01

    Escherichia coli ribosomes were treated with a number of different aldehydes of various sizes in the presence of NaBH4. After incorporation of either 3H or 14C, the ribosomal proteins were separated by two-dimensional polyacrylamide-gel electrophoresis and the extent of alkylation of the lysine residues in each protein was measured. The same pattern of alkylation was observed with the four reagents used, namely formaldehyde, acetone, benzaldehyde and 3,4,5-trimethoxybenzaldehyde. Every protein in 30S and 50S subunits was modified, although there was considerable variation in the degree of alkylation of individual proteins. A topographical classification of ribosomal proteins is presented, based on the degree of exposure of lysine residues. The data indicate that every protein of the ribosome has at least one lysine residue exposed at or near the surface of the ribonucleo-protein complex. PMID:4462744

  9. A phylogenomic approach to bacterial subspecies classification: proof of concept in Mycobacterium abscessus.

    PubMed

    Tan, Joon Liang; Khang, Tsung Fei; Ngeow, Yun Fong; Choo, Siew Woh

    2013-12-13

    Mycobacterium abscessus is a rapidly growing mycobacterium that is often associated with human infections. The taxonomy of this species has undergone several revisions and is still being debated. In this study, we sequenced the genomes of 12 M. abscessus strains and used phylogenomic analysis to perform subspecies classification. A data mining approach was used to rank and select informative genes based on the relative entropy metric for the construction of a phylogenetic tree. The resulting tree topology was similar to that generated using the concatenation of five classical housekeeping genes: rpoB, hsp65, secA, recA and sodA. Additional support for the reliability of the subspecies classification came from the analysis of erm41 and ITS gene sequences, single nucleotide polymorphisms (SNPs)-based classification and strain clustering demonstrated by a variable number tandem repeat (VNTR) assay and a multilocus sequence analysis (MLSA). We subsequently found that the concatenation of a minimal set of three median-ranked genes: DNA polymerase III subunit alpha (polC), 4-hydroxy-2-ketovalerate aldolase (Hoa) and cell division protein FtsZ (ftsZ), is sufficient to recover the same tree topology. PCR assays designed specifically for these genes showed that all three genes could be amplified in the reference strain of M. abscessus ATCC 19977T. This study provides proof of concept that whole-genome sequence-based data mining approach can provide confirmatory evidence of the phylogenetic informativeness of existing markers, as well as lead to the discovery of a more economical and informative set of markers that produces similar subspecies classification in M. abscessus. The systematic procedure used in this study to choose the informative minimal set of gene markers can potentially be applied to species or subspecies classification of other bacteria.

  10. Toward Genetics-Based Virus Taxonomy: Comparative Analysis of a Genetics-Based Classification and the Taxonomy of Picornaviruses

    PubMed Central

    Lauber, Chris

    2012-01-01

    Virus taxonomy has received little attention from the research community despite its broad relevance. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3890–3904, 2012), we have introduced a quantitative approach to hierarchically classify viruses of a family using pairwise evolutionary distances (PEDs) as a measure of genetic divergence. When applied to the six most conserved proteins of the Picornaviridae, it clustered 1,234 genome sequences in groups at three hierarchical levels (to which we refer as the “GENETIC classification”). In this study, we compare the GENETIC classification with the expert-based picornavirus taxonomy and outline differences in the underlying frameworks regarding the relation of virus groups and genetic diversity that represent, respectively, the structure and content of a classification. To facilitate the analysis, we introduce two novel diagrams. The first connects the genetic diversity of taxa to both the PED distribution and the phylogeny of picornaviruses. The second depicts a classification and the accommodated genetic diversity in a standardized manner. Generally, we found striking agreement between the two classifications on species and genus taxa. A few disagreements concern the species Human rhinovirus A and Human rhinovirus C and the genus Aphthovirus, which were split in the GENETIC classification. Furthermore, we propose a new supergenus level and universal, level-specific PED thresholds, not reached yet by many taxa. Since the species threshold is approached mostly by taxa with large sampling sizes and those infecting multiple hosts, it may represent an upper limit on divergence, beyond which homologous recombination in the six most conserved genes between two picornaviruses might not give viable progeny. PMID:22278238

  11. Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement

    PubMed Central

    Xu, Dong; Zhang, Jian; Roy, Ambrish; Zhang, Yang

    2011-01-01

    I-TASSER is an automated pipeline for protein tertiary structure prediction using multiple threading alignments and iterative structure assembly simulations. In CASP9 experiments, two new algorithms, QUARK and FG-MD, were added to the I-TASSER pipeline for improving the structural modeling accuracy. QUARK is a de novo structure prediction algorithm used for structure modeling of proteins that lack detectable template structures. For distantly homologous targets, QUARK models are found useful as a reference structure for selecting good threading alignments and guiding the I-TASSER structure assembly simulations. FG-MD is an atomic-level structural refinement program that uses structural fragments collected from the PDB structures to guide molecular dynamics simulation and improve the local structure of predicted model, including hydrogen-bonding networks, torsion angles and steric clashes. Despite considerable progress in both the template-based and template-free structure modeling, significant improvements on protein target classification, domain parsing, model selection, and ab initio folding of beta-proteins are still needed to further improve the I-TASSER pipeline. PMID:22069036

  12. PPI-IRO: a two-stage method for protein-protein interaction extraction based on interaction relation ontology.

    PubMed

    Li, Chuan-Xi; Chen, Peng; Wang, Ru-Jing; Wang, Xiu-Jie; Su, Ya-Ru; Li, Jinyan

    2014-01-01

    Mining Protein-Protein Interactions (PPIs) from the fast-growing biomedical literature resources has been proven as an effective approach for the identification of biological regulatory networks. This paper presents a novel method based on the idea of Interaction Relation Ontology (IRO), which specifies and organises words of various proteins interaction relationships. Our method is a two-stage PPI extraction method. At first, IRO is applied in a binary classifier to determine whether sentences contain a relation or not. Then, IRO is taken to guide PPI extraction by building sentence dependency parse tree. Comprehensive and quantitative evaluations and detailed analyses are used to demonstrate the significant performance of IRO on relation sentences classification and PPI extraction. Our PPI extraction method yielded a recall of around 80% and 90% and an F1 of around 54% and 66% on corpora of AIMed and BioInfer, respectively, which are superior to most existing extraction methods.

  13. Representing and comparing protein structures as paths in three-dimensional space

    PubMed Central

    Zhi, Degui; Krishna, S Sri; Cao, Haibo; Pevzner, Pavel; Godzik, Adam

    2006-01-01

    Background Most existing formulations of protein structure comparison are based on detailed atomic level descriptions of protein structures and bypass potential insights that arise from a higher-level abstraction. Results We propose a structure comparison approach based on a simplified representation of proteins that describes its three-dimensional path by local curvature along the generalized backbone of the polypeptide. We have implemented a dynamic programming procedure that aligns curvatures of proteins by optimizing a defined sum turning angle deviation measure. Conclusion Although our procedure does not directly optimize global structural similarity as measured by RMSD, our benchmarking results indicate that it can surprisingly well recover the structural similarity defined by structure classification databases and traditional structure alignment programs. In addition, our program can recognize similarities between structures with extensive conformation changes that are beyond the ability of traditional structure alignment programs. We demonstrate the applications of procedure to several contexts of structure comparison. An implementation of our procedure, CURVE, is available as a public webserver. PMID:17052359

  14. The history of the CATH structural classification of protein domains.

    PubMed

    Sillitoe, Ian; Dawson, Natalie; Thornton, Janet; Orengo, Christine

    2015-12-01

    This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.

  15. Rapid identification and classification of Mycobacterium spp. using whole-cell protein barcodes with matrix assisted laser desorption ionization time of flight mass spectrometry in comparison with multigene phylogenetic analysis.

    PubMed

    Wang, Jun; Chen, Wen Feng; Li, Qing X

    2012-02-24

    The need of quick diagnostics and increasing number of bacterial species isolated necessitate development of a rapid and effective phenotypic identification method. Mass spectrometry (MS) profiling of whole cell proteins has potential to satisfy the requirements. The genus Mycobacterium contains more than 154 species that are taxonomically very close and require use of multiple genes including 16S rDNA for phylogenetic identification and classification. Six strains of five Mycobacterium species were selected as model bacteria in the present study because of their 16S rDNA similarity (98.4-99.8%) and the high similarity of the concatenated 16S rDNA, rpoB and hsp65 gene sequences (95.9-99.9%), requiring high identification resolution. The classification of the six strains by MALDI TOF MS protein barcodes was consistent with, but at much higher resolution than, that of the multi-locus sequence analysis of using 16S rDNA, rpoB and hsp65. The species were well differentiated using MALDI TOF MS and MALDI BioTyper™ software after quick preparation of whole-cell proteins. Several proteins were selected as diagnostic markers for species confirmation. An integration of MALDI TOF MS, MALDI BioTyper™ software and diagnostic protein fragments provides a robust phenotypic approach for bacterial identification and classification. Copyright © 2011 Elsevier B.V. All rights reserved.

  16. Toward a unified nomenclature for mammalian ADP-ribosyltransferases.

    PubMed

    Hottiger, Michael O; Hassa, Paul O; Lüscher, Bernhard; Schüler, Herwig; Koch-Nolte, Friedrich

    2010-04-01

    ADP-ribosylation is a post-translational modification of proteins catalyzed by ADP-ribosyltransferases. It comprises the transfer of the ADP-ribose moiety from NAD+ to specific amino acid residues on substrate proteins or to ADP-ribose itself. Currently, 22 human genes encoding proteins that possess an ADP-ribosyltransferase catalytic domain are known. Recent structural and enzymological evidence of poly(ADP-ribose)polymerase (PARP) family members demonstrate that earlier proposed names and classifications of these proteins are no longer accurate. Here we summarize these new findings and propose a new consensus nomenclature for all ADP-ribosyltransferases (ARTs) based on the catalyzed reaction and on structural features. A unified nomenclature would facilitate communication between researchers both inside and outside the ADP-ribosylation field. 2009 Elsevier Ltd. All rights reserved.

  17. The structure of the bacteriophage PRD1 spike sheds light on the evolution of viral capsid architecture.

    PubMed

    Merckel, Michael C; Huiskonen, Juha T; Bamford, Dennis H; Goldman, Adrian; Tuma, Roman

    2005-04-15

    Comparisons of bacteriophage PRD1 and adenovirus protein structures and virion architectures have been instrumental in unraveling an evolutionary relationship and have led to a proposal of a phylogeny-based virus classification. The structure of the PRD1 spike protein P5 provides further insight into the evolution of viral proteins. The crystallized P5 fragment comprises two structural domains: a globular knob and a fibrous shaft. The head folds into a ten-stranded jelly roll beta barrel, which is structurally related to the tumor necrosis factor (TNF) and the PRD1 coat protein domains. The shaft domain is a structural counterpart to the adenovirus spike shaft. The structural relationships between PRD1, TNF, and adenovirus proteins suggest that the vertex proteins may have originated from an ancestral TNF-like jelly roll coat protein via a combination of gene duplication and deletion.

  18. Efficient use of unlabeled data for protein sequence classification: a comparative study.

    PubMed

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-04-29

    Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.

  19. Classification of Domain Movements in Proteins Using Dynamic Contact Graphs

    PubMed Central

    Taylor, Daniel; Cawley, Gavin; Hayward, Steven

    2013-01-01

    A new method for the classification of domain movements in proteins is described and applied to 1822 pairs of structures from the Protein Data Bank that represent a domain movement in two-domain proteins. The method is based on changes in contacts between residues from the two domains in moving from one conformation to the other. We argue that there are five types of elemental contact changes and that these relate to five model domain movements called: “free”, “open-closed”, “anchored”, “sliding-twist”, and “see-saw.” A directed graph is introduced called the “Dynamic Contact Graph” which represents the contact changes in a domain movement. In many cases a graph, or part of a graph, provides a clear visual metaphor for the movement it represents and is a motif that can be easily recognised. The Dynamic Contact Graphs are often comprised of disconnected subgraphs indicating independent regions which may play different roles in the domain movement. The Dynamic Contact Graph for each domain movement is decomposed into elemental Dynamic Contact Graphs, those that represent elemental contact changes, allowing us to count the number of instances of each type of elemental contact change in the domain movement. This naturally leads to sixteen classes into which the 1822 domain movements are classified. PMID:24260562

  20. Data Mining Algorithms for Classification of Complex Biomedical Data

    ERIC Educational Resources Information Center

    Lan, Liang

    2012-01-01

    In my dissertation, I will present my research which contributes to solve the following three open problems from biomedical informatics: (1) Multi-task approaches for microarray classification; (2) Multi-label classification of gene and protein prediction from multi-source biological data; (3) Spatial scan for movement data. In microarray…

  1. PrionHome: a database of prions and other sequences relevant to prion phenomena.

    PubMed

    Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M

    2012-01-01

    Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion.

  2. PrionHome: A Database of Prions and Other Sequences Relevant to Prion Phenomena

    PubMed Central

    Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M. A.; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M.

    2012-01-01

    Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion. PMID:22363733

  3. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements.

    PubMed

    Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D

    2017-01-04

    The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  4. Brain amyloidosis ascertainment from cognitive, imaging, and peripheral blood protein measures

    PubMed Central

    Hwang, Kristy S.; Avila, David; Elashoff, David; Kohannim, Omid; Teng, Edmond; Sokolow, Sophie; Jack, Clifford R.; Jagust, William J.; Shaw, Leslie; Trojanowski, John Q.; Weiner, Michael W.; Thompson, Paul M.

    2015-01-01

    Background: The goal of this study was to identify a clinical biomarker signature of brain amyloidosis in the Alzheimer's Disease Neuroimaging Initiative 1 (ADNI1) mild cognitive impairment (MCI) cohort. Methods: We developed a multimodal biomarker classifier for predicting brain amyloidosis using cognitive, imaging, and peripheral blood protein ADNI1 MCI data. We used CSF β-amyloid 1–42 (Aβ42) ≤192 pg/mL as proxy measure for Pittsburgh compound B (PiB)-PET standard uptake value ratio ≥1.5. We trained our classifier in the subcohort with CSF Aβ42 but no PiB-PET data and tested its performance in the subcohort with PiB-PET but no CSF Aβ42 data. We also examined the utility of our biomarker signature for predicting disease progression from MCI to Alzheimer dementia. Results: The CSF training classifier selected Mini-Mental State Examination, Trails B, Auditory Verbal Learning Test delayed recall, education, APOE genotype, interleukin 6 receptor, clusterin, and ApoE protein, and achieved leave-one-out accuracy of 85% (area under the curve [AUC] = 0.8). The PiB testing classifier achieved an AUC of 0.72, and when classifier self-tuning was allowed, AUC = 0.74. The 36-month disease-progression classifier achieved AUC = 0.75 and accuracy = 71%. Conclusions: Automated classifiers based on cognitive and peripheral blood protein variables can identify the presence of brain amyloidosis with a modest level of accuracy. Such methods could have implications for clinical trial design and enrollment in the near future. Classification of evidence: This study provides Class II evidence that a classification algorithm based on cognitive, imaging, and peripheral blood protein measures identifies patients with brain amyloid on PiB-PET with moderate accuracy (sensitivity 68%, specificity 78%). PMID:25609767

  5. Bacillus Classification Based on Matrix-Assisted Laser Desorption Ionization Time-of-Flight Mass Spectrometry-Effects of Culture Conditions.

    PubMed

    Shu, Lin-Jie; Yang, Yu-Liang

    2017-11-14

    Matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) is a reliable and rapid technique applied widely in the identification and classification of microbes. MALDI-TOF MS has been used to identify many endospore-forming Bacillus species; however, endospores affect the identification accuracy when using MALDI-TOF MS because they change the protein composition of samples. Since culture conditions directly influence endospore formation and Bacillus growth, in this study we clarified how culture conditions influence the classification of Bacillus species by using MALDI-TOF MS. We analyzed members of the Bacillus subtilis group and Bacillus cereus group using different incubation periods, temperatures and media. Incubation period was found to affect mass spectra due to endospores which were observed mixing with vegetative cells after 24 hours. Culture temperature also resulted in different mass spectra profiles depending on the temperature best suited growth and sporulation. Conversely, the four common media for Bacillus incubation, Luria-Bertani agar, nutrient agar, plate count agar and brain-heart infusion agar did not result in any significant differences in mass spectra profiles. Profiles in the range m/z 1000-3000 were found to provide additional data to the standard ribosomal peptide/protein region m/z 3000-15000 profiles to enable easier differentiation of some highly similar species and the identification of new strains under fresh culture conditions. In summary, control of culture conditions is vital for Bacillus identification and classification by MALDI-TOF MS.

  6. Identification and classification of conopeptides using profile Hidden Markov Models.

    PubMed

    Laht, Silja; Koua, Dominique; Kaplinski, Lauris; Lisacek, Frédérique; Stöcklin, Reto; Remm, Maido

    2012-03-01

    Conopeptides are small toxins produced by predatory marine snails of the genus Conus. They are studied with increasing intensity due to their potential in neurosciences and pharmacology. The number of existing conopeptides is estimated to be 1 million, but only about 1000 have been described to date. Thanks to new high-throughput sequencing technologies the number of known conopeptides is likely to increase exponentially in the near future. There is therefore a need for a fast and accurate computational method for identification and classification of the novel conopeptides in large data sets. 62 profile Hidden Markov Models (pHMMs) were built for prediction and classification of all described conopeptide superfamilies and families, based on the different parts of the corresponding protein sequences. These models showed very high specificity in detection of new peptides. 56 out of 62 models do not give a single false positive in a test with the entire UniProtKB/Swiss-Prot protein sequence database. Our study demonstrates the usefulness of mature peptide models for automatic classification with accuracy of 96% for the mature peptide models and 100% for the pro- and signal peptide models. Our conopeptide profile HMMs can be used for finding and annotation of new conopeptides from large datasets generated by transcriptome or genome sequencing. To our knowledge this is the first time this kind of computational method has been applied to predict all known conopeptide superfamilies and some conopeptide families. Copyright © 2012 Elsevier B.V. All rights reserved.

  7. Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource

    PubMed Central

    Koike, Asako; Kobayashi, Yoshiyuki; Takagi, Toshihisa

    2003-01-01

    Protein kinases play a crucial role in the regulation of cellular functions. Various kinds of information about these molecules are important for understanding signaling pathways and organism characteristics. We have developed the Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes. It contains the classification of protein kinases and their functional conservation, ortholog tables among species, protein–protein, protein–gene, and protein–compound interaction data, domain information, and structural information. It also provides an automatic pathway graphic image interface. The protein, gene, and compound interactions are automatically extracted from abstracts for all genes and proteins by natural-language processing (NLP).The method of automatic extraction uses phrase patterns and the GENA protein, gene, and compound name dictionary, which was developed by our group. With this database, pathways are easily compared among species using data with more than 47,000 protein interactions and protein kinase ortholog tables. The database is available for querying and browsing at http://kinasedb.ontology.ims.u-tokyo.ac.jp/. PMID:12799355

  8. Protein profiling in potato (Solanum tuberosum L.) leaf tissues by differential centrifugation.

    PubMed

    Lim, Sanghyun; Chisholm, Kenneth; Coffin, Robert H; Peters, Rick D; Al-Mughrabi, Khalil I; Wang-Pruski, Gefu; Pinto, Devanand M

    2012-04-06

    Foliar diseases, such as late blight, result in serious threats to potato production. As such, potato leaf tissue becomes an important substrate to study biological processes, such as plant defense responses to infection. Nonetheless, the potato leaf proteome remains poorly characterized. Here, we report protein profiling of potato leaf tissues using a modified differential centrifugation approach to separate the leaf tissues into cell wall and cytoplasmic fractions. This method helps to increase the number of identified proteins, including targeted putative cell wall proteins. The method allowed for the identification of 1484 nonredundant potato leaf proteins, of which 364 and 447 were reproducibly identified proteins in the cell wall and cytoplasmic fractions, respectively. Reproducibly identified proteins corresponded to over 70% of proteins identified in each replicate. A diverse range of proteins was identified based on their theoretical pI values, molecular masses, functional classification, and biological processes. Such a protein extraction method is effective for the establishment of a highly qualified proteome profile.

  9. HPSLPred: An Ensemble Multi-Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source.

    PubMed

    Wan, Shixiang; Duan, Yucong; Zou, Quan

    2017-09-01

    Predicting the subcellular localization of proteins is an important and challenging problem. Traditional experimental approaches are often expensive and time-consuming. Consequently, a growing number of research efforts employ a series of machine learning approaches to predict the subcellular location of proteins. There are two main challenges among the state-of-the-art prediction methods. First, most of the existing techniques are designed to deal with multi-class rather than multi-label classification, which ignores connections between multiple labels. In reality, multiple locations of particular proteins imply that there are vital and unique biological significances that deserve special focus and cannot be ignored. Second, techniques for handling imbalanced data in multi-label classification problems are necessary, but never employed. For solving these two issues, we have developed an ensemble multi-label classifier called HPSLPred, which can be applied for multi-label classification with an imbalanced protein source. For convenience, a user-friendly webserver has been established at http://server.malab.cn/HPSLPred. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  10. Classification of pseudo pairs between nucleotide bases and amino acids by analysis of nucleotide-protein complexes.

    PubMed

    Kondo, Jiro; Westhof, Eric

    2011-10-01

    Nucleotide bases are recognized by amino acid residues in a variety of DNA/RNA binding and nucleotide binding proteins. In this study, a total of 446 crystal structures of nucleotide-protein complexes are analyzed manually and pseudo pairs together with single and bifurcated hydrogen bonds observed between bases and amino acids are classified and annotated. Only 5 of the 20 usual amino acid residues, Asn, Gln, Asp, Glu and Arg, are able to orient in a coplanar fashion in order to form pseudo pairs with nucleotide bases through two hydrogen bonds. The peptide backbone can also form pseudo pairs with nucleotide bases and presents a strong bias for binding to the adenine base. The Watson-Crick side of the nucleotide bases is the major interaction edge participating in such pseudo pairs. Pseudo pairs between the Watson-Crick edge of guanine and Asp are frequently observed. The Hoogsteen edge of the purine bases is a good discriminatory element in recognition of nucleotide bases by protein side chains through the pseudo pairing: the Hoogsteen edge of adenine is recognized by various amino acids while the Hoogsteen edge of guanine is only recognized by Arg. The sugar edge is rarely recognized by either the side-chain or peptide backbone of amino acid residues.

  11. Classification of pseudo pairs between nucleotide bases and amino acids by analysis of nucleotide–protein complexes

    PubMed Central

    Kondo, Jiro; Westhof, Eric

    2011-01-01

    Nucleotide bases are recognized by amino acid residues in a variety of DNA/RNA binding and nucleotide binding proteins. In this study, a total of 446 crystal structures of nucleotide–protein complexes are analyzed manually and pseudo pairs together with single and bifurcated hydrogen bonds observed between bases and amino acids are classified and annotated. Only 5 of the 20 usual amino acid residues, Asn, Gln, Asp, Glu and Arg, are able to orient in a coplanar fashion in order to form pseudo pairs with nucleotide bases through two hydrogen bonds. The peptide backbone can also form pseudo pairs with nucleotide bases and presents a strong bias for binding to the adenine base. The Watson–Crick side of the nucleotide bases is the major interaction edge participating in such pseudo pairs. Pseudo pairs between the Watson–Crick edge of guanine and Asp are frequently observed. The Hoogsteen edge of the purine bases is a good discriminatory element in recognition of nucleotide bases by protein side chains through the pseudo pairing: the Hoogsteen edge of adenine is recognized by various amino acids while the Hoogsteen edge of guanine is only recognized by Arg. The sugar edge is rarely recognized by either the side-chain or peptide backbone of amino acid residues. PMID:21737431

  12. PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification.

    PubMed

    Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier

    2003-01-01

    The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.

  13. Comprehensive inventory of protein complexes in the Protein Data Bank from consistent classification of interfaces

    DOE PAGES

    Bordner, Andrew J.; Gorin, Andrey A.

    2008-05-12

    Here, protein-protein interactions are ubiquitous and essential for cellular processes. High-resolution X-ray crystallographic structures of protein complexes can elucidate the details of their function and provide a basis for many computational and experimental approaches. Here we demonstrate that existing annotations of protein complexes, including those provided by the Protein Data Bank (PDB) itself, contain a significant fraction of incorrect annotations. Results: We have developed a method for identifying protein complexes in the PDB X-ray structures by a four step procedure: (1) comprehensively collecting all protein-protein interfaces; (2) clustering similar protein-protein interfaces together; (3) estimating the probability that each cluster ismore » relevant based on a diverse set of properties; and (4) finally combining these scores for each entry in order to predict the complex structure. Unlike previous annotation methods, consistent prediction of complexes with identical or almost identical protein content is insured. The resulting clusters of biologically relevant interfaces provide a reliable catalog of evolutionary conserved protein-protein interactions.« less

  14. A protein and mRNA expression-based classification of gastric cancer.

    PubMed

    Setia, Namrata; Agoston, Agoston T; Han, Hye S; Mullen, John T; Duda, Dan G; Clark, Jeffrey W; Deshpande, Vikram; Mino-Kenudson, Mari; Srivastava, Amitabh; Lennerz, Jochen K; Hong, Theodore S; Kwak, Eunice L; Lauwers, Gregory Y

    2016-07-01

    The overall survival of gastric carcinoma patients remains poor despite improved control over known risk factors and surveillance. This highlights the need for new classifications, driven towards identification of potential therapeutic targets. Using sophisticated molecular technologies and analysis, three groups recently provided genetic and epigenetic molecular classifications of gastric cancer (The Cancer Genome Atlas, 'Singapore-Duke' study, and Asian Cancer Research Group). Suggested by these classifications, here, we examined the expression of 14 biomarkers in a cohort of 146 gastric adenocarcinomas and performed unsupervised hierarchical clustering analysis using less expensive and widely available immunohistochemistry and in situ hybridization. Ultimately, we identified five groups of gastric cancers based on Epstein-Barr virus (EBV) positivity, microsatellite instability, aberrant E-cadherin, and p53 expression; the remaining cases constituted a group characterized by normal p53 expression. In addition, the five categories correspond to the reported molecular subgroups by virtue of clinicopathologic features. Furthermore, evaluation between these clusters and survival using the Cox proportional hazards model showed a trend for superior survival in the EBV and microsatellite-instable related adenocarcinomas. In conclusion, we offer as a proposal a simplified algorithm that is able to reproduce the recently proposed molecular subgroups of gastric adenocarcinoma, using immunohistochemical and in situ hybridization techniques.

  15. Rapid on-line detection and grading of wooden breast myopathy in chicken fillets by near-infrared spectroscopy.

    PubMed

    Wold, Jens Petter; Veiseth-Kent, Eva; Høst, Vibeke; Løvland, Atle

    2017-01-01

    The main objective of this work was to develop a method for rapid and non-destructive detection and grading of wooden breast (WB) syndrome in chicken breast fillets. Near-infrared (NIR) spectroscopy was chosen as detection method, and an industrial NIR scanner was applied and tested for large scale on-line detection of the syndrome. Two approaches were evaluated for discrimination of WB fillets: 1) Linear discriminant analysis based on NIR spectra only, and 2) a regression model for protein was made based on NIR spectra and the estimated concentrations of protein were used for discrimination. A sample set of 197 fillets was used for training and calibration. A test set was recorded under industrial conditions and contained spectra from 79 fillets. The classification methods obtained 99.5-100% correct classification of the calibration set and 100% correct classification of the test set. The NIR scanner was then installed in a commercial chicken processing plant and could detect incidence rates of WB in large batches of fillets. Examples of incidence are shown for three broiler flocks where a high number of fillets (9063, 6330 and 10483) were effectively measured. Prevalence of WB of 0.1%, 6.6% and 8.5% were estimated for these flocks based on the complete sample volumes. Such an on-line system can be used to alleviate the challenges WB represents to the poultry meat industry. It enables automatic quality sorting of chicken fillets to different product categories. Manual laborious grading can be avoided. Incidences of WB from different farms and flocks can be tracked and information can be used to understand and point out main causes for WB in the chicken production. This knowledge can be used to improve the production procedures and reduce today's extensive occurrence of WB.

  16. Robust prediction of protein subcellular localization combining PCA and WSVMs.

    PubMed

    Tian, Jiang; Gu, Hong; Liu, Wenqi; Gao, Chiyang

    2011-08-01

    Automated prediction of protein subcellular localization is an important tool for genome annotation and drug discovery, and Support Vector Machines (SVMs) can effectively solve this problem in a supervised manner. However, the datasets obtained from real experiments are likely to contain outliers or noises, which can lead to poor generalization ability and classification accuracy. To explore this problem, we adopt strategies to lower the effect of outliers. First we design a method based on Weighted SVMs, different weights are assigned to different data points, so the training algorithm will learn the decision boundary according to the relative importance of the data points. Second we analyse the influence of Principal Component Analysis (PCA) on WSVM classification, propose a hybrid classifier combining merits of both PCA and WSVM. After performing dimension reduction operations on the datasets, kernel-based possibilistic c-means algorithm can generate more suitable weights for the training, as PCA transforms the data into a new coordinate system with largest variances affected greatly by the outliers. Experiments on benchmark datasets show promising results, which confirms the effectiveness of the proposed method in terms of prediction accuracy. Copyright © 2011 Elsevier Ltd. All rights reserved.

  17. [Relation between location of elements in periodic table and affinity for the malignant tumor (author's transl)].

    PubMed

    Ando, A; Hisada, K; Ando, I

    1977-10-01

    Affinity of many inorganic compounds for the malignant tumor was examined, using the rats which were subcutaneously transplanted with Yoshida sarcoma. And the relations between the uptake rate into the malignant tumor and in vitro binding power to the protein were investigated in these compounds. In these experiments, the bipositive ions and anions had not affinity for the tumor tissue with a few exceptions. On the other hand, Hg, Au and Bi, which have strong binding power to the protein, showed high uptake rate into the malignant tumor. As Hg++, Au+ and Bi+++ are soft acids according to classification of Lewis acids, it was thought that these elements would bind strongly to soft base (R-SH, R-S-) present in the tumor tissue. In many hard acids (according to classification of Lewis acids), the uptake rate into the tumor was shown as a function of ionic potentials (valency/ionic radii) of the metal ions. It is presumed that the chemical bond of these hard acids in the tumor tissue is ionic bond to hard base (R-COO-, R-PO3(2-), R-SO3-, R-NH2).

  18. The ASTRAL Compendium in 2004

    DOE R&D Accomplishments Database

    Chandonia, John-Marc; Hon, Gary; Walker, Nigel S.; Lo Conte, Loredana; Koehl, Patrice; Levitt, Michael; Brenner, Steven E.

    2003-09-15

    The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54,745 domains, more than three times as many as the initial release four years ago. ASTRAL has undergone major transformations in the past two years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as available integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods.

  19. Mechanisms, biology and inhibitors of deubiquitinating enzymes.

    PubMed

    Love, Kerry Routenberg; Catic, André; Schlieker, Christian; Ploegh, Hidde L

    2007-11-01

    The addition of ubiquitin (Ub) and ubiquitin-like (Ubl) modifiers to proteins serves to modulate function and is a key step in protein degradation, epigenetic modification and intracellular localization. Deubiquitinating enzymes and Ubl-specific proteases, the proteins responsible for the removal of Ub and Ubls, act as an additional level of control over the ubiquitin-proteasome system. Their conservation and widespread occurrence in eukaryotes, prokaryotes and viruses shows that these proteases constitute an essential class of enzymes. Here, we discuss how chemical tools, including activity-based probes and suicide inhibitors, have enabled (i) discovery of deubiquitinating enzymes, (ii) their functional profiling, crystallographic characterization and mechanistic classification and (iii) development of molecules for therapeutic purposes.

  20. Actions of plant Argonautes: predictable or unpredictable?

    PubMed

    Ma, Zeyang; Zhang, Xiuren

    2018-05-29

    Argonaute (AGO) proteins are the key effector of RNA-induced silencing complex (RISC). Land plants typically encode numerous AGO proteins, and they can be typically divided into two major functional groups based on the species of their housed small RNAs (sRNAs). One group of AGOs, guided by 24-nucleotide (nt) sRNAs, canonically function in nuclei to implement transcriptional gene silencing (TGS), whereas the other group of AGOs, guided by 21-nt sRNAs, act in the cytoplasm to fulfill posttranscriptional gene silencing (PTGS). Many new discoveries have been recently made on functions and mechanisms of AGO proteins in plants, and some of the findings change our views on the conventional classification and roles of AGO proteins. In this review, we summarize our current knowledge of AGO proteins in plants. Copyright © 2018 Elsevier Ltd. All rights reserved.

  1. Deep Learning and Its Applications in Biomedicine.

    PubMed

    Cao, Chensi; Liu, Feng; Tan, Hai; Song, Deshou; Shu, Wenjie; Li, Weizhong; Zhou, Yiming; Bo, Xiaochen; Xie, Zhi

    2018-02-01

    Advances in biological and medical technologies have been providing us explosive volumes of biological and physiological data, such as medical images, electroencephalography, genomic and protein sequences. Learning from these data facilitates the understanding of human health and disease. Developed from artificial neural networks, deep learning-based algorithms show great promise in extracting features and learning patterns from complex data. The aim of this paper is to provide an overview of deep learning techniques and some of the state-of-the-art applications in the biomedical field. We first introduce the development of artificial neural network and deep learning. We then describe two main components of deep learning, i.e., deep learning architectures and model optimization. Subsequently, some examples are demonstrated for deep learning applications, including medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Finally, we offer our perspectives for the future directions in the field of deep learning. Copyright © 2018. Production and hosting by Elsevier B.V.

  2. A nearest neighbor approach for automated transporter prediction and categorization from protein sequences.

    PubMed

    Li, Haiquan; Dai, Xinbin; Zhao, Xuechun

    2008-05-01

    Membrane transport proteins play a crucial role in the import and export of ions, small molecules or macromolecules across biological membranes. Currently, there are a limited number of published computational tools which enable the systematic discovery and categorization of transporters prior to costly experimental validation. To approach this problem, we utilized a nearest neighbor method which seamlessly integrates homologous search and topological analysis into a machine-learning framework. Our approach satisfactorily distinguished 484 transporter families in the Transporter Classification Database, a curated and representative database for transporters. A five-fold cross-validation on the database achieved a positive classification rate of 72.3% on average. Furthermore, this method successfully detected transporters in seven model and four non-model organisms, ranging from archaean to mammalian species. A preliminary literature-based validation has cross-validated 65.8% of our predictions on the 11 organisms, including 55.9% of our predictions overlapping with 83.6% of the predicted transporters in TransportDB.

  3. Classification and evolutionary analysis of the basic helix-loop-helix gene family in the green anole lizard, Anolis carolinensis.

    PubMed

    Liu, Ake; Wang, Yong; Zhang, Debao; Wang, Xuhua; Song, Huifang; Dang, Chunwang; Yao, Qin; Chen, Keping

    2013-08-01

    Helix-loop-helix (bHLH) proteins play essential regulatory roles in a variety of biological processes. These highly conserved proteins form a large transcription factor superfamily, and are commonly identified in large numbers within animal, plant, and fungal genomes. The bHLH domain has been well studied in many animal species, but has not yet been characterized in non-avian reptiles. In this study, we identified 102 putative bHLH genes in the genome of the green anole lizard, Anolis carolinensis. Based on phylogenetic analysis, these genes were classified into 43 families, with 43, 24, 16, 3, 10, and 3 members assigned into groups A, B, C, D, E, and F, respectively, and 3 members categorized as "orphans". Within-group evolutionary relationships inferred from the phylogenetic analysis were consistent with highly conserved patterns observed for introns and additional domains. Results from phylogenetic analysis of the H/E(spl) family suggest that genome and tandem gene duplications have contributed to this family's expansion. Our classification and evolutionary analysis has provided insights into the evolutionary diversification of animal bHLH genes, and should aid future studies on bHLH protein regulation of key growth and developmental processes.

  4. Evaluation of Semi-supervised Learning for Classification of Protein Crystallization Imagery

    PubMed Central

    Sigdel, Madhav; Dinç, İmren; Dinç, Semih; Sigdel, Madhu S.; Pusey, Marc L.; Aygün, Ramazan S.

    2015-01-01

    In this paper, we investigate the performance of two wrapper methods for semi-supervised learning algorithms for classification of protein crystallization images with limited labeled images. Firstly, we evaluate the performance of semi-supervised approach using self-training with naïve Bayesian (NB) and sequential minimum optimization (SMO) as the base classifiers. The confidence values returned by these classifiers are used to select high confident predictions to be used for self-training. Secondly, we analyze the performance of Yet Another Two Stage Idea (YATSI) semi-supervised learning using NB, SMO, multilayer perceptron (MLP), J48 and random forest (RF) classifiers. These results are compared with the basic supervised learning using the same training sets. We perform our experiments on a dataset consisting of 2250 protein crystallization images for different proportions of training and test data. Our results indicate that NB and SMO using both self-training and YATSI semi-supervised approaches improve accuracies with respect to supervised learning. On the other hand, MLP, J48 and RF perform better using basic supervised learning. Overall, random forest classifier yields the best accuracy with supervised learning for our dataset. PMID:25914518

  5. The new WHO 2016 classification of brain tumors-what neurosurgeons need to know.

    PubMed

    Banan, Rouzbeh; Hartmann, Christian

    2017-03-01

    The understanding of molecular alterations of tumors has severely changed the concept of classification in all fields of pathology. The availability of high-throughput technologies such as next-generation sequencing allows for a much more precise definition of tumor entities. Also in the field of brain tumors a dramatic increase of knowledge has occurred over the last years partially calling into question the purely morphologically based concepts that were used as exclusive defining criteria in the WHO 2007 classification. Review of the WHO 2016 classification of brain tumors as well as a search and review of publications in the literature relevant for brain tumor classification from 2007 up to now. The idea of incorporating the molecular features in classifying tumors of the central nervous system led the authors of the new WHO 2016 classification to encounter inevitable conceptual problems, particularly with respect to linking morphology to molecular alterations. As a solution they introduced the concept of a "layered diagnosis" to the classification of brain tumors that still allows at a lower level a purely morphologically based diagnosis while partially forcing the incorporation of molecular characteristics for an "integrated diagnosis" at the highest diagnostic level. In this context the broad availability of molecular assays was debated. On the one hand molecular antibodies specifically targeting mutated proteins should be available in nearly all neuropathological laboratories. On the other hand, different high-throughput assays are accessible only in few first-world neuropathological institutions. As examples oligodendrogliomas are now primarily defined by molecular characteristics since the required assays are generally established, whereas molecular grouping of ependymomas, found to clearly outperform morphologically based tumor interpretation, was rejected from inclusion in the WHO 2016 classification because the required assays are currently only established in a small number of institutions. In summary, while neuropathologists have now encountered various challenges in the transitional phase from the previous WHO 2007 version to the new WHO 2016 classification of brain tumors, clinical neurooncologists now face many new diagnoses allowing a clearly improved understanding that could offer them more effective therapeutic opportunities in neurooncological treatment. The new WHO 2016 classification presumably presents the highest number of modifications since the initial WHO classification of 1979 and thereby forces all professionals in the field of neurooncology to intensively understand the new concepts. This review article aims to present the basic concepts of the new WHO 2016 brain tumor classification for neurosurgeons with a focus on neurooncology.

  6. Recognition of functional sites in protein structures.

    PubMed

    Shulman-Peleg, Alexandra; Nussinov, Ruth; Wolfson, Haim J

    2004-06-04

    Recognition of regions on the surface of one protein, that are similar to a binding site of another is crucial for the prediction of molecular interactions and for functional classifications. We first describe a novel method, SiteEngine, that assumes no sequence or fold similarities and is able to recognize proteins that have similar binding sites and may perform similar functions. We achieve high efficiency and speed by introducing a low-resolution surface representation via chemically important surface points, by hashing triangles of physico-chemical properties and by application of hierarchical scoring schemes for a thorough exploration of global and local similarities. We proceed to rigorously apply this method to functional site recognition in three possible ways: first, we search a given functional site on a large set of complete protein structures. Second, a potential functional site on a protein of interest is compared with known binding sites, to recognize similar features. Third, a complete protein structure is searched for the presence of an a priori unknown functional site, similar to known sites. Our method is robust and efficient enough to allow computationally demanding applications such as the first and the third. From the biological standpoint, the first application may identify secondary binding sites of drugs that may lead to side-effects. The third application finds new potential sites on the protein that may provide targets for drug design. Each of the three applications may aid in assigning a function and in classification of binding patterns. We highlight the advantages and disadvantages of each type of search, provide examples of large-scale searches of the entire Protein Data Base and make functional predictions.

  7. iACP-GAEnsC: Evolutionary genetic algorithm based ensemble classification of anticancer peptides by utilizing hybrid feature space.

    PubMed

    Akbar, Shahid; Hayat, Maqsood; Iqbal, Muhammad; Jan, Mian Ahmad

    2017-06-01

    Cancer is a fatal disease, responsible for one-quarter of all deaths in developed countries. Traditional anticancer therapies such as, chemotherapy and radiation, are highly expensive, susceptible to errors and ineffective techniques. These conventional techniques induce severe side-effects on human cells. Due to perilous impact of cancer, the development of an accurate and highly efficient intelligent computational model is desirable for identification of anticancer peptides. In this paper, evolutionary intelligent genetic algorithm-based ensemble model, 'iACP-GAEnsC', is proposed for the identification of anticancer peptides. In this model, the protein sequences are formulated, using three different discrete feature representation methods, i.e., amphiphilic Pseudo amino acid composition, g-Gap dipeptide composition, and Reduce amino acid alphabet composition. The performance of the extracted feature spaces are investigated separately and then merged to exhibit the significance of hybridization. In addition, the predicted results of individual classifiers are combined together, using optimized genetic algorithm and simple majority technique in order to enhance the true classification rate. It is observed that genetic algorithm-based ensemble classification outperforms than individual classifiers as well as simple majority voting base ensemble. The performance of genetic algorithm-based ensemble classification is highly reported on hybrid feature space, with an accuracy of 96.45%. In comparison to the existing techniques, 'iACP-GAEnsC' model has achieved remarkable improvement in terms of various performance metrics. Based on the simulation results, it is observed that 'iACP-GAEnsC' model might be a leading tool in the field of drug design and proteomics for researchers. Copyright © 2017 Elsevier B.V. All rights reserved.

  8. RRCRank: a fusion method using rank strategy for residue-residue contact prediction.

    PubMed

    Jing, Xiaoyang; Dong, Qiwen; Lu, Ruqian

    2017-09-02

    In structural biology area, protein residue-residue contacts play a crucial role in protein structure prediction. Some researchers have found that the predicted residue-residue contacts could effectively constrain the conformational search space, which is significant for de novo protein structure prediction. In the last few decades, related researchers have developed various methods to predict residue-residue contacts, especially, significant performance has been achieved by using fusion methods in recent years. In this work, a novel fusion method based on rank strategy has been proposed to predict contacts. Unlike the traditional regression or classification strategies, the contact prediction task is regarded as a ranking task. First, two kinds of features are extracted from correlated mutations methods and ensemble machine-learning classifiers, and then the proposed method uses the learning-to-rank algorithm to predict contact probability of each residue pair. First, we perform two benchmark tests for the proposed fusion method (RRCRank) on CASP11 dataset and CASP12 dataset respectively. The test results show that the RRCRank method outperforms other well-developed methods, especially for medium and short range contacts. Second, in order to verify the superiority of ranking strategy, we predict contacts by using the traditional regression and classification strategies based on the same features as ranking strategy. Compared with these two traditional strategies, the proposed ranking strategy shows better performance for three contact types, in particular for long range contacts. Third, the proposed RRCRank has been compared with several state-of-the-art methods in CASP11 and CASP12. The results show that the RRCRank could achieve comparable prediction precisions and is better than three methods in most assessment metrics. The learning-to-rank algorithm is introduced to develop a novel rank-based method for the residue-residue contact prediction of proteins, which achieves state-of-the-art performance based on the extensive assessment.

  9. CRISPRCasFinder, an update of CRISRFinder, includes a portable version, enhanced performance and integrates search for Cas proteins.

    PubMed

    Couvin, David; Bernheim, Aude; Toffano-Nioche, Claire; Touchon, Marie; Michalik, Juraj; Néron, Bertrand; C Rocha, Eduardo P; Vergnaud, Gilles; Gautheret, Daniel; Pourcel, Christine

    2018-05-22

    CRISPR (clustered regularly interspaced short palindromic repeats) arrays and their associated (Cas) proteins confer bacteria and archaea adaptive immunity against exogenous mobile genetic elements, such as phages or plasmids. CRISPRCasFinder allows the identification of both CRISPR arrays and Cas proteins. The program includes: (i) an improved CRISPR array detection tool facilitating expert validation based on a rating system, (ii) prediction of CRISPR orientation and (iii) a Cas protein detection and typing tool updated to match the latest classification scheme of these systems. CRISPRCasFinder can either be used online or as a standalone tool compatible with Linux operating system. All third-party software packages employed by the program are freely available. CRISPRCasFinder is available at https://crisprcas.i2bc.paris-saclay.fr.

  10. CNN-BLPred: a Convolutional neural network based predictor for β-Lactamases (BL) and their classes.

    PubMed

    White, Clarence; Ismail, Hamid D; Saigo, Hiroto; Kc, Dukka B

    2017-12-28

    The β-Lactamase (BL) enzyme family is an important class of enzymes that plays a key role in bacterial resistance to antibiotics. As the newly identified number of BL enzymes is increasing daily, it is imperative to develop a computational tool to classify the newly identified BL enzymes into one of its classes. There are two types of classification of BL enzymes: Molecular Classification and Functional Classification. Existing computational methods only address Molecular Classification and the performance of these existing methods is unsatisfactory. We addressed the unsatisfactory performance of the existing methods by implementing a Deep Learning approach called Convolutional Neural Network (CNN). We developed CNN-BLPred, an approach for the classification of BL proteins. The CNN-BLPred uses Gradient Boosted Feature Selection (GBFS) in order to select the ideal feature set for each BL classification. Based on the rigorous benchmarking of CCN-BLPred using both leave-one-out cross-validation and independent test sets, CCN-BLPred performed better than the other existing algorithms. Compared with other architectures of CNN, Recurrent Neural Network, and Random Forest, the simple CNN architecture with only one convolutional layer performs the best. After feature extraction, we were able to remove ~95% of the 10,912 features using Gradient Boosted Trees. During 10-fold cross validation, we increased the accuracy of the classic BL predictions by 7%. We also increased the accuracy of Class A, Class B, Class C, and Class D performance by an average of 25.64%. The independent test results followed a similar trend. We implemented a deep learning algorithm known as Convolutional Neural Network (CNN) to develop a classifier for BL classification. Combined with feature selection on an exhaustive feature set and using balancing method such as Random Oversampling (ROS), Random Undersampling (RUS) and Synthetic Minority Oversampling Technique (SMOTE), CNN-BLPred performs significantly better than existing algorithms for BL classification.

  11. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

    PubMed

    Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

    2017-01-01

    The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.

  12. TMDIM: an improved algorithm for the structure prediction of transmembrane domains of bitopic dimers.

    PubMed

    Cao, Han; Ng, Marcus C K; Jusoh, Siti Azma; Tai, Hio Kuan; Siu, Shirley W I

    2017-09-01

    [Formula: see text]-Helical transmembrane proteins are the most important drug targets in rational drug development. However, solving the experimental structures of these proteins remains difficult, therefore computational methods to accurately and efficiently predict the structures are in great demand. We present an improved structure prediction method TMDIM based on Park et al. (Proteins 57:577-585, 2004) for predicting bitopic transmembrane protein dimers. Three major algorithmic improvements are introduction of the packing type classification, the multiple-condition decoy filtering, and the cluster-based candidate selection. In a test of predicting nine known bitopic dimers, approximately 78% of our predictions achieved a successful fit (RMSD <2.0 Å) and 78% of the cases are better predicted than the two other methods compared. Our method provides an alternative for modeling TM bitopic dimers of unknown structures for further computational studies. TMDIM is freely available on the web at https://cbbio.cis.umac.mo/TMDIM . Website is implemented in PHP, MySQL and Apache, with all major browsers supported.

  13. TMDIM: an improved algorithm for the structure prediction of transmembrane domains of bitopic dimers

    NASA Astrophysics Data System (ADS)

    Cao, Han; Ng, Marcus C. K.; Jusoh, Siti Azma; Tai, Hio Kuan; Siu, Shirley W. I.

    2017-09-01

    α-Helical transmembrane proteins are the most important drug targets in rational drug development. However, solving the experimental structures of these proteins remains difficult, therefore computational methods to accurately and efficiently predict the structures are in great demand. We present an improved structure prediction method TMDIM based on Park et al. (Proteins 57:577-585, 2004) for predicting bitopic transmembrane protein dimers. Three major algorithmic improvements are introduction of the packing type classification, the multiple-condition decoy filtering, and the cluster-based candidate selection. In a test of predicting nine known bitopic dimers, approximately 78% of our predictions achieved a successful fit (RMSD <2.0 Å) and 78% of the cases are better predicted than the two other methods compared. Our method provides an alternative for modeling TM bitopic dimers of unknown structures for further computational studies. TMDIM is freely available on the web at https://cbbio.cis.umac.mo/TMDIM. Website is implemented in PHP, MySQL and Apache, with all major browsers supported.

  14. Features analysis for identification of date and party hubs in protein interaction network of Saccharomyces Cerevisiae.

    PubMed

    Mirzarezaee, Mitra; Araabi, Babak N; Sadeghi, Mehdi

    2010-12-19

    It has been understood that biological networks have modular organizations which are the sources of their observed complexity. Analysis of networks and motifs has shown that two types of hubs, party hubs and date hubs, are responsible for this complexity. Party hubs are local coordinators because of their high co-expressions with their partners, whereas date hubs display low co-expressions and are assumed as global connectors. However there is no mutual agreement on these concepts in related literature with different studies reporting their results on different data sets. We investigated whether there is a relation between the biological features of Saccharomyces Cerevisiae's proteins and their roles as non-hubs, intermediately connected, party hubs, and date hubs. We propose a classifier that separates these four classes. We extracted different biological characteristics including amino acid sequences, domain contents, repeated domains, functional categories, biological processes, cellular compartments, disordered regions, and position specific scoring matrix from various sources. Several classifiers are examined and the best feature-sets based on average correct classification rate and correlation coefficients of the results are selected. We show that fusion of five feature-sets including domains, Position Specific Scoring Matrix-400, cellular compartments level one, and composition pairs with two and one gaps provide the best discrimination with an average correct classification rate of 77%. We study a variety of known biological feature-sets of the proteins and show that there is a relation between domains, Position Specific Scoring Matrix-400, cellular compartments level one, composition pairs with two and one gaps of Saccharomyces Cerevisiae's proteins, and their roles in the protein interaction network as non-hubs, intermediately connected, party hubs and date hubs. This study also confirms the possibility of predicting non-hubs, party hubs and date hubs based on their biological features with acceptable accuracy. If such a hypothesis is correct for other species as well, similar methods can be applied to predict the roles of proteins in those species.

  15. Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.

    PubMed

    Ali, Safdar; Majid, Abdul; Javed, Syed Gibran; Sattar, Mohsin

    2016-06-01

    Early prediction of breast cancer is important for effective treatment and survival. We developed an effective Cost-Sensitive Classifier with GentleBoost Ensemble (Can-CSC-GBE) for the classification of breast cancer using protein amino acid features. In this work, first, discriminant information of the protein sequences related to breast tissue is extracted. Then, the physicochemical properties hydrophobicity and hydrophilicity of amino acids are employed to generate molecule descriptors in different feature spaces. For comparison, we obtained results by combining Cost-Sensitive learning with conventional ensemble of AdaBoostM1 and Bagging. The proposed Can-CSC-GBE system has effectively reduced the misclassification costs and thereby improved the overall classification performance. Our novel approach has highlighted promising results as compared to the state-of-the-art ensemble approaches. Copyright © 2016 Elsevier Ltd. All rights reserved.

  16. Annotation and Classification of CRISPR-Cas Systems

    PubMed Central

    Makarova, Kira S.; Koonin, Eugene V.

    2018-01-01

    The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods. PMID:25981466

  17. Annotation and Classification of CRISPR-Cas Systems.

    PubMed

    Makarova, Kira S; Koonin, Eugene V

    2015-01-01

    The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods.

  18. Stratification of co-evolving genomic groups using ranked phylogenetic profiles

    PubMed Central

    Freilich, Shiri; Goldovsky, Leon; Gottlieb, Assaf; Blanc, Eric; Tsoka, Sophia; Ouzounis, Christos A

    2009-01-01

    Background Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. Results The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. Conclusion Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples. PMID:19860884

  19. Significance and implications of FDA approval of pembrolizumab for biomarker-defined disease.

    PubMed

    Boyiadzis, Michael M; Kirkwood, John M; Marshall, John L; Pritchard, Colin C; Azad, Nilofer S; Gulley, James L

    2018-05-14

    The U.S. Food and Drug Administration (FDA) recently approved pembrolizumab, an anti- programmed cell death protein 1 cancer immunotherapeutic, for use in advanced solid tumors in patients with the microsatellite-high/DNA mismatch repair-deficient biomarker. This is the first example of a tissue-agnostic FDA approval of a treatment based on a patient's tumor biomarker status, rather than on tumor histology. Here we discuss key issues and implications arising from the biomarker-based disease classification implied by this historic approval.

  20. YTPdb: a wiki database of yeast membrane transporters.

    PubMed

    Brohée, Sylvain; Barriot, Roland; Moreau, Yves; André, Bruno

    2010-10-01

    Membrane transporters constitute one of the largest functional categories of proteins in all organisms. In the yeast Saccharomyces cerevisiae, this represents about 300 proteins ( approximately 5% of the proteome). We here present the Yeast Transport Protein database (YTPdb), a user-friendly collaborative resource dedicated to the precise classification and annotation of yeast transporters. YTPdb exploits an evolution of the MediaWiki web engine used for popular collaborative databases like Wikipedia, allowing every registered user to edit the data in a user-friendly manner. Proteins in YTPdb are classified on the basis of functional criteria such as subcellular location or their substrate compounds. These classifications are hierarchical, allowing queries to be performed at various levels, from highly specific (e.g. ammonium as a substrate or the vacuole as a location) to broader (e.g. cation as a substrate or inner membranes as location). Other resources accessible for each transporter via YTPdb include post-translational modifications, K(m) values, a permanently updated bibliography, and a hierarchical classification into families. The YTPdb concept can be extrapolated to other organisms and could even be applied for other functional categories of proteins. YTPdb is accessible at http://homes.esat.kuleuven.be/ytpdb/. Copyright © 2010 Elsevier B.V. All rights reserved.

  1. [Entification of the Rubella virus genotype 1H in Western Siberia].

    PubMed

    Seregin, S V; Babkin, I V; Petrova, I D; Iashina, L N; Malkova, E M; Petrov, V S

    2011-01-01

    Molecular epidemiological study of novel strain of Rubella virus isolated during the outbreak in Western Siberia in 2004 was described. Detailed phylogenetic analysis performed based upon entire SP-region, which encodes all three Rubella structural proteins (C, E2, and E1), was implemented. This analysis provides characterization of this strain and classifies it as 1H genotype, thereby correcting previous classification of this strain based upon shorter nucleotide sequence, only encoding E1 protein. Therefore, this study identified the genotype of the Rubella virus not previously detected in Western Siberia (and even entire Russian Federation), which highlights the importance of more extensive characterization of genetic variability of the Rubella virus, especially with regard to potential influence of vaccination on the Rubella virus mutagenesis.

  2. SELDI-TOF-MS proteomic profiling of serum, urine, and amniotic fluid in neural tube defects.

    PubMed

    Liu, Zhenjiang; Yuan, Zhengwei; Zhao, Qun

    2014-01-01

    Neural tube defects (NTDs) are common birth defects, whose specific biomarkers are needed. The purpose of this pilot study is to determine whether protein profiling in NTD-mothers differ from normal controls using SELDI-TOF-MS. ProteinChip Biomarker System was used to evaluate 82 maternal serum samples, 78 urine samples and 76 amniotic fluid samples. The validity of classification tree was then challenged with a blind test set including another 20 NTD-mothers and 18 controls in serum samples, and another 19 NTD-mothers and 17 controls in urine samples, and another 20 NTD-mothers and 17 controls in amniotic fluid samples. Eight proteins detected in serum samples were up-regulated and four proteins were down-regulated in the NTD group. Four proteins detected in urine samples were up-regulated and one protein was down-regulated in the NTD group. Six proteins detected in amniotic fluid samples were up-regulated and one protein was down-regulated in the NTD group. The classification tree for serum samples separated NTDs from healthy individuals, achieving a sensitivity of 91% and a specificity of 97% in the training set, and achieving a sensitivity of 90% and a specificity of 97% and a positive predictive value of 95% in the test set. The classification tree for urine samples separated NTDs from controls, achieving a sensitivity of 95% and a specificity of 94% in the training set, and achieving a sensitivity of 89% and a specificity of 82% and a positive predictive value of 85% in the test set. The classification tree for amniotic fluid samples separated NTDs from controls, achieving a sensitivity of 93% and a specificity of 89% in the training set, and achieving a sensitivity of 90% and a specificity of 88% and a positive predictive value of 90% in the test set. These suggest that SELDI-TOF-MS is an additional method for NTDs pregnancies detection.

  3. Analysis of mass spectrometry data from the secretome of an explant model of articular cartilage exposed to pro-inflammatory and anti-inflammatory stimuli using machine learning

    PubMed Central

    2013-01-01

    Background Osteoarthritis (OA) is an inflammatory disease of synovial joints involving the loss and degeneration of articular cartilage. The gold standard for evaluating cartilage loss in OA is the measurement of joint space width on standard radiographs. However, in most cases the diagnosis is made well after the onset of the disease, when the symptoms are well established. Identification of early biomarkers of OA can facilitate earlier diagnosis, improve disease monitoring and predict responses to therapeutic interventions. Methods This study describes the bioinformatic analysis of data generated from high throughput proteomics for identification of potential biomarkers of OA. The mass spectrometry data was generated using a canine explant model of articular cartilage treated with the pro-inflammatory cytokine interleukin 1 β (IL-1β). The bioinformatics analysis involved the application of machine learning and network analysis to the proteomic mass spectrometry data. A rule based machine learning technique, BioHEL, was used to create a model that classified the samples into their relevant treatment groups by identifying those proteins that separated samples into their respective groups. The proteins identified were considered to be potential biomarkers. Protein networks were also generated; from these networks, proteins pivotal to the classification were identified. Results BioHEL correctly classified eighteen out of twenty-three samples, giving a classification accuracy of 78.3% for the dataset. The dataset included the four classes of control, IL-1β, carprofen, and IL-1β and carprofen together. This exceeded the other machine learners that were used for a comparison, on the same dataset, with the exception of another rule-based method, JRip, which performed equally well. The proteins that were most frequently used in rules generated by BioHEL were found to include a number of relevant proteins including matrix metalloproteinase 3, interleukin 8 and matrix gla protein. Conclusions Using this protocol, combining an in vitro model of OA with bioinformatics analysis, a number of relevant extracellular matrix proteins were identified, thereby supporting the application of these bioinformatics tools for analysis of proteomic data from in vitro models of cartilage degradation. PMID:24330474

  4. Integrative topological analysis of mass spectrometry data reveals molecular features with clinical relevance in esophageal squamous cell carcinoma

    PubMed Central

    Gao, She-Gan; Liu, Rui-Min; Zhao, Yun-Gang; Wang, Pei; Ward, Douglas G.; Wang, Guang-Chao; Guo, Xiang-Qian; Gu, Juan; Niu, Wan-Bin; Zhang, Tian; Martin, Ashley; Guo, Zhi-Peng; Feng, Xiao-Shan; Qi, Yi-Jun; Ma, Yuan-Fang

    2016-01-01

    Combining MS-based proteomic data with network and topological features of such network would identify more clinically relevant molecules and meaningfully expand the repertoire of proteins derived from MS analysis. The integrative topological indexes representing 95.96% information of seven individual topological measures of node proteins were calculated within a protein-protein interaction (PPI) network, built using 244 differentially expressed proteins (DEPs) identified by iTRAQ 2D-LC-MS/MS. Compared with DEPs, differentially expressed genes (DEGs) and comprehensive features (CFs), structurally dominant nodes (SDNs) based on integrative topological index distribution produced comparable classification performance in three different clinical settings using five independent gene expression data sets. The signature molecules of SDN-based classifier for distinction of early from late clinical TNM stages were enriched in biological traits of protein synthesis, intracellular localization and ribosome biogenesis, which suggests that ribosome biogenesis represents a promising therapeutic target for treating ESCC. In addition, ITGB1 expression selected exclusively by integrative topological measures correlated with clinical stages and prognosis, which was further validated with two independent cohorts of ESCC samples. Thus the integrative topological analysis of PPI networks proposed in this study provides an alternative approach to identify potential biomarkers and therapeutic targets from MS/MS data with functional insights in ESCC. PMID:26898710

  5. Phylogenetic Characterization of Transport Protein Superfamilies: Superiority of SuperfamilyTree Programs over Those Based on Multiple Alignments

    PubMed Central

    Chen, Jonathan S.; Reddy, Vamsee; Chen, Joshua H.; Shlykov, Maksim A.; Zheng, Wei Hao; Cho, Jaehoon; Yen, Ming Ren; Saier, Milton H.

    2012-01-01

    Transport proteins function in the translocation of ions, solutes and macromolecules across cellular and organellar membranes. These integral membrane proteins fall into >600 families as tabulated in the Transporter Classification Database (www.tcdb.org). Recent studies, some of which are reported here, define distant phylogenetic relationships between families with the creation of superfamilies. Several of these are analyzed using a novel set of programs designed to allow reliable prediction of phylogenetic trees when sequence divergence is too great to allow the use of multiple alignments. These new programs, called SuperfamilyTree1 and 2 (SFT1 and 2), allow display of protein and family relationships, respectively, based on thousands of comparative BLAST scores rather than multiple alignments. Superfamilies analyzed include: (1) Aerolysins, (2) RTX Toxins, (3) Defensins, (4) Ion Transporters, (5) Bile/Arsenite/Riboflavin Transporters, (6) Cation: Proton Antiporters, and (7) the Glucose/Fructose/Lactose superfamily within the prokaryotic phosphoenol pyruvate-dependent Phosphotransferase System. In addition to defining the phylogenetic relationships of the proteins and families within these seven superfamilies, evidence is provided showing that the SFT programs outperform programs that are based on multiple alignments whenever sequence divergence of superfamily members is extensive. The SFT programs should be applicable to virtually any superfamily of proteins or nucleic acids. PMID:22286036

  6. A simple and fast heuristic for protein structure comparison.

    PubMed

    Pelta, David A; González, Juan R; Moreno Vega, Marcos

    2008-03-25

    Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results.

  7. Efficient use of unlabeled data for protein sequence classification: a comparative study

    PubMed Central

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-01-01

    Background Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags–the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. PMID:19426450

  8. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics.

    PubMed

    Weber, Marc; Teeling, Hanno; Huang, Sixing; Waldmann, Jost; Kassabgy, Mariette; Fuchs, Bernhard M; Klindworth, Anna; Klockow, Christine; Wichels, Antje; Gerdts, Gunnar; Amann, Rudolf; Glöckner, Frank Oliver

    2011-05-01

    Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.

  9. Effective Feature Selection for Classification of Promoter Sequences.

    PubMed

    K, Kouser; P G, Lavanya; Rangarajan, Lalitha; K, Acharya Kshitish

    2016-01-01

    Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.

  10. Large-scale optimization-based classification models in medicine and biology.

    PubMed

    Lee, Eva K

    2007-06-01

    We present novel optimization-based classification models that are general purpose and suitable for developing predictive rules for large heterogeneous biological and medical data sets. Our predictive model simultaneously incorporates (1) the ability to classify any number of distinct groups; (2) the ability to incorporate heterogeneous types of attributes as input; (3) a high-dimensional data transformation that eliminates noise and errors in biological data; (4) the ability to incorporate constraints to limit the rate of misclassification, and a reserved-judgment region that provides a safeguard against over-training (which tends to lead to high misclassification rates from the resulting predictive rule); and (5) successive multi-stage classification capability to handle data points placed in the reserved-judgment region. To illustrate the power and flexibility of the classification model and solution engine, and its multi-group prediction capability, application of the predictive model to a broad class of biological and medical problems is described. Applications include: the differential diagnosis of the type of erythemato-squamous diseases; predicting presence/absence of heart disease; genomic analysis and prediction of aberrant CpG island meythlation in human cancer; discriminant analysis of motility and morphology data in human lung carcinoma; prediction of ultrasonic cell disruption for drug delivery; identification of tumor shape and volume in treatment of sarcoma; discriminant analysis of biomarkers for prediction of early atherosclerois; fingerprinting of native and angiogenic microvascular networks for early diagnosis of diabetes, aging, macular degeneracy and tumor metastasis; prediction of protein localization sites; and pattern recognition of satellite images in classification of soil types. In all these applications, the predictive model yields correct classification rates ranging from 80 to 100%. This provides motivation for pursuing its use as a medical diagnostic, monitoring and decision-making tool.

  11. The C. elegans rab family: identification, classification and toolkit construction.

    PubMed

    Gallegos, Maria E; Balakrishnan, Sanjeev; Chandramouli, Priya; Arora, Shaily; Azameera, Aruna; Babushekar, Anitha; Bargoma, Emilee; Bokhari, Abdulmalik; Chava, Siva Kumari; Das, Pranti; Desai, Meetali; Decena, Darlene; Saramma, Sonia Dev Devadas; Dey, Bodhidipra; Doss, Anna-Louise; Gor, Nilang; Gudiputi, Lakshmi; Guo, Chunyuan; Hande, Sonali; Jensen, Megan; Jones, Samantha; Jones, Norman; Jorgens, Danielle; Karamchedu, Padma; Kamrani, Kambiz; Kolora, Lakshmi Divya; Kristensen, Line; Kwan, Kelly; Lau, Henry; Maharaj, Pranesh; Mander, Navneet; Mangipudi, Kalyani; Menakuru, Himabindu; Mody, Vaishali; Mohanty, Sandeepa; Mukkamala, Sridevi; Mundra, Sheena A; Nagaraju, Sudharani; Narayanaswamy, Rajhalutshimi; Ndungu-Case, Catherine; Noorbakhsh, Mersedeh; Patel, Jigna; Patel, Puja; Pendem, Swetha Vandana; Ponakala, Anusha; Rath, Madhusikta; Robles, Michael C; Rokkam, Deepti; Roth, Caroline; Sasidharan, Preeti; Shah, Sapana; Tandon, Shweta; Suprai, Jagdip; Truong, Tina Quynh Nhu; Uthayaruban, Rubatharshini; Varma, Ajitha; Ved, Urvi; Wang, Zeran; Yu, Zhe

    2012-01-01

    Rab monomeric GTPases regulate specific aspects of vesicle transport in eukaryotes including coat recruitment, uncoating, fission, motility, target selection and fusion. Moreover, individual Rab proteins function at specific sites within the cell, for example the ER, golgi and early endosome. Importantly, the localization and function of individual Rab subfamily members are often conserved underscoring the significant contributions that model organisms such as Caenorhabditis elegans can make towards a better understanding of human disease caused by Rab and vesicle trafficking malfunction. With this in mind, a bioinformatics approach was first taken to identify and classify the complete C. elegans Rab family placing individual Rabs into specific subfamilies based on molecular phylogenetics. For genes that were difficult to classify by sequence similarity alone, we did a comparative analysis of intron position among specific subfamilies from yeast to humans. This two-pronged approach allowed the classification of 30 out of 31 C. elegans Rab proteins identified here including Rab31/Rab50, a likely member of the last eukaryotic common ancestor (LECA). Second, a molecular toolset was created to facilitate research on biological processes that involve Rab proteins. Specifically, we used Gateway-compatible C. elegans ORFeome clones as starting material to create 44 full-length, sequence-verified, dominant-negative (DN) and constitutive active (CA) rab open reading frames (ORFs). Development of this toolset provided independent research projects for students enrolled in a research-based molecular techniques course at California State University, East Bay (CSUEB).

  12. The C. elegans Rab Family: Identification, Classification and Toolkit Construction

    PubMed Central

    Gallegos, Maria E.; Balakrishnan, Sanjeev; Chandramouli, Priya

    2012-01-01

    Rab monomeric GTPases regulate specific aspects of vesicle transport in eukaryotes including coat recruitment, uncoating, fission, motility, target selection and fusion. Moreover, individual Rab proteins function at specific sites within the cell, for example the ER, golgi and early endosome. Importantly, the localization and function of individual Rab subfamily members are often conserved underscoring the significant contributions that model organisms such as Caenorhabditis elegans can make towards a better understanding of human disease caused by Rab and vesicle trafficking malfunction. With this in mind, a bioinformatics approach was first taken to identify and classify the complete C. elegans Rab family placing individual Rabs into specific subfamilies based on molecular phylogenetics. For genes that were difficult to classify by sequence similarity alone, we did a comparative analysis of intron position among specific subfamilies from yeast to humans. This two-pronged approach allowed the classification of 30 out of 31 C. elegans Rab proteins identified here including Rab31/Rab50, a likely member of the last eukaryotic common ancestor (LECA). Second, a molecular toolset was created to facilitate research on biological processes that involve Rab proteins. Specifically, we used Gateway-compatible C. elegans ORFeome clones as starting material to create 44 full-length, sequence-verified, dominant-negative (DN) and constitutive active (CA) rab open reading frames (ORFs). Development of this toolset provided independent research projects for students enrolled in a research-based molecular techniques course at California State University, East Bay (CSUEB). PMID:23185324

  13. Cloning and characterization of Sdga gene encoding alpha-subunit of heterotrimeric guanosine 5'-triphosphate-binding protein complex in Scoparia dulcis.

    PubMed

    Shite, Masato; Yamamura, Yoshimi; Hayashi, Toshimitsu; Kurosaki, Fumiya

    2008-11-01

    A homology-based cloning strategy yielded Sdga, a cDNA clone presumably encoding alpha-subunit of heterotrimeric guanosine 5'-triphosphate-binding protein complex, from leaf tissues of Scoparia dulcis. Phylogenetic tree analysis of G-protein alpha-subunits from various biological sources suggested that, unlike in animal cells, classification of Galpha-proteins into specific subfamilies could not be applicable to the proteins from higher plants. Restriction digests of genomic DNA of S. dulcis showed a single hybridized signal in Southern blot analysis, suggesting that Sdga is a sole gene encoding Galpha-subunit in this plant. The expression level of Sdga appeared to be maintained at almost constant level after exposure of the leaves to methyl jasmonate as analyzed by reverse-transcription polymerase chain reaction. These results suggest that Sdga plays roles in methyl jasmonate-induced responses of S. dulcis without a notable change in the transcriptional level.

  14. RepeatsDB-lite: a web server for unit annotation of tandem repeat proteins.

    PubMed

    Hirsh, Layla; Paladin, Lisanna; Piovesan, Damiano; Tosatto, Silvio C E

    2018-05-09

    RepeatsDB-lite (http://protein.bio.unipd.it/repeatsdb-lite) is a web server for the prediction of repetitive structural elements and units in tandem repeat (TR) proteins. TRs are a widespread but poorly annotated class of non-globular proteins carrying heterogeneous functions. RepeatsDB-lite extends the prediction to all TR types and strongly improves the performance both in terms of computational time and accuracy over previous methods, with precision above 95% for solenoid structures. The algorithm exploits an improved TR unit library derived from the RepeatsDB database to perform an iterative structural search and assignment. The web interface provides tools for analyzing the evolutionary relationships between units and manually refine the prediction by changing unit positions and protein classification. An all-against-all structure-based sequence similarity matrix is calculated and visualized in real-time for every user edit. Reviewed predictions can be submitted to RepeatsDB for review and inclusion.

  15. Insecticidal activity of plant lectins and potential application in crop protection.

    PubMed

    Macedo, Maria Lígia R; Oliveira, Caio F R; Oliveira, Carolina T

    2015-01-27

    Lectins constitute a complex group of proteins found in different organisms. These proteins constitute an important field for research, as their structural diversity and affinity for several carbohydrates makes them suitable for numerous biological applications. This review addresses the classification and insecticidal activities of plant lectins, providing an overview of the applicability of these proteins in crop protection. The likely target sites in insect tissues, the mode of action of these proteins, as well as the use of lectins as biotechnological tools for pest control are also described. The use of initial bioassays employing artificial diets has led to the most recent advances in this field, such as plant breeding and the construction of fusion proteins, using lectins for targeting the delivery of toxins and to potentiate expected insecticide effects. Based on the data presented, we emphasize the contribution that plant lectins may make as tools for the development of integrated insect pest control strategies.

  16. Predicting β-Turns in Protein Using Kernel Logistic Regression

    PubMed Central

    Elbashir, Murtada Khalafallah; Sheng, Yu; Wang, Jianxin; Wu, FangXiang; Li, Min

    2013-01-01

    A β-turn is a secondary protein structure type that plays a significant role in protein configuration and function. On average 25% of amino acids in protein structures are located in β-turns. It is very important to develope an accurate and efficient method for β-turns prediction. Most of the current successful β-turns prediction methods use support vector machines (SVMs) or neural networks (NNs). The kernel logistic regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems. However, it is often not found in β-turns classification, mainly because it is computationally expensive. In this paper, we used KLR to obtain sparse β-turns prediction in short evolution time. Secondary structure information and position-specific scoring matrices (PSSMs) are utilized as input features. We achieved Q total of 80.7% and MCC of 50% on BT426 dataset. These results show that KLR method with the right algorithm can yield performance equivalent to or even better than NNs and SVMs in β-turns prediction. In addition, KLR yields probabilistic outcome and has a well-defined extension to multiclass case. PMID:23509793

  17. Predicting β-turns in protein using kernel logistic regression.

    PubMed

    Elbashir, Murtada Khalafallah; Sheng, Yu; Wang, Jianxin; Wu, Fangxiang; Li, Min

    2013-01-01

    A β-turn is a secondary protein structure type that plays a significant role in protein configuration and function. On average 25% of amino acids in protein structures are located in β-turns. It is very important to develope an accurate and efficient method for β-turns prediction. Most of the current successful β-turns prediction methods use support vector machines (SVMs) or neural networks (NNs). The kernel logistic regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems. However, it is often not found in β-turns classification, mainly because it is computationally expensive. In this paper, we used KLR to obtain sparse β-turns prediction in short evolution time. Secondary structure information and position-specific scoring matrices (PSSMs) are utilized as input features. We achieved Q total of 80.7% and MCC of 50% on BT426 dataset. These results show that KLR method with the right algorithm can yield performance equivalent to or even better than NNs and SVMs in β-turns prediction. In addition, KLR yields probabilistic outcome and has a well-defined extension to multiclass case.

  18. Fast protein tertiary structure retrieval based on global surface shape similarity.

    PubMed

    Sael, Lee; Li, Bin; La, David; Fang, Yi; Ramani, Karthik; Rustamov, Raif; Kihara, Daisuke

    2008-09-01

    Characterization and identification of similar tertiary structure of proteins provides rich information for investigating function and evolution. The importance of structure similarity searches is increasing as structure databases continue to expand, partly due to the structural genomics projects. A crucial drawback of conventional protein structure comparison methods, which compare structures by their main-chain orientation or the spatial arrangement of secondary structure, is that a database search is too slow to be done in real-time. Here we introduce a global surface shape representation by three-dimensional (3D) Zernike descriptors, which represent a protein structure compactly as a series expansion of 3D functions. With this simplified representation, the search speed against a few thousand structures takes less than a minute. To investigate the agreement between surface representation defined by 3D Zernike descriptor and conventional main-chain based representation, a benchmark was performed against a protein classification generated by the combinatorial extension algorithm. Despite the different representation, 3D Zernike descriptor retrieved proteins of the same conformation defined by combinatorial extension in 89.6% of the cases within the top five closest structures. The real-time protein structure search by 3D Zernike descriptor will open up new possibility of large-scale global and local protein surface shape comparison. 2008 Wiley-Liss, Inc.

  19. Rebelling for a Reason: Protein Structural “Outliers”

    PubMed Central

    Arumugam, Gandhimathi; Nair, Anu G.; Hariharaputran, Sridhar; Ramanathan, Sowdhamini

    2013-01-01

    Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural differences which are functionally relevant. This paper analyses the impact of structural divergence on function for multi-member superfamilies, selected from the PASS2 superfamily alignment database. Functional annotations within superfamilies, with structural outliers or ‘rebels’, are discussed in the context of structural variations. Overall, these data reinforce the idea that functional similarities cannot be extrapolated from mere structural conservation. The implication for fold-function prediction is that the functional annotations can only be inherited with very careful consideration, especially at low sequence identities. PMID:24073209

  20. Exceptions to the rule: case studies in the prediction of pathogenicity for genetic variants in hereditary cancer genes.

    PubMed

    Rosenthal, E T; Bowles, K R; Pruss, D; van Kan, A; Vail, P J; McElroy, H; Wenstrup, R J

    2015-12-01

    Based on current consensus guidelines and standard practice, many genetic variants detected in clinical testing are classified as disease causing based on their predicted impact on the normal expression or function of the gene in the absence of additional data. However, our laboratory has identified a subset of such variants in hereditary cancer genes for which compelling contradictory evidence emerged after the initial evaluation following the first observation of the variant. Three representative examples of variants in BRCA1, BRCA2 and MSH2 that are predicted to disrupt splicing, prematurely truncate the protein, or remove the start codon were evaluated for pathogenicity by analyzing clinical data with multiple classification algorithms. Available clinical data for all three variants contradicts the expected pathogenic classification. These variants illustrate potential pitfalls associated with standard approaches to variant classification as well as the challenges associated with monitoring data, updating classifications, and reporting potentially contradictory interpretations to the clinicians responsible for translating test outcomes to appropriate clinical action. It is important to address these challenges now as the model for clinical testing moves toward the use of large multi-gene panels and whole exome/genome analysis, which will dramatically increase the number of genetic variants identified. © 2015 The Authors. Clinical Genetics published by John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  1. Molecular classification of gastric adenocarcinoma: translating new insights from the cancer genome atlas research network.

    PubMed

    Sunakawa, Yu; Lenz, Heinz-Josef

    2015-04-01

    Gastric cancer is a heterogenous cancer, which may be classified into several distinct subtypes based on pathology and epidemiology, each with different initiating pathological processes and each possibly having different tumor biology. A classification of gastric cancer should be important to select patients who can benefit from the targeted therapies or to precisely predict prognosis. The Cancer Genome Atlas (TCGA) study collaborated with previous reports regarding subtyping gastric cancer but also proposed a refined classification based on molecular characteristics. The addition of the new molecular classification strategy to a current classical subtyping may be a promising option, particularly stratification by Epstein-Barr virus (EBV) and microsatellite instability (MSI) statuses. According to TCGA study, EBV gastric cancer patients may benefit the programmed cell death protein 1 (PD-1)/programmed death-ligand 1 (PD-L1) antibodies or phosphoinositide 3-kinase (PI3K) inhibitors which are now being developed. The discoveries of predictive biomarkers should improve patient care and individualized medicine in the management since the targeted therapies may have the potential to change the landscape of gastric cancer treatment, moreover leading to both better understanding of the heterogeneity and better outcomes. Patient enrichment by predictive biomarkers for new treatment strategies will be critical to improve clinical outcomes. Additionally, liquid biopsies will be able to enable us to monitor in real-time molecular escape mechanism, resulting in better treatment strategies.

  2. Predicting Protein Relationships to Human Pathways through a Relational Learning Approach Based on Simple Sequence Features.

    PubMed

    García-Jiménez, Beatriz; Pons, Tirso; Sanchis, Araceli; Valencia, Alfonso

    2014-01-01

    Biological pathways are important elements of systems biology and in the past decade, an increasing number of pathway databases have been set up to document the growing understanding of complex cellular processes. Although more genome-sequence data are becoming available, a large fraction of it remains functionally uncharacterized. Thus, it is important to be able to predict the mapping of poorly annotated proteins to original pathway models. We have developed a Relational Learning-based Extension (RLE) system to investigate pathway membership through a function prediction approach that mainly relies on combinations of simple properties attributed to each protein. RLE searches for proteins with molecular similarities to specific pathway components. Using RLE, we associated 383 uncharacterized proteins to 28 pre-defined human Reactome pathways, demonstrating relative confidence after proper evaluation. Indeed, in specific cases manual inspection of the database annotations and the related literature supported the proposed classifications. Examples of possible additional components of the Electron transport system, Telomere maintenance and Integrin cell surface interactions pathways are discussed in detail. All the human predicted proteins in the 2009 and 2012 releases 30 and 40 of Reactome are available at http://rle.bioinfo.cnio.es.

  3. Application of Neural Networks for classification of Patau, Edwards, Down, Turner and Klinefelter Syndrome based on first trimester maternal serum screening data, ultrasonographic findings and patient demographics.

    PubMed

    Catic, Aida; Gurbeta, Lejla; Kurtovic-Kozaric, Amina; Mehmedbasic, Senad; Badnjevic, Almir

    2018-02-13

    The usage of Artificial Neural Networks (ANNs) for genome-enabled classifications and establishing genome-phenotype correlations have been investigated more extensively over the past few years. The reason for this is that ANNs are good approximates of complex functions, so classification can be performed without the need for explicitly defined input-output model. This engineering tool can be applied for optimization of existing methods for disease/syndrome classification. Cytogenetic and molecular analyses are the most frequent tests used in prenatal diagnostic for the early detection of Turner, Klinefelter, Patau, Edwards and Down syndrome. These procedures can be lengthy, repetitive; and often employ invasive techniques so a robust automated method for classifying and reporting prenatal diagnostics would greatly help the clinicians with their routine work. The database consisted of data collected from 2500 pregnant woman that came to the Institute of Gynecology, Infertility and Perinatology "Mehmedbasic" for routine antenatal care between January 2000 and December 2016. During first trimester all women were subject to screening test where values of maternal serum pregnancy-associated plasma protein A (PAPP-A) and free beta human chorionic gonadotropin (β-hCG) were measured. Also, fetal nuchal translucency thickness and the presence or absence of the nasal bone was observed using ultrasound. The architectures of linear feedforward and feedback neural networks were investigated for various training data distributions and number of neurons in hidden layer. Feedback neural network architecture out performed feedforward neural network architecture in predictive ability for all five aneuploidy prenatal syndrome classes. Feedforward neural network with 15 neurons in hidden layer achieved classification sensitivity of 92.00%. Classification sensitivity of feedback (Elman's) neural network was 99.00%. Average accuracy of feedforward neural network was 89.6% and for feedback was 98.8%. The results presented in this paper prove that an expert diagnostic system based on neural networks can be efficiently used for classification of five aneuploidy syndromes, covered with this study, based on first trimester maternal serum screening data, ultrasonographic findings and patient demographics. Developed Expert System proved to be simple, robust, and powerful in properly classifying prenatal aneuploidy syndromes.

  4. DIRProt: a computational approach for discriminating insecticide resistant proteins from non-resistant proteins.

    PubMed

    Meher, Prabina Kumar; Sahu, Tanmaya Kumar; Banchariya, Anjali; Rao, Atmakuri Ramakrishna

    2017-03-24

    Insecticide resistance is a major challenge for the control program of insect pests in the fields of crop protection, human and animal health etc. Resistance to different insecticides is conferred by the proteins encoded from certain class of genes of the insects. To distinguish the insecticide resistant proteins from non-resistant proteins, no computational tool is available till date. Thus, development of such a computational tool will be helpful in predicting the insecticide resistant proteins, which can be targeted for developing appropriate insecticides. Five different sets of feature viz., amino acid composition (AAC), di-peptide composition (DPC), pseudo amino acid composition (PAAC), composition-transition-distribution (CTD) and auto-correlation function (ACF) were used to map the protein sequences into numeric feature vectors. The encoded numeric vectors were then used as input in support vector machine (SVM) for classification of insecticide resistant and non-resistant proteins. Higher accuracies were obtained under RBF kernel than that of other kernels. Further, accuracies were observed to be higher for DPC feature set as compared to others. The proposed approach achieved an overall accuracy of >90% in discriminating resistant from non-resistant proteins. Further, the two classes of resistant proteins i.e., detoxification-based and target-based were discriminated from non-resistant proteins with >95% accuracy. Besides, >95% accuracy was also observed for discrimination of proteins involved in detoxification- and target-based resistance mechanisms. The proposed approach not only outperformed Blastp, PSI-Blast and Delta-Blast algorithms, but also achieved >92% accuracy while assessed using an independent dataset of 75 insecticide resistant proteins. This paper presents the first computational approach for discriminating the insecticide resistant proteins from non-resistant proteins. Based on the proposed approach, an online prediction server DIRProt has also been developed for computational prediction of insecticide resistant proteins, which is accessible at http://cabgrid.res.in:8080/dirprot/ . The proposed approach is believed to supplement the efforts needed to develop dynamic insecticides in wet-lab by targeting the insecticide resistant proteins.

  5. Ceramic nanocarriers: versatile nanosystem for protein and peptide delivery.

    PubMed

    Singh, Deependra; Dubey, Pooja; Pradhan, Madhulika; Singh, Manju Rawat

    2013-02-01

    Proteins and peptides have been established to be the potential drug candidate for various human diseases. But, delivery of these therapeutic protein and peptides is still a challenge due to their several unfavorable properties. Nanotechnology is expanding as a promising tool for the efficient delivery of proteins and peptides. Among numerous nano-based carriers, ceramic nanoparticles have proven themselves as a unique carrier for protein and peptide delivery as they provide a more stable, bioavailable, readily manufacturable, and acceptable proteins and polypeptide formulation. This article provides an overview of the various aspects of ceramic nanoparticles including their classification, methods of preparation, latest advances, and applications as protein and peptide delivery carriers. Ceramic nanocarriers seem to have potential for preserving structural integrity of proteins and peptides, thereby promoting a better therapeutic effect. This approach thus provides pharmaceutical scientists with a new hope for the delivery of proteins and peptides. Still, considerable study on ceramic nanocarrier is necessary with respect to pharmacokinetics, toxicology, and animal studies to confirm their efficiency as well as safety and to establish their clinical usefulness and scale-up to industrial level.

  6. A 16-Gene Signature Distinguishes Anaplastic Astrocytoma from Glioblastoma

    PubMed Central

    Rao, Soumya Alige Mahabala; Srinivasan, Sujaya; Patric, Irene Rosita Pia; Hegde, Alangar Sathyaranjandas; Chandramouli, Bangalore Ashwathnarayanara; Arimappamagan, Arivazhagan; Santosh, Vani; Kondaiah, Paturu; Rao, Manchanahalli R. Sathyanarayana; Somasundaram, Kumaravel

    2014-01-01

    Anaplastic astrocytoma (AA; Grade III) and glioblastoma (GBM; Grade IV) are diffusely infiltrating tumors and are called malignant astrocytomas. The treatment regimen and prognosis are distinctly different between anaplastic astrocytoma and glioblastoma patients. Although histopathology based current grading system is well accepted and largely reproducible, intratumoral histologic variations often lead to difficulties in classification of malignant astrocytoma samples. In order to obtain a more robust molecular classifier, we analysed RT-qPCR expression data of 175 differentially regulated genes across astrocytoma using Prediction Analysis of Microarrays (PAM) and found the most discriminatory 16-gene expression signature for the classification of anaplastic astrocytoma and glioblastoma. The 16-gene signature obtained in the training set was validated in the test set with diagnostic accuracy of 89%. Additionally, validation of the 16-gene signature in multiple independent cohorts revealed that the signature predicted anaplastic astrocytoma and glioblastoma samples with accuracy rates of 99%, 88%, and 92% in TCGA, GSE1993 and GSE4422 datasets, respectively. The protein-protein interaction network and pathway analysis suggested that the 16-genes of the signature identified epithelial-mesenchymal transition (EMT) pathway as the most differentially regulated pathway in glioblastoma compared to anaplastic astrocytoma. In addition to identifying 16 gene classification signature, we also demonstrated that genes involved in epithelial-mesenchymal transition may play an important role in distinguishing glioblastoma from anaplastic astrocytoma. PMID:24475040

  7. Ubiquitin C-terminal electrophiles are activity-based probes for identification and mechanistic study of ubiquitin conjugating machinery.

    PubMed

    Love, Kerry Routenberg; Pandya, Renuka K; Spooner, Eric; Ploegh, Hidde L

    2009-04-17

    Protein modification by ubiquitin (Ub) and ubiquitin-like modifiers (Ubl) requires the action of activating (E1), conjugating (E2), and ligating (E3) enzymes and is a key step in the specific destruction of proteins. Deubiquitinating enzymes (DUBs) deconjugate substrates modified with Ub/Ubl's and recycle Ub inside the cell. Genome mining based on sequence homology to proteins with known function has assigned many enzymes to this pathway without confirmation of either conjugating or DUB activity. Function-dependent methodologies are still the most useful for rapid identification or assessment of biological activity of expressed proteins from cells. Activity-based protein profiling uses chemical probes that are active-site-directed for the classification of protein activities in complex mixtures. Here we show that the design and use of an expanded set of Ub-based electrophilic probes allowed us to recover and identify members of each enzyme class in the ubiquitin-proteasome system, including E3 ligases and DUBs with previously unverified activity. We show that epitope-tagged Ub-electrophilic probes can be used as activity-based probes for E3 ligase identification by in vitro labeling and activity studies of purified enzymes identified from complex mixtures in cell lysate. Furthermore, the reactivity of our probe with the HECT domain of the E3 Ub ligase ARF-BP1 suggests that multiple cysteines may be in the vicinity of the E2-binding site and are capable of the transfer of Ub to self or to a substrate protein.

  8. Review, evaluation, and discussion of the challenges of missing value imputation for mass spectrometry-based label-free global proteomics

    DOE PAGES

    Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.; ...

    2015-04-09

    In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less

  9. Characterization of new DsbB-like thiol-oxidoreductases of Campylobacter jejuni and Helicobacter pylori and classification of the DsbB family based on phylogenomic, structural and functional criteria.

    PubMed

    Raczko, Anna M; Bujnicki, Janusz M; Pawlowski, Marcin; Godlewska, Renata; Lewandowska, Magdalena; Jagusztyn-Krynicka, Elzbieta K

    2005-01-01

    In Gram-negative bacterial cells, disulfide bond formation occurs in the oxidative environment of the periplasm and is catalysed by Dsb (disulfide bond) proteins found in the periplasm and in the inner membrane. In this report the identification of a new subfamily of disulfide oxidoreductases encoded by a gene denoted dsbI, and functional characterization of DsbI proteins from Campylobacter jejuni and Helicobacter pylori, as well as DsbB from C. jejuni, are described. The N-terminal domain of DsbI is related to DsbB proteins and comprises five predicted transmembrane segments, while the C-terminal domain is predicted to locate to the periplasm and to fold into a beta-propeller structure. The dsbI gene is co-transcribed with a small ORF designated dba (dsbI-accessory). Based on a series of deletion and complementation experiments it is proposed that DsbB can complement the lack of DsbI but not the converse. In the presence of DsbB, the activity of DsbI was undetectable, hence it probably acts only on a subset of possible substrates of DsbB. To reconstruct the principal events in the evolution of DsbB and DsbI proteins, sequences of all their homologues identifiable in databases were analysed. In the course of this study, previously undetected variations on the common thiol-oxidoreductase theme were identified, such as development of an additional transmembrane helix and loss or migration of the second pair of Cys residues between two distinct periplasmic loops. In conjunction with the experimental characterization of two members of the DsbI lineage, this analysis has resulted in the first comprehensive classification of the DsbB/DsbI family based on structural, functional and evolutionary criteria.

  10. LenVarDB: database of length-variant protein domains.

    PubMed

    Mutt, Eshita; Mathew, Oommen K; Sowdhamini, Ramanathan

    2014-01-01

    Protein domains are functionally and structurally independent modules, which add to the functional variety of proteins. This array of functional diversity has been enabled by evolutionary changes, such as amino acid substitutions or insertions or deletions, occurring in these protein domains. Length variations (indels) can introduce changes at structural, functional and interaction levels. LenVarDB (freely available at http://caps.ncbs.res.in/lenvardb/) traces these length variations, starting from structure-based sequence alignments in our Protein Alignments organized as Structural Superfamilies (PASS2) database, across 731 structural classification of proteins (SCOP)-based protein domain superfamilies connected to 2 730 625 sequence homologues. Alignment of sequence homologues corresponding to a structural domain is available, starting from a structure-based sequence alignment of the superfamily. Orientation of the length-variant (indel) regions in protein domains can be visualized by mapping them on the structure and on the alignment. Knowledge about location of length variations within protein domains and their visual representation will be useful in predicting changes within structurally or functionally relevant sites, which may ultimately regulate protein function. Non-technical summary: Evolutionary changes bring about natural changes to proteins that may be found in many organisms. Such changes could be reflected as amino acid substitutions or insertions-deletions (indels) in protein sequences. LenVarDB is a database that provides an early overview of observed length variations that were set among 731 protein families and after examining >2 million sequences. Indels are followed up to observe if they are close to the active site such that they can affect the activity of proteins. Inclusion of such information can aid the design of bioengineering experiments.

  11. Classification and disease prediction via mathematical programming

    NASA Astrophysics Data System (ADS)

    Lee, Eva K.; Wu, Tsung-Lin

    2007-11-01

    In this chapter, we present classification models based on mathematical programming approaches. We first provide an overview on various mathematical programming approaches, including linear programming, mixed integer programming, nonlinear programming and support vector machines. Next, we present our effort of novel optimization-based classification models that are general purpose and suitable for developing predictive rules for large heterogeneous biological and medical data sets. Our predictive model simultaneously incorporates (1) the ability to classify any number of distinct groups; (2) the ability to incorporate heterogeneous types of attributes as input; (3) a high-dimensional data transformation that eliminates noise and errors in biological data; (4) the ability to incorporate constraints to limit the rate of misclassification, and a reserved-judgment region that provides a safeguard against over-training (which tends to lead to high misclassification rates from the resulting predictive rule) and (5) successive multi-stage classification capability to handle data points placed in the reserved judgment region. To illustrate the power and flexibility of the classification model and solution engine, and its multigroup prediction capability, application of the predictive model to a broad class of biological and medical problems is described. Applications include: the differential diagnosis of the type of erythemato-squamous diseases; predicting presence/absence of heart disease; genomic analysis and prediction of aberrant CpG island meythlation in human cancer; discriminant analysis of motility and morphology data in human lung carcinoma; prediction of ultrasonic cell disruption for drug delivery; identification of tumor shape and volume in treatment of sarcoma; multistage discriminant analysis of biomarkers for prediction of early atherosclerois; fingerprinting of native and angiogenic microvascular networks for early diagnosis of diabetes, aging, macular degeneracy and tumor metastasis; prediction of protein localization sites; and pattern recognition of satellite images in classification of soil types. In all these applications, the predictive model yields correct classification rates ranging from 80% to 100%. This provides motivation for pursuing its use as a medical diagnostic, monitoring and decision-making tool.

  12. Molecular classification of fatty liver by high-throughput profiling of protein post-translational modifications.

    PubMed

    Urasaki, Yasuyo; Fiscus, Ronald R; Le, Thuc T

    2016-04-01

    We describe an alternative approach to classifying fatty liver by profiling protein post-translational modifications (PTMs) with high-throughput capillary isoelectric focusing (cIEF) immunoassays. Four strains of mice were studied, with fatty livers induced by different causes, such as ageing, genetic mutation, acute drug usage, and high-fat diet. Nutrient-sensitive PTMs of a panel of 12 liver metabolic and signalling proteins were simultaneously evaluated with cIEF immunoassays, using nanograms of total cellular protein per assay. Changes to liver protein acetylation, phosphorylation, and O-N-acetylglucosamine glycosylation were quantified and compared between normal and diseased states. Fatty liver tissues could be distinguished from one another by distinctive protein PTM profiles. Fatty liver is currently classified by morphological assessment of lipid droplets, without identifying the underlying molecular causes. In contrast, high-throughput profiling of protein PTMs has the potential to provide molecular classification of fatty liver. Copyright © 2016 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.

  13. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

    PubMed

    Marchler-Bauer, Aron; Bo, Yu; Han, Lianyi; He, Jane; Lanczycki, Christopher J; Lu, Shennan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Geer, Lewis Y; Bryant, Stephen H

    2017-01-04

    NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  14. Integrated computational biology analysis to evaluate target genes for chronic myelogenous leukemia.

    PubMed

    Zheng, Yu; Wang, Yu-Ping; Cao, Hongbao; Chen, Qiusheng; Zhang, Xi

    2018-06-05

    Although hundreds of genes have been linked to chronic myelogenous leukemia (CML), many of the results lack reproducibility. In the present study, data across multiple modalities were integrated to evaluate 579 CML candidate genes, including literature‑based CML‑gene relation data, Gene Expression Omnibus RNA expression data and pathway‑based gene‑gene interaction data. The expression data included samples from 76 patients with CML and 73 healthy controls. For each target gene, four metrics were proposed and tested with case/control classification. The effectiveness of the four metrics presented was demonstrated by the high classification accuracy (94.63%; P<2x10‑4). Cross metric analysis suggested nine top candidate genes for CML: Epidermal growth factor receptor, tumor protein p53, catenin β 1, janus kinase 2, tumor necrosis factor, abelson murine leukemia viral oncogene homolog 1, vascular endothelial growth factor A, B‑cell lymphoma 2 and proto‑oncogene tyrosine‑protein kinase. In addition, 145 CML candidate pathways enriched with 485 out of 579 genes were identified (P<8.2x10‑11; q=0.005). In conclusion, weighted genetic networks generated using computational biology may be complementary to biological experiments for the evaluation of known or novel CML target genes.

  15. Classification of ligand molecules in PDB with fast heuristic graph match algorithm COMPLIG.

    PubMed

    Saito, Mihoko; Takemura, Naomi; Shirai, Tsuyoshi

    2012-12-14

    A fast heuristic graph-matching algorithm, COMPLIG, was devised to classify the small-molecule ligands in the Protein Data Bank (PDB), which are currently not properly classified on structure basis. By concurrently classifying proteins and ligands, we determined the most appropriate parameter for categorizing ligands to be more than 60% identity of atoms and bonds between molecules, and we classified 11,585 types of ligands into 1946 clusters. Although the large clusters were composed of nucleotides or amino acids, a significant presence of drug compounds was also observed. Application of the system to classify the natural ligand status of human proteins in the current database suggested that, at most, 37% of the experimental structures of human proteins were in complex with natural ligands. However, protein homology- and/or ligand similarity-based modeling was implied to provide models of natural interactions for an additional 28% of the total, which might be used to increase the knowledge of intrinsic protein-metabolite interactions. Copyright © 2012 Elsevier Ltd. All rights reserved.

  16. Phylogenetic classification and the universal tree.

    PubMed

    Doolittle, W F

    1999-06-25

    From comparative analyses of the nucleotide sequences of genes encoding ribosomal RNAs and several proteins, molecular phylogeneticists have constructed a "universal tree of life," taking it as the basis for a "natural" hierarchical classification of all living things. Although confidence in some of the tree's early branches has recently been shaken, new approaches could still resolve many methodological uncertainties. More challenging is evidence that most archaeal and bacterial genomes (and the inferred ancestral eukaryotic nuclear genome) contain genes from multiple sources. If "chimerism" or "lateral gene transfer" cannot be dismissed as trivial in extent or limited to special categories of genes, then no hierarchical universal classification can be taken as natural. Molecular phylogeneticists will have failed to find the "true tree," not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot properly be represented as a tree. However, taxonomies based on molecular sequences will remain indispensable, and understanding of the evolutionary process will ultimately be enriched, not impoverished.

  17. Highly multiplexed and quantitative cell-surface protein profiling using genetically barcoded antibodies.

    PubMed

    Pollock, Samuel B; Hu, Amy; Mou, Yun; Martinko, Alexander J; Julien, Olivier; Hornsby, Michael; Ploder, Lynda; Adams, Jarrett J; Geng, Huimin; Müschen, Markus; Sidhu, Sachdev S; Moffat, Jason; Wells, James A

    2018-03-13

    Human cells express thousands of different surface proteins that can be used for cell classification, or to distinguish healthy and disease conditions. A method capable of profiling a substantial fraction of the surface proteome simultaneously and inexpensively would enable more accurate and complete classification of cell states. We present a highly multiplexed and quantitative surface proteomic method using genetically barcoded antibodies called phage-antibody next-generation sequencing (PhaNGS). Using 144 preselected antibodies displayed on filamentous phage (Fab-phage) against 44 receptor targets, we assess changes in B cell surface proteins after the development of drug resistance in a patient with acute lymphoblastic leukemia (ALL) and in adaptation to oncogene expression in a Myc-inducible Burkitt lymphoma model. We further show PhaNGS can be applied at the single-cell level. Our results reveal that a common set of proteins including FLT3, NCR3LG1, and ROR1 dominate the response to similar oncogenic perturbations in B cells. Linking high-affinity, selective, genetically encoded binders to NGS enables direct and highly multiplexed protein detection, comparable to RNA-sequencing for mRNA. PhaNGS has the potential to profile a substantial fraction of the surface proteome simultaneously and inexpensively to enable more accurate and complete classification of cell states. Copyright © 2018 the Author(s). Published by PNAS.

  18. Improving Virus Taxonomy by Recontextualizing Sequence-Based Classification with Biologically Relevant Data: the Case of the Alphacoronavirus 1 Species

    PubMed Central

    André, Nicole M.

    2018-01-01

    ABSTRACT The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members. PMID:29299531

  19. Improving Virus Taxonomy by Recontextualizing Sequence-Based Classification with Biologically Relevant Data: the Case of the Alphacoronavirus 1 Species.

    PubMed

    Whittaker, Gary R; André, Nicole M; Millet, Jean Kaoru

    2018-01-01

    The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members.

  20. Quantitative Cell Cycle Analysis Based on an Endogenous All-in-One Reporter for Cell Tracking and Classification.

    PubMed

    Zerjatke, Thomas; Gak, Igor A; Kirova, Dilyana; Fuhrmann, Markus; Daniel, Katrin; Gonciarz, Magdalena; Müller, Doris; Glauche, Ingmar; Mansfeld, Jörg

    2017-05-30

    Cell cycle kinetics are crucial to cell fate decisions. Although live imaging has provided extensive insights into this relationship at the single-cell level, the limited number of fluorescent markers that can be used in a single experiment has hindered efforts to link the dynamics of individual proteins responsible for decision making directly to cell cycle progression. Here, we present fluorescently tagged endogenous proliferating cell nuclear antigen (PCNA) as an all-in-one cell cycle reporter that allows simultaneous analysis of cell cycle progression, including the transition into quiescence, and the dynamics of individual fate determinants. We also provide an image analysis pipeline for automated segmentation, tracking, and classification of all cell cycle phases. Combining the all-in-one reporter with labeled endogenous cyclin D1 and p21 as prime examples of cell-cycle-regulated fate determinants, we show how cell cycle and quantitative protein dynamics can be simultaneously extracted to gain insights into G1 phase regulation and responses to perturbations. Copyright © 2017 The Author(s). Published by Elsevier Inc. All rights reserved.

  1. LncRNApred: Classification of Long Non-Coding RNAs and Protein-Coding Transcripts by the Ensemble Algorithm with a New Hybrid Feature.

    PubMed

    Pian, Cong; Zhang, Guangle; Chen, Zhi; Chen, Yuanyuan; Zhang, Jin; Yang, Tao; Zhang, Liangyun

    2016-01-01

    As a novel class of noncoding RNAs, long noncoding RNAs (lncRNAs) have been verified to be associated with various diseases. As large scale transcripts are generated every year, it is significant to accurately and quickly identify lncRNAs from thousands of assembled transcripts. To accurately discover new lncRNAs, we develop a classification tool of random forest (RF) named LncRNApred based on a new hybrid feature. This hybrid feature set includes three new proposed features, which are MaxORF, RMaxORF and SNR. LncRNApred is effective for classifying lncRNAs and protein coding transcripts accurately and quickly. Moreover,our RF model only requests the training using data on human coding and non-coding transcripts. Other species can also be predicted by using LncRNApred. The result shows that our method is more effective compared with the Coding Potential Calculate (CPC). The web server of LncRNApred is available for free at http://mm20132014.wicp.net:57203/LncRNApred/home.jsp.

  2. Comprehensive Genome-Wide Classification Reveals That Many Plant-Specific Transcription Factors Evolved in Streptophyte Algae

    PubMed Central

    Wilhelmsson, Per K I; Mühlich, Cornelia; Ullrich, Kristian K

    2017-01-01

    Abstract Plant genomes encode many lineage-specific, unique transcription factors. Expansion of such gene families has been previously found to coincide with the evolution of morphological complexity, although comparative analyses have been hampered by severe sampling bias. Here, we make use of the recently increased availability of plant genomes. We have updated and expanded previous rule sets for domain-based classification of transcription associated proteins (TAPs), comprising transcription factors and transcriptional regulators. The genome-wide annotation of these protein families has been analyzed and made available via the novel TAPscan web interface. We find that many TAP families previously thought to be specific for land plants actually evolved in streptophyte (charophyte) algae; 26 out of 36 TAP family gains are inferred to have occurred in the common ancestor of the Streptophyta (uniting the land plants—Embryophyta—with their closest algal relatives). In contrast, expansions of TAP families were found to occur throughout streptophyte evolution. 17 out of 76 expansion events were found to be common to all land plants and thus probably evolved concomitant with the water-to-land-transition. PMID:29216360

  3. The Complete Set of Genes Encoding Major Intrinsic Proteins in Arabidopsis Provides a Framework for a New Nomenclature for Major Intrinsic Proteins in Plants1

    PubMed Central

    Johanson, Urban; Karlsson, Maria; Johansson, Ingela; Gustavsson, Sofia; Sjövall, Sara; Fraysse, Laure; Weig, Alfons R.; Kjellbom, Per

    2001-01-01

    Major intrinsic proteins (MIPs) facilitate the passive transport of small polar molecules across membranes. MIPs constitute a very old family of proteins and different forms have been found in all kinds of living organisms, including bacteria, fungi, animals, and plants. In the genomic sequence of Arabidopsis, we have identified 35 different MIP-encoding genes. Based on sequence similarity, these 35 proteins are divided into four different subfamilies: plasma membrane intrinsic proteins, tonoplast intrinsic proteins, NOD26-like intrinsic proteins also called NOD26-like MIPs, and the recently discovered small basic intrinsic proteins. In Arabidopsis, there are 13 plasma membrane intrinsic proteins, 10 tonoplast intrinsic proteins, nine NOD26-like intrinsic proteins, and three small basic intrinsic proteins. The gene structure in general is conserved within each subfamily, although there is a tendency to lose introns. Based on phylogenetic comparisons of maize (Zea mays) and Arabidopsis MIPs (AtMIPs), it is argued that the general intron patterns in the subfamilies were formed before the split of monocotyledons and dicotyledons. Although the gene structure is unique for each subfamily, there is a common pattern in how transmembrane helices are encoded on the exons in three of the subfamilies. The nomenclature for plant MIPs varies widely between different species but also between subfamilies in the same species. Based on the phylogeny of all AtMIPs, a new and more consistent nomenclature is proposed. The complete set of AtMIPs, together with the new nomenclature, will facilitate the isolation, classification, and labeling of plant MIPs from other species. PMID:11500536

  4. A Feature and Algorithm Selection Method for Improving the Prediction of Protein Structural Class.

    PubMed

    Ni, Qianwu; Chen, Lei

    2017-01-01

    Correct prediction of protein structural class is beneficial to investigation on protein functions, regulations and interactions. In recent years, several computational methods have been proposed in this regard. However, based on various features, it is still a great challenge to select proper classification algorithm and extract essential features to participate in classification. In this study, a feature and algorithm selection method was presented for improving the accuracy of protein structural class prediction. The amino acid compositions and physiochemical features were adopted to represent features and thirty-eight machine learning algorithms collected in Weka were employed. All features were first analyzed by a feature selection method, minimum redundancy maximum relevance (mRMR), producing a feature list. Then, several feature sets were constructed by adding features in the list one by one. For each feature set, thirtyeight algorithms were executed on a dataset, in which proteins were represented by features in the set. The predicted classes yielded by these algorithms and true class of each protein were collected to construct a dataset, which were analyzed by mRMR method, yielding an algorithm list. From the algorithm list, the algorithm was taken one by one to build an ensemble prediction model. Finally, we selected the ensemble prediction model with the best performance as the optimal ensemble prediction model. Experimental results indicate that the constructed model is much superior to models using single algorithm and other models that only adopt feature selection procedure or algorithm selection procedure. The feature selection procedure or algorithm selection procedure are really helpful for building an ensemble prediction model that can yield a better performance. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  5. Expression of Y-box-binding protein YB-1 allows stratification into long- and short-term survivors of head and neck cancer patients.

    PubMed

    Kolk, A; Jubitz, N; Mengele, K; Mantwill, K; Bissinger, O; Schmitt, M; Kremer, M; Holm, P S

    2011-12-06

    Histology-based classifications and clinical parameters of head and neck squamous cell carcinoma (HNSCC) are limited in their clinical capacity to provide information on prognosis and treatment choice of HNSCC. The primary aim of this study was to analyse Y-box-binding protein-1 (YB-1) protein expression in different grading groups of HNSCC patients, and to correlate these findings with the disease-specific survival (DSS). We investigated the expression and cellular localisation of the oncogenic transcription/translation factor YB-1 by immunohistochemistry on tissue micro arrays in a total of 365 HNSCC specimens and correlated expression data with clinico-pathological parameters including DSS. Compared with control tissue from healthy individuals, a significantly (P<0.01) increased YB-1 protein expression was observed in high-grade HNSCC patients. By univariate survival data analysis, HNSCC patients with elevated YB-1 protein expression had a significantly (P<0.01) decreased DSS. By multivariate Cox regression analysis, high YB-1 expression and nuclear localisation retained its significance as a statistically independent (P<0.002) prognostic marker for DSS. Within grade 2 group of HNSCC patients, a subgroup defined by high nuclear and cytoplasmic YB-1 levels (co-expression pattern) in the cells of the tumour invasion front had a significantly poorer 5-year DSS rate of only 38% compared with overall 55% for grade 2 patients. Vice versa, the DSS rate was markedly increased to 74% for grade 2 cancer patients with low YB-1 protein expression at the same localisation. Our findings point to the fact that YB-1 expression in combination with histological classification in a double stratification strategy is superior to classical grading in the prediction of tumour progression in HNSCC.

  6. Expression of Y-box-binding protein YB-1 allows stratification into long- and short-term survivors of head and neck cancer patients

    PubMed Central

    Kolk, A; Jubitz, N; Mengele, K; Mantwill, K; Bissinger, O; Schmitt, M; Kremer, M; Holm, P S

    2011-01-01

    Background: Histology-based classifications and clinical parameters of head and neck squamous cell carcinoma (HNSCC) are limited in their clinical capacity to provide information on prognosis and treatment choice of HNSCC. The primary aim of this study was to analyse Y-box-binding protein-1 (YB-1) protein expression in different grading groups of HNSCC patients, and to correlate these findings with the disease-specific survival (DSS). Methods: We investigated the expression and cellular localisation of the oncogenic transcription/translation factor YB-1 by immunohistochemistry on tissue micro arrays in a total of 365 HNSCC specimens and correlated expression data with clinico-pathological parameters including DSS. Results: Compared with control tissue from healthy individuals, a significantly (P<0.01) increased YB-1 protein expression was observed in high-grade HNSCC patients. By univariate survival data analysis, HNSCC patients with elevated YB-1 protein expression had a significantly (P<0.01) decreased DSS. By multivariate Cox regression analysis, high YB-1 expression and nuclear localisation retained its significance as a statistically independent (P<0.002) prognostic marker for DSS. Within grade 2 group of HNSCC patients, a subgroup defined by high nuclear and cytoplasmic YB-1 levels (co-expression pattern) in the cells of the tumour invasion front had a significantly poorer 5-year DSS rate of only 38% compared with overall 55% for grade 2 patients. Vice versa, the DSS rate was markedly increased to 74% for grade 2 cancer patients with low YB-1 protein expression at the same localisation. Conclusion: Our findings point to the fact that YB-1 expression in combination with histological classification in a double stratification strategy is superior to classical grading in the prediction of tumour progression in HNSCC. PMID:22095225

  7. Spatial cluster analysis of nanoscopically mapped serotonin receptors for classification of fixed brain tissue

    NASA Astrophysics Data System (ADS)

    Sams, Michael; Silye, Rene; Göhring, Janett; Muresan, Leila; Schilcher, Kurt; Jacak, Jaroslaw

    2014-01-01

    We present a cluster spatial analysis method using nanoscopic dSTORM images to determine changes in protein cluster distributions within brain tissue. Such methods are suitable to investigate human brain tissue and will help to achieve a deeper understanding of brain disease along with aiding drug development. Human brain tissue samples are usually treated postmortem via standard fixation protocols, which are established in clinical laboratories. Therefore, our localization microscopy-based method was adapted to characterize protein density and protein cluster localization in samples fixed using different protocols followed by common fluorescent immunohistochemistry techniques. The localization microscopy allows nanoscopic mapping of serotonin 5-HT1A receptor groups within a two-dimensional image of a brain tissue slice. These nanoscopically mapped proteins can be confined to clusters by applying the proposed statistical spatial analysis. Selected features of such clusters were subsequently used to characterize and classify the tissue. Samples were obtained from different types of patients, fixed with different preparation methods, and finally stored in a human tissue bank. To verify the proposed method, samples of a cryopreserved healthy brain have been compared with epitope-retrieved and paraffin-fixed tissues. Furthermore, samples of healthy brain tissues were compared with data obtained from patients suffering from mental illnesses (e.g., major depressive disorder). Our work demonstrates the applicability of localization microscopy and image analysis methods for comparison and classification of human brain tissues at a nanoscopic level. Furthermore, the presented workflow marks a unique technological advance in the characterization of protein distributions in brain tissue sections.

  8. ECOD: new developments in the evolutionary classification of domains

    PubMed Central

    Schaeffer, R. Dustin; Liao, Yuxing; Cheng, Hua; Grishin, Nick V.

    2017-01-01

    Evolutionary Classification Of protein Domains (ECOD) (http://prodata.swmed.edu/ecod) comprehensively classifies protein with known spatial structures maintained by the Protein Data Bank (PDB) into evolutionary groups of protein domains. ECOD relies on a combination of automatic and manual weekly updates to achieve its high accuracy and coverage with a short update cycle. ECOD classifies the approximately 120 000 depositions of the PDB into more than 500 000 domains in ∼3400 homologous groups. We show the performance of the weekly update pipeline since the release of ECOD, describe improvements to the ECOD website and available search options, and discuss novel structures and homologous groups that have been classified in the recent updates. Finally, we discuss the future directions of ECOD and further improvements planned for the hierarchy and update process. PMID:27899594

  9. FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds.

    PubMed

    Abbasi, Elham; Ghatee, Mehdi; Shiri, M E

    2013-09-01

    In this paper, an intelligent hyper framework is proposed to recognize protein folds from its amino acid sequence which is a fundamental problem in bioinformatics. This framework includes some statistical and intelligent algorithms for proteins classification. The main components of the proposed framework are the Fuzzy Resource-Allocating Network (FRAN) and the Radial Bases Function based on Particle Swarm Optimization (RBF-PSO). FRAN applies a dynamic method to tune up the RBF network parameters. Due to the patterns complexity captured in protein dataset, FRAN classifies the proteins under fuzzy conditions. Also, RBF-PSO applies PSO to tune up the RBF classifier. Experimental results demonstrate that FRAN improves prediction accuracy up to 51% and achieves acceptable multi-class results for protein fold prediction. Although RBF-PSO provides reasonable results for protein fold recognition up to 48%, it is weaker than FRAN in some cases. However the proposed hyper framework provides an opportunity to use a great range of intelligent methods and can learn from previous experiences. Thus it can avoid the weakness of some intelligent methods in terms of memory, computational time and static structure. Furthermore, the performance of this system can be enhanced throughout the system life-cycle. Copyright © 2013 Elsevier Ltd. All rights reserved.

  10. Classification of Bacillus and Brevibacillus species using rapid analysis of lipids by mass spectrometry.

    PubMed

    AlMasoud, Najla; Xu, Yun; Trivedi, Drupad K; Salivo, Simona; Abban, Tom; Rattray, Nicholas J W; Szula, Ewa; AlRabiah, Haitham; Sayqal, Ali; Goodacre, Royston

    2016-11-01

    Bacillus are aerobic spore-forming bacteria that are known to lead to specific diseases, such as anthrax and food poisoning. This study focuses on the characterization of these bacteria by the detection of lipids extracted from 33 well-characterized strains from the Bacillus and Brevibacillus genera, with the aim to discriminate between the different species. For the purpose of analysing the lipids extracted from these bacterial samples, two rapid physicochemical techniques were used: matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-TOF-MS) and liquid chromatography in conjunction with mass spectrometry (LC-MS). The findings of this investigation confirmed that MALDI-TOF-MS could be used to identify different bacterial lipids and, in combination with appropriate chemometrics, allowed for the discrimination between these different bacterial species, which was supported by LC-MS. The average correct classification rates for the seven species of bacteria were 62.23 and 77.03 % based on MALDI-TOF-MS and LC-MS data, respectively. The Procrustes distance for the two datasets was 0.0699, indicating that the results from the two techniques were very similar. In addition, we also compared these bacterial lipid MALDI-TOF-MS profiles to protein profiles also collected by MALDI-TOF-MS on the same bacteria (Procrustes distance, 0.1006). The level of discrimination between lipids and proteins was equivalent, and this further indicated the potential of MALDI-TOF-MS analysis as a rapid, robust and reliable method for the classification of bacteria based on different bacterial chemical components. Graphical abstract MALDI-MS has been successfully developed for the characterization of bacteria at the subspecies level using lipids and benchmarked against HPLC.

  11. 15 years of research on Oral-Facial-Digital syndromes: from 1 to 16 causal genes

    PubMed Central

    Bruel, Ange-Line; Franco, Brunella; Duffourd, Yannis; Thevenon, Julien; Jego, Laurence; Lopez, Estelle; Deleuze, Jean-François; Doummar, Diane; Giles, Rachel H.; Johnson, Colin A.; Huynen, Martijn A.; Chevrier, Véronique; Burglen, Lydie; Morleo, Manuela; Desguerres, Isabelle; Pierquin, Geneviève; Doray, Bérénice; Gilbert-Dussardier, Brigitte; Reversade, Bruno; Steichen-Gersdorf, Elisabeth; Baumann, Clarisse; Panigrahi, Inusha; Fargeot-Espaliat, Anne; Dieux, Anne; David, Albert; Goldenberg, Alice; Bongers, Ernie; Gaillard, Dominique; Argente, Jesús; Aral, Bernard; Gigot, Nadège; St-Onge, Judith; Birnbaum, Daniel; Phadke, Shubha R.; Cormier-Daire, Valérie; Eguether, Thibaut; Pazour, Gregory J.; Herranz-Pérez, Vicente; Lee, Jaclyn S.; Pasquier, Laurent; Loget, Philippe; Saunier, Sophie; Mégarbané, André; Rosnet, Olivier; Leroux, Michel R.; Wallingford, John B.; Blacque, Oliver E.; Nachury, Maxence V.; Attie-Bitach, Tania; Rivière, Jean-Baptiste; Faivre, Laurence; Thauvin-Robinet, Christel

    2017-01-01

    Oral-facial-digital syndromes (OFDS) gather rare genetic disorders characterized by facial, oral and digital abnormalities associated with a wide range of additional features (polycystic kidney disease, cerebral malformations and several others) to delineate a growing list of OFD subtypes. The most frequent, OFD type I, is caused by a heterozygous mutation in the OFD1 gene encoding a centrosomal protein. The wide clinical heterogeneity of OFDS suggests the involvement of other ciliary genes. For 15 years, we have aimed to identify the molecular bases of OFDS. This effort has been greatly helped by the recent development of whole exome sequencing (WES). Here, we present all our published and unpublished results for WES in 24 OFDS cases. We identified causal variants in five new genes (C2CD3, TMEM107, INTU, KIAA0753, IFT57) and related the clinical spectrum of four genes in other ciliopathies (C5orf42, TMEM138, TMEM231, WDPCP) to OFDS. Mutations were also detected in two genes previously implicated in OFDS. Functional studies revealed the involvement of centriole elongation, transition zone and intraflagellar transport defects in OFDS, thus characterizing three ciliary protein modules: the complex KIAA0753-FOPNL-OFD1, a regulator of centriole elongation; the MKS module, a major component of the transition zone; and the CPLANE complex necessary for IFT-A assembly. OFDS now appear to be a distinct subgroup of ciliopathies with wide heterogeneity, which makes the initial classification obsolete. A clinical classification restricted to the three frequent/well-delineated subtypes could be proposed, and for patients who do not fit one of these 3 main subtypes, a further classification could be based on the genotype. PMID:28289185

  12. Ribosome-inactivating proteins

    PubMed Central

    Walsh, Matthew J; Dodd, Jennifer E; Hautbergue, Guillaume M

    2013-01-01

    Ribosome-inactivating proteins (RIPs) were first isolated over a century ago and have been shown to be catalytic toxins that irreversibly inactivate protein synthesis. Elucidation of atomic structures and molecular mechanism has revealed these proteins to be a diverse group subdivided into two classes. RIPs have been shown to exhibit RNA N-glycosidase activity and depurinate the 28S rRNA of the eukaryotic 60S ribosomal subunit. In this review, we compare archetypal RIP family members with other potent toxins that abolish protein synthesis: the fungal ribotoxins which directly cleave the 28S rRNA and the newly discovered Burkholderia lethal factor 1 (BLF1). BLF1 presents additional challenges to the current classification system since, like the ribotoxins, it does not possess RNA N-glycosidase activity but does irreversibly inactivate ribosomes. We further discuss whether the RIP classification should be broadened to include toxins achieving irreversible ribosome inactivation with similar turnovers to RIPs, but through different enzymatic mechanisms. PMID:24071927

  13. Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase).

    PubMed

    Odronitz, Florian; Kollmar, Martin

    2006-11-29

    Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.

  14. Predictive value of C-reactive protein/albumin ratio in acute pancreatitis.

    PubMed

    Kaplan, Mustafa; Ates, Ihsan; Akpinar, Muhammed Yener; Yuksel, Mahmut; Kuzu, Ufuk Baris; Kacar, Sabite; Coskun, Orhan; Kayacetin, Ertugrul

    2017-08-15

    Serum C-reactive protein (CRP) increases and albumin decreases in patients with inflammation and infection. However, their role in patients with acute pancreatitis is not clear. The present study was to investigate the predictive significance of the CRP/albumin ratio for the prognosis and mortality in acute pancreatitis patients. This study was performed retrospectively with 192 acute pancreatitis patients between January 2002 and June 2015. Ranson scores, Atlanta classification and CRP/albumin ratios of the patients were calculated. The CRP/albumin ratio was higher in deceased patients compared to survivors. The CRP/albumin ratio was positively correlated with Ranson score and Atlanta classification in particular and with important prognostic markers such as hospitalization time, CRP and erythrocyte sedimentation rate. In addition to the CRP/albumin ratio, necrotizing pancreatitis type, moderately severe and severe Atlanta classification, and total Ranson score were independent risk factors of mortality. It was found that an increase of 1 unit in the CRP/albumin ratio resulted in an increase of 1.52 times in mortality risk. A prediction value about CRP/albumin ratio >16.28 was found to be a significant marker in predicting mortality with 92.1% sensitivity and 58.0% specificity. It was seen that Ranson and Atlanta classification were higher in patients with CRP/albumin ratio >16.28 compared with those with CRP/albumin ratio ≤16.28. Patients with CRP/albumin ratio >16.28 had a 19.3 times higher chance of death. The CRP/albumin ratio is a novel but promising, easy-to-measure, repeatable, non-invasive inflammation-based prognostic score in acute pancreatitis. Copyright © 2017 The Editorial Board of Hepatobiliary & Pancreatic Diseases International. Published by Elsevier B.V. All rights reserved.

  15. mLASSO-Hum: A LASSO-based interpretable human-protein subcellular localization predictor.

    PubMed

    Wan, Shibiao; Mak, Man-Wai; Kung, Sun-Yuan

    2015-10-07

    Knowing the subcellular compartments of human proteins is essential to shed light on the mechanisms of a broad range of human diseases. In computational methods for protein subcellular localization, knowledge-based methods (especially gene ontology (GO) based methods) are known to perform better than sequence-based methods. However, existing GO-based predictors often lack interpretability and suffer from overfitting due to the high dimensionality of feature vectors. To address these problems, this paper proposes an interpretable multi-label predictor, namely mLASSO-Hum, which can yield sparse and interpretable solutions for large-scale prediction of human protein subcellular localization. By using the one-vs-rest LASSO-based classifiers, 87 out of more than 8000 GO terms are found to play more significant roles in determining the subcellular localization. Based on these 87 essential GO terms, we can decide not only where a protein resides within a cell, but also why it is located there. To further exploit information from the remaining GO terms, a method based on the GO hierarchical information derived from the depth distance of GO terms is proposed. Experimental results show that mLASSO-Hum performs significantly better than state-of-the-art predictors. We also found that in addition to the GO terms from the cellular component category, GO terms from the other two categories also play important roles in the final classification decisions. For readers' convenience, the mLASSO-Hum server is available online at http://bioinfo.eie.polyu.edu.hk/mLASSOHumServer/. Copyright © 2015 Elsevier Ltd. All rights reserved.

  16. QPROT: Statistical method for testing differential expression using protein-level intensity data in label-free quantitative proteomics.

    PubMed

    Choi, Hyungwon; Kim, Sinae; Fermin, Damian; Tsou, Chih-Chiang; Nesvizhskii, Alexey I

    2015-11-03

    We introduce QPROT, a statistical framework and computational tool for differential protein expression analysis using protein intensity data. QPROT is an extension of the QSPEC suite, originally developed for spectral count data, adapted for the analysis using continuously measured protein-level intensity data. QPROT offers a new intensity normalization procedure and model-based differential expression analysis, both of which account for missing data. Determination of differential expression of each protein is based on the standardized Z-statistic based on the posterior distribution of the log fold change parameter, guided by the false discovery rate estimated by a well-known Empirical Bayes method. We evaluated the classification performance of QPROT using the quantification calibration data from the clinical proteomic technology assessment for cancer (CPTAC) study and a recently published Escherichia coli benchmark dataset, with evaluation of FDR accuracy in the latter. QPROT is a statistical framework with computational software tool for comparative quantitative proteomics analysis. It features various extensions of QSPEC method originally built for spectral count data analysis, including probabilistic treatment of missing values in protein intensity data. With the increasing popularity of label-free quantitative proteomics data, the proposed method and accompanying software suite will be immediately useful for many proteomics laboratories. This article is part of a Special Issue entitled: Computational Proteomics. Copyright © 2015 Elsevier B.V. All rights reserved.

  17. Improving predictions of protein-protein interfaces by combining amino acid-specific classifiers based on structural and physicochemical descriptors with their weighted neighbor averages.

    PubMed

    de Moraes, Fábio R; Neshich, Izabella A P; Mazoni, Ivan; Yano, Inácio H; Pereira, José G C; Salim, José A; Jardine, José G; Neshich, Goran

    2014-01-01

    Protein-protein interactions are involved in nearly all regulatory processes in the cell and are considered one of the most important issues in molecular biology and pharmaceutical sciences but are still not fully understood. Structural and computational biology contributed greatly to the elucidation of the mechanism of protein interactions. In this paper, we present a collection of the physicochemical and structural characteristics that distinguish interface-forming residues (IFR) from free surface residues (FSR). We formulated a linear discriminative analysis (LDA) classifier to assess whether chosen descriptors from the BlueStar STING database (http://www.cbi.cnptia.embrapa.br/SMS/) are suitable for such a task. Receiver operating characteristic (ROC) analysis indicates that the particular physicochemical and structural descriptors used for building the linear classifier perform much better than a random classifier and in fact, successfully outperform some of the previously published procedures, whose performance indicators were recently compared by other research groups. The results presented here show that the selected set of descriptors can be utilized to predict IFRs, even when homologue proteins are missing (particularly important for orphan proteins where no homologue is available for comparative analysis/indication) or, when certain conformational changes accompany interface formation. The development of amino acid type specific classifiers is shown to increase IFR classification performance. Also, we found that the addition of an amino acid conservation attribute did not improve the classification prediction. This result indicates that the increase in predictive power associated with amino acid conservation is exhausted by adequate use of an extensive list of independent physicochemical and structural parameters that, by themselves, fully describe the nano-environment at protein-protein interfaces. The IFR classifier developed in this study is now integrated into the BlueStar STING suite of programs. Consequently, the prediction of protein-protein interfaces for all proteins available in the PDB is possible through STING_interfaces module, accessible at the following website: (http://www.cbi.cnptia.embrapa.br/SMS/predictions/index.html).

  18. Improving Predictions of Protein-Protein Interfaces by Combining Amino Acid-Specific Classifiers Based on Structural and Physicochemical Descriptors with Their Weighted Neighbor Averages

    PubMed Central

    de Moraes, Fábio R.; Neshich, Izabella A. P.; Mazoni, Ivan; Yano, Inácio H.; Pereira, José G. C.; Salim, José A.; Jardine, José G.; Neshich, Goran

    2014-01-01

    Protein-protein interactions are involved in nearly all regulatory processes in the cell and are considered one of the most important issues in molecular biology and pharmaceutical sciences but are still not fully understood. Structural and computational biology contributed greatly to the elucidation of the mechanism of protein interactions. In this paper, we present a collection of the physicochemical and structural characteristics that distinguish interface-forming residues (IFR) from free surface residues (FSR). We formulated a linear discriminative analysis (LDA) classifier to assess whether chosen descriptors from the BlueStar STING database (http://www.cbi.cnptia.embrapa.br/SMS/) are suitable for such a task. Receiver operating characteristic (ROC) analysis indicates that the particular physicochemical and structural descriptors used for building the linear classifier perform much better than a random classifier and in fact, successfully outperform some of the previously published procedures, whose performance indicators were recently compared by other research groups. The results presented here show that the selected set of descriptors can be utilized to predict IFRs, even when homologue proteins are missing (particularly important for orphan proteins where no homologue is available for comparative analysis/indication) or, when certain conformational changes accompany interface formation. The development of amino acid type specific classifiers is shown to increase IFR classification performance. Also, we found that the addition of an amino acid conservation attribute did not improve the classification prediction. This result indicates that the increase in predictive power associated with amino acid conservation is exhausted by adequate use of an extensive list of independent physicochemical and structural parameters that, by themselves, fully describe the nano-environment at protein-protein interfaces. The IFR classifier developed in this study is now integrated into the BlueStar STING suite of programs. Consequently, the prediction of protein-protein interfaces for all proteins available in the PDB is possible through STING_interfaces module, accessible at the following website: (http://www.cbi.cnptia.embrapa.br/SMS/predictions/index.html). PMID:24489849

  19. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3

    PubMed Central

    Dietmann, Sabine; Park, Jong; Notredame, Cedric; Heger, Andreas; Lappe, Michael; Holm, Liisa

    2001-01-01

    The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families. PMID:11125048

  20. A simple and fast heuristic for protein structure comparison

    PubMed Central

    Pelta, David A; González, Juan R; Moreno Vega, Marcos

    2008-01-01

    Background Protein structure comparison is a key problem in bioinformatics. There exist several methods for doing protein comparison, being the solution of the Maximum Contact Map Overlap problem (MAX-CMO) one of the alternatives available. Although this problem may be solved using exact algorithms, researchers require approximate algorithms that obtain good quality solutions using less computational resources than the formers. Results We propose a variable neighborhood search metaheuristic for solving MAX-CMO. We analyze this strategy in two aspects: 1) from an optimization point of view the strategy is tested on two different datasets, obtaining an error of 3.5%(over 2702 pairs) and 1.7% (over 161 pairs) with respect to optimal values; thus leading to high accurate solutions in a simpler and less expensive way than exact algorithms; 2) in terms of protein structure classification, we conduct experiments on three datasets and show that is feasible to detect structural similarities at SCOP's family and CATH's architecture levels using normalized overlap values. Some limitations and the role of normalization are outlined for doing classification at SCOP's fold level. Conclusion We designed, implemented and tested.a new tool for solving MAX-CMO, based on a well-known metaheuristic technique. The good balance between solution's quality and computational effort makes it a valuable tool. Moreover, to the best of our knowledge, this is the first time the MAX-CMO measure is tested at SCOP's fold and CATH's architecture levels with encouraging results. Software is available for download at . PMID:18366735

  1. Comparing ensemble learning methods based on decision tree classifiers for protein fold recognition.

    PubMed

    Bardsiri, Mahshid Khatibi; Eftekhari, Mahdi

    2014-01-01

    In this paper, some methods for ensemble learning of protein fold recognition based on a decision tree (DT) are compared and contrasted against each other over three datasets taken from the literature. According to previously reported studies, the features of the datasets are divided into some groups. Then, for each of these groups, three ensemble classifiers, namely, random forest, rotation forest and AdaBoost.M1 are employed. Also, some fusion methods are introduced for combining the ensemble classifiers obtained in the previous step. After this step, three classifiers are produced based on the combination of classifiers of types random forest, rotation forest and AdaBoost.M1. Finally, the three different classifiers achieved are combined to make an overall classifier. Experimental results show that the overall classifier obtained by the genetic algorithm (GA) weighting fusion method, is the best one in comparison to previously applied methods in terms of classification accuracy.

  2. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation

    PubMed Central

    Amidi, Afshine; Megalooikonomou, Vasileios; Paragios, Nikos

    2018-01-01

    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet. PMID:29740518

  3. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation.

    PubMed

    Amidi, Afshine; Amidi, Shervine; Vlachakis, Dimitrios; Megalooikonomou, Vasileios; Paragios, Nikos; Zacharaki, Evangelia I

    2018-01-01

    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.

  4. Automated Gene Ontology annotation for anonymous sequence data.

    PubMed

    Hennig, Steffen; Groth, Detlef; Lehrach, Hans

    2003-07-01

    Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.

  5. Development of a fast and simple test system for the semiquantitative protein detection in cerebrospinal liquids based on gold nanoparticles.

    PubMed

    Göbel, Gero; Lange, Robert; Hollidt, Jörg-Michael; Lisdat, Fred

    2016-01-01

    The fast and simple detection of increased protein concentrations in cerebrospinal liquids is preferable in the emergency medicine and it can help to avoid unnecessary laboratory work by an early classification of neurological diseases. Here a test system is developed which is based on the electrostatic interaction between negatively charged gold nanoparticles and proteins at pH values around 5. The test system can be adjusted in such a way that protein/nanoparticles aggregates are formed leading to a red-shift in the absorption spectrum of the nanoparticles suspension. At concentrations above 500 mg/l the color of the suspension changes from red via violet toward blue in a rather small concentration range from 500 to 1000 mg/l. Furthermore the influence of various parameters such as gold nanoparticle concentration, pH value and varying ion concentration in the sample on the test system is examined. Finally cerebrospinal liquids of a larger number of patients have been analyzed. Copyright © 2015 Elsevier B.V. All rights reserved.

  6. From cheminformatics to structure-based design: Web services and desktop applications based on the NAOMI library.

    PubMed

    Bietz, Stefan; Inhester, Therese; Lauck, Florian; Sommer, Kai; von Behren, Mathias M; Fährrolfes, Rainer; Flachsenberg, Florian; Meyder, Agnes; Nittinger, Eva; Otto, Thomas; Hilbig, Matthias; Schomburg, Karen T; Volkamer, Andrea; Rarey, Matthias

    2017-11-10

    Nowadays, computational approaches are an integral part of life science research. Problems related to interpretation of experimental results, data analysis, or visualization tasks highly benefit from the achievements of the digital era. Simulation methods facilitate predictions of physicochemical properties and can assist in understanding macromolecular phenomena. Here, we will give an overview of the methods developed in our group that aim at supporting researchers from all life science areas. Based on state-of-the-art approaches from structural bioinformatics and cheminformatics, we provide software covering a wide range of research questions. Our all-in-one web service platform ProteinsPlus (http://proteins.plus) offers solutions for pocket and druggability prediction, hydrogen placement, structure quality assessment, ensemble generation, protein-protein interaction classification, and 2D-interaction visualization. Additionally, we provide a software package that contains tools targeting cheminformatics problems like file format conversion, molecule data set processing, SMARTS editing, fragment space enumeration, and ligand-based virtual screening. Furthermore, it also includes structural bioinformatics solutions for inverse screening, binding site alignment, and searching interaction patterns across structure libraries. The software package is available at http://software.zbh.uni-hamburg.de. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  7. Variability of the protein sequences of lcrV between epidemic and atypical rhamnose-positive strains of Yersinia pestis.

    PubMed

    Anisimov, Andrey P; Panfertsev, Evgeniy A; Svetoch, Tat'yana E; Dentovskaya, Svetlana V

    2007-01-01

    Sequencing of lcrV genes and comparison of the deduced amino acid sequences from ten Y. pestis strains belonging mostly to the group of atypical rhamnose-positive isolates (non-pestis subspecies or pestoides group) showed that the LcrV proteins analyzed could be classified into five sequence types. This classification was based on major amino acid polymorphisms among LcrV proteins in the four "hot points" of the protein sequences. Some additional minor polymorphisms were found throughout these sequence types. The "hot points" corresponded to amino acids 18 (Lys --> Asn), 72 (Lys --> Arg), 273 (Cys --> Ser), and 324-326 (Ser-Gly-Lys --> Arg) in the LcrV sequence of the reference Y. pestis strain CO92. One possible explanation for polymorphism in amino acid sequences of LcrV among different strains is that strain-specific variation resulted from adaptation of the plague pathogen to different rodent and lagomorph hosts.

  8. An overview of the structures of protein-DNA complexes

    PubMed Central

    Luscombe, Nicholas M; Austin, Susan E; Berman , Helen M; Thornton, Janet M

    2000-01-01

    On the basis of a structural analysis of 240 protein-DNA complexes contained in the Protein Data Bank (PDB), we have classified the DNA-binding proteins involved into eight different structural/functional groups, which are further classified into 54 structural families. Here we present this classification and review the functions, structures and binding interactions of these protein-DNA complexes. PMID:11104519

  9. The COG database: a tool for genome-scale analysis of protein functions and evolution

    PubMed Central

    Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.

    2000-01-01

    Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175

  10. Sensitivity of Breast Tumors to Oncolytic Viruses

    DTIC Science & Technology

    2006-08-01

    therapies for breast cancer based on the oncolytic virus, vesicular stomatitis virus (VSV). Studies have shown that matrix (M) protein mutants of VSV, such...more resistant to VSV-induced cytopathic effect than breast cancer cells. However, in syngeneic breast cancer system in vivo, rM51R-M virus is only...interleukin 12, breast cancer , interferon 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18. NUMBER OF PAGES 19a. NAME OF RESPONSIBLE

  11. Structural Organization and Strain Variation in the Genome of Varicella Zoster Virus

    DTIC Science & Technology

    1984-10-23

    Zoster 6 Growth of VZV in tissue culture 9 Structure and proteins of VZV 15 Structure of HSV DNA 20 Classification of herpesviruses based on DNA...structure 28 Strain variation in herpesvirus DNA 31 VZV DNA 33 Specific aims 36 II. MATERIALS AND METHODS 38 Cells and viruses 38 Isolation of virus...endonuclease fragments by colony hybridization 106 21. Selected methods of restriction endonuclease mapping .... 109 22. Identification of

  12. Rapid on-line detection and grading of wooden breast myopathy in chicken fillets by near-infrared spectroscopy

    PubMed Central

    Veiseth-Kent, Eva; Høst, Vibeke; Løvland, Atle

    2017-01-01

    The main objective of this work was to develop a method for rapid and non-destructive detection and grading of wooden breast (WB) syndrome in chicken breast fillets. Near-infrared (NIR) spectroscopy was chosen as detection method, and an industrial NIR scanner was applied and tested for large scale on-line detection of the syndrome. Two approaches were evaluated for discrimination of WB fillets: 1) Linear discriminant analysis based on NIR spectra only, and 2) a regression model for protein was made based on NIR spectra and the estimated concentrations of protein were used for discrimination. A sample set of 197 fillets was used for training and calibration. A test set was recorded under industrial conditions and contained spectra from 79 fillets. The classification methods obtained 99.5–100% correct classification of the calibration set and 100% correct classification of the test set. The NIR scanner was then installed in a commercial chicken processing plant and could detect incidence rates of WB in large batches of fillets. Examples of incidence are shown for three broiler flocks where a high number of fillets (9063, 6330 and 10483) were effectively measured. Prevalence of WB of 0.1%, 6.6% and 8.5% were estimated for these flocks based on the complete sample volumes. Such an on-line system can be used to alleviate the challenges WB represents to the poultry meat industry. It enables automatic quality sorting of chicken fillets to different product categories. Manual laborious grading can be avoided. Incidences of WB from different farms and flocks can be tracked and information can be used to understand and point out main causes for WB in the chicken production. This knowledge can be used to improve the production procedures and reduce today’s extensive occurrence of WB. PMID:28278170

  13. EBprot: Statistical analysis of labeling-based quantitative proteomics data.

    PubMed

    Koh, Hiromi W L; Swa, Hannah L F; Fermin, Damian; Ler, Siok Ghee; Gunaratne, Jayantha; Choi, Hyungwon

    2015-08-01

    Labeling-based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein-level ratios, which is obtained by summarizing peptide-level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide-protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide-level analysis of EBprot provides better receiver-operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein-level ratios. We also demonstrate superior classification performance of peptide-level EBprot analysis in a spike-in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF-stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide-level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling-based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/. All MS data have been deposited in the ProteomeXchange with identifier PXD001426 (http://proteomecentral.proteomexchange.org/dataset/PXD001426/). © 2015 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  14. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity.

    PubMed

    Li, Ying Hong; Xu, Jing Yu; Tao, Lin; Li, Xiao Feng; Li, Shuang; Zeng, Xian; Chen, Shang Ying; Zhang, Peng; Qin, Chu; Zhang, Cheng; Chen, Zhe; Zhu, Feng; Chen, Yu Zong

    2016-01-01

    Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.

  15. [Histological diagnosis and complications of celiac disease. Update according to the new S2k guidelines].

    PubMed

    Aust, D E; Bläker, H

    2015-03-01

    Celiac disease is a relatively common immunological systemic disease triggered by the protein gluten in genetically predisposed individuals. Classical symptoms like chronic diarrhea, steatorrhea, weight loss and growth retardation are nowadays relatively uncommon. Diagnostic workup includes serological tests for IgA antibodies against tissue transglutaminase 2 (anti-TG2-IgA) and total IgA and histology of duodenal biopsies. Histomorphological classification should be done according to the modified Marsh-Oberhuber classification. Diagnosis of celiac disease should be based on serological, clinical, and histological findings. The only treatment is a life-long gluten-free diet. Unchanged or recurrent symptoms under gluten-free diet may indicate refractory celiac disease. Enteropathy-associated T-cell lymphoma and adenocarcinomas of the small intestine are known complications of celiac disease.

  16. Mycobacteriophage genome database.

    PubMed

    Joseph, Jerrine; Rajendran, Vasanthi; Hassan, Sameer; Kumar, Vanaja

    2011-01-01

    Mycobacteriophage genome database (MGDB) is an exclusive repository of the 64 completely sequenced mycobacteriophages with annotated information. It is a comprehensive compilation of the various gene parameters captured from several databases pooled together to empower mycobacteriophage researchers. The MGDB (Version No.1.0) comprises of 6086 genes from 64 mycobacteriophages classified into 72 families based on ACLAME database. Manual curation was aided by information available from public databases which was enriched further by analysis. Its web interface allows browsing as well as querying the classification. The main objective is to collect and organize the complexity inherent to mycobacteriophage protein classification in a rational way. The other objective is to browse the existing and new genomes and describe their functional annotation. The database is available for free at http://mpgdb.ibioinformatics.org/mpgdb.php.

  17. Improving prediction of heterodimeric protein complexes using combination with pairwise kernel.

    PubMed

    Ruan, Peiying; Hayashida, Morihiro; Akutsu, Tatsuya; Vert, Jean-Philippe

    2018-02-19

    Since many proteins become functional only after they interact with their partner proteins and form protein complexes, it is essential to identify the sets of proteins that form complexes. Therefore, several computational methods have been proposed to predict complexes from the topology and structure of experimental protein-protein interaction (PPI) network. These methods work well to predict complexes involving at least three proteins, but generally fail at identifying complexes involving only two different proteins, called heterodimeric complexes or heterodimers. There is however an urgent need for efficient methods to predict heterodimers, since the majority of known protein complexes are precisely heterodimers. In this paper, we use three promising kernel functions, Min kernel and two pairwise kernels, which are Metric Learning Pairwise Kernel (MLPK) and Tensor Product Pairwise Kernel (TPPK). We also consider the normalization forms of Min kernel. Then, we combine Min kernel or its normalization form and one of the pairwise kernels by plugging. We applied kernels based on PPI, domain, phylogenetic profile, and subcellular localization properties to predicting heterodimers. Then, we evaluate our method by employing C-Support Vector Classification (C-SVC), carrying out 10-fold cross-validation, and calculating the average F-measures. The results suggest that the combination of normalized-Min-kernel and MLPK leads to the best F-measure and improved the performance of our previous work, which had been the best existing method so far. We propose new methods to predict heterodimers, using a machine learning-based approach. We train a support vector machine (SVM) to discriminate interacting vs non-interacting protein pairs, based on informations extracted from PPI, domain, phylogenetic profiles and subcellular localization. We evaluate in detail new kernel functions to encode these data, and report prediction performance that outperforms the state-of-the-art.

  18. Atomic analysis of protein-protein interfaces with known inhibitors: the 2P2I database.

    PubMed

    Bourgeas, Raphaël; Basse, Marie-Jeanne; Morelli, Xavier; Roche, Philippe

    2010-03-09

    In the last decade, the inhibition of protein-protein interactions (PPIs) has emerged from both academic and private research as a new way to modulate the activity of proteins. Inhibitors of these original interactions are certainly the next generation of highly innovative drugs that will reach the market in the next decade. However, in silico design of such compounds still remains challenging. Here we describe this particular PPI chemical space through the presentation of 2P2I(DB), a hand-curated database dedicated to the structure of PPIs with known inhibitors. We have analyzed protein/protein and protein/inhibitor interfaces in terms of geometrical parameters, atom and residue properties, buried accessible surface area and other biophysical parameters. The interfaces found in 2P2I(DB) were then compared to those of representative datasets of heterodimeric complexes. We propose a new classification of PPIs with known inhibitors into two classes depending on the number of segments present at the interface and corresponding to either a single secondary structure element or to a more globular interacting domain. 2P2I(DB) complexes share global shape properties with standard transient heterodimer complexes, but their accessible surface areas are significantly smaller. No major conformational changes are seen between the different states of the proteins. The interfaces are more hydrophobic than general PPI's interfaces, with less charged residues and more non-polar atoms. Finally, fifty percent of the complexes in the 2P2I(DB) dataset possess more hydrogen bonds than typical protein-protein complexes. Potential areas of study for the future are proposed, which include a new classification system consisting of specific families and the identification of PPI targets with high druggability potential based on key descriptors of the interaction. 2P2I database stores structural information about PPIs with known inhibitors and provides a useful tool for biologists to assess the potential druggability of their interfaces. The database can be accessed at http://2p2idb.cnrs-mrs.fr.

  19. Regulation of IAP (Inhibitor of Apoptosis) Gene Expression by the p53 Tumor Suppressor Protein

    DTIC Science & Technology

    2005-05-01

    adenovirus, gene therapy, polymorphism, 31 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20...averaged results of three inde- pendent experiments, with standard error. Right panel: Level of p53 in infected cells using the antibody Ab-6 (Calbiochem...with highly purified mitochondria as described in (2). The arrow marks oligomerized BAK. The right _ -. panel depicts the purity of BMH CrosIinked Mito

  20. 21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.

    Code of Federal Regulations, 2010 CFR

    2010-04-01

    ... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...

  1. 21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...

  2. 21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...

  3. 21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...

  4. 21 CFR 866.5425 - Alpha-2-glycoproteins immunological test system.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... the alpha-2-glycoproteins (a group of plasma proteins found in the alpha-2 group when subjected to... some cancers and genetically inherited deficiencies of these plasma proteins. (b) Classification. Class...

  5. Identification of serum proteins discriminating colorectal cancer patients and healthy controls using surface-enhanced laser desorption ionisation-time of flight mass spectrometry.

    PubMed

    Engwegen, Judith Y M N; Helgason, Helgi H; Cats, Annemieke; Harris, Nathan; Bonfrer, Johannes M G; Schellens, Jan H M; Beijnen, Jos H

    2006-03-14

    To detect the new serum biomarkers for colorectal cancer (CRC) by serum protein profiling with surface-enhanced laser desorption ionisation--time of flight mass spectrometry (SELDI-TOF MS). Two independent serum sample sets were analysed separately with the ProteinChip technology (set A: 40 CRC+49 healthy controls; set B: 37 CRC+31 healthy controls), using chips with a weak cation exchange moiety and buffer pH 5. Discriminative power of differentially expressed proteins was assessed with a classification tree algorithm. Sensitivities and specificities of the generated classification trees were obtained by blindly applying data from set A to the generated trees from set B and vice versa. CRC serum protein profiles were also compared with those from breast, ovarian, prostate, and non-small cell lung cancer. Mass-to-charge ratios (m/z) 3.1x10(3), 3.3x10(3), 4.5x10(3), 6.6x10(3) and 28x10(3) were used as classifiers in the best-performing classification trees. Tree sensitivities and specificities were between 65% and 90%. Most of these discriminative m/z values were also different in the other tumour types investigated. M/z 3.3x10(3), main classifier in most trees, was a doubly charged form of the 6.6x10(3)-Da protein. The latter was identified as apolipoprotein C-I. M/z 3.1x10(3) was identified as an N-terminal fragment of albumin, and m/z 28x10(3) as apolipoprotein A-I. SELDI-TOF MS followed by classification tree pattern analysis is a suitable technique for finding new serum markers for CRC. Biomarkers can be identified and reproducibly detected in independent sample sets with high sensitivities and specificities. Although not specific for CRC, these biomarkers have a potential role in disease and treatment monitoring.

  6. Classification of Phylogenetic Profiles for Protein Function Prediction: An SVM Approach

    NASA Astrophysics Data System (ADS)

    Kotaru, Appala Raju; Joshi, Ramesh C.

    Predicting the function of an uncharacterized protein is a major challenge in post-genomic era due to problems complexity and scale. Having knowledge of protein function is a crucial link in the development of new drugs, better crops, and even the development of biochemicals such as biofuels. Recently numerous high-throughput experimental procedures have been invented to investigate the mechanisms leading to the accomplishment of a protein’s function and Phylogenetic profile is one of them. Phylogenetic profile is a way of representing a protein which encodes evolutionary history of proteins. In this paper we proposed a method for classification of phylogenetic profiles using supervised machine learning method, support vector machine classification along with radial basis function as kernel for identifying functionally linked proteins. We experimentally evaluated the performance of the classifier with the linear kernel, polynomial kernel and compared the results with the existing tree kernel. In our study we have used proteins of the budding yeast saccharomyces cerevisiae genome. We generated the phylogenetic profiles of 2465 yeast genes and for our study we used the functional annotations that are available in the MIPS database. Our experiments show that the performance of the radial basis kernel is similar to polynomial kernel is some functional classes together are better than linear, tree kernel and over all radial basis kernel outperformed the polynomial kernel, linear kernel and tree kernel. In analyzing these results we show that it will be feasible to make use of SVM classifier with radial basis function as kernel to predict the gene functionality using phylogenetic profiles.

  7. Impact of genomics on the understanding of microbial evolution and classification: the importance of Darwin's views on classification.

    PubMed

    Gupta, Radhey S

    2016-07-01

    Analyses of genome sequences, by some approaches, suggest that the widespread occurrence of horizontal gene transfers (HGTs) in prokaryotes disguises their evolutionary relationships and have led to questioning of the Darwinian model of evolution for prokaryotes. These inferences are critically examined in the light of comparative genome analysis, characteristic synapomorphies, phylogenetic trees and Darwin's views on examining evolutionary relationships. Genome sequences are enabling discovery of numerous molecular markers (synapomorphies) such as conserved signature indels (CSIs) and conserved signature proteins (CSPs), which are distinctive characteristics of different prokaryotic taxa. Based on these molecular markers, exhibiting high degree of specificity and predictive ability, numerous prokaryotic taxa of different ranks, currently identified based on the 16S rRNA gene trees, can now be reliably demarcated in molecular terms. Within all studied groups, multiple CSIs and CSPs have been identified for successive nested clades providing reliable information regarding their hierarchical relationships and these inferences are not affected by HGTs. These results strongly support Darwin's views on evolution and classification and supplement the current phylogenetic framework based on 16S rRNA in important respects. The identified molecular markers provide important means for developing novel diagnostics, therapeutics and for functional studies providing important insights regarding prokaryotic taxa. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  8. Classification of infectious bursal disease virus into genogroups.

    PubMed

    Michel, Linda O; Jackwood, Daral J

    2017-12-01

    Infectious bursal disease virus (IBDV) causes infectious bursal disease (IBD), an immunosuppressive disease of poultry. The current classification scheme of IBDV is confusing because it is based on antigenic types (variant and classical) as well as pathotypes. Many of the amino acid changes differentiating these various classifications are found in a hypervariable region of the capsid protein VP2 (hvVP2), the major host protective antigen. Data from this study were used to propose a new classification scheme for IBDV based solely on genogroups identified from phylogenetic analysis of the hvVP2 of strains worldwide. Seven major genogroups were identified, some of which are geographically restricted and others that have global dispersion, such as genogroup 1. Genogroup 2 viruses are predominately distributed in North America, while genogroup 3 viruses are most often identified on other continents. Additionally, we have identified a population of genogroup 3 vvIBDV isolates that have an amino acid change from alanine to threonine at position 222 while maintaining other residues conserved in this genogroup (I242, I256 and I294). A222T is an important mutation because amino acid 222 is located in the first of four surface loops of hvVP2. A similar shift from proline to threonine at 222 is believed to play a role in the significant antigenic change of the genogroup 2 IBDV strains, suggesting that antigenic drift may be occurring in genogroup 3, possibly in response to antigenic pressure from vaccination.

  9. MATtrack: A MATLAB-Based Quantitative Image Analysis Platform for Investigating Real-Time Photo-Converted Fluorescent Signals in Live Cells.

    PubMed

    Courtney, Jane; Woods, Elena; Scholz, Dimitri; Hall, William W; Gautier, Virginie W

    2015-01-01

    We introduce here MATtrack, an open source MATLAB-based computational platform developed to process multi-Tiff files produced by a photo-conversion time lapse protocol for live cell fluorescent microscopy. MATtrack automatically performs a series of steps required for image processing, including extraction and import of numerical values from Multi-Tiff files, red/green image classification using gating parameters, noise filtering, background extraction, contrast stretching and temporal smoothing. MATtrack also integrates a series of algorithms for quantitative image analysis enabling the construction of mean and standard deviation images, clustering and classification of subcellular regions and injection point approximation. In addition, MATtrack features a simple user interface, which enables monitoring of Fluorescent Signal Intensity in multiple Regions of Interest, over time. The latter encapsulates a region growing method to automatically delineate the contours of Regions of Interest selected by the user, and performs background and regional Average Fluorescence Tracking, and automatic plotting. Finally, MATtrack computes convenient visualization and exploration tools including a migration map, which provides an overview of the protein intracellular trajectories and accumulation areas. In conclusion, MATtrack is an open source MATLAB-based software package tailored to facilitate the analysis and visualization of large data files derived from real-time live cell fluorescent microscopy using photoconvertible proteins. It is flexible, user friendly, compatible with Windows, Mac, and Linux, and a wide range of data acquisition software. MATtrack is freely available for download at eleceng.dit.ie/courtney/MATtrack.zip.

  10. MATtrack: A MATLAB-Based Quantitative Image Analysis Platform for Investigating Real-Time Photo-Converted Fluorescent Signals in Live Cells

    PubMed Central

    Courtney, Jane; Woods, Elena; Scholz, Dimitri; Hall, William W.; Gautier, Virginie W.

    2015-01-01

    We introduce here MATtrack, an open source MATLAB-based computational platform developed to process multi-Tiff files produced by a photo-conversion time lapse protocol for live cell fluorescent microscopy. MATtrack automatically performs a series of steps required for image processing, including extraction and import of numerical values from Multi-Tiff files, red/green image classification using gating parameters, noise filtering, background extraction, contrast stretching and temporal smoothing. MATtrack also integrates a series of algorithms for quantitative image analysis enabling the construction of mean and standard deviation images, clustering and classification of subcellular regions and injection point approximation. In addition, MATtrack features a simple user interface, which enables monitoring of Fluorescent Signal Intensity in multiple Regions of Interest, over time. The latter encapsulates a region growing method to automatically delineate the contours of Regions of Interest selected by the user, and performs background and regional Average Fluorescence Tracking, and automatic plotting. Finally, MATtrack computes convenient visualization and exploration tools including a migration map, which provides an overview of the protein intracellular trajectories and accumulation areas. In conclusion, MATtrack is an open source MATLAB-based software package tailored to facilitate the analysis and visualization of large data files derived from real-time live cell fluorescent microscopy using photoconvertible proteins. It is flexible, user friendly, compatible with Windows, Mac, and Linux, and a wide range of data acquisition software. MATtrack is freely available for download at eleceng.dit.ie/courtney/MATtrack.zip. PMID:26485569

  11. Functional Proteomic Analysis of Human NucleolusD⃞

    PubMed Central

    Scherl, Alexander; Couté, Yohann; Déon, Catherine; Callé, Aleth; Kindbeiter, Karine; Sanchez, Jean-Charles; Greco, Anna; Hochstrasser, Denis; Diaz, Jean-Jacques

    2002-01-01

    The notion of a “plurifunctional” nucleolus is now well established. However, molecular mechanisms underlying the biological processes occurring within this nuclear domain remain only partially understood. As a first step in elucidating these mechanisms we have carried out a proteomic analysis to draw up a list of proteins present within nucleoli of HeLa cells. This analysis allowed the identification of 213 different nucleolar proteins. This catalog complements that of the 271 proteins obtained recently by others, giving a total of ∼350 different nucleolar proteins. Functional classification of these proteins allowed outlining several biological processes taking place within nucleoli. Bioinformatic analyses permitted the assignment of hypothetical functions for 43 proteins for which no functional information is available. Notably, a role in ribosome biogenesis was proposed for 31 proteins. More generally, this functional classification reinforces the plurifunctional nature of nucleoli and provides convincing evidence that nucleoli may play a central role in the control of gene expression. Finally, this analysis supports the recent demonstration of a coupling of transcription and translation in higher eukaryotes. PMID:12429849

  12. Structural classification of small, disulfide-rich protein domains.

    PubMed

    Cheek, Sara; Krishna, S Sri; Grishin, Nick V

    2006-05-26

    Disulfide-rich domains are small protein domains whose global folds are stabilized primarily by the formation of disulfide bonds and, to a much lesser extent, by secondary structure and hydrophobic interactions. Disulfide-rich domains perform a wide variety of roles functioning as growth factors, toxins, enzyme inhibitors, hormones, pheromones, allergens, etc. These domains are commonly found both as independent (single-domain) proteins and as domains within larger polypeptides. Here, we present a comprehensive structural classification of approximately 3000 small, disulfide-rich protein domains. We find that these domains can be arranged into 41 fold groups on the basis of structural similarity. Our fold groups, which describe broader structural relationships than existing groupings of these domains, bring together representatives with previously unacknowledged similarities; 18 of the 41 fold groups include domains from several SCOP folds. Within the fold groups, the domains are assembled into families of homologs. We define 98 families of disulfide-rich domains, some of which include newly detected homologs, particularly among knottin-like domains. On the basis of this classification, we have examined cases of convergent and divergent evolution of functions performed by disulfide-rich proteins. Disulfide bonding patterns in these domains are also evaluated. Reducible disulfide bonding patterns are much less frequent, while symmetric disulfide bonding patterns are more common than expected from random considerations. Examples of variations in disulfide bonding patterns found within families and fold groups are discussed.

  13. Classification and Lineage Tracing of SH2 Domains Throughout Eukaryotes.

    PubMed

    Liu, Bernard A

    2017-01-01

    Today there exists a rapidly expanding number of sequenced genomes. Cataloging protein interaction domains such as the Src Homology 2 (SH2) domain across these various genomes can be accomplished with ease due to existing algorithms and predictions models. An evolutionary analysis of SH2 domains provides a step towards understanding how SH2 proteins integrated with existing signaling networks to position phosphotyrosine signaling as a crucial driver of robust cellular communication networks in metazoans. However organizing and tracing SH2 domain across organisms and understanding their evolutionary trajectory remains a challenge. This chapter describes several methodologies towards analyzing the evolutionary trajectory of SH2 domains including a global SH2 domain classification system, which facilitates annotation of new SH2 sequences essential for tracing the lineage of SH2 domains throughout eukaryote evolution. This classification utilizes a combination of sequence homology, protein domain architecture and the boundary positions between introns and exons within the SH2 domain or genes encoding these domains. Discrete SH2 families can then be traced across various genomes to provide insight into its origins. Furthermore, additional methods for examining potential mechanisms for divergence of SH2 domains from structural changes to alterations in the protein domain content and genome duplication will be discussed. Therefore a better understanding of SH2 domain evolution may enhance our insight into the emergence of phosphotyrosine signaling and the expansion of protein interaction domains.

  14. FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.

    PubMed

    El-Manzalawy, Yasser; Abbas, Mostafa; Malluhi, Qutaibah; Honavar, Vasant

    2016-01-01

    A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.

  15. 21 CFR 866.5270 - C-reactive protein immuno-logical test system.

    Code of Federal Regulations, 2013 CFR

    2013-04-01

    ... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...

  16. 21 CFR 866.5270 - C-reactive protein immuno-logical test system.

    Code of Federal Regulations, 2011 CFR

    2011-04-01

    ... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...

  17. 21 CFR 866.5270 - C-reactive protein immuno-logical test system.

    Code of Federal Regulations, 2012 CFR

    2012-04-01

    ... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...

  18. 21 CFR 866.5270 - C-reactive protein immuno-logical test system.

    Code of Federal Regulations, 2014 CFR

    2014-04-01

    ... the C-reactive protein in serum and other body fluids. Measurement of C-reactive protein aids in evaluation of the amount of injury to body tissues. (b) Classification. Class II (performance standards). ....5270 Section 866.5270 Food and Drugs FOOD AND DRUG ADMINISTRATION, DEPARTMENT OF HEALTH AND HUMAN...

  19. Automatic annotation of protein motif function with Gene Ontology terms.

    PubMed

    Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G

    2004-09-02

    Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.

  20. Protein fold recognition using geometric kernel data fusion.

    PubMed

    Zakeri, Pooya; Jeuris, Ben; Vandebril, Raf; Moreau, Yves

    2014-07-01

    Various approaches based on features extracted from protein sequences and often machine learning methods have been used in the prediction of protein folds. Finding an efficient technique for integrating these different protein features has received increasing attention. In particular, kernel methods are an interesting class of techniques for integrating heterogeneous data. Various methods have been proposed to fuse multiple kernels. Most techniques for multiple kernel learning focus on learning a convex linear combination of base kernels. In addition to the limitation of linear combinations, working with such approaches could cause a loss of potentially useful information. We design several techniques to combine kernel matrices by taking more involved, geometry inspired means of these matrices instead of convex linear combinations. We consider various sequence-based protein features including information extracted directly from position-specific scoring matrices and local sequence alignment. We evaluate our methods for classification on the SCOP PDB-40D benchmark dataset for protein fold recognition. The best overall accuracy on the protein fold recognition test set obtained by our methods is ∼ 86.7%. This is an improvement over the results of the best existing approach. Moreover, our computational model has been developed by incorporating the functional domain composition of proteins through a hybridization model. It is observed that by using our proposed hybridization model, the protein fold recognition accuracy is further improved to 89.30%. Furthermore, we investigate the performance of our approach on the protein remote homology detection problem by fusing multiple string kernels. The MATLAB code used for our proposed geometric kernel fusion frameworks are publicly available at http://people.cs.kuleuven.be/∼raf.vandebril/homepage/software/geomean.php?menu=5/. © The Author 2014. Published by Oxford University Press.

  1. Acyl carrier protein structural classification and normal mode analysis

    PubMed Central

    Cantu, David C; Forrester, Michael J; Charov, Katherine; Reilly, Peter J

    2012-01-01

    All acyl carrier protein primary and tertiary structures were gathered into the ThYme database. They are classified into 16 families by amino acid sequence similarity, with members of the different families having sequences with statistically highly significant differences. These classifications are supported by tertiary structure superposition analysis. Tertiary structures from a number of families are very similar, suggesting that these families may come from a single distant ancestor. Normal vibrational mode analysis was conducted on experimentally determined freestanding structures, showing greater fluctuations at chain termini and loops than in most helices. Their modes overlap more so within families than between different families. The tertiary structures of three acyl carrier protein families that lacked any known structures were predicted as well. PMID:22374859

  2. Early vertebrate origin and diversification of small transmembrane regulators of cellular ion transport.

    PubMed

    Pirkmajer, Sergej; Kirchner, Henriette; Lundell, Leonidas S; Zelenin, Pavel V; Zierath, Juleen R; Makarova, Kira S; Wolf, Yuri I; Chibalin, Alexander V

    2017-07-15

    Small transmembrane proteins such as FXYDs, which interact with Na + ,K + -ATPase, and the micropeptides that interact with sarco/endoplasmic reticulum Ca 2+ -ATPase play fundamental roles in regulation of ion transport in vertebrates. Uncertain evolutionary origins and phylogenetic relationships among these regulators of ion transport have led to inconsistencies in their classification across vertebrate species, thus hampering comparative studies of their functions. We discovered the first FXYD homologue in sea lamprey, a basal jawless vertebrate, which suggests small transmembrane regulators of ion transport emerged early in the vertebrate lineage. We also identified 13 gene subfamilies of FXYDs and propose a revised, phylogeny-based FXYD classification that is consistent across vertebrate species. These findings provide an improved framework for investigating physiological and pathophysiological functions of small transmembrane regulators of ion transport. Small transmembrane proteins are important for regulation of cellular ion transport. The most prominent among these are members of the FXYD family (FXYD1-12), which regulate Na + ,K + -ATPase, and phospholamban, sarcolipin, myoregulin and DWORF, which regulate the sarco/endoplasmic reticulum Ca 2+ -ATPase (SERCA). FXYDs and regulators of SERCA are present in fishes, as well as terrestrial vertebrates; however, their evolutionary origins and phylogenetic relationships are obscure, thus hampering comparative physiological studies. Here we discovered that sea lamprey (Petromyzon marinus), a representative of extant jawless vertebrates (Cyclostomata), expresses an FXYD homologue, which strongly suggests that FXYDs predate the emergence of fishes and other jawed vertebrates (Gnathostomata). Using a combination of sequence-based phylogenetic analysis and conservation of local chromosome context, we determined that FXYDs markedly diversified in the lineages leading to cartilaginous fishes (Chondrichthyes) and bony vertebrates (Euteleostomi). Diversification of SERCA regulators was much less extensive, indicating they operate under different evolutionary constraints. Finally, we found that FXYDs in extant vertebrates can be classified into 13 gene subfamilies, which do not always correspond to the established FXYD classification. We therefore propose a revised classification that is based on evolutionary history of FXYDs and that is consistent across vertebrate species. Collectively, our findings provide an improved framework for investigating the function of ion transport in health and disease. © 2017 The Authors. The Journal of Physiology © 2017 The Physiological Society.

  3. Information theory applications for biological sequence analysis.

    PubMed

    Vinga, Susana

    2014-05-01

    Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.

  4. Differential sensing using proteins: exploiting the cross-reactivity of serum albumin to pattern individual terpenes and terpenes in perfume.

    PubMed

    Adams, Michelle M; Anslyn, Eric V

    2009-12-02

    There has been a growing interest in the use of differential sensing for analyte classification. In an effort to mimic the mammalian senses of taste and smell, which utilize protein-based receptors, we have introduced serum albumins as nonselective receptors for recognition of small hydrophobic molecules. Herein, we employ a sensing ensemble consisting of serum albumins, a hydrophobic fluorescent indicator (PRODAN), and a hydrophobic additive (deoxycholate) to detect terpenes. With the aid of linear discriminant analysis, we successfully applied our system to differentiate five terpenes. We then extended our terpene analysis and utilized our sensing ensemble for terpene discrimination within the complex mixtures found in perfume.

  5. Note: An automated image analysis method for high-throughput classification of surface-bound bacterial cell motions.

    PubMed

    Shen, Simon; Syal, Karan; Tao, Nongjian; Wang, Shaopeng

    2015-12-01

    We present a Single-Cell Motion Characterization System (SiCMoCS) to automatically extract bacterial cell morphological features from microscope images and use those features to automatically classify cell motion for rod shaped motile bacterial cells. In some imaging based studies, bacteria cells need to be attached to the surface for time-lapse observation of cellular processes such as cell membrane-protein interactions and membrane elasticity. These studies often generate large volumes of images. Extracting accurate bacterial cell morphology features from these images is critical for quantitative assessment. Using SiCMoCS, we demonstrated simultaneous and automated motion tracking and classification of hundreds of individual cells in an image sequence of several hundred frames. This is a significant improvement from traditional manual and semi-automated approaches to segmenting bacterial cells based on empirical thresholds, and a first attempt to automatically classify bacterial motion types for motile rod shaped bacterial cells, which enables rapid and quantitative analysis of various types of bacterial motion.

  6. Horror Autoinflammaticus: The Molecular Pathophysiology of Autoinflammatory Disease*

    PubMed Central

    Masters, Seth L.; Simon, Anna; Aksentijevich, Ivona; Kastner, Daniel L.

    2010-01-01

    The autoinflammatory diseases are characterized by seemingly unprovoked episodes of inflammation, without high-titer autoantibodies or antigen-specific T cells. The concept was proposed ten years ago with the identification of the genes underlying hereditary periodic fever syndromes. This nosology has taken root because of the dramatic advances in our knowledge of the genetic basis of both mendelian and complex autoinflammatory diseases, and with the recognition that these illnesses derive from genetic variants of the innate immune system. Herein we propose an updated classification scheme based on the molecular insights garnered over the past decade, supplanting a clinical classification that has served well but is opaque to the genetic, immunologic, and therapeutic interrelationships now before us. We define six categories of autoinflammatory disease: IL-1β activation disorders (inflammasomopathies), NF-κB activation syndromes, protein misfolding disorders, complement regulatory diseases, disturbances in cytokine signaling, and macrophage activation syndromes. A system based on molecular pathophysiology will bring greater clarity to our discourse while catalyzing new hypotheses both at the bench and at the bedside. PMID:19302049

  7. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics

    PubMed Central

    Weber, Marc; Teeling, Hanno; Huang, Sixing; Waldmann, Jost; Kassabgy, Mariette; Fuchs, Bernhard M; Klindworth, Anna; Klockow, Christine; Wichels, Antje; Gerdts, Gunnar; Amann, Rudolf; Glöckner, Frank Oliver

    2011-01-01

    Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion. PMID:21160538

  8. ZifBASE: a database of zinc finger proteins and associated resources.

    PubMed

    Jayakanthan, Mannu; Muthukumaran, Jayaraman; Chandrasekar, Sanniyasi; Chawla, Konika; Punetha, Ankita; Sundar, Durai

    2009-09-09

    Information on the occurrence of zinc finger protein motifs in genomes is crucial to the developing field of molecular genome engineering. The knowledge of their target DNA-binding sequences is vital to develop chimeric proteins for targeted genome engineering and site-specific gene correction. There is a need to develop a computational resource of zinc finger proteins (ZFP) to identify the potential binding sites and its location, which reduce the time of in vivo task, and overcome the difficulties in selecting the specific type of zinc finger protein and the target site in the DNA sequence. ZifBASE provides an extensive collection of various natural and engineered ZFP. It uses standard names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, Protein Data Bank, ModBase, Protein Model Portal and the literature. It also incorporates specialized features of ZFP including finger sequences and positions, number of fingers, physiochemical properties, classes, framework, PubMed citations with links to experimental structures (PDB, if available) and modeled structures of natural zinc finger proteins. ZifBASE provides information on zinc finger proteins (both natural and engineered ones), the number of finger units in each of the zinc finger proteins (with multiple fingers), the synergy between the adjacent fingers and their positions. Additionally, it gives the individual finger sequence and their target DNA site to which it binds for better and clear understanding on the interactions of adjacent fingers. The current version of ZifBASE contains 139 entries of which 89 are engineered ZFPs, containing 3-7F totaling to 296 fingers. There are 50 natural zinc finger protein entries ranging from 2-13F, totaling to 307 fingers. It has sequences and structures from literature, Protein Data Bank, ModBase and Protein Model Portal. The interface is cross linked to other public databases like UniprotKB, PDB, ModBase and Protein Model Portal and PubMed for making it more informative. A database is established to maintain the information of the sequence features, including the class, framework, number of fingers, residues, position, recognition site and physio-chemical properties (molecular weight, isoelectric point) of both natural and engineered zinc finger proteins and dissociation constant of few. ZifBASE can provide more effective and efficient way of accessing the zinc finger protein sequences and their target binding sites with the links to their three-dimensional structures. All the data and functions are available at the advanced web-based search interface http://web.iitd.ac.in/~sundar/zifbase.

  9. Lung tumor diagnosis and subtype discovery by gene expression profiling.

    PubMed

    Wang, Lu-yong; Tu, Zhuowen

    2006-01-01

    The optimal treatment of patients with complex diseases, such as cancers, depends on the accurate diagnosis by using a combination of clinical and histopathological data. In many scenarios, it becomes tremendously difficult because of the limitations in clinical presentation and histopathology. To accurate diagnose complex diseases, the molecular classification based on gene or protein expression profiles are indispensable for modern medicine. Moreover, many heterogeneous diseases consist of various potential subtypes in molecular basis and differ remarkably in their response to therapies. It is critical to accurate predict subgroup on disease gene expression profiles. More fundamental knowledge of the molecular basis and classification of disease could aid in the prediction of patient outcome, the informed selection of therapies, and identification of novel molecular targets for therapy. In this paper, we propose a new disease diagnostic method, probabilistic boosting tree (PB tree) method, on gene expression profiles of lung tumors. It enables accurate disease classification and subtype discovery in disease. It automatically constructs a tree in which each node combines a number of weak classifiers into a strong classifier. Also, subtype discovery is naturally embedded in the learning process. Our algorithm achieves excellent diagnostic performance, and meanwhile it is capable of detecting the disease subtype based on gene expression profile.

  10. Metaproteomics analyses as diagnostic tool for differentiation of Escherichia coli strains in outbreaks

    NASA Astrophysics Data System (ADS)

    Jabbour, Rabih E.; Wright, James D.; Deshpande, Samir V.; Wade, Mary; McCubbin, Patrick; Bevilacqua, Vicky

    2013-05-01

    The secreted proteins of the enterohemorrhagic and enteropathogenic E. coli (EHEC and EPEC) are the most common cause of hemorrhagic colitis, a bloody diarrhea with EHEC infection, which often can lead to life threatening hemolytic-uremic syndrome (HUS).We are employing a metaproteomic approach as an effective and complimentary technique to the current genomic based approaches. This metaproteomic approach will evaluate the secreted proteins associated with pathogenicity and utilize their signatures as differentiation biomarkers between EHEC and EPEC strains. The result showed that the identified tryptic peptides of the secreted proteins extracted from different EHEC and EPEC growths have difference in their amino acids sequences and could potentially utilized as biomarkers for the studied E. coli strains. Analysis of extract from EHEC O104:H4 resulted in identification of a multidrug efflux protein, which belongs to the family of fusion proteins that are responsible of cell transportation. Experimental peptides identified lies in the region of the HlyD haemolysin secretion protein-D that is responsible for transporting the haemolysin A toxin. Moreover, the taxonomic classification of EHEC O104:H4 showed closest match with E. coli E55989, which is in agreement with genomic sequencing studies that were done extensively on the mentioned strain. The taxonomic results showed strain level classification for the studied strains and distinctive separation among the strains. Comparative proteomic calculations showed separation between EHEC O157:H7 and O104:H4 in replicate samples using cluster analysis. There are no reported studies addressing the characterization of secreted proteins in various enhanced growth media and utilizing them as biomarkers for strain differentiation. The results of FY-2012 are promising to pursue further experimentation to statistically validate the results and to further explore the impact of environmental conditions on the nature of the secreted biomarkers in various E. coli strains that are of public health concerns in various sectors.

  11. Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase)

    PubMed Central

    Odronitz, Florian; Kollmar, Martin

    2006-01-01

    Background Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Description Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. Conclusion We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein. PMID:17134497

  12. Label-free high-throughput imaging flow cytometry

    NASA Astrophysics Data System (ADS)

    Mahjoubfar, A.; Chen, C.; Niazi, K. R.; Rabizadeh, S.; Jalali, B.

    2014-03-01

    Flow cytometry is an optical method for studying cells based on their individual physical and chemical characteristics. It is widely used in clinical diagnosis, medical research, and biotechnology for analysis of blood cells and other cells in suspension. Conventional flow cytometers aim a laser beam at a stream of cells and measure the elastic scattering of light at forward and side angles. They also perform single-point measurements of fluorescent emissions from labeled cells. However, many reagents used in cell labeling reduce cellular viability or change the behavior of the target cells through the activation of undesired cellular processes or inhibition of normal cellular activity. Therefore, labeled cells are not completely representative of their unaltered form nor are they fully reliable for downstream studies. To remove the requirement of cell labeling in flow cytometry, while still meeting the classification sensitivity and specificity goals, measurement of additional biophysical parameters is essential. Here, we introduce an interferometric imaging flow cytometer based on the world's fastest continuous-time camera. Our system simultaneously measures cellular size, scattering, and protein concentration as supplementary biophysical parameters for label-free cell classification. It exploits the wide bandwidth of ultrafast laser pulses to perform blur-free quantitative phase and intensity imaging at flow speeds as high as 10 meters per second and achieves nanometer-scale optical path length resolution for precise measurements of cellular protein concentration.

  13. ACLAME: a CLAssification of Mobile genetic Elements, update 2010.

    PubMed

    Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane

    2010-01-01

    The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.

  14. Classification of protein quaternary structure by functional domain composition

    PubMed Central

    Yu, Xiaojing; Wang, Chuan; Li, Yixue

    2006-01-01

    Background The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences. Results To explore this problem, we adopted an approach based on the functional domain composition of proteins. Every protein was represented by a vector calculated from the domains in the PFAM database. The nearest neighbor algorithm (NNA) was used for classifying the quaternary structure of proteins from this information. The jackknife cross-validation test was performed on the non-redundant protein dataset in which the sequence identity was less than 25%. The overall success rate obtained is 75.17%. Additionally, to demonstrate the effectiveness of this method, we predicted the proteins in an independent dataset and achieved an overall success rate of 84.11% Conclusion Compared with the amino acid composition method and Blast, the results indicate that the domain composition approach may be a more effective and promising high-throughput method in dealing with this complicated problem in bioinformatics. PMID:16584572

  15. Soft Computing Methods for Disulfide Connectivity Prediction.

    PubMed

    Márquez-Chamorro, Alfonso E; Aguilar-Ruiz, Jesús S

    2015-01-01

    The problem of protein structure prediction (PSP) is one of the main challenges in structural bioinformatics. To tackle this problem, PSP can be divided into several subproblems. One of these subproblems is the prediction of disulfide bonds. The disulfide connectivity prediction problem consists in identifying which nonadjacent cysteines would be cross-linked from all possible candidates. Determining the disulfide bond connectivity between the cysteines of a protein is desirable as a previous step of the 3D PSP, as the protein conformational search space is highly reduced. The most representative soft computing approaches for the disulfide bonds connectivity prediction problem of the last decade are summarized in this paper. Certain aspects, such as the different methodologies based on soft computing approaches (artificial neural network or support vector machine) or features of the algorithms, are used for the classification of these methods.

  16. A reference map of the Arabidopsis thaliana mature pollen proteome

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Noir, Sandra; Braeutigam, Anne; Colby, Thomas

    The male gametophyte (or pollen) plays an obligatory role during sexual reproduction of higher plants. The extremely reduced complexity of this organ renders pollen a valuable experimental system for studying fundamental aspects of plant biology such as cell fate determination, cell-cell interactions, cell polarity, and tip-growth. Here, we present the first reference map of the mature pollen proteome of the dicotyledonous model plant species, Arabidopsis thaliana. Based on two-dimensional gel electrophoresis, matrix-assisted laser desorption/ionization time-of-flight, and electrospray quadrupole time-of-flight mass spectrometry, we reproducibly identified 121 different proteins in 145 individual spots. The presence, subcellular localization, and functional classification of themore » identified proteins are discussed in relation to the pollen transcriptome and the full protein complement encoded by the nuclear Arabidopsis genome.« less

  17. Land Cover Classification in a Complex Urban-Rural Landscape with Quickbird Imagery

    PubMed Central

    Moran, Emilio Federico.

    2010-01-01

    High spatial resolution images have been increasingly used for urban land use/cover classification, but the high spectral variation within the same land cover, the spectral confusion among different land covers, and the shadow problem often lead to poor classification performance based on the traditional per-pixel spectral-based classification methods. This paper explores approaches to improve urban land cover classification with Quickbird imagery. Traditional per-pixel spectral-based supervised classification, incorporation of textural images and multispectral images, spectral-spatial classifier, and segmentation-based classification are examined in a relatively new developing urban landscape, Lucas do Rio Verde in Mato Grosso State, Brazil. This research shows that use of spatial information during the image classification procedure, either through the integrated use of textural and spectral images or through the use of segmentation-based classification method, can significantly improve land cover classification performance. PMID:21643433

  18. Noninvasive diagnosis of intraamniotic infection: proteomic biomarkers in vaginal fluid.

    PubMed

    Hitti, Jane; Lapidus, Jodi A; Lu, Xinfang; Reddy, Ashok P; Jacob, Thomas; Dasari, Surendra; Eschenbach, David A; Gravett, Michael G; Nagalla, Srinivasa R

    2010-07-01

    We analyzed the vaginal fluid proteome to identify biomarkers of intraamniotic infection among women in preterm labor. Proteome analysis was performed on vaginal fluid specimens from women with preterm labor, using multidimensional liquid chromatography, tandem mass spectrometry, and label-free quantification. Enzyme immunoassays were used to quantify candidate proteins. Classification accuracy for intraamniotic infection (positive amniotic fluid bacterial culture and/or interleukin-6 >2 ng/mL) was evaluated using receiver-operator characteristic curves obtained by logistic regression. Of 170 subjects, 30 (18%) had intraamniotic infection. Vaginal fluid proteome analysis revealed 338 unique proteins. Label-free quantification identified 15 proteins differentially expressed in intraamniotic infection, including acute-phase reactants, immune modulators, high-abundance amniotic fluid proteins and extracellular matrix-signaling factors; these findings were confirmed by enzyme immunoassay. A multi-analyte algorithm showed accurate classification of intraamniotic infection. Vaginal fluid proteome analyses identified proteins capable of discriminating between patients with and without intraamniotic infection. Copyright (c) 2010 Mosby, Inc. All rights reserved.

  19. Ribosome-inactivating proteins: potent poisons and molecular tools.

    PubMed

    Walsh, Matthew J; Dodd, Jennifer E; Hautbergue, Guillaume M

    2013-11-15

    Ribosome-inactivating proteins (RIPs) were first isolated over a century ago and have been shown to be catalytic toxins that irreversibly inactivate protein synthesis. Elucidation of atomic structures and molecular mechanism has revealed these proteins to be a diverse group subdivided into two classes. RIPs have been shown to exhibit RNA N-glycosidase activity and depurinate the 28S rRNA of the eukaryotic 60S ribosomal subunit. In this review, we compare archetypal RIP family members with other potent toxins that abolish protein synthesis: the fungal ribotoxins which directly cleave the 28S rRNA and the newly discovered Burkholderia lethal factor 1 (BLF1). BLF1 presents additional challenges to the current classification system since, like the ribotoxins, it does not possess RNA N-glycosidase activity but does irreversibly inactivate ribosomes. We further discuss whether the RIP classification should be broadened to include toxins achieving irreversible ribosome inactivation with similar turnovers to RIPs, but through different enzymatic mechanisms.

  20. Multiscale Persistent Functions for Biomolecular Structure Characterization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Xia, Kelin; Li, Zhiming; Mu, Lin

    Here in this paper, we introduce multiscale persistent functions for biomolecular structure characterization. The essential idea is to combine our multiscale rigidity functions (MRFs) with persistent homology analysis, so as to construct a series of multiscale persistent functions, particularly multiscale persistent entropies, for structure characterization. To clarify the fundamental idea of our method, the multiscale persistent entropy (MPE) model is discussed in great detail. Mathematically, unlike the previous persistent entropy (Chintakunta et al. in Pattern Recognit 48(2):391–401, 2015; Merelli et al. in Entropy 17(10):6872–6892, 2015; Rucco et al. in: Proceedings of ECCS 2014, Springer, pp 117–128, 2016), a special resolutionmore » parameter is incorporated into our model. Various scales can be achieved by tuning its value. Physically, our MPE can be used in conformational entropy evaluation. More specifically, it is found that our method incorporates in it a natural classification scheme. This is achieved through a density filtration of an MRF built from angular distributions. To further validate our model, a systematical comparison with the traditional entropy evaluation model is done. Additionally, it is found that our model is able to preserve the intrinsic topological features of biomolecular data much better than traditional approaches, particularly for resolutions in the intermediate range. Moreover, by comparing with traditional entropies from various grid sizes, bond angle-based methods and a persistent homology-based support vector machine method (Cang et al. in Mol Based Math Biol 3:140–162, 2015), we find that our MPE method gives the best results in terms of average true positive rate in a classic protein structure classification test. More interestingly, all-alpha and all-beta protein classes can be clearly separated from each other with zero error only in our model. Finally, a special protein structure index (PSI) is proposed, for the first time, to describe the “regularity” of protein structures. Basically, a protein structure is deemed as regular if it has a consistent and orderly configuration. Our PSI model is tested on a database of 110 proteins; we find that structures with larger portions of loops and intrinsically disorder regions are always associated with larger PSI, meaning an irregular configuration, while proteins with larger portions of secondary structures, i.e., alpha-helix or beta-sheet, have smaller PSI. Essentially, PSI can be used to describe the “regularity” information in any systems.« less

  1. River reach classification for the Greater Mekong Region at high spatial resolution

    NASA Astrophysics Data System (ADS)

    Ouellet Dallaire, C.; Lehner, B.

    2014-12-01

    River classifications have been used in river health and ecological assessments as coarse proxies to represent aquatic biodiversity when comprehensive biological and/or species data is unavailable. Currently there are no river classifications or biological data available in a consistent format for the extent of the Greater Mekong Region (GMR; including the Irrawaddy, the Salween, the Chao Praya, the Mekong and the Red River basins). The current project proposes a new river habitat classification for the region, facilitated by the HydroSHEDS (HYDROlogical SHuttle Elevation Derivatives at multiple Scales) database at 500m pixel resolution. The classification project is based on the Global River Classification framework relying on the creation of multiple sub-classifications based on different disciplines. The resulting classes from the sub-classification are later combined into final classes to create a holistic river reach classification. For the GMR, a final habitat classification was created based on three sub-classifications: a hydrological sub-classification based only on discharge indices (river size and flow variability); a physio-climatic sub-classification based on large scale indices of climate and elevation (biomes, ecoregions and elevation); and a geomorphological sub-classification based on local morphology (presence of floodplains, reach gradient and sand transport). Key variables and thresholds were identified in collaboration with local experts to ensure that regional knowledge was included. The final classification is composed 54 unique final classes based on 3 sub-classifications with less than 15 classes each. The resulting classifications are driven by abiotic variables and do not include biological data, but they represent a state-of-the art product based on best available data (mostly global data). The most common river habitat type is the "dry broadleaf, low gradient, very small river". These classifications could be applied in a wide range of hydro-ecological assessments and useful for a variety of stakeholders such as NGO, governments and researchers.

  2. Beta Atomic Contacts: Identifying Critical Specific Contacts in Protein Binding Interfaces

    PubMed Central

    Liu, Qian; Kwoh, Chee Keong; Hoi, Steven C. H.

    2013-01-01

    Specific binding between proteins plays a crucial role in molecular functions and biological processes. Protein binding interfaces and their atomic contacts are typically defined by simple criteria, such as distance-based definitions that only use some threshold of spatial distance in previous studies. These definitions neglect the nearby atomic organization of contact atoms, and thus detect predominant contacts which are interrupted by other atoms. It is questionable whether such kinds of interrupted contacts are as important as other contacts in protein binding. To tackle this challenge, we propose a new definition called beta (β) atomic contacts. Our definition, founded on the β-skeletons in computational geometry, requires that there is no other atom in the contact spheres defined by two contact atoms; this sphere is similar to the van der Waals spheres of atoms. The statistical analysis on a large dataset shows that β contacts are only a small fraction of conventional distance-based contacts. To empirically quantify the importance of β contacts, we design βACV, an SVM classifier with β contacts as input, to classify homodimers from crystal packing. We found that our βACV is able to achieve the state-of-the-art classification performance superior to SVM classifiers with distance-based contacts as input. Our βACV also outperforms several existing methods when being evaluated on several datasets in previous works. The promising empirical performance suggests that β contacts can truly identify critical specific contacts in protein binding interfaces. β contacts thus provide a new model for more precise description of atomic organization in protein quaternary structures than distance-based contacts. PMID:23630569

  3. A novel chemometric classification for FTIR spectra of mycotoxin-contaminated maize and peanuts at regulatory limits.

    PubMed

    Kos, Gregor; Sieger, Markus; McMullin, David; Zahradnik, Celine; Sulyok, Michael; Öner, Tuba; Mizaikoff, Boris; Krska, Rudolf

    2016-10-01

    The rapid identification of mycotoxins such as deoxynivalenol and aflatoxin B 1 in agricultural commodities is an ongoing concern for food importers and processors. While sophisticated chromatography-based methods are well established for regulatory testing by food safety authorities, few techniques exist to provide a rapid assessment for traders. This study advances the development of a mid-infrared spectroscopic method, recording spectra with little sample preparation. Spectral data were classified using a bootstrap-aggregated (bagged) decision tree method, evaluating the protein and carbohydrate absorption regions of the spectrum. The method was able to classify 79% of 110 maize samples at the European Union regulatory limit for deoxynivalenol of 1750 µg kg -1 and, for the first time, 77% of 92 peanut samples at 8 µg kg -1 of aflatoxin B 1 . A subset model revealed a dependency on variety and type of fungal infection. The employed CRC and SBL maize varieties could be pooled in the model with a reduction of classification accuracy from 90% to 79%. Samples infected with Fusarium verticillioides were removed, leaving samples infected with F. graminearum and F. culmorum in the dataset improving classification accuracy from 73% to 79%. A 500 µg kg -1 classification threshold for deoxynivalenol in maize performed even better with 85% accuracy. This is assumed to be due to a larger number of samples around the threshold increasing representativity. Comparison with established principal component analysis classification, which consistently showed overlapping clusters, confirmed the superior performance of bagged decision tree classification.

  4. A framework for biomedical figure segmentation towards image-based document retrieval

    PubMed Central

    2013-01-01

    The figures included in many of the biomedical publications play an important role in understanding the biological experiments and facts described within. Recent studies have shown that it is possible to integrate the information that is extracted from figures in classical document classification and retrieval tasks in order to improve their accuracy. One important observation about the figures included in biomedical publications is that they are often composed of multiple subfigures or panels, each describing different methodologies or results. The use of these multimodal figures is a common practice in bioscience, as experimental results are graphically validated via multiple methodologies or procedures. Thus, for a better use of multimodal figures in document classification or retrieval tasks, as well as for providing the evidence source for derived assertions, it is important to automatically segment multimodal figures into subfigures and panels. This is a challenging task, however, as different panels can contain similar objects (i.e., barcharts and linecharts) with multiple layouts. Also, certain types of biomedical figures are text-heavy (e.g., DNA sequences and protein sequences images) and they differ from traditional images. As a result, classical image segmentation techniques based on low-level image features, such as edges or color, are not directly applicable to robustly partition multimodal figures into single modal panels. In this paper, we describe a robust solution for automatically identifying and segmenting unimodal panels from a multimodal figure. Our framework starts by robustly harvesting figure-caption pairs from biomedical articles. We base our approach on the observation that the document layout can be used to identify encoded figures and figure boundaries within PDF files. Taking into consideration the document layout allows us to correctly extract figures from the PDF document and associate their corresponding caption. We combine pixel-level representations of the extracted images with information gathered from their corresponding captions to estimate the number of panels in the figure. Thus, our approach simultaneously identifies the number of panels and the layout of figures. In order to evaluate the approach described here, we applied our system on documents containing protein-protein interactions (PPIs) and compared the results against a gold standard that was annotated by biologists. Experimental results showed that our automatic figure segmentation approach surpasses pure caption-based and image-based approaches, achieving a 96.64% accuracy. To allow for efficient retrieval of information, as well as to provide the basis for integration into document classification and retrieval systems among other, we further developed a web-based interface that lets users easily retrieve panels containing the terms specified in the user queries. PMID:24565394

  5. Proteomic analysis of Medulloblastoma reveals functional biology with translational potential.

    PubMed

    Rivero-Hinojosa, Samuel; Lau, Ling San; Stampar, Mojca; Staal, Jerome; Zhang, Huizhen; Gordish-Dressman, Heather; Northcott, Paul A; Pfister, Stefan M; Taylor, Michael D; Brown, Kristy J; Rood, Brian R

    2018-06-07

    Genomic characterization has begun to redefine diagnostic classifications of cancers. However, it remains a challenge to infer disease phenotypes from genomic alterations alone. To help realize the promise of genomics, we have performed a quantitative proteomics investigation using Stable Isotope Labeling by Amino Acids in Cell Culture (SILAC) and 41 tissue samples spanning the 4 genomically based subgroups of medulloblastoma and control cerebellum. We have identified and quantitated thousands of proteins across these groups and find that we are able to recapitulate the genomic subgroups based upon subgroup restricted and differentially abundant proteins while also identifying subgroup specific protein isoforms. Integrating our proteomic measurements with genomic data, we calculate a poor correlation between mRNA and protein abundance. Using EPIC 850 k methylation array data on the same tissues, we also investigate the influence of copy number alterations and DNA methylation on the proteome in an attempt to characterize the impact of these genetic features on the proteome. Reciprocally, we are able to use the proteome to identify which genomic alterations result in altered protein abundance and thus are most likely to impact biology. Finally, we are able to assemble protein-based pathways yielding potential avenues for clinical intervention. From these, we validate the EIF4F cap-dependent translation pathway as a novel druggable pathway in medulloblastoma. Thus, quantitative proteomics complements genomic platforms to yield a more complete understanding of functional tumor biology and identify novel therapeutic targets for medulloblastoma.

  6. Comparative genomic analysis of the eight-membered ring cystine knot-containing bone morphogenetic protein antagonists.

    PubMed

    Avsian-Kretchmer, Orna; Hsueh, Aaron J W

    2004-01-01

    TGF-beta family proteins with a cystine knot motif serve as ligands for diverse families of plasma membrane receptors. Bone morphogenetic protein (BMP) antagonists represent a subgroup of these proteins, some of which bind BMPs and antagonize their actions during development and morphogenesis. Availability of completed genome sequences from diverse organisms allows bioinformatic analysis of the evolution of BMP antagonists and facilitates their classification. Using a regular expression algorithm (http://BioRegEx.stanford.edu), an exhaustive search of the human genome identified all cystine knot-containing BMP antagonists. Based on the size of the cystine ring, these proteins were divided into three subfamilies: CAN (eight-membered ring), twisted gastrulation (nine-membered ring), as well as chordin and noggin (10-membered ring). The CAN family can be divided further into four subgroups based on a conserved arrangement of additional cysteine residues-gremlin and PRDC, cerberus and coco, and DAN, together with USAG-1 and sclerostin. We searched for orthologs of human BMP antagonists in the genomes of model organisms and analyzed their phylogenetic relationship. New human paralogs were identified together with the verification of orthologous relationships of known genes. We also discuss the physiological roles of the CAN subfamily of BMP antagonists and the associated genetic defects. Based on the known three-dimensional structure of key cystine knot proteins, we postulated disulfide bondings for eight-membered ring BMP antagonists to predict their potential folding and dimerization.

  7. [A prediction model for the activity of insecticidal crystal proteins from Bacillus thuringiensis based on support vector machine].

    PubMed

    Lin, Yi; Cai, Fu-Ying; Zhang, Guang-Ya

    2007-01-01

    A quantitative structure-property relationship (QSPR) model in terms of amino acid composition and the activity of Bacillus thuringiensis insecticidal crystal proteins was established. Support vector machine (SVM) is a novel general machine-learning tool based on the structural risk minimization principle that exhibits good generalization when fault samples are few; it is especially suitable for classification, forecasting, and estimation in cases where small amounts of samples are involved such as fault diagnosis; however, some parameters of SVM are selected based on the experience of the operator, which has led to decreased efficiency of SVM in practical application. The uniform design (UD) method was applied to optimize the running parameters of SVM. It was found that the average accuracy rate approached 73% when the penalty factor was 0.01, the epsilon 0.2, the gamma 0.05, and the range 0.5. The results indicated that UD might be used an effective method to optimize the parameters of SVM and SVM and could be used as an alternative powerful modeling tool for QSPR studies of the activity of Bacillus thuringiensis (Bt) insecticidal crystal proteins. Therefore, a novel method for predicting the insecticidal activity of Bt insecticidal crystal proteins was proposed by the authors of this study.

  8. A novel one-class SVM based negative data sampling method for reconstructing proteome-wide HTLV-human protein interaction networks.

    PubMed

    Mei, Suyu; Zhu, Hao

    2015-01-26

    Protein-protein interaction (PPI) prediction is generally treated as a problem of binary classification wherein negative data sampling is still an open problem to be addressed. The commonly used random sampling is prone to yield less representative negative data with considerable false negatives. Meanwhile rational constraints are seldom exerted on model selection to reduce the risk of false positive predictions for most of the existing computational methods. In this work, we propose a novel negative data sampling method based on one-class SVM (support vector machine, SVM) to predict proteome-wide protein interactions between HTLV retrovirus and Homo sapiens, wherein one-class SVM is used to choose reliable and representative negative data, and two-class SVM is used to yield proteome-wide outcomes as predictive feedback for rational model selection. Computational results suggest that one-class SVM is more suited to be used as negative data sampling method than two-class PPI predictor, and the predictive feedback constrained model selection helps to yield a rational predictive model that reduces the risk of false positive predictions. Some predictions have been validated by the recent literature. Lastly, gene ontology based clustering of the predicted PPI networks is conducted to provide valuable cues for the pathogenesis of HTLV retrovirus.

  9. Exploring the free-energy landscape of carbohydrate-protein complexes: development and validation of scoring functions considering the binding-site topology

    NASA Astrophysics Data System (ADS)

    Eid, Sameh; Saleh, Noureldin; Zalewski, Adam; Vedani, Angelo

    2014-12-01

    Carbohydrates play a key role in a variety of physiological and pathological processes and, hence, represent a rich source for the development of novel therapeutic agents. Being able to predict binding mode and binding affinity is an essential, yet lacking, aspect of the structure-based design of carbohydrate-based ligands. We assembled a diverse data set comprising 273 carbohydrate-protein crystal structures with known binding affinity and evaluated the prediction accuracy of a large collection of well-established scoring and free-energy functions, as well as combinations thereof. Unfortunately, the tested functions were not capable of reproducing binding affinities in the studied complexes. To simplify the complex free-energy surface of carbohydrate-protein systems, we classified the studied proteins according to the topology and solvent exposure of the carbohydrate-binding site into five distinct categories. A free-energy model based on the proposed classification scheme reproduced binding affinities in the carbohydrate data set with an r 2 of 0.71 and root-mean-squared-error of 1.25 kcal/mol ( N = 236). The improvement in model performance underlines the significance of the differences in the local micro-environments of carbohydrate-binding sites and demonstrates the usefulness of calibrating free-energy functions individually according to binding-site topology and solvent exposure.

  10. Prediction of change in protein unfolding rates upon point mutations in two state proteins.

    PubMed

    Chaudhary, Priyashree; Naganathan, Athi N; Gromiha, M Michael

    2016-09-01

    Studies on protein unfolding rates are limited and challenging due to the complexity of unfolding mechanism and the larger dynamic range of the experimental data. Though attempts have been made to predict unfolding rates using protein sequence-structure information there is no available method for predicting the unfolding rates of proteins upon specific point mutations. In this work, we have systematically analyzed a set of 790 single mutants and developed a robust method for predicting protein unfolding rates upon mutations (Δlnku) in two-state proteins by combining amino acid properties and knowledge-based classification of mutants with multiple linear regression technique. We obtain a mean absolute error (MAE) of 0.79/s and a Pearson correlation coefficient (PCC) of 0.71 between predicted unfolding rates and experimental observations using jack-knife test. We have developed a web server for predicting protein unfolding rates upon mutation and it is freely available at https://www.iitm.ac.in/bioinfo/proteinunfolding/unfoldingrace.html. Prominent features that determine unfolding kinetics as well as plausible reasons for the observed outliers are also discussed. Copyright © 2016 Elsevier B.V. All rights reserved.

  11. Interactomic approach for evaluating nucleophosmin-binding proteins as biomarkers for Ewing's sarcoma.

    PubMed

    Haga, Ayako; Ogawara, Yoko; Kubota, Daisuke; Kitabayashi, Issay; Murakami, Yasufumi; Kondo, Tadashi

    2013-06-01

    Nucleophosmin (NPM) is a novel prognostic biomarker for Ewing's sarcoma. To evaluate the prognostic utility of NPM, we conducted an interactomic approach to characterize the NPM protein complex in Ewing's sarcoma cells. A gene suppression assay revealed that NPM promoted cell proliferation and the invasive properties of Ewing's sarcoma cells. FLAG-tag-based affinity purification coupled with liquid chromatography-tandem mass spectrometry identified 106 proteins in the NPM protein complex. The functional classification suggested that the NPM complex participates in critical biological events, including ribosome biogenesis, regulation of transcription and translation, and protein folding, that are mediated by these proteins. In addition to JAK1, a candidate prognostic biomarker for Ewing's sarcoma, the NPM complex, includes 11 proteins known as prognostic biomarkers for other malignancies. Meta-analysis of gene expression profiles of 32 patients with Ewing's sarcoma revealed that 6 of 106 were significantly and independently associated with survival period. These observations suggest a functional role as well as prognostic value of these NPM complex proteins in Ewing's sarcoma. Further, our study suggests the potential applications of interactomics in conjunction with meta-analysis for biomarker discovery. © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  12. Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern.

    PubMed

    Zhang, Tong-Liang; Ding, Yong-Sheng; Chou, Kuo-Chen

    2008-01-07

    Compared with the conventional amino acid (AA) composition, the pseudo-amino acid (PseAA) composition as originally introduced for protein subcellular location prediction can incorporate much more information of a protein sequence, so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. In this study, based on the concept of PseAA composition, the approximate entropy and hydrophobicity pattern of a protein sequence are used to characterize the PseAA components. Also, the immune genetic algorithm (IGA) is applied to search the optimal weight factors in generating the PseAA composition. Thus, for a given protein sequence sample, a 27-D (dimensional) PseAA composition is generated as its descriptor. The fuzzy K nearest neighbors (FKNN) classifier is adopted as the prediction engine. The results thus obtained in predicting protein structural classification are quite encouraging, indicating that the current approach may also be used to improve the prediction quality of other protein attributes, or at least can play a complimentary role to the existing methods in the relevant areas. Our algorithm is written in Matlab that is available by contacting the corresponding author.

  13. MIPS: analysis and annotation of proteins from whole genomes.

    PubMed

    Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  14. Object-Based Random Forest Classification of Land Cover from Remotely Sensed Imagery for Industrial and Mining Reclamation

    NASA Astrophysics Data System (ADS)

    Chen, Y.; Luo, M.; Xu, L.; Zhou, X.; Ren, J.; Zhou, J.

    2018-04-01

    The RF method based on grid-search parameter optimization could achieve a classification accuracy of 88.16 % in the classification of images with multiple feature variables. This classification accuracy was higher than that of SVM and ANN under the same feature variables. In terms of efficiency, the RF classification method performs better than SVM and ANN, it is more capable of handling multidimensional feature variables. The RF method combined with object-based analysis approach could highlight the classification accuracy further. The multiresolution segmentation approach on the basis of ESP scale parameter optimization was used for obtaining six scales to execute image segmentation, when the segmentation scale was 49, the classification accuracy reached the highest value of 89.58 %. The classification accuracy of object-based RF classification was 1.42 % higher than that of pixel-based classification (88.16 %), and the classification accuracy was further improved. Therefore, the RF classification method combined with object-based analysis approach could achieve relatively high accuracy in the classification and extraction of land use information for industrial and mining reclamation areas. Moreover, the interpretation of remotely sensed imagery using the proposed method could provide technical support and theoretical reference for remotely sensed monitoring land reclamation.

  15. Land Cover Analysis by Using Pixel-Based and Object-Based Image Classification Method in Bogor

    NASA Astrophysics Data System (ADS)

    Amalisana, Birohmatin; Rokhmatullah; Hernina, Revi

    2017-12-01

    The advantage of image classification is to provide earth’s surface information like landcover and time-series changes. Nowadays, pixel-based image classification technique is commonly performed with variety of algorithm such as minimum distance, parallelepiped, maximum likelihood, mahalanobis distance. On the other hand, landcover classification can also be acquired by using object-based image classification technique. In addition, object-based classification uses image segmentation from parameter such as scale, form, colour, smoothness and compactness. This research is aimed to compare the result of landcover classification and its change detection between parallelepiped pixel-based and object-based classification method. Location of this research is Bogor with 20 years range of observation from 1996 until 2016. This region is famous as urban areas which continuously change due to its rapid development, so that time-series landcover information of this region will be interesting.

  16. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Webb-Robertson, Bobbie-Jo M.; Wiberg, Holli K.; Matzke, Melissa M.

    In this review, we apply selected imputation strategies to label-free liquid chromatography–mass spectrometry (LC–MS) proteomics datasets to evaluate the accuracy with respect to metrics of variance and classification. We evaluate several commonly used imputation approaches for individual merits and discuss the caveats of each approach with respect to the example LC–MS proteomics data. In general, local similarity-based approaches, such as the regularized expectation maximization and least-squares adaptive algorithms, yield the best overall performances with respect to metrics of accuracy and robustness. However, no single algorithm consistently outperforms the remaining approaches, and in some cases, performing classification without imputation sometimes yieldedmore » the most accurate classification. Thus, because of the complex mechanisms of missing data in proteomics, which also vary from peptide to protein, no individual method is a single solution for imputation. In summary, on the basis of the observations in this review, the goal for imputation in the field of computational proteomics should be to develop new approaches that work generically for this data type and new strategies to guide users in the selection of the best imputation for their dataset and analysis objectives.« less

  17. URS DataBase: universe of RNA structures and their motifs.

    PubMed

    Baulin, Eugene; Yacovlev, Victor; Khachko, Denis; Spirin, Sergei; Roytberg, Mikhail

    2016-01-01

    The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA-protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification.Database URL: http://server3.lpm.org.ru/urs/. © The Author(s) 2016. Published by Oxford University Press.

  18. URS DataBase: universe of RNA structures and their motifs

    PubMed Central

    Baulin, Eugene; Yacovlev, Victor; Khachko, Denis; Spirin, Sergei; Roytberg, Mikhail

    2016-01-01

    The Universe of RNA Structures DataBase (URSDB) stores information obtained from all RNA-containing PDB entries (2935 entries in October 2015). The content of the database is updated regularly. The database consists of 51 tables containing indexed data on various elements of the RNA structures. The database provides a web interface allowing user to select a subset of structures with desired features and to obtain various statistical data for a selected subset of structures or for all structures. In particular, one can easily obtain statistics on geometric parameters of base pairs, on structural motifs (stems, loops, etc.) or on different types of pseudoknots. The user can also view and get information on an individual structure or its selected parts, e.g. RNA–protein hydrogen bonds. URSDB employs a new original definition of loops in RNA structures. That definition fits both pseudoknot-free and pseudoknotted secondary structures and coincides with the classical definition in case of pseudoknot-free structures. To our knowledge, URSDB is the first database supporting searches based on topological classification of pseudoknots and on extended loop classification. Database URL: http://server3.lpm.org.ru/urs/ PMID:27242032

  19. Vitamin D3 Analogues with Low Vitamin D Receptor Binding Affinity Regulate Chondrocyte Proliferation, Proteoglycan Synthesis, and Protein Kinase C Activity

    DTIC Science & Technology

    1997-07-11

    REPORT DOCUMENTATION PAGE Form ApprovedOMB No. 0704-0188 Public reporting burden for this collection of information is estimated to average 1 hour...DISTRIBUTION CODE 13. ABSTRACT (Maximum 200 words) 14. SUBJECT TERMS 15. NUMBER OF PAGES 50 16. PRICE CODE 17. SECURITY CLASSIFICATION 18. SECURITY...CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT OF REPORT OF THIS PAGE OF ABSTRACT Standard Form 298(Rev. 2-89) (EG) Prescribed byANSI

  20. Prediction of Protein-Protein Interaction Sites with Machine-Learning-Based Data-Cleaning and Post-Filtering Procedures.

    PubMed

    Liu, Guang-Hui; Shen, Hong-Bin; Yu, Dong-Jun

    2016-04-01

    Accurately predicting protein-protein interaction sites (PPIs) is currently a hot topic because it has been demonstrated to be very useful for understanding disease mechanisms and designing drugs. Machine-learning-based computational approaches have been broadly utilized and demonstrated to be useful for PPI prediction. However, directly applying traditional machine learning algorithms, which often assume that samples in different classes are balanced, often leads to poor performance because of the severe class imbalance that exists in the PPI prediction problem. In this study, we propose a novel method for improving PPI prediction performance by relieving the severity of class imbalance using a data-cleaning procedure and reducing predicted false positives with a post-filtering procedure: First, a machine-learning-based data-cleaning procedure is applied to remove those marginal targets, which may potentially have a negative effect on training a model with a clear classification boundary, from the majority samples to relieve the severity of class imbalance in the original training dataset; then, a prediction model is trained on the cleaned dataset; finally, an effective post-filtering procedure is further used to reduce potential false positive predictions. Stringent cross-validation and independent validation tests on benchmark datasets demonstrated the efficacy of the proposed method, which exhibits highly competitive performance compared with existing state-of-the-art sequence-based PPIs predictors and should supplement existing PPI prediction methods.

  1. Growth condition dependency is the major cause of non-responsiveness upon genetic perturbation

    PubMed Central

    Amini, Saman; Holstege, Frank C. P.

    2017-01-01

    Investigating the role and interplay between individual proteins in biological processes is often performed by assessing the functional consequences of gene inactivation or removal. Depending on the sensitivity of the assay used for determining phenotype, between 66% (growth) and 53% (gene expression) of Saccharomyces cerevisiae gene deletion strains show no defect when analyzed under a single condition. Although it is well known that this non-responsive behavior is caused by different types of redundancy mechanisms or by growth condition/cell type dependency, it is not known what the relative contribution of these different causes is. Understanding the underlying causes of and their relative contribution to non-responsive behavior upon genetic perturbation is extremely important for designing efficient strategies aimed at elucidating gene function and unraveling complex cellular systems. Here, we provide a systematic classification of the underlying causes of and their relative contribution to non-responsive behavior upon gene deletion. The overall contribution of redundancy to non-responsive behavior is estimated at 29%, of which approximately 17% is due to homology-based redundancy and 12% is due to pathway-based redundancy. The major determinant of non-responsiveness is condition dependency (71%). For approximately 14% of protein complexes, just-in-time assembly can be put forward as a potential mechanistic explanation for how proteins can be regulated in a condition dependent manner. Taken together, the results underscore the large contribution of growth condition requirement to non-responsive behavior, which needs to be taken into account for strategies aimed at determining gene function. The classification provided here, can also be further harnessed in systematic analyses of complex cellular systems. PMID:28257504

  2. [Electrophoretic patterns of cell wall protein as a criterion for the identification and classification of Corynebacteria].

    PubMed

    Mykhal's'kyĭ, L O; Furtat, I M; Dem'ianenko, F P; Kostiuchyk, A A

    2001-01-01

    Electrophoretic patterns of cell wall protein of three industrial strains, that were used for production of lysin, and eight collection strains from the genus Corynevacterium were studied to analyze their similarity as well as to estimate an opportunity of using this parameter as an additional criterion for identification and classification of corynebacteria. Similarity coefficient of cell wall overall and main protein electrophoretic patterns were determined by a specially created computer program. Electrophoretic analysis showed that every specie had an individual protein profile. There were determined biopolymers common for the specie, genus and individual among the overall majors and minors. The obtained results showed, that the patterns of main proteins were more conservative and informative in comparison with those ones of overall proteins. The definition of similarity coefficient by the main protein patterns has correlated with the protein profile characteristics of every analyzed strain, and it managed to distribute them into the separate groups. The similarity coefficient of preparations by the main protein patterns allows to separate one specie or a strain from another, and that gives us a chance to claim that this parameter could be used as an additional criterion for differentiation and referring the corynebacteria to a certain taxonomic group.

  3. Computational prediction of virus-human protein-protein interactions using embedding kernelized heterogeneous data.

    PubMed

    Nourani, Esmaeil; Khunjush, Farshad; Durmuş, Saliha

    2016-05-24

    Pathogenic microorganisms exploit host cellular mechanisms and evade host defense mechanisms through molecular pathogen-host interactions (PHIs). Therefore, comprehensive analysis of these PHI networks should be an initial step for developing effective therapeutics against infectious diseases. Computational prediction of PHI data is gaining increasing demand because of scarcity of experimental data. Prediction of protein-protein interactions (PPIs) within PHI systems can be formulated as a classification problem, which requires the knowledge of non-interacting protein pairs. This is a restricting requirement since we lack datasets that report non-interacting protein pairs. In this study, we formulated the "computational prediction of PHI data" problem using kernel embedding of heterogeneous data. This eliminates the abovementioned requirement and enables us to predict new interactions without randomly labeling protein pairs as non-interacting. Domain-domain associations are used to filter the predicted results leading to 175 novel PHIs between 170 human proteins and 105 viral proteins. To compare our results with the state-of-the-art studies that use a binary classification formulation, we modified our settings to consider the same formulation. Detailed evaluations are conducted and our results provide more than 10 percent improvements for accuracy and AUC (area under the receiving operating curve) results in comparison with state-of-the-art methods.

  4. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships.

    PubMed

    Gold, Nicola D; Jackson, Richard M

    2006-02-03

    The rapid growth in protein structural data and the emergence of structural genomics projects have increased the need for automatic structure analysis and tools for function prediction. Small molecule recognition is critical to the function of many proteins; therefore, determination of ligand binding site similarity is important for understanding ligand interactions and may allow their functional classification. Here, we present a binding sites database (SitesBase) that given a known protein-ligand binding site allows rapid retrieval of other binding sites with similar structure independent of overall sequence or fold similarity. However, each match is also annotated with sequence similarity and fold information to aid interpretation of structure and functional similarity. Similarity in ligand binding sites can indicate common binding modes and recognition of similar molecules, allowing potential inference of function for an uncharacterised protein or providing additional evidence of common function where sequence or fold similarity is already known. Alternatively, the resource can provide valuable information for detailed studies of molecular recognition including structure-based ligand design and in understanding ligand cross-reactivity. Here, we show examples of atomic similarity between superfamily or more distant fold relatives as well as between seemingly unrelated proteins. Assignment of unclassified proteins to structural superfamiles is also undertaken and in most cases substantiates assignments made using sequence similarity. Correct assignment is also possible where sequence similarity fails to find significant matches, illustrating the potential use of binding site comparisons for newly determined proteins.

  5. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

    PubMed

    Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.

  6. Current Nomenclature of Pseudohypoparathyroidism: Inactivating Parathyroid Hormone/Parathyroid Hormone-Related Protein Signaling Disorder

    PubMed Central

    Turan, Serap

    2017-01-01

    Disorders related to parathyroid hormone (PTH) resistance and PTH signaling pathway impairment are historically classified under the term of pseudohypoparathyroidism (PHP). The disease was first described and named by Fuller Albright and colleagues in 1942. Albright hereditary osteodystrophy (AHO) is described as an associated clinical entity with PHP, characterized by brachydactyly, subcutaneous ossifications, round face, short stature and a stocky build. The classification of PHP is further divided into PHP-Ia, pseudo-PHP (pPHP), PHP-Ib, PHP-Ic and PHP-II according to the presence or absence of AHO, together with an in vivo response to exogenous PTH and the measurement of Gsα protein activity in peripheral erythrocyte membranes in vitro. However, PHP classification fails to differentiate all patients with different clinical and molecular findings for PHP subtypes and classification become more complicated with more recent molecular characterization and new forms having been identified. So far, new classifications have been established by the EuroPHP network to cover all disorders of the PTH receptor and its signaling pathway. Inactivating PTH/PTH-related protein signaling disorder (iPPSD) is the new name proposed for a group of these disorders and which can be further divided into subtypes - iPPSD1 to iPPSD6. These are termed, starting from PTH receptor inactivation mutation (Eiken and Blomstrand dysplasia) as iPPSD1, inactivating Gsα mutations (PHP-Ia, PHP-Ic and pPHP) as iPPSD2, loss of methylation of GNAS DMRs (PHP-Ib) as iPPSD3, PRKAR1A mutations (acrodysostosis type 1) as iPPSD4, PDE4D mutations (acrodysostosis type 2) as iPPSD5 and PDE3A mutations (autosomal dominant hypertension with brachydactyly) as iPPSD6. iPPSDx is reserved for unknown molecular defects and iPPSDn+1 for new molecular defects which are yet to be described. With these new classifications, the aim is to clarify the borders of each different subtype of disease and make the classification according to molecular pathology. The iPPSD group is designed to be expandable and new classifications will readily fit into it as necessary. PMID:29280743

  7. Current Nomenclature of Pseudohypoparathyroidism: Inactivating Parathyroid Hormone/Parathyroid Hormone-Related Protein Signaling Disorder.

    PubMed

    Turan, Serap

    2017-12-30

    Disorders related to parathyroid hormone (PTH) resistance and PTH signaling pathway impairment are historically classified under the term of pseudohypoparathyroidism (PHP). The disease was first described and named by Fuller Albright and colleagues in 1942. Albright hereditary osteodystrophy (AHO) is described as an associated clinical entity with PHP, characterized by brachydactyly, subcutaneous ossifications, round face, short stature and a stocky build. The classification of PHP is further divided into PHP-Ia, pseudo-PHP (pPHP), PHP-Ib, PHP-Ic and PHP-II according to the presence or absence of AHO, together with an in vivo response to exogenous PTH and the measurement of Gsα protein activity in peripheral erythrocyte membranes in vitro. However, PHP classification fails to differentiate all patients with different clinical and molecular findings for PHP subtypes and classification become more complicated with more recent molecular characterization and new forms having been identified. So far, new classifications have been established by the EuroPHP network to cover all disorders of the PTH receptor and its signaling pathway. Inactivating PTH/PTH-related protein signaling disorder (iPPSD) is the new name proposed for a group of these disorders and which can be further divided into subtypes - iPPSD1 to iPPSD6. These are termed, starting from PTH receptor inactivation mutation (Eiken and Blomstrand dysplasia) as iPPSD1, inactivating Gsα mutations (PHP-Ia, PHP-Ic and pPHP) as iPPSD2, loss of methylation of GNAS DMRs (PHP-Ib) as iPPSD3, PRKAR1A mutations (acrodysostosis type 1) as iPPSD4, PDE4D mutations (acrodysostosis type 2) as iPPSD5 and PDE3A mutations (autosomal dominant hypertension with brachydactyly) as iPPSD6. iPPSDx is reserved for unknown molecular defects and iPPSDn+1 for new molecular defects which are yet to be described. With these new classifications, the aim is to clarify the borders of each different subtype of disease and make the classification according to molecular pathology. The iPPSD group is designed to be expandable and new classifications will readily fit into it as necessary.

  8. Effects of gross motor function and manual function levels on performance-based ADL motor skills of children with spastic cerebral palsy.

    PubMed

    Park, Myoung-Ok

    2017-02-01

    [Purpose] The purpose of this study was to determine effects of Gross Motor Function Classification System and Manual Ability Classification System levels on performance-based motor skills of children with spastic cerebral palsy. [Subjects and Methods] Twenty-three children with cerebral palsy were included. The Assessment of Motor and Process Skills was used to evaluate performance-based motor skills in daily life. Gross motor function was assessed using Gross Motor Function Classification Systems, and manual function was measured using the Manual Ability Classification System. [Results] Motor skills in daily activities were significantly different on Gross Motor Function Classification System level and Manual Ability Classification System level. According to the results of multiple regression analysis, children categorized as Gross Motor Function Classification System level III scored lower in terms of performance based motor skills than Gross Motor Function Classification System level I children. Also, when analyzed with respect to Manual Ability Classification System level, level II was lower than level I, and level III was lower than level II in terms of performance based motor skills. [Conclusion] The results of this study indicate that performance-based motor skills differ among children categorized based on Gross Motor Function Classification System and Manual Ability Classification System levels of cerebral palsy.

  9. Classification of self-assembling protein nanoparticle architectures for applications in vaccine design

    NASA Astrophysics Data System (ADS)

    Indelicato, G.; Burkhard, P.; Twarock, R.

    2017-04-01

    We introduce here a mathematical procedure for the structural classification of a specific class of self-assembling protein nanoparticles (SAPNs) that are used as a platform for repetitive antigen display systems. These SAPNs have distinctive geometries as a consequence of the fact that their peptide building blocks are formed from two linked coiled coils that are designed to assemble into trimeric and pentameric clusters. This allows a mathematical description of particle architectures in terms of bipartite (3,5)-regular graphs. Exploiting the relation with fullerene graphs, we provide a complete atlas of SAPN morphologies. The classification enables a detailed understanding of the spectrum of possible particle geometries that can arise in the self-assembly process. Moreover, it provides a toolkit for a systematic exploitation of SAPNs in bioengineering in the context of vaccine design, predicting the density of B-cell epitopes on the SAPN surface, which is critical for a strong humoral immune response.

  10. PDB-Ligand: a ligand database based on PDB for the automated and customized classification of ligand-binding structures.

    PubMed

    Shin, Jae-Min; Cho, Doo-Ho

    2005-01-01

    PDB-Ligand (http://www.idrtech.com/PDB-Ligand/) is a three-dimensional structure database of small molecular ligands that are bound to larger biomolecules deposited in the Protein Data Bank (PDB). It is also a database tool that allows one to browse, classify, superimpose and visualize these structures. As of May 2004, there are about 4870 types of small molecular ligands, experimentally determined as a complex with protein or DNA in the PDB. The proteins that a given ligand binds are often homologous and present the same binding structure to the ligand. However, there are also many instances wherein a given ligand binds to two or more unrelated proteins, or to the same or homologous protein in different binding environments. PDB-Ligand serves as an interactive structural analysis and clustering tool for all the ligand-binding structures in the PDB. PDB-Ligand also provides an easier way to obtain a number of different structure alignments of many related ligand-binding structures based on a simple and flexible ligand clustering method. PDB-Ligand will be a good resource for both a better interpretation of ligand-binding structures and the development of better scoring functions to be used in many drug discovery applications.

  11. Automated Sample Preparation (ASP): Development of a Rapid Method to Sequentially Isolate Nucleic Acids and Protein from Any Sample Type by a Cartridge-Based System

    DTIC Science & Technology

    2013-11-27

    SECURITY CLASSIFICATION OF: CUBRC has developed an in-line, multi-analyte isolation technology that utilizes solid phase extraction chemistries to purify...goals. Specifically, CUBRC will design and manufacture a prototype cartridge(s) and test the prototype cartridge for its ability to isolate each...display a currently valid OMB control number. PLEASE DO NOT RETURN YOUR FORM TO THE ABOVE ADDRESS. CUBRC , Inc. P. O. Box 400 Buffalo, NY 14225 -1955

  12. Comparison Between Self-Guided Langevin Dynamics and Molecular Dynamics Simulations for Structure Refinement of Protein Loop Conformations

    DTIC Science & Technology

    2011-01-01

    SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as Report (SAR) 18 . NUMBER OF PAGES 9 19a. NAME OF RESPONSIBLE PERSON a. REPORT...unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39- 18 sampling is based on...atom distance-scaled ideal-gas reference state (DFIRE-AA) statistical potential func- tion.[ 18 ] The third approach is the Rosetta all-atom energy func

  13. JNK Signaling: Regulation and Functions Based on Complex Protein-Protein Partnerships

    PubMed Central

    Zeke, András; Misheva, Mariya

    2016-01-01

    SUMMARY The c-Jun N-terminal kinases (JNKs), as members of the mitogen-activated protein kinase (MAPK) family, mediate eukaryotic cell responses to a wide range of abiotic and biotic stress insults. JNKs also regulate important physiological processes, including neuronal functions, immunological actions, and embryonic development, via their impact on gene expression, cytoskeletal protein dynamics, and cell death/survival pathways. Although the JNK pathway has been under study for >20 years, its complexity is still perplexing, with multiple protein partners of JNKs underlying the diversity of actions. Here we review the current knowledge of JNK structure and isoforms as well as the partnerships of JNKs with a range of intracellular proteins. Many of these proteins are direct substrates of the JNKs. We analyzed almost 100 of these target proteins in detail within a framework of their classification based on their regulation by JNKs. Examples of these JNK substrates include a diverse assortment of nuclear transcription factors (Jun, ATF2, Myc, Elk1), cytoplasmic proteins involved in cytoskeleton regulation (DCX, Tau, WDR62) or vesicular transport (JIP1, JIP3), cell membrane receptors (BMPR2), and mitochondrial proteins (Mcl1, Bim). In addition, because upstream signaling components impact JNK activity, we critically assessed the involvement of signaling scaffolds and the roles of feedback mechanisms in the JNK pathway. Despite a clarification of many regulatory events in JNK-dependent signaling during the past decade, many other structural and mechanistic insights are just beginning to be revealed. These advances open new opportunities to understand the role of JNK signaling in diverse physiological and pathophysiological states. PMID:27466283

  14. Classification of antibiotics by neural network analysis of optical resonance data of whispering gallery modes in dielectric microspheres

    NASA Astrophysics Data System (ADS)

    Saetchnikov, Vladimir A.; Tcherniavskaia, Elina A.; Schweiger, Gustav; Ostendorf, Andreas

    2012-04-01

    A novel emerging technique for the label-free analysis of nanoparticles and biomolecules in liquid fluids using optical micro cavity resonance of whispering-gallery-type modes is being developed.A scheme based on polymer microspheres fixed by adhesive on the evanescence wave coupling element has been used. We demonstrated that the only spectral shift can't be used for identification of biological agents by developed approach. So neural network classifier for biological agents and micro/nano particles classification has been developed. The developed technique is the following. While tuning the laser wavelength images were recorded as avi-file. All sequences were broken into single frames and the location of the resonance was allocated in each frame. The image was filtered for noise reduction and integrated over two coordinates for evaluation of integrated energy of a measured signal. As input data normalized resonance shift of whispering-gallery modes and the relative efficiency of whispering-gallery modes excitation were used. Other parameters such as polarization of excited light, "center of gravity" of a resonance spectra etc. are also tested as input data for probabilistic neural network. After network designing and training we estimated the accuracy of classification. The classification of antibiotics such as penicillin and cephasolin have been performed with the accuracy of not less 97 %. Developed techniques can be used for lab-on-chip sensor based diagnostic tools as for identification of different biological molecules, e.g. proteins, oligonucleotides, oligosaccharides, lipids, small molecules, viral particles, cells and for dynamics of a delivery of medicines to bodies.

  15. A prognostic classifier for patients with colorectal cancer liver metastasis, based on AURKA, PTGS2 and MMP9.

    PubMed

    Goos, Jeroen A C M; Coupé, Veerle M H; van de Wiel, Mark A; Diosdado, Begoña; Delis-Van Diemen, Pien M; Hiemstra, Annemieke C; de Cuba, Erienne M V; Beliën, Jeroen A M; Menke-van der Houven van Oordt, C Willemien; Geldof, Albert A; Meijer, Gerrit A; Hoekstra, Otto S; Fijneman, Remond J A

    2016-01-12

    Prognosis of patients with colorectal cancer liver metastasis (CRCLM) is estimated based on clinicopathological models. Stratifying patients based on tumor biology may have additional value. Tissue micro-arrays (TMAs), containing resected CRCLM and corresponding primary tumors from a multi-institutional cohort of 507 patients, were immunohistochemically stained for 18 candidate biomarkers. Cross-validated hazard rate ratios (HRRs) for overall survival (OS) and the proportion of HRRs with opposite effect (P(HRR < 1) or P(HRR > 1)) were calculated. A classifier was constructed by classification and regression tree (CART) analysis and its prognostic value determined by permutation analysis. Correlations between protein expression in primary tumor-CRCLM pairs were calculated. Based on their putative prognostic value, EGFR (P(HRR < 1) = .02), AURKA (P(HRR < 1) = .02), VEGFA (P(HRR < 1) = .02), PTGS2 (P(HRR < 1) = .01), SLC2A1 (P(HRR > 1) < 01), HIF1α (P(HRR > 1) = .06), KCNQ1 (P(HRR > 1) = .09), CEA (P (HRR > 1) = .05) and MMP9 (P(HRR < 1) = .07) were included in the CART analysis (n = 201). The resulting classifier was based on AURKA, PTGS2 and MMP9 expression and was associated with OS (HRR 2.79, p < .001), also after multivariate analysis (HRR 3.57, p < .001). The prognostic value of the biomarker-based classifier was superior to the clinicopathological model (p = .001). Prognostic value was highest for colon cancer patients (HRR 5.71, p < .001) and patients not treated with systemic therapy (HRR 3.48, p < .01). Classification based on protein expression in primary tumors could be based on AURKA expression only (HRR 2.59, p = .04). A classifier was generated for patients with CRCLM with improved prognostic value compared to the standard clinicopathological prognostic parameters, which may aid selection of patients who may benefit from adjuvant systemic therapy.

  16. Bacterial cell identification in differential interference contrast microscopy images.

    PubMed

    Obara, Boguslaw; Roberts, Mark A J; Armitage, Judith P; Grau, Vicente

    2013-04-23

    Microscopy image segmentation lays the foundation for shape analysis, motion tracking, and classification of biological objects. Despite its importance, automated segmentation remains challenging for several widely used non-fluorescence, interference-based microscopy imaging modalities. For example in differential interference contrast microscopy which plays an important role in modern bacterial cell biology. Therefore, new revolutions in the field require the development of tools, technologies and work-flows to extract and exploit information from interference-based imaging data so as to achieve new fundamental biological insights and understanding. We have developed and evaluated a high-throughput image analysis and processing approach to detect and characterize bacterial cells and chemotaxis proteins. Its performance was evaluated using differential interference contrast and fluorescence microscopy images of Rhodobacter sphaeroides. Results demonstrate that the proposed approach provides a fast and robust method for detection and analysis of spatial relationship between bacterial cells and their chemotaxis proteins.

  17. Ratiometric Array of Conjugated Polymers-Fluorescent Protein Provides a Robust Mammalian Cell Sensor.

    PubMed

    Rana, Subinoy; Elci, S Gokhan; Mout, Rubul; Singla, Arvind K; Yazdani, Mahdieh; Bender, Markus; Bajaj, Avinash; Saha, Krishnendu; Bunz, Uwe H F; Jirik, Frank R; Rotello, Vincent M

    2016-04-06

    Supramolecular complexes of a family of positively charged conjugated polymers (CPs) and green fluorescent protein (GFP) create a fluorescence resonance energy transfer (FRET)-based ratiometric biosensor array. Selective multivalent interactions of the CPs with mammalian cell surfaces caused differential change in FRET signals, providing a fingerprint signature for each cell type. The resulting fluorescence signatures allowed the identification of 16 different cell types and discrimination between healthy, cancerous, and metastatic cells, with the same genetic background. While the CP-GFP sensor array completely differentiated between the cell types, only partial classification was achieved for the CPs alone, validating the effectiveness of the ratiometric sensor. The utility of the biosensor was further demonstrated in the detection of blinded unknown samples, where 121 of 128 samples were correctly identified. Notably, this selectivity-based sensor stratified diverse cell types in minutes, using only 2000 cells, without requiring specific biomarkers or cell labeling.

  18. Comparative homology agreement search: An effective combination of homology-search methods

    PubMed Central

    Alam, Intikhab; Dress, Andreas; Rehmsmeier, Marc; Fuellen, Georg

    2004-01-01

    Many methods have been developed to search for homologous members of a protein family in databases, and the reliability of results and conclusions may be compromised if only one method is used, neglecting the others. Here we introduce a general scheme for combining such methods. Based on this scheme, we implemented a tool called comparative homology agreement search (chase) that integrates different search strategies to obtain a combined “E value.” Our results show that a consensus method integrating distinct strategies easily outperforms any of its component algorithms. More specifically, an evaluation based on the Structural Classification of Proteins database reveals that, on average, a coverage of 47% can be obtained in searches for distantly related homologues (i.e., members of the same superfamily but not the same family, which is a very difficult task), accepting only 10 false positives, whereas the individual methods obtain a coverage of 28–38%. PMID:15367730

  19. Prediction of interface residue based on the features of residue interaction network.

    PubMed

    Jiao, Xiong; Ranganathan, Shoba

    2017-11-07

    Protein-protein interaction plays a crucial role in the cellular biological processes. Interface prediction can improve our understanding of the molecular mechanisms of the related processes and functions. In this work, we propose a classification method to recognize the interface residue based on the features of a weighted residue interaction network. The random forest algorithm is used for the prediction and 16 network parameters and the B-factor are acting as the element of the input feature vector. Compared with other similar work, the method is feasible and effective. The relative importance of these features also be analyzed to identify the key feature for the prediction. Some biological meaning of the important feature is explained. The results of this work can be used for the related work about the structure-function relationship analysis via a residue interaction network model. Copyright © 2017 Elsevier Ltd. All rights reserved.

  20. The generalization ability of online SVM classification based on Markov sampling.

    PubMed

    Xu, Jie; Yan Tang, Yuan; Zou, Bin; Xu, Zongben; Li, Luoqing; Lu, Yang

    2015-03-01

    In this paper, we consider online support vector machine (SVM) classification learning algorithms with uniformly ergodic Markov chain (u.e.M.c.) samples. We establish the bound on the misclassification error of an online SVM classification algorithm with u.e.M.c. samples based on reproducing kernel Hilbert spaces and obtain a satisfactory convergence rate. We also introduce a novel online SVM classification algorithm based on Markov sampling, and present the numerical studies on the learning ability of online SVM classification based on Markov sampling for benchmark repository. The numerical studies show that the learning performance of the online SVM classification algorithm based on Markov sampling is better than that of classical online SVM classification based on random sampling as the size of training samples is larger.

  1. Prediction of cancer proteins by integrating protein interaction, domain frequency, and domain interaction data using machine learning algorithms.

    PubMed

    Huang, Chien-Hung; Peng, Huai-Shun; Ng, Ka-Lok

    2015-01-01

    Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis.

  2. Prediction of Cancer Proteins by Integrating Protein Interaction, Domain Frequency, and Domain Interaction Data Using Machine Learning Algorithms

    PubMed Central

    2015-01-01

    Many proteins are known to be associated with cancer diseases. It is quite often that their precise functional role in disease pathogenesis remains unclear. A strategy to gain a better understanding of the function of these proteins is to make use of a combination of different aspects of proteomics data types. In this study, we extended Aragues's method by employing the protein-protein interaction (PPI) data, domain-domain interaction (DDI) data, weighted domain frequency score (DFS), and cancer linker degree (CLD) data to predict cancer proteins. Performances were benchmarked based on three kinds of experiments as follows: (I) using individual algorithm, (II) combining algorithms, and (III) combining the same classification types of algorithms. When compared with Aragues's method, our proposed methods, that is, machine learning algorithm and voting with the majority, are significantly superior in all seven performance measures. We demonstrated the accuracy of the proposed method on two independent datasets. The best algorithm can achieve a hit ratio of 89.4% and 72.8% for lung cancer dataset and lung cancer microarray study, respectively. It is anticipated that the current research could help understand disease mechanisms and diagnosis. PMID:25866773

  3. MIPS: analysis and annotation of proteins from whole genomes

    PubMed Central

    Mewes, H. W.; Amid, C.; Arnold, R.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M.; Pagel, P.; Strack, N.; Stümpflen, V.; Warfsmann, J.; Ruepp, A.

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein–protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:14681354

  4. Sirius PSB: a generic system for analysis of biological sequences.

    PubMed

    Koh, Chuan Hock; Lin, Sharene; Jedd, Gregory; Wong, Limsoon

    2009-12-01

    Computational tools are essential components of modern biological research. For example, BLAST searches can be used to identify related proteins based on sequence homology, or when a new genome is sequenced, prediction models can be used to annotate functional sites such as transcription start sites, translation initiation sites and polyadenylation sites and to predict protein localization. Here we present Sirius Prediction Systems Builder (PSB), a new computational tool for sequence analysis, classification and searching. Sirius PSB has four main operations: (1) Building a classifier, (2) Deploying a classifier, (3) Search for proteins similar to query proteins, (4) Preliminary and post-prediction analysis. Sirius PSB supports all these operations via a simple and interactive graphical user interface. Besides being a convenient tool, Sirius PSB has also introduced two novelties in sequence analysis. Firstly, genetic algorithm is used to identify interesting features in the feature space. Secondly, instead of the conventional method of searching for similar proteins via sequence similarity, we introduced searching via features' similarity. To demonstrate the capabilities of Sirius PSB, we have built two prediction models - one for the recognition of Arabidopsis polyadenylation sites and another for the subcellular localization of proteins. Both systems are competitive against current state-of-the-art models based on evaluation of public datasets. More notably, the time and effort required to build each model is greatly reduced with the assistance of Sirius PSB. Furthermore, we show that under certain conditions when BLAST is unable to find related proteins, Sirius PSB can identify functionally related proteins based on their biophysical similarities. Sirius PSB and its related supplements are available at: http://compbio.ddns.comp.nus.edu.sg/~sirius.

  5. 3D flexible alignment using 2D maximum common substructure: dependence of prediction accuracy on target-reference chemical similarity.

    PubMed

    Kawabata, Takeshi; Nakamura, Haruki

    2014-07-28

    A protein-bound conformation of a target molecule can be predicted by aligning the target molecule on the reference molecule obtained from the 3D structure of the compound-protein complex. This strategy is called "similarity-based docking". For this purpose, we develop the flexible alignment program fkcombu, which aligns the target molecule based on atomic correspondences with the reference molecule. The correspondences are obtained by the maximum common substructure (MCS) of 2D chemical structures, using our program kcombu. The prediction performance was evaluated using many target-reference pairs of superimposed ligand 3D structures on the same protein in the PDB, with different ranges of chemical similarity. The details of atomic correspondence largely affected the prediction success. We found that topologically constrained disconnected MCS (TD-MCS) with the simple element-based atomic classification provides the best prediction. The crashing potential energy with the receptor protein improved the performance. We also found that the RMSD between the predicted and correct target conformations significantly correlates with the chemical similarities between target-reference molecules. Generally speaking, if the reference and target compounds have more than 70% chemical similarity, then the average RMSD of 3D conformations is <2.0 Å. We compared the performance with a rigid-body molecular alignment program based on volume-overlap scores (ShaEP). Our MCS-based flexible alignment program performed better than the rigid-body alignment program, especially when the target and reference molecules were sufficiently similar.

  6. Identification of Proteins Involved in Carbohydrate Metabolism and Energy Metabolism Pathways and Their Regulation of Cytoplasmic Male Sterility in Wheat.

    PubMed

    Geng, Xingxia; Ye, Jiali; Yang, Xuetong; Li, Sha; Zhang, Lingli; Song, Xiyue

    2018-01-23

    Cytoplasmic male sterility (CMS) where no functional pollen is produced has important roles in wheat breeding. The anther is a unique organ for male gametogenesis and its abnormal development can cause male sterility. However, the mechanisms and regulatory networks related to plant male sterility are poorly understood. In this study, we conducted comparative analyses using isobaric tags for relative and absolute quantification (iTRAQ) of the pollen proteins in a CMS line and its wheat maintainer. Differentially abundant proteins (DAPs) were analyzed based on Gene Ontology classifications, metabolic pathways and transcriptional regulation networks using Blast2GO. We identified 5570 proteins based on 23,277 peptides, which matched with 73,688 spectra, including proteins in key pathways such as glyceraldehyde-3-phosphate dehydrogenase, pyruvate kinase and 6-phosphofructokinase 1 in the glycolysis pathway, isocitrate dehydrogenase and citrate synthase in the tricarboxylic acid cycle and nicotinamide adenine dinucleotide (NADH)-dehydrogenase and adenosine-triphosphate (ATP) synthases in the oxidative phosphorylation pathway. These proteins may comprise a network that regulates male sterility in wheat. Quantitative real time polymerase chain reaction (qRT-PCR) analysis, ATP assays and total sugar assays validated the iTRAQ results. These DAPs could be associated with abnormal pollen grain formation and male sterility. Our findings provide insights into the molecular mechanism related to male sterility in wheat.

  7. Columba: an integrated database of proteins, structures, and annotations.

    PubMed

    Trissl, Silke; Rother, Kristian; Müller, Heiko; Steinke, Thomas; Koch, Ina; Preissner, Robert; Frömmel, Cornelius; Leser, Ulf

    2005-03-31

    Structural and functional research often requires the computation of sets of protein structures based on certain properties of the proteins, such as sequence features, fold classification, or functional annotation. Compiling such sets using current web resources is tedious because the necessary data are spread over many different databases. To facilitate this task, we have created COLUMBA, an integrated database of annotations of protein structures. COLUMBA currently integrates twelve different databases, including PDB, KEGG, Swiss-Prot, CATH, SCOP, the Gene Ontology, and ENZYME. The database can be searched using either keyword search or data source-specific web forms. Users can thus quickly select and download PDB entries that, for instance, participate in a particular pathway, are classified as containing a certain CATH architecture, are annotated as having a certain molecular function in the Gene Ontology, and whose structures have a resolution under a defined threshold. The results of queries are provided in both machine-readable extensible markup language and human-readable format. The structures themselves can be viewed interactively on the web. The COLUMBA database facilitates the creation of protein structure data sets for many structure-based studies. It allows to combine queries on a number of structure-related databases not covered by other projects at present. Thus, information on both many and few protein structures can be used efficiently. The web interface for COLUMBA is available at http://www.columba-db.de.

  8. Automated classification of immunostaining patterns in breast tissue from the human protein atlas.

    PubMed

    Swamidoss, Issac Niwas; Kårsnäs, Andreas; Uhlmann, Virginie; Ponnusamy, Palanisamy; Kampf, Caroline; Simonsson, Martin; Wählby, Carolina; Strand, Robin

    2013-01-01

    The Human Protein Atlas (HPA) is an effort to map the location of all human proteins (http://www.proteinatlas.org/). It contains a large number of histological images of sections from human tissue. Tissue micro arrays (TMA) are imaged by a slide scanning microscope, and each image represents a thin slice of a tissue core with a dark brown antibody specific stain and a blue counter stain. When generating antibodies for protein profiling of the human proteome, an important step in the quality control is to compare staining patterns of different antibodies directed towards the same protein. This comparison is an ultimate control that the antibody recognizes the right protein. In this paper, we propose and evaluate different approaches for classifying sub-cellular antibody staining patterns in breast tissue samples. The proposed methods include the computation of various features including gray level co-occurrence matrix (GLCM) features, complex wavelet co-occurrence matrix (CWCM) features, and weighted neighbor distance using compound hierarchy of algorithms representing morphology (WND-CHARM)-inspired features. The extracted features are used into two different multivariate classifiers (support vector machine (SVM) and linear discriminant analysis (LDA) classifier). Before extracting features, we use color deconvolution to separate different tissue components, such as the brownly stained positive regions and the blue cellular regions, in the immuno-stained TMA images of breast tissue. We present classification results based on combinations of feature measurements. The proposed complex wavelet features and the WND-CHARM features have accuracy similar to that of a human expert. Both human experts and the proposed automated methods have difficulties discriminating between nuclear and cytoplasmic staining patterns. This is to a large extent due to mixed staining of nucleus and cytoplasm. Methods for quantification of staining patterns in histopathology have many applications, ranging from antibody quality control to tumor grading.

  9. Comparative proteomic analysis reveals the positive effect of exogenous spermidine on photosynthesis and salinity tolerance in cucumber seedlings.

    PubMed

    Sang, Ting; Shan, Xi; Li, Bin; Shu, Sheng; Sun, Jin; Guo, Shirong

    2016-08-01

    Our results based on proteomics data and physiological alterations proposed the putative mechanism of exogenous Spd enhanced salinity tolerance in cucumber seedlings. Current studies showed that exogenous spermidine (Spd) could alleviate harmful effects of salinity. It is important to increase our understanding of the beneficial physiological responses of exogenous Spd treatment, and to determine the molecular responses underlying these responses. Here, we combined a physiological analysis with iTRAQ-based comparative proteomics of cucumber (Cucumis sativus L.) leaves, treated with 0.1 mM exogenous Spd, 75 mM NaCl and/or exogenous Spd. A total of 221 differentially expressed proteins were found and involved in 30 metabolic pathways, such as photosynthesis, carbohydrate metabolism, amino acid metabolism, stress response, signal transduction and antioxidant. Based on functional classification of the differentially expressed proteins and the physiological responses, we found cucumber seedlings treated with Spd under salt stress had higher photosynthesis efficiency, upregulated tetrapyrrole synthesis, stronger ROS scavenging ability and more protein biosynthesis activity than NaCl treatment, suggesting that these pathways may promote salt tolerance under high salinity. This study provided insights into how exogenous Spd protects photosynthesis and enhances salt tolerance in cucumber seedlings.

  10. Partial cooperative unfolding in proteins as observed by hydrogen exchange mass spectrometry

    PubMed Central

    Engen, John R.; Wales, Thomas E.; Chen, Shugui; Marzluff, Elaine M.; Hassell, Kerry M.; Weis, David D.; Smithgall, Thomas E.

    2013-01-01

    Many proteins do not exist in a single rigid conformation. Protein motions, or dynamics, exist and in many cases are important for protein function. The analysis of protein dynamics relies on biophysical techniques that can distinguish simultaneously existing populations of molecules and their rates of interconversion. Hydrogen exchange (HX) detected by mass spectrometry (MS) is contributing to our understanding of protein motions by revealing unfolding and dynamics on a wide timescale, ranging from seconds to hours to days. In this review we discuss HX MS-based analyses of protein dynamics, using our studies of multi-domain kinases as examples. Using HX MS, we have successfully probed protein dynamics and unfolding in the isolated SH3, SH2 and kinase domains of the c-Src and Abl kinase families, as well as the role of inter- and intra-molecular interactions in the global control of kinase function. Coupled with high-resolution structural information, HX MS has proved to be a powerful and versatile tool for the analysis of the conformational dynamics in these kinase systems, and has provided fresh insight regarding the regulatory control of these important signaling proteins. HX MS studies of dynamics are applicable not only to the proteins we illustrate here, but to a very wide range of proteins and protein systems, and should play a role in both classification of and greater understanding of the prevalence of protein motion. PMID:23682200

  11. EXTENDING AQUATIC CLASSIFICATION TO THE LANDSCAPE SCALE HYDROLOGY-BASED STRATEGIES

    EPA Science Inventory

    Aquatic classification of single water bodies (lakes, wetlands, estuaries) is often based on geologic origin, while stream classification has relied on multiple factors related to landform, geomorphology, and soils. We have developed an approach to aquatic classification based o...

  12. Proteomics Versus Clinical Data and Stochastic Local Search Based Feature Selection for Acute Myeloid Leukemia Patients' Classification.

    PubMed

    Chebouba, Lokmane; Boughaci, Dalila; Guziolowski, Carito

    2018-06-04

    The use of data issued from high throughput technologies in drug target problems is widely widespread during the last decades. This study proposes a meta-heuristic framework using stochastic local search (SLS) combined with random forest (RF) where the aim is to specify the most important genes and proteins leading to the best classification of Acute Myeloid Leukemia (AML) patients. First we use a stochastic local search meta-heuristic as a feature selection technique to select the most significant proteins to be used in the classification task step. Then we apply RF to classify new patients into their corresponding classes. The evaluation technique is to run the RF classifier on the training data to get a model. Then, we apply this model on the test data to find the appropriate class. We use as metrics the balanced accuracy (BAC) and the area under the receiver operating characteristic curve (AUROC) to measure the performance of our model. The proposed method is evaluated on the dataset issued from DREAM 9 challenge. The comparison is done with a pure random forest (without feature selection), and with the two best ranked results of the DREAM 9 challenge. We used three types of data: only clinical data, only proteomics data, and finally clinical and proteomics data combined. The numerical results show that the highest scores are obtained when using clinical data alone, and the lowest is obtained when using proteomics data alone. Further, our method succeeds in finding promising results compared to the methods presented in the DREAM challenge.

  13. Classifications of Acute Scaphoid Fractures: A Systematic Literature Review.

    PubMed

    Ten Berg, Paul W; Drijkoningen, Tessa; Strackee, Simon D; Buijze, Geert A

    2016-05-01

    Background In the lack of consensus, surgeon-based preference determines how acute scaphoid fractures are classified. There is a great variety of classification systems with considerable controversies. Purposes The purpose of this study was to provide an overview of the different classification systems, clarifying their subgroups and analyzing their popularity by comparing citation indexes. The intention was to improve data comparison between studies using heterogeneous fracture descriptions. Methods We performed a systematic review of the literature based on a search of medical literature from 1950 to 2015, and a manual search using the reference lists in relevant book chapters. Only original descriptions of classifications of acute scaphoid fractures in adults were included. Popularity was based on citation index as reported in the databases of Web of Science (WoS) and Google Scholar. Articles that were cited <10 times in WoS were excluded. Results Our literature search resulted in 308 potentially eligible descriptive reports of which 12 reports met the inclusion criteria. We distinguished 13 different (sub) classification systems based on (1) fracture location, (2) fracture plane orientation, and (3) fracture stability/displacement. Based on citations numbers, the Herbert classification was most popular, followed by the Russe and Mayo classifications. All classification systems were based on plain radiography. Conclusions Most classification systems were based on fracture location, displacement, or stability. Based on the controversy and limited reliability of current classification systems, suggested research areas for an updated classification include three-dimensional fracture pattern etiology and fracture fragment mobility assessed by dynamic imaging.

  14. 7 CFR 27.36 - Classification and Micronaire determinations based on official standards.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 7 Agriculture 2 2010-01-01 2010-01-01 false Classification and Micronaire determinations based on... COMMODITY STANDARDS AND STANDARD CONTAINER REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Classification and Micronaire Determinations § 27.36 Classification and Micronaire...

  15. 7 CFR 27.36 - Classification and Micronaire determinations based on official standards.

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 7 Agriculture 2 2011-01-01 2011-01-01 false Classification and Micronaire determinations based on... COMMODITY STANDARDS AND STANDARD CONTAINER REGULATIONS COTTON CLASSIFICATION UNDER COTTON FUTURES LEGISLATION Regulations Classification and Micronaire Determinations § 27.36 Classification and Micronaire...

  16. Community of protein complexes impacts disease association

    PubMed Central

    Wang, Qianghu; Liu, Weisha; Ning, Shangwei; Ye, Jingrun; Huang, Teng; Li, Yan; Wang, Peng; Shi, Hongbo; Li, Xia

    2012-01-01

    One important challenge in the post-genomic era is uncovering the relationships among distinct pathophenotypes by using molecular signatures. Given the complex functional interdependencies between cellular components, a disease is seldom the consequence of a defect in a single gene product, instead reflecting the perturbations of a group of closely related gene products that carry out specific functions together. Therefore, it is meaningful to explore how the community of protein complexes impacts disease associations. Here, by integrating a large amount of information from protein complexes and the cellular basis of diseases, we built a human disease network in which two diseases are linked if they share common disease-related protein complex. A systemic analysis revealed that linked disease pairs exhibit higher comorbidity than those that have no links, and that the stronger association two diseases have based on protein complexes, the higher comorbidity they are prone to display. Moreover, more connected diseases tend to be malignant, which have high prevalence. We provide novel disease associations that cannot be identified through previous analysis. These findings will potentially provide biologists and clinicians new insights into the etiology, classification and treatment of diseases. PMID:22549411

  17. A novel feature ranking method for prediction of cancer stages using proteomics data

    PubMed Central

    Saghapour, Ehsan; Sehhati, Mohammadreza

    2017-01-01

    Proteomic analysis of cancers' stages has provided new opportunities for the development of novel, highly sensitive diagnostic tools which helps early detection of cancer. This paper introduces a new feature ranking approach called FRMT. FRMT is based on the Technique for Order of Preference by Similarity to Ideal Solution method (TOPSIS) which select the most discriminative proteins from proteomics data for cancer staging. In this approach, outcomes of 10 feature selection techniques were combined by TOPSIS method, to select the final discriminative proteins from seven different proteomic databases of protein expression profiles. In the proposed workflow, feature selection methods and protein expressions have been considered as criteria and alternatives in TOPSIS, respectively. The proposed method is tested on seven various classifier models in a 10-fold cross validation procedure that repeated 30 times on the seven cancer datasets. The obtained results proved the higher stability and superior classification performance of method in comparison with other methods, and it is less sensitive to the applied classifier. Moreover, the final introduced proteins are informative and have the potential for application in the real medical practice. PMID:28934234

  18. Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA.

    PubMed

    Wang, Shunfang; Liu, Shuhui

    2015-12-19

    An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.

  19. Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA

    PubMed Central

    Wang, Shunfang; Liu, Shuhui

    2015-01-01

    An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one. PMID:26703574

  20. Meconium proteins as a source of biomarkers for the assessment of the intrauterine environment of the fetus.

    PubMed

    Lisowska-Myjak, B; Skarżyńska, E; Bakun, M

    2018-06-01

    Intrauterine environmental factors can be associated with perinatal complications and long-term health outcomes although the underlying mechanisms remain poorly defined. Meconium formed exclusively in utero and passed naturally by a neonate may contain proteins which characterise the intrauterine environment. The aim of the study was proteomic analysis of the composition of meconium proteins and their classification by biological function. Proteomic techniques combining isoelectrofocussing fractionation and LC-MS/MS analysis were used to study the protein composition of a meconium sample obtained by pooling 50 serial meconium portions from 10 healthy full-term neonates. The proteins were classified by function based on the literature search for each protein in the PubMed database. A total of 946 proteins were identified in the meconium, including 430 proteins represented by two or more peptides. When the proteins were classified by their biological function the following were identified: immunoglobulin fragments and enzymatic, neutrophil-derived, structural and fetal intestine-specific proteins. Meconium is a rich source of proteins deposited in the fetal intestine during its development in utero. A better understanding of their specific biological functions in the intrauterine environment may help to identify these proteins which may serve as biomarkers associated with specific clinical conditions/diseases with the possible impact on the fetal development and further health consequences in infants, older children and adults.

  1. Fifteen years of research on oral-facial-digital syndromes: from 1 to 16 causal genes.

    PubMed

    Bruel, Ange-Line; Franco, Brunella; Duffourd, Yannis; Thevenon, Julien; Jego, Laurence; Lopez, Estelle; Deleuze, Jean-François; Doummar, Diane; Giles, Rachel H; Johnson, Colin A; Huynen, Martijn A; Chevrier, Véronique; Burglen, Lydie; Morleo, Manuela; Desguerres, Isabelle; Pierquin, Geneviève; Doray, Bérénice; Gilbert-Dussardier, Brigitte; Reversade, Bruno; Steichen-Gersdorf, Elisabeth; Baumann, Clarisse; Panigrahi, Inusha; Fargeot-Espaliat, Anne; Dieux, Anne; David, Albert; Goldenberg, Alice; Bongers, Ernie; Gaillard, Dominique; Argente, Jesús; Aral, Bernard; Gigot, Nadège; St-Onge, Judith; Birnbaum, Daniel; Phadke, Shubha R; Cormier-Daire, Valérie; Eguether, Thibaut; Pazour, Gregory J; Herranz-Pérez, Vicente; Goldstein, Jaclyn S; Pasquier, Laurent; Loget, Philippe; Saunier, Sophie; Mégarbané, André; Rosnet, Olivier; Leroux, Michel R; Wallingford, John B; Blacque, Oliver E; Nachury, Maxence V; Attie-Bitach, Tania; Rivière, Jean-Baptiste; Faivre, Laurence; Thauvin-Robinet, Christel

    2017-06-01

    Oral-facial-digital syndromes (OFDS) gather rare genetic disorders characterised by facial, oral and digital abnormalities associated with a wide range of additional features (polycystic kidney disease, cerebral malformations and several others) to delineate a growing list of OFDS subtypes. The most frequent, OFD type I, is caused by a heterozygous mutation in the OFD1 gene encoding a centrosomal protein. The wide clinical heterogeneity of OFDS suggests the involvement of other ciliary genes. For 15 years, we have aimed to identify the molecular bases of OFDS. This effort has been greatly helped by the recent development of whole-exome sequencing (WES). Here, we present all our published and unpublished results for WES in 24 cases with OFDS. We identified causal variants in five new genes ( C2CD3 , TMEM107 , INTU , KIAA0753 and IFT57 ) and related the clinical spectrum of four genes in other ciliopathies ( C5orf42 , TMEM138 , TMEM231 and WDPCP ) to OFDS. Mutations were also detected in two genes previously implicated in OFDS. Functional studies revealed the involvement of centriole elongation, transition zone and intraflagellar transport defects in OFDS, thus characterising three ciliary protein modules: the complex KIAA0753-FOPNL-OFD1, a regulator of centriole elongation; the Meckel-Gruber syndrome module, a major component of the transition zone; and the CPLANE complex necessary for IFT-A assembly. OFDS now appear to be a distinct subgroup of ciliopathies with wide heterogeneity, which makes the initial classification obsolete. A clinical classification restricted to the three frequent/well-delineated subtypes could be proposed, and for patients who do not fit one of these three main subtypes, a further classification could be based on the genotype. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  2. Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics.

    PubMed

    Xu, Xiu-Qin; Leow, Chon K; Lu, Xin; Zhang, Xuegong; Liu, Jun S; Wong, Wing-Hung; Asperger, Arndt; Deininger, Sören; Eastwood Leung, Hon-Chiu

    2004-10-01

    Liver cirrhosis is a worldwide health problem. Reliable, noninvasive methods for early detection of liver cirrhosis are not available. Using a three-step approach, we classified sera from rats with liver cirrhosis following different treatment insults. The approach consisted of: (i) protein profiling using surface-enhanced laser desorption/ionization (SELDI) technology; (ii) selection of a statistically significant serum biomarker set using machine learning algorithms; and (iii) identification of selected serum biomarkers by peptide sequencing. We generated serum protein profiles from three groups of rats: (i) normal (n=8), (ii) thioacetamide-induced liver cirrhosis (n=22), and (iii) bile duct ligation-induced liver fibrosis (n=5) using a weak cation exchanger surface. Profiling data were further analyzed by a recursive support vector machine algorithm to select a panel of statistically significant biomarkers for class prediction. Sensitivity and specificity of classification using the selected protein marker set were higher than 92%. A consistently down-regulated 3495 Da protein in cirrhosis samples was one of the selected significant biomarkers. This 3495 Da protein was purified on-chip and trypsin digested. Further structural characterization of this biomarkers candidate was done by using cross-platform matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) peptide mass fingerprinting (PMF) and matrix-assisted laser desorption/ionization time of flight/time of flight (MALDI-TOF/TOF) tandem mass spectrometry (MS/MS). Combined data from PMF and MS/MS spectra of two tryptic peptides suggested that this 3495 Da protein shared homology to a histidine-rich glycoprotein. These results demonstrated a novel approach to discovery of new biomarkers for early detection of liver cirrhosis and classification of liver diseases.

  3. The Protein Information Resource: an integrated public resource of functional annotation of proteins

    PubMed Central

    Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.

    2002-01-01

    The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247

  4. Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences.

    PubMed

    Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

    2006-06-15

    The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.

  5. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology.

    PubMed

    Bakhtiarizadeh, Mohammad Reza; Moradi-Shahrbabak, Mohammad; Ebrahimi, Mansour; Ebrahimie, Esmaeil

    2014-09-07

    Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods. Copyright © 2014 Elsevier Ltd. All rights reserved.

  6. [Biologics - nomenclature and classification].

    PubMed

    Eichbaum, Christine; Haefeli, Walter E

    2011-11-01

    Biological medicines are a heterogeneous group of drugs that are produced by living organisms using genetic or biological technology. Unlike chemically derived small molecules biologics are structurally complex making characterization and manufacturing difficult. Moreover, biological medicines show a great variety concerning their clinical use. To appropriately consider these particularities, there are other standards and guidelines for approval of similar derivatives of biologics, the so-called biosimilars or follow-on biologics. In contrast to a generic medicinal product containing a chemically identical active ingredient, a biosimilar is only expected to be similar to the innovator drug. Nowadays, monoclonal antibodies, fragments of antibodies, and fusion proteins manufactured by recombinant procedures play an important role. They have been used in many specialties for diagnostic and therapeutic purposes and are subject to continuous further development and improvement. Their nomenclature is based on a classification by the WHO which allows drawing conclusions for class of substance, origin, and pharmacological target.

  7. Familial or Sporadic Idiopathic Scoliosis – classification based on artificial neural network and GAPDH and ACTB transcription profile

    PubMed Central

    2013-01-01

    Background Importance of hereditary factors in the etiology of Idiopathic Scoliosis is widely accepted. In clinical practice some of the IS patients present with positive familial history of the deformity and some do not. Traditionally about 90% of patients have been considered as sporadic cases without familial recurrence. However the exact proportion of Familial and Sporadic Idiopathic Scoliosis is still unknown. Housekeeping genes encode proteins that are usually essential for the maintenance of basic cellular functions. ACTB and GAPDH are two housekeeping genes encoding respectively a cytoskeletal protein β-actin, and glyceraldehyde-3-phosphate dehydrogenase, an enzyme of glycolysis. Although their expression levels can fluctuate between different tissues and persons, human housekeeping genes seem to exhibit a preserved tissue-wide expression ranking order. It was hypothesized that expression ranking order of two representative housekeeping genes ACTB and GAPDH might be disturbed in the tissues of patients with Familial Idiopathic Scoliosis (with positive family history of idiopathic scoliosis) opposed to the patients with no family members affected (Sporadic Idiopathic Scoliosis). An artificial neural network (ANN) was developed that could serve to differentiate between familial and sporadic cases of idiopathic scoliosis based on the expression levels of ACTB and GAPDH in different tissues of scoliotic patients. The aim of the study was to investigate whether the expression levels of ACTB and GAPDH in different tissues of idiopathic scoliosis patients could be used as a source of data for specially developed artificial neural network in order to predict the positive family history of index patient. Results The comparison of developed models showed, that the most satisfactory classification accuracy was achieved for ANN model with 18 nodes in the first hidden layer and 16 nodes in the second hidden layer. The classification accuracy for positive Idiopathic Scoliosis anamnesis only with the expression measurements of ACTB and GAPDH with the use of ANN based on 6-18-16-1 architecture was 8 of 9 (88%). Only in one case the prediction was ambiguous. Conclusions Specially designed artificial neural network model proved possible association between expression level of ACTB, GAPDH and positive familial history of Idiopathic Scoliosis. PMID:23289769

  8. Read-Through Proteins of Group 4 RNA Bacteriophages TW19 and TW28

    PubMed Central

    Aoi, Takeshi; Kaesberg, Paul

    1976-01-01

    Group 4 phages TW19 and TW28 of Escherichia coli possess a “read-through” (IIb) protein, although group 2 phage GA does not. This may have implications concerning the evolution and classification of RNA phages. Images PMID:978795

  9. Accelerating the Original Profile Kernel.

    PubMed

    Hamp, Tobias; Goldberg, Tatyana; Rost, Burkhard

    2013-01-01

    One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel.

  10. Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

    PubMed

    Xu, Qifang; Dunbrack, Roland L

    2012-11-01

    Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

  11. Relationship between global structural parameters and Enzyme Commission hierarchy: implications for function prediction.

    PubMed

    Boareto, Marcelo; Yamagishi, Michel E B; Caticha, Nestor; Leite, Vitor B P

    2012-10-01

    In protein databases there is a substantial number of proteins structurally determined but without function annotation. Understanding the relationship between function and structure can be useful to predict function on a large scale. We have analyzed the similarities in global physicochemical parameters for a set of enzymes which were classified according to the four Enzyme Commission (EC) hierarchical levels. Using relevance theory we introduced a distance between proteins in the space of physicochemical characteristics. This was done by minimizing a cost function of the metric tensor built to reflect the EC classification system. Using an unsupervised clustering method on a set of 1025 enzymes, we obtained no relevant clustering formation compatible with EC classification. The distance distributions between enzymes from the same EC group and from different EC groups were compared by histograms. Such analysis was also performed using sequence alignment similarity as a distance. Our results suggest that global structure parameters are not sufficient to segregate enzymes according to EC hierarchy. This indicates that features essential for function are rather local than global. Consequently, methods for predicting function based on global attributes should not obtain high accuracy in main EC classes prediction without relying on similarities between enzymes from training and validation datasets. Furthermore, these results are consistent with a substantial number of studies suggesting that function evolves fundamentally by recruitment, i.e., a same protein motif or fold can be used to perform different enzymatic functions and a few specific amino acids (AAs) are actually responsible for enzyme activity. These essential amino acids should belong to active sites and an effective method for predicting function should be able to recognize them. Copyright © 2012 Elsevier Ltd. All rights reserved.

  12. SUPERFAMILY 1.75 including a domain-centric gene ontology method.

    PubMed

    de Lima Morais, David A; Fang, Hai; Rackham, Owen J L; Wilson, Derek; Pethica, Ralph; Chothia, Cyrus; Gough, Julian

    2011-01-01

    The SUPERFAMILY resource provides protein domain assignments at the structural classification of protein (SCOP) superfamily level for over 1400 completely sequenced genomes, over 120 metagenomes and other gene collections such as UniProt. All models and assignments are available to browse and download at http://supfam.org. A new hidden Markov model library based on SCOP 1.75 has been created and a previously ignored class of SCOP, coiled coils, is now included. Our scoring component now uses HMMER3, which is in orders of magnitude faster and produces superior results. A cloud-based pipeline was implemented and is publicly available at Amazon web services elastic computer cloud. The SUPERFAMILY reference tree of life has been improved allowing the user to highlight a chosen superfamily, family or domain architecture on the tree of life. The most significant advance in SUPERFAMILY is that now it contains a domain-based gene ontology (GO) at the superfamily and family levels. A new methodology was developed to ensure a high quality GO annotation. The new methodology is general purpose and has been used to produce domain-based phenotypic ontologies in addition to GO.

  13. Comprehensive inventory of protein complexes in the Protein Data Bank from consistent classification of interfaces.

    PubMed

    Bordner, Andrew J; Gorin, Andrey A

    2008-05-12

    Protein-protein interactions are ubiquitous and essential for all cellular processes. High-resolution X-ray crystallographic structures of protein complexes can reveal the details of their function and provide a basis for many computational and experimental approaches. Differentiation between biological and non-biological contacts and reconstruction of the intact complex is a challenging computational problem. A successful solution can provide additional insights into the fundamental principles of biological recognition and reduce errors in many algorithms and databases utilizing interaction information extracted from the Protein Data Bank (PDB). We have developed a method for identifying protein complexes in the PDB X-ray structures by a four step procedure: (1) comprehensively collecting all protein-protein interfaces; (2) clustering similar protein-protein interfaces together; (3) estimating the probability that each cluster is relevant based on a diverse set of properties; and (4) combining these scores for each PDB entry in order to predict the complex structure. The resulting clusters of biologically relevant interfaces provide a reliable catalog of evolutionary conserved protein-protein interactions. These interfaces, as well as the predicted protein complexes, are available from the Protein Interface Server (PInS) website (see Availability and requirements section). Our method demonstrates an almost two-fold reduction of the annotation error rate as evaluated on a large benchmark set of complexes validated from the literature. We also estimate relative contributions of each interface property to the accurate discrimination of biologically relevant interfaces and discuss possible directions for further improving the prediction method.

  14. Rama: a machine learning approach for ribosomal protein prediction in plants.

    PubMed

    Carvalho, Thales Francisco Mota; Silva, José Cleydson F; Calil, Iara Pinheiro; Fontes, Elizabeth Pacheco Batista; Cerqueira, Fabio Ribeiro

    2017-11-24

    Ribosomal proteins (RPs) play a fundamental role within all type of cells, as they are major components of ribosomes, which are essential for translation of mRNAs. Furthermore, these proteins are involved in various physiological and pathological processes. The intrinsic biological relevance of RPs motivated advanced studies for the identification of unrevealed RPs. In this work, we propose a new computational method, termed Rama, for the prediction of RPs, based on machine learning techniques, with a particular interest in plants. To perform an effective classification, Rama uses a set of fundamental attributes of the amino acid side chains and applies a two-step procedure to classify proteins with unknown function as RPs. The evaluation of the resultant predictive models showed that Rama could achieve mean sensitivity, precision, and specificity of 0.91, 0.91, and 0.82, respectively. Furthermore, a list of proteins that have no annotation in Phytozome v.10, and are annotated as RPs in Phytozome v.12, were correctly classified by our models. Additional computational experiments have also shown that Rama presents high accuracy to differentiate ribosomal proteins from RNA-binding proteins. Finally, two novel proteins of Arabidopsis thaliana were validated in biological experiments. Rama is freely available at http://inctipp.bioagro.ufv.br:8080/Rama .

  15. Protein dynamics in a broad frequency range: Dielectric spectroscopy studies

    DOE PAGES

    Nakanishi, Masahiro; Sokolov, Alexei P.

    2014-09-17

    We present detailed dielectric spectroscopy studies of dynamics in two hydrated proteins, lysozyme and myoglobin. We emphasize the importance of explicit account for possible Maxwell-Wagner (MW) polarization effects in protein powder samples. Combining our data with earlier literature results, we demonstrate the existence of three major relaxation processes in globular proteins. To understand the mechanisms of these relaxations we involve literature data on neutron scattering, simulations and NMR studies. The faster process is ascribed to coupled protein-hydration water motions and has relaxation time similar to 10-50 Ps at room temperature. The intermediate process is similar to 10(2)-10(3) times slower thanmore » the faster process and might be strongly affected by MW polarizations. Based on the analysis of data obtained by different experimental techniques and simulations, we ascribe this process to large scale domain-like motions of proteins. The slowest observed process is similar to 10(6)-10(7) times slower than the faster process and has anomalously large dielectric amplitude Delta epsilon similar to 10(2)-10(4). The microscopic nature of this process is not clear, but it seems to be related to the glass transition of hydrated proteins. The presentedresults suggest a general classification of the relaxation processes in hydrated proteins. (c) 2014 Elsevier B.V. All rights reserved.« less

  16. Rule-based land use/land cover classification in coastal areas using seasonal remote sensing imagery: a case study from Lianyungang City, China.

    PubMed

    Yang, Xiaoyan; Chen, Longgao; Li, Yingkui; Xi, Wenjia; Chen, Longqian

    2015-07-01

    Land use/land cover (LULC) inventory provides an important dataset in regional planning and environmental assessment. To efficiently obtain the LULC inventory, we compared the LULC classifications based on single satellite imagery with a rule-based classification based on multi-seasonal imagery in Lianyungang City, a coastal city in China, using CBERS-02 (the 2nd China-Brazil Environmental Resource Satellites) images. The overall accuracies of the classification based on single imagery are 78.9, 82.8, and 82.0% in winter, early summer, and autumn, respectively. The rule-based classification improves the accuracy to 87.9% (kappa 0.85), suggesting that combining multi-seasonal images can considerably improve the classification accuracy over any single image-based classification. This method could also be used to analyze seasonal changes of LULC types, especially for those associated with tidal changes in coastal areas. The distribution and inventory of LULC types with an overall accuracy of 87.9% and a spatial resolution of 19.5 m can assist regional planning and environmental assessment efficiently in Lianyungang City. This rule-based classification provides a guidance to improve accuracy for coastal areas with distinct LULC temporal spectral features.

  17. Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

    PubMed Central

    Jia, Yi; Huan, Jun; Buhr, Vincent; Zhang, Jintao; Carayannopoulos, Leonidas N

    2009-01-01

    Background Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail. Results Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusion We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty. PMID:19208148

  18. Proline: The Distribution, Frequency, Positioning, and Common Functional Roles of Proline and Polyproline Sequences in the Human Proteome

    PubMed Central

    Morgan, Alexander A.; Rubenstein, Edward

    2013-01-01

    Proline is an anomalous amino acid. Its nitrogen atom is covalently locked within a ring, thus it is the only proteinogenic amino acid with a constrained phi angle. Sequences of three consecutive prolines can fold into polyproline helices, structures that join alpha helices and beta pleats as architectural motifs in protein configuration. Triproline helices are participants in protein-protein signaling interactions. Longer spans of repeat prolines also occur, containing as many as 27 consecutive proline residues. Little is known about the frequency, positioning, and functional significance of these proline sequences. Therefore we have undertaken a systematic bioinformatics study of proline residues in proteins. We analyzed the distribution and frequency of 687,434 proline residues among 18,666 human proteins, identifying single residues, dimers, trimers, and longer repeats. Proline accounts for 6.3% of the 10,882,808 protein amino acids. Of all proline residues, 4.4% are in trimers or longer spans. We detected patterns that influence function based on proline location, spacing, and concentration. We propose a classification based on proline-rich, polyproline-rich, and proline-poor status. Whereas singlet proline residues are often found in proteins that display recurring architectural patterns, trimers or longer proline sequences tend be associated with the absence of repetitive structural motifs. Spans of 6 or more are associated with DNA/RNA processing, actin, and developmental processes. We also suggest a role for proline in Kruppel-type zinc finger protein control of DNA expression, and in the nucleation and translocation of actin by the formin complex. PMID:23372670

  19. JNK Signaling: Regulation and Functions Based on Complex Protein-Protein Partnerships.

    PubMed

    Zeke, András; Misheva, Mariya; Reményi, Attila; Bogoyevitch, Marie A

    2016-09-01

    The c-Jun N-terminal kinases (JNKs), as members of the mitogen-activated protein kinase (MAPK) family, mediate eukaryotic cell responses to a wide range of abiotic and biotic stress insults. JNKs also regulate important physiological processes, including neuronal functions, immunological actions, and embryonic development, via their impact on gene expression, cytoskeletal protein dynamics, and cell death/survival pathways. Although the JNK pathway has been under study for >20 years, its complexity is still perplexing, with multiple protein partners of JNKs underlying the diversity of actions. Here we review the current knowledge of JNK structure and isoforms as well as the partnerships of JNKs with a range of intracellular proteins. Many of these proteins are direct substrates of the JNKs. We analyzed almost 100 of these target proteins in detail within a framework of their classification based on their regulation by JNKs. Examples of these JNK substrates include a diverse assortment of nuclear transcription factors (Jun, ATF2, Myc, Elk1), cytoplasmic proteins involved in cytoskeleton regulation (DCX, Tau, WDR62) or vesicular transport (JIP1, JIP3), cell membrane receptors (BMPR2), and mitochondrial proteins (Mcl1, Bim). In addition, because upstream signaling components impact JNK activity, we critically assessed the involvement of signaling scaffolds and the roles of feedback mechanisms in the JNK pathway. Despite a clarification of many regulatory events in JNK-dependent signaling during the past decade, many other structural and mechanistic insights are just beginning to be revealed. These advances open new opportunities to understand the role of JNK signaling in diverse physiological and pathophysiological states. Copyright © 2016, American Society for Microbiology. All Rights Reserved.

  20. MPact: the MIPS protein interaction resource on yeast.

    PubMed

    Güldener, Ulrich; Münsterkötter, Martin; Oesterheld, Matthias; Pagel, Philipp; Ruepp, Andreas; Mewes, Hans-Werner; Stümpflen, Volker

    2006-01-01

    In recent years, the Munich Information Center for Protein Sequences (MIPS) yeast protein-protein interaction (PPI) dataset has been used in numerous analyses of protein networks and has been called a gold standard because of its quality and comprehensiveness [H. Yu, N. M. Luscombe, H. X. Lu, X. Zhu, Y. Xia, J. D. Han, N. Bertin, S. Chung, M. Vidal and M. Gerstein (2004) Genome Res., 14, 1107-1118]. MPact and the yeast protein localization catalog provide information related to the proximity of proteins in yeast. Beside the integration of high-throughput data, information about experimental evidence for PPIs in the literature was compiled by experts adding up to 4300 distinct PPIs connecting 1500 proteins in yeast. As the interaction data is a complementary part of CYGD, interactive mapping of data on other integrated data types such as the functional classification catalog [A. Ruepp, A. Zollner, D. Maier, K. Albermann, J. Hani, M. Mokrejs, I. Tetko, U. Güldener, G. Mannhaupt, M. Münsterkötter and H. W. Mewes (2004) Nucleic Acids Res., 32, 5539-5545] is possible. A survey of signaling proteins and comparison with pathway data from KEGG demonstrates that based on these manually annotated data only an extensive overview of the complexity of this functional network can be obtained in yeast. The implementation of a web-based PPI-analysis tool allows analysis and visualization of protein interaction networks and facilitates integration of our curated data with high-throughput datasets. The complete dataset as well as user-defined sub-networks can be retrieved easily in the standardized PSI-MI format. The resource can be accessed through http://mips.gsf.de/genre/proj/mpact.

  1. Functional annotation from the genome sequence of the giant panda.

    PubMed

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  2. Object-based land cover classification based on fusion of multifrequency SAR data and THAICHOTE optical imagery

    NASA Astrophysics Data System (ADS)

    Sukawattanavijit, Chanika; Srestasathiern, Panu

    2017-10-01

    Land Use and Land Cover (LULC) information are significant to observe and evaluate environmental change. LULC classification applying remotely sensed data is a technique popularly employed on a global and local dimension particularly, in urban areas which have diverse land cover types. These are essential components of the urban terrain and ecosystem. In the present, object-based image analysis (OBIA) is becoming widely popular for land cover classification using the high-resolution image. COSMO-SkyMed SAR data was fused with THAICHOTE (namely, THEOS: Thailand Earth Observation Satellite) optical data for land cover classification using object-based. This paper indicates a comparison between object-based and pixel-based approaches in image fusion. The per-pixel method, support vector machines (SVM) was implemented to the fused image based on Principal Component Analysis (PCA). For the objectbased classification was applied to the fused images to separate land cover classes by using nearest neighbor (NN) classifier. Finally, the accuracy assessment was employed by comparing with the classification of land cover mapping generated from fused image dataset and THAICHOTE image. The object-based data fused COSMO-SkyMed with THAICHOTE images demonstrated the best classification accuracies, well over 85%. As the results, an object-based data fusion provides higher land cover classification accuracy than per-pixel data fusion.

  3. Carbohydrate terminology and classification.

    PubMed

    Cummings, J H; Stephen, A M

    2007-12-01

    Dietary carbohydrates are a group of chemically defined substances with a range of physical and physiological properties and health benefits. As with other macronutrients, the primary classification of dietary carbohydrate is based on chemistry, that is character of individual monomers, degree of polymerization (DP) and type of linkage (alpha or beta), as agreed at the Food and Agriculture Organization/World Health Organization Expert Consultation in 1997. This divides carbohydrates into three main groups, sugars (DP 1-2), oligosaccharides (short-chain carbohydrates) (DP 3-9) and polysaccharides (DP> or =10). Within this classification, a number of terms are used such as mono- and disaccharides, polyols, oligosaccharides, starch, modified starch, non-starch polysaccharides, total carbohydrate, sugars, etc. While effects of carbohydrates are ultimately related to their primary chemistry, they are modified by their physical properties. These include water solubility, hydration, gel formation, crystalline state, association with other molecules such as protein, lipid and divalent cations and aggregation into complex structures in cell walls and other specialized plant tissues. A classification based on chemistry is essential for a system of measurement, predication of properties and estimation of intakes, but does not allow a simple translation into nutritional effects since each class of carbohydrate has overlapping physiological properties and effects on health. This dichotomy has led to the use of a number of terms to describe carbohydrate in foods, for example intrinsic and extrinsic sugars, prebiotic, resistant starch, dietary fibre, available and unavailable carbohydrate, complex carbohydrate, glycaemic and whole grain. This paper reviews these terms and suggests that some are more useful than others. A clearer understanding of what is meant by any particular word used to describe carbohydrate is essential to progress in translating the growing knowledge of the physiological properties of carbohydrate into public health messages.

  4. Molecular Diagnostics of Gliomas Using Next Generation Sequencing of a Glioma-Tailored Gene Panel.

    PubMed

    Zacher, Angela; Kaulich, Kerstin; Stepanow, Stefanie; Wolter, Marietta; Köhrer, Karl; Felsberg, Jörg; Malzkorn, Bastian; Reifenberger, Guido

    2017-03-01

    Current classification of gliomas is based on histological criteria according to the World Health Organization (WHO) classification of tumors of the central nervous system. Over the past years, characteristic genetic profiles have been identified in various glioma types. These can refine tumor diagnostics and provide important prognostic and predictive information. We report on the establishment and validation of gene panel next generation sequencing (NGS) for the molecular diagnostics of gliomas. We designed a glioma-tailored gene panel covering 660 amplicons derived from 20 genes frequently aberrant in different glioma types. Sensitivity and specificity of glioma gene panel NGS for detection of DNA sequence variants and copy number changes were validated by single gene analyses. NGS-based mutation detection was optimized for application on formalin-fixed paraffin-embedded tissue specimens including small stereotactic biopsy samples. NGS data obtained in a retrospective analysis of 121 gliomas allowed for their molecular classification into distinct biological groups, including (i) isocitrate dehydrogenase gene (IDH) 1 or 2 mutant astrocytic gliomas with frequent α-thalassemia/mental retardation syndrome X-linked (ATRX) and tumor protein p53 (TP53) gene mutations, (ii) IDH mutant oligodendroglial tumors with 1p/19q codeletion, telomerase reverse transcriptase (TERT) promoter mutation and frequent Drosophila homolog of capicua (CIC) gene mutation, as well as (iii) IDH wildtype glioblastomas with frequent TERT promoter mutation, phosphatase and tensin homolog (PTEN) mutation and/or epidermal growth factor receptor (EGFR) amplification. Oligoastrocytic gliomas were genetically assigned to either of these groups. Our findings implicate gene panel NGS as a promising diagnostic technique that may facilitate integrated histological and molecular glioma classification. © 2016 International Society of Neuropathology.

  5. Systematic Analysis of Primary Sequence Domain Segments for the Discrimination Between Class C GPCR Subtypes.

    PubMed

    König, Caroline; Alquézar, René; Vellido, Alfredo; Giraldo, Jesús

    2018-03-01

    G-protein-coupled receptors (GPCRs) are a large and diverse super-family of eukaryotic cell membrane proteins that play an important physiological role as transmitters of extracellular signal. In this paper, we investigate Class C, a member of this super-family that has attracted much attention in pharmacology. The limited knowledge about the complete 3D crystal structure of Class C receptors makes necessary the use of their primary amino acid sequences for analytical purposes. Here, we provide a systematic analysis of distinct receptor sequence segments with regard to their ability to differentiate between seven class C GPCR subtypes according to their topological location in the extracellular, transmembrane, or intracellular domains. We build on the results from the previous research that provided preliminary evidence of the potential use of separated domains of complete class C GPCR sequences as the basis for subtype classification. The use of the extracellular N-terminus domain alone was shown to result in a minor decrease in subtype discrimination in comparison with the complete sequence, despite discarding much of the sequence information. In this paper, we describe the use of Support Vector Machine-based classification models to evaluate the subtype-discriminating capacity of the specific topological sequence segments.

  6. A Proposed Genus Boundary for the Prokaryotes Based on Genomic Insights

    PubMed Central

    Qin, Qi-Long; Xie, Bin-Bin; Zhang, Xi-Ying; Chen, Xiu-Lan; Zhou, Bai-Cheng; Zhou, Jizhong; Oren, Aharon

    2014-01-01

    Genomic information has already been applied to prokaryotic species definition and classification. However, the contribution of the genome sequence to prokaryotic genus delimitation has been less studied. To gain insights into genus definition for the prokaryotes, we attempted to reveal the genus-level genomic differences in the current prokaryotic classification system and to delineate the boundary of a genus on the basis of genomic information. The average nucleotide sequence identity between two genomes can be used for prokaryotic species delineation, but it is not suitable for genus demarcation. We used the percentage of conserved proteins (POCP) between two strains to estimate their evolutionary and phenotypic distance. A comprehensive genomic survey indicated that the POCP can serve as a robust genomic index for establishing the genus boundary for prokaryotic groups. Basically, two species belonging to the same genus would share at least half of their proteins. In a specific lineage, the genus and family/order ranks showed slight or no overlap in terms of POCP values. A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%. Integration of whole-genome data into the current taxonomy system can provide comprehensive information for prokaryotic genus definition and delimitation. PMID:24706738

  7. An Optimization-Based Framework for the Transformation of Incomplete Biological Knowledge into a Probabilistic Structure and Its Application to the Utilization of Gene/Protein Signaling Pathways in Discrete Phenotype Classification.

    PubMed

    Esfahani, Mohammad Shahrokh; Dougherty, Edward R

    2015-01-01

    Phenotype classification via genomic data is hampered by small sample sizes that negatively impact classifier design. Utilization of prior biological knowledge in conjunction with training data can improve both classifier design and error estimation via the construction of the optimal Bayesian classifier. In the genomic setting, gene/protein signaling pathways provide a key source of biological knowledge. Although these pathways are neither complete, nor regulatory, with no timing associated with them, they are capable of constraining the set of possible models representing the underlying interaction between molecules. The aim of this paper is to provide a framework and the mathematical tools to transform signaling pathways to prior probabilities governing uncertainty classes of feature-label distributions used in classifier design. Structural motifs extracted from the signaling pathways are mapped to a set of constraints on a prior probability on a Multinomial distribution. Being the conjugate prior for the Multinomial distribution, we propose optimization paradigms to estimate the parameters of a Dirichlet distribution in the Bayesian setting. The performance of the proposed methods is tested on two widely studied pathways: mammalian cell cycle and a p53 pathway model.

  8. Osteogenesis imperfecta.

    PubMed

    Marini, Joan C; Forlino, Antonella; Bächinger, Hans Peter; Bishop, Nick J; Byers, Peter H; Paepe, Anne De; Fassier, Francois; Fratzl-Zelman, Nadja; Kozloff, Kenneth M; Krakow, Deborah; Montpetit, Kathleen; Semler, Oliver

    2017-08-18

    Skeletal deformity and bone fragility are the hallmarks of the brittle bone dysplasia osteogenesis imperfecta. The diagnosis of osteogenesis imperfecta usually depends on family history and clinical presentation characterized by a fracture (or fractures) during the prenatal period, at birth or in early childhood; genetic tests can confirm diagnosis. Osteogenesis imperfecta is caused by dominant autosomal mutations in the type I collagen coding genes (COL1A1 and COL1A2) in about 85% of individuals, affecting collagen quantity or structure. In the past decade, (mostly) recessive, dominant and X-linked defects in a wide variety of genes encoding proteins involved in type I collagen synthesis, processing, secretion and post-translational modification, as well as in proteins that regulate the differentiation and activity of bone-forming cells have been shown to cause osteogenesis imperfecta. The large number of causative genes has complicated the classic classification of the disease, and although a new genetic classification system is widely used, it is still debated. Phenotypic manifestations in many organs, in addition to bone, are reported, such as abnormalities in the cardiovascular and pulmonary systems, skin fragility, muscle weakness, hearing loss and dentinogenesis imperfecta. Management involves surgical and medical treatment of skeletal abnormalities, and treatment of other complications. More innovative approaches based on gene and cell therapy, and signalling pathway alterations, are under investigation.

  9. Object based image analysis for the classification of the growth stages of Avocado crop, in Michoacán State, Mexico

    NASA Astrophysics Data System (ADS)

    Gao, Yan; Marpu, Prashanth; Morales Manila, Luis M.

    2014-11-01

    This paper assesses the suitability of 8-band Worldview-2 (WV2) satellite data and object-based random forest algorithm for the classification of avocado growth stages in Mexico. We tested both pixel-based with minimum distance (MD) and maximum likelihood (MLC) and object-based with Random Forest (RF) algorithm for this task. Training samples and verification data were selected by visual interpreting the WV2 images for seven thematic classes: fully grown, middle stage, and early stage of avocado crops, bare land, two types of natural forests, and water body. To examine the contribution of the four new spectral bands of WV2 sensor, all the tested classifications were carried out with and without the four new spectral bands. Classification accuracy assessment results show that object-based classification with RF algorithm obtained higher overall higher accuracy (93.06%) than pixel-based MD (69.37%) and MLC (64.03%) method. For both pixel-based and object-based methods, the classifications with the four new spectral bands (overall accuracy obtained higher accuracy than those without: overall accuracy of object-based RF classification with vs without: 93.06% vs 83.59%, pixel-based MD: 69.37% vs 67.2%, pixel-based MLC: 64.03% vs 36.05%, suggesting that the four new spectral bands in WV2 sensor contributed to the increase of the classification accuracy.

  10. An accurate computational method for an order parameter with a Markov state model constructed using a manifold-learning technique

    NASA Astrophysics Data System (ADS)

    Ito, Reika; Yoshidome, Takashi

    2018-01-01

    Markov state models (MSMs) are a powerful approach for analyzing the long-time behaviors of protein motion using molecular dynamics simulation data. However, their quantitative performance with respect to the physical quantities is poor. We believe that this poor performance is caused by the failure to appropriately classify protein conformations into states when constructing MSMs. Herein, we show that the quantitative performance of an order parameter is improved when a manifold-learning technique is employed for the classification in the MSM. The MSM construction using the K-center method, which has been previously used for classification, has a poor quantitative performance.

  11. Innovative vehicle classification strategies : using LIDAR to do more for less.

    DOT National Transportation Integrated Search

    2012-06-23

    This study examines LIDAR (light detection and ranging) based vehicle classification and classification : performance monitoring. First, we develop a portable LIDAR based vehicle classification system that can : be rapidly deployed, and then we use t...

  12. Solidago Virgaurea for Prostate Cancer Therapy

    DTIC Science & Technology

    2010-04-01

    CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18 . NUMBER OF PAGES 19a. NAME OF RESPONSIBLE PERSON Kounosuke Watabe Ph.D. a. REPORT Unclassified b...We have purified 48KD protein as described above and examined the efficacy of this protein on tumorigenesis using a xenograft ...clinically more relevant than xenograft model, and most importantly, we will examine the preventive effect of this protein. For this purpose, we need to

  13. Probing Enzyme-Surface Interactions via Protein Engineering and Single-Molecule Techniques

    DTIC Science & Technology

    2017-06-26

    SECURITY CLASSIFICATION OF: The overall objective of this research was to exploit protein engineering and fluorescence single-molecule methods to... Engineering and Single-Molecule Techniques The views, opinions and/or findings contained in this report are those of the author(s) and should not...Status: Technology Transfer: Report Date: 1 FINAL REPORT Project Title: Probing Enzyme-Surface Interactions via Protein Engineering and

  14. Synthesis of Stable Microcapsules from Trematode Eggshell Components

    DTIC Science & Technology

    1990-06-30

    NO Arlington, VA 22217-5000 61153N RR4106 11 TITLE (Include Security Classification) (u) Synthesis of Stable Microcapsules from Trematode Eggshell...Continue on reverse if necessary and identify by block number) The trematode Fasciola hepatica produces a unique protein eggshell or microcapsule the...proteins to produce a hard quinone tanned microcapsule with unusual properties. The focus of this project is to i) characterize the protein components

  15. Structural Characterisation of Proteins from the Peroxiredoxin Family

    DTIC Science & Technology

    2014-01-01

    SECURITY CLASSIFICATION OF: The oligomerisation of protein subunits is an area of much research interest, in particular the relationship to protein...or decision, unless so designated by other documentation. 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS (ES) U.S. Army Research Office P.O...Box 12211 Research Triangle Park, NC 27709-2211 peroxiredoxin, tecton, supramolecular assembly, TEM REPORT DOCUMENTATION PAGE 11. SPONSOR/MONITOR’S

  16. Compressed learning and its applications to subcellular localization.

    PubMed

    Zheng, Zhong-Long; Guo, Li; Jia, Jiong; Xie, Chen-Mao; Zeng, Wen-Cai; Yang, Jie

    2011-09-01

    One of the main challenges faced by biological applications is to predict protein subcellular localization in automatic fashion accurately. To achieve this in these applications, a wide variety of machine learning methods have been proposed in recent years. Most of them focus on finding the optimal classification scheme and less of them take the simplifying the complexity of biological systems into account. Traditionally, such bio-data are analyzed by first performing a feature selection before classification. Motivated by CS (Compressed Sensing) theory, we propose the methodology which performs compressed learning with a sparseness criterion such that feature selection and dimension reduction are merged into one analysis. The proposed methodology decreases the complexity of biological system, while increases protein subcellular localization accuracy. Experimental results are quite encouraging, indicating that the aforementioned sparse methods are quite promising in dealing with complicated biological problems, such as predicting the subcellular localization of Gram-negative bacterial proteins.

  17. Taxonomy of rare genetic metabolic bone disorders.

    PubMed

    Masi, L; Agnusdei, D; Bilezikian, J; Chappard, D; Chapurlat, R; Cianferotti, L; Devolgelaer, J-P; El Maghraoui, A; Ferrari, S; Javaid, M K; Kaufman, J-M; Liberman, U A; Lyritis, G; Miller, P; Napoli, N; Roldan, E; Papapoulos, S; Watts, N B; Brandi, M L

    2015-10-01

    This article reports a taxonomic classification of rare skeletal diseases based on metabolic phenotypes. It was prepared by The Skeletal Rare Diseases Working Group of the International Osteoporosis Foundation (IOF) and includes 116 OMIM phenotypes with 86 affected genes. Rare skeletal metabolic diseases comprise a group of diseases commonly associated with severe clinical consequences. In recent years, the description of the clinical phenotypes and radiographic features of several genetic bone disorders was paralleled by the discovery of key molecular pathways involved in the regulation of bone and mineral metabolism. Including this information in the description and classification of rare skeletal diseases may improve the recognition and management of affected patients. IOF recognized this need and formed a Skeletal Rare Diseases Working Group (SRD-WG) of basic and clinical scientists who developed a taxonomy of rare skeletal diseases based on their metabolic pathogenesis. This taxonomy of rare genetic metabolic bone disorders (RGMBDs) comprises 116 OMIM phenotypes, with 86 affected genes related to bone and mineral homeostasis. The diseases were divided into four major groups, namely, disorders due to altered osteoclast, osteoblast, or osteocyte activity; disorders due to altered bone matrix proteins; disorders due to altered bone microenvironmental regulators; and disorders due to deranged calciotropic hormonal activity. This article provides the first comprehensive taxonomy of rare metabolic skeletal diseases based on deranged metabolic activity. This classification will help in the development of common and shared diagnostic and therapeutic pathways for these patients and also in the creation of international registries of rare skeletal diseases, the first step for the development of genetic tests based on next generation sequencing and for performing large intervention trials to assess efficacy of orphan drugs.

  18. Identification of Protein Components of Yeast Telomerase

    DTIC Science & Technology

    2000-09-01

    cells past this limit senesce, or stop growing (reviewed in Hayflick 1997). This limit is imposed by the inactivity of telomerase, which results in...CLASSIFICATION OF THIS PAGE Unclassified 19. SECURITY CLASSIFICATION OF ABSTRACT Unclassified 15. NUMBER OF PAGES 55 16. PRICE CODE 20. LIMITATION ...one of which is the acquired capability of limitless replicative potential. Normal mammalian cells have an intrinsic limit to cellular division, and

  19. [Food allergy or food intolerance?].

    PubMed

    Maître, S; Maniu, C-M; Buss, G; Maillard, M H; Spertini, F; Ribi, C

    2014-04-16

    Adverse food reactions can be classified into two main categories depending on wether an immune mechanism is involved or not. The first category includes immune mediated reactions like IgE mediated food allergy, eosinophilic oesophagitis, food protein-induced enterocolitis syndrome and celiac disease. The second category implies non-immune mediated adverse food reactions, also called food intolerances. Intoxications, pharmacologic reactions, metabolic reactions, physiologic, psychologic or reactions with an unknown mechanism belong to this category. We present a classification of adverse food reactions based on the pathophysiologic mechanism that can be useful for both diagnostic approach and management.

  20. Molecular mechanisms of action of bacterial exotoxins.

    PubMed

    Balfanz, J; Rautenberg, P; Ullmann, U

    1996-07-01

    Toxins are one of the inventive strategies that bacteria have developed in order to survive. As virulence factors, they play a major role in the pathogenesis of infectious diseases. Recent discoveries have once more highlighted the effectiveness of these precisely adjusted bacterial weapons. Furthermore, toxins have become an invaluable tool in the investigation of fundamental cell processes, including regulation of cellular functions by various G proteins, cytoskeletal dynamics and neural transmission. In this review, the bacterial toxins are presented in a rational classification based on the molecular mechanisms of action.

  1. Deep Learning Accurately Predicts Estrogen Receptor Status in Breast Cancer Metabolomics Data.

    PubMed

    Alakwaa, Fadhl M; Chaudhary, Kumardeep; Garmire, Lana X

    2018-01-05

    Metabolomics holds the promise as a new technology to diagnose highly heterogeneous diseases. Conventionally, metabolomics data analysis for diagnosis is done using various statistical and machine learning based classification methods. However, it remains unknown if deep neural network, a class of increasingly popular machine learning methods, is suitable to classify metabolomics data. Here we use a cohort of 271 breast cancer tissues, 204 positive estrogen receptor (ER+), and 67 negative estrogen receptor (ER-) to test the accuracies of feed-forward networks, a deep learning (DL) framework, as well as six widely used machine learning models, namely random forest (RF), support vector machines (SVM), recursive partitioning and regression trees (RPART), linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), and generalized boosted models (GBM). DL framework has the highest area under the curve (AUC) of 0.93 in classifying ER+/ER- patients, compared to the other six machine learning algorithms. Furthermore, the biological interpretation of the first hidden layer reveals eight commonly enriched significant metabolomics pathways (adjusted P-value <0.05) that cannot be discovered by other machine learning methods. Among them, protein digestion and absorption and ATP-binding cassette (ABC) transporters pathways are also confirmed in integrated analysis between metabolomics and gene expression data in these samples. In summary, deep learning method shows advantages for metabolomics based breast cancer ER status classification, with both the highest prediction accuracy (AUC = 0.93) and better revelation of disease biology. We encourage the adoption of feed-forward networks based deep learning method in the metabolomics research community for classification.

  2. Partitioning the Genetic Diversity of a Virus Family: Approach and Evaluation through a Case Study of Picornaviruses

    PubMed Central

    Lauber, Chris

    2012-01-01

    The recent advent of genome sequences as the only source available to classify many newly discovered viruses challenges the development of virus taxonomy by expert virologists who traditionally rely on extensive virus characterization. In this proof-of-principle study, we address this issue by presenting a computational approach (DEmARC) to classify viruses of a family into groups at hierarchical levels using a sole criterion—intervirus genetic divergence. To quantify genetic divergence, we used pairwise evolutionary distances (PEDs) estimated by maximum likelihood inference on a multiple alignment of family-wide conserved proteins. PEDs were calculated for all virus pairs, and the resulting distribution was modeled via a mixture of probability density functions. The model enables the quantitative inference of regions of distance discontinuity in the family-wide PED distribution, which define the levels of hierarchy. For each level, a limit on genetic divergence, below which two viruses join the same group, was objectively selected among a set of candidates by minimizing violations of intragroup PEDs to the limit. In a case study, we applied the procedure to hundreds of genome sequences of picornaviruses and extensively evaluated it by modulating four key parameters. It was found that the genetics-based classification largely tolerates variations in virus sampling and multiple alignment construction but is affected by the choice of protein and the measure of genetic divergence. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3905–3915, 2012), we analyze the substantial insight gained with the genetics-based classification approach by comparing it with the expert-based picornavirus taxonomy. PMID:22278230

  3. Extracytoplasmic function σ factors of the widely distributed group ECF41 contain a fused regulatory domain

    PubMed Central

    Wecke, Tina; Halang, Petra; Staroń, Anna; Dufour, Yann S; Donohue, Timothy J; Mascher, Thorsten

    2012-01-01

    Bacteria need signal transducing systems to respond to environmental changes. Next to one- and two-component systems, alternative σ factors of the extra-cytoplasmic function (ECF) protein family represent the third fundamental mechanism of bacterial signal transduction. A comprehensive classification of these proteins identified more than 40 phylogenetically distinct groups, most of which are not experimentally investigated. Here, we present the characterization of such a group with unique features, termed ECF41. Among analyzed bacterial genomes, ECF41 σ factors are widely distributed with about 400 proteins from 10 different phyla. They lack obvious anti-σ factors that typically control activity of other ECF σ factors, but their structural genes are often predicted to be cotranscribed with carboxymuconolactone decarboxylases, oxidoreductases, or epimerases based on genomic context conservation. We demonstrate for Bacillus licheniformis and Rhodobacter sphaeroides that the corresponding genes are preceded by a highly conserved promoter motif and are the only detectable targets of ECF41-dependent gene regulation. In contrast to other ECF σ factors, proteins of group ECF41 contain a large C-terminal extension, which is crucial for σ factor activity. Our data demonstrate that ECF41 σ factors are regulated by a novel mechanism based on the presence of a fused regulatory domain. PMID:22950025

  4. A new multidimensional stoichiometric classification of compounds: moving beyond the van Krevelen diagram.

    NASA Astrophysics Data System (ADS)

    Rivas-Ubach, A.; Liu, Y.; Bianchi, T. S.; Tolic, N.; Jansson, C.; Paša-Tolić, L.

    2017-12-01

    The role of nutrients in organisms, especially primary producers, has been a topic of special interest in ecosystem research for understanding the ecosystem structure and function. The majority of macro-elements in organisms, such as C, H, O, N and P, do not act as single elements but are components of organic compounds (lipids, peptides, carbohydrates, etc), which are more directly related to the physiology of organisms and thus to the ecosystem function. However, accurately deciphering the overall content of the main compound classes (lipids, proteins, carbohydrates,…) in organisms is still a major challenge. van Krevelen (vK) diagrams have been widely used as an estimation of the main compound categories present in environmental samples based on O:C vs H:C molecular ratios, but a stoichiometric classification based exclusively on O:C and H:C ratios is feeble. Different compound classes show large O:C and H:C ratio overlapping and other heteroatoms, such as N and P, should be considered to robustly distinguish the different classes. We propose a new compound classification for biological/environmental samples based on the C:H:O:N:P stoichiometric ratios of thousands of molecular formulas of characterized compounds from 6 different main categories: lipids, peptides, amino-sugars, carbohydrates, nucleotides and phytochemical compounds (oxy-aromatic compounds). This new multidimensional stoichiometric compound constraints classification (MSCC) can be applied to data obtained with high resolution mass spectrometry (HRMS), allowing an accurate overview of the relative abundances of the main compound categories present in organismal samples. The MSCC has been optimized for plants, but it could be also applied to different organisms and serve as a strong starting point to further investigate other environmental complex matrices (soils, aerosols, etc). The proposed MSCC advances environmental research, especially eco-metabolomics, ecophysiology and ecological stoichiometry studies, providing a new tool to understand the ecosystem structure and function at the molecular level.

  5. A web-based land cover classification system based on ontology model of different classification systems

    NASA Astrophysics Data System (ADS)

    Lin, Y.; Chen, X.

    2016-12-01

    Land cover classification systems used in remote sensing image data have been developed to meet the needs for depicting land covers in scientific investigations and policy decisions. However, accuracy assessments of a spate of data sets demonstrate that compared with the real physiognomy, each of the thematic map of specific land cover classification system contains some unavoidable flaws and unintended deviation. This work proposes a web-based land cover classification system, an integrated prototype, based on an ontology model of various classification systems, each of which is assigned the same weight in the final determination of land cover type. Ontology, a formal explication of specific concepts and relations, is employed in this prototype to build up the connections among different systems to resolve the naming conflicts. The process is initialized by measuring semantic similarity between terminologies in the systems and the search key to produce certain set of satisfied classifications, and carries on through searching the predefined relations in concepts of all classification systems to generate classification maps with user-specified land cover type highlighted, based on probability calculated by votes from data sets with different classification system adopted. The present system is verified and validated by comparing the classification results with those most common systems. Due to full consideration and meaningful expression of each classification system using ontology and the convenience that the web brings with itself, this system, as a preliminary model, proposes a flexible and extensible architecture for classification system integration and data fusion, thereby providing a strong foundation for the future work.

  6. Comprehensive analysis of CpG island methylator phenotype (CIMP)-high, -low, and -negative colorectal cancers based on protein marker expression and molecular features.

    PubMed

    Zlobec, Inti; Bihl, Michel; Foerster, Anja; Rufle, Alex; Lugli, Alessandro

    2011-11-01

    CpG island methylator phenotype (CIMP) is being investigated for its role in the molecular and prognostic classification of colorectal cancer patients but is also emerging as a factor with the potential to influence clinical decision-making. We report a comprehensive analysis of clinico-pathological and molecular features (KRAS, BRAF and microsatellite instability, MSI) as well as of selected tumour- and host-related protein markers characterizing CIMP-high (CIMP-H), -low, and -negative colorectal cancers. Immunohistochemical analysis for 48 protein markers and molecular analysis of CIMP (CIMP-H: ≥ 4/5 methylated genes), MSI (MSI-H: ≥ 2 instable genes), KRAS, and BRAF were performed on 337 colorectal cancers. Simple and multiple regression analysis and receiver operating characteristic (ROC) curve analysis were performed. CIMP-H was found in 24 cases (7.1%) and linked (p < 0.0001) to more proximal tumour location, BRAF mutation, MSI-H, MGMT methylation (p = 0.022), advanced pT classification (p = 0.03), mucinous histology (p = 0.069), and less frequent KRAS mutation (p = 0.067) compared to CIMP-low or -negative cases. Of the 48 protein markers, decreased levels of RKIP (p = 0.0056), EphB2 (p = 0.0045), CK20 (p = 0.002), and Cdx2 (p < 0.0001) and increased numbers of CD8+ intra-epithelial lymphocytes (p < 0.0001) were related to CIMP-H, independently of MSI status. In addition to the expected clinico-pathological and molecular associations, CIMP-H colorectal cancers are characterized by a loss of protein markers associated with differentiation, and metastasis suppression, and have increased CD8+ T-lymphocytes regardless of MSI status. In particular, Cdx2 loss seems to strongly predict CIMP-H in both microsatellite-stable (MSS) and MSI-H colorectal cancers. Cdx2 is proposed as a surrogate marker for CIMP-H. Copyright © 2011 Pathological Society of Great Britain and Ireland. Published by John Wiley & Sons, Ltd.

  7. Insights into an original pocket-ligand pair classification: a promising tool for ligand profile prediction.

    PubMed

    Pérot, Stéphanie; Regad, Leslie; Reynès, Christelle; Spérandio, Olivier; Miteva, Maria A; Villoutreix, Bruno O; Camproux, Anne-Claude

    2013-01-01

    Pockets are today at the cornerstones of modern drug discovery projects and at the crossroad of several research fields, from structural biology to mathematical modeling. Being able to predict if a small molecule could bind to one or more protein targets or if a protein could bind to some given ligands is very useful for drug discovery endeavors, anticipation of binding to off- and anti-targets. To date, several studies explore such questions from chemogenomic approach to reverse docking methods. Most of these studies have been performed either from the viewpoint of ligands or targets. However it seems valuable to use information from both ligands and target binding pockets. Hence, we present a multivariate approach relating ligand properties with protein pocket properties from the analysis of known ligand-protein interactions. We explored and optimized the pocket-ligand pair space by combining pocket and ligand descriptors using Principal Component Analysis and developed a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physico-chemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physico-chemical properties and capture relevant information with respect to protein-ligand interactions. Based on these pocket-ligand correspondences, a protocol of prediction of clusters sharing similarity in terms of recognition characteristics is developed for a given pocket-ligand complex and gives high performances. It is then extended to cluster prediction for a given pocket in order to acquire knowledge about its expected ligand profile or to cluster prediction for a given ligand in order to acquire knowledge about its expected pocket profile. This prediction approach shows promising results and could contribute to predict some ligand properties critical for binding to a given pocket, and conversely, some key pocket properties for ligand binding.

  8. Insights into an Original Pocket-Ligand Pair Classification: A Promising Tool for Ligand Profile Prediction

    PubMed Central

    Reynès, Christelle; Spérandio, Olivier; Miteva, Maria A.; Villoutreix, Bruno O.; Camproux, Anne-Claude

    2013-01-01

    Pockets are today at the cornerstones of modern drug discovery projects and at the crossroad of several research fields, from structural biology to mathematical modeling. Being able to predict if a small molecule could bind to one or more protein targets or if a protein could bind to some given ligands is very useful for drug discovery endeavors, anticipation of binding to off- and anti-targets. To date, several studies explore such questions from chemogenomic approach to reverse docking methods. Most of these studies have been performed either from the viewpoint of ligands or targets. However it seems valuable to use information from both ligands and target binding pockets. Hence, we present a multivariate approach relating ligand properties with protein pocket properties from the analysis of known ligand-protein interactions. We explored and optimized the pocket-ligand pair space by combining pocket and ligand descriptors using Principal Component Analysis and developed a classification engine on this paired space, revealing five main clusters of pocket-ligand pairs sharing specific and similar structural or physico-chemical properties. These pocket-ligand pair clusters highlight correspondences between pocket and ligand topological and physico-chemical properties and capture relevant information with respect to protein-ligand interactions. Based on these pocket-ligand correspondences, a protocol of prediction of clusters sharing similarity in terms of recognition characteristics is developed for a given pocket-ligand complex and gives high performances. It is then extended to cluster prediction for a given pocket in order to acquire knowledge about its expected ligand profile or to cluster prediction for a given ligand in order to acquire knowledge about its expected pocket profile. This prediction approach shows promising results and could contribute to predict some ligand properties critical for binding to a given pocket, and conversely, some key pocket properties for ligand binding. PMID:23840299

  9. Comprehensive and quantitative proteomic analyses of zebrafish plasma reveals conserved protein profiles between genders and between zebrafish and human.

    PubMed

    Li, Caixia; Tan, Xing Fei; Lim, Teck Kwang; Lin, Qingsong; Gong, Zhiyuan

    2016-04-13

    Omic approaches have been increasingly used in the zebrafish model for holistic understanding of molecular events and mechanisms of tissue functions. However, plasma is rarely used for omic profiling because of the technical challenges in collecting sufficient blood. In this study, we employed two mass spectrometric (MS) approaches for a comprehensive characterization of zebrafish plasma proteome, i.e. conventional shotgun liquid chromatography-tandem mass spectrometry (LC-MS/MS) for an overview study and quantitative SWATH (Sequential Window Acquisition of all THeoretical fragment-ion spectra) for comparison between genders. 959 proteins were identified in the shotgun profiling with estimated concentrations spanning almost five orders of magnitudes. Other than the presence of a few highly abundant female egg yolk precursor proteins (vitellogenins), the proteomic profiles of male and female plasmas were very similar in both number and abundance and there were basically no other highly gender-biased proteins. The types of plasma proteins based on IPA (Ingenuity Pathway Analysis) classification and tissue sources of production were also very similar. Furthermore, the zebrafish plasma proteome shares significant similarities with human plasma proteome, in particular in top abundant proteins including apolipoproteins and complements. Thus, the current study provided a valuable dataset for future evaluation of plasma proteins in zebrafish.

  10. Comprehensive and quantitative proteomic analyses of zebrafish plasma reveals conserved protein profiles between genders and between zebrafish and human

    PubMed Central

    Li, Caixia; Tan, Xing Fei; Lim, Teck Kwang; Lin, Qingsong; Gong, Zhiyuan

    2016-01-01

    Omic approaches have been increasingly used in the zebrafish model for holistic understanding of molecular events and mechanisms of tissue functions. However, plasma is rarely used for omic profiling because of the technical challenges in collecting sufficient blood. In this study, we employed two mass spectrometric (MS) approaches for a comprehensive characterization of zebrafish plasma proteome, i.e. conventional shotgun liquid chromatography-tandem mass spectrometry (LC-MS/MS) for an overview study and quantitative SWATH (Sequential Window Acquisition of all THeoretical fragment-ion spectra) for comparison between genders. 959 proteins were identified in the shotgun profiling with estimated concentrations spanning almost five orders of magnitudes. Other than the presence of a few highly abundant female egg yolk precursor proteins (vitellogenins), the proteomic profiles of male and female plasmas were very similar in both number and abundance and there were basically no other highly gender-biased proteins. The types of plasma proteins based on IPA (Ingenuity Pathway Analysis) classification and tissue sources of production were also very similar. Furthermore, the zebrafish plasma proteome shares significant similarities with human plasma proteome, in particular in top abundant proteins including apolipoproteins and complements. Thus, the current study provided a valuable dataset for future evaluation of plasma proteins in zebrafish. PMID:27071722

  11. Self-organized neural maps of human protein sequences.

    PubMed Central

    Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.

    1994-01-01

    We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421

  12. LC-MS analysis of Hep-2 and Hek-293 cell lines treated with Brazilian red propolis reveals differences in protein expression.

    PubMed

    da Silva Frozza, Caroline O; da Silva Brum, Emyle; Alving, Anjali; Moura, Sidnei; Henriques, João A P; Roesch-Ely, Mariana

    2016-08-01

    Red propolis, an exclusive variety of propolis found in the northeast of Brazil has shown to present antitumour activity, among several other biological properties. This article aimed to help to evaluate the underlying molecular mechanisms of the potential anticancer effects of red propolis on tumour, Hep-2, and non-tumour cells, Hek-293. Differentially expressed proteins in human cell lines were identified through label-free quantitative MS-based proteomic platform, and cells were stained with Giemsa to show morphological changes. A total of 1336 and 773 proteins were identified for Hep-2 and Hek-293, respectively. Among the proteins here identified, 16 were regulated in the Hep-2 cell line and 04 proteins in the Hek-293 line. Over a total of 2000 proteins were identified under MS analysis, and approximately 1% presented differential expression patterns. The GO annotation using Protein Analysis THrough Evolutionary Relationships classification system revealed predominant molecular function of catalytic activity, and among the biological processes, the most prominent was associated to cell metabolism. The proteomic profile here presented should help to elucidate further molecular mechanisms involved in inhibition of cancer cell proliferation by red propolis, which remain unclear to date. © 2016 Royal Pharmaceutical Society.

  13. A Method of Spatial Mapping and Reclassification for High-Spatial-Resolution Remote Sensing Image Classification

    PubMed Central

    Wang, Guizhou; Liu, Jianbo; He, Guojin

    2013-01-01

    This paper presents a new classification method for high-spatial-resolution remote sensing images based on a strategic mechanism of spatial mapping and reclassification. The proposed method includes four steps. First, the multispectral image is classified by a traditional pixel-based classification method (support vector machine). Second, the panchromatic image is subdivided by watershed segmentation. Third, the pixel-based multispectral image classification result is mapped to the panchromatic segmentation result based on a spatial mapping mechanism and the area dominant principle. During the mapping process, an area proportion threshold is set, and the regional property is defined as unclassified if the maximum area proportion does not surpass the threshold. Finally, unclassified regions are reclassified based on spectral information using the minimum distance to mean algorithm. Experimental results show that the classification method for high-spatial-resolution remote sensing images based on the spatial mapping mechanism and reclassification strategy can make use of both panchromatic and multispectral information, integrate the pixel- and object-based classification methods, and improve classification accuracy. PMID:24453808

  14. Phylogenetic Analysis and Classification of the Fungal bHLH Domain

    PubMed Central

    Sailsbery, Joshua K.; Atchley, William R.; Dean, Ralph A.

    2012-01-01

    The basic Helix-Loop-Helix (bHLH) domain is an essential highly conserved DNA-binding domain found in many transcription factors in all eukaryotic organisms. The bHLH domain has been well studied in the Animal and Plant Kingdoms but has yet to be characterized within Fungi. Herein, we obtained and evaluated the phylogenetic relationship of 490 fungal-specific bHLH containing proteins from 55 whole genome projects composed of 49 Ascomycota and 6 Basidiomycota organisms. We identified 12 major groupings within Fungi (F1–F12); identifying conserved motifs and functions specific to each group. Several classification models were built to distinguish the 12 groups and elucidate the most discerning sites in the domain. Performance testing on these models, for correct group classification, resulted in a maximum sensitivity and specificity of 98.5% and 99.8%, respectively. We identified 12 highly discerning sites and incorporated those into a set of rules (simplified model) to classify sequences into the correct group. Conservation of amino acid sites and phylogenetic analyses established that like plant bHLH proteins, fungal bHLH–containing proteins are most closely related to animal Group B. The models used in these analyses were incorporated into a software package, the source code for which is available at www.fungalgenomics.ncsu.edu. PMID:22114358

  15. Java Web Start based software for automated quantitative nuclear analysis of prostate cancer and benign prostate hyperplasia.

    PubMed

    Singh, Swaroop S; Kim, Desok; Mohler, James L

    2005-05-11

    Androgen acts via androgen receptor (AR) and accurate measurement of the levels of AR protein expression is critical for prostate research. The expression of AR in paired specimens of benign prostate and prostate cancer from 20 African and 20 Caucasian Americans was compared to demonstrate an application of this system. A set of 200 immunopositive and 200 immunonegative nuclei were collected from the images using a macro developed in Image Pro Plus. Linear Discriminant and Logistic Regression analyses were performed on the data to generate classification coefficients. Classification coefficients render the automated image analysis software independent of the type of immunostaining or image acquisition system used. The image analysis software performs local segmentation and uses nuclear shape and size to detect prostatic epithelial nuclei. AR expression is described by (a) percentage of immunopositive nuclei; (b) percentage of immunopositive nuclear area; and (c) intensity of AR expression among immunopositive nuclei or areas. The percent positive nuclei and percent nuclear area were similar by race in both benign prostate hyperplasia and prostate cancer. In prostate cancer epithelial nuclei, African Americans exhibited 38% higher levels of AR immunostaining than Caucasian Americans (two sided Student's t-tests; P < 0.05). Intensity of AR immunostaining was similar between races in benign prostate. The differences measured in the intensity of AR expression in prostate cancer were consistent with previous studies. Classification coefficients are required due to non-standardized immunostaining and image collection methods across medical institutions and research laboratories and helps customize the software for the specimen under study. The availability of a free, automated system creates new opportunities for testing, evaluation and use of this image analysis system by many research groups who study nuclear protein expression.

  16. MIPS: a database for protein sequences, homology data and yeast genome information.

    PubMed Central

    Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F

    1997-01-01

    The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498

  17. Proteome Exploration to Provide a Resource for the Investigation of Ganoderma lucidum

    PubMed Central

    Yu, Guo-Jun; Yin, Ya-Lin; Yu, Wen-Hui; Liu, Wei; Jin, Yan-Xia; Shrestha, Alok; Yang, Qing; Ye, Xiang-Dong; Sun, Hui

    2015-01-01

    Ganoderma lucidum is a basidiomycete white rot fungus that has been used for medicinal purposes worldwide. Although information concerning its genome and transcriptome has recently been reported, relatively little information is available for G. lucidum at the proteomic level. In this study, protein fractions from G. lucidum at three developmental stages (16-day mycelia, and fruiting bodies at 60 and 90 days) were prepared and subjected to LC-MS/MS analysis. A search against the G. lucidum genome database identified 803 proteins. Among these proteins, 61 lignocellulose degrading proteins were detected, most of which (49 proteins) were found in the 90-day fruiting bodies. Fourteen TCA-cycle related proteins, 17 peptidases, two argonaute-like proteins, and two immunomodulatory proteins were also detected. A majority (470) of the 803 proteins had GO annotations and were classified into 36 GO terms, with “binding”, “catalytic activity”, and “hydrolase activity” having high percentages. Additionally, 357 out of the 803 proteins were assigned to at least one COG functional category and grouped into 22 COG classifications. Based on the results from the proteomic and sequence alignment analyses, a potentially new immunomodulatory protein (GL18769) was expressed and shown to have high immunomodulatory activity. In this study, proteomic and biochemical analyses of G. lucidum were performed for the first time, revealing that proteins from this fungus can play significant bioactive roles and providing a new foundation for the further functional investigations that this fungus merits. PMID:25756518

  18. Classification of weld defect based on information fusion technology for radiographic testing system

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Jiang, Hongquan; Liang, Zeming, E-mail: heavenlzm@126.com; Gao, Jianmin

    Improving the efficiency and accuracy of weld defect classification is an important technical problem in developing the radiographic testing system. This paper proposes a novel weld defect classification method based on information fusion technology, Dempster–Shafer evidence theory. First, to characterize weld defects and improve the accuracy of their classification, 11 weld defect features were defined based on the sub-pixel level edges of radiographic images, four of which are presented for the first time in this paper. Second, we applied information fusion technology to combine different features for weld defect classification, including a mass function defined based on the weld defectmore » feature information and the quartile-method-based calculation of standard weld defect class which is to solve a sample problem involving a limited number of training samples. A steam turbine weld defect classification case study is also presented herein to illustrate our technique. The results show that the proposed method can increase the correct classification rate with limited training samples and address the uncertainties associated with weld defect classification.« less

  19. Classification of weld defect based on information fusion technology for radiographic testing system.

    PubMed

    Jiang, Hongquan; Liang, Zeming; Gao, Jianmin; Dang, Changying

    2016-03-01

    Improving the efficiency and accuracy of weld defect classification is an important technical problem in developing the radiographic testing system. This paper proposes a novel weld defect classification method based on information fusion technology, Dempster-Shafer evidence theory. First, to characterize weld defects and improve the accuracy of their classification, 11 weld defect features were defined based on the sub-pixel level edges of radiographic images, four of which are presented for the first time in this paper. Second, we applied information fusion technology to combine different features for weld defect classification, including a mass function defined based on the weld defect feature information and the quartile-method-based calculation of standard weld defect class which is to solve a sample problem involving a limited number of training samples. A steam turbine weld defect classification case study is also presented herein to illustrate our technique. The results show that the proposed method can increase the correct classification rate with limited training samples and address the uncertainties associated with weld defect classification.

  20. A Computerized English-Spanish Correlation Index to Five Biomedical Library Classification Schemes Based on MeSH*

    PubMed Central

    Muench, Eugene V.

    1971-01-01

    A computerized English/Spanish correlation index to five biomedical library classification schemes and a computerized English/Spanish, Spanish/English listings of MeSH are described. The index was accomplished by supplying appropriate classification numbers of five classification schemes (National Library of Medicine; Library of Congress; Dewey Decimal; Cunningham; Boston Medical) to MeSH and a Spanish translation of MeSH The data were keypunched, merged on magnetic tape, and sorted in a computer alphabetically by English and Spanish subject headings and sequentially by classification number. Some benefits and uses of the index are: a complete index to classification schemes based on MeSH terms; a tool for conversion of classification numbers when reclassifying collections; a Spanish index and a crude Spanish translation of five classification schemes; a data base for future applications, e.g., automatic classification. Other classification schemes, such as the UDC, and translations of MeSH into other languages can be added. PMID:5172471

Top