Sample records for protein sequence classification

  1. Visualizing and Clustering Protein Similarity Networks: Sequences, Structures, and Functions.

    PubMed

    Mai, Te-Lun; Hu, Geng-Ming; Chen, Chi-Ming

    2016-07-01

    Research in the recent decade has demonstrated the usefulness of protein network knowledge in furthering the study of molecular evolution of proteins, understanding the robustness of cells to perturbation, and annotating new protein functions. In this study, we aimed to provide a general clustering approach to visualize the sequence-structure-function relationship of protein networks, and investigate possible causes for inconsistency in the protein classifications based on sequences, structures, and functions. Such visualization of protein networks could facilitate our understanding of the overall relationship among proteins and help researchers comprehend various protein databases. As a demonstration, we clustered 1437 enzymes by their sequences and structures using the minimum span clustering (MSC) method. The general structure of this protein network was delineated at two clustering resolutions, and the second level MSC clustering was found to be highly similar to existing enzyme classifications. The clustering of these enzymes based on sequence, structure, and function information is consistent with each other. For proteases, the Jaccard's similarity coefficient is 0.86 between sequence and function classifications, 0.82 between sequence and structure classifications, and 0.78 between structure and function classifications. From our clustering results, we discussed possible examples of divergent evolution and convergent evolution of enzymes. Our clustering approach provides a panoramic view of the sequence-structure-function network of proteins, helps visualize the relation between related proteins intuitively, and is useful in predicting the structure and function of newly determined protein sequences.

  2. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification.

    PubMed

    Sinclair, Robert M; Ravantti, Janne J; Bamford, Dennis H

    2017-04-15

    Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. Copyright © 2017 Sinclair et al.

  3. Nucleic and Amino Acid Sequences Support Structure-Based Viral Classification

    PubMed Central

    Sinclair, Robert M.; Ravantti, Janne J.

    2017-01-01

    ABSTRACT Viral capsids ensure viral genome integrity by protecting the enclosed nucleic acids. Interactions between the genome and capsid and between individual capsid proteins (i.e., capsid architecture) are intimate and are expected to be characterized by strong evolutionary conservation. For this reason, a capsid structure-based viral classification has been proposed as a way to bring order to the viral universe. The seeming lack of sufficient sequence similarity to reproduce this classification has made it difficult to reject structural convergence as the basis for the classification. We reinvestigate whether the structure-based classification for viral coat proteins making icosahedral virus capsids is in fact supported by previously undetected sequence similarity. Since codon choices can influence nascent protein folding cotranslationally, we searched for both amino acid and nucleotide sequence similarity. To demonstrate the sensitivity of the approach, we identify a candidate gene for the pandoravirus capsid protein. We show that the structure-based classification is strongly supported by amino acid and also nucleotide sequence similarities, suggesting that the similarities are due to common descent. The correspondence between structure-based and sequence-based analyses of the same proteins shown here allow them to be used in future analyses of the relationship between linear sequence information and macromolecular function, as well as between linear sequence and protein folds. IMPORTANCE Viral capsids protect nucleic acid genomes, which in turn encode capsid proteins. This tight coupling of protein shell and nucleic acids, together with strong functional constraints on capsid protein folding and architecture, leads to the hypothesis that capsid protein-coding nucleotide sequences may retain signatures of ancient viral evolution. We have been able to show that this is indeed the case, using the major capsid proteins of viruses forming icosahedral capsids. Importantly, we detected similarity at the nucleotide level between capsid protein-coding regions from viruses infecting cells belonging to all three domains of life, reproducing a previously established structure-based classification of icosahedral viral capsids. PMID:28122979

  4. Elman RNN based classification of proteins sequences on account of their mutual information.

    PubMed

    Mishra, Pooja; Nath Pandey, Paras

    2012-10-21

    In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor. Copyright © 2012 Elsevier Ltd. All rights reserved.

  5. Protein Sequence Classification with Improved Extreme Learning Machine Algorithms

    PubMed Central

    2014-01-01

    Precisely classifying a protein sequence from a large biological protein sequences database plays an important role for developing competitive pharmacological products. Comparing the unseen sequence with all the identified protein sequences and returning the category index with the highest similarity scored protein, conventional methods are usually time-consuming. Therefore, it is urgent and necessary to build an efficient protein sequence classification system. In this paper, we study the performance of protein sequence classification using SLFNs. The recent efficient extreme learning machine (ELM) and its invariants are utilized as the training algorithms. The optimal pruned ELM is first employed for protein sequence classification in this paper. To further enhance the performance, the ensemble based SLFNs structure is constructed where multiple SLFNs with the same number of hidden nodes and the same activation function are used as ensembles. For each ensemble, the same training algorithm is adopted. The final category index is derived using the majority voting method. Two approaches, namely, the basic ELM and the OP-ELM, are adopted for the ensemble based SLFNs. The performance is analyzed and compared with several existing methods using datasets obtained from the Protein Information Resource center. The experimental results show the priority of the proposed algorithms. PMID:24795876

  6. A topological approach for protein classification

    DOE PAGES

    Cang, Zixuan; Mu, Lin; Wu, Kedi; ...

    2015-11-04

    Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.

  7. Optimizing high performance computing workflow for protein functional annotation.

    PubMed

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-09-10

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.

  8. Optimizing high performance computing workflow for protein functional annotation

    PubMed Central

    Stanberry, Larissa; Rekepalli, Bhanu; Liu, Yuan; Giblock, Paul; Higdon, Roger; Montague, Elizabeth; Broomall, William; Kolker, Natali; Kolker, Eugene

    2014-01-01

    Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data. PMID:25313296

  9. Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

    PubMed Central

    Faye, Ibrahima; Samir, Brahim Belhaouari; Md Said, Abas

    2014-01-01

    Bioinformatics has been an emerging area of research for the last three decades. The ultimate aims of bioinformatics were to store and manage the biological data, and develop and analyze computational tools to enhance their understanding. The size of data accumulated under various sequencing projects is increasing exponentially, which presents difficulties for the experimental methods. To reduce the gap between newly sequenced protein and proteins with known functions, many computational techniques involving classification and clustering algorithms were proposed in the past. The classification of protein sequences into existing superfamilies is helpful in predicting the structure and function of large amount of newly discovered proteins. The existing classification results are unsatisfactory due to a huge size of features obtained through various feature encoding methods. In this work, a statistical metric-based feature selection technique has been proposed in order to reduce the size of the extracted feature vector. The proposed method of protein classification shows significant improvement in terms of performance measure metrics: accuracy, sensitivity, specificity, recall, F-measure, and so forth. PMID:25045727

  10. Graph pyramids for protein function prediction

    PubMed Central

    2015-01-01

    Background Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Methods Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Results Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data. PMID:26044522

  11. Graph pyramids for protein function prediction.

    PubMed

    Sandhan, Tushar; Yoo, Youngjun; Choi, Jin; Kim, Sun

    2015-01-01

    Uncovering the hidden organizational characteristics and regularities among biological sequences is the key issue for detailed understanding of an underlying biological phenomenon. Thus pattern recognition from nucleic acid sequences is an important affair for protein function prediction. As proteins from the same family exhibit similar characteristics, homology based approaches predict protein functions via protein classification. But conventional classification approaches mostly rely on the global features by considering only strong protein similarity matches. This leads to significant loss of prediction accuracy. Here we construct the Protein-Protein Similarity (PPS) network, which captures the subtle properties of protein families. The proposed method considers the local as well as the global features, by examining the interactions among 'weakly interacting proteins' in the PPS network and by using hierarchical graph analysis via the graph pyramid. Different underlying properties of the protein families are uncovered by operating the proposed graph based features at various pyramid levels. Experimental results on benchmark data sets show that the proposed hierarchical voting algorithm using graph pyramid helps to improve computational efficiency as well the protein classification accuracy. Quantitatively, among 14,086 test sequences, on an average the proposed method misclassified only 21.1 sequences whereas baseline BLAST score based global feature matching method misclassified 362.9 sequences. With each correctly classified test sequence, the fast incremental learning ability of the proposed method further enhances the training model. Thus it has achieved more than 96% protein classification accuracy using only 20% per class training data.

  12. Structural classification of proteins using texture descriptors extracted from the cellular automata image.

    PubMed

    Kavianpour, Hamidreza; Vasighi, Mahdi

    2017-02-01

    Nowadays, having knowledge about cellular attributes of proteins has an important role in pharmacy, medical science and molecular biology. These attributes are closely correlated with the function and three-dimensional structure of proteins. Knowledge of protein structural class is used by various methods for better understanding the protein functionality and folding patterns. Computational methods and intelligence systems can have an important role in performing structural classification of proteins. Most of protein sequences are saved in databanks as characters and strings and a numerical representation is essential for applying machine learning methods. In this work, a binary representation of protein sequences is introduced based on reduced amino acids alphabets according to surrounding hydrophobicity index. Many important features which are hidden in these long binary sequences can be clearly displayed through their cellular automata images. The extracted features from these images are used to build a classification model by support vector machine. Comparing to previous studies on the several benchmark datasets, the promising classification rates obtained by tenfold cross-validation imply that the current approach can help in revealing some inherent features deeply hidden in protein sequences and improve the quality of predicting protein structural class.

  13. ECOD: An Evolutionary Classification of Protein Domains

    PubMed Central

    Kinch, Lisa N.; Pei, Jimin; Shi, Shuoyong; Kim, Bong-Hyun; Grishin, Nick V.

    2014-01-01

    Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or “fold”). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies. PMID:25474468

  14. ECOD: an evolutionary classification of protein domains.

    PubMed

    Cheng, Hua; Schaeffer, R Dustin; Liao, Yuxing; Kinch, Lisa N; Pei, Jimin; Shi, Shuoyong; Kim, Bong-Hyun; Grishin, Nick V

    2014-12-01

    Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or "fold"). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies.

  15. Automatic classification of protein structures using physicochemical parameters.

    PubMed

    Mohan, Abhilash; Rao, M Divya; Sunderrajan, Shruthi; Pennathur, Gautam

    2014-09-01

    Protein classification is the first step to functional annotation; SCOP and Pfam databases are currently the most relevant protein classification schemes. However, the disproportion in the number of three dimensional (3D) protein structures generated versus their classification into relevant superfamilies/families emphasizes the need for automated classification schemes. Predicting function of novel proteins based on sequence information alone has proven to be a major challenge. The present study focuses on the use of physicochemical parameters in conjunction with machine learning algorithms (Naive Bayes, Decision Trees, Random Forest and Support Vector Machines) to classify proteins into their respective SCOP superfamily/Pfam family, using sequence derived information. Spectrophores™, a 1D descriptor of the 3D molecular field surrounding a structure was used as a benchmark to compare the performance of the physicochemical parameters. The machine learning algorithms were modified to select features based on information gain for each SCOP superfamily/Pfam family. The effect of combining physicochemical parameters and spectrophores on classification accuracy (CA) was studied. Machine learning algorithms trained with the physicochemical parameters consistently classified SCOP superfamilies and Pfam families with a classification accuracy above 90%, while spectrophores performed with a CA of around 85%. Feature selection improved classification accuracy for both physicochemical parameters and spectrophores based machine learning algorithms. Combining both attributes resulted in a marginal loss of performance. Physicochemical parameters were able to classify proteins from both schemes with classification accuracy ranging from 90-96%. These results suggest the usefulness of this method in classifying proteins from amino acid sequences.

  16. Mining sequential patterns for protein fold recognition.

    PubMed

    Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I

    2008-02-01

    Protein data contain discriminative patterns that can be used in many beneficial applications if they are defined correctly. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. Protein classification in terms of fold recognition plays an important role in computational protein analysis, since it can contribute to the determination of the function of a protein whose structure is unknown. Specifically, one of the most efficient SPM algorithms, cSPADE, is employed for the analysis of protein sequence. A classifier uses the extracted sequential patterns to classify proteins in the appropriate fold category. For training and evaluating the proposed method we used the protein sequences from the Protein Data Bank and the annotation of the SCOP database. The method exhibited an overall accuracy of 25% in a classification problem with 36 candidate categories. The classification performance reaches up to 56% when the five most probable protein folds are considered.

  17. Mining for class-specific motifs in protein sequence classification

    PubMed Central

    2013-01-01

    Background In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. Results We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. Conclusion The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms. PMID:23496846

  18. The Classification of Protein Domains.

    PubMed

    Dawson, Natalie; Sillitoe, Ian; Marsden, Russell L; Orengo, Christine A

    2017-01-01

    The significant expansion in protein sequence and structure data that we are now witnessing brings with it a pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution of proteins, their function and the extent to which the functional repertoire can vary across the three kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to group proteins based upon their evolutionary relationships.In this chapter we discuss the approaches and methods that are frequently used in the classification of proteins, with a specific emphasis on the classification of protein domains. The construction of both domain sequence and domain structure databases is considered and we show how the use of domain family annotations to assign structural and functional information is enhancing our understanding of genomes.

  19. PrionHome: a database of prions and other sequences relevant to prion phenomena.

    PubMed

    Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M A; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M

    2012-01-01

    Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion.

  20. PrionHome: A Database of Prions and Other Sequences Relevant to Prion Phenomena

    PubMed Central

    Harbi, Djamel; Parthiban, Marimuthu; Gendoo, Deena M. A.; Ehsani, Sepehr; Kumar, Manish; Schmitt-Ulms, Gerold; Sowdhamini, Ramanathan; Harrison, Paul M.

    2012-01-01

    Prions are units of propagation of an altered state of a protein or proteins; prions can propagate from organism to organism, through cooption of other protein copies. Prions contain no necessary nucleic acids, and are important both as both pathogenic agents, and as a potential force in epigenetic phenomena. The original prions were derived from a misfolded form of the mammalian Prion Protein PrP. Infection by these prions causes neurodegenerative diseases. Other prions cause non-Mendelian inheritance in budding yeast, and sometimes act as diseases of yeast. We report the bioinformatic construction of the PrionHome, a database of >2000 prion-related sequences. The data was collated from various public and private resources and filtered for redundancy. The data was then processed according to a transparent classification system of prionogenic sequences (i.e., sequences that can make prions), prionoids (i.e., proteins that propagate like prions between individual cells), and other prion-related phenomena. There are eight PrionHome classifications for sequences. The first four classifications are derived from experimental observations: prionogenic sequences, prionoids, other prion-related phenomena, and prion interactors. The second four classifications are derived from sequence analysis: orthologs, paralogs, pseudogenes, and candidate-prionogenic sequences. Database entries list: supporting information for PrionHome classifications, prion-determinant areas (where relevant), and disordered and compositionally-biased regions. Also included are literature references for the PrionHome classifications, transcripts and genomic coordinates, and structural data (including comparative models made for the PrionHome from manually curated alignments). We provide database usage examples for both vertebrate and fungal prion contexts. Using the database data, we have performed a detailed analysis of the compositional biases in known budding-yeast prionogenic sequences, showing that the only abundant bias pattern is for asparagine bias with subsidiary serine bias. We anticipate that this database will be a useful experimental aid and reference resource. It is freely available at: http://libaio.biol.mcgill.ca/prion. PMID:22363733

  1. PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification.

    PubMed

    Thomas, Paul D; Kejariwal, Anish; Campbell, Michael J; Mi, Huaiyu; Diemer, Karen; Guo, Nan; Ladunga, Istvan; Ulitsky-Lazareva, Betty; Muruganujan, Anushya; Rabkin, Steven; Vandergriff, Jody A; Doremieux, Olivier

    2003-01-01

    The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.

  2. HAMAP in 2013, new developments in the protein family classification and annotation system

    PubMed Central

    Pedruzzi, Ivo; Rivoire, Catherine; Auchincloss, Andrea H.; Coudert, Elisabeth; Keller, Guillaume; de Castro, Edouard; Baratin, Delphine; Cuche, Béatrice A.; Bougueleret, Lydie; Poux, Sylvain; Redaschi, Nicole; Xenarios, Ioannis; Bridge, Alan

    2013-01-01

    HAMAP (High-quality Automated and Manual Annotation of Proteins—available at http://hamap.expasy.org/) is a system for the classification and annotation of protein sequences. It consists of a collection of manually curated family profiles for protein classification, and associated annotation rules that specify annotations that apply to family members. HAMAP was originally developed to support the manual curation of UniProtKB/Swiss-Prot records describing microbial proteins. Here we describe new developments in HAMAP, including the extension of HAMAP to eukaryotic proteins, the use of HAMAP in the automated annotation of UniProtKB/TrEMBL, providing high-quality annotation for millions of protein sequences, and the future integration of HAMAP into a unified system for UniProtKB annotation, UniRule. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized. The collection of HAMAP family classification profiles and annotation rules can be browsed and viewed on the HAMAP website, which also provides an interface to scan user sequences against HAMAP profiles. PMID:23193261

  3. An information-based network approach for protein classification

    PubMed Central

    Wan, Xiaogeng; Zhao, Xin; Yau, Stephen S. T.

    2017-01-01

    Protein classification is one of the critical problems in bioinformatics. Early studies used geometric distances and polygenetic-tree to classify proteins. These methods use binary trees to present protein classification. In this paper, we propose a new protein classification method, whereby theories of information and networks are used to classify the multivariate relationships of proteins. In this study, protein universe is modeled as an undirected network, where proteins are classified according to their connections. Our method is unsupervised, multivariate, and alignment-free. It can be applied to the classification of both protein sequences and structures. Nine examples are used to demonstrate the efficiency of our new method. PMID:28350835

  4. Classification of G-protein coupled receptors based on a rich generation of convolutional neural network, N-gram transformation and multiple sequence alignments.

    PubMed

    Li, Man; Ling, Cheng; Xu, Qi; Gao, Jingyang

    2018-02-01

    Sequence classification is crucial in predicting the function of newly discovered sequences. In recent years, the prediction of the incremental large-scale and diversity of sequences has heavily relied on the involvement of machine-learning algorithms. To improve prediction accuracy, these algorithms must confront the key challenge of extracting valuable features. In this work, we propose a feature-enhanced protein classification approach, considering the rich generation of multiple sequence alignment algorithms, N-gram probabilistic language model and the deep learning technique. The essence behind the proposed method is that if each group of sequences can be represented by one feature sequence, composed of homologous sites, there should be less loss when the sequence is rebuilt, when a more relevant sequence is added to the group. On the basis of this consideration, the prediction becomes whether a query sequence belonging to a group of sequences can be transferred to calculate the probability that the new feature sequence evolves from the original one. The proposed work focuses on the hierarchical classification of G-protein Coupled Receptors (GPCRs), which begins by extracting the feature sequences from the multiple sequence alignment results of the GPCRs sub-subfamilies. The N-gram model is then applied to construct the input vectors. Finally, these vectors are imported into a convolutional neural network to make a prediction. The experimental results elucidate that the proposed method provides significant performance improvements. The classification error rate of the proposed method is reduced by at least 4.67% (family level I) and 5.75% (family Level II), in comparison with the current state-of-the-art methods. The implementation program of the proposed work is freely available at: https://github.com/alanFchina/CNN .

  5. Predicting Flavonoid UGT Regioselectivity

    PubMed Central

    Jackson, Rhydon; Knisley, Debra; McIntosh, Cecilia; Pfeiffer, Phillip

    2011-01-01

    Machine learning was applied to a challenging and biologically significant protein classification problem: the prediction of avonoid UGT acceptor regioselectivity from primary sequence. Novel indices characterizing graphical models of residues were proposed and found to be widely distributed among existing amino acid indices and to cluster residues appropriately. UGT subsequences biochemically linked to regioselectivity were modeled as sets of index sequences. Several learning techniques incorporating these UGT models were compared with classifications based on standard sequence alignment scores. These techniques included an application of time series distance functions to protein classification. Time series distances defined on the index sequences were used in nearest neighbor and support vector machine classifiers. Additionally, Bayesian neural network classifiers were applied to the index sequences. The experiments identified improvements over the nearest neighbor and support vector machine classifications relying on standard alignment similarity scores, as well as strong correlations between specific subsequences and regioselectivities. PMID:21747849

  6. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Cang, Zixuan; Mu, Lin; Wu, Kedi

    Here, protein function and dynamics are closely related to its sequence and structure. However, prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Protein classification, which is typically done through measuring the similarity between proteins based on protein sequence or physical information, serves as a crucial step toward the understanding of protein function and dynamics.

  7. Protein Information Resource: a community resource for expert annotation of protein data

    PubMed Central

    Barker, Winona C.; Garavelli, John S.; Hou, Zhenglin; Huang, Hongzhan; Ledley, Robert S.; McGarvey, Peter B.; Mewes, Hans-Werner; Orcutt, Bruce C.; Pfeiffer, Friedhelm; Tsugita, Akira; Vinayaka, C. R.; Xiao, Chunlin; Yeh, Lai-Su L.; Wu, Cathy

    2001-01-01

    The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200 000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-Inter­national databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP. PMID:11125041

  8. TIM Barrel Protein Structure Classification Using Alignment Approach and Best Hit Strategy

    NASA Astrophysics Data System (ADS)

    Chu, Jia-Han; Lin, Chun Yuan; Chang, Cheng-Wen; Lee, Chihan; Yang, Yuh-Shyong; Tang, Chuan Yi

    2007-11-01

    The classification of protein structures is essential for their function determination in bioinformatics. It has been estimated that around 10% of all known enzymes have TIM barrel domains from the Structural Classification of Proteins (SCOP) database. With its high sequence variation and diverse functionalities, TIM barrel protein becomes to be an attractive target for protein engineering and for the evolution study. Hence, in this paper, an alignment approach with the best hit strategy is proposed to classify the TIM barrel protein structure in terms of superfamily and family levels in the SCOP. This work is also used to do the classification for class level in the Enzyme nomenclature (ENZYME) database. Two testing data sets, TIM40D and TIM95D, both are used to evaluate this approach. The resulting classification has an overall prediction accuracy rate of 90.3% for the superfamily level in the SCOP, 89.5% for the family level in the SCOP and 70.1% for the class level in the ENZYME. These results demonstrate that the alignment approach with the best hit strategy is a simple and viable method for the TIM barrel protein structure classification, even only has the amino acid sequences information.

  9. A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3

    PubMed Central

    Dietmann, Sabine; Park, Jong; Notredame, Cedric; Heger, Andreas; Lappe, Michael; Holm, Liisa

    2001-01-01

    The Dali Domain Dictionary (http://www.ebi.ac.uk/dali/domain) is a numerical taxonomy of all known structures in the Protein Data Bank (PDB). The taxonomy is derived fully automatically from measurements of structural, functional and sequence similarities. Here, we report the extension of the classification to match the traditional four hierarchical levels corresponding to: (i) supersecondary structural motifs (attractors in fold space), (ii) the topology of globular domains (fold types), (iii) remote homologues (functional families) and (iv) homologues with sequence identity above 25% (sequence families). The computational definitions of attractors and functional families are new. In September 2000, the Dali classification contained 10 531 PDB entries comprising 17 101 chains, which were partitioned into five attractor regions, 1375 fold types, 2582 functional families and 3724 domain sequence families. Sequence families were further associated with 99 582 unique homologous sequences in the HSSP database, which increases the number of effectively known structures several-fold. The resulting database contains the description of protein domain architecture, the definition of structural neighbours around each known structure, the definition of structurally conserved cores and a comprehensive library of explicit multiple alignments of distantly related protein families. PMID:11125048

  10. SeqRate: sequence-based protein folding type classification and rates prediction

    PubMed Central

    2010-01-01

    Background Protein folding rate is an important property of a protein. Predicting protein folding rate is useful for understanding protein folding process and guiding protein design. Most previous methods of predicting protein folding rate require the tertiary structure of a protein as an input. And most methods do not distinguish the different kinetic nature (two-state folding or multi-state folding) of the proteins. Here we developed a method, SeqRate, to predict both protein folding kinetic type (two-state versus multi-state) and real-value folding rate using sequence length, amino acid composition, contact order, contact number, and secondary structure information predicted from only protein sequence with support vector machines. Results We systematically studied the contributions of individual features to folding rate prediction. On a standard benchmark dataset, the accuracy of folding kinetic type classification is 80%. The Pearson correlation coefficient and the mean absolute difference between predicted and experimental folding rates (sec-1) in the base-10 logarithmic scale are 0.81 and 0.79 for two-state protein folders, and 0.80 and 0.68 for three-state protein folders. SeqRate is the first sequence-based method for protein folding type classification and its accuracy of fold rate prediction is improved over previous sequence-based methods. Its performance can be further enhanced with additional information, such as structure-based geometric contacts, as inputs. Conclusions Both the web server and software of predicting folding rate are publicly available at http://casp.rnet.missouri.edu/fold_rate/index.html. PMID:20438647

  11. Protein classification using modified n-grams and skip-grams.

    PubMed

    Islam, S M Ashiqul; Heil, Benjamin J; Kearney, Christopher Michel; Baker, Erich J

    2018-05-01

    Classification by supervised machine learning greatly facilitates the annotation of protein characteristics from their primary sequence. However, the feature generation step in this process requires detailed knowledge of attributes used to classify the proteins. Lack of this knowledge risks the selection of irrelevant features, resulting in a faulty model. In this study, we introduce a supervised protein classification method with a novel means of automating the work-intensive feature generation step via a Natural Language Processing (NLP)-dependent model, using a modified combination of n-grams and skip-grams (m-NGSG). A meta-comparison of cross-validation accuracy with twelve training datasets from nine different published studies demonstrates a consistent increase in accuracy of m-NGSG when compared to contemporary classification and feature generation models. We expect this model to accelerate the classification of proteins from primary sequence data and increase the accessibility of protein characteristic prediction to a broader range of scientists. m-NGSG is freely available at Bitbucket: https://bitbucket.org/sm_islam/mngsg/src. A web server is available at watson.ecs.baylor.edu/ngsg. erich_baker@baylor.edu. Supplementary data are available at Bioinformatics online.

  12. Signal peptide discrimination and cleavage site identification using SVM and NN.

    PubMed

    Kazemian, H B; Yusuf, S A; White, K

    2014-02-01

    About 15% of all proteins in a genome contain a signal peptide (SP) sequence, at the N-terminus, that targets the protein to intracellular secretory pathways. Once the protein is targeted correctly in the cell, the SP is cleaved, releasing the mature protein. Accurate prediction of the presence of these short amino-acid SP chains is crucial for modelling the topology of membrane proteins, since SP sequences can be confused with transmembrane domains due to similar composition of hydrophobic amino acids. This paper presents a cascaded Support Vector Machine (SVM)-Neural Network (NN) classification methodology for SP discrimination and cleavage site identification. The proposed method utilises a dual phase classification approach using SVM as a primary classifier to discriminate SP sequences from Non-SP. The methodology further employs NNs to predict the most suitable cleavage site candidates. In phase one, a SVM classification utilises hydrophobic propensities as a primary feature vector extraction using symmetric sliding window amino-acid sequence analysis for discrimination of SP and Non-SP. In phase two, a NN classification uses asymmetric sliding window sequence analysis for prediction of cleavage site identification. The proposed SVM-NN method was tested using Uni-Prot non-redundant datasets of eukaryotic and prokaryotic proteins with SP and Non-SP N-termini. Computer simulation results demonstrate an overall accuracy of 0.90 for SP and Non-SP discrimination based on Matthews Correlation Coefficient (MCC) tests using SVM. For SP cleavage site prediction, the overall accuracy is 91.5% based on cross-validation tests using the novel SVM-NN model. © 2013 Published by Elsevier Ltd.

  13. Benchmarking protein classification algorithms via supervised cross-validation.

    PubMed

    Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor

    2008-04-24

    Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

  14. Different evolutionary patterns of SNPs between domains and unassigned regions in human protein-coding sequences.

    PubMed

    Pang, Erli; Wu, Xiaomei; Lin, Kui

    2016-06-01

    Protein evolution plays an important role in the evolution of each genome. Because of their functional nature, in general, most of their parts or sites are differently constrained selectively, particularly by purifying selection. Most previous studies on protein evolution considered individual proteins in their entirety or compared protein-coding sequences with non-coding sequences. Less attention has been paid to the evolution of different parts within each protein of a given genome. To this end, based on PfamA annotation of all human proteins, each protein sequence can be split into two parts: domains or unassigned regions. Using this rationale, single nucleotide polymorphisms (SNPs) in protein-coding sequences from the 1000 Genomes Project were mapped according to two classifications: SNPs occurring within protein domains and those within unassigned regions. With these classifications, we found: the density of synonymous SNPs within domains is significantly greater than that of synonymous SNPs within unassigned regions; however, the density of non-synonymous SNPs shows the opposite pattern. We also found there are signatures of purifying selection on both the domain and unassigned regions. Furthermore, the selective strength on domains is significantly greater than that on unassigned regions. In addition, among all of the human protein sequences, there are 117 PfamA domains in which no SNPs are found. Our results highlight an important aspect of protein domains and may contribute to our understanding of protein evolution.

  15. The COG database: a tool for genome-scale analysis of protein functions and evolution

    PubMed Central

    Tatusov, Roman L.; Galperin, Michael Y.; Natale, Darren A.; Koonin, Eugene V.

    2000-01-01

    Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The database of Clusters of Orthologous Groups of proteins (COGs) is an attempt on a phylogenetic classification of the proteins encoded in 21 complete genomes of bacteria, archaea and eukaryotes (http://www.ncbi.nlm.nih.gov/COG ). The COGs were constructed by applying the criterion of consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The database comprises 2091 COGs that include 56–83% of the gene products from each of the complete bacterial and archaeal genomes and ~35% of those from the yeast Saccharomyces cerevisiae genome. The COG database is accompanied by the COGNITOR program that is used to fit new proteins into the COGs and can be applied to functional and phylogenetic annotation of newly sequenced genomes. PMID:10592175

  16. Improving protein complex classification accuracy using amino acid composition profile.

    PubMed

    Huang, Chien-Hung; Chou, Szu-Yu; Ng, Ka-Lok

    2013-09-01

    Protein complex prediction approaches are based on the assumptions that complexes have dense protein-protein interactions and high functional similarity between their subunits. We investigated those assumptions by studying the subunits' interaction topology, sequence similarity and molecular function for human and yeast protein complexes. Inclusion of amino acids' physicochemical properties can provide better understanding of protein complex properties. Principal component analysis is carried out to determine the major features. Adopting amino acid composition profile information with the SVM classifier serves as an effective post-processing step for complexes classification. Improvement is based on primary sequence information only, which is easy to obtain. Copyright © 2013 Elsevier Ltd. All rights reserved.

  17. IDM-PhyChm-Ens: intelligent decision-making ensemble methodology for classification of human breast cancer using physicochemical properties of amino acids.

    PubMed

    Ali, Safdar; Majid, Abdul; Khan, Asifullah

    2014-04-01

    Development of an accurate and reliable intelligent decision-making method for the construction of cancer diagnosis system is one of the fast growing research areas of health sciences. Such decision-making system can provide adequate information for cancer diagnosis and drug discovery. Descriptors derived from physicochemical properties of protein sequences are very useful for classifying cancerous proteins. Recently, several interesting research studies have been reported on breast cancer classification. To this end, we propose the exploitation of the physicochemical properties of amino acids in protein primary sequences such as hydrophobicity (Hd) and hydrophilicity (Hb) for breast cancer classification. Hd and Hb properties of amino acids, in recent literature, are reported to be quite effective in characterizing the constituent amino acids and are used to study protein foldings, interactions, structures, and sequence-order effects. Especially, using these physicochemical properties, we observed that proline, serine, tyrosine, cysteine, arginine, and asparagine amino acids offer high discrimination between cancerous and healthy proteins. In addition, unlike traditional ensemble classification approaches, the proposed 'IDM-PhyChm-Ens' method was developed by combining the decision spaces of a specific classifier trained on different feature spaces. The different feature spaces used were amino acid composition, split amino acid composition, and pseudo amino acid composition. Consequently, we have exploited different feature spaces using Hd and Hb properties of amino acids to develop an accurate method for classification of cancerous protein sequences. We developed ensemble classifiers using diverse learning algorithms such as random forest (RF), support vector machines (SVM), and K-nearest neighbor (KNN) trained on different feature spaces. We observed that ensemble-RF, in case of cancer classification, performed better than ensemble-SVM and ensemble-KNN. Our analysis demonstrates that ensemble-RF, ensemble-SVM and ensemble-KNN are more effective than their individual counterparts. The proposed 'IDM-PhyChm-Ens' method has shown improved performance compared to existing techniques.

  18. Classification of proteins with shared motifs and internal repeats in the ECOD database

    PubMed Central

    Kinch, Lisa N.; Liao, Yuxing

    2016-01-01

    Abstract Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain‐like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade. PMID:26833690

  19. Protein classification using sequential pattern mining.

    PubMed

    Exarchos, Themis P; Papaloukas, Costas; Lampros, Christos; Fotiadis, Dimitrios I

    2006-01-01

    Protein classification in terms of fold recognition can be employed to determine the structural and functional properties of a newly discovered protein. In this work sequential pattern mining (SPM) is utilized for sequence-based fold recognition. One of the most efficient SPM algorithms, cSPADE, is employed for protein primary structure analysis. Then a classifier uses the extracted sequential patterns for classifying proteins of unknown structure in the appropriate fold category. The proposed methodology exhibited an overall accuracy of 36% in a multi-class problem of 17 candidate categories. The classification performance reaches up to 65% when the three most probable protein folds are considered.

  20. Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB.

    PubMed

    Xu, Qifang; Dunbrack, Roland L

    2012-11-01

    Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.

  1. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification.

    PubMed

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-05-01

    Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. ivan.borozan@gmail.com Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press.

  2. Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

    PubMed Central

    Borozan, Ivan; Watt, Stuart; Ferretti, Vincent

    2015-01-01

    Motivation: Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed in many bacterial and viral genomes. Here, we propose a classification model that exploits the complementary nature of alignment-based and alignment-free similarity measures with the aim to improve the accuracy with which DNA and protein sequences are characterized. Results: Our model classifies sequences using a combined sequence similarity score calculated by adaptively weighting the contribution of different sequence similarity measures. Weights are determined independently for each sequence in the test set and reflect the discriminatory ability of individual similarity measures in the training set. Because the similarity between some sequences is determined more accurately with one type of measure rather than another, our classifier allows different sets of weights to be associated with different sequences. Using five different similarity measures, we show that our model significantly improves the classification accuracy over the current composition- and alignment-based models, when predicting the taxonomic lineage for both short viral sequence fragments and complete viral sequences. We also show that our model can be used effectively for the classification of reads from a real metagenome dataset as well as protein sequences. Availability and implementation: All the datasets and the code used in this study are freely available at https://collaborators.oicr.on.ca/vferretti/borozan_csss/csss.html. Contact: ivan.borozan@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25573913

  3. Re-visiting protein-centric two-tier classification of existing DNA-protein complexes

    PubMed Central

    2012-01-01

    Background Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. Results On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Conclusions Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information. PMID:22800292

  4. Re-visiting protein-centric two-tier classification of existing DNA-protein complexes.

    PubMed

    Malhotra, Sony; Sowdhamini, Ramanathan

    2012-07-16

    Precise DNA-protein interactions play most important and vital role in maintaining the normal physiological functioning of the cell, as it controls many high fidelity cellular processes. Detailed study of the nature of these interactions has paved the way for understanding the mechanisms behind the biological processes in which they are involved. Earlier in 2000, a systematic classification of DNA-protein complexes based on the structural analysis of the proteins was proposed at two tiers, namely groups and families. With the advancement in the number and resolution of structures of DNA-protein complexes deposited in the Protein Data Bank, it is important to revisit the existing classification. On the basis of the sequence analysis of DNA binding proteins, we have built upon the protein centric, two-tier classification of DNA-protein complexes by adding new members to existing families and making new families and groups. While classifying the new complexes, we also realised the emergence of new groups and families. The new group observed was where β-propeller was seen to interact with DNA. There were 34 SCOP folds which were observed to be present in the complexes of both old and new classifications, whereas 28 folds are present exclusively in the new complexes. Some new families noticed were NarL transcription factor, Z-α DNA binding proteins, Forkhead transcription factor, AP2 protein, Methyl CpG binding protein etc. Our results suggest that with the increasing number of availability of DNA-protein complexes in Protein Data Bank, the number of families in the classification increased by approximately three fold. The folds present exclusively in newly classified complexes is suggestive of inclusion of proteins with new function in new classification, the most populated of which are the folds responsible for DNA damage repair. The proposed re-visited classification can be used to perform genome-wide surveys in the genomes of interest for the presence of DNA-binding proteins. Further analysis of these complexes can aid in developing algorithms for identifying DNA-binding proteins and their family members from mere sequence information.

  5. PANTHER version 11: expanded annotation data from Gene Ontology and Reactome pathways, and data analysis tool enhancements.

    PubMed

    Mi, Huaiyu; Huang, Xiaosong; Muruganujan, Anushya; Tang, Haiming; Mills, Caitlin; Kang, Diane; Thomas, Paul D

    2017-01-04

    The PANTHER database (Protein ANalysis THrough Evolutionary Relationships, http://pantherdb.org) contains comprehensive information on the evolution and function of protein-coding genes from 104 completely sequenced genomes. PANTHER software tools allow users to classify new protein sequences, and to analyze gene lists obtained from large-scale genomics experiments. In the past year, major improvements include a large expansion of classification information available in PANTHER, as well as significant enhancements to the analysis tools. Protein subfamily functional classifications have more than doubled due to progress of the Gene Ontology Phylogenetic Annotation Project. For human genes (as well as a few other organisms), PANTHER now also supports enrichment analysis using pathway classifications from the Reactome resource. The gene list enrichment tools include a new 'hierarchical view' of results, enabling users to leverage the structure of the classifications/ontologies; the tools also allow users to upload genetic variant data directly, rather than requiring prior conversion to a gene list. The updated coding single-nucleotide polymorphisms (SNP) scoring tool uses an improved algorithm. The hidden Markov model (HMM) search tools now use HMMER3, dramatically reducing search times and improving accuracy of E-value statistics. Finally, the PANTHER Tree-Attribute Viewer has been implemented in JavaScript, with new views for exploring protein sequence evolution. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  6. A novel Multi-Agent Ada-Boost algorithm for predicting protein structural class with the information of protein secondary structure.

    PubMed

    Fan, Ming; Zheng, Bin; Li, Lihua

    2015-10-01

    Knowledge of the structural class of a given protein is important for understanding its folding patterns. Although a lot of efforts have been made, it still remains a challenging problem for prediction of protein structural class solely from protein sequences. The feature extraction and classification of proteins are the main problems in prediction. In this research, we extended our earlier work regarding these two aspects. In protein feature extraction, we proposed a scheme by calculating the word frequency and word position from sequences of amino acid, reduced amino acid, and secondary structure. For an accurate classification of the structural class of protein, we developed a novel Multi-Agent Ada-Boost (MA-Ada) method by integrating the features of Multi-Agent system into Ada-Boost algorithm. Extensive experiments were taken to test and compare the proposed method using four benchmark datasets in low homology. The results showed classification accuracies of 88.5%, 96.0%, 88.4%, and 85.5%, respectively, which are much better compared with the existing methods. The source code and dataset are available on request.

  7. A novel and efficient technique for identification and classification of GPCRs.

    PubMed

    Gupta, Ravi; Mittal, Ankush; Singh, Kuldip

    2008-07-01

    G-protein coupled receptors (GPCRs) play a vital role in different biological processes, such as regulation of growth, death, and metabolism of cells. GPCRs are the focus of significant amount of current pharmaceutical research since they interact with more than 50% of prescription drugs. The dipeptide-based support vector machine (SVM) approach is the most accurate technique to identify and classify the GPCRs. However, this approach has two major disadvantages. First, the dimension of dipeptide-based feature vector is equal to 400. The large dimension makes the classification task computationally and memory wise inefficient. Second, it does not consider the biological properties of protein sequence for identification and classification of GPCRs. In this paper, we present a novel-feature-based SVM classification technique. The novel features are derived by applying wavelet-based time series analysis approach on protein sequences. The proposed feature space summarizes the variance information of seven important biological properties of amino acids in a protein sequence. In addition, the dimension of the feature vector for proposed technique is equal to 35. Experiments were performed on GPCRs protein sequences available at GPCRs Database. Our approach achieves an accuracy of 99.9%, 98.06%, 97.78%, and 94.08% for GPCR superfamily, families, subfamilies, and subsubfamilies (amine group), respectively, when evaluated using fivefold cross-validation. Further, an accuracy of 99.8%, 97.26%, and 97.84% was obtained when evaluated on unseen or recall datasets of GPCR superfamily, families, and subfamilies, respectively. Comparison with dipeptide-based SVM technique shows the effectiveness of our approach.

  8. Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB

    PubMed Central

    Dunbrack, Roland L.

    2012-01-01

    Motivation: Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. Results: We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM–HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. Availability: The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly. Contact: Roland.Dunbracks@fccc.edu PMID:22942020

  9. Sequence-structure relationship study in all-α transmembrane proteins using an unsupervised learning approach.

    PubMed

    Esque, Jérémy; Urbain, Aurélie; Etchebest, Catherine; de Brevern, Alexandre G

    2015-11-01

    Transmembrane proteins (TMPs) are major drug targets, but the knowledge of their precise topology structure remains highly limited compared with globular proteins. In spite of the difficulties in obtaining their structures, an important effort has been made these last years to increase their number from an experimental and computational point of view. In view of this emerging challenge, the development of computational methods to extract knowledge from these data is crucial for the better understanding of their functions and in improving the quality of structural models. Here, we revisit an efficient unsupervised learning procedure, called Hybrid Protein Model (HPM), which is applied to the analysis of transmembrane proteins belonging to the all-α structural class. HPM method is an original classification procedure that efficiently combines sequence and structure learning. The procedure was initially applied to the analysis of globular proteins. In the present case, HPM classifies a set of overlapping protein fragments, extracted from a non-redundant databank of TMP 3D structure. After fine-tuning of the learning parameters, the optimal classification results in 65 clusters. They represent at best similar relationships between sequence and local structure properties of TMPs. Interestingly, HPM distinguishes among the resulting clusters two helical regions with distinct hydrophobic patterns. This underlines the complexity of the topology of these proteins. The HPM classification enlightens unusual relationship between amino acids in TMP fragments, which can be useful to elaborate new amino acids substitution matrices. Finally, two challenging applications are described: the first one aims at annotating protein functions (channel or not), the second one intends to assess the quality of the structures (X-ray or models) via a new scoring function deduced from the HPM classification.

  10. Classification of Complete Proteomes of Different Organisms and Protein Sets Based on Their Protein Distributions in Terms of Some Key Attributes of Proteins

    PubMed Central

    Ma, Yue; Tuskan, Gerald A.

    2018-01-01

    The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here) from the protein distribution densities in the LD space defined by ln(L) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level. PMID:29686995

  11. Classification of Ancient Mammal Individuals Using Dental Pulp MALDI-TOF MS Peptide Profiling

    PubMed Central

    Tran, Thi-Nguyen-Ny; Aboudharam, Gérard; Gardeisen, Armelle; Davoust, Bernard; Bocquet-Appel, Jean-Pierre; Flaudrops, Christophe; Belghazi, Maya; Raoult, Didier; Drancourt, Michel

    2011-01-01

    Background The classification of ancient animal corpses at the species level remains a challenging task for forensic scientists and anthropologists. Severe damage and mixed, tiny pieces originating from several skeletons may render morphological classification virtually impossible. Standard approaches are based on sequencing mitochondrial and nuclear targets. Methodology/Principal Findings We present a method that can accurately classify mammalian species using dental pulp and mass spectrometry peptide profiling. Our work was organized into three successive steps. First, after extracting proteins from the dental pulp collected from 37 modern individuals representing 13 mammalian species, trypsin-digested peptides were used for matrix-assisted laser desorption/ionization time-of-flight mass spectrometry analysis. The resulting peptide profiles accurately classified every individual at the species level in agreement with parallel cytochrome b gene sequencing gold standard. Second, using a 279–modern spectrum database, we blindly classified 33 of 37 teeth collected in 37 modern individuals (89.1%). Third, we classified 10 of 18 teeth (56%) collected in 15 ancient individuals representing five mammal species including human, from five burial sites dating back 8,500 years. Further comparison with an upgraded database comprising ancient specimen profiles yielded 100% classification in ancient teeth. Peptide sequencing yield 4 and 16 different non-keratin proteins including collagen (alpha-1 type I and alpha-2 type I) in human ancient and modern dental pulp, respectively. Conclusions/Significance Mass spectrometry peptide profiling of the dental pulp is a new approach that can be added to the arsenal of species classification tools for forensics and anthropology as a complementary method to DNA sequencing. The dental pulp is a new source for collagen and other proteins for the species classification of modern and ancient mammal individuals. PMID:21364886

  12. A framework for classification of prokaryotic protein kinases.

    PubMed

    Tyagi, Nidhi; Anamika, Krishanpal; Srinivasan, Narayanaswamy

    2010-05-26

    Overwhelming majority of the Serine/Threonine protein kinases identified by gleaning archaeal and eubacterial genomes could not be classified into any of the well known Hanks and Hunter subfamilies of protein kinases. This is owing to the development of Hanks and Hunter classification scheme based on eukaryotic protein kinases which are highly divergent from their prokaryotic homologues. A large dataset of prokaryotic Serine/Threonine protein kinases recognized from genomes of prokaryotes have been used to develop a classification framework for prokaryotic Ser/Thr protein kinases. We have used traditional sequence alignment and phylogenetic approaches and clustered the prokaryotic kinases which represent 72 subfamilies with at least 4 members in each. Such a clustering enables classification of prokaryotic Ser/Thr kinases and it can be used as a framework to classify newly identified prokaryotic Ser/Thr kinases. After series of searches in a comprehensive sequence database we recognized that 38 subfamilies of prokaryotic protein kinases are associated to a specific taxonomic level. For example 4, 6 and 3 subfamilies have been identified that are currently specific to phylum proteobacteria, cyanobacteria and actinobacteria respectively. Similarly subfamilies which are specific to an order, sub-order, class, family and genus have also been identified. In addition to these, we also identify organism-diverse subfamilies. Members of these clusters are from organisms of different taxonomic levels, such as archaea, bacteria, eukaryotes and viruses. Interestingly, occurrence of several taxonomic level specific subfamilies of prokaryotic kinases contrasts with classification of eukaryotic protein kinases in which most of the popular subfamilies of eukaryotic protein kinases occur diversely in several eukaryotes. Many prokaryotic Ser/Thr kinases exhibit a wide variety of modular organization which indicates a degree of complexity and protein-protein interactions in the signaling pathways in these microbes.

  13. Stratification of co-evolving genomic groups using ranked phylogenetic profiles

    PubMed Central

    Freilich, Shiri; Goldovsky, Leon; Gottlieb, Assaf; Blanc, Eric; Tsoka, Sophia; Ouzounis, Christos A

    2009-01-01

    Background Previous methods of detecting the taxonomic origins of arbitrary sequence collections, with a significant impact to genome analysis and in particular metagenomics, have primarily focused on compositional features of genomes. The evolutionary patterns of phylogenetic distribution of genes or proteins, represented by phylogenetic profiles, provide an alternative approach for the detection of taxonomic origins, but typically suffer from low accuracy. Herein, we present rank-BLAST, a novel approach for the assignment of protein sequences into genomic groups of the same taxonomic origin, based on the ranking order of phylogenetic profiles of target genes or proteins across the reference database. Results The rank-BLAST approach is validated by computing the phylogenetic profiles of all sequences for five distinct microbial species of varying degrees of phylogenetic proximity, against a reference database of 243 fully sequenced genomes. The approach - a combination of sequence searches, statistical estimation and clustering - analyses the degree of sequence divergence between sets of protein sequences and allows the classification of protein sequences according to the species of origin with high accuracy, allowing taxonomic classification of 64% of the proteins studied. In most cases, a main cluster is detected, representing the corresponding species. Secondary, functionally distinct and species-specific clusters exhibit different patterns of phylogenetic distribution, thus flagging gene groups of interest. Detailed analyses of such cases are provided as examples. Conclusion Our results indicate that the rank-BLAST approach can capture the taxonomic origins of sequence collections in an accurate and efficient manner. The approach can be useful both for the analysis of genome evolution and the detection of species groups in metagenomics samples. PMID:19860884

  14. Efficient use of unlabeled data for protein sequence classification: a comparative study.

    PubMed

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-04-29

    Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags-the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably.

  15. ACLAME: a CLAssification of Mobile genetic Elements, update 2010.

    PubMed

    Leplae, Raphaël; Lima-Mendez, Gipsi; Toussaint, Ariane

    2010-01-01

    The ACLAME database is dedicated to the collection, analysis and classification of sequenced mobile genetic elements (MGEs, in particular phages and plasmids). In addition to providing information on the MGEs content, classifications are available at various levels of organization. At the gene/protein level, families group similar sequences that are expected to share the same function. Families of four or more proteins are manually assigned with a functional annotation using the GeneOntology and the locally developed ontology MeGO dedicated to MGEs. At the genome level, evolutionary cohesive modules group sets of protein families shared among MGEs. At the population level, networks display the reticulate evolutionary relationships among MGEs. To increase the coverage of the phage sequence space, ACLAME version 0.4 incorporates 760 high-quality predicted prophages selected from the Prophinder database. Most of the data can be downloaded from the freely accessible ACLAME web site (http://aclame.ulb.ac.be). The BLAST interface for querying the database has been extended and numerous tools for in-depth analysis of the results have been added.

  16. GOLabeler: Improving Sequence-based Large-scale Protein Function Prediction by Learning to Rank.

    PubMed

    You, Ronghui; Zhang, Zihan; Xiong, Yi; Sun, Fengzhu; Mamitsuka, Hiroshi; Zhu, Shanfeng

    2018-03-07

    Gene Ontology (GO) has been widely used to annotate functions of proteins and understand their biological roles. Currently only <1% of more than 70 million proteins in UniProtKB have experimental GO annotations, implying the strong necessity of automated function prediction (AFP) of proteins, where AFP is a hard multilabel classification problem due to one protein with a diverse number of GO terms. Most of these proteins have only sequences as input information, indicating the importance of sequence-based AFP (SAFP: sequences are the only input). Furthermore homology-based SAFP tools are competitive in AFP competitions, while they do not necessarily work well for so-called difficult proteins, which have <60% sequence identity to proteins with annotations already. Thus the vital and challenging problem now is how to develop a method for SAFP, particularly for difficult proteins. The key of this method is to extract not only homology information but also diverse, deep- rooted information/evidence from sequence inputs and integrate them into a predictor in a both effective and efficient manner. We propose GOLabeler, which integrates five component classifiers, trained from different features, including GO term frequency, sequence alignment, amino acid trigram, domains and motifs, and biophysical properties, etc., in the framework of learning to rank (LTR), a paradigm of machine learning, especially powerful for multilabel classification. The empirical results obtained by examining GOLabeler extensively and thoroughly by using large-scale datasets revealed numerous favorable aspects of GOLabeler, including significant performance advantage over state-of-the-art AFP methods. http://datamining-iip.fudan.edu.cn/golabeler. zhusf@fudan.edu.cn. Supplementary data are available at Bioinformatics online.

  17. Classification of Complete Proteomes of Different Organisms and Protein Sets Based on Their Protein Distributions in Terms of Some Key Attributes of Proteins

    DOE PAGES

    Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.; ...

    2018-01-01

    The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less

  18. Classification of Complete Proteomes of Different Organisms and Protein Sets Based on Their Protein Distributions in Terms of Some Key Attributes of Proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Guo, Hao-Bo; Ma, Yue; Tuskan, Gerald A.

    The existence of complete genome sequences makes it important to develop different approaches for classification of large-scale data sets and to make extraction of biological insights easier. Here, we propose an approach for classification of complete proteomes/protein sets based on protein distributions on some basic attributes. We demonstrate the usefulness of this approach by determining protein distributions in terms of two attributes: protein lengths and protein intrinsic disorder contents (ID). The protein distributions based on L and ID are surveyed for representative proteome organisms and protein sets from the three domains of life. The two-dimensional maps (designated as fingerprints here)more » from the protein distribution densities in the LD space defined by ln( L ) and ID are then constructed. The fingerprints for different organisms and protein sets are found to be distinct with each other, and they can therefore be used for comparative studies. As a test case, phylogenetic trees have been constructed based on the protein distribution densities in the fingerprints of proteomes of organisms without performing any protein sequence comparison and alignments. The phylogenetic trees generated are biologically meaningful, demonstrating that the protein distributions in the LD space may serve as unique phylogenetic signals of the organisms at the proteome level.« less

  19. Evaluating the efficacy of a structure-derived amino acid substitution matrix in detecting protein homologs by BLAST and PSI-BLAST.

    PubMed

    Goonesekere, Nalin Cw

    2009-01-01

    The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.

  20. Mouse Vk gene classification by nucleic acid sequence similarity.

    PubMed

    Strohal, R; Helmberg, A; Kroemer, G; Kofler, R

    1989-01-01

    Analyses of immunoglobulin (Ig) variable (V) region gene usage in the immune response, estimates of V gene germline complexity, and other nucleic acid hybridization-based studies depend on the extent to which such genes are related (i.e., sequence similarity) and their organization in gene families. While mouse Igh heavy chain V region (VH) gene families are relatively well-established, a corresponding systematic classification of Igk light chain V region (Vk) genes has not been reported. The present analysis, in the course of which we reviewed the known extent of the Vk germline gene repertoire and Vk gene usage in a variety of responses to foreign and self antigens, provides a classification of mouse Vk genes in gene families composed of members with greater than 80% overall nucleic acid sequence similarity. This classification differed in several aspects from that of VH genes: only some Vk gene families were as clearly separated (by greater than 25% sequence dissimilarity) as typical VH gene families; most Vk gene families were closely related and, in several instances, members from different families were very similar (greater than 80%) over large sequence portions; frequently, classification by nucleic acid sequence similarity diverged from existing classifications based on amino-terminal protein sequence similarity. Our data have implications for Vk gene analyses by nucleic acid hybridization and describe potentially important differences in sequence organization between VH and Vk genes.

  1. Sequence-based protein superfamily classification using computational intelligence techniques: a review.

    PubMed

    Vipsita, Swati; Rath, Santanu Kumar

    2015-01-01

    Protein superfamily classification deals with the problem of predicting the family membership of newly discovered amino acid sequence. Although many trivial alignment methods are already developed by previous researchers, but the present trend demands the application of computational intelligent techniques. As there is an exponential growth in size of biological database, retrieval and inference of essential knowledge in the biological domain become a very cumbersome task. This problem can be easily handled using intelligent techniques due to their ability of tolerance for imprecision, uncertainty, approximate reasoning, and partial truth. This paper discusses the various global and local features extracted from full length protein sequence which are used for the approximation and generalisation of the classifier. The various parameters used for evaluating the performance of the classifiers are also discussed. Therefore, this review article can show right directions to the present researchers to make an improvement over the existing methods.

  2. A Bioinformatics Classifier and Database for Heme-Copper Oxygen Reductases

    PubMed Central

    Sousa, Filipa L.; Alves, Renato J.; Pereira-Leal, José B.; Teixeira, Miguel; Pereira, Manuela M.

    2011-01-01

    Background Heme-copper oxygen reductases (HCOs) are the last enzymatic complexes of most aerobic respiratory chains, reducing dioxygen to water and translocating up to four protons across the inner mitochondrial membrane (eukaryotes) or cytoplasmatic membrane (prokaryotes). The number of completely sequenced genomes is expanding exponentially, and concomitantly, the number and taxonomic distribution of HCO sequences. These enzymes were initially classified into three different types being this classification recently challenged. Methodology We reanalyzed the classification scheme and developed a new bioinformatics classifier for the HCO and Nitric oxide reductases (NOR), which we benchmark against a manually derived gold standard sequence set. It is able to classify any given sequence of subunit I from HCO and NOR with a global recall and precision both of 99.8%. We use this tool to classify this protein family in 552 completely sequenced genomes. Conclusions We concluded that the new and broader data set supports three functional and evolutionary groups of HCOs. Homology between NORs and HCOs is shown and NORs closest relationship with C Type HCOs demonstrated. We established and made available a classification web tool and an integrated Heme-Copper Oxygen reductase and NOR protein database (www.evocell.org/hco). PMID:21559461

  3. Computational intelligence techniques for biological data mining: An overview

    NASA Astrophysics Data System (ADS)

    Faye, Ibrahima; Iqbal, Muhammad Javed; Said, Abas Md; Samir, Brahim Belhaouari

    2014-10-01

    Computational techniques have been successfully utilized for a highly accurate analysis and modeling of multifaceted and raw biological data gathered from various genome sequencing projects. These techniques are proving much more effective to overcome the limitations of the traditional in-vitro experiments on the constantly increasing sequence data. However, most critical problems that caught the attention of the researchers may include, but not limited to these: accurate structure and function prediction of unknown proteins, protein subcellular localization prediction, finding protein-protein interactions, protein fold recognition, analysis of microarray gene expression data, etc. To solve these problems, various classification and clustering techniques using machine learning have been extensively used in the published literature. These techniques include neural network algorithms, genetic algorithms, fuzzy ARTMAP, K-Means, K-NN, SVM, Rough set classifiers, decision tree and HMM based algorithms. Major difficulties in applying the above algorithms include the limitations found in the previous feature encoding and selection methods while extracting the best features, increasing classification accuracy and decreasing the running time overheads of the learning algorithms. The application of this research would be potentially useful in the drug design and in the diagnosis of some diseases. This paper presents a concise overview of the well-known protein classification techniques.

  4. BayesMotif: de novo protein sorting motif discovery from impure datasets.

    PubMed

    Hu, Jianjun; Zhang, Fan

    2010-01-18

    Protein sorting is the process that newly synthesized proteins are transported to their target locations within or outside of the cell. This process is precisely regulated by protein sorting signals in different forms. A major category of sorting signals are amino acid sub-sequences usually located at the N-terminals or C-terminals of protein sequences. Genome-wide experimental identification of protein sorting signals is extremely time-consuming and costly. Effective computational algorithms for de novo discovery of protein sorting signals is needed to improve the understanding of protein sorting mechanisms. We formulated the protein sorting motif discovery problem as a classification problem and proposed a Bayesian classifier based algorithm (BayesMotif) for de novo identification of a common type of protein sorting motifs in which a highly conserved anchor is present along with a less conserved motif regions. A false positive removal procedure is developed to iteratively remove sequences that are unlikely to contain true motifs so that the algorithm can identify motifs from impure input sequences. Experiments on both implanted motif datasets and real-world datasets showed that the enhanced BayesMotif algorithm can identify anchored sorting motifs from pure or impure protein sequence dataset. It also shows that the false positive removal procedure can help to identify true motifs even when there is only 20% of the input sequences containing true motif instances. We proposed BayesMotif, a novel Bayesian classification based algorithm for de novo discovery of a special category of anchored protein sorting motifs from impure datasets. Compared to conventional motif discovery algorithms such as MEME, our algorithm can find less-conserved motifs with short highly conserved anchors. Our algorithm also has the advantage of easy incorporation of additional meta-sequence features such as hydrophobicity or charge of the motifs which may help to overcome the limitations of PWM (position weight matrix) motif model.

  5. Camps 2.0: exploring the sequence and structure space of prokaryotic, eukaryotic, and viral membrane proteins.

    PubMed

    Neumann, Sindy; Hartmann, Holger; Martin-Galiano, Antonio J; Fuchs, Angelika; Frishman, Dmitrij

    2012-03-01

    Structural bioinformatics of membrane proteins is still in its infancy, and the picture of their fold space is only beginning to emerge. Because only a handful of three-dimensional structures are available, sequence comparison and structure prediction remain the main tools for investigating sequence-structure relationships in membrane protein families. Here we present a comprehensive analysis of the structural families corresponding to α-helical membrane proteins with at least three transmembrane helices. The new version of our CAMPS database (CAMPS 2.0) covers nearly 1300 eukaryotic, prokaryotic, and viral genomes. Using an advanced classification procedure, which is based on high-order hidden Markov models and considers both sequence similarity as well as the number of transmembrane helices and loop lengths, we identified 1353 structurally homogeneous clusters roughly corresponding to membrane protein folds. Only 53 clusters are associated with experimentally determined three-dimensional structures, and for these clusters CAMPS is in reasonable agreement with structure-based classification approaches such as SCOP and CATH. We therefore estimate that ∼1300 structures would need to be determined to provide a sufficient structural coverage of polytopic membrane proteins. CAMPS 2.0 is available at http://webclu.bio.wzw.tum.de/CAMPS2.0/. Copyright © 2011 Wiley Periodicals, Inc.

  6. Acyl carrier protein structural classification and normal mode analysis

    PubMed Central

    Cantu, David C; Forrester, Michael J; Charov, Katherine; Reilly, Peter J

    2012-01-01

    All acyl carrier protein primary and tertiary structures were gathered into the ThYme database. They are classified into 16 families by amino acid sequence similarity, with members of the different families having sequences with statistically highly significant differences. These classifications are supported by tertiary structure superposition analysis. Tertiary structures from a number of families are very similar, suggesting that these families may come from a single distant ancestor. Normal vibrational mode analysis was conducted on experimentally determined freestanding structures, showing greater fluctuations at chain termini and loops than in most helices. Their modes overlap more so within families than between different families. The tertiary structures of three acyl carrier protein families that lacked any known structures were predicted as well. PMID:22374859

  7. Efficient use of unlabeled data for protein sequence classification: a comparative study

    PubMed Central

    Kuksa, Pavel; Huang, Pai-Hsi; Pavlovic, Vladimir

    2009-01-01

    Background Recent studies in computational primary protein sequence analysis have leveraged the power of unlabeled data. For example, predictive models based on string kernels trained on sequences known to belong to particular folds or superfamilies, the so-called labeled data set, can attain significantly improved accuracy if this data is supplemented with protein sequences that lack any class tags–the unlabeled data. In this study, we present a principled and biologically motivated computational framework that more effectively exploits the unlabeled data by only using the sequence regions that are more likely to be biologically relevant for better prediction accuracy. As overly-represented sequences in large uncurated databases may bias the estimation of computational models that rely on unlabeled data, we also propose a method to remove this bias and improve performance of the resulting classifiers. Results Combined with state-of-the-art string kernels, our proposed computational framework achieves very accurate semi-supervised protein remote fold and homology detection on three large unlabeled databases. It outperforms current state-of-the-art methods and exhibits significant reduction in running time. Conclusion The unlabeled sequences used under the semi-supervised setting resemble the unpolished gemstones; when used as-is, they may carry unnecessary features and hence compromise the classification accuracy but once cut and polished, they improve the accuracy of the classifiers considerably. PMID:19426450

  8. Classification and Lineage Tracing of SH2 Domains Throughout Eukaryotes.

    PubMed

    Liu, Bernard A

    2017-01-01

    Today there exists a rapidly expanding number of sequenced genomes. Cataloging protein interaction domains such as the Src Homology 2 (SH2) domain across these various genomes can be accomplished with ease due to existing algorithms and predictions models. An evolutionary analysis of SH2 domains provides a step towards understanding how SH2 proteins integrated with existing signaling networks to position phosphotyrosine signaling as a crucial driver of robust cellular communication networks in metazoans. However organizing and tracing SH2 domain across organisms and understanding their evolutionary trajectory remains a challenge. This chapter describes several methodologies towards analyzing the evolutionary trajectory of SH2 domains including a global SH2 domain classification system, which facilitates annotation of new SH2 sequences essential for tracing the lineage of SH2 domains throughout eukaryote evolution. This classification utilizes a combination of sequence homology, protein domain architecture and the boundary positions between introns and exons within the SH2 domain or genes encoding these domains. Discrete SH2 families can then be traced across various genomes to provide insight into its origins. Furthermore, additional methods for examining potential mechanisms for divergence of SH2 domains from structural changes to alterations in the protein domain content and genome duplication will be discussed. Therefore a better understanding of SH2 domain evolution may enhance our insight into the emergence of phosphotyrosine signaling and the expansion of protein interaction domains.

  9. SVM-Fold: a tool for discriminative multi-class protein fold and superfamily recognition

    PubMed Central

    Melvin, Iain; Ie, Eugene; Kuang, Rui; Weston, Jason; Stafford, William Noble; Leslie, Christina

    2007-01-01

    Background Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community. Results We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider. Conclusion By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition. PMID:17570145

  10. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold.

    PubMed

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel; Ten Have, Arjen

    2018-01-01

    Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER.

  11. HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold

    PubMed Central

    Pagnuco, Inti Anabela; Revuelta, María Victoria; Bondino, Hernán Gabriel; Brun, Marcel

    2018-01-01

    Background Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. Results HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. Conclusions HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER. PMID:29579071

  12. Variability of the protein sequences of lcrV between epidemic and atypical rhamnose-positive strains of Yersinia pestis.

    PubMed

    Anisimov, Andrey P; Panfertsev, Evgeniy A; Svetoch, Tat'yana E; Dentovskaya, Svetlana V

    2007-01-01

    Sequencing of lcrV genes and comparison of the deduced amino acid sequences from ten Y. pestis strains belonging mostly to the group of atypical rhamnose-positive isolates (non-pestis subspecies or pestoides group) showed that the LcrV proteins analyzed could be classified into five sequence types. This classification was based on major amino acid polymorphisms among LcrV proteins in the four "hot points" of the protein sequences. Some additional minor polymorphisms were found throughout these sequence types. The "hot points" corresponded to amino acids 18 (Lys --> Asn), 72 (Lys --> Arg), 273 (Cys --> Ser), and 324-326 (Ser-Gly-Lys --> Arg) in the LcrV sequence of the reference Y. pestis strain CO92. One possible explanation for polymorphism in amino acid sequences of LcrV among different strains is that strain-specific variation resulted from adaptation of the plague pathogen to different rodent and lagomorph hosts.

  13. MIPS: a database for genomes and protein sequences.

    PubMed Central

    Mewes, H W; Heumann, K; Kaps, A; Mayer, K; Pfeiffer, F; Stocker, S; Frishman, D

    1999-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Martinsried near Munich, Germany, develops and maintains genome oriented databases. It is commonplace that the amount of sequence data available increases rapidly, but not the capacity of qualified manual annotation at the sequence databases. Therefore, our strategy aims to cope with the data stream by the comprehensive application of analysis tools to sequences of complete genomes, the systematic classification of protein sequences and the active support of sequence analysis and functional genomics projects. This report describes the systematic and up-to-date analysis of genomes (PEDANT), a comprehensive database of the yeast genome (MYGD), a database reflecting the progress in sequencing the Arabidopsis thaliana genome (MATD), the database of assembled, annotated human EST clusters (MEST), and the collection of protein sequence data within the framework of the PIR-International Protein Sequence Database (described elsewhere in this volume). MIPS provides access through its WWW server (http://www.mips.biochem.mpg.de) to a spectrum of generic databases, including the above mentioned as well as a database of protein families (PROTFAM), the MITOP database, and the all-against-all FASTA database. PMID:9847138

  14. Rapid identification and classification of Mycobacterium spp. using whole-cell protein barcodes with matrix assisted laser desorption ionization time of flight mass spectrometry in comparison with multigene phylogenetic analysis.

    PubMed

    Wang, Jun; Chen, Wen Feng; Li, Qing X

    2012-02-24

    The need of quick diagnostics and increasing number of bacterial species isolated necessitate development of a rapid and effective phenotypic identification method. Mass spectrometry (MS) profiling of whole cell proteins has potential to satisfy the requirements. The genus Mycobacterium contains more than 154 species that are taxonomically very close and require use of multiple genes including 16S rDNA for phylogenetic identification and classification. Six strains of five Mycobacterium species were selected as model bacteria in the present study because of their 16S rDNA similarity (98.4-99.8%) and the high similarity of the concatenated 16S rDNA, rpoB and hsp65 gene sequences (95.9-99.9%), requiring high identification resolution. The classification of the six strains by MALDI TOF MS protein barcodes was consistent with, but at much higher resolution than, that of the multi-locus sequence analysis of using 16S rDNA, rpoB and hsp65. The species were well differentiated using MALDI TOF MS and MALDI BioTyper™ software after quick preparation of whole-cell proteins. Several proteins were selected as diagnostic markers for species confirmation. An integration of MALDI TOF MS, MALDI BioTyper™ software and diagnostic protein fragments provides a robust phenotypic approach for bacterial identification and classification. Copyright © 2011 Elsevier B.V. All rights reserved.

  15. An alternative view of protein fold space.

    PubMed

    Shindyalov, I N; Bourne, P E

    2000-02-15

    Comparing and subsequently classifying protein structures information has received significant attention concurrent with the increase in the number of experimentally derived 3-dimensional structures. Classification schemes have focused on biological function found within protein domains and on structure classification based on topology. Here an alternative view is presented that groups substructures. Substructures are long (50-150 residue) highly repetitive near-contiguous pieces of polypeptide chain that occur frequently in a set of proteins from the PDB defined as structurally non-redundant over the complete polypeptide chain. The substructure classification is based on a previously reported Combinatorial Extension (CE) algorithm that provides a significantly different set of structure alignments than those previously described, having, for example, only a 40% overlap with FSSP. Qualitatively the algorithm provides longer contiguous aligned segments at the price of a slightly higher root-mean-square deviation (rmsd). Clustering these alignments gives a discreet and highly repetitive set of substructures not detectable by sequence similarity alone. In some cases different substructures represent all or different parts of well known folds indicative of the Russian doll effect--the continuity of protein fold space. In other cases they fall into different structure and functional classifications. It is too early to determine whether these newly classified substructures represent new insights into the evolution of a structural framework important to many proteins. What is apparent from on-going work is that these substructures have the potential to be useful probes in finding remote sequence homology and in structure prediction studies. The characteristics of the complete all-by-all comparison of the polypeptide chains present in the PDB and details of the filtering procedure by pair-wise structure alignment that led to the emergent substructure gallery are discussed. Substructure classification, alignments, and tools to analyze them are available at http://cl.sdsc.edu/ce.html.

  16. Protein classification based on text document classification techniques.

    PubMed

    Cheng, Betty Yee Man; Carbonell, Jaime G; Klein-Seetharaman, Judith

    2005-03-01

    The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Copyright 2005 Wiley-Liss, Inc.

  17. Sequentially distant but structurally similar proteins exhibit fold specific patterns based on their biophysical properties.

    PubMed

    Rajendran, Senthilnathan; Jothi, Arunachalam

    2018-05-16

    The Three-dimensional structure of a protein depends on the interaction between their amino acid residues. These interactions are in turn influenced by various biophysical properties of the amino acids. There are several examples of proteins that share the same fold but are very dissimilar at the sequence level. For proteins to share a common fold some crucial interactions should be maintained despite insignificant sequence similarity. Since the interactions are because of the biophysical properties of the amino acids, we should be able to detect descriptive patterns for folds at such a property level. In this line, the main focus of our research is to analyze such proteins and to characterize them in terms of their biophysical properties. Protein structures with sequence similarity lesser than 40% were selected for ten different subfolds from three different mainfolds (according to CATH classification) and were used for this analysis. We used the normalized values of the 49 physio-chemical, energetic and conformational properties of amino acids. We characterize the folds based on the average biophysical property values. We also observed a fold specific correlational behavior of biophysical properties despite a very low sequence similarity in our data. We further trained three different binary classification models (Naive Bayes-NB, Support Vector Machines-SVM and Bayesian Generalized Linear Model-BGLM) which could discriminate mainfold based on the biophysical properties. We also show that among the three generated models, the BGLM classifier model was able to discriminate protein sequences coming under all beta category with 81.43% accuracy and all alpha, alpha-beta proteins with 83.37% accuracy. Copyright © 2018 Elsevier Ltd. All rights reserved.

  18. Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins.

    PubMed

    Kolker, Natali; Higdon, Roger; Broomall, William; Stanberry, Larissa; Welch, Dean; Lu, Wei; Haynes, Winston; Barga, Roger; Kolker, Eugene

    2011-01-01

    To address the monumental challenge of assigning function to millions of sequenced proteins, we completed the first of a kind all-versus-all sequence alignments using BLAST for 9.9 million proteins in the UniRef100 database. Microsoft Windows Azure produced over 3 billion filtered records in 6 days using 475 eight-core virtual machines. Protein classification into functional groups was then performed using Hive and custom jars implemented on top of Apache Hadoop utilizing the MapReduce paradigm. First, using the Clusters of Orthologous Genes (COG) database, a length normalized bit score (LNBS) was determined to be the best similarity measure for classification of proteins. LNBS achieved sensitivity and specificity of 98% each. Second, out of 5.1 million bacterial proteins, about two-thirds were assigned to significantly extended COG groups, encompassing 30 times more assigned proteins. Third, the remaining proteins were classified into protein functional groups using an innovative implementation of a single-linkage algorithm on an in-house Hadoop compute cluster. This implementation significantly reduces the run time for nonindexed queries and optimizes efficient clustering on a large scale. The performance was also verified on Amazon Elastic MapReduce. This clustering assigned nearly 2 million proteins to approximately half a million different functional groups. A similar approach was applied to classify 2.8 million eukaryotic sequences resulting in over 1 million proteins being assign to existing KOG groups and the remainder clustered into 100,000 functional groups.

  19. An ensemble approach to protein fold classification by integration of template-based assignment and support vector machine classifier.

    PubMed

    Xia, Jiaqi; Peng, Zhenling; Qi, Dawei; Mu, Hongbo; Yang, Jianyi

    2017-03-15

    Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template-based fold assignment and the other is ab-initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. We developed two algorithms, HH-fold and SVM-fold for protein fold classification. HH-fold is a template-based fold assignment algorithm using the HHsearch program. SVM-fold is a support vector machine-based ab-initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA-fold. We performed a comprehensive assessment for the proposed methods by comparing with ab-initio methods and template-based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA-fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4-11.7% over ab-initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA-fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA-fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA-fold is attributed to the combination of template-based fold assignment and ab-initio classification using features from complementary sequence profiles that contain rich evolution information. http://yanglab.nankai.edu.cn/TA-fold/. yangjy@nankai.edu.cn or mhb-506@163.com. Supplementary data are available at Bioinformatics online. © The Author 2016. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  20. MIPS: a database for protein sequences, homology data and yeast genome information.

    PubMed Central

    Mewes, H W; Albermann, K; Heumann, K; Liebl, S; Pfeiffer, F

    1997-01-01

    The MIPS group (Martinsried Institute for Protein Sequences) at the Max-Planck-Institute for Biochemistry, Martinsried near Munich, Germany, collects, processes and distributes protein sequence data within the framework of the tripartite association of the PIR-International Protein Sequence Database (,). MIPS contributes nearly 50% of the data input to the PIR-International Protein Sequence Database. The database is distributed on CD-ROM together with PATCHX, an exhaustive supplement of unique, unverified protein sequences from external sources compiled by MIPS. Through its WWW server (http://www.mips.biochem.mpg.de/ ) MIPS permits internet access to sequence databases, homology data and to yeast genome information. (i) Sequence similarity results from the FASTA program () are stored in the FASTA database for all proteins from PIR-International and PATCHX. The database is dynamically maintained and permits instant access to FASTA results. (ii) Starting with FASTA database queries, proteins have been classified into families and superfamilies (PROT-FAM). (iii) The HPT (hashed position tree) data structure () developed at MIPS is a new approach for rapid sequence and pattern searching. (iv) MIPS provides access to the sequence and annotation of the complete yeast genome (), the functional classification of yeast genes (FunCat) and its graphical display, the 'Genome Browser' (). A CD-ROM based on the JAVA programming language providing dynamic interactive access to the yeast genome and the related protein sequences has been compiled and is available on request. PMID:9016498

  1. Problems of classification in the family Paramyxoviridae.

    PubMed

    Rima, Bert; Collins, Peter; Easton, Andrew; Fouchier, Ron; Kurath, Gael; Lamb, Robert A; Lee, Benhur; Maisner, Andrea; Rota, Paul; Wang, Lin-Fa

    2018-05-01

    A number of unassigned viruses in the family Paramyxoviridae need to be classified either as a new genus or placed into one of the seven genera currently recognized in this family. Furthermore, numerous new paramyxoviruses continue to be discovered. However, attempts at classification have highlighted the difficulties that arise by applying historic criteria or criteria based on sequence alone to the classification of the viruses in this family. While the recent taxonomic change that elevated the previous subfamily Pneumovirinae into a separate family Pneumoviridae is readily justified on the basis of RNA dependent -RNA polymerase (RdRp or L protein) sequence motifs, using RdRp sequence comparisons for assignment to lower level taxa raises problems that would require an overhaul of the current criteria for assignment into genera in the family Paramyxoviridae. Arbitrary cut off points to delineate genera and species would have to be set if classification was based on the amino acid sequence of the RdRp alone or on pairwise analysis of sequence complementarity (PASC) of all open reading frames (ORFs). While these cut-offs cannot be made consistent with the current classification in this family, resorting to genus-level demarcation criteria with additional input from the biological context may afford a way forward. Such criteria would reflect the increasingly dynamic nature of virus taxonomy even if it would require a complete revision of the current classification.

  2. Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification.

    PubMed

    Huang, Chuen-Der; Lin, Chin-Teng; Pal, Nikhil Ranjan

    2003-12-01

    The structure classification of proteins plays a very important role in bioinformatics, since the relationships and characteristics among those known proteins can be exploited to predict the structure of new proteins. The success of a classification system depends heavily on two things: the tools being used and the features considered. For the bioinformatics applications, the role of appropriate features has not been paid adequate importance. In this investigation we use three novel ideas for multiclass protein fold classification. First, we use the gating neural network, where each input node is associated with a gate. This network can select important features in an online manner when the learning goes on. At the beginning of the training, all gates are almost closed, i.e., no feature is allowed to enter the network. Through the training, gates corresponding to good features are completely opened while gates corresponding to bad features are closed more tightly, and some gates may be partially open. The second novel idea is to use a hierarchical learning architecture (HLA). The classifier in the first level of HLA classifies the protein features into four major classes: all alpha, all beta, alpha + beta, and alpha/beta. And in the next level we have another set of classifiers, which further classifies the protein features into 27 folds. The third novel idea is to induce the indirect coding features from the amino-acid composition sequence of proteins based on the N-gram concept. This provides us with more representative and discriminative new local features of protein sequences for multiclass protein fold classification. The proposed HLA with new indirect coding features increases the protein fold classification accuracy by about 12%. Moreover, the gating neural network is found to reduce the number of features drastically. Using only half of the original features selected by the gating neural network can reach comparable test accuracy as that using all the original features. The gating mechanism also helps us to get a better insight into the folding process of proteins. For example, tracking the evolution of different gates we can find which characteristics (features) of the data are more important for the folding process. And, of course, it also reduces the computation time.

  3. Highly multiplexed and quantitative cell-surface protein profiling using genetically barcoded antibodies.

    PubMed

    Pollock, Samuel B; Hu, Amy; Mou, Yun; Martinko, Alexander J; Julien, Olivier; Hornsby, Michael; Ploder, Lynda; Adams, Jarrett J; Geng, Huimin; Müschen, Markus; Sidhu, Sachdev S; Moffat, Jason; Wells, James A

    2018-03-13

    Human cells express thousands of different surface proteins that can be used for cell classification, or to distinguish healthy and disease conditions. A method capable of profiling a substantial fraction of the surface proteome simultaneously and inexpensively would enable more accurate and complete classification of cell states. We present a highly multiplexed and quantitative surface proteomic method using genetically barcoded antibodies called phage-antibody next-generation sequencing (PhaNGS). Using 144 preselected antibodies displayed on filamentous phage (Fab-phage) against 44 receptor targets, we assess changes in B cell surface proteins after the development of drug resistance in a patient with acute lymphoblastic leukemia (ALL) and in adaptation to oncogene expression in a Myc-inducible Burkitt lymphoma model. We further show PhaNGS can be applied at the single-cell level. Our results reveal that a common set of proteins including FLT3, NCR3LG1, and ROR1 dominate the response to similar oncogenic perturbations in B cells. Linking high-affinity, selective, genetically encoded binders to NGS enables direct and highly multiplexed protein detection, comparable to RNA-sequencing for mRNA. PhaNGS has the potential to profile a substantial fraction of the surface proteome simultaneously and inexpensively to enable more accurate and complete classification of cell states. Copyright © 2018 the Author(s). Published by PNAS.

  4. Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection.

    PubMed

    Jahandideh, Samad; Srinivasasainagendra, Vinodh; Zhi, Degui

    2012-11-07

    RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using Gene Ontology Annotated (GOA) and Structural Classification of Proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), Random Forest (RF), and multi-class ℓ1/ℓq-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.

  5. SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins - extended Database.

    PubMed

    Chandonia, John-Marc; Fox, Naomi K; Brenner, Steven E

    2017-02-03

    SCOPe (Structural Classification of Proteins-extended, http://scop.berkeley.edu) is a database of relationships between protein structures that extends the Structural Classification of Proteins (SCOP) database. SCOP is an expert-curated ordering of domains from the majority of proteins of known structure in a hierarchy according to structural and evolutionary relationships. SCOPe classifies the majority of protein structures released since SCOP development concluded in 2009, using a combination of manual curation and highly precise automated tools, aiming to have the same accuracy as fully hand-curated SCOP releases. SCOPe also incorporates and updates the ASTRAL compendium, which provides several databases and tools to aid in the analysis of the sequences and structures of proteins classified in SCOPe. SCOPe continues high-quality manual classification of new superfamilies, a key feature of SCOP. Artifacts such as expression tags are now separated into their own class, in order to distinguish them from the homology-based annotations in the remainder of the SCOPe hierarchy. SCOPe 2.06 contains 77,439 Protein Data Bank entries, double the 38,221 structures classified in SCOP. Copyright © 2016 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  6. The history of the CATH structural classification of protein domains.

    PubMed

    Sillitoe, Ian; Dawson, Natalie; Thornton, Janet; Orengo, Christine

    2015-12-01

    This article presents a historical review of the protein structure classification database CATH. Together with the SCOP database, CATH remains comprehensive and reasonably up-to-date with the now more than 100,000 protein structures in the PDB. We review the expansion of the CATH and SCOP resources to capture predicted domain structures in the genome sequence data and to provide information on the likely functions of proteins mediated by their constituent domains. The establishment of comprehensive function annotation resources has also meant that domain families can be functionally annotated allowing insights into functional divergence and evolution within protein families. Copyright © 2015 The Authors. Published by Elsevier B.V. All rights reserved.

  7. PCR and RFLP analyses based on the ribosomal protein operon

    USDA-ARS?s Scientific Manuscript database

    Differentiation and classification of phytoplasmas have been primarily based on the highly conserved 16Sr RNA gene. RFLP analysis of 16Sr RNA gene sequences has identified 31 16Sr RNA (16Sr) groups and more than 100 16Sr subgroups. Classification of phytoplasma strains can however, become more refin...

  8. Label noise in subtype discrimination of class C G protein-coupled receptors: A systematic approach to the analysis of classification errors.

    PubMed

    König, Caroline; Cárdenas, Martha I; Giraldo, Jesús; Alquézar, René; Vellido, Alfredo

    2015-09-29

    The characterization of proteins in families and subfamilies, at different levels, entails the definition and use of class labels. When the adscription of a protein to a family is uncertain, or even wrong, this becomes an instance of what has come to be known as a label noise problem. Label noise has a potentially negative effect on any quantitative analysis of proteins that depends on label information. This study investigates class C of G protein-coupled receptors, which are cell membrane proteins of relevance both to biology in general and pharmacology in particular. Their supervised classification into different known subtypes, based on primary sequence data, is hampered by label noise. The latter may stem from a combination of expert knowledge limitations and the lack of a clear correspondence between labels that mostly reflect GPCR functionality and the different representations of the protein primary sequences. In this study, we describe a systematic approach, using Support Vector Machine classifiers, to the analysis of G protein-coupled receptor misclassifications. As a proof of concept, this approach is used to assist the discovery of labeling quality problems in a curated, publicly accessible database of this type of proteins. We also investigate the extent to which physico-chemical transformations of the protein sequences reflect G protein-coupled receptor subtype labeling. The candidate mislabeled cases detected with this approach are externally validated with phylogenetic trees and against further trusted sources such as the National Center for Biotechnology Information, Universal Protein Resource, European Bioinformatics Institute and Ensembl Genome Browser information repositories. In quantitative classification problems, class labels are often by default assumed to be correct. Label noise, though, is bound to be a pervasive problem in bioinformatics, where labels may be obtained indirectly through complex, many-step similarity modelling processes. In the case of G protein-coupled receptors, methods capable of singling out and characterizing those sequences with consistent misclassification behaviour are required to minimize this problem. A systematic, Support Vector Machine-based method has been proposed in this study for such purpose. The proposed method enables a filtering approach to the label noise problem and might become a support tool for database curators in proteomics.

  9. Automated hierarchical classification of protein domain subfamilies based on functionally-divergent residue signatures

    PubMed Central

    2012-01-01

    Background The NCBI Conserved Domain Database (CDD) consists of a collection of multiple sequence alignments of protein domains that are at various stages of being manually curated into evolutionary hierarchies based on conserved and divergent sequence and structural features. These domain models are annotated to provide insights into the relationships between sequence, structure and function via web-based BLAST searches. Results Here we automate the generation of conserved domain (CD) hierarchies using a combination of heuristic and Markov chain Monte Carlo (MCMC) sampling procedures and starting from a (typically very large) multiple sequence alignment. This procedure relies on statistical criteria to define each hierarchy based on the conserved and divergent sequence patterns associated with protein functional-specialization. At the same time this facilitates the sequence and structural annotation of residues that are functionally important. These statistical criteria also provide a means to objectively assess the quality of CD hierarchies, a non-trivial task considering that the protein subgroups are often very distantly related—a situation in which standard phylogenetic methods can be unreliable. Our aim here is to automatically generate (typically sub-optimal) hierarchies that, based on statistical criteria and visual comparisons, are comparable to manually curated hierarchies; this serves as the first step toward the ultimate goal of obtaining optimal hierarchical classifications. A plot of runtimes for the most time-intensive (non-parallelizable) part of the algorithm indicates a nearly linear time complexity so that, even for the extremely large Rossmann fold protein class, results were obtained in about a day. Conclusions This approach automates the rapid creation of protein domain hierarchies and thus will eliminate one of the most time consuming aspects of conserved domain database curation. At the same time, it also facilitates protein domain annotation by identifying those pattern residues that most distinguish each protein domain subgroup from other related subgroups. PMID:22726767

  10. Identification and classification of conopeptides using profile Hidden Markov Models.

    PubMed

    Laht, Silja; Koua, Dominique; Kaplinski, Lauris; Lisacek, Frédérique; Stöcklin, Reto; Remm, Maido

    2012-03-01

    Conopeptides are small toxins produced by predatory marine snails of the genus Conus. They are studied with increasing intensity due to their potential in neurosciences and pharmacology. The number of existing conopeptides is estimated to be 1 million, but only about 1000 have been described to date. Thanks to new high-throughput sequencing technologies the number of known conopeptides is likely to increase exponentially in the near future. There is therefore a need for a fast and accurate computational method for identification and classification of the novel conopeptides in large data sets. 62 profile Hidden Markov Models (pHMMs) were built for prediction and classification of all described conopeptide superfamilies and families, based on the different parts of the corresponding protein sequences. These models showed very high specificity in detection of new peptides. 56 out of 62 models do not give a single false positive in a test with the entire UniProtKB/Swiss-Prot protein sequence database. Our study demonstrates the usefulness of mature peptide models for automatic classification with accuracy of 96% for the mature peptide models and 100% for the pro- and signal peptide models. Our conopeptide profile HMMs can be used for finding and annotation of new conopeptides from large datasets generated by transcriptome or genome sequencing. To our knowledge this is the first time this kind of computational method has been applied to predict all known conopeptide superfamilies and some conopeptide families. Copyright © 2012 Elsevier B.V. All rights reserved.

  11. The ASTRAL Compendium in 2004

    DOE R&D Accomplishments Database

    Chandonia, John-Marc; Hon, Gary; Walker, Nigel S.; Lo Conte, Loredana; Koehl, Patrice; Levitt, Michael; Brenner, Steven E.

    2003-09-15

    The ASTRAL compendium provides several databases and tools to aid in the analysis of protein structures, particularly through the use of their sequences. Partially derived from the SCOP database of protein structure domains, it includes sequences for each domain and other resources useful for studying these sequences and domain structures. The current release of ASTRAL contains 54,745 domains, more than three times as many as the initial release four years ago. ASTRAL has undergone major transformations in the past two years. In addition to several complete updates each year, ASTRAL is now updated on a weekly basis with preliminary classifications of domains from newly released PDB structures. These classifications are available as a stand-alone database, as well as available integrated into other ASTRAL databases such as representative subsets. To enhance the utility of ASTRAL to structural biologists, all SCOP domains are now made available as PDB-style coordinate files as well as sequences. In addition to sequences and representative subsets based on SCOP domains, sequences and subsets based on PDB chains are newly included in ASTRAL. Several search tools have been added to ASTRAL to facilitate retrieval of data by individual users and automated methods.

  12. A novel alignment-free method to classify protein folding types by combining spectral graph clustering with Chou's pseudo amino acid composition.

    PubMed

    Tripathi, Pooja; Pandey, Paras N

    2017-07-07

    The present work employs pseudo amino acid composition (PseAAC) for encoding the protein sequences in their numeric form. Later this will be arranged in the similarity matrix, which serves as input for spectral graph clustering method. Spectral methods are used previously also for clustering of protein sequences, but they uses pair wise alignment scores of protein sequences, in similarity matrix. The alignment score depends on the length of sequences, so clustering short and long sequences together may not good idea. Therefore the idea of introducing PseAAC with spectral clustering algorithm came into scene. We extensively tested our method and compared its performance with other existing machine learning methods. It is consistently observed that, the number of clusters that we obtained for a given set of proteins is close to the number of superfamilies in that set and PseAAC combined with spectral graph clustering shows the best classification results. Copyright © 2017 Elsevier Ltd. All rights reserved.

  13. PlantTribes: a gene and gene family resource for comparative genomics in plants

    PubMed Central

    Wall, P. Kerr; Leebens-Mack, Jim; Müller, Kai F.; Field, Dawn; Altman, Naomi S.; dePamphilis, Claude W.

    2008-01-01

    The PlantTribes database (http://fgp.huck.psu.edu/tribe.html) is a plant gene family database based on the inferred proteomes of five sequenced plant species: Arabidopsis thaliana, Carica papaya, Medicago truncatula, Oryza sativa and Populus trichocarpa. We used the graph-based clustering algorithm MCL [Van Dongen (Technical Report INS-R0010 2000) and Enright et al. (Nucleic Acids Res. 2002; 30: 1575–1584)] to classify all of these species’ protein-coding genes into putative gene families, called tribes, using three clustering stringencies (low, medium and high). For all tribes, we have generated protein and DNA alignments and maximum-likelihood phylogenetic trees. A parallel database of microarray experimental results is linked to the genes, which lets researchers identify groups of related genes and their expression patterns. Unified nomenclatures were developed, and tribes can be related to traditional gene families and conserved domain identifiers. SuperTribes, constructed through a second iteration of MCL clustering, connect distant, but potentially related gene clusters. The global classification of nearly 200 000 plant proteins was used as a scaffold for sorting ∼4 million additional cDNA sequences from over 200 plant species. All data and analyses are accessible through a flexible interface allowing users to explore the classification, to place query sequences within the classification, and to download results for further study. PMID:18073194

  14. Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase).

    PubMed

    Odronitz, Florian; Kollmar, Martin

    2006-11-29

    Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein.

  15. LZW-Kernel: fast kernel utilizing variable length code blocks from LZW compressors for protein sequence classification.

    PubMed

    Filatov, Gleb; Bauwens, Bruno; Kertész-Farkas, Attila

    2018-05-07

    Bioinformatics studies often rely on similarity measures between sequence pairs, which often pose a bottleneck in large-scale sequence analysis. Here, we present a new convolutional kernel function for protein sequences called the LZW-Kernel. It is based on code words identified with the Lempel-Ziv-Welch (LZW) universal text compressor. The LZW-Kernel is an alignment-free method, it is always symmetric, is positive, always provides 1.0 for self-similarity and it can directly be used with Support Vector Machines (SVMs) in classification problems, contrary to normalized compression distance (NCD), which often violates the distance metric properties in practice and requires further techniques to be used with SVMs. The LZW-Kernel is a one-pass algorithm, which makes it particularly plausible for big data applications. Our experimental studies on remote protein homology detection and protein classification tasks reveal that the LZW-Kernel closely approaches the performance of the Local Alignment Kernel (LAK) and the SVM-pairwise method combined with Smith-Waterman (SW) scoring at a fraction of the time. Moreover, the LZW-Kernel outperforms the SVM-pairwise method when combined with BLAST scores, which indicates that the LZW code words might be a better basis for similarity measures than local alignment approximations found with BLAST. In addition, the LZW-Kernel outperforms n-gram based mismatch kernels, hidden Markov model based SAM and Fisher kernel, and protein family based PSI-BLAST, among others. Further advantages include the LZW-Kernel's reliance on a simple idea, its ease of implementation, and its high speed, three times faster than BLAST and several magnitudes faster than SW or LAK in our tests. LZW-Kernel is implemented as a standalone C code and is a free open-source program distributed under GPLv3 license and can be downloaded from https://github.com/kfattila/LZW-Kernel. akerteszfarkas@hse.ru. Supplementary data are available at Bioinformatics Online.

  16. HMMBinder: DNA-Binding Protein Prediction Using HMM Profile Based Features.

    PubMed

    Zaman, Rianon; Chowdhury, Shahana Yasmin; Rashid, Mahmood A; Sharma, Alok; Dehzangi, Abdollah; Shatabda, Swakkhar

    2017-01-01

    DNA-binding proteins often play important role in various processes within the cell. Over the last decade, a wide range of classification algorithms and feature extraction techniques have been used to solve this problem. In this paper, we propose a novel DNA-binding protein prediction method called HMMBinder. HMMBinder uses monogram and bigram features extracted from the HMM profiles of the protein sequences. To the best of our knowledge, this is the first application of HMM profile based features for the DNA-binding protein prediction problem. We applied Support Vector Machines (SVM) as a classification technique in HMMBinder. Our method was tested on standard benchmark datasets. We experimentally show that our method outperforms the state-of-the-art methods found in the literature.

  17. Systematic Analysis of Primary Sequence Domain Segments for the Discrimination Between Class C GPCR Subtypes.

    PubMed

    König, Caroline; Alquézar, René; Vellido, Alfredo; Giraldo, Jesús

    2018-03-01

    G-protein-coupled receptors (GPCRs) are a large and diverse super-family of eukaryotic cell membrane proteins that play an important physiological role as transmitters of extracellular signal. In this paper, we investigate Class C, a member of this super-family that has attracted much attention in pharmacology. The limited knowledge about the complete 3D crystal structure of Class C receptors makes necessary the use of their primary amino acid sequences for analytical purposes. Here, we provide a systematic analysis of distinct receptor sequence segments with regard to their ability to differentiate between seven class C GPCR subtypes according to their topological location in the extracellular, transmembrane, or intracellular domains. We build on the results from the previous research that provided preliminary evidence of the potential use of separated domains of complete class C GPCR sequences as the basis for subtype classification. The use of the extracellular N-terminus domain alone was shown to result in a minor decrease in subtype discrimination in comparison with the complete sequence, despite discarding much of the sequence information. In this paper, we describe the use of Support Vector Machine-based classification models to evaluate the subtype-discriminating capacity of the specific topological sequence segments.

  18. MIPS: analysis and annotation of proteins from whole genomes.

    PubMed

    Mewes, H W; Amid, C; Arnold, R; Frishman, D; Güldener, U; Mannhaupt, G; Münsterkötter, M; Pagel, P; Strack, N; Stümpflen, V; Warfsmann, J; Ruepp, A

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein-protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  19. Pfarao: a web application for protein family analysis customized for cytoskeletal and motor proteins (CyMoBase)

    PubMed Central

    Odronitz, Florian; Kollmar, Martin

    2006-01-01

    Background Annotation of protein sequences of eukaryotic organisms is crucial for the understanding of their function in the cell. Manual annotation is still by far the most accurate way to correctly predict genes. The classification of protein sequences, their phylogenetic relation and the assignment of function involves information from various sources. This often leads to a collection of heterogeneous data, which is hard to track. Cytoskeletal and motor proteins consist of large and diverse superfamilies comprising up to several dozen members per organism. Up to date there is no integrated tool available to assist in the manual large-scale comparative genomic analysis of protein families. Description Pfarao (Protein Family Application for Retrieval, Analysis and Organisation) is a database driven online working environment for the analysis of manually annotated protein sequences and their relationship. Currently, the system can store and interrelate a wide range of information about protein sequences, species, phylogenetic relations and sequencing projects as well as links to literature and domain predictions. Sequences can be imported from multiple sequence alignments that are generated during the annotation process. A web interface allows to conveniently browse the database and to compile tabular and graphical summaries of its content. Conclusion We implemented a protein sequence-centric web application to store, organize, interrelate, and present heterogeneous data that is generated in manual genome annotation and comparative genomics. The application has been developed for the analysis of cytoskeletal and motor proteins (CyMoBase) but can easily be adapted for any protein. PMID:17134497

  20. MIPS: analysis and annotation of proteins from whole genomes

    PubMed Central

    Mewes, H. W.; Amid, C.; Arnold, R.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M.; Pagel, P.; Strack, N.; Stümpflen, V.; Warfsmann, J.; Ruepp, A.

    2004-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF), Neuherberg, Germany, provides protein sequence-related information based on whole-genome analysis. The main focus of the work is directed toward the systematic organization of sequence-related attributes as gathered by a variety of algorithms, primary information from experimental data together with information compiled from the scientific literature. MIPS maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the database of complete cDNAs (German Human Genome Project, NGFN), the database of mammalian protein–protein interactions (MPPI), the database of FASTA homologies (SIMAP), and the interface for the fast retrieval of protein-associated information (QUIPOS). The Arabidopsis thaliana database, the rice database, the plant EST databases (MATDB, MOsDB, SPUTNIK), as well as the databases for the comprehensive set of genomes (PEDANT genomes) are described elsewhere in the 2003 and 2004 NAR database issues, respectively. All databases described, and the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:14681354

  1. The SUPERFAMILY database in 2004: additions and improvements.

    PubMed

    Madera, Martin; Vogel, Christine; Kummerfeld, Sarah K; Chothia, Cyrus; Gough, Julian

    2004-01-01

    The SUPERFAMILY database provides structural assignments to protein sequences and a framework for analysis of the results. At the core of the database is a library of profile Hidden Markov Models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent an entire superfamily. We have applied the library to predicted proteins from all completely sequenced genomes (currently 154), the Swiss-Prot and TrEMBL databases and other sequence collections. Close to 60% of all proteins have at least one match, and one half of all residues are covered by assignments. All models and full results are available for download and online browsing at http://supfam.org. Users can study the distribution of their superfamily of interest across all completely sequenced genomes, investigate with which other superfamilies it combines and retrieve proteins in which it occurs. Alternatively, concentrating on a particular genome as a whole, it is possible first, to find out its superfamily composition, and secondly, to compare it with that of other genomes to detect superfamilies that are over- or under-represented. In addition, the webserver provides the following standard services: sequence search; keyword search for genomes, superfamilies and sequence identifiers; and multiple alignment of genomic, PDB and custom sequences.

  2. MIPS: a database for genomes and protein sequences

    PubMed Central

    Mewes, H. W.; Frishman, D.; Güldener, U.; Mannhaupt, G.; Mayer, K.; Mokrejs, M.; Morgenstern, B.; Münsterkötter, M.; Rudd, S.; Weil, B.

    2002-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz–Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91–93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155–158; Barker et al. (2001) Nucleic Acids Res., 29, 29–32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de). PMID:11752246

  3. MIPS: a database for genomes and protein sequences.

    PubMed

    Mewes, H W; Frishman, D; Güldener, U; Mannhaupt, G; Mayer, K; Mokrejs, M; Morgenstern, B; Münsterkötter, M; Rudd, S; Weil, B

    2002-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) continues to provide genome-related information in a systematic way. MIPS supports both national and European sequencing and functional analysis projects, develops and maintains automatically generated and manually annotated genome-specific databases, develops systematic classification schemes for the functional annotation of protein sequences, and provides tools for the comprehensive analysis of protein sequences. This report updates the information on the yeast genome (CYGD), the Neurospora crassa genome (MNCDB), the databases for the comprehensive set of genomes (PEDANT genomes), the database of annotated human EST clusters (HIB), the database of complete cDNAs from the DHGP (German Human Genome Project), as well as the project specific databases for the GABI (Genome Analysis in Plants) and HNB (Helmholtz-Netzwerk Bioinformatik) networks. The Arabidospsis thaliana database (MATDB), the database of mitochondrial proteins (MITOP) and our contribution to the PIR International Protein Sequence Database have been described elsewhere [Schoof et al. (2002) Nucleic Acids Res., 30, 91-93; Scharfe et al. (2000) Nucleic Acids Res., 28, 155-158; Barker et al. (2001) Nucleic Acids Res., 29, 29-32]. All databases described, the protein analysis tools provided and the detailed descriptions of our projects can be accessed through the MIPS World Wide Web server (http://mips.gsf.de).

  4. Large-scale gene function analysis with the PANTHER classification system.

    PubMed

    Mi, Huaiyu; Muruganujan, Anushya; Casagrande, John T; Thomas, Paul D

    2013-08-01

    The PANTHER (protein annotation through evolutionary relationship) classification system (http://www.pantherdb.org/) is a comprehensive system that combines gene function, ontology, pathways and statistical analysis tools that enable biologists to analyze large-scale, genome-wide data from sequencing, proteomics or gene expression experiments. The system is built with 82 complete genomes organized into gene families and subfamilies, and their evolutionary relationships are captured in phylogenetic trees, multiple sequence alignments and statistical models (hidden Markov models or HMMs). Genes are classified according to their function in several different ways: families and subfamilies are annotated with ontology terms (Gene Ontology (GO) and PANTHER protein class), and sequences are assigned to PANTHER pathways. The PANTHER website includes a suite of tools that enable users to browse and query gene functions, and to analyze large-scale experimental data with a number of statistical tests. It is widely used by bench scientists, bioinformaticians, computer scientists and systems biologists. In the 2013 release of PANTHER (v.8.0), in addition to an update of the data content, we redesigned the website interface to improve both user experience and the system's analytical capability. This protocol provides a detailed description of how to analyze genome-wide experimental data with the PANTHER classification system.

  5. Probabilistic grammatical model for helix‐helix contact site classification

    PubMed Central

    2013-01-01

    Background Hidden Markov Models power many state‐of‐the‐art tools in the field of protein bioinformatics. While excelling in their tasks, these methods of protein analysis do not convey directly information on medium‐ and long‐range residue‐residue interactions. This requires an expressive power of at least context‐free grammars. However, application of more powerful grammar formalisms to protein analysis has been surprisingly limited. Results In this work, we present a probabilistic grammatical framework for problem‐specific protein languages and apply it to classification of transmembrane helix‐helix pairs configurations. The core of the model consists of a probabilistic context‐free grammar, automatically inferred by a genetic algorithm from only a generic set of expert‐based rules and positive training samples. The model was applied to produce sequence based descriptors of four classes of transmembrane helix‐helix contact site configurations. The highest performance of the classifiers reached AUCROC of 0.70. The analysis of grammar parse trees revealed the ability of representing structural features of helix‐helix contact sites. Conclusions We demonstrated that our probabilistic context‐free framework for analysis of protein sequences outperforms the state of the art in the task of helix‐helix contact site classification. However, this is achieved without necessarily requiring modeling long range dependencies between interacting residues. A significant feature of our approach is that grammar rules and parse trees are human‐readable. Thus they could provide biologically meaningful information for molecular biologists. PMID:24350601

  6. DWARF – a data warehouse system for analyzing protein families

    PubMed Central

    Fischer, Markus; Thai, Quan K; Grieb, Melanie; Pleiss, Jürgen

    2006-01-01

    Background The emerging field of integrative bioinformatics provides the tools to organize and systematically analyze vast amounts of highly diverse biological data and thus allows to gain a novel understanding of complex biological systems. The data warehouse DWARF applies integrative bioinformatics approaches to the analysis of large protein families. Description The data warehouse system DWARF integrates data on sequence, structure, and functional annotation for protein fold families. The underlying relational data model consists of three major sections representing entities related to the protein (biochemical function, source organism, classification to homologous families and superfamilies), the protein sequence (position-specific annotation, mutant information), and the protein structure (secondary structure information, superimposed tertiary structure). Tools for extracting, transforming and loading data from public available resources (ExPDB, GenBank, DSSP) are provided to populate the database. The data can be accessed by an interface for searching and browsing, and by analysis tools that operate on annotation, sequence, or structure. We applied DWARF to the family of α/β-hydrolases to host the Lipase Engineering database. Release 2.3 contains 6138 sequences and 167 experimentally determined protein structures, which are assigned to 37 superfamilies 103 homologous families. Conclusion DWARF has been designed for constructing databases of large structurally related protein families and for evaluating their sequence-structure-function relationships by a systematic analysis of sequence, structure and functional annotation. It has been applied to predict biochemical properties from sequence, and serves as a valuable tool for protein engineering. PMID:17094801

  7. The Protein Information Resource: an integrated public resource of functional annotation of proteins

    PubMed Central

    Wu, Cathy H.; Huang, Hongzhan; Arminski, Leslie; Castro-Alvear, Jorge; Chen, Yongxing; Hu, Zhang-Zhi; Ledley, Robert S.; Lewis, Kali C.; Mewes, Hans-Werner; Orcutt, Bruce C.; Suzek, Baris E.; Tsugita, Akira; Vinayaka, C. R.; Yeh, Lai-Su L.; Zhang, Jian; Barker, Winona C.

    2002-01-01

    The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases). PMID:11752247

  8. Towards comprehensive structural motif mining for better fold annotation in the "twilight zone" of sequence dissimilarity

    PubMed Central

    Jia, Yi; Huan, Jun; Buhr, Vincent; Zhang, Jintao; Carayannopoulos, Leonidas N

    2009-01-01

    Background Automatic identification of structure fingerprints from a group of diverse protein structures is challenging, especially for proteins whose divergent amino acid sequences may fall into the "twilight-" or "midnight-" zones where pair-wise sequence identities to known sequences fall below 25% and sequence-based functional annotations often fail. Results Here we report a novel graph database mining method and demonstrate its application to protein structure pattern identification and structure classification. The biologic motivation of our study is to recognize common structure patterns in "immunoevasins", proteins mediating virus evasion of host immune defense. Our experimental study, using both viral and non-viral proteins, demonstrates the efficiency and efficacy of the proposed method. Conclusion We present a theoretic framework, offer a practical software implementation for incorporating prior domain knowledge, such as substitution matrices as studied here, and devise an efficient algorithm to identify approximate matched frequent subgraphs. By doing so, we significantly expanded the analytical power of sophisticated data mining algorithms in dealing with large volume of complicated and noisy protein structure data. And without loss of generality, choice of appropriate compatibility matrices allows our method to be easily employed in domains where subgraph labels have some uncertainty. PMID:19208148

  9. Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants in the InSiGHT locus-specific database.

    PubMed

    Thompson, Bryony A; Spurdle, Amanda B; Plazzer, John-Paul; Greenblatt, Marc S; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P; Farrington, Susan M; Frayling, Ian M; Frebourg, Thierry; Goldgar, David E; Heinen, Christopher D; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J; Sijmons, Rolf; Tavtigian, Sean V; Tops, Carli M; Weber, Thomas; Wijnen, Juul; Woods, Michael O; Macrae, Finlay; Genuardi, Maurizio

    2014-02-01

    The clinical classification of hereditary sequence variants identified in disease-related genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch syndrome-associated genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist in variant classification and was recognized through microattribution. The scheme was refined by multidisciplinary expert committee review of the clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants that were not obviously protein truncating from nomenclature. This large-scale endeavor will facilitate the consistent management of families suspected to have Lynch syndrome and demonstrates the value of multidisciplinary collaboration in the curation and classification of variants in public locus-specific databases.

  10. Application of a five-tiered scheme for standardized classification of 2,360 unique mismatch repair gene variants lodged on the InSiGHT locus-specific database

    PubMed Central

    Plazzer, John-Paul; Greenblatt, Marc S.; Akagi, Kiwamu; Al-Mulla, Fahd; Bapat, Bharati; Bernstein, Inge; Capellá, Gabriel; den Dunnen, Johan T.; du Sart, Desiree; Fabre, Aurelie; Farrell, Michael P.; Farrington, Susan M.; Frayling, Ian M.; Frebourg, Thierry; Goldgar, David E.; Heinen, Christopher D.; Holinski-Feder, Elke; Kohonen-Corish, Maija; Robinson, Kristina Lagerstedt; Leung, Suet Yi; Martins, Alexandra; Moller, Pal; Morak, Monika; Nystrom, Minna; Peltomaki, Paivi; Pineda, Marta; Qi, Ming; Ramesar, Rajkumar; Rasmussen, Lene Juel; Royer-Pokora, Brigitte; Scott, Rodney J.; Sijmons, Rolf; Tavtigian, Sean V.; Tops, Carli M.; Weber, Thomas; Wijnen, Juul; Woods, Michael O.; Macrae, Finlay; Genuardi, Maurizio

    2015-01-01

    Clinical classification of sequence variants identified in hereditary disease genes directly affects clinical management of patients and their relatives. The International Society for Gastrointestinal Hereditary Tumours (InSiGHT) undertook a collaborative effort to develop, test and apply a standardized classification scheme to constitutional variants in the Lynch Syndrome genes MLH1, MSH2, MSH6 and PMS2. Unpublished data submission was encouraged to assist variant classification, and recognized by microattribution. The scheme was refined by multidisciplinary expert committee review of clinical and functional data available for variants, applied to 2,360 sequence alterations, and disseminated online. Assessment using validated criteria altered classifications for 66% of 12,006 database entries. Clinical recommendations based on transparent evaluation are now possible for 1,370 variants not obviously protein-truncating from nomenclature. This large-scale endeavor will facilitate consistent management of suspected Lynch Syndrome families, and demonstrates the value of multidisciplinary collaboration for curation and classification of variants in public locus-specific databases. PMID:24362816

  11. Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.

    PubMed

    Ali, Safdar; Majid, Abdul; Javed, Syed Gibran; Sattar, Mohsin

    2016-06-01

    Early prediction of breast cancer is important for effective treatment and survival. We developed an effective Cost-Sensitive Classifier with GentleBoost Ensemble (Can-CSC-GBE) for the classification of breast cancer using protein amino acid features. In this work, first, discriminant information of the protein sequences related to breast tissue is extracted. Then, the physicochemical properties hydrophobicity and hydrophilicity of amino acids are employed to generate molecule descriptors in different feature spaces. For comparison, we obtained results by combining Cost-Sensitive learning with conventional ensemble of AdaBoostM1 and Bagging. The proposed Can-CSC-GBE system has effectively reduced the misclassification costs and thereby improved the overall classification performance. Our novel approach has highlighted promising results as compared to the state-of-the-art ensemble approaches. Copyright © 2016 Elsevier Ltd. All rights reserved.

  12. Effective Feature Selection for Classification of Promoter Sequences.

    PubMed

    K, Kouser; P G, Lavanya; Rangarajan, Lalitha; K, Acharya Kshitish

    2016-01-01

    Exploring novel computational methods in making sense of biological data has not only been a necessity, but also productive. A part of this trend is the search for more efficient in silico methods/tools for analysis of promoters, which are parts of DNA sequences that are involved in regulation of expression of genes into other functional molecules. Promoter regions vary greatly in their function based on the sequence of nucleotides and the arrangement of protein-binding short-regions called motifs. In fact, the regulatory nature of the promoters seems to be largely driven by the selective presence and/or the arrangement of these motifs. Here, we explore computational classification of promoter sequences based on the pattern of motif distributions, as such classification can pave a new way of functional analysis of promoters and to discover the functionally crucial motifs. We make use of Position Specific Motif Matrix (PSMM) features for exploring the possibility of accurately classifying promoter sequences using some of the popular classification techniques. The classification results on the complete feature set are low, perhaps due to the huge number of features. We propose two ways of reducing features. Our test results show improvement in the classification output after the reduction of features. The results also show that decision trees outperform SVM (Support Vector Machine), KNN (K Nearest Neighbor) and ensemble classifier LibD3C, particularly with reduced features. The proposed feature selection methods outperform some of the popular feature transformation methods such as PCA and SVD. Also, the methods proposed are as accurate as MRMR (feature selection method) but much faster than MRMR. Such methods could be useful to categorize new promoters and explore regulatory mechanisms of gene expressions in complex eukaryotic species.

  13. Phenotype classification of single cells using SRS microscopy, RNA sequencing, and microfluidics (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Streets, Aaron M.; Cao, Chen; Zhang, Xiannian; Huang, Yanyi

    2016-03-01

    Phenotype classification of single cells reveals biological variation that is masked in ensemble measurement. This heterogeneity is found in gene and protein expression as well as in cell morphology. Many techniques are available to probe phenotypic heterogeneity at the single cell level, for example quantitative imaging and single-cell RNA sequencing, but it is difficult to perform multiple assays on the same single cell. In order to directly track correlation between morphology and gene expression at the single cell level, we developed a microfluidic platform for quantitative coherent Raman imaging and immediate RNA sequencing (RNA-Seq) of single cells. With this device we actively sort and trap cells for analysis with stimulated Raman scattering microscopy (SRS). The cells are then processed in parallel pipelines for lysis, and preparation of cDNA for high-throughput transcriptome sequencing. SRS microscopy offers three-dimensional imaging with chemical specificity for quantitative analysis of protein and lipid distribution in single cells. Meanwhile, the microfluidic platform facilitates single-cell manipulation, minimizes contamination, and furthermore, provides improved RNA-Seq detection sensitivity and measurement precision, which is necessary for differentiating biological variability from technical noise. By combining coherent Raman microscopy with RNA sequencing, we can better understand the relationship between cellular morphology and gene expression at the single-cell level.

  14. Data mining of enzymes using specific peptides

    PubMed Central

    2009-01-01

    Background Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is. Results We extract novel SP sets from Swiss-Prot enzyme data. Using a training set of July 2006, and test sets of July 2008, we find that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length of all SP matches (the number of amino-acids matched on the protein sequence). DME is quite different from BLAST. Comparing the two on an enzyme test set of July 2008, we find that DME has lower recall. On the other hand, DME can provide predictions for proteins regarded by BLAST as having low homologies with known enzymes, thus supplying complementary information. We test our method on a set of proteins belonging to 10 bacteria, dated July 2008, establishing the usefulness of the coverage-length cutoff to determine true-negatives. Moreover, sifting through our predictions we find that some of them have been substantiated by Swiss-Prot annotations by July 2009. Finally we extract, for production purposes, a novel SP set trained on all Swiss-Prot enzymes as of July 2009. This new set increases considerably the recall of DME. The new SP set is being applied to three metagenomes: Sargasso Sea with over 1,000,000 proteins, producing predictions of over 220,000 enzymes, and two human gut metagenomes. The outcome of these analyses can be characterized by the enzymatic profile of the metagenomes, describing the relative numbers of enzymes observed for different EC categories. Conclusions Employing SPs for predicting enzymatic activity of proteins works well once one utilizes coverage-length criteria. In our analysis, L ≥ 7 has led to highly accurate results. PMID:20034383

  15. A Functional-Phylogenetic Classification System for Transmembrane Solute Transporters

    PubMed Central

    Saier, Milton H.

    2000-01-01

    A comprehensive classification system for transmembrane molecular transporters has been developed and recently approved by the transport panel of the nomenclature committee of the International Union of Biochemistry and Molecular Biology. This system is based on (i) transporter class and subclass (mode of transport and energy coupling mechanism), (ii) protein phylogenetic family and subfamily, and (iii) substrate specificity. Almost all of the more than 250 identified families of transporters include members that function exclusively in transport. Channels (115 families), secondary active transporters (uniporters, symporters, and antiporters) (78 families), primary active transporters (23 families), group translocators (6 families), and transport proteins of ill-defined function or of unknown mechanism (51 families) constitute distinct categories. Transport mode and energy coupling prove to be relatively immutable characteristics and therefore provide primary bases for classification. Phylogenetic grouping reflects structure, function, mechanism, and often substrate specificity and therefore provides a reliable secondary basis for classification. Substrate specificity and polarity of transport prove to be more readily altered during evolutionary history and therefore provide a tertiary basis for classification. With very few exceptions, a phylogenetic family of transporters includes members that function by a single transport mode and energy coupling mechanism, although a variety of substrates may be transported, sometimes with either inwardly or outwardly directed polarity. In this review, I provide cross-referencing of well-characterized constituent transporters according to (i) transport mode, (ii) energy coupling mechanism, (iii) phylogenetic grouping, and (iv) substrates transported. The structural features and distribution of recognized family members throughout the living world are also evaluated. The tabulations should facilitate familial and functional assignments of newly sequenced transport proteins that will result from future genome sequencing projects. PMID:10839820

  16. Function-based classification of carbohydrate-active enzymes by recognition of short, conserved peptide motifs.

    PubMed

    Busk, Peter Kamp; Lange, Lene

    2013-06-01

    Functional prediction of carbohydrate-active enzymes is difficult due to low sequence identity. However, similar enzymes often share a few short motifs, e.g., around the active site, even when the overall sequences are very different. To exploit this notion for functional prediction of carbohydrate-active enzymes, we developed a simple algorithm, peptide pattern recognition (PPR), that can divide proteins into groups of sequences that share a set of short conserved sequences. When this method was used on 118 glycoside hydrolase 5 proteins with 9% average pairwise identity and representing four characterized enzymatic functions, 97% of the proteins were sorted into groups correlating with their enzymatic activity. Furthermore, we analyzed 8,138 glycoside hydrolase 13 proteins including 204 experimentally characterized enzymes with 28 different functions. There was a 91% correlation between group and enzyme activity. These results indicate that the function of carbohydrate-active enzymes can be predicted with high precision by finding short, conserved motifs in their sequences. The glycoside hydrolase 61 family is important for fungal biomass conversion, but only a few proteins of this family have been functionally characterized. Interestingly, PPR divided 743 glycoside hydrolase 61 proteins into 16 subfamilies useful for targeted investigation of the function of these proteins and pinpointed three conserved motifs with putative importance for enzyme activity. Furthermore, the conserved sequences were useful for cloning of new, subfamily-specific glycoside hydrolase 61 proteins from 14 fungi. In conclusion, identification of conserved sequence motifs is a new approach to sequence analysis that can predict carbohydrate-active enzyme functions with high precision.

  17. FastRNABindR: Fast and Accurate Prediction of Protein-RNA Interface Residues.

    PubMed

    El-Manzalawy, Yasser; Abbas, Mostafa; Malluhi, Qutaibah; Honavar, Vasant

    2016-01-01

    A wide range of biological processes, including regulation of gene expression, protein synthesis, and replication and assembly of many viruses are mediated by RNA-protein interactions. However, experimental determination of the structures of protein-RNA complexes is expensive and technically challenging. Hence, a number of computational tools have been developed for predicting protein-RNA interfaces. Some of the state-of-the-art protein-RNA interface predictors rely on position-specific scoring matrix (PSSM)-based encoding of the protein sequences. The computational efforts needed for generating PSSMs severely limits the practical utility of protein-RNA interface prediction servers. In this work, we experiment with two approaches, random sampling and sequence similarity reduction, for extracting a representative reference database of protein sequences from more than 50 million protein sequences in UniRef100. Our results suggest that random sampled databases produce better PSSM profiles (in terms of the number of hits used to generate the profile and the distance of the generated profile to the corresponding profile generated using the entire UniRef100 data as well as the accuracy of the machine learning classifier trained using these profiles). Based on our results, we developed FastRNABindR, an improved version of RNABindR for predicting protein-RNA interface residues using PSSM profiles generated using 1% of the UniRef100 sequences sampled uniformly at random. To the best of our knowledge, FastRNABindR is the only protein-RNA interface residue prediction online server that requires generation of PSSM profiles for query sequences and accepts hundreds of protein sequences per submission. Our approach for determining the optimal BLAST database for a protein-RNA interface residue classification task has the potential of substantially speeding up, and hence increasing the practical utility of, other amino acid sequence based predictors of protein-protein and protein-DNA interfaces.

  18. Granular support vector machines with association rules mining for protein homology prediction.

    PubMed

    Tang, Yuchun; Jin, Bo; Zhang, Yan-Qing

    2005-01-01

    Protein homology prediction between protein sequences is one of critical problems in computational biology. Such a complex classification problem is common in medical or biological information processing applications. How to build a model with superior generalization capability from training samples is an essential issue for mining knowledge to accurately predict/classify unseen new samples and to effectively support human experts to make correct decisions. A new learning model called granular support vector machines (GSVM) is proposed based on our previous work. GSVM systematically and formally combines the principles from statistical learning theory and granular computing theory and thus provides an interesting new mechanism to address complex classification problems. It works by building a sequence of information granules and then building support vector machines (SVM) in some of these information granules on demand. A good granulation method to find suitable granules is crucial for modeling a GSVM with good performance. In this paper, we also propose an association rules-based granulation method. For the granules induced by association rules with high enough confidence and significant support, we leave them as they are because of their high "purity" and significant effect on simplifying the classification task. For every other granule, a SVM is modeled to discriminate the corresponding data. In this way, a complex classification problem is divided into multiple smaller problems so that the learning task is simplified. The proposed algorithm, here named GSVM-AR, is compared with SVM by KDDCUP04 protein homology prediction data. The experimental results show that finding the splitting hyperplane is not a trivial task (we should be careful to select the association rules to avoid overfitting) and GSVM-AR does show significant improvement compared to building one single SVM in the whole feature space. Another advantage is that the utility of GSVM-AR is very good because it is easy to be implemented. More importantly and more interestingly, GSVM provides a new mechanism to address complex classification problems.

  19. Accurate Identification of Cancerlectins through Hybrid Machine Learning Technology.

    PubMed

    Zhang, Jieru; Ju, Ying; Lu, Huijuan; Xuan, Ping; Zou, Quan

    2016-01-01

    Cancerlectins are cancer-related proteins that function as lectins. They have been identified through computational identification techniques, but these techniques have sometimes failed to identify proteins because of sequence diversity among the cancerlectins. Advanced machine learning identification methods, such as support vector machine and basic sequence features (n-gram), have also been used to identify cancerlectins. In this study, various protein fingerprint features and advanced classifiers, including ensemble learning techniques, were utilized to identify this group of proteins. We improved the prediction accuracy of the original feature extraction methods and classification algorithms by more than 10% on average. Our work provides a basis for the computational identification of cancerlectins and reveals the power of hybrid machine learning techniques in computational proteomics.

  20. LenVarDB: database of length-variant protein domains.

    PubMed

    Mutt, Eshita; Mathew, Oommen K; Sowdhamini, Ramanathan

    2014-01-01

    Protein domains are functionally and structurally independent modules, which add to the functional variety of proteins. This array of functional diversity has been enabled by evolutionary changes, such as amino acid substitutions or insertions or deletions, occurring in these protein domains. Length variations (indels) can introduce changes at structural, functional and interaction levels. LenVarDB (freely available at http://caps.ncbs.res.in/lenvardb/) traces these length variations, starting from structure-based sequence alignments in our Protein Alignments organized as Structural Superfamilies (PASS2) database, across 731 structural classification of proteins (SCOP)-based protein domain superfamilies connected to 2 730 625 sequence homologues. Alignment of sequence homologues corresponding to a structural domain is available, starting from a structure-based sequence alignment of the superfamily. Orientation of the length-variant (indel) regions in protein domains can be visualized by mapping them on the structure and on the alignment. Knowledge about location of length variations within protein domains and their visual representation will be useful in predicting changes within structurally or functionally relevant sites, which may ultimately regulate protein function. Non-technical summary: Evolutionary changes bring about natural changes to proteins that may be found in many organisms. Such changes could be reflected as amino acid substitutions or insertions-deletions (indels) in protein sequences. LenVarDB is a database that provides an early overview of observed length variations that were set among 731 protein families and after examining >2 million sequences. Indels are followed up to observe if they are close to the active site such that they can affect the activity of proteins. Inclusion of such information can aid the design of bioengineering experiments.

  1. FARME DB: a functional antibiotic resistance element database

    PubMed Central

    Wallace, James C.; Port, Jesse A.; Smith, Marissa N.; Faustman, Elaine M.

    2017-01-01

    Antibiotic resistance (AR) is a major global public health threat but few resources exist that catalog AR genes outside of a clinical context. Current AR sequence databases are assembled almost exclusively from genomic sequences derived from clinical bacterial isolates and thus do not include many microbial sequences derived from environmental samples that confer resistance in functional metagenomic studies. These environmental metagenomic sequences often show little or no similarity to AR sequences from clinical isolates using standard classification criteria. In addition, existing AR databases provide no information about flanking sequences containing regulatory or mobile genetic elements. To help address this issue, we created an annotated database of DNA and protein sequences derived exclusively from environmental metagenomic sequences showing AR in laboratory experiments. Our Functional Antibiotic Resistant Metagenomic Element (FARME) database is a compilation of publically available DNA sequences and predicted protein sequences conferring AR as well as regulatory elements, mobile genetic elements and predicted proteins flanking antibiotic resistant genes. FARME is the first database to focus on functional metagenomic AR gene elements and provides a resource to better understand AR in the 99% of bacteria which cannot be cultured and the relationship between environmental AR sequences and antibiotic resistant genes derived from cultured isolates. Database URL: http://staff.washington.edu/jwallace/farme PMID:28077567

  2. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures.

    PubMed

    Marchler-Bauer, Aron; Bo, Yu; Han, Lianyi; He, Jane; Lanczycki, Christopher J; Lu, Shennan; Chitsaz, Farideh; Derbyshire, Myra K; Geer, Renata C; Gonzales, Noreen R; Gwadz, Marc; Hurwitz, David I; Lu, Fu; Marchler, Gabriele H; Song, James S; Thanki, Narmada; Wang, Zhouxi; Yamashita, Roxanne A; Zhang, Dachuan; Zheng, Chanjuan; Geer, Lewis Y; Bryant, Stephen H

    2017-01-04

    NCBI's Conserved Domain Database (CDD) aims at annotating biomolecular sequences with the location of evolutionarily conserved protein domain footprints, and functional sites inferred from such footprints. An archive of pre-computed domain annotation is maintained for proteins tracked by NCBI's Entrez database, and live search services are offered as well. CDD curation staff supplements a comprehensive collection of protein domain and protein family models, which have been imported from external providers, with representations of selected domain families that are curated in-house and organized into hierarchical classifications of functionally distinct families and sub-families. CDD also supports comparative analyses of protein families via conserved domain architectures, and a recent curation effort focuses on providing functional characterizations of distinct subfamily architectures using SPARCLE: Subfamily Protein Architecture Labeling Engine. CDD can be accessed at https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.

  3. SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity.

    PubMed

    Li, Ying Hong; Xu, Jing Yu; Tao, Lin; Li, Xiao Feng; Li, Shuang; Zeng, Xian; Chen, Shang Ying; Zhang, Peng; Qin, Chu; Zhang, Cheng; Chen, Zhe; Zhu, Feng; Chen, Yu Zong

    2016-01-01

    Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.

  4. HIPPI: highly accurate protein family classification with ensembles of HMMs.

    PubMed

    Nguyen, Nam-Phuong; Nute, Michael; Mirarab, Siavash; Warnow, Tandy

    2016-11-11

    Given a new biological sequence, detecting membership in a known family is a basic step in many bioinformatics analyses, with applications to protein structure and function prediction and metagenomic taxon identification and abundance profiling, among others. Yet family identification of sequences that are distantly related to sequences in public databases or that are fragmentary remains one of the more difficult analytical problems in bioinformatics. We present a new technique for family identification called HIPPI (Hierarchical Profile Hidden Markov Models for Protein family Identification). HIPPI uses a novel technique to represent a multiple sequence alignment for a given protein family or superfamily by an ensemble of profile hidden Markov models computed using HMMER. An evaluation of HIPPI on the Pfam database shows that HIPPI has better overall precision and recall than blastp, HMMER, and pipelines based on HHsearch, and maintains good accuracy even for fragmentary query sequences and for protein families with low average pairwise sequence identity, both conditions where other methods degrade in accuracy. HIPPI provides accurate protein family identification and is robust to difficult model conditions. Our results, combined with observations from previous studies, show that ensembles of profile Hidden Markov models can better represent multiple sequence alignments than a single profile Hidden Markov model, and thus can improve downstream analyses for various bioinformatic tasks. Further research is needed to determine the best practices for building the ensemble of profile Hidden Markov models. HIPPI is available on GitHub at https://github.com/smirarab/sepp .

  5. A phylogenomic approach to bacterial subspecies classification: proof of concept in Mycobacterium abscessus.

    PubMed

    Tan, Joon Liang; Khang, Tsung Fei; Ngeow, Yun Fong; Choo, Siew Woh

    2013-12-13

    Mycobacterium abscessus is a rapidly growing mycobacterium that is often associated with human infections. The taxonomy of this species has undergone several revisions and is still being debated. In this study, we sequenced the genomes of 12 M. abscessus strains and used phylogenomic analysis to perform subspecies classification. A data mining approach was used to rank and select informative genes based on the relative entropy metric for the construction of a phylogenetic tree. The resulting tree topology was similar to that generated using the concatenation of five classical housekeeping genes: rpoB, hsp65, secA, recA and sodA. Additional support for the reliability of the subspecies classification came from the analysis of erm41 and ITS gene sequences, single nucleotide polymorphisms (SNPs)-based classification and strain clustering demonstrated by a variable number tandem repeat (VNTR) assay and a multilocus sequence analysis (MLSA). We subsequently found that the concatenation of a minimal set of three median-ranked genes: DNA polymerase III subunit alpha (polC), 4-hydroxy-2-ketovalerate aldolase (Hoa) and cell division protein FtsZ (ftsZ), is sufficient to recover the same tree topology. PCR assays designed specifically for these genes showed that all three genes could be amplified in the reference strain of M. abscessus ATCC 19977T. This study provides proof of concept that whole-genome sequence-based data mining approach can provide confirmatory evidence of the phylogenetic informativeness of existing markers, as well as lead to the discovery of a more economical and informative set of markers that produces similar subspecies classification in M. abscessus. The systematic procedure used in this study to choose the informative minimal set of gene markers can potentially be applied to species or subspecies classification of other bacteria.

  6. On Utilizing Optimal and Information Theoretic Syntactic Modeling for Peptide Classification

    NASA Astrophysics Data System (ADS)

    Aygün, Eser; Oommen, B. John; Cataltepe, Zehra

    Syntactic methods in pattern recognition have been used extensively in bioinformatics, and in particular, in the analysis of gene and protein expressions, and in the recognition and classification of bio-sequences. These methods are almost universally distance-based. This paper concerns the use of an Optimal and Information Theoretic (OIT) probabilistic model [11] to achieve peptide classification using the information residing in their syntactic representations. The latter has traditionally been achieved using the edit distances required in the respective peptide comparisons. We advocate that one can model the differences between compared strings as a mutation model consisting of random Substitutions, Insertions and Deletions (SID) obeying the OIT model. Thus, in this paper, we show that the probability measure obtained from the OIT model can be perceived as a sequence similarity metric, using which a Support Vector Machine (SVM)-based peptide classifier, referred to as OIT_SVM, can be devised.

  7. The complete genome sequence and genetic analysis of ΦCA82 a novel uncultured microphage from the turkey gastrointestinal system

    PubMed Central

    2011-01-01

    The genomic DNA sequence of a novel enteric uncultured microphage, ΦCA82 from a turkey gastrointestinal system was determined utilizing metagenomics techniques. The entire circular, single-stranded nucleotide sequence of the genome was 5,514 nucleotides. The ΦCA82 genome is quite different from other microviruses as indicated by comparisons of nucleotide similarity, predicted protein similarity, and functional classifications. Only three genes showed significant similarity to microviral proteins as determined by local alignments using BLAST analysis. ORF1 encoded a predicted phage F capsid protein that was phylogenetically most similar to the Microviridae ΦMH2K member's major coat protein. The ΦCA82 genome also encoded a predicted minor capsid protein (ORF2) and putative replication initiation protein (ORF3) most similar to the microviral bacteriophage SpV4. The distant evolutionary relationship of ΦCA82 suggests that the divergence of this novel turkey microvirus from other microviruses may reflect unique evolutionary pressures encountered within the turkey gastrointestinal system. PMID:21714899

  8. Automated Gene Ontology annotation for anonymous sequence data.

    PubMed

    Hennig, Steffen; Groth, Detlef; Lehrach, Hans

    2003-07-01

    Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.

  9. Self-organized neural maps of human protein sequences.

    PubMed Central

    Ferrán, E. A.; Pflugfelder, B.; Ferrara, P.

    1994-01-01

    We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen's unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis. PMID:8019421

  10. Deep Learning and Its Applications in Biomedicine.

    PubMed

    Cao, Chensi; Liu, Feng; Tan, Hai; Song, Deshou; Shu, Wenjie; Li, Weizhong; Zhou, Yiming; Bo, Xiaochen; Xie, Zhi

    2018-02-01

    Advances in biological and medical technologies have been providing us explosive volumes of biological and physiological data, such as medical images, electroencephalography, genomic and protein sequences. Learning from these data facilitates the understanding of human health and disease. Developed from artificial neural networks, deep learning-based algorithms show great promise in extracting features and learning patterns from complex data. The aim of this paper is to provide an overview of deep learning techniques and some of the state-of-the-art applications in the biomedical field. We first introduce the development of artificial neural network and deep learning. We then describe two main components of deep learning, i.e., deep learning architectures and model optimization. Subsequently, some examples are demonstrated for deep learning applications, including medical image classification, genomic sequence analysis, as well as protein structure classification and prediction. Finally, we offer our perspectives for the future directions in the field of deep learning. Copyright © 2018. Production and hosting by Elsevier B.V.

  11. Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern.

    PubMed

    Zhang, Tong-Liang; Ding, Yong-Sheng; Chou, Kuo-Chen

    2008-01-07

    Compared with the conventional amino acid (AA) composition, the pseudo-amino acid (PseAA) composition as originally introduced for protein subcellular location prediction can incorporate much more information of a protein sequence, so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. In this study, based on the concept of PseAA composition, the approximate entropy and hydrophobicity pattern of a protein sequence are used to characterize the PseAA components. Also, the immune genetic algorithm (IGA) is applied to search the optimal weight factors in generating the PseAA composition. Thus, for a given protein sequence sample, a 27-D (dimensional) PseAA composition is generated as its descriptor. The fuzzy K nearest neighbors (FKNN) classifier is adopted as the prediction engine. The results thus obtained in predicting protein structural classification are quite encouraging, indicating that the current approach may also be used to improve the prediction quality of other protein attributes, or at least can play a complimentary role to the existing methods in the relevant areas. Our algorithm is written in Matlab that is available by contacting the corresponding author.

  12. Rapid detection, classification and accurate alignment of up to a million or more related protein sequences.

    PubMed

    Neuwald, Andrew F

    2009-08-01

    The patterns of sequence similarity and divergence present within functionally diverse, evolutionarily related proteins contain implicit information about corresponding biochemical similarities and differences. A first step toward accessing such information is to statistically analyze these patterns, which, in turn, requires that one first identify and accurately align a very large set of protein sequences. Ideally, the set should include many distantly related, functionally divergent subgroups. Because it is extremely difficult, if not impossible for fully automated methods to align such sequences correctly, researchers often resort to manual curation based on detailed structural and biochemical information. However, multiply-aligning vast numbers of sequences in this way is clearly impractical. This problem is addressed using Multiply-Aligned Profiles for Global Alignment of Protein Sequences (MAPGAPS). The MAPGAPS program uses a set of multiply-aligned profiles both as a query to detect and classify related sequences and as a template to multiply-align the sequences. It relies on Karlin-Altschul statistics for sensitivity and on PSI-BLAST (and other) heuristics for speed. Using as input a carefully curated multiple-profile alignment for P-loop GTPases, MAPGAPS correctly aligned weakly conserved sequence motifs within 33 distantly related GTPases of known structure. By comparison, the sequence- and structurally based alignment methods hmmalign and PROMALS3D misaligned at least 11 and 23 of these regions, respectively. When applied to a dataset of 65 million protein sequences, MAPGAPS identified, classified and aligned (with comparable accuracy) nearly half a million putative P-loop GTPase sequences. A C++ implementation of MAPGAPS is available at http://mapgaps.igs.umaryland.edu. Supplementary data are available at Bioinformatics online.

  13. Fold independent structural comparisons of protein-ligand binding sites for exploring functional relationships.

    PubMed

    Gold, Nicola D; Jackson, Richard M

    2006-02-03

    The rapid growth in protein structural data and the emergence of structural genomics projects have increased the need for automatic structure analysis and tools for function prediction. Small molecule recognition is critical to the function of many proteins; therefore, determination of ligand binding site similarity is important for understanding ligand interactions and may allow their functional classification. Here, we present a binding sites database (SitesBase) that given a known protein-ligand binding site allows rapid retrieval of other binding sites with similar structure independent of overall sequence or fold similarity. However, each match is also annotated with sequence similarity and fold information to aid interpretation of structure and functional similarity. Similarity in ligand binding sites can indicate common binding modes and recognition of similar molecules, allowing potential inference of function for an uncharacterised protein or providing additional evidence of common function where sequence or fold similarity is already known. Alternatively, the resource can provide valuable information for detailed studies of molecular recognition including structure-based ligand design and in understanding ligand cross-reactivity. Here, we show examples of atomic similarity between superfamily or more distant fold relatives as well as between seemingly unrelated proteins. Assignment of unclassified proteins to structural superfamiles is also undertaken and in most cases substantiates assignments made using sequence similarity. Correct assignment is also possible where sequence similarity fails to find significant matches, illustrating the potential use of binding site comparisons for newly determined proteins.

  14. Automatic annotation of protein motif function with Gene Ontology terms.

    PubMed

    Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi; Buchanan, Bruce G

    2004-09-02

    Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, a much needed and important task is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base. This paper presents methods to mine the GO knowledge base and use the association between the GO terms assigned to a sequence and the motifs matched by the same sequence as evidence for predicting the functions of novel protein motifs automatically. The task of assigning GO terms to protein motifs is viewed as both a binary classification and information retrieval problem, where PROSITE motifs are used as samples for mode training and functional prediction. The mutual information of a motif and aGO term association is found to be a very useful feature. We take advantage of the known motifs to train a logistic regression classifier, which allows us to combine mutual information with other frequency-based features and obtain a probability of correct association. The trained logistic regression model has intuitively meaningful and logically plausible parameter values, and performs very well empirically according to our evaluation criteria. In this research, different methods for automatic annotation of protein motifs have been investigated. Empirical result demonstrated that the methods have a great potential for detecting and augmenting information about the functions of newly discovered candidate protein motifs.

  15. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition.

    PubMed

    Hayat, Maqsood; Khan, Asifullah

    2011-02-21

    Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor. Copyright © 2010 Elsevier Ltd. All rights reserved.

  16. PRED-CLASS: cascading neural networks for generalized protein classification and genome-wide applications.

    PubMed

    Pasquier, C; Promponas, V J; Hamodrakas, S J

    2001-08-15

    A cascading system of hierarchical, artificial neural networks (named PRED-CLASS) is presented for the generalized classification of proteins into four distinct classes-transmembrane, fibrous, globular, and mixed-from information solely encoded in their amino acid sequences. The architecture of the individual component networks is kept very simple, reducing the number of free parameters (network synaptic weights) for faster training, improved generalization, and the avoidance of data overfitting. Capturing information from as few as 50 protein sequences spread among the four target classes (6 transmembrane, 10 fibrous, 13 globular, and 17 mixed), PRED-CLASS was able to obtain 371 correct predictions out of a set of 387 proteins (success rate approximately 96%) unambiguously assigned into one of the target classes. The application of PRED-CLASS to several test sets and complete proteomes of several organisms demonstrates that such a method could serve as a valuable tool in the annotation of genomic open reading frames with no functional assignment or as a preliminary step in fold recognition and ab initio structure prediction methods. Detailed results obtained for various data sets and completed genomes, along with a web sever running the PRED-CLASS algorithm, can be accessed over the World Wide Web at http://o2.biol.uoa.gr/PRED-CLASS.

  17. Large-scale analysis of intrinsic disorder flavors and associated functions in the protein sequence universe.

    PubMed

    Necci, Marco; Piovesan, Damiano; Tosatto, Silvio C E

    2016-12-01

    Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large-scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence-based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. © 2016 The Protein Society.

  18. Tandem Repeats in Proteins: Prediction Algorithms and Biological Role.

    PubMed

    Pellegrini, Marco

    2015-01-01

    Tandem repetitions in protein sequence and structure is a fascinating subject of research which has been a focus of study since the late 1990s. In this survey, we give an overview on the multi-faceted aspects of research on protein tandem repeats (PTR for short), including prediction algorithms, databases, early classification efforts, mechanisms of PTR formation and evolution, and synthetic PTR design. We also touch on the rather open issue of the relationship between PTR and flexibility (or disorder) in proteins. Detection of PTR either from protein sequence or structure data is challenging due to inherent high (biological) signal-to-noise ratio that is a key feature of this problem. As early in silico analytic tools have been key enablers for starting this field of study, we expect that current and future algorithmic and statistical breakthroughs will have a high impact on the investigations of the biological role of PTR.

  19. Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs.

    PubMed

    Shamim, Mohammad Tabrez Anwar; Anwaruddin, Mohammad; Nagarajaram, H A

    2007-12-15

    Fold recognition is a key step in the protein structure discovery process, especially when traditional sequence comparison methods fail to yield convincing structural homologies. Although many methods have been developed for protein fold recognition, their accuracies remain low. This can be attributed to insufficient exploitation of fold discriminatory features. We have developed a new method for protein fold recognition using structural information of amino acid residues and amino acid residue pairs. Since protein fold recognition can be treated as a protein fold classification problem, we have developed a Support Vector Machine (SVM) based classifier approach that uses secondary structural state and solvent accessibility state frequencies of amino acids and amino acid pairs as feature vectors. Among the individual properties examined secondary structural state frequencies of amino acids gave an overall accuracy of 65.2% for fold discrimination, which is better than the accuracy by any method reported so far in the literature. Combination of secondary structural state frequencies with solvent accessibility state frequencies of amino acids and amino acid pairs further improved the fold discrimination accuracy to more than 70%, which is approximately 8% higher than the best available method. In this study we have also tested, for the first time, an all-together multi-class method known as Crammer and Singer method for protein fold classification. Our studies reveal that the three multi-class classification methods, namely one versus all, one versus one and Crammer and Singer method, yield similar predictions. Dataset and stand-alone program are available upon request.

  20. Evolution and classification of the CRISPR-Cas systems

    PubMed Central

    S. Makarova, Kira; H. Haft, Daniel; Barrangou, Rodolphe; J. J. Brouns, Stan; Charpentier, Emmanuelle; Horvath, Philippe; Moineau, Sylvain; J. M. Mojica, Francisco; I. Wolf, Yuri; Yakunin, Alexander F.; van der Oost, John; V. Koonin, Eugene

    2012-01-01

    The CRISPR–Cas (clustered regularly interspaced short palindromic repeats–CRISPR-associated proteins) modules are adaptive immunity systems that are present in many archaea and bacteria. These defence systems are encoded by operons that have an extraordinarily diverse architecture and a high rate of evolution for both the cas genes and the unique spacer content. Here, we provide an updated analysis of the evolutionary relationships between CRISPR–Cas systems and Cas proteins. Three major types of CRISPR–Cas system are delineated, with a further division into several subtypes and a few chimeric variants. Given the complexity of the genomic architectures and the extremely dynamic evolution of the CRISPR–Cas systems, a unified classification of these systems should be based on multiple criteria. Accordingly, we propose a `polythetic' classification that integrates the phylogenies of the most common cas genes, the sequence and organization of the CRISPR repeats and the architecture of the CRISPR–cas loci. PMID:21552286

  1. The proteome: structure, function and evolution

    PubMed Central

    Fleming, Keiran; Kelley, Lawrence A; Islam, Suhail A; MacCallum, Robert M; Muller, Arne; Pazos, Florencio; Sternberg, Michael J.E

    2006-01-01

    This paper reports two studies to model the inter-relationships between protein sequence, structure and function. First, an automated pipeline to provide a structural annotation of proteomes in the major genomes is described. The results are stored in a database at Imperial College, London (3D-GENOMICS) that can be accessed at www.sbg.bio.ic.ac.uk. Analysis of the assignments to structural superfamilies provides evolutionary insights. 3D-GENOMICS is being integrated with related proteome annotation data at University College London and the European Bioinformatics Institute in a project known as e-protein (http://www.e-protein.org/). The second topic is motivated by the developments in structural genomics projects in which the structure of a protein is determined prior to knowledge of its function. We have developed a new approach PHUNCTIONER that uses the gene ontology (GO) classification to supervise the extraction of the sequence signal responsible for protein function from a structure-based sequence alignment. Using GO we can obtain profiles for a range of specificities described in the ontology. In the region of low sequence similarity (around 15%), our method is more accurate than assignment from the closest structural homologue. The method is also able to identify the specific residues associated with the function of the protein family. PMID:16524832

  2. Rebelling for a Reason: Protein Structural “Outliers”

    PubMed Central

    Arumugam, Gandhimathi; Nair, Anu G.; Hariharaputran, Sridhar; Ramanathan, Sowdhamini

    2013-01-01

    Analysis of structural variation in domain superfamilies can reveal constraints in protein evolution which aids protein structure prediction and classification. Structure-based sequence alignment of distantly related proteins, organized in PASS2 database, provides clues about structurally conserved regions among different functional families. Some superfamily members show large structural differences which are functionally relevant. This paper analyses the impact of structural divergence on function for multi-member superfamilies, selected from the PASS2 superfamily alignment database. Functional annotations within superfamilies, with structural outliers or ‘rebels’, are discussed in the context of structural variations. Overall, these data reinforce the idea that functional similarities cannot be extrapolated from mere structural conservation. The implication for fold-function prediction is that the functional annotations can only be inherited with very careful consideration, especially at low sequence identities. PMID:24073209

  3. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer

    DOE PAGES

    Zhang, Qian; Jun, Se -Ran; Leuze, Michael; ...

    2017-01-19

    The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conservedmore » proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.« less

  4. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhang, Qian; Jun, Se -Ran; Leuze, Michael

    The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral tree of life . However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conservedmore » proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. Lastly, the resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses.« less

  5. Viral Phylogenomics Using an Alignment-Free Method: A Three-Step Approach to Determine Optimal Length of k-mer

    PubMed Central

    Zhang, Qian; Jun, Se-Ran; Leuze, Michael; Ussery, David; Nookaew, Intawat

    2017-01-01

    The development of rapid, economical genome sequencing has shed new light on the classification of viruses. As of October 2016, the National Center for Biotechnology Information (NCBI) database contained >2 million viral genome sequences and a reference set of ~4000 viral genome sequences that cover a wide range of known viral families. Whole-genome sequences can be used to improve viral classification and provide insight into the viral “tree of life”. However, due to the lack of evolutionary conservation amongst diverse viruses, it is not feasible to build a viral tree of life using traditional phylogenetic methods based on conserved proteins. In this study, we used an alignment-free method that uses k-mers as genomic features for a large-scale comparison of complete viral genomes available in RefSeq. To determine the optimal feature length, k (an essential step in constructing a meaningful dendrogram), we designed a comprehensive strategy that combines three approaches: (1) cumulative relative entropy, (2) average number of common features among genomes, and (3) the Shannon diversity index. This strategy was used to determine k for all 3,905 complete viral genomes in RefSeq. The resulting dendrogram shows consistency with the viral taxonomy of the ICTV and the Baltimore classification of viruses. PMID:28102365

  6. The DynaMine webserver: predicting protein dynamics from sequence.

    PubMed

    Cilia, Elisa; Pancsa, Rita; Tompa, Peter; Lenaerts, Tom; Vranken, Wim F

    2014-07-01

    Protein dynamics are important for understanding protein function. Unfortunately, accurate protein dynamics information is difficult to obtain: here we present the DynaMine webserver, which provides predictions for the fast backbone movements of proteins directly from their amino-acid sequence. DynaMine rapidly produces a profile describing the statistical potential for such movements at residue-level resolution. The predicted values have meaning on an absolute scale and go beyond the traditional binary classification of residues as ordered or disordered, thus allowing for direct dynamics comparisons between protein regions. Through this webserver, we provide molecular biologists with an efficient and easy to use tool for predicting the dynamical characteristics of any protein of interest, even in the absence of experimental observations. The prediction results are visualized and can be directly downloaded. The DynaMine webserver, including instructive examples describing the meaning of the profiles, is available at http://dynamine.ibsquare.be. © The Author(s) 2014. Published by Oxford University Press on behalf of Nucleic Acids Research.

  7. Evolution, substrate specificity and subfamily classification of glycoside hydrolase family 5 (GH5).

    PubMed

    Aspeborg, Henrik; Coutinho, Pedro M; Wang, Yang; Brumer, Harry; Henrissat, Bernard

    2012-09-20

    The large Glycoside Hydrolase family 5 (GH5) groups together a wide range of enzymes acting on β-linked oligo- and polysaccharides, and glycoconjugates from a large spectrum of organisms. The long and complex evolution of this family of enzymes and its broad sequence diversity limits functional prediction. With the objective of improving the differentiation of enzyme specificities in a knowledge-based context, and to obtain new evolutionary insights, we present here a new, robust subfamily classification of family GH5. About 80% of the current sequences were assigned into 51 subfamilies in a global analysis of all publicly available GH5 sequences and associated biochemical data. Examination of subfamilies with catalytically-active members revealed that one third are monospecific (containing a single enzyme activity), although new functions may be discovered with biochemical characterization in the future. Furthermore, twenty subfamilies presently have no characterization whatsoever and many others have only limited structural and biochemical data. Mapping of functional knowledge onto the GH5 phylogenetic tree revealed that the sequence space of this historical and industrially important family is far from well dispersed, highlighting targets in need of further study. The analysis also uncovered a number of GH5 proteins which have lost their catalytic machinery, indicating evolution towards novel functions. Overall, the subfamily division of GH5 provides an actively curated resource for large-scale protein sequence annotation for glycogenomics; the subfamily assignments are openly accessible via the Carbohydrate-Active Enzyme database at http://www.cazy.org/GH5.html.

  8. T-RMSD: a web server for automated fine-grained protein structural classification.

    PubMed

    Magis, Cedrik; Di Tommaso, Paolo; Notredame, Cedric

    2013-07-01

    This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd.

  9. T-RMSD: a web server for automated fine-grained protein structural classification

    PubMed Central

    Magis, Cedrik; Di Tommaso, Paolo; Notredame, Cedric

    2013-01-01

    This article introduces the T-RMSD web server (tree-based on root-mean-square deviation), a service allowing the online computation of structure-based protein classification. It has been developed to address the relation between structural and functional similarity in proteins, and it allows a fine-grained structural clustering of a given protein family or group of structurally related proteins using distance RMSD (dRMSD) variations. These distances are computed between all pairs of equivalent residues, as defined by the ungapped columns within a given multiple sequence alignment. Using these generated distance matrices (one per equivalent position), T-RMSD produces a structural tree with support values for each cluster node, reminiscent of bootstrap values. These values, associated with the tree topology, allow a quantitative estimate of structural distances between proteins or group of proteins defined by the tree topology. The clusters thus defined have been shown to be structurally and functionally informative. The T-RMSD web server is a free website open to all users and available at http://tcoffee.crg.cat/apps/tcoffee/do:trmsd. PMID:23716642

  10. Hidden Markov models of biological primary sequence information.

    PubMed Central

    Baldi, P; Chauvin, Y; Hunkapiller, T; McClure, M A

    1994-01-01

    Hidden Markov model (HMM) techniques are used to model families of biological sequences. A smooth and convergent algorithm is introduced to iteratively adapt the transition and emission parameters of the models from the examples in a given family. The HMM approach is applied to three protein families: globins, immunoglobulins, and kinases. In all cases, the models derived capture the important statistical characteristics of the family and can be used for a number of tasks, including multiple alignments, motif detection, and classification. For K sequences of average length N, this approach yields an effective multiple-alignment algorithm which requires O(KN2) operations, linear in the number of sequences. PMID:8302831

  11. Sirius PSB: a generic system for analysis of biological sequences.

    PubMed

    Koh, Chuan Hock; Lin, Sharene; Jedd, Gregory; Wong, Limsoon

    2009-12-01

    Computational tools are essential components of modern biological research. For example, BLAST searches can be used to identify related proteins based on sequence homology, or when a new genome is sequenced, prediction models can be used to annotate functional sites such as transcription start sites, translation initiation sites and polyadenylation sites and to predict protein localization. Here we present Sirius Prediction Systems Builder (PSB), a new computational tool for sequence analysis, classification and searching. Sirius PSB has four main operations: (1) Building a classifier, (2) Deploying a classifier, (3) Search for proteins similar to query proteins, (4) Preliminary and post-prediction analysis. Sirius PSB supports all these operations via a simple and interactive graphical user interface. Besides being a convenient tool, Sirius PSB has also introduced two novelties in sequence analysis. Firstly, genetic algorithm is used to identify interesting features in the feature space. Secondly, instead of the conventional method of searching for similar proteins via sequence similarity, we introduced searching via features' similarity. To demonstrate the capabilities of Sirius PSB, we have built two prediction models - one for the recognition of Arabidopsis polyadenylation sites and another for the subcellular localization of proteins. Both systems are competitive against current state-of-the-art models based on evaluation of public datasets. More notably, the time and effort required to build each model is greatly reduced with the assistance of Sirius PSB. Furthermore, we show that under certain conditions when BLAST is unable to find related proteins, Sirius PSB can identify functionally related proteins based on their biophysical similarities. Sirius PSB and its related supplements are available at: http://compbio.ddns.comp.nus.edu.sg/~sirius.

  12. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology.

    PubMed

    Bakhtiarizadeh, Mohammad Reza; Moradi-Shahrbabak, Mohammad; Ebrahimi, Mansour; Ebrahimie, Esmaeil

    2014-09-07

    Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods. Copyright © 2014 Elsevier Ltd. All rights reserved.

  13. Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition.

    PubMed

    Kandaswamy, Krishna Kumar; Pugalenthi, Ganesan; Möller, Steffen; Hartmann, Enno; Kalies, Kai-Uwe; Suganthan, P N; Martinetz, Thomas

    2010-12-01

    Apoptosis is an essential process for controlling tissue homeostasis by regulating a physiological balance between cell proliferation and cell death. The subcellular locations of proteins performing the cell death are determined by mostly independent cellular mechanisms. The regular bioinformatics tools to predict the subcellular locations of such apoptotic proteins do often fail. This work proposes a model for the sorting of proteins that are involved in apoptosis, allowing us to both the prediction of their subcellular locations as well as the molecular properties that contributed to it. We report a novel hybrid Genetic Algorithm (GA)/Support Vector Machine (SVM) approach to predict apoptotic protein sequences using 119 sequence derived properties like frequency of amino acid groups, secondary structure, and physicochemical properties. GA is used for selecting a near-optimal subset of informative features that is most relevant for the classification. Jackknife cross-validation is applied to test the predictive capability of the proposed method on 317 apoptosis proteins. Our method achieved 85.80% accuracy using all 119 features and 89.91% accuracy for 25 features selected by GA. Our models were examined by a test dataset of 98 apoptosis proteins and obtained an overall accuracy of 90.34%. The results show that the proposed approach is promising; it is able to select small subsets of features and still improves the classification accuracy. Our model can contribute to the understanding of programmed cell death and drug discovery. The software and dataset are available at http://www.inb.uni-luebeck.de/tools-demos/apoptosis/GASVM.

  14. Introduction to bioinformatics.

    PubMed

    Can, Tolga

    2014-01-01

    Bioinformatics is an interdisciplinary field mainly involving molecular biology and genetics, computer science, mathematics, and statistics. Data intensive, large-scale biological problems are addressed from a computational point of view. The most common problems are modeling biological processes at the molecular level and making inferences from collected data. A bioinformatics solution usually involves the following steps: Collect statistics from biological data. Build a computational model. Solve a computational modeling problem. Test and evaluate a computational algorithm. This chapter gives a brief introduction to bioinformatics by first providing an introduction to biological terminology and then discussing some classical bioinformatics problems organized by the types of data sources. Sequence analysis is the analysis of DNA and protein sequences for clues regarding function and includes subproblems such as identification of homologs, multiple sequence alignment, searching sequence patterns, and evolutionary analyses. Protein structures are three-dimensional data and the associated problems are structure prediction (secondary and tertiary), analysis of protein structures for clues regarding function, and structural alignment. Gene expression data is usually represented as matrices and analysis of microarray data mostly involves statistics analysis, classification, and clustering approaches. Biological networks such as gene regulatory networks, metabolic pathways, and protein-protein interaction networks are usually modeled as graphs and graph theoretic approaches are used to solve associated problems such as construction and analysis of large-scale networks.

  15. Phylogenetic classification and the universal tree.

    PubMed

    Doolittle, W F

    1999-06-25

    From comparative analyses of the nucleotide sequences of genes encoding ribosomal RNAs and several proteins, molecular phylogeneticists have constructed a "universal tree of life," taking it as the basis for a "natural" hierarchical classification of all living things. Although confidence in some of the tree's early branches has recently been shaken, new approaches could still resolve many methodological uncertainties. More challenging is evidence that most archaeal and bacterial genomes (and the inferred ancestral eukaryotic nuclear genome) contain genes from multiple sources. If "chimerism" or "lateral gene transfer" cannot be dismissed as trivial in extent or limited to special categories of genes, then no hierarchical universal classification can be taken as natural. Molecular phylogeneticists will have failed to find the "true tree," not because their methods are inadequate or because they have chosen the wrong genes, but because the history of life cannot properly be represented as a tree. However, taxonomies based on molecular sequences will remain indispensable, and understanding of the evolutionary process will ultimately be enriched, not impoverished.

  16. The Staphylococcus aureus Two-Component System AgrAC Displays Four Distinct Genomic Arrangements That Delineate Genomic Virulence Factor Signatures

    PubMed Central

    Choudhary, Kumari S.; Mih, Nathan; Monk, Jonathan; Kavvas, Erol; Yurkovich, James T.; Sakoulas, George; Palsson, Bernhard O.

    2018-01-01

    Two-component systems (TCSs) consist of a histidine kinase and a response regulator. Here, we evaluated the conservation of the AgrAC TCS among 149 completely sequenced Staphylococcus aureus strains. It is composed of four genes: agrBDCA. We found that: (i) AgrAC system (agr) was found in all but one of the 149 strains, (ii) the agr positive strains were further classified into four agr types based on AgrD protein sequences, (iii) the four agr types not only specified the chromosomal arrangement of the agr genes but also the sequence divergence of AgrC histidine kinase protein, which confers signal specificity, (iv) the sequence divergence was reflected in distinct structural properties especially in the transmembrane region and second extracellular binding domain, and (v) there was a strong correlation between the agr type and the virulence genomic profile of the organism. Taken together, these results demonstrate that bioinformatic analysis of the agr locus leads to a classification system that correlates with the presence of virulence factors and protein structural properties. PMID:29887846

  17. Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences.

    PubMed

    Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

    2006-06-15

    The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart.

  18. Annotation and Classification of CRISPR-Cas Systems

    PubMed Central

    Makarova, Kira S.; Koonin, Eugene V.

    2018-01-01

    The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods. PMID:25981466

  19. Annotation and Classification of CRISPR-Cas Systems.

    PubMed

    Makarova, Kira S; Koonin, Eugene V

    2015-01-01

    The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas (CRISPR-associated proteins) is a prokaryotic adaptive immune system that is represented in most archaea and many bacteria. Among the currently known prokaryotic defense systems, the CRISPR-Cas genomic loci show unprecedented complexity and diversity. Classification of CRISPR-Cas variants that would capture their evolutionary relationships to the maximum possible extent is essential for comparative genomic and functional characterization of this theoretically and practically important system of adaptive immunity. To this end, a multipronged approach has been developed that combines phylogenetic analysis of the conserved Cas proteins with comparison of gene repertoires and arrangements in CRISPR-Cas loci. This approach led to the current classification of CRISPR-Cas systems into three distinct types and ten subtypes for each of which signature genes have been identified. Comparative genomic analysis of the CRISPR-Cas systems in new archaeal and bacterial genomes performed over the 3 years elapsed since the development of this classification makes it clear that new types and subtypes of CRISPR-Cas need to be introduced. Moreover, this classification system captures only part of the complexity of CRISPR-Cas organization and evolution, due to the intrinsic modularity and evolutionary mobility of these immunity systems, resulting in numerous recombinant variants. Moreover, most of the cas genes evolve rapidly, complicating the family assignment for many Cas proteins and the use of family profiles for the recognition of CRISPR-Cas subtype signatures. Further progress in the comparative analysis of CRISPR-Cas systems requires integration of the most sensitive sequence comparison tools, protein structure comparison, and refined approaches for comparison of gene neighborhoods.

  20. EUCLID: automatic classification of proteins in functional classes by their database annotations.

    PubMed

    Tamames, J; Ouzounis, C; Casari, G; Sander, C; Valencia, A

    1998-01-01

    A tool is described for the automatic classification of sequences in functional classes using their database annotations. The Euclid system is based on a simple learning procedure from examples provided by human experts. Euclid is freely available for academics at http://www.gredos.cnb.uam.es/EUCLID, with the corresponding dictionaries for the generation of three, eight and 14 functional classes. E-mail: valencia@cnb.uam.es The results of the EUCLID classification of different genomes are available at http://www.sander.ebi.ac. uk/genequiz/. A detailed description of the different applications mentioned in the text is available at http://www.gredos.cnb.uam. es/EUCLID/Full_Paper

  1. NMRDSP: an accurate prediction of protein shape strings from NMR chemical shifts and sequence data.

    PubMed

    Mao, Wusong; Cong, Peisheng; Wang, Zhiheng; Lu, Longjian; Zhu, Zhongliang; Li, Tonghua

    2013-01-01

    Shape string is structural sequence and is an extremely important structure representation of protein backbone conformations. Nuclear magnetic resonance chemical shifts give a strong correlation with the local protein structure, and are exploited to predict protein structures in conjunction with computational approaches. Here we demonstrate a novel approach, NMRDSP, which can accurately predict the protein shape string based on nuclear magnetic resonance chemical shifts and structural profiles obtained from sequence data. The NMRDSP uses six chemical shifts (HA, H, N, CA, CB and C) and eight elements of structure profiles as features, a non-redundant set (1,003 entries) as the training set, and a conditional random field as a classification algorithm. For an independent testing set (203 entries), we achieved an accuracy of 75.8% for S8 (the eight states accuracy) and 87.8% for S3 (the three states accuracy). This is higher than only using chemical shifts or sequence data, and confirms that the chemical shift and the structure profile are significant features for shape string prediction and their combination prominently improves the accuracy of the predictor. We have constructed the NMRDSP web server and believe it could be employed to provide a solid platform to predict other protein structures and functions. The NMRDSP web server is freely available at http://cal.tongji.edu.cn/NMRDSP/index.jsp.

  2. ANCAC: amino acid, nucleotide, and codon analysis of COGs--a tool for sequence bias analysis in microbial orthologs.

    PubMed

    Meiler, Arno; Klinger, Claudia; Kaufmann, Michael

    2012-09-08

    The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins. Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC's NUCOCOG dataset as the largest one available for that purpose thus far. Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills.

  3. ANCAC: amino acid, nucleotide, and codon analysis of COGs – a tool for sequence bias analysis in microbial orthologs

    PubMed Central

    2012-01-01

    Background The COG database is the most popular collection of orthologous proteins from many different completely sequenced microbial genomes. Per definition, a cluster of orthologous groups (COG) within this database exclusively contains proteins that most likely achieve the same cellular function. Recently, the COG database was extended by assigning to every protein both the corresponding amino acid and its encoding nucleotide sequence resulting in the NUCOCOG database. This extended version of the COG database is a valuable resource connecting sequence features with the functionality of the respective proteins. Results Here we present ANCAC, a web tool and MySQL database for the analysis of amino acid, nucleotide, and codon frequencies in COGs on the basis of freely definable phylogenetic patterns. We demonstrate the usefulness of ANCAC by analyzing amino acid frequencies, codon usage, and GC-content in a species- or function-specific context. With respect to amino acids we, at least in part, confirm the cognate bias hypothesis by using ANCAC’s NUCOCOG dataset as the largest one available for that purpose thus far. Conclusions Using the NUCOCOG datasets, ANCAC connects taxonomic, amino acid, and nucleotide sequence information with the functional classification via COGs and provides a GUI for flexible mining for sequence-bias. Thereby, to our knowledge, it is the only tool for the analysis of sequence composition in the light of physiological roles and phylogenetic context without requirement of substantial programming-skills. PMID:22958836

  4. PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine.

    PubMed

    Hayat, Maqsood; Tahir, Muhammad

    2015-08-01

    Membrane protein is a central component of the cell that manages intra and extracellular processes. Membrane proteins execute a diversity of functions that are vital for the survival of organisms. The topology of transmembrane proteins describes the number of transmembrane (TM) helix segments and its orientation. However, owing to the lack of its recognized structures, the identification of TM helix and its topology through experimental methods is laborious with low throughput. In order to identify TM helix segments reliably, accurately, and effectively from topogenic sequences, we propose the PSOFuzzySVM-TMH model. In this model, evolutionary based information position specific scoring matrix and discrete based information 6-letter exchange group are used to formulate transmembrane protein sequences. The noisy and extraneous attributes are eradicated using an optimization selection technique, particle swarm optimization, from both feature spaces. Finally, the selected feature spaces are combined in order to form ensemble feature space. Fuzzy-support vector Machine is utilized as a classification algorithm. Two benchmark datasets, including low and high resolution datasets, are used. At various levels, the performance of the PSOFuzzySVM-TMH model is assessed through 10-fold cross validation test. The empirical results reveal that the proposed framework PSOFuzzySVM-TMH outperforms in terms of classification performance in the examined datasets. It is ascertained that the proposed model might be a useful and high throughput tool for academia and research community for further structure and functional studies on transmembrane proteins.

  5. Complete genome sequences and comparative genome analysis of Lactobacillus plantarum strain 5-2 isolated from fermented soybean.

    PubMed

    Liu, Chen-Jian; Wang, Rui; Gong, Fu-Ming; Liu, Xiao-Feng; Zheng, Hua-Jun; Luo, Yi-Yong; Li, Xiao-Ran

    2015-12-01

    Lactobacillus plantarum is an important probiotic and is mostly isolated from fermented foods. We sequenced the genome of L. plantarum strain 5-2, which was derived from fermented soybean isolated from Yunnan province, China. The strain was determined to contain 3114 genes. Fourteen complete insertion sequence (IS) elements were found in 5-2 chromosome. There were 24 DNA replication proteins and 76 DNA repair proteins in the 5-2 genome. Consistent with the classification of L. plantarum as a facultative heterofermentative lactobacillus, the 5-2 genome encodes key enzymes required for the EMP (Embden-Meyerhof-Parnas) and phosphoketolase (PK) pathways. Several components of the secretion machinery are found in the 5-2 genome, which was compared with L. plantarum ST-III, JDM1 and WCFS1. Most of the specific proteins in the four genomes appeared to be related to their prophage elements. Copyright © 2015 Elsevier Inc. All rights reserved.

  6. Application of Wavelet Transform for PDZ Domain Classification

    PubMed Central

    Daqrouq, Khaled; Alhmouz, Rami; Balamesh, Ahmed; Memic, Adnan

    2015-01-01

    PDZ domains have been identified as part of an array of signaling proteins that are often unrelated, except for the well-conserved structural PDZ domain they contain. These domains have been linked to many disease processes including common Avian influenza, as well as very rare conditions such as Fraser and Usher syndromes. Historically, based on the interactions and the nature of bonds they form, PDZ domains have most often been classified into one of three classes (class I, class II and others - class III), that is directly dependent on their binding partner. In this study, we report on three unique feature extraction approaches based on the bigram and trigram occurrence and existence rearrangements within the domain's primary amino acid sequences in assisting PDZ domain classification. Wavelet packet transform (WPT) and Shannon entropy denoted by wavelet entropy (WE) feature extraction methods were proposed. Using 115 unique human and mouse PDZ domains, the existence rearrangement approach yielded a high recognition rate (78.34%), which outperformed our occurrence rearrangements based method. The recognition rate was (81.41%) with validation technique. The method reported for PDZ domain classification from primary sequences proved to be an encouraging approach for obtaining consistent classification results. We anticipate that by increasing the database size, we can further improve feature extraction and correct classification. PMID:25860375

  7. Information theory applications for biological sequence analysis.

    PubMed

    Vinga, Susana

    2014-05-01

    Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.

  8. Predicting residue-wise contact orders in proteins by support vector regression.

    PubMed

    Song, Jiangning; Burrage, Kevin

    2006-10-03

    The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.

  9. Sequence composition and environment effects on residue fluctuations in protein structures

    NASA Astrophysics Data System (ADS)

    Ruvinsky, Anatoly M.; Vakser, Ilya A.

    2010-10-01

    Structure fluctuations in proteins affect a broad range of cell phenomena, including stability of proteins and their fragments, allosteric transitions, and energy transfer. This study presents a statistical-thermodynamic analysis of relationship between the sequence composition and the distribution of residue fluctuations in protein-protein complexes. A one-node-per-residue elastic network model accounting for the nonhomogeneous protein mass distribution and the interatomic interactions through the renormalized inter-residue potential is developed. Two factors, a protein mass distribution and a residue environment, were found to determine the scale of residue fluctuations. Surface residues undergo larger fluctuations than core residues in agreement with experimental observations. Ranking residues over the normalized scale of fluctuations yields a distinct classification of amino acids into three groups: (i) highly fluctuating-Gly, Ala, Ser, Pro, and Asp, (ii) moderately fluctuating-Thr, Asn, Gln, Lys, Glu, Arg, Val, and Cys, and (iii) weakly fluctuating-Ile, Leu, Met, Phe, Tyr, Trp, and His. The structural instability in proteins possibly relates to the high content of the highly fluctuating residues and a deficiency of the weakly fluctuating residues in irregular secondary structure elements (loops), chameleon sequences, and disordered proteins. Strong correlation between residue fluctuations and the sequence composition of protein loops supports this hypothesis. Comparing fluctuations of binding site residues (interface residues) with other surface residues shows that, on average, the interface is more rigid than the rest of the protein surface and Gly, Ala, Ser, Cys, Leu, and Trp have a propensity to form more stable docking patches on the interface. The findings have broad implications for understanding mechanisms of protein association and stability of protein structures.

  10. A comprehensive analysis of the Omp85/TpsB protein superfamily structural diversity, taxonomic occurrence, and evolution

    PubMed Central

    Heinz, Eva; Lithgow, Trevor

    2014-01-01

    Members of the Omp85/TpsB protein superfamily are ubiquitously distributed in Gram-negative bacteria, and function in protein translocation (e.g., FhaC) or the assembly of outer membrane proteins (e.g., BamA). Several recent findings are suggestive of a further level of variation in the superfamily, including the identification of the novel membrane protein assembly factor TamA and protein translocase PlpD. To investigate the diversity and the causal evolutionary events, we undertook a comprehensive comparative sequence analysis of the Omp85/TpsB proteins. A total of 10 protein subfamilies were apparent, distinguished in their domain structure and sequence signatures. In addition to the proteins FhaC, BamA, and TamA, for which structural and functional information is available, are families of proteins with so far undescribed domain architectures linked to the Omp85 β-barrel domain. This study brings a classification structure to a dynamic protein superfamily of high interest given its essential function for Gram-negative bacteria as well as its diverse domain architecture, and we discuss several scenarios of putative functions of these so far undescribed proteins. PMID:25101071

  11. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST‐Indel)

    PubMed Central

    Douville, Christopher; Masica, David L.; Stenson, Peter D.; Cooper, David N.; Gygax, Derek M.; Kim, Rick; Ryan, Michael

    2015-01-01

    ABSTRACT Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features—DNA and protein sequence conservation, indel length, and occurrence in repeat regions—are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in‐frame and frameshift indels (VEST‐indel) as pathogenic or benign. We apply 24 features, including a new “PubMed” feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false‐positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta‐predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta‐predictor with improved performance over any individual method. PMID:26442818

  12. Assessing the Pathogenicity of Insertion and Deletion Variants with the Variant Effect Scoring Tool (VEST-Indel).

    PubMed

    Douville, Christopher; Masica, David L; Stenson, Peter D; Cooper, David N; Gygax, Derek M; Kim, Rick; Ryan, Michael; Karchin, Rachel

    2016-01-01

    Insertion/deletion variants (indels) alter protein sequence and length, yet are highly prevalent in healthy populations, presenting a challenge to bioinformatics classifiers. Commonly used features--DNA and protein sequence conservation, indel length, and occurrence in repeat regions--are useful for inference of protein damage. However, these features can cause false positives when predicting the impact of indels on disease. Existing methods for indel classification suffer from low specificities, severely limiting clinical utility. Here, we further develop our variant effect scoring tool (VEST) to include the classification of in-frame and frameshift indels (VEST-indel) as pathogenic or benign. We apply 24 features, including a new "PubMed" feature, to estimate a gene's importance in human disease. When compared with four existing indel classifiers, our method achieves a drastically reduced false-positive rate, improving specificity by as much as 90%. This approach of estimating gene importance might be generally applicable to missense and other bioinformatics pathogenicity predictors, which often fail to achieve high specificity. Finally, we tested all possible meta-predictors that can be obtained from combining the four different indel classifiers using Boolean conjunctions and disjunctions, and derived a meta-predictor with improved performance over any individual method. © 2015 The Authors. **Human Mutation published by Wiley Periodicals, Inc.

  13. Functional annotation from the genome sequence of the giant panda.

    PubMed

    Huo, Tong; Zhang, Yinjie; Lin, Jianping

    2012-08-01

    The giant panda is one of the most critically endangered species due to the fragmentation and loss of its habitat. Studying the functions of proteins in this animal, especially specific trait-related proteins, is therefore necessary to protect the species. In this work, the functions of these proteins were investigated using the genome sequence of the giant panda. Data on 21,001 proteins and their functions were stored in the Giant Panda Protein Database, in which the proteins were divided into two groups: 20,179 proteins whose functions can be predicted by GeneScan formed the known-function group, whereas 822 proteins whose functions cannot be predicted by GeneScan comprised the unknown-function group. For the known-function group, we further classified the proteins by molecular function, biological process, cellular component, and tissue specificity. For the unknown-function group, we developed a strategy in which the proteins were filtered by cross-Blast to identify panda-specific proteins under the assumption that proteins related to the panda-specific traits in the unknown-function group exist. After this filtering procedure, we identified 32 proteins (2 of which are membrane proteins) specific to the giant panda genome as compared against the dog and horse genomes. Based on their amino acid sequences, these 32 proteins were further analyzed by functional classification using SVM-Prot, motif prediction using MyHits, and interacting protein prediction using the Database of Interacting Proteins. Nineteen proteins were predicted to be zinc-binding proteins, thus affecting the activities of nucleic acids. The 32 panda-specific proteins will be further investigated by structural and functional analysis.

  14. A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy.

    PubMed

    Gao, Xiang; Lin, Huaiying; Revanna, Kashi; Dong, Qunfeng

    2017-05-10

    Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA .

  15. Protein structure determination by exhaustive search of Protein Data Bank derived databases.

    PubMed

    Stokes-Rees, Ian; Sliz, Piotr

    2010-12-14

    Parallel sequence and structure alignment tools have become ubiquitous and invaluable at all levels in the study of biological systems. We demonstrate the application and utility of this same parallel search paradigm to the process of protein structure determination, benefitting from the large and growing corpus of known structures. Such searches were previously computationally intractable. Through the method of Wide Search Molecular Replacement, developed here, they can be completed in a few hours with the aide of national-scale federated cyberinfrastructure. By dramatically expanding the range of models considered for structure determination, we show that small (less than 12% structural coverage) and low sequence identity (less than 20% identity) template structures can be identified through multidimensional template scoring metrics and used for structure determination. Many new macromolecular complexes can benefit significantly from such a technique due to the lack of known homologous protein folds or sequences. We demonstrate the effectiveness of the method by determining the structure of a full-length p97 homologue from Trichoplusia ni. Example cases with the MHC/T-cell receptor complex and the EmoB protein provide systematic estimates of minimum sequence identity, structure coverage, and structural similarity required for this method to succeed. We describe how this structure-search approach and other novel computationally intensive workflows are made tractable through integration with the US national computational cyberinfrastructure, allowing, for example, rapid processing of the entire Structural Classification of Proteins protein fragment database.

  16. Protein Kinase Classification with 2866 Hidden Markov Models and One Support Vector Machine

    NASA Technical Reports Server (NTRS)

    Weber, Ryan; New, Michael H.; Fonda, Mark (Technical Monitor)

    2002-01-01

    The main application considered in this paper is predicting true kinases from randomly permuted kinases that share the same length and amino acid distributions as the true kinases. Numerous methods already exist for this classification task, such as HMMs, motif-matchers, and sequence comparison algorithms. We build on some of these efforts by creating a vector from the output of thousands of structurally based HMMs, created offline with Pfam-A seed alignments using SAM-T99, which then must be combined into an overall classification for the protein. Then we use a Support Vector Machine for classifying this large ensemble Pfam-Vector, with a polynomial and chisquared kernel. In particular, the chi-squared kernel SVM performs better than the HMMs and better than the BLAST pairwise comparisons, when predicting true from false kinases in some respects, but no one algorithm is best for all purposes or in all instances so we consider the particular strengths and weaknesses of each.

  17. Large‐scale analysis of intrinsic disorder flavors and associated functions in the protein sequence universe

    PubMed Central

    Necci, Marco; Piovesan, Damiano

    2016-01-01

    Abstract Intrinsic disorder (ID) in proteins has been extensively described for the last decade; a large‐scale classification of ID in proteins is mostly missing. Here, we provide an extensive analysis of ID in the protein universe on the UniProt database derived from sequence‐based predictions in MobiDB. Almost half the sequences contain an ID region of at least five residues. About 9% of proteins have a long ID region of over 20 residues which are more abundant in Eukaryotic organisms and most frequently cover less than 20% of the sequence. A small subset of about 67,000 (out of over 80 million) proteins is fully disordered and mostly found in Viruses. Most proteins have only one ID, with short ID evenly distributed along the sequence and long ID overrepresented in the center. The charged residue composition of Das and Pappu was used to classify ID proteins by structural propensities and corresponding functional enrichment. Swollen Coils seem to be used mainly as structural components and in biosynthesis in both Prokaryotes and Eukaryotes. In Bacteria, they are confined in the nucleoid and in Viruses provide DNA binding function. Coils & Hairpins seem to be specialized in ribosome binding and methylation activities. Globules & Tadpoles bind antigens in Eukaryotes but are involved in killing other organisms and cytolysis in Bacteria. The Undefined class is used by Bacteria to bind toxic substances and mediate transport and movement between and within organisms in Viruses. Fully disordered proteins behave similarly, but are enriched for glycine residues and extracellular structures. PMID:27636733

  18. sc-PDB: a database for identifying variations and multiplicity of 'druggable' binding sites in proteins.

    PubMed

    Meslamani, Jamel; Rognan, Didier; Kellenberger, Esther

    2011-05-01

    The sc-PDB database is an annotated archive of druggable binding sites extracted from the Protein Data Bank. It contains all-atoms coordinates for 8166 protein-ligand complexes, chosen for their geometrical and physico-chemical properties. The sc-PDB provides a functional annotation for proteins, a chemical description for ligands and the detailed intermolecular interactions for complexes. The sc-PDB now includes a hierarchical classification of all the binding sites within a functional class. The sc-PDB entries were first clustered according to the protein name indifferent of the species. For each cluster, we identified dissimilar sites (e.g. catalytic and allosteric sites of an enzyme). SCOPE AND APPLICATIONS: The classification of sc-PDB targets by binding site diversity was intended to facilitate chemogenomics approaches to drug design. In ligand-based approaches, it avoids comparing ligands that do not share the same binding site. In structure-based approaches, it permits to quantitatively evaluate the diversity of the binding site definition (variations in size, sequence and/or structure). The sc-PDB database is freely available at: http://bioinfo-pharma.u-strasbg.fr/scPDB.

  19. MannDB: A microbial annotation database for protein characterization

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Zhou, C; Lam, M; Smith, J

    2006-05-19

    MannDB was created to meet a need for rapid, comprehensive automated protein sequence analyses to support selection of proteins suitable as targets for driving the development of reagents for pathogen or protein toxin detection. Because a large number of open-source tools were needed, it was necessary to produce a software system to scale the computations for whole-proteome analysis. Thus, we built a fully automated system for executing software tools and for storage, integration, and display of automated protein sequence analysis and annotation data. MannDB is a relational database that organizes data resulting from fully automated, high-throughput protein-sequence analyses using open-sourcemore » tools. Types of analyses provided include predictions of cleavage, chemical properties, classification, features, functional assignment, post-translational modifications, motifs, antigenicity, and secondary structure. Proteomes (lists of hypothetical and known proteins) are downloaded and parsed from Genbank and then inserted into MannDB, and annotations from SwissProt are downloaded when identifiers are found in the Genbank entry or when identical sequences are identified. Currently 36 open-source tools are run against MannDB protein sequences either on local systems or by means of batch submission to external servers. In addition, BLAST against protein entries in MvirDB, our database of microbial virulence factors, is performed. A web client browser enables viewing of computational results and downloaded annotations, and a query tool enables structured and free-text search capabilities. When available, links to external databases, including MvirDB, are provided. MannDB contains whole-proteome analyses for at least one representative organism from each category of biological threat organism listed by APHIS, CDC, HHS, NIAID, USDA, USFDA, and WHO. MannDB comprises a large number of genomes and comprehensive protein sequence analyses representing organisms listed as high-priority agents on the websites of several governmental organizations concerned with bio-terrorism. MannDB provides the user with a BLAST interface for comparison of native and non-native sequences and a query tool for conveniently selecting proteins of interest. In addition, the user has access to a web-based browser that compiles comprehensive and extensive reports.« less

  20. Conserved thioredoxin fold is present in Pisum sativum L. sieve element occlusion-1 protein

    PubMed Central

    Umate, Pavan; Tuteja, Renu

    2010-01-01

    Homology-based three-dimensional model for Pisum sativum sieve element occlusion 1 (Ps.SEO1) (forisomes) protein was constructed. A stretch of amino acids (residues 320 to 456) which is well conserved in all known members of forisomes proteins was used to model the 3D structure of Ps.SEO1. The structural prediction was done using Protein Homology/analogY Recognition Engine (PHYRE) web server. Based on studies of local sequence alignment, the thioredoxin-fold containing protein [Structural Classification of Proteins (SCOP) code d1o73a_], a member of the glutathione peroxidase family was selected as a template for modeling the spatial structure of Ps.SEO1. Selection was based on comparison of primary sequence, higher match quality and alignment accuracy. Motif 1 (EVF) is conserved in Ps.SEO1, Vicia faba (Vf.For1) and Medicago truncatula (MT.SEO3); motif 2 (KKED) is well conserved across all forisomes proteins and motif 3 (IGYIGNP) is conserved in Ps.SEO1 and Vf.For1. PMID:20404566

  1. A galaxy of folds.

    PubMed

    Alva, Vikram; Remmert, Michael; Biegert, Andreas; Lupas, Andrei N; Söding, Johannes

    2010-01-01

    Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co-occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.

  2. fRMSDPred: Predicting Local RMSD Between Structural Fragments Using Sequence Information

    DTIC Science & Technology

    2007-04-04

    machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel

  3. Sequence analysis of serum albumins reveals the molecular evolution of ligand recognition properties.

    PubMed

    Fanali, Gabriella; Ascenzi, Paolo; Bernardi, Giorgio; Fasano, Mauro

    2012-01-01

    Serum albumin (SA) is a circulating protein providing a depot and carrier for many endogenous and exogenous compounds. At least seven major binding sites have been identified by structural and functional investigations mainly in human SA. SA is conserved in vertebrates, with at least 49 entries in protein sequence databases. The multiple sequence analysis of this set of entries leads to the definition of a cladistic tree for the molecular evolution of SA orthologs in vertebrates, thus showing the clustering of the considered species, with lamprey SAs (Lethenteron japonicum and Petromyzon marinus) in a separate outgroup. Sequence analysis aimed at searching conserved domains revealed that most SA sequences are made up by three repeated domains (about 600 residues), as extensively characterized for human SA. On the contrary, lamprey SAs are giant proteins (about 1400 residues) comprising seven repeated domains. The phylogenetic analysis of the SA family reveals a stringent correlation with the taxonomic classification of the species available in sequence databases. A focused inspection of the sequences of ligand binding sites in SA revealed that in all sites most residues involved in ligand binding are conserved, although the versatility towards different ligands could be peculiar of higher organisms. Moreover, the analysis of molecular links between the different sites suggests that allosteric modulation mechanisms could be restricted to higher vertebrates.

  4. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space.

    PubMed

    Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal

    2008-07-01

    UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request.

  5. Center for Regenerative Biology and Medicine at Mount Desert Island Biological Laboratory

    DTIC Science & Technology

    2012-06-01

    Code Axolotl microRNAs Zebrafish Polypterus 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF...controlled in both Polypterus and axolotl samples. These comparisons revealed a total of 2779 shared genes that are significantly upregulated during...UPREGULATED DOWNREGULATED Figure 1: Venn diagram of UniProt protein sequence IDs among Axolotl and Polypterus contigs that were up-regulated

  6. Classification of viral zoonosis through receptor pattern analysis.

    PubMed

    Bae, Se-Eun; Son, Hyeon Seok

    2011-04-13

    Viral zoonosis, the transmission of a virus from its primary vertebrate reservoir species to humans, requires ubiquitous cellular proteins known as receptor proteins. Zoonosis can occur not only through direct transmission from vertebrates to humans, but also through intermediate reservoirs or other environmental factors. Viruses can be categorized according to genotype (ssDNA, dsDNA, ssRNA and dsRNA viruses). Among them, the RNA viruses exhibit particularly high mutation rates and are especially problematic for this reason. Most zoonotic viruses are RNA viruses that change their envelope proteins to facilitate binding to various receptors of host species. In this study, we sought to predict zoonotic propensity through the analysis of receptor characteristics. We hypothesized that the major barrier to interspecies virus transmission is that receptor sequences vary among species--in other words, that the specific amino acid sequence of the receptor determines the ability of the viral envelope protein to attach to the cell. We analysed host-cell receptor sequences for their hydrophobicity/hydrophilicity characteristics. We then analysed these properties for similarities among receptors of different species and used a statistical discriminant analysis to predict the likelihood of transmission among species. This study is an attempt to predict zoonosis through simple computational analysis of receptor sequence differences. Our method may be useful in predicting the zoonotic potential of newly discovered viral strains.

  7. Reticulate classification of mosaic microbial genomes using NeAT website.

    PubMed

    Lima-Mendez, Gipsi

    2012-01-01

    The tree of life is the classical representation of the evolutionary relationships between existent species. A tree is appropriate to display the divergence of species through mutation, i.e., by vertical descent. However, lateral gene transfer (LGT) is excluded from such representations. When LGT contribution to genome evolution cannot be neglected (e.g., for prokaryotes and mobile genetic elements), the tree becomes misleading. Networks appear as an intuitive way to represent both vertical and horizontal relationships, while overlapping groups within such graphs are more suitable for their classification. Here, we describe a method to represent both vertical and horizontal relationships. We start with a set of genomes whose coded proteins have been grouped into families based on sequence similarity. Next, all pairs of genomes are compared, counting the number of proteins classified into the same family. From this comparison, we derive a weighted graph where genomes with a significant number of similar proteins are linked. Finally, we apply a two-step clustering of this graph to produce a classification where nodes can be assigned to multiple clusters. The procedure can be performed using the Network Analysis Tools (NeAT) website.

  8. Proline: The Distribution, Frequency, Positioning, and Common Functional Roles of Proline and Polyproline Sequences in the Human Proteome

    PubMed Central

    Morgan, Alexander A.; Rubenstein, Edward

    2013-01-01

    Proline is an anomalous amino acid. Its nitrogen atom is covalently locked within a ring, thus it is the only proteinogenic amino acid with a constrained phi angle. Sequences of three consecutive prolines can fold into polyproline helices, structures that join alpha helices and beta pleats as architectural motifs in protein configuration. Triproline helices are participants in protein-protein signaling interactions. Longer spans of repeat prolines also occur, containing as many as 27 consecutive proline residues. Little is known about the frequency, positioning, and functional significance of these proline sequences. Therefore we have undertaken a systematic bioinformatics study of proline residues in proteins. We analyzed the distribution and frequency of 687,434 proline residues among 18,666 human proteins, identifying single residues, dimers, trimers, and longer repeats. Proline accounts for 6.3% of the 10,882,808 protein amino acids. Of all proline residues, 4.4% are in trimers or longer spans. We detected patterns that influence function based on proline location, spacing, and concentration. We propose a classification based on proline-rich, polyproline-rich, and proline-poor status. Whereas singlet proline residues are often found in proteins that display recurring architectural patterns, trimers or longer proline sequences tend be associated with the absence of repetitive structural motifs. Spans of 6 or more are associated with DNA/RNA processing, actin, and developmental processes. We also suggest a role for proline in Kruppel-type zinc finger protein control of DNA expression, and in the nucleation and translocation of actin by the formin complex. PMID:23372670

  9. Feature selection and classification of protein-protein complexes based on their binding affinities using machine learning approaches.

    PubMed

    Yugandhar, K; Gromiha, M Michael

    2014-09-01

    Protein-protein interactions are intrinsic to virtually every cellular process. Predicting the binding affinity of protein-protein complexes is one of the challenging problems in computational and molecular biology. In this work, we related sequence features of protein-protein complexes with their binding affinities using machine learning approaches. We set up a database of 185 protein-protein complexes for which the interacting pairs are heterodimers and their experimental binding affinities are available. On the other hand, we have developed a set of 610 features from the sequences of protein complexes and utilized Ranker search method, which is the combination of Attribute evaluator and Ranker method for selecting specific features. We have analyzed several machine learning algorithms to discriminate protein-protein complexes into high and low affinity groups based on their Kd values. Our results showed a 10-fold cross-validation accuracy of 76.1% with the combination of nine features using support vector machines. Further, we observed accuracy of 83.3% on an independent test set of 30 complexes. We suggest that our method would serve as an effective tool for identifying the interacting partners in protein-protein interaction networks and human-pathogen interactions based on the strength of interactions. © 2014 Wiley Periodicals, Inc.

  10. Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences

    PubMed Central

    Chiu, Shih-Hau; Chen, Chien-Chi; Yuan, Gwo-Fang; Lin, Thy-Hou

    2006-01-01

    Background The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. Results There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. Conclusion The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart. PMID:16776838

  11. SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences.

    PubMed

    Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke

    2008-05-01

    Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods.

  12. SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences

    PubMed Central

    Kurgan, Lukasz; Cios, Krzysztof; Chen, Ke

    2008-01-01

    Background Protein structure prediction methods provide accurate results when a homologous protein is predicted, while poorer predictions are obtained in the absence of homologous templates. However, some protein chains that share twilight-zone pairwise identity can form similar folds and thus determining structural similarity without the sequence similarity would be desirable for the structure prediction. The folding type of a protein or its domain is defined as the structural class. Current structural class prediction methods that predict the four structural classes defined in SCOP provide up to 63% accuracy for the datasets in which sequence identity of any pair of sequences belongs to the twilight-zone. We propose SCPRED method that improves prediction accuracy for sequences that share twilight-zone pairwise similarity with sequences used for the prediction. Results SCPRED uses a support vector machine classifier that takes several custom-designed features as its input to predict the structural classes. Based on extensive design that considers over 2300 index-, composition- and physicochemical properties-based features along with features based on the predicted secondary structure and content, the classifier's input includes 8 features based on information extracted from the secondary structure predicted with PSI-PRED and one feature computed from the sequence. Tests performed with datasets of 1673 protein chains, in which any pair of sequences shares twilight-zone similarity, show that SCPRED obtains 80.3% accuracy when predicting the four SCOP-defined structural classes, which is superior when compared with over a dozen recent competing methods that are based on support vector machine, logistic regression, and ensemble of classifiers predictors. Conclusion The SCPRED can accurately find similar structures for sequences that share low identity with sequence used for the prediction. The high predictive accuracy achieved by SCPRED is attributed to the design of the features, which are capable of separating the structural classes in spite of their low dimensionality. We also demonstrate that the SCPRED's predictions can be successfully used as a post-processing filter to improve performance of modern fold classification methods. PMID:18452616

  13. SigniSite: Identification of residue-level genotype-phenotype correlations in protein multiple sequence alignments.

    PubMed

    Jessen, Leon Eyrich; Hoof, Ilka; Lund, Ole; Nielsen, Morten

    2013-07-01

    Identifying which mutation(s) within a given genotype is responsible for an observable phenotype is important in many aspects of molecular biology. Here, we present SigniSite, an online application for subgroup-free residue-level genotype-phenotype correlation. In contrast to similar methods, SigniSite does not require any pre-definition of subgroups or binary classification. Input is a set of protein sequences where each sequence has an associated real number, quantifying a given phenotype. SigniSite will then identify which amino acid residues are significantly associated with the data set phenotype. As output, SigniSite displays a sequence logo, depicting the strength of the phenotype association of each residue and a heat-map identifying 'hot' or 'cold' regions. SigniSite was benchmarked against SPEER, a state-of-the-art method for the prediction of specificity determining positions (SDP) using a set of human immunodeficiency virus protease-inhibitor genotype-phenotype data and corresponding resistance mutation scores from the Stanford University HIV Drug Resistance Database, and a data set of protein families with experimentally annotated SDPs. For both data sets, SigniSite was found to outperform SPEER. SigniSite is available at: http://www.cbs.dtu.dk/services/SigniSite/.

  14. Specificity and non-specificity in RNA–protein interactions

    PubMed Central

    Jankowsky, Eckhard; Harris, Michael E.

    2016-01-01

    Gene expression is regulated by complex networks of interactions between RNAs and proteins. Proteins that interact with RNA have been traditionally viewed as either specific or non-specific; specific proteins interact preferentially with defined RNA sequence or structure motifs, whereas non-specific proteins interact with RNA sites devoid of such characteristics. Recent studies indicate that the binary “specific vs. non-specific” classification is insufficient to describe the full spectrum of RNA–protein interactions. Here, we review new methods that enable quantitative measurements of protein binding to large numbers of RNA variants, and the concepts aimed as describing resulting binding spectra: affinity distributions, comprehensive binding models and free energy landscapes. We discuss how these new methodologies and associated concepts enable work towards inclusive, quantitative models for specific and non-specific RNA–protein interactions. PMID:26285679

  15. Improving Virus Taxonomy by Recontextualizing Sequence-Based Classification with Biologically Relevant Data: the Case of the Alphacoronavirus 1 Species

    PubMed Central

    André, Nicole M.

    2018-01-01

    ABSTRACT The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members. PMID:29299531

  16. Improving Virus Taxonomy by Recontextualizing Sequence-Based Classification with Biologically Relevant Data: the Case of the Alphacoronavirus 1 Species.

    PubMed

    Whittaker, Gary R; André, Nicole M; Millet, Jean Kaoru

    2018-01-01

    The difficulties related to virus taxonomy have been amplified by recent advances in next-generation sequencing and metagenomics, prompting the field to revisit the question of what constitutes a useful viral classification. Here, taking a challenging classification found in coronaviruses, we argue that consideration of biological properties in addition to sequence-based demarcations is critical for generating useful taxonomy that recapitulates complex evolutionary histories. Within the Alphacoronavirus genus, the Alphacoronavirus 1 species encompasses several biologically distinct viruses. We carried out functionally based phylogenetic analysis, centered on the spike gene, which encodes the main surface antigen and primary driver of tropism and pathogenesis. Within the Alphacoronavirus 1 species, we identify clade A (encompassing serotype I feline coronavirus [FCoV] and canine coronavirus [CCoV]) and clade B (grouping serotype II FCoV and CCoV and transmissible gastroenteritis virus [TGEV]-like viruses). We propose this clade designation, along with the newly proposed Alphacoronavirus 2 species, as an improved way to classify the Alphacoronavirus genus. IMPORTANCE Our work focuses on improving the classification of the Alphacoronavirus genus. The Alphacoronavirus 1 species groups viruses of veterinary importance that infect distinct mammalian hosts and includes canine and feline coronaviruses and transmissible gastroenteritis virus. It is the prototype species of the Alphacoronavirus genus; however, it encompasses biologically distinct viruses. To better characterize this prototypical species, we performed phylogenetic analyses based on the sequences of the spike protein, one of the main determinants of tropism and pathogenesis, and reveal the existence of two subgroups or clades that fit with previously established serotype demarcations. We propose a new clade designation to better classify Alphacoronavirus 1 members.

  17. New technologies accelerate the exploration of non-coding RNAs in horticultural plants

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Degao; Mewalal, Ritesh; Hu, Rongbin

    Non-coding RNAs (ncRNAs), that is, RNAs not translated into proteins, are crucial regulators of a variety of biological processes in plants. While protein-encoding genes have been relatively well-annotated in sequenced genomes, accounting for a small portion of the genome space in plants, the universe of plant ncRNAs is rapidly expanding. Recent advances in experimental and computational technologies have generated a great momentum for discovery and functional characterization of ncRNAs. Here we summarize the classification and known biological functions of plant ncRNAs, review the application of next-generation sequencing (NGS) technology and ribosome profiling technology to ncRNA discovery in horticultural plants andmore » discuss the application of new technologies, especially the new genome-editing tool clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) systems, to functional characterization of plant ncRNAs.« less

  18. New technologies accelerate the exploration of non-coding RNAs in horticultural plants

    PubMed Central

    Liu, Degao; Mewalal, Ritesh; Hu, Rongbin; Tuskan, Gerald A; Yang, Xiaohan

    2017-01-01

    Non-coding RNAs (ncRNAs), that is, RNAs not translated into proteins, are crucial regulators of a variety of biological processes in plants. While protein-encoding genes have been relatively well-annotated in sequenced genomes, accounting for a small portion of the genome space in plants, the universe of plant ncRNAs is rapidly expanding. Recent advances in experimental and computational technologies have generated a great momentum for discovery and functional characterization of ncRNAs. Here we summarize the classification and known biological functions of plant ncRNAs, review the application of next-generation sequencing (NGS) technology and ribosome profiling technology to ncRNA discovery in horticultural plants and discuss the application of new technologies, especially the new genome-editing tool clustered regularly interspaced short palindromic repeat (CRISPR)/CRISPR-associated protein 9 (Cas9) systems, to functional characterization of plant ncRNAs. PMID:28698797

  19. Numerical classification of coding sequences

    NASA Technical Reports Server (NTRS)

    Collins, D. W.; Liu, C. C.; Jukes, T. H.

    1992-01-01

    DNA sequences coding for protein may be represented by counts of nucleotides or codons. A complete reading frame may be abbreviated by its base count, e.g. A76C158G121T74, or with the corresponding codon table, e.g. (AAA)0(AAC)1(AAG)9 ... (TTT)0. We propose that these numerical designations be used to augment current methods of sequence annotation. Because base counts and codon tables do not require revision as knowledge of function evolves, they are well-suited to act as cross-references, for example to identify redundant GenBank entries. These descriptors may be compared, in place of DNA sequences, to extract homologous genes from large databases. This approach permits rapid searching with good selectivity.

  20. An Automatic Segmentation Method Combining an Active Contour Model and a Classification Technique for Detecting Polycomb-group Proteinsin High-Throughput Microscopy Images.

    PubMed

    Gregoretti, Francesco; Cesarini, Elisa; Lanzuolo, Chiara; Oliva, Gennaro; Antonelli, Laura

    2016-01-01

    The large amount of data generated in biological experiments that rely on advanced microscopy can be handled only with automated image analysis. Most analyses require a reliable cell image segmentation eventually capable of detecting subcellular structures.We present an automatic segmentation method to detect Polycomb group (PcG) proteins areas isolated from nuclei regions in high-resolution fluorescent cell image stacks. It combines two segmentation algorithms that use an active contour model and a classification technique serving as a tool to better understand the subcellular three-dimensional distribution of PcG proteins in live cell image sequences. We obtained accurate results throughout several cell image datasets, coming from different cell types and corresponding to different fluorescent labels, without requiring elaborate adjustments to each dataset.

  1. Improving Protein Fold Recognition by Deep Learning Networks.

    PubMed

    Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin

    2015-12-04

    For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl's benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.

  2. Classification of protein quaternary structure by functional domain composition

    PubMed Central

    Yu, Xiaojing; Wang, Chuan; Li, Yixue

    2006-01-01

    Background The number and the arrangement of subunits that form a protein are referred to as quaternary structure. Quaternary structure is an important protein attribute that is closely related to its function. Proteins with quaternary structure are called oligomeric proteins. Oligomeric proteins are involved in various biological processes, such as metabolism, signal transduction, and chromosome replication. Thus, it is highly desirable to develop some computational methods to automatically classify the quaternary structure of proteins from their sequences. Results To explore this problem, we adopted an approach based on the functional domain composition of proteins. Every protein was represented by a vector calculated from the domains in the PFAM database. The nearest neighbor algorithm (NNA) was used for classifying the quaternary structure of proteins from this information. The jackknife cross-validation test was performed on the non-redundant protein dataset in which the sequence identity was less than 25%. The overall success rate obtained is 75.17%. Additionally, to demonstrate the effectiveness of this method, we predicted the proteins in an independent dataset and achieved an overall success rate of 84.11% Conclusion Compared with the amino acid composition method and Blast, the results indicate that the domain composition approach may be a more effective and promising high-throughput method in dealing with this complicated problem in bioinformatics. PMID:16584572

  3. MIPS: analysis and annotation of genome information in 2007

    PubMed Central

    Mewes, H. W.; Dietmann, S.; Frishman, D.; Gregory, R.; Mannhaupt, G.; Mayer, K. F. X.; Münsterkötter, M.; Ruepp, A.; Spannagl, M.; Stümpflen, V.; Rattei, T.

    2008-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de). PMID:18158298

  4. MIPS: analysis and annotation of genome information in 2007.

    PubMed

    Mewes, H W; Dietmann, S; Frishman, D; Gregory, R; Mannhaupt, G; Mayer, K F X; Münsterkötter, M; Ruepp, A; Spannagl, M; Stümpflen, V; Rattei, T

    2008-01-01

    The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).

  5. GPU-Based Point Cloud Superpositioning for Structural Comparisons of Protein Binding Sites.

    PubMed

    Leinweber, Matthias; Fober, Thomas; Freisleben, Bernd

    2018-01-01

    In this paper, we present a novel approach to solve the labeled point cloud superpositioning problem for performing structural comparisons of protein binding sites. The solution is based on a parallel evolution strategy that operates on large populations and runs on GPU hardware. The proposed evolution strategy reduces the likelihood of getting stuck in a local optimum of the multimodal real-valued optimization problem represented by labeled point cloud superpositioning. The performance of the GPU-based parallel evolution strategy is compared to a previously proposed CPU-based sequential approach for labeled point cloud superpositioning, indicating that the GPU-based parallel evolution strategy leads to qualitatively better results and significantly shorter runtimes, with speed improvements of up to a factor of 1,500 for large populations. Binary classification tests based on the ATP, NADH, and FAD protein subsets of CavBase, a database containing putative binding sites, show average classification rate improvements from about 92 percent (CPU) to 96 percent (GPU). Further experiments indicate that the proposed GPU-based labeled point cloud superpositioning approach can be superior to traditional protein comparison approaches based on sequence alignments.

  6. Intelligent data analysis to model and understand live cell time-lapse sequences.

    PubMed

    Paterson, Allan; Ashtari, M; Ribé, D; Stenbeck, G; Tucker, A

    2012-01-01

    One important aspect of cellular function, which is at the basis of tissue homeostasis, is the delivery of proteins to their correct destinations. Significant advances in live cell microscopy have allowed tracking of these pathways by following the dynamics of fluorescently labelled proteins in living cells. This paper explores intelligent data analysis techniques to model the dynamic behavior of proteins in living cells as well as to classify different experimental conditions. We use a combination of decision tree classification and hidden Markov models. In particular, we introduce a novel approach to "align" hidden Markov models so that hidden states from different models can be cross-compared. Our models capture the dynamics of two experimental conditions accurately with a stable hidden state for control data and multiple (less stable) states for the experimental data recapitulating the behaviour of particle trajectories within live cell time-lapse data. In addition to having successfully developed an automated framework for the classification of protein transport dynamics from live cell time-lapse data our model allows us to understand the dynamics of a complex trafficking pathway in living cells in culture.

  7. The property distance index PD predicts peptides that cross-react with IgE antibodies

    PubMed Central

    Ivanciuc, Ovidiu; Midoro-Horiuti, Terumi; Schein, Catherine H.; Xie, Liping; Hillman, Gilbert R.; Goldblum, Randall M.; Braun, Werner

    2009-01-01

    Similarities in the sequence and structure of allergens can explain clinically observed cross-reactivities. Distinguishing sequences that bind IgE in patient sera can be used to identify potentially allergenic protein sequences and aid in the design of hypo-allergenic proteins. The property distance index PD, incorporated in our Structural Database of Allergenic Proteins (SDAP, http://fermi.utmb.edu/SDAP/), may identify potentially cross-reactive segments of proteins, based on their similarity to known IgE epitopes. We sought to obtain experimental validation of the PD index as a quantitative predictor of IgE cross-reactivity, by designing peptide variants with predetermined PD scores relative to three linear IgE epitopes of Jun a 1, the dominant allergen from mountain cedar pollen. For each of the three epitopes, 60 peptides were designed with increasing PD values (decreasing physicochemical similarity) to the starting sequence. The peptides synthesized on a derivatized cellulose membrane were probed with sera from patients who were allergic to Jun a 1, and the experimental data were interpreted with a PD classification method. Peptides with low PD values relative to a given epitope were more likely to bind IgE from the sera than were those with PD values larger than 6. Control sequences, with PD values between 18 and 20 to all the three epitopes, did not bind patient IgE, thus validating our procedure for identifying negative control peptides. The PD index is a statistically validated method to detect discrete regions of proteins that have a high probability of cross-reacting with IgE from allergic patients. PMID:18950868

  8. Delineating slowly and rapidly evolving fractions of the Drosophila genome.

    PubMed

    Keith, Jonathan M; Adams, Peter; Stephen, Stuart; Mattick, John S

    2008-05-01

    Evolutionary conservation is an important indicator of function and a major component of bioinformatic methods to identify non-protein-coding genes. We present a new Bayesian method for segmenting pairwise alignments of eukaryotic genomes while simultaneously classifying segments into slowly and rapidly evolving fractions. We also describe an information criterion similar to the Akaike Information Criterion (AIC) for determining the number of classes. Working with pairwise alignments enables detection of differences in conservation patterns among closely related species. We analyzed three whole-genome and three partial-genome pairwise alignments among eight Drosophila species. Three distinct classes of conservation level were detected. Sequences comprising the most slowly evolving component were consistent across a range of species pairs, and constituted approximately 62-66% of the D. melanogaster genome. Almost all (>90%) of the aligned protein-coding sequence is in this fraction, suggesting much of it (comprising the majority of the Drosophila genome, including approximately 56% of non-protein-coding sequences) is functional. The size and content of the most rapidly evolving component was species dependent, and varied from 1.6% to 4.8%. This fraction is also enriched for protein-coding sequence (while containing significant amounts of non-protein-coding sequence), suggesting it is under positive selection. We also classified segments according to conservation and GC content simultaneously. This analysis identified numerous sub-classes of those identified on the basis of conservation alone, but was nevertheless consistent with that classification. Software, data, and results available at www.maths.qut.edu.au/-keithj/. Genomic segments comprising the conservation classes available in BED format.

  9. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space

    PubMed Central

    Loewenstein, Yaniv; Portugaly, Elon; Fromer, Menachem; Linial, Michal

    2008-01-01

    Motivation: UPGMA (average linking) is probably the most popular algorithm for hierarchical data clustering, especially in computational biology. However, UPGMA requires the entire dissimilarity matrix in memory. Due to this prohibitive requirement, UPGMA is not scalable to very large datasets. Application: We present a novel class of memory-constrained UPGMA (MC-UPGMA) algorithms. Given any practical memory size constraint, this framework guarantees the correct clustering solution without explicitly requiring all dissimilarities in memory. The algorithms are general and are applicable to any dataset. We present a data-dependent characterization of hardness and clustering efficiency. The presented concepts are applicable to any agglomerative clustering formulation. Results: We apply our algorithm to the entire collection of protein sequences, to automatically build a comprehensive evolutionary-driven hierarchy of proteins from sequence alone. The newly created tree captures protein families better than state-of-the-art large-scale methods such as CluSTr, ProtoNet4 or single-linkage clustering. We demonstrate that leveraging the entire mass embodied in all sequence similarities allows to significantly improve on current protein family clusterings which are unable to directly tackle the sheer mass of this data. Furthermore, we argue that non-metric constraints are an inherent complexity of the sequence space and should not be overlooked. The robustness of UPGMA allows significant improvement, especially for multidomain proteins, and for large or divergent families. Availability: A comprehensive tree built from all UniProt sequence similarities, together with navigation and classification tools will be made available as part of the ProtoNet service. A C++ implementation of the algorithm is available on request. Contact: lonshy@cs.huji.ac.il PMID:18586742

  10. GenoMycDB: a database for comparative analysis of mycobacterial genes and genomes.

    PubMed

    Catanho, Marcos; Mascarenhas, Daniel; Degrave, Wim; Miranda, Antonio Basílio de

    2006-03-31

    Several databases and computational tools have been created with the aim of organizing, integrating and analyzing the wealth of information generated by large-scale sequencing projects of mycobacterial genomes and those of other organisms. However, with very few exceptions, these databases and tools do not allow for massive and/or dynamic comparison of these data. GenoMycDB (http://www.dbbm.fiocruz.br/GenoMycDB) is a relational database built for large-scale comparative analyses of completely sequenced mycobacterial genomes, based on their predicted protein content. Its central structure is composed of the results obtained after pair-wise sequence alignments among all the predicted proteins coded by the genomes of six mycobacteria: Mycobacterium tuberculosis (strains H37Rv and CDC1551), M. bovis AF2122/97, M. avium subsp. paratuberculosis K10, M. leprae TN, and M. smegmatis MC2 155. The database stores the computed similarity parameters of every aligned pair, providing for each protein sequence the predicted subcellular localization, the assigned cluster of orthologous groups, the features of the corresponding gene, and links to several important databases. Tables containing pairs or groups of potential homologs between selected species/strains can be produced dynamically by user-defined criteria, based on one or multiple sequence similarity parameters. In addition, searches can be restricted according to the predicted subcellular localization of the protein, the DNA strand of the corresponding gene and/or the description of the protein. Massive data search and/or retrieval are available, and different ways of exporting the result are offered. GenoMycDB provides an on-line resource for the functional classification of mycobacterial proteins as well as for the analysis of genome structure, organization, and evolution.

  11. Insights into rubber biosynthesis from transcriptome analysis of Hevea brasiliensis latex.

    PubMed

    Chow, Keng-See; Wan, Kiew-Lian; Isa, Mohd Noor Mat; Bahari, Azlina; Tan, Siang-Hee; Harikrishna, K; Yeang, Hoong-Yeet

    2007-01-01

    Hevea brasiliensis is the most widely cultivated species for commercial production of natural rubber (cis-polyisoprene). In this study, 10,040 expressed sequence tags (ESTs) were generated from the latex of the rubber tree, which represents the cytoplasmic content of a single cell type, in order to analyse the latex transcription profile with emphasis on rubber biosynthesis-related genes. A total of 3,441 unique transcripts (UTs) were obtained after quality editing and assembly of EST sequences. Functional classification of UTs according to the Gene Ontology convention showed that 73.8% were related to genes of unknown function. Among highly expressed ESTs, a significant proportion encoded proteins related to rubber biosynthesis and stress or defence responses. Sequences encoding rubber particle membrane proteins (RPMPs) belonging to three protein families accounted for 12% of the ESTs. Characterization of these ESTs revealed nine RPMP variants (7.9-27 kDa) including the 14 kDa REF (rubber elongation factor) and 22 kDa SRPP (small rubber particle protein). The expression of multiple RPMP isoforms in latex was shown using antibodies against REF and SRPP. Both EST and quantitative reverse transcription-PCR (QRT-PCR) analyses demonstrated REF and SRPP to be the most abundant transcripts in latex. Besides rubber biosynthesis, comparative sequence analysis showed that the RPMPs are highly similar to sequences in the plant kingdom having stress-related functions. Implications of the RPMP function in cis-polyisoprene biosynthesis in the context of transcript abundance and differential gene expression are discussed.

  12. Availability of MudPIT data for classification of biological samples.

    PubMed

    Silvestre, Dario Di; Zoppis, Italo; Brambilla, Francesca; Bellettato, Valeria; Mauri, Giancarlo; Mauri, Pierluigi

    2013-01-14

    Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins. Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software. These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application.

  13. A domain-centric solution to functional genomics via dcGO Predictor

    PubMed Central

    2013-01-01

    Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era. PMID:23514627

  14. Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA.

    PubMed

    Wang, Shunfang; Liu, Shuhui

    2015-12-19

    An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.

  15. Protein Sub-Nuclear Localization Based on Effective Fusion Representations and Dimension Reduction Algorithm LDA

    PubMed Central

    Wang, Shunfang; Liu, Shuhui

    2015-01-01

    An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one. PMID:26703574

  16. Prediction of type III secretion signals in genomes of gram-negative bacteria.

    PubMed

    Löwer, Martin; Schneider, Gisbert

    2009-06-15

    Pathogenic bacteria infecting both animals as well as plants use various mechanisms to transport virulence factors across their cell membranes and channel these proteins into the infected host cell. The type III secretion system represents such a mechanism. Proteins transported via this pathway ("effector proteins") have to be distinguished from all other proteins that are not exported from the bacterial cell. Although a special targeting signal at the N-terminal end of effector proteins has been proposed in literature its exact characteristics remain unknown. In this study, we demonstrate that the signals encoded in the sequences of type III secretion system effectors can be consistently recognized and predicted by machine learning techniques. Known protein effectors were compiled from the literature and sequence databases, and served as training data for artificial neural networks and support vector machine classifiers. Common sequence features were most pronounced in the first 30 amino acids of the effector sequences. Classification accuracy yielded a cross-validated Matthews correlation of 0.63 and allowed for genome-wide prediction of potential type III secretion system effectors in 705 proteobacterial genomes (12% predicted candidates protein), their chromosomes (11%) and plasmids (13%), as well as 213 Firmicute genomes (7%). We present a signal prediction method together with comprehensive survey of potential type III secretion system effectors extracted from 918 published bacterial genomes. Our study demonstrates that the analyzed signal features are common across a wide range of species, and provides a substantial basis for the identification of exported pathogenic proteins as targets for future therapeutic intervention. The prediction software is publicly accessible from our web server (www.modlab.org).

  17. Composite Structural Motifs of Binding Sites for Delineating Biological Functions of Proteins

    PubMed Central

    Kinjo, Akira R.; Nakamura, Haruki

    2012-01-01

    Most biological processes are described as a series of interactions between proteins and other molecules, and interactions are in turn described in terms of atomic structures. To annotate protein functions as sets of interaction states at atomic resolution, and thereby to better understand the relation between protein interactions and biological functions, we conducted exhaustive all-against-all atomic structure comparisons of all known binding sites for ligands including small molecules, proteins and nucleic acids, and identified recurring elementary motifs. By integrating the elementary motifs associated with each subunit, we defined composite motifs that represent context-dependent combinations of elementary motifs. It is demonstrated that function similarity can be better inferred from composite motif similarity compared to the similarity of protein sequences or of individual binding sites. By integrating the composite motifs associated with each protein function, we define meta-composite motifs each of which is regarded as a time-independent diagrammatic representation of a biological process. It is shown that meta-composite motifs provide richer annotations of biological processes than sequence clusters. The present results serve as a basis for bridging atomic structures to higher-order biological phenomena by classification and integration of binding site structures. PMID:22347478

  18. Protein structure database search and evolutionary classification.

    PubMed

    Yang, Jinn-Moon; Tung, Chi-Hua

    2006-01-01

    As more protein structures become available and structural genomics efforts provide structural models in a genome-wide strategy, there is a growing need for fast and accurate methods for discovering homologous proteins and evolutionary classifications of newly determined structures. We have developed 3D-BLAST, in part, to address these issues. 3D-BLAST is as fast as BLAST and calculates the statistical significance (E-value) of an alignment to indicate the reliability of the prediction. Using this method, we first identified 23 states of the structural alphabet that represent pattern profiles of the backbone fragments and then used them to represent protein structure databases as structural alphabet sequence databases (SADB). Our method enhanced BLAST as a search method, using a new structural alphabet substitution matrix (SASM) to find the longest common substructures with high-scoring structured segment pairs from an SADB database. Using personal computers with Intel Pentium4 (2.8 GHz) processors, our method searched more than 10 000 protein structures in 1.3 s and achieved a good agreement with search results from detailed structure alignment methods. [3D-BLAST is available at http://3d-blast.life.nctu.edu.tw].

  19. Classification of community types, successional sequences, and landscapes of the Copper River Delta, Alaska.

    Treesearch

    Keith. Boggs

    2000-01-01

    A classification of community types, successional sequences, and landscapes is presented for the piedmont of the Copper River Delta. The classification was based on a sampling of 471 sites. A total of 75 community types, 42 successional sequences, and 6 landscapes are described. The classification of community types reflects the existing vegetation communities on the...

  20. ZifBASE: a database of zinc finger proteins and associated resources.

    PubMed

    Jayakanthan, Mannu; Muthukumaran, Jayaraman; Chandrasekar, Sanniyasi; Chawla, Konika; Punetha, Ankita; Sundar, Durai

    2009-09-09

    Information on the occurrence of zinc finger protein motifs in genomes is crucial to the developing field of molecular genome engineering. The knowledge of their target DNA-binding sequences is vital to develop chimeric proteins for targeted genome engineering and site-specific gene correction. There is a need to develop a computational resource of zinc finger proteins (ZFP) to identify the potential binding sites and its location, which reduce the time of in vivo task, and overcome the difficulties in selecting the specific type of zinc finger protein and the target site in the DNA sequence. ZifBASE provides an extensive collection of various natural and engineered ZFP. It uses standard names and a genetic and structural classification scheme to present data retrieved from UniProtKB, GenBank, Protein Data Bank, ModBase, Protein Model Portal and the literature. It also incorporates specialized features of ZFP including finger sequences and positions, number of fingers, physiochemical properties, classes, framework, PubMed citations with links to experimental structures (PDB, if available) and modeled structures of natural zinc finger proteins. ZifBASE provides information on zinc finger proteins (both natural and engineered ones), the number of finger units in each of the zinc finger proteins (with multiple fingers), the synergy between the adjacent fingers and their positions. Additionally, it gives the individual finger sequence and their target DNA site to which it binds for better and clear understanding on the interactions of adjacent fingers. The current version of ZifBASE contains 139 entries of which 89 are engineered ZFPs, containing 3-7F totaling to 296 fingers. There are 50 natural zinc finger protein entries ranging from 2-13F, totaling to 307 fingers. It has sequences and structures from literature, Protein Data Bank, ModBase and Protein Model Portal. The interface is cross linked to other public databases like UniprotKB, PDB, ModBase and Protein Model Portal and PubMed for making it more informative. A database is established to maintain the information of the sequence features, including the class, framework, number of fingers, residues, position, recognition site and physio-chemical properties (molecular weight, isoelectric point) of both natural and engineered zinc finger proteins and dissociation constant of few. ZifBASE can provide more effective and efficient way of accessing the zinc finger protein sequences and their target binding sites with the links to their three-dimensional structures. All the data and functions are available at the advanced web-based search interface http://web.iitd.ac.in/~sundar/zifbase.

  1. Protein Aggregation/Folding: The Role of Deterministic Singularities of Sequence Hydrophobicity as Determined by Nonlinear Signal Analysis of Acylphosphatase and Aβ(1–40)

    PubMed Central

    Zbilut, Joseph P.; Colosimo, Alfredo; Conti, Filippo; Colafranceschi, Mauro; Manetti, Cesare; Valerio, MariaCristina; Webber, Charles L.; Giuliani, Alessandro

    2003-01-01

    The problem of protein folding vs. aggregation was investigated in acylphosphatase and the amyloid protein Aβ(1–40) by means of nonlinear signal analysis of their chain hydrophobicity. Numerical descriptors of recurrence patterns provided the basis for statistical evaluation of folding/aggregation distinctive features. Static and dynamic approaches were used to elucidate conditions coincident with folding vs. aggregation using comparisons with known protein secondary structure classifications, site-directed mutagenesis studies of acylphosphatase, and molecular dynamics simulations of amyloid protein, Aβ(1–40). The results suggest that a feature derived from principal component space characterized by the smoothness of singular, deterministic hydrophobicity patches plays a significant role in the conditions governing protein aggregation. PMID:14645049

  2. Deciphering the molecular and functional basis of Dbl family proteins: a novel systematic approach toward classification of selective activation of the Rho family proteins.

    PubMed

    Jaiswal, Mamta; Dvorsky, Radovan; Ahmadian, Mohammad Reza

    2013-02-08

    The diffuse B-cell lymphoma (Dbl) family of the guanine nucleotide exchange factors is a direct activator of the Rho family proteins. The Rho family proteins are involved in almost every cellular process that ranges from fundamental (e.g. the establishment of cell polarity) to highly specialized processes (e.g. the contraction of vascular smooth muscle cells). Abnormal activation of the Rho proteins is known to play a crucial role in cancer, infectious and cognitive disorders, and cardiovascular diseases. However, the existence of 74 Dbl proteins and 25 Rho-related proteins in humans, which are largely uncharacterized, has led to increasing complexity in identifying specific upstream pathways. Thus, we comprehensively investigated sequence-structure-function-property relationships of 21 representatives of the Dbl protein family regarding their specificities and activities toward 12 Rho family proteins. The meta-analysis approach provides an unprecedented opportunity to broadly profile functional properties of Dbl family proteins, including catalytic efficiency, substrate selectivity, and signaling specificity. Our analysis has provided novel insights into the following: (i) understanding of the relative differences of various Rho protein members in nucleotide exchange; (ii) comparing and defining individual and overall guanine nucleotide exchange factor activities of a large representative set of the Dbl proteins toward 12 Rho proteins; (iii) grouping the Dbl family into functionally distinct categories based on both their catalytic efficiencies and their sequence-structural relationships; (iv) identifying conserved amino acids as fingerprints of the Dbl and Rho protein interaction; and (v) defining amino acid sequences conserved within, but not between, Dbl subfamilies. Therefore, the characteristics of such specificity-determining residues identified the regions or clusters conserved within the Dbl subfamilies.

  3. Hybrid and Rogue Kinases Encoded in the Genomes of Model Eukaryotes

    PubMed Central

    Rakshambikai, Ramaswamy; Gnanavel, Mutharasu; Srinivasan, Narayanaswamy

    2014-01-01

    The highly modular nature of protein kinases generates diverse functional roles mediated by evolutionary events such as domain recombination, insertion and deletion of domains. Usually domain architecture of a kinase is related to the subfamily to which the kinase catalytic domain belongs. However outlier kinases with unusual domain architectures serve in the expansion of the functional space of the protein kinase family. For example, Src kinases are made-up of SH2 and SH3 domains in addition to the kinase catalytic domain. A kinase which lacks these two domains but retains sequence characteristics within the kinase catalytic domain is an outlier that is likely to have modes of regulation different from classical src kinases. This study defines two types of outlier kinases: hybrids and rogues depending on the nature of domain recombination. Hybrid kinases are those where the catalytic kinase domain belongs to a kinase subfamily but the domain architecture is typical of another kinase subfamily. Rogue kinases are those with kinase catalytic domain characteristic of a kinase subfamily but the domain architecture is typical of neither that subfamily nor any other kinase subfamily. This report provides a consolidated set of such hybrid and rogue kinases gleaned from six eukaryotic genomes–S.cerevisiae, D. melanogaster, C.elegans, M.musculus, T.rubripes and H.sapiens–and discusses their functions. The presence of such kinases necessitates a revisiting of the classification scheme of the protein kinase family using full length sequences apart from classical classification using solely the sequences of kinase catalytic domains. The study of these kinases provides a good insight in engineering signalling pathways for a desired output. Lastly, identification of hybrids and rogues in pathogenic protozoa such as P.falciparum sheds light on possible strategies in host-pathogen interactions. PMID:25255313

  4. Implementing a rational and consistent nomenclature for serine/arginine-rich protein splicing factors (SR proteins) in plants.

    PubMed

    Barta, Andrea; Kalyna, Maria; Reddy, Anireddy S N

    2010-09-01

    Growing interest in alternative splicing in plants and the extensive sequencing of new plant genomes necessitate more precise definition and classification of genes coding for splicing factors. SR proteins are a family of RNA binding proteins, which function as essential factors for constitutive and alternative splicing. We propose a unified nomenclature for plant SR proteins, taking into account the newly revised nomenclature of the mammalian SR proteins and a number of plant-specific properties of the plant proteins. We identify six subfamilies of SR proteins in Arabidopsis thaliana and rice (Oryza sativa), three of which are plant specific. The proposed subdivision of plant SR proteins into different subfamilies will allow grouping of paralogous proteins and simple assignment of newly discovered SR orthologs from other plant species and will promote functional comparisons in diverse plant species.

  5. Phylogenetic Analysis and Classification of the Fungal bHLH Domain

    PubMed Central

    Sailsbery, Joshua K.; Atchley, William R.; Dean, Ralph A.

    2012-01-01

    The basic Helix-Loop-Helix (bHLH) domain is an essential highly conserved DNA-binding domain found in many transcription factors in all eukaryotic organisms. The bHLH domain has been well studied in the Animal and Plant Kingdoms but has yet to be characterized within Fungi. Herein, we obtained and evaluated the phylogenetic relationship of 490 fungal-specific bHLH containing proteins from 55 whole genome projects composed of 49 Ascomycota and 6 Basidiomycota organisms. We identified 12 major groupings within Fungi (F1–F12); identifying conserved motifs and functions specific to each group. Several classification models were built to distinguish the 12 groups and elucidate the most discerning sites in the domain. Performance testing on these models, for correct group classification, resulted in a maximum sensitivity and specificity of 98.5% and 99.8%, respectively. We identified 12 highly discerning sites and incorporated those into a set of rules (simplified model) to classify sequences into the correct group. Conservation of amino acid sites and phylogenetic analyses established that like plant bHLH proteins, fungal bHLH–containing proteins are most closely related to animal Group B. The models used in these analyses were incorporated into a software package, the source code for which is available at www.fungalgenomics.ncsu.edu. PMID:22114358

  6. The Center for Regenerative Biology and Medicine at Mount Desert Island Biological Laboratory

    DTIC Science & Technology

    2015-08-01

    TERMS limb regeneration Positional Memory Code Axolotl microRNAs Zebrafish Polypterus 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT 18...injury induced regeneration of limbs/appendage tissues in Ambystoma ( axolotl ) and Polypterus animals. The defining feature of limb/appendage...downregulated in UPREGULATED DOWNREGULATED Figure 1: Venn diagram of UniProt protein sequence IDs among Axolotl and Polypterus contigs that

  7. Mycofier: a new machine learning-based classifier for fungal ITS sequences.

    PubMed

    Delgado-Serrano, Luisa; Restrepo, Silvia; Bustos, Jose Ricardo; Zambrano, Maria Mercedes; Anzola, Juan Manuel

    2016-08-11

    The taxonomic and phylogenetic classification based on sequence analysis of the ITS1 genomic region has become a crucial component of fungal ecology and diversity studies. Nowadays, there is no accurate alignment-free classification tool for fungal ITS1 sequences for large environmental surveys. This study describes the development of a machine learning-based classifier for the taxonomical assignment of fungal ITS1 sequences at the genus level. A fungal ITS1 sequence database was built using curated data. Training and test sets were generated from it. A Naïve Bayesian classifier was built using features from the primary sequence with an accuracy of 87 % in the classification at the genus level. The final model was based on a Naïve Bayes algorithm using ITS1 sequences from 510 fungal genera. This classifier, denoted as Mycofier, provides similar classification accuracy compared to BLASTN, but the database used for the classification contains curated data and the tool, independent of alignment, is more efficient and contributes to the field, given the lack of an accurate classification tool for large data from fungal ITS1 sequences. The software and source code for Mycofier are freely available at https://github.com/ldelgado-serrano/mycofier.git .

  8. A deep insight into the sialotranscriptome of the mosquito, Psorophora albipes

    PubMed Central

    2013-01-01

    Background Psorophora mosquitoes are exclusively found in the Americas and have been associated with transmission of encephalitis and West Nile fever viruses, among other arboviruses. Mosquito salivary glands represent the final route of differentiation and transmission of many parasites. They also secrete molecules with powerful pharmacologic actions that modulate host hemostasis, inflammation, and immune response. Here, we employed next generation sequencing and proteome approaches to investigate for the first time the salivary composition of a mosquito member of the Psorophora genus. We additionally discuss the evolutionary position of this mosquito genus into the Culicidae family by comparing the identity of its secreted salivary compounds to other mosquito salivary proteins identified so far. Results Illumina sequencing resulted in 13,535,229 sequence reads, which were assembled into 3,247 contigs. All families were classified according to their in silico-predicted function/ activity. Annotation of these sequences allowed classification of their products into 83 salivary protein families, twenty (24.39%) of which were confirmed by our subsequent proteome analysis. Two protein families were deorphanized from Aedes and one from Ochlerotatus, while four protein families were described as novel to Psorophora genus because they had no match with any other known mosquito salivary sequence. Several protein families described as exclusive to Culicines were present in Psorophora mosquitoes, while we did not identify any member of the protein families already known as unique to Anophelines. Also, the Psorophora salivary proteins had better identity to homologs in Aedes (69.23%), followed by Ochlerotatus (8.15%), Culex (6.52%), and Anopheles (4.66%), respectively. Conclusions This is the first sialome (from the Greek sialo = saliva) catalog of salivary proteins from a Psorophora mosquito, which may be useful for better understanding the lifecycle of this mosquito and the role of its salivary secretion in arboviral transmission. PMID:24330624

  9. A high level interface to SCOP and ASTRAL implemented in python.

    PubMed

    Casbon, James A; Crooks, Gavin E; Saqi, Mansoor A S

    2006-01-10

    Benchmarking algorithms in structural bioinformatics often involves the construction of datasets of proteins with given sequence and structural properties. The SCOP database is a manually curated structural classification which groups together proteins on the basis of structural similarity. The ASTRAL compendium provides non redundant subsets of SCOP domains on the basis of sequence similarity such that no two domains in a given subset share more than a defined degree of sequence similarity. Taken together these two resources provide a 'ground truth' for assessing structural bioinformatics algorithms. We present a small and easy to use API written in python to enable construction of datasets from these resources. We have designed a set of python modules to provide an abstraction of the SCOP and ASTRAL databases. The modules are designed to work as part of the Biopython distribution. Python users can now manipulate and use the SCOP hierarchy from within python programs, and use ASTRAL to return sequences of domains in SCOP, as well as clustered representations of SCOP from ASTRAL. The modules make the analysis and generation of datasets for use in structural genomics easier and more principled.

  10. A novel, extremely alkaliphilic and cold-active esterase from Antarctic desert soil.

    PubMed

    Hu, Xiao Ping; Heath, Caroline; Taylor, Mark Paul; Tuffin, Marla; Cowan, Don

    2012-01-01

    A novel, cold-active and highly alkaliphilic esterase was isolated from an Antarctic desert soil metagenomic library by functional screening. The 1,044 bp gene sequence contained several conserved regions common to lipases/esterases, but lacked clear classification based on sequence analysis alone. Moderate (<40%) amino acid sequence similarity to known esterases was apparent (the closest neighbour being a hypothetical protein from Chitinophaga pinensis), despite phylogenetic distance to many of the lipolytic "families". The enzyme functionally demonstrated activity towards shorter chain p-nitrophenyl esters with the optimal activity recorded towards p-nitrophenyl propionate (C3). The enzyme possessed an apparent T(opt) at 20°C and a pH optimum at pH 11. Esterases possessing such extreme alkaliphily are rare and so this enzyme represents an intriguing novel locus in protein sequence space. A metagenomic approach has been shown, in this case, to yield an enzyme with quite different sequential/structural properties to known lipases. It serves as an excellent candidate for analysis of the molecular mechanisms responsible for both cold and alkaline activity and novel structure-function relationships of esterase activity.

  11. Evolutionary relationships of a plant-pathogenic mycoplasmalike organism and Acholeplasma laidlawii deduced from two ribosomal protein gene sequences.

    PubMed Central

    Lim, P O; Sears, B B

    1992-01-01

    The families within the class Mollicutes are distinguished by their morphologies, nutritional requirements, and abilities to metabolize certain compounds. Biosystematic classification of the plant-pathogenic mycoplasmalike organisms (MLOs) has been difficult because these organisms have not been cultured in vitro, and hence their nutritional requirements have not been determined nor have physiological characterizations been possible. To investigate the evolutionary relationship of the MLOs to other members of the class Mollicutes, a segment of a ribosomal protein operon was cloned and sequenced from an aster yellows-type MLO which is pathogenic for members of the genus Oenothera and from Acholeplasma laidlawii. The deduced amino acid sequence data from the rpl22 and rps3 genes indicate that the MLOs are more closely related to A. laidlawii than to animal mycoplasmas, confirming previous results from 16S rRNA sequence comparisons. This conclusion is also supported by the finding that the UGA codon is not read as a tryptophan codon in the MLO and A. laidlawii, in contrast to its usage in Mycoplasma capricolum. PMID:1556079

  12. Clinical multiplexed exome sequencing distinguishes adult oligodendroglial neoplasms from astrocytic and mixed lineage gliomas.

    PubMed

    Cryan, Jane B; Haidar, Sam; Ramkissoon, Lori A; Bi, Wenya Linda; Knoff, David S; Schultz, Nikolaus; Abedalthagafi, Malak; Brown, Loreal; Wen, Patrick Y; Reardon, David A; Dunn, Ian F; Folkerth, Rebecca D; Santagata, Sandro; Lindeman, Neal I; Ligon, Azra H; Beroukhim, Rameen; Hornick, Jason L; Alexander, Brian M; Ligon, Keith L; Ramkissoon, Shakti H

    2014-09-30

    Classifying adult gliomas remains largely a histologic diagnosis based on morphology; however astrocytic, oligodendroglial and mixed lineage tumors can display overlapping histologic features. We used multiplexed exome sequencing (OncoPanel) on 108 primary or recurrent adult gliomas, comprising 65 oligodendrogliomas, 28 astrocytomas and 15 mixed oligoastrocytomas to identify lesions that could enhance lineage classification. Mutations in TP53 (20/28, 71%) and ATRX (15/28, 54%) were enriched in astrocytic tumors compared to oligodendroglial tumors of which 4/65 (6%) had mutations in TP53 and 2/65 (3%) had ATRX mutations. We found that oligoastrocytomas harbored mutations in TP53 (80%, 12/15) and ATRX (60%, 9/15) at frequencies similar to pure astrocytic tumors, suggesting that oligoastrocytomas and astrocytomas may represent a single genetic or biological entity. p53 protein expression correlated with mutation status and showed significant increases in astrocytomas and oligoastrocytomas compared to oligodendrogliomas, a finding that also may facilitate accurate classification. Furthermore our OncoPanel analysis revealed that 15% of IDH1/2 mutant gliomas would not be detected by traditional IDH1 (p.R132H) antibody testing, supporting the use of genomic technologies in providing clinically relevant data. In all, our results demonstrate that multiplexed exome sequencing can support evaluation and classification of adult low-grade gliomas with a single clinical test.

  13. Origin and diversification of leucine-rich repeat receptor-like protein kinase (LRR-RLK) genes in plants.

    PubMed

    Liu, Ping-Li; Du, Liang; Huang, Yuan; Gao, Shu-Min; Yu, Meng

    2017-02-07

    Leucine-rich repeat receptor-like protein kinases (LRR-RLKs) are the largest group of receptor-like kinases in plants and play crucial roles in development and stress responses. The evolutionary relationships among LRR-RLK genes have been investigated in flowering plants; however, no comprehensive studies have been performed for these genes in more ancestral groups. The subfamily classification of LRR-RLK genes in plants, the evolutionary history and driving force for the evolution of each LRR-RLK subfamily remain to be understood. We identified 119 LRR-RLK genes in the Physcomitrella patens moss genome, 67 LRR-RLK genes in the Selaginella moellendorffii lycophyte genome, and no LRR-RLK genes in five green algae genomes. Furthermore, these LRR-RLK sequences, along with previously reported LRR-RLK sequences from Arabidopsis thaliana and Oryza sativa, were subjected to evolutionary analyses. Phylogenetic analyses revealed that plant LRR-RLKs belong to 19 subfamilies, eighteen of which were established in early land plants, and one of which evolved in flowering plants. More importantly, we found that the basic structures of LRR-RLK genes for most subfamilies are established in early land plants and conserved within subfamilies and across different plant lineages, but divergent among subfamilies. In addition, most members of the same subfamily had common protein motif compositions, whereas members of different subfamilies showed variations in protein motif compositions. The unique gene structure and protein motif compositions of each subfamily differentiate the subfamily classifications and, more importantly, provide evidence for functional divergence among LRR-RLK subfamilies. Maximum likelihood analyses showed that some sites within four subfamilies were under positive selection. Much of the diversity of plant LRR-RLK genes was established in early land plants. Positive selection contributed to the evolution of a few LRR-RLK subfamilies.

  14. Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies

    PubMed Central

    2010-01-01

    Background All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences. Results The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%. Conclusions This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general. PMID:20144194

  15. Phylogenetic and expression analysis of the NPR1-like gene family from Persea americana (Mill.).

    PubMed

    Backer, Robert; Mahomed, Waheed; Reeksting, Bianca J; Engelbrecht, Juanita; Ibarra-Laclette, Enrique; van den Berg, Noëlani

    2015-01-01

    The NONEXPRESSOR OF PATHOGENESIS-RELATED GENES1 (NPR1) forms an integral part of the salicylic acid (SA) pathway in plants and is involved in cross-talk between the SA and jasmonic acid/ethylene (JA/ET) pathways. Therefore, NPR1 is essential to the effective response of plants to pathogens. Avocado (Persea americana) is a commercially important crop worldwide. Significant losses in production result from Phytophthora root rot, caused by the hemibiotroph, Phytophthora cinnamomi. This oomycete infects the feeder roots of avocado trees leading to an overall decline in health and eventual death. The interaction between avocado and P. cinnamomi is poorly understood and as such limited control strategies exist. Thus uncovering the role of NPR1 in avocado could provide novel insights into the avocado - P. cinnamomi interaction. A total of five NPR1-like sequences were identified. These sequences were annotated using FGENESH and a maximum-likelihood tree was constructed using 34 NPR1-like protein sequences from other plant species. The conserved protein domains and functional motifs of these sequences were predicted. Reverse transcription quantitative PCR was used to analyze the expression of the five NPR1-like sequences in the roots of avocado after treatment with salicylic and jasmonic acid, P. cinnamomi infection, across different tissues and in P. cinnamomi infected tolerant and susceptible rootstocks. Of the five NPR1-like sequences three have strong support for a defensive role while two are most likely involved in development. Significant differences in the expression profiles of these five NPR1-like genes were observed, assisting in functional classification. Understanding the interaction of avocado and P. cinnamomi is essential to developing new control strategies. This work enables further classification of these genes by means of functional annotation and is a crucial step in understanding the role of NPR1 during P. cinnamomi infection.

  16. Phylogenetic and expression analysis of the NPR1-like gene family from Persea americana (Mill.)

    PubMed Central

    Backer, Robert; Mahomed, Waheed; Reeksting, Bianca J.; Engelbrecht, Juanita; Ibarra-Laclette, Enrique; van den Berg, Noëlani

    2015-01-01

    The NONEXPRESSOR OF PATHOGENESIS-RELATED GENES1 (NPR1) forms an integral part of the salicylic acid (SA) pathway in plants and is involved in cross-talk between the SA and jasmonic acid/ethylene (JA/ET) pathways. Therefore, NPR1 is essential to the effective response of plants to pathogens. Avocado (Persea americana) is a commercially important crop worldwide. Significant losses in production result from Phytophthora root rot, caused by the hemibiotroph, Phytophthora cinnamomi. This oomycete infects the feeder roots of avocado trees leading to an overall decline in health and eventual death. The interaction between avocado and P. cinnamomi is poorly understood and as such limited control strategies exist. Thus uncovering the role of NPR1 in avocado could provide novel insights into the avocado – P. cinnamomi interaction. A total of five NPR1-like sequences were identified. These sequences were annotated using FGENESH and a maximum-likelihood tree was constructed using 34 NPR1-like protein sequences from other plant species. The conserved protein domains and functional motifs of these sequences were predicted. Reverse transcription quantitative PCR was used to analyze the expression of the five NPR1-like sequences in the roots of avocado after treatment with salicylic and jasmonic acid, P. cinnamomi infection, across different tissues and in P. cinnamomi infected tolerant and susceptible rootstocks. Of the five NPR1-like sequences three have strong support for a defensive role while two are most likely involved in development. Significant differences in the expression profiles of these five NPR1-like genes were observed, assisting in functional classification. Understanding the interaction of avocado and P. cinnamomi is essential to developing new control strategies. This work enables further classification of these genes by means of functional annotation and is a crucial step in understanding the role of NPR1 during P. cinnamomi infection. PMID:25972890

  17. Using simple artificial intelligence methods for predicting amyloidogenesis in antibodies.

    PubMed

    David, Maria Pamela C; Concepcion, Gisela P; Padlan, Eduardo A

    2010-02-08

    All polypeptide backbones have the potential to form amyloid fibrils, which are associated with a number of degenerative disorders. However, the likelihood that amyloidosis would actually occur under physiological conditions depends largely on the amino acid composition of a protein. We explore using a naive Bayesian classifier and a weighted decision tree for predicting the amyloidogenicity of immunoglobulin sequences. The average accuracy based on leave-one-out (LOO) cross validation of a Bayesian classifier generated from 143 amyloidogenic sequences is 60.84%. This is consistent with the average accuracy of 61.15% for a holdout test set comprised of 103 AM and 28 non-amyloidogenic sequences. The LOO cross validation accuracy increases to 81.08% when the training set is augmented by the holdout test set. In comparison, the average classification accuracy for the holdout test set obtained using a decision tree is 78.64%. Non-amyloidogenic sequences are predicted with average LOO cross validation accuracies between 74.05% and 77.24% using the Bayesian classifier, depending on the training set size. The accuracy for the holdout test set was 89%. For the decision tree, the non-amyloidogenic prediction accuracy is 75.00%. This exploratory study indicates that both classification methods may be promising in providing straightforward predictions on the amyloidogenicity of a sequence. Nevertheless, the number of available sequences that satisfy the premises of this study are limited, and are consequently smaller than the ideal training set size. Increasing the size of the training set clearly increases the accuracy, and the expansion of the training set to include not only more derivatives, but more alignments, would make the method more sound. The accuracy of the classifiers may also be improved when additional factors, such as structural and physico-chemical data, are considered. The development of this type of classifier has significant applications in evaluating engineered antibodies, and may be adapted for evaluating engineered proteins in general.

  18. AllergenFP: allergenicity prediction by descriptor fingerprints.

    PubMed

    Dimitrov, Ivan; Naneva, Lyudmila; Doytchinova, Irini; Bangov, Ivan

    2014-03-15

    Allergenicity, like antigenicity and immunogenicity, is a property encoded linearly and non-linearly, and therefore the alignment-based approaches are not able to identify this property unambiguously. A novel alignment-free descriptor-based fingerprint approach is presented here and applied to identify allergens and non-allergens. The approach was implemented into a four step algorithm. Initially, the protein sequences are described by amino acid principal properties as hydrophobicity, size, relative abundance, helix and β-strand forming propensities. Then, the generated strings of different length are converted into vectors with equal length by auto- and cross-covariance (ACC). The vectors were transformed into binary fingerprints and compared in terms of Tanimoto coefficient. The approach was applied to a set of 2427 known allergens and 2427 non-allergens and identified correctly 88% of them with Matthews correlation coefficient of 0.759. The descriptor fingerprint approach presented here is universal. It could be applied for any classification problem in computational biology. The set of E-descriptors is able to capture the main structural and physicochemical properties of amino acids building the proteins. The ACC transformation overcomes the main problem in the alignment-based comparative studies arising from the different length of the aligned protein sequences. The conversion of protein ACC values into binary descriptor fingerprints allows similarity search and classification. The algorithm described in the present study was implemented in a specially designed Web site, named AllergenFP (FP stands for FingerPrint). AllergenFP is written in Python, with GIU in HTML. It is freely accessible at http://ddg-pharmfac.net/Allergen FP. idoytchinova@pharmfac.net or ivanbangov@shu-bg.net.

  19. Identification of a novel species of papillomavirus in giraffe lesions using nanopore sequencing.

    PubMed

    Vanmechelen, Bert; Bertelsen, Mads Frost; Rector, Annabel; Van den Oord, Joost J; Laenen, Lies; Vergote, Valentijn; Maes, Piet

    2017-03-01

    Papillomaviridae form a large family of viruses that are known to infect a variety of vertebrates, including mammals, reptiles, birds and fish. Infections usually give rise to minor skin lesions but can in some cases lead to the development of malignant neoplasia. In this study, we identified a novel species of papillomavirus (PV), isolated from warts of four giraffes (Giraffa camelopardalis). The sequence of the L1 gene was determined and found to be identical for all isolates. Using nanopore sequencing, the full sequence of the PV genome could be determined. The coding region of the genome was found to contain seven open reading frames (ORF), encoding the early proteins E1, E2 and E5-E7 as well as the late proteins L1 and L2. In addition to these ORFs, a region located within the E2 gene is thought, based on sequence similarities to other papillomaviruses, to encode an E4 protein, although no start codon could be identified. Based on the sequence of the L1 gene, this novel PV was found to be most similar to Capreolus capreolus papillomavirus 1 (CcaPV1), with 67.96% nucleotide identity. We therefore suggest that the virus identified here is given the name Giraffa camelopardalis papillomavirus 1 (GcPV1) and is classified as a novel species within the genus Deltapapillomavirus, in line with the current guidelines for the nomenclature and classification of PVs. Copyright © 2017 Elsevier B.V. All rights reserved.

  20. Predicting Protein Relationships to Human Pathways through a Relational Learning Approach Based on Simple Sequence Features.

    PubMed

    García-Jiménez, Beatriz; Pons, Tirso; Sanchis, Araceli; Valencia, Alfonso

    2014-01-01

    Biological pathways are important elements of systems biology and in the past decade, an increasing number of pathway databases have been set up to document the growing understanding of complex cellular processes. Although more genome-sequence data are becoming available, a large fraction of it remains functionally uncharacterized. Thus, it is important to be able to predict the mapping of poorly annotated proteins to original pathway models. We have developed a Relational Learning-based Extension (RLE) system to investigate pathway membership through a function prediction approach that mainly relies on combinations of simple properties attributed to each protein. RLE searches for proteins with molecular similarities to specific pathway components. Using RLE, we associated 383 uncharacterized proteins to 28 pre-defined human Reactome pathways, demonstrating relative confidence after proper evaluation. Indeed, in specific cases manual inspection of the database annotations and the related literature supported the proposed classifications. Examples of possible additional components of the Electron transport system, Telomere maintenance and Integrin cell surface interactions pathways are discussed in detail. All the human predicted proteins in the 2009 and 2012 releases 30 and 40 of Reactome are available at http://rle.bioinfo.cnio.es.

  1. Complete genomic sequence and taxonomic position of eel virus European X (EVEX), a rhabdovirus of European eel.

    PubMed

    Galinier, Richard; van Beurden, Steven; Amilhat, Elsa; Castric, Jeannette; Schoehn, Guy; Verneau, Olivier; Fazio, Géraldine; Allienne, Jean-François; Engelsma, Marc; Sasal, Pierre; Faliex, Elisabeth

    2012-06-01

    Eel virus European X (EVEX) was first isolated from diseased European eel Anguilla anguilla in Japan at the end of seventies. The virus was tentatively classified into the Rhabdoviridae family on the basis of morphology and serological cross reactivity. This family of viruses is organized into six genera and currently comprises approximately 200 members, many of which are still unassigned because of the lack of molecular data. This work presents the morphological, biochemical and genetic characterizations of EVEX, and proposes a taxonomic classification for this virus. We provide its complete genome sequence, plus a comprehensive sequence comparison between isolates from different geographical origins. The genome encodes the five classical structural proteins plus an overlapping open reading frame in the phosphoprotein gene, coding for a putative C protein. Phylogenic relationship with other rhabdoviruses indicates that EVEX is most closely related to the Vesiculovirus genus and shares the highest identity with trout rhabdovirus 903/87. Copyright © 2012 Elsevier B.V. All rights reserved.

  2. Genetic Characterization of the Tick-Borne Orbiviruses

    PubMed Central

    Belaganahalli, Manjunatha N.; Maan, Sushila; Maan, Narender S.; Brownlie, Joe; Tesh, Robert; Attoui, Houssam; Mertens, Peter P. C.

    2015-01-01

    The International Committee for Taxonomy of Viruses (ICTV) recognizes four species of tick-borne orbiviruses (TBOs): Chenuda virus, Chobar Gorge virus, Wad Medani virus and Great Island virus (genus Orbivirus, family Reoviridae). Nucleotide (nt) and amino acid (aa) sequence comparisons provide a basis for orbivirus detection and classification, however full genome sequence data were only available for the Great Island virus species. We report representative genome-sequences for the three other TBO species (virus isolates: Chenuda virus (CNUV); Chobar Gorge virus (CGV) and Wad Medani virus (WMV)). Phylogenetic comparisons show that TBOs cluster separately from insect-borne orbiviruses (IBOs). CNUV, CGV, WMV and GIV share low level aa/nt identities with other orbiviruses, in ‘conserved’ Pol, T2 and T13 proteins/genes, identifying them as four distinct virus-species. The TBO genome segment encoding cell attachment, outer capsid protein 1 (OC1), is approximately half the size of the equivalent segment from insect-borne orbiviruses, helping to explain why tick-borne orbiviruses have a ~1 kb smaller genome. PMID:25928203

  3. Genetic characterization of the tick-borne orbiviruses.

    PubMed

    Belaganahalli, Manjunatha N; Maan, Sushila; Maan, Narender S; Brownlie, Joe; Tesh, Robert; Attoui, Houssam; Mertens, Peter P C

    2015-04-28

    The International Committee for Taxonomy of Viruses (ICTV) recognizes four species of tick-borne orbiviruses (TBOs): Chenuda virus, Chobar Gorge virus, Wad Medani virus and Great Island virus (genus Orbivirus, family Reoviridae). Nucleotide (nt) and amino acid (aa) sequence comparisons provide a basis for orbivirus detection and classification, however full genome sequence data were only available for the Great Island virus species. We report representative genome-sequences for the three other TBO species (virus isolates: Chenuda virus (CNUV); Chobar Gorge virus (CGV) and Wad Medani virus (WMV)). Phylogenetic comparisons show that TBOs cluster separately from insect-borne orbiviruses (IBOs). CNUV, CGV, WMV and GIV share low level aa/nt identities with other orbiviruses, in 'conserved' Pol, T2 and T13 proteins/genes, identifying them as four distinct virus-species. The TBO genome segment encoding cell attachment, outer capsid protein 1 (OC1), is approximately half the size of the equivalent segment from insect-borne orbiviruses, helping to explain why tick-borne orbiviruses have a ~1 kb smaller genome.

  4. Specificity of molecular interactions in transient protein-protein interaction interfaces.

    PubMed

    Cho, Kyu-il; Lee, KiYoung; Lee, Kwang H; Kim, Dongsup; Lee, Doheon

    2006-11-15

    In this study, we investigate what types of interactions are specific to their biological function, and what types of interactions are persistent regardless of their functional category in transient protein-protein heterocomplexes. This is the first approach to analyze protein-protein interfaces systematically at the molecular interaction level in the context of protein functions. We perform systematic analysis at the molecular interaction level using classification and feature subset selection technique prevalent in the field of pattern recognition. To represent the physicochemical properties of protein-protein interfaces, we design 18 molecular interaction types using canonical and noncanonical interactions. Then, we construct input vector using the frequency of each interaction type in protein-protein interface. We analyze the 131 interfaces of transient protein-protein heterocomplexes in PDB: 33 protease-inhibitors, 52 antibody-antigens, 46 signaling proteins including 4 cyclin dependent kinase and 26 G-protein. Using kNN classification and feature subset selection technique, we show that there are specific interaction types based on their functional category, and such interaction types are conserved through the common binding mechanism, rather than through the sequence or structure conservation. The extracted interaction types are C(alpha)-- H...O==C interaction, cation...anion interaction, amine...amine interaction, and amine...cation interaction. With these four interaction types, we achieve the classification success rate up to 83.2% with leave-one-out cross-validation at k = 15. Of these four interaction types, C(alpha)--H...O==C shows binding specificity for protease-inhibitor complexes, while cation-anion interaction is predominant in signaling complexes. The amine ... amine and amine...cation interaction give a minor contribution to the classification accuracy. When combined with these two interactions, they increase the accuracy by 3.8%. In the case of antibody-antigen complexes, the sign is somewhat ambiguous. From the evolutionary perspective, while protease-inhibitors and sig-naling proteins have optimized their interfaces to suit their biological functions, antibody-antigen interactions are the happenstance, implying that antibody-antigen complexes do not show distinctive interaction types. Persistent interaction types such as pi...pi, amide-carbonyl, and hydroxyl-carbonyl interaction, are also investigated. Analyzing the structural orientations of the pi...pi stacking interactions, we find that herringbone shape is a major configuration in transient protein-protein interfaces. This result is different from that of protein core, where parallel-displaced configurations are the major configuration. We also analyze overall trend of amide-carbonyl and hydroxyl-carbonyl interactions. It is noticeable that nearly 82% of the interfaces have at least one hydroxyl-carbonyl interactions. (c) 2006 Wiley-Liss, Inc.

  5. Literature classification for semi-automated updating of biological knowledgebases

    PubMed Central

    2013-01-01

    Background As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Results We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. Conclusion We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases. PMID:24564403

  6. A proteome view of structural, functional, and taxonomic characteristics of major protein domain clusters.

    PubMed

    Sun, Chia-Tsen; Chiang, Austin W T; Hwang, Ming-Jing

    2017-10-27

    Proteome-scale bioinformatics research is increasingly conducted as the number of completely sequenced genomes increases, but analysis of protein domains (PDs) usually relies on similarity in their amino acid sequences and/or three-dimensional structures. Here, we present results from a bi-clustering analysis on presence/absence data for 6,580 unique PDs in 2,134 species with a sequenced genome, thus covering a complete set of proteins, for the three superkingdoms of life, Bacteria, Archaea, and Eukarya. Our analysis revealed eight distinctive PD clusters, which, following an analysis of enrichment of Gene Ontology functions and CATH classification of protein structures, were shown to exhibit structural and functional properties that are taxa-characteristic. For examples, the largest cluster is ubiquitous in all three superkingdoms, constituting a set of 1,472 persistent domains created early in evolution and retained in living organisms and characterized by basic cellular functions and ancient structural architectures, while an Archaea and Eukarya bi-superkingdom cluster suggests its PDs may have existed in the ancestor of the two superkingdoms, and others are single superkingdom- or taxa (e.g. Fungi)-specific. These results contribute to increase our appreciation of PD diversity and our knowledge of how PDs are used in species, yielding implications on species evolution.

  7. Fuzzy cluster analysis of simple physicochemical properties of amino acids for recognizing secondary structure in proteins.

    PubMed Central

    Mocz, G.

    1995-01-01

    Fuzzy cluster analysis has been applied to the 20 amino acids by using 65 physicochemical properties as a basis for classification. The clustering products, the fuzzy sets (i.e., classical sets with associated membership functions), have provided a new measure of amino acid similarities for use in protein folding studies. This work demonstrates that fuzzy sets of simple molecular attributes, when assigned to amino acid residues in a protein's sequence, can predict the secondary structure of the sequence with reasonable accuracy. An approach is presented for discriminating standard folding states, using near-optimum information splitting in half-overlapping segments of the sequence of assigned membership functions. The method is applied to a nonredundant set of 252 proteins and yields approximately 73% matching for correctly predicted and correctly rejected residues with approximately 60% overall success rate for the correctly recognized ones in three folding states: alpha-helix, beta-strand, and coil. The most useful attributes for discriminating these states appear to be related to size, polarity, and thermodynamic factors. Van der Waals volume, apparent average thickness of surrounding molecular free volume, and a measure of dimensionless surface electron density can explain approximately 95% of prediction results. hydrogen bonding and hydrophobicity induces do not yet enable clear clustering and prediction. PMID:7549882

  8. Improving Protein Fold Recognition by Deep Learning Networks

    NASA Astrophysics Data System (ADS)

    Jo, Taeho; Hou, Jie; Eickholt, Jesse; Cheng, Jianlin

    2015-12-01

    For accurate recognition of protein folds, a deep learning network method (DN-Fold) was developed to predict if a given query-template protein pair belongs to the same structural fold. The input used stemmed from the protein sequence and structural features extracted from the protein pair. We evaluated the performance of DN-Fold along with 18 different methods on Lindahl’s benchmark dataset and on a large benchmark set extracted from SCOP 1.75 consisting of about one million protein pairs, at three different levels of fold recognition (i.e., protein family, superfamily, and fold) depending on the evolutionary distance between protein sequences. The correct recognition rate of ensembled DN-Fold for Top 1 predictions is 84.5%, 61.5%, and 33.6% and for Top 5 is 91.2%, 76.5%, and 60.7% at family, superfamily, and fold levels, respectively. We also evaluated the performance of single DN-Fold (DN-FoldS), which showed the comparable results at the level of family and superfamily, compared to ensemble DN-Fold. Finally, we extended the binary classification problem of fold recognition to real-value regression task, which also show a promising performance. DN-Fold is freely available through a web server at http://iris.rnet.missouri.edu/dnfold.

  9. Metaproteomics analyses as diagnostic tool for differentiation of Escherichia coli strains in outbreaks

    NASA Astrophysics Data System (ADS)

    Jabbour, Rabih E.; Wright, James D.; Deshpande, Samir V.; Wade, Mary; McCubbin, Patrick; Bevilacqua, Vicky

    2013-05-01

    The secreted proteins of the enterohemorrhagic and enteropathogenic E. coli (EHEC and EPEC) are the most common cause of hemorrhagic colitis, a bloody diarrhea with EHEC infection, which often can lead to life threatening hemolytic-uremic syndrome (HUS).We are employing a metaproteomic approach as an effective and complimentary technique to the current genomic based approaches. This metaproteomic approach will evaluate the secreted proteins associated with pathogenicity and utilize their signatures as differentiation biomarkers between EHEC and EPEC strains. The result showed that the identified tryptic peptides of the secreted proteins extracted from different EHEC and EPEC growths have difference in their amino acids sequences and could potentially utilized as biomarkers for the studied E. coli strains. Analysis of extract from EHEC O104:H4 resulted in identification of a multidrug efflux protein, which belongs to the family of fusion proteins that are responsible of cell transportation. Experimental peptides identified lies in the region of the HlyD haemolysin secretion protein-D that is responsible for transporting the haemolysin A toxin. Moreover, the taxonomic classification of EHEC O104:H4 showed closest match with E. coli E55989, which is in agreement with genomic sequencing studies that were done extensively on the mentioned strain. The taxonomic results showed strain level classification for the studied strains and distinctive separation among the strains. Comparative proteomic calculations showed separation between EHEC O157:H7 and O104:H4 in replicate samples using cluster analysis. There are no reported studies addressing the characterization of secreted proteins in various enhanced growth media and utilizing them as biomarkers for strain differentiation. The results of FY-2012 are promising to pursue further experimentation to statistically validate the results and to further explore the impact of environmental conditions on the nature of the secreted biomarkers in various E. coli strains that are of public health concerns in various sectors.

  10. Phylogeny of Anophelinae using mitochondrial protein coding genes

    PubMed Central

    de Oliveira, Tatiane Marques Porangaba; Bergo, Eduardo S.; Conn, Jan E.; Sant’Ana, Denise Cristina; Nagaki, Sandra Sayuri; Nihei, Silvio; Lamas, Carlos Einicker; González, Christian; Moreira, Caio Cesar; Sallum, Maria Anice Mureb

    2017-01-01

    Malaria is a vector-borne disease that is a great burden on the poorest and most marginalized communities of the tropical and subtropical world. Approximately 41 species of Anopheline mosquitoes can effectively spread species of Plasmodium parasites that cause human malaria. Proposing a natural classification for the subfamily Anophelinae has been a continuous effort, addressed using both morphology and DNA sequence data. The monophyly of the genus Anopheles, and phylogenetic placement of the genus Bironella, subgenera Kerteszia, Lophopodomyia and Stethomyia within the subfamily Anophelinae, remain in question. To understand the classification of Anophelinae, we inferred the phylogeny of all three genera (Anopheles, Bironella, Chagasia) and major subgenera by analysing the amino acid sequences of the 13 protein coding genes of 150 newly sequenced mitochondrial genomes of Anophelinae and 18 newly sequenced Culex species as outgroup taxa, supplemented with 23 mitogenomes from GenBank. Our analyses generally place genus Bironella within the genus Anopheles, which implies that the latter as it is currently defined is not monophyletic. With some inconsistencies, Bironella was placed within the major clade that includes Anopheles, Cellia, Kerteszia, Lophopodomyia, Nyssorhynchus and Stethomyia, which were found to be monophyletic groups within Anophelinae. Our findings provided robust evidence for elevating the monophyletic groupings Kerteszia, Lophopodomyia, Nyssorhynchus and Stethomyia to genus level; genus Anopheles to include subgenera Anopheles, Baimaia, Cellia and Christya; Anopheles parvus to be placed into a new genus; Nyssorhynchus to be elevated to genus level; the genus Nyssorhynchus to include subgenera Myzorhynchella and Nyssorhynchus; Anopheles atacamensis and Anopheles pictipennis to be transferred from subgenus Nyssorhynchus to subgenus Myzorhynchella; and subgenus Nyssorhynchus to encompass the remaining species of Argyritarsis and Albimanus Sections. PMID:29291068

  11. Taxonomic distribution and origins of the extended LHC (light-harvesting complex) antenna protein superfamily

    PubMed Central

    2010-01-01

    Background The extended light-harvesting complex (LHC) protein superfamily is a centerpiece of eukaryotic photosynthesis, comprising the LHC family and several families involved in photoprotection, like the LHC-like and the photosystem II subunit S (PSBS). The evolution of this complex superfamily has long remained elusive, partially due to previously missing families. Results In this study we present a meticulous search for LHC-like sequences in public genome and expressed sequence tag databases covering twelve representative photosynthetic eukaryotes from the three primary lineages of plants (Plantae): glaucophytes, red algae and green plants (Viridiplantae). By introducing a coherent classification of the different protein families based on both, hidden Markov model analyses and structural predictions, numerous new LHC-like sequences were identified and several new families were described, including the red lineage chlorophyll a/b-binding-like protein (RedCAP) family from red algae and diatoms. The test of alternative topologies of sequences of the highly conserved chlorophyll-binding core structure of LHC and PSBS proteins significantly supports the independent origins of LHC and PSBS families via two unrelated internal gene duplication events. This result was confirmed by the application of cluster likelihood mapping. Conclusions The independent evolution of LHC and PSBS families is supported by strong phylogenetic evidence. In addition, a possible origin of LHC and PSBS families from different homologous members of the stress-enhanced protein subfamily, a diverse and anciently paralogous group of two-helix proteins, seems likely. The new hypothesis for the evolution of the extended LHC protein superfamily proposed here is in agreement with the character evolution analysis that incorporates the distribution of families and subfamilies across taxonomic lineages. Intriguingly, stress-enhanced proteins, which are universally found in the genomes of green plants, red algae, glaucophytes and in diatoms with complex plastids, could represent an important and previously missing link in the evolution of the extended LHC protein superfamily. PMID:20673336

  12. Freiburg RNA tools: a central online resource for RNA-focused research and teaching.

    PubMed

    Raden, Martin; Ali, Syed M; Alkhnbashi, Omer S; Busch, Anke; Costa, Fabrizio; Davis, Jason A; Eggenhofer, Florian; Gelhausen, Rick; Georg, Jens; Heyne, Steffen; Hiller, Michael; Kundu, Kousik; Kleinkauf, Robert; Lott, Steffen C; Mohamed, Mostafa M; Mattheis, Alexander; Miladi, Milad; Richter, Andreas S; Will, Sebastian; Wolff, Joachim; Wright, Patrick R; Backofen, Rolf

    2018-05-21

    The Freiburg RNA tools webserver is a well established online resource for RNA-focused research. It provides a unified user interface and comprehensive result visualization for efficient command line tools. The webserver includes RNA-RNA interaction prediction (IntaRNA, CopraRNA, metaMIR), sRNA homology search (GLASSgo), sequence-structure alignments (LocARNA, MARNA, CARNA, ExpaRNA), CRISPR repeat classification (CRISPRmap), sequence design (antaRNA, INFO-RNA, SECISDesign), structure aberration evaluation of point mutations (RaSE), and RNA/protein-family models visualization (CMV), and other methods. Open education resources offer interactive visualizations of RNA structure and RNA-RNA interaction prediction as well as basic and advanced sequence alignment algorithms. The services are freely available at http://rna.informatik.uni-freiburg.de.

  13. Natural Variation of Epstein-Barr Virus Genes, Proteins, and Primary MicroRNA.

    PubMed

    Correia, Samantha; Palser, Anne; Elgueta Karstegl, Claudio; Middeldorp, Jaap M; Ramayanti, Octavia; Cohen, Jeffrey I; Hildesheim, Allan; Fellner, Maria Dolores; Wiels, Joelle; White, Robert E; Kellam, Paul; Farrell, Paul J

    2017-08-01

    Viral gene sequences from an enlarged set of about 200 Epstein-Barr virus (EBV) strains, including many primary isolates, have been used to investigate variation in key viral genetic regions, particularly LMP1, Zp, gp350, EBNA1, and the BART microRNA (miRNA) cluster 2. Determination of type 1 and type 2 EBV in saliva samples from people from a wide range of geographic and ethnic backgrounds demonstrates a small percentage of healthy white Caucasian British people carrying predominantly type 2 EBV. Linkage of Zp and gp350 variants to type 2 EBV is likely to be due to their genes being adjacent to the EBNA3 locus, which is one of the major determinants of the type 1/type 2 distinction. A novel classification of EBNA1 DNA binding domains, named QCIGP, results from phylogeny analysis of their protein sequences but is not linked to the type 1/type 2 classification. The BART cluster 2 miRNA region is classified into three major variants through single-nucleotide polymorphisms (SNPs) in the primary miRNA outside the mature miRNA sequences. These SNPs can result in altered levels of expression of some miRNAs from the BART variant frequently present in Chinese and Indonesian nasopharyngeal carcinoma (NPC) samples. The EBV genetic variants identified here provide a basis for future, more directed analysis of association of specific EBV variations with EBV biology and EBV-associated diseases. IMPORTANCE Incidence of diseases associated with EBV varies greatly in different parts of the world. Thus, relationships between EBV genome sequence variation and health, disease, geography, and ethnicity of the host may be important for understanding the role of EBV in diseases and for development of an effective EBV vaccine. This paper provides the most comprehensive analysis so far of variation in specific EBV genes relevant to these diseases and proposed EBV vaccines. By focusing on variation in LMP1, Zp, gp350, EBNA1, and the BART miRNA cluster 2, new relationships with the known type 1/type 2 strains are demonstrated, and a novel classification of EBNA1 and the BART miRNAs is proposed. Copyright © 2017 Correia et al.

  14. Genome-wide identification, classification, and expression analysis of the arabinogalactan protein gene family in rice (Oryza sativa L.)

    PubMed Central

    Zhao, Jie

    2010-01-01

    Arabinogalactan proteins (AGPs) comprise a family of hydroxyproline-rich glycoproteins that are implicated in plant growth and development. In this study, 69 AGPs are identified from the rice genome, including 13 classical AGPs, 15 arabinogalactan (AG) peptides, three non-classical AGPs, three early nodulin-like AGPs (eNod-like AGPs), eight non-specific lipid transfer protein-like AGPs (nsLTP-like AGPs), and 27 fasciclin-like AGPs (FLAs). The results from expressed sequence tags, microarrays, and massively parallel signature sequencing tags are used to analyse the expression of AGP-encoding genes, which is confirmed by real-time PCR. The results reveal that several rice AGP-encoding genes are predominantly expressed in anthers and display differential expression patterns in response to abscisic acid, gibberellic acid, and abiotic stresses. Based on the results obtained from this analysis, an attempt has been made to link the protein structures and expression patterns of rice AGP-encoding genes to their functions. Taken together, the genome-wide identification and expression analysis of the rice AGP gene family might facilitate further functional studies of rice AGPs. PMID:20423940

  15. The C. elegans rab family: identification, classification and toolkit construction.

    PubMed

    Gallegos, Maria E; Balakrishnan, Sanjeev; Chandramouli, Priya; Arora, Shaily; Azameera, Aruna; Babushekar, Anitha; Bargoma, Emilee; Bokhari, Abdulmalik; Chava, Siva Kumari; Das, Pranti; Desai, Meetali; Decena, Darlene; Saramma, Sonia Dev Devadas; Dey, Bodhidipra; Doss, Anna-Louise; Gor, Nilang; Gudiputi, Lakshmi; Guo, Chunyuan; Hande, Sonali; Jensen, Megan; Jones, Samantha; Jones, Norman; Jorgens, Danielle; Karamchedu, Padma; Kamrani, Kambiz; Kolora, Lakshmi Divya; Kristensen, Line; Kwan, Kelly; Lau, Henry; Maharaj, Pranesh; Mander, Navneet; Mangipudi, Kalyani; Menakuru, Himabindu; Mody, Vaishali; Mohanty, Sandeepa; Mukkamala, Sridevi; Mundra, Sheena A; Nagaraju, Sudharani; Narayanaswamy, Rajhalutshimi; Ndungu-Case, Catherine; Noorbakhsh, Mersedeh; Patel, Jigna; Patel, Puja; Pendem, Swetha Vandana; Ponakala, Anusha; Rath, Madhusikta; Robles, Michael C; Rokkam, Deepti; Roth, Caroline; Sasidharan, Preeti; Shah, Sapana; Tandon, Shweta; Suprai, Jagdip; Truong, Tina Quynh Nhu; Uthayaruban, Rubatharshini; Varma, Ajitha; Ved, Urvi; Wang, Zeran; Yu, Zhe

    2012-01-01

    Rab monomeric GTPases regulate specific aspects of vesicle transport in eukaryotes including coat recruitment, uncoating, fission, motility, target selection and fusion. Moreover, individual Rab proteins function at specific sites within the cell, for example the ER, golgi and early endosome. Importantly, the localization and function of individual Rab subfamily members are often conserved underscoring the significant contributions that model organisms such as Caenorhabditis elegans can make towards a better understanding of human disease caused by Rab and vesicle trafficking malfunction. With this in mind, a bioinformatics approach was first taken to identify and classify the complete C. elegans Rab family placing individual Rabs into specific subfamilies based on molecular phylogenetics. For genes that were difficult to classify by sequence similarity alone, we did a comparative analysis of intron position among specific subfamilies from yeast to humans. This two-pronged approach allowed the classification of 30 out of 31 C. elegans Rab proteins identified here including Rab31/Rab50, a likely member of the last eukaryotic common ancestor (LECA). Second, a molecular toolset was created to facilitate research on biological processes that involve Rab proteins. Specifically, we used Gateway-compatible C. elegans ORFeome clones as starting material to create 44 full-length, sequence-verified, dominant-negative (DN) and constitutive active (CA) rab open reading frames (ORFs). Development of this toolset provided independent research projects for students enrolled in a research-based molecular techniques course at California State University, East Bay (CSUEB).

  16. The C. elegans Rab Family: Identification, Classification and Toolkit Construction

    PubMed Central

    Gallegos, Maria E.; Balakrishnan, Sanjeev; Chandramouli, Priya

    2012-01-01

    Rab monomeric GTPases regulate specific aspects of vesicle transport in eukaryotes including coat recruitment, uncoating, fission, motility, target selection and fusion. Moreover, individual Rab proteins function at specific sites within the cell, for example the ER, golgi and early endosome. Importantly, the localization and function of individual Rab subfamily members are often conserved underscoring the significant contributions that model organisms such as Caenorhabditis elegans can make towards a better understanding of human disease caused by Rab and vesicle trafficking malfunction. With this in mind, a bioinformatics approach was first taken to identify and classify the complete C. elegans Rab family placing individual Rabs into specific subfamilies based on molecular phylogenetics. For genes that were difficult to classify by sequence similarity alone, we did a comparative analysis of intron position among specific subfamilies from yeast to humans. This two-pronged approach allowed the classification of 30 out of 31 C. elegans Rab proteins identified here including Rab31/Rab50, a likely member of the last eukaryotic common ancestor (LECA). Second, a molecular toolset was created to facilitate research on biological processes that involve Rab proteins. Specifically, we used Gateway-compatible C. elegans ORFeome clones as starting material to create 44 full-length, sequence-verified, dominant-negative (DN) and constitutive active (CA) rab open reading frames (ORFs). Development of this toolset provided independent research projects for students enrolled in a research-based molecular techniques course at California State University, East Bay (CSUEB). PMID:23185324

  17. A Proposed Genus Boundary for the Prokaryotes Based on Genomic Insights

    PubMed Central

    Qin, Qi-Long; Xie, Bin-Bin; Zhang, Xi-Ying; Chen, Xiu-Lan; Zhou, Bai-Cheng; Zhou, Jizhong; Oren, Aharon

    2014-01-01

    Genomic information has already been applied to prokaryotic species definition and classification. However, the contribution of the genome sequence to prokaryotic genus delimitation has been less studied. To gain insights into genus definition for the prokaryotes, we attempted to reveal the genus-level genomic differences in the current prokaryotic classification system and to delineate the boundary of a genus on the basis of genomic information. The average nucleotide sequence identity between two genomes can be used for prokaryotic species delineation, but it is not suitable for genus demarcation. We used the percentage of conserved proteins (POCP) between two strains to estimate their evolutionary and phenotypic distance. A comprehensive genomic survey indicated that the POCP can serve as a robust genomic index for establishing the genus boundary for prokaryotic groups. Basically, two species belonging to the same genus would share at least half of their proteins. In a specific lineage, the genus and family/order ranks showed slight or no overlap in terms of POCP values. A prokaryotic genus can be defined as a group of species with all pairwise POCP values higher than 50%. Integration of whole-genome data into the current taxonomy system can provide comprehensive information for prokaryotic genus definition and delimitation. PMID:24706738

  18. Entropyology: the application of bioinformatics and data modeling to digital virus and malware recognition

    NASA Astrophysics Data System (ADS)

    Jaenisch, Holger M.; Handley, James W.

    2010-04-01

    Malware are analogs of viruses. Viruses are comprised of large numbers of polypeptide proteins. The shape and function of the protein strands determines the functionality of the segment, similar to a subroutine in malware. The full combination of subroutines is the malware organism, in analogous fashion as a collection of polypeptides forms protein structures that are information bearing. We propose to apply the methods of Bioinformatics to analyze malware to provide a rich feature set for creating a unique and novel detection and classification scheme that is originally applied to Bioinformatics amino acid sequencing. Our proposed methods enable real time in situ (in contrast to in vivo) detection applications.

  19. Identification of high-confidence RNA regulatory elements by combinatorial classification of RNA-protein binding sites.

    PubMed

    Li, Yang Eric; Xiao, Mu; Shi, Binbin; Yang, Yu-Cheng T; Wang, Dong; Wang, Fei; Marcia, Marco; Lu, Zhi John

    2017-09-08

    Crosslinking immunoprecipitation sequencing (CLIP-seq) technologies have enabled researchers to characterize transcriptome-wide binding sites of RNA-binding protein (RBP) with high resolution. We apply a soft-clustering method, RBPgroup, to various CLIP-seq datasets to group together RBPs that specifically bind the same RNA sites. Such combinatorial clustering of RBPs helps interpret CLIP-seq data and suggests functional RNA regulatory elements. Furthermore, we validate two RBP-RBP interactions in cell lines. Our approach links proteins and RNA motifs known to possess similar biochemical and cellular properties and can, when used in conjunction with additional experimental data, identify high-confidence RBP groups and their associated RNA regulatory elements.

  20. Identification and substrate prediction of new Fragaria x ananassa aquaporins and expression in different tissues and during strawberry fruit development.

    PubMed

    Merlaen, Britt; De Keyser, Ellen; Van Labeke, Marie-Christine

    2018-01-01

    The newly identified aquaporin coding sequences presented here pave the way for further insights into the plant-water relations in the commercial strawberry ( Fragaria x ananassa ). Aquaporins are water channel proteins that allow water to cross (intra)cellular membranes. In Fragaria x ananassa , few of them have been identified hitherto, hampering the exploration of the water transport regulation at cellular level. Here, we present new aquaporin coding sequences belonging to different subclasses: plasma membrane intrinsic proteins subtype 1 and subtype 2 (PIP1 and PIP2) and tonoplast intrinsic proteins (TIP). The classification is based on phylogenetic analysis and is confirmed by the presence of conserved residues. Substrate-specific signature sequences (SSSSs) and specificity-determining positions (SDPs) predict the substrate specificity of each new aquaporin. Expression profiling in leaves, petioles and developing fruits reveals distinct patterns, even within the same (sub)class. Expression profiles range from leaf-specific expression over constitutive expression to fruit-specific expression. Both upregulation and downregulation during fruit ripening occur. Substrate specificity and expression profiles suggest that functional specialization exists among aquaporins belonging to a different but also to the same (sub)class.

  1. Molecular classification of liver cirrhosis in a rat model by proteomics and bioinformatics.

    PubMed

    Xu, Xiu-Qin; Leow, Chon K; Lu, Xin; Zhang, Xuegong; Liu, Jun S; Wong, Wing-Hung; Asperger, Arndt; Deininger, Sören; Eastwood Leung, Hon-Chiu

    2004-10-01

    Liver cirrhosis is a worldwide health problem. Reliable, noninvasive methods for early detection of liver cirrhosis are not available. Using a three-step approach, we classified sera from rats with liver cirrhosis following different treatment insults. The approach consisted of: (i) protein profiling using surface-enhanced laser desorption/ionization (SELDI) technology; (ii) selection of a statistically significant serum biomarker set using machine learning algorithms; and (iii) identification of selected serum biomarkers by peptide sequencing. We generated serum protein profiles from three groups of rats: (i) normal (n=8), (ii) thioacetamide-induced liver cirrhosis (n=22), and (iii) bile duct ligation-induced liver fibrosis (n=5) using a weak cation exchanger surface. Profiling data were further analyzed by a recursive support vector machine algorithm to select a panel of statistically significant biomarkers for class prediction. Sensitivity and specificity of classification using the selected protein marker set were higher than 92%. A consistently down-regulated 3495 Da protein in cirrhosis samples was one of the selected significant biomarkers. This 3495 Da protein was purified on-chip and trypsin digested. Further structural characterization of this biomarkers candidate was done by using cross-platform matrix-assisted laser desorption/ionization mass spectrometry (MALDI-MS) peptide mass fingerprinting (PMF) and matrix-assisted laser desorption/ionization time of flight/time of flight (MALDI-TOF/TOF) tandem mass spectrometry (MS/MS). Combined data from PMF and MS/MS spectra of two tryptic peptides suggested that this 3495 Da protein shared homology to a histidine-rich glycoprotein. These results demonstrated a novel approach to discovery of new biomarkers for early detection of liver cirrhosis and classification of liver diseases.

  2. Protein fold recognition using geometric kernel data fusion.

    PubMed

    Zakeri, Pooya; Jeuris, Ben; Vandebril, Raf; Moreau, Yves

    2014-07-01

    Various approaches based on features extracted from protein sequences and often machine learning methods have been used in the prediction of protein folds. Finding an efficient technique for integrating these different protein features has received increasing attention. In particular, kernel methods are an interesting class of techniques for integrating heterogeneous data. Various methods have been proposed to fuse multiple kernels. Most techniques for multiple kernel learning focus on learning a convex linear combination of base kernels. In addition to the limitation of linear combinations, working with such approaches could cause a loss of potentially useful information. We design several techniques to combine kernel matrices by taking more involved, geometry inspired means of these matrices instead of convex linear combinations. We consider various sequence-based protein features including information extracted directly from position-specific scoring matrices and local sequence alignment. We evaluate our methods for classification on the SCOP PDB-40D benchmark dataset for protein fold recognition. The best overall accuracy on the protein fold recognition test set obtained by our methods is ∼ 86.7%. This is an improvement over the results of the best existing approach. Moreover, our computational model has been developed by incorporating the functional domain composition of proteins through a hybridization model. It is observed that by using our proposed hybridization model, the protein fold recognition accuracy is further improved to 89.30%. Furthermore, we investigate the performance of our approach on the protein remote homology detection problem by fusing multiple string kernels. The MATLAB code used for our proposed geometric kernel fusion frameworks are publicly available at http://people.cs.kuleuven.be/∼raf.vandebril/homepage/software/geomean.php?menu=5/. © The Author 2014. Published by Oxford University Press.

  3. Proteins with an Euonymus lectin-like domain are ubiquitous in Embryophyta

    PubMed Central

    2009-01-01

    Background Cloning of the Euonymus lectin led to the discovery of a novel domain that also occurs in some stress-induced plant proteins. The distribution and the diversity of proteins with an Euonymus lectin (EUL) domain were investigated using detailed analysis of sequences in publicly accessible genome and transcriptome databases. Results Comprehensive in silico analyses indicate that the recently identified Euonymus europaeus lectin domain represents a conserved structural unit of a novel family of putative carbohydrate-binding proteins, which will further be referred to as the Euonymus lectin (EUL) family. The EUL domain is widespread among plants. Analysis of retrieved sequences revealed that some sequences consist of a single EUL domain linked to an unrelated N-terminal domain whereas others comprise two in tandem arrayed EUL domains. A new classification system for these lectins is proposed based on the overall domain architecture. Evolutionary relationships among the sequences with EUL domains are discussed. Conclusion The identification of the EUL family provides the first evidence for the occurrence in terrestrial plants of a highly conserved plant specific domain. The widespread distribution of the EUL domain strikingly contrasts the more limited or even narrow distribution of most other lectin domains found in plants. The apparent omnipresence of the EUL domain is indicative for a universal role of this lectin domain in plants. Although there is unambiguous evidence that several EUL domains possess carbohydrate-binding activity further research is required to corroborate the carbohydrate-binding properties of different members of the EUL family. PMID:19930663

  4. Origin and Diversification of Basic-Helix-Loop-Helix Proteins in Plants

    PubMed Central

    Pires, Nuno; Dolan, Liam

    2010-01-01

    Basic helix-loop-helix (bHLH) proteins are a class of transcription factors found throughout eukaryotic organisms. Classification of the complete sets of bHLH proteins in the sequenced genomes of Arabidopsis thaliana and Oryza sativa (rice) has defined the diversity of these proteins among flowering plants. However, the evolutionary relationships of different plant bHLH groups and the diversity of bHLH proteins in more ancestral groups of plants are currently unknown. In this study, we use whole-genome sequences from nine species of land plants and algae to define the relationships between these proteins in plants. We show that few (less than 5) bHLH proteins are encoded in the genomes of chlorophytes and red algae. In contrast, many bHLH proteins (100–170) are encoded in the genomes of land plants (embryophytes). Phylogenetic analyses suggest that plant bHLH proteins are monophyletic and constitute 26 subfamilies. Twenty of these subfamilies existed in the common ancestors of extant mosses and vascular plants, whereas six further subfamilies evolved among the vascular plants. In addition to the conserved bHLH domains, most subfamilies are characterized by the presence of highly conserved short amino acid motifs. We conclude that much of the diversity of plant bHLH proteins was established in early land plants, over 440 million years ago. PMID:19942615

  5. A PDB-wide, evolution-based assessment of protein-protein interfaces.

    PubMed

    Baskaran, Kumaran; Duarte, Jose M; Biyani, Nikhil; Bliven, Spencer; Capitani, Guido

    2014-10-18

    Thanks to the growth in sequence and structure databases, more than 50 million sequences are now available in UniProt and 100,000 structures in the PDB. Rich information about protein-protein interfaces can be obtained by a comprehensive study of protein contacts in the PDB, their sequence conservation and geometric features. An automated computational pipeline was developed to run our Evolutionary Protein-Protein Interface Classifier (EPPIC) software on the entire PDB and store the results in a relational database, currently containing > 800,000 interfaces. This allows the analysis of interface data on a PDB-wide scale. Two large benchmark datasets of biological interfaces and crystal contacts, each containing about 3000 entries, were automatically generated based on criteria thought to be strong indicators of interface type. The BioMany set of biological interfaces includes NMR dimers solved as crystal structures and interfaces that are preserved across diverse crystal forms, as catalogued by the Protein Common Interface Database (ProtCID) from Xu and Dunbrack. The second dataset, XtalMany, is derived from interfaces that would lead to infinite assemblies and are therefore crystal contacts. BioMany and XtalMany were used to benchmark the EPPIC approach. The performance of EPPIC was also compared to classifications from the Protein Interfaces, Surfaces, and Assemblies (PISA) program on a PDB-wide scale, finding that the two approaches give the same call in about 88% of PDB interfaces. By comparing our safest predictions to the PDB author annotations, we provide a lower-bound estimate of the error rate of biological unit annotations in the PDB. Additionally, we developed a PyMOL plugin for direct download and easy visualization of EPPIC interfaces for any PDB entry. Both the datasets and the PyMOL plugin are available at http://www.eppic-web.org/ewui/\\#downloads. Our computational pipeline allows us to analyze protein-protein contacts and their sequence conservation across the entire PDB. Two new benchmark datasets are provided, which are over an order of magnitude larger than existing manually curated ones. These tools enable the comprehensive study of several aspects of protein-protein contacts in the PDB and represent a basis for future, even larger scale studies of protein-protein interactions.

  6. Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error

    PubMed Central

    Porter, Teresita M.; Golding, G. Brian

    2012-01-01

    Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys. PMID:22558215

  7. Application of Machine Learning to Proteomics Data: Classification and Biomarker Identification in Postgenomics Biology

    PubMed Central

    Swan, Anna Louise; Mobasheri, Ali; Allaway, David; Liddell, Susan

    2013-01-01

    Abstract Mass spectrometry is an analytical technique for the characterization of biological samples and is increasingly used in omics studies because of its targeted, nontargeted, and high throughput abilities. However, due to the large datasets generated, it requires informatics approaches such as machine learning techniques to analyze and interpret relevant data. Machine learning can be applied to MS-derived proteomics data in two ways. First, directly to mass spectral peaks and second, to proteins identified by sequence database searching, although relative protein quantification is required for the latter. Machine learning has been applied to mass spectrometry data from different biological disciplines, particularly for various cancers. The aims of such investigations have been to identify biomarkers and to aid in diagnosis, prognosis, and treatment of specific diseases. This review describes how machine learning has been applied to proteomics tandem mass spectrometry data. This includes how it can be used to identify proteins suitable for use as biomarkers of disease and for classification of samples into disease or treatment groups, which may be applicable for diagnostics. It also includes the challenges faced by such investigations, such as prediction of proteins present, protein quantification, planning for the use of machine learning, and small sample sizes. PMID:24116388

  8. Molecular Diagnostics of Gliomas Using Next Generation Sequencing of a Glioma-Tailored Gene Panel.

    PubMed

    Zacher, Angela; Kaulich, Kerstin; Stepanow, Stefanie; Wolter, Marietta; Köhrer, Karl; Felsberg, Jörg; Malzkorn, Bastian; Reifenberger, Guido

    2017-03-01

    Current classification of gliomas is based on histological criteria according to the World Health Organization (WHO) classification of tumors of the central nervous system. Over the past years, characteristic genetic profiles have been identified in various glioma types. These can refine tumor diagnostics and provide important prognostic and predictive information. We report on the establishment and validation of gene panel next generation sequencing (NGS) for the molecular diagnostics of gliomas. We designed a glioma-tailored gene panel covering 660 amplicons derived from 20 genes frequently aberrant in different glioma types. Sensitivity and specificity of glioma gene panel NGS for detection of DNA sequence variants and copy number changes were validated by single gene analyses. NGS-based mutation detection was optimized for application on formalin-fixed paraffin-embedded tissue specimens including small stereotactic biopsy samples. NGS data obtained in a retrospective analysis of 121 gliomas allowed for their molecular classification into distinct biological groups, including (i) isocitrate dehydrogenase gene (IDH) 1 or 2 mutant astrocytic gliomas with frequent α-thalassemia/mental retardation syndrome X-linked (ATRX) and tumor protein p53 (TP53) gene mutations, (ii) IDH mutant oligodendroglial tumors with 1p/19q codeletion, telomerase reverse transcriptase (TERT) promoter mutation and frequent Drosophila homolog of capicua (CIC) gene mutation, as well as (iii) IDH wildtype glioblastomas with frequent TERT promoter mutation, phosphatase and tensin homolog (PTEN) mutation and/or epidermal growth factor receptor (EGFR) amplification. Oligoastrocytic gliomas were genetically assigned to either of these groups. Our findings implicate gene panel NGS as a promising diagnostic technique that may facilitate integrated histological and molecular glioma classification. © 2016 International Society of Neuropathology.

  9. Lamin-like analogues in plants: the characterization of NMCP1 in Allium cepa

    PubMed Central

    Moreno Díaz de la Espina, Susana

    2013-01-01

    The nucleoskeleton of plants contains a peripheral lamina (also called plamina) and, even though lamins are absent in plants, their roles are still fulfilled in plant nuclei. One of the most intriguing topics in plant biology concerns the identity of lamin protein analogues in plants. Good candidates to play lamin functions in plants are the members of the NMCP (nuclear matrix constituent protein) family, which exhibit the typical tripartite structure of lamins. This paper describes a bioinformatics analysis and classification of the NMCP family based on phylogenetic relationships, sequence similarity and the distribution of conserved regions in 76 homologues. In addition, NMCP1 in the monocot Allium cepa characterized by its sequence and structure, biochemical properties, and subnuclear distribution and alterations in its expression throughout the root were identified. The results demonstrate that these proteins exhibit many similarities to lamins (structural organization, conserved regions, subnuclear distribution, and solubility) and that they may fulfil the functions of lamins in plants. These findings significantly advance understanding of the structural proteins of the plant lamina and nucleoskeleton and provide a basis for further investigation of the protein networks forming these structures. PMID:23378381

  10. Lamin-like analogues in plants: the characterization of NMCP1 in Allium cepa.

    PubMed

    Ciska, Malgorzata; Masuda, Kiyoshi; Moreno Díaz de la Espina, Susana

    2013-04-01

    The nucleoskeleton of plants contains a peripheral lamina (also called plamina) and, even though lamins are absent in plants, their roles are still fulfilled in plant nuclei. One of the most intriguing topics in plant biology concerns the identity of lamin protein analogues in plants. Good candidates to play lamin functions in plants are the members of the NMCP (nuclear matrix constituent protein) family, which exhibit the typical tripartite structure of lamins. This paper describes a bioinformatics analysis and classification of the NMCP family based on phylogenetic relationships, sequence similarity and the distribution of conserved regions in 76 homologues. In addition, NMCP1 in the monocot Allium cepa characterized by its sequence and structure, biochemical properties, and subnuclear distribution and alterations in its expression throughout the root were identified. The results demonstrate that these proteins exhibit many similarities to lamins (structural organization, conserved regions, subnuclear distribution, and solubility) and that they may fulfil the functions of lamins in plants. These findings significantly advance understanding of the structural proteins of the plant lamina and nucleoskeleton and provide a basis for further investigation of the protein networks forming these structures.

  11. Predicting permanent and transient protein-protein interfaces.

    PubMed

    La, David; Kong, Misun; Hoffman, William; Choi, Youn Im; Kihara, Daisuke

    2013-05-01

    Protein-protein interactions (PPIs) are involved in diverse functions in a cell. To optimize functional roles of interactions, proteins interact with a spectrum of binding affinities. Interactions are conventionally classified into permanent and transient, where the former denotes tight binding between proteins that result in strong complexes, whereas the latter compose of relatively weak interactions that can dissociate after binding to regulate functional activity at specific time point. Knowing the type of interactions has significant implications for understanding the nature and function of PPIs. In this study, we constructed amino acid substitution models that capture mutation patterns at permanent and transient type of protein interfaces, which were found to be different with statistical significance. Using the substitution models, we developed a novel computational method that predicts permanent and transient protein binding interfaces (PBIs) in protein surfaces. Without knowledge of the interacting partner, the method uses a single query protein structure and a multiple sequence alignment of the sequence family. Using a large dataset of permanent and transient proteins, we show that our method, BindML+, performs very well in protein interface classification. A very high area under the curve (AUC) value of 0.957 was observed when predicted protein binding sites were classified. Remarkably, near prefect accuracy was achieved with an AUC of 0.991 when actual binding sites were classified. The developed method will be also useful for protein design of permanent and transient PBIs. Copyright © 2013 Wiley Periodicals, Inc.

  12. Studying long 16S rDNA sequences with ultrafast-metagenomic sequence classification using exact alignments (Kraken).

    PubMed

    Valenzuela-González, Fabiola; Martínez-Porchas, Marcel; Villalpando-Canchola, Enrique; Vargas-Albores, Francisco

    2016-03-01

    Ultrafast-metagenomic sequence classification using exact alignments (Kraken) is a novel approach to classify 16S rDNA sequences. The classifier is based on mapping short sequences to the lowest ancestor and performing alignments to form subtrees with specific weights in each taxon node. This study aimed to evaluate the classification performance of Kraken with long 16S rDNA random environmental sequences produced by cloning and then Sanger sequenced. A total of 480 clones were isolated and expanded, and 264 of these clones formed contigs (1352 ± 153 bp). The same sequences were analyzed using the Ribosomal Database Project (RDP) classifier. Deeper classification performance was achieved by Kraken than by the RDP: 73% of the contigs were classified up to the species or variety levels, whereas 67% of these contigs were classified no further than the genus level by the RDP. The results also demonstrated that unassembled sequences analyzed by Kraken provide similar or inclusively deeper information. Moreover, sequences that did not form contigs, which are usually discarded by other programs, provided meaningful information when analyzed by Kraken. Finally, it appears that the assembly step for Sanger sequences can be eliminated when using Kraken. Kraken cumulates the information of both sequence senses, providing additional elements for the classification. In conclusion, the results demonstrate that Kraken is an excellent choice for use in the taxonomic assignment of sequences obtained by Sanger sequencing or based on third generation sequencing, of which the main goal is to generate larger sequences. Copyright © 2016 Elsevier B.V. All rights reserved.

  13. HMM-ModE: implementation, benchmarking and validation with HMMER3

    PubMed Central

    2014-01-01

    Background HMM-ModE is a computational method that generates family specific profile HMMs using negative training sequences. The method optimizes the discrimination threshold using 10 fold cross validation and modifies the emission probabilities of profiles to reduce common fold based signals shared with other sub-families. The protocol depends on the program HMMER for HMM profile building and sequence database searching. The recent release of HMMER3 has improved database search speed by several orders of magnitude, allowing for the large scale deployment of the method in sequence annotation projects. We have rewritten our existing scripts both at the level of parsing the HMM profiles and modifying emission probabilities to upgrade HMM-ModE using HMMER3 that takes advantage of its probabilistic inference with high computational speed. The method is benchmarked and tested on GPCR dataset as an accurate and fast method for functional annotation. Results The implementation of this method, which now works with HMMER3, is benchmarked with the earlier version of HMMER, to show that the effect of local-local alignments is marked only in the case of profiles containing a large number of discontinuous match states. The method is tested on a gold standard set of families and we have reported a significant reduction in the number of false positive hits over the default HMM profiles. When implemented on GPCR sequences, the results showed an improvement in the accuracy of classification compared with other methods used to classify the familyat different levels of their classification hierarchy. Conclusions The present findings show that the new version of HMM-ModE is a highly specific method used to differentiate between fold (superfamily) and function (family) specific signals, which helps in the functional annotation of protein sequences. The use of modified profile HMMs of GPCR sequences provides a simple yet highly specific method for classification of the family, being able to predict the sub-family specific sequences with high accuracy even though sequences share common physicochemical characteristics between sub-families. PMID:25073805

  14. Cluster analysis of S. Cerevisiae nucleosome binding sites

    NASA Astrophysics Data System (ADS)

    Suvorova, Y.; Korotkov, E.

    2017-12-01

    It is well known that major part of a eukaryotic genome is wrapped around histone proteins forming nucleosomes. It was also demonstrated that the DNA sequence itself is playing an important role in the nucleosome positioning process. In this work, a cluster analysis of 67 517 nucleosome binding sites from the S. Cerevisiae genome was carried out. The classification method is based on the self-adjusting dinucleotides position weight matrix. As a result, 135 significant clusters were discovered that contain 43225 sequences (which constitutes 64% of the initial set). The meaning of the found classes is discussed, as well as the possibility of the further usage.

  15. Whale song analyses using bioinformatics sequence analysis approaches

    NASA Astrophysics Data System (ADS)

    Chen, Yian A.; Almeida, Jonas S.; Chou, Lien-Siang

    2005-04-01

    Animal songs are frequently analyzed using discrete hierarchical units, such as units, themes and songs. Because animal songs and bio-sequences may be understood as analogous, bioinformatics analysis tools DNA/protein sequence alignment and alignment-free methods are proposed to quantify the theme similarities of the songs of false killer whales recorded off northeast Taiwan. The eighteen themes with discrete units that were identified in an earlier study [Y. A. Chen, masters thesis, University of Charleston, 2001] were compared quantitatively using several distance metrics. These metrics included the scores calculated using the Smith-Waterman algorithm with the repeated procedure; the standardized Euclidian distance and the angle metrics based on word frequencies. The theme classifications based on different metrics were summarized and compared in dendrograms using cluster analyses. The results agree with earlier classifications derived by human observation qualitatively. These methods further quantify the similarities among themes. These methods could be applied to the analyses of other animal songs on a larger scale. For instance, these techniques could be used to investigate song evolution and cultural transmission quantifying the dissimilarities of humpback whale songs across different seasons, years, populations, and geographic regions. [Work supported by SC Sea Grant, and Ilan County Government, Taiwan.

  16. Protein interface classification by evolutionary analysis

    PubMed Central

    2012-01-01

    Background Distinguishing biologically relevant interfaces from lattice contacts in protein crystals is a fundamental problem in structural biology. Despite efforts towards the computational prediction of interface character, many issues are still unresolved. Results We present here a protein-protein interface classifier that relies on evolutionary data to detect the biological character of interfaces. The classifier uses a simple geometric measure, number of core residues, and two evolutionary indicators based on the sequence entropy of homolog sequences. Both aim at detecting differential selection pressure between interface core and rim or rest of surface. The core residues, defined as fully buried residues (>95% burial), appear to be fundamental determinants of biological interfaces: their number is in itself a powerful discriminator of interface character and together with the evolutionary measures it is able to clearly distinguish evolved biological contacts from crystal ones. We demonstrate that this definition of core residues leads to distinctively better results than earlier definitions from the literature. The stringent selection and quality filtering of structural and sequence data was key to the success of the method. Most importantly we demonstrate that a more conservative selection of homolog sequences - with relatively high sequence identities to the query - is able to produce a clearer signal than previous attempts. Conclusions An evolutionary approach like the one presented here is key to the advancement of the field, which so far was missing an effective method exploiting the evolutionary character of protein interfaces. Its coverage and performance will only improve over time thanks to the incessant growth of sequence databases. Currently our method reaches an accuracy of 89% in classifying interfaces of the Ponstingl 2003 datasets and it lends itself to a variety of useful applications in structural biology and bioinformatics. We made the corresponding software implementation available to the community as an easy-to-use graphical web interface at http://www.eppic-web.org. PMID:23259833

  17. 32 CFR 2001.12 - Duration of classification.

    Code of Federal Regulations, 2011 CFR

    2011-07-01

    ... classification authority shall follow the sequence listed in paragraphs (a)(1)(i), (ii), and (iii) of this... to the sequence in paragraph (a)(1) of this section are as follows: (i) If an original classification... and shall be designated with the following marking, “50X1-HUM;” or (ii) If an original classification...

  18. 32 CFR 2001.12 - Duration of classification.

    Code of Federal Regulations, 2010 CFR

    2010-07-01

    ... classification authority shall follow the sequence listed in paragraphs (a)(1)(i), (ii), and (iii) of this... to the sequence in paragraph (a)(1) of this section are as follows: (i) If an original classification... and shall be designated with the following marking, “50X1-HUM;” or (ii) If an original classification...

  19. Preliminary Classification of Novel Hemorrhagic Fever-Causing Viruses Using Sequence-Based PAirwise Sequence Comparison (PASC) Analysis.

    PubMed

    Bào, Yīmíng; Kuhn, Jens H

    2018-01-01

    During the last decade, genome sequence-based classification of viruses has become increasingly prominent. Viruses can be even classified based on coding-complete genome sequence data alone. Nevertheless, classification remains arduous as experts are required to establish phylogenetic trees to depict the evolutionary relationships of such sequences for preliminary taxonomic placement. Pairwise sequence comparison (PASC) of genomes is one of several novel methods for establishing relationships among viruses. This method, provided by the US National Center for Biotechnology Information as an open-access tool, circumvents phylogenetics, and yet PASC results are often in agreement with those of phylogenetic analyses. Computationally inexpensive, PASC can be easily performed by non-taxonomists. Here we describe how to use the PASC tool for the preliminary classification of novel viral hemorrhagic fever-causing viruses.

  20. Comparing K-mer based methods for improved classification of 16S sequences.

    PubMed

    Vinje, Hilde; Liland, Kristian Hovde; Almøy, Trygve; Snipen, Lars

    2015-07-01

    The need for precise and stable taxonomic classification is highly relevant in modern microbiology. Parallel to the explosion in the amount of sequence data accessible, there has also been a shift in focus for classification methods. Previously, alignment-based methods were the most applicable tools. Now, methods based on counting K-mers by sliding windows are the most interesting classification approach with respect to both speed and accuracy. Here, we present a systematic comparison on five different K-mer based classification methods for the 16S rRNA gene. The methods differ from each other both in data usage and modelling strategies. We have based our study on the commonly known and well-used naïve Bayes classifier from the RDP project, and four other methods were implemented and tested on two different data sets, on full-length sequences as well as fragments of typical read-length. The difference in classification error obtained by the methods seemed to be small, but they were stable and for both data sets tested. The Preprocessed nearest-neighbour (PLSNN) method performed best for full-length 16S rRNA sequences, significantly better than the naïve Bayes RDP method. On fragmented sequences the naïve Bayes Multinomial method performed best, significantly better than all other methods. For both data sets explored, and on both full-length and fragmented sequences, all the five methods reached an error-plateau. We conclude that no K-mer based method is universally best for classifying both full-length sequences and fragments (reads). All methods approach an error plateau indicating improved training data is needed to improve classification from here. Classification errors occur most frequent for genera with few sequences present. For improving the taxonomy and testing new classification methods, the need for a better and more universal and robust training data set is crucial.

  1. Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource

    PubMed Central

    Koike, Asako; Kobayashi, Yoshiyuki; Takagi, Toshihisa

    2003-01-01

    Protein kinases play a crucial role in the regulation of cellular functions. Various kinds of information about these molecules are important for understanding signaling pathways and organism characteristics. We have developed the Kinase Pathway Database, an integrated database involving major completely sequenced eukaryotes. It contains the classification of protein kinases and their functional conservation, ortholog tables among species, protein–protein, protein–gene, and protein–compound interaction data, domain information, and structural information. It also provides an automatic pathway graphic image interface. The protein, gene, and compound interactions are automatically extracted from abstracts for all genes and proteins by natural-language processing (NLP).The method of automatic extraction uses phrase patterns and the GENA protein, gene, and compound name dictionary, which was developed by our group. With this database, pathways are easily compared among species using data with more than 47,000 protein interactions and protein kinase ortholog tables. The database is available for querying and browsing at http://kinasedb.ontology.ims.u-tokyo.ac.jp/. PMID:12799355

  2. Complete genome sequence of Geobacillus thermoglucosidasius C56-YS93, a novel biomass degrader isolated from obsidian hot spring in Yellowstone National Park.

    PubMed

    Brumm, Phillip J; Land, Miriam L; Mead, David A

    2015-01-01

    Geobacillus thermoglucosidasius C56-YS93 was one of several thermophilic organisms isolated from Obsidian Hot Spring, Yellowstone National Park, Montana, USA under permit from the National Park Service. Comparison of 16 S rRNA sequences confirmed the classification of the strain as a G. thermoglucosidasius species. The genome was sequenced, assembled, and annotated by the DOE Joint Genome Institute and deposited at the NCBI in December 2011 (CP002835). The genome of G. thermoglucosidasius C56-YS93 consists of one circular chromosome of 3,893,306 bp and two circular plasmids of 80,849 and 19,638 bp and an average G + C content of 43.93 %. G. thermoglucosidasius C56-YS93 possesses a xylan degradation cluster not found in the other G. thermoglucosidasius sequenced strains. This cluster appears to be related to the xylan degradation cluster found in G. stearothermophilus. G. thermoglucosidasius C56-YS93 possesses two plasmids not found in the other two strains. One plasmid contains a novel gene cluster coding for proteins involved in proline degradation and metabolism, the other contains a collection of mostly hypothetical proteins.

  3. Molecular Characterization and Comparative Sequence Analysis of Defense-Related Gene, Oryza rufipogon Receptor-Like Protein Kinase 1

    PubMed Central

    Law, Yee-Song; Gudimella, Ranganath; Song, Beng-Kah; Ratnam, Wickneswari; Harikrishna, Jennifer Ann

    2012-01-01

    Many of the plant leucine rich repeat receptor-like kinases (LRR-RLKs) have been found to regulate signaling during plant defense processes. In this study, we selected and sequenced an LRR-RLK gene, designated as Oryza rufipogon receptor-like protein kinase 1 (OrufRPK1), located within yield QTL yld1.1 from the wild rice Oryza rufipogon (accession IRGC105491). A 2055 bp coding region and two exons were identified. Southern blotting determined OrufRPK1 to be a single copy gene. Sequence comparison with cultivated rice orthologs (OsI219RPK1, OsI9311RPK1 and OsJNipponRPK1, respectively derived from O. sativa ssp. indica cv. MR219, O. sativa ssp. indica cv. 9311 and O. sativa ssp. japonica cv. Nipponbare) revealed the presence of 12 single nucleotide polymorphisms (SNPs) with five non-synonymous substitutions, and 23 insertion/deletion sites. The biological role of the OrufRPK1 as a defense related LRR-RLK is proposed on the basis of cDNA sequence characterization, domain subfamily classification, structural prediction of extra cellular domains, cluster analysis and comparative gene expression. PMID:22942769

  4. Genome sequence of the Thermotoga thermarum type strain (LA3(T)) from an African solfataric spring.

    PubMed

    Göker, Markus; Spring, Stefan; Scheuner, Carmen; Anderson, Iain; Zeytun, Ahmet; Nolan, Matt; Lucas, Susan; Tice, Hope; Del Rio, Tijana Glavina; Cheng, Jan-Fang; Han, Cliff; Tapia, Roxanne; Goodwin, Lynne A; Pitluck, Sam; Liolios, Konstantinos; Mavromatis, Konstantinos; Pagani, Ioanna; Ivanova, Natalia; Mikhailova, Natalia; Pati, Amrita; Chen, Amy; Palaniappan, Krishna; Land, Miriam; Hauser, Loren; Chang, Yun-Juan; Jeffries, Cynthia D; Rohde, Manfred; Detter, John C; Woyke, Tanja; Bristow, James; Eisen, Jonathan A; Markowitz, Victor; Hugenholtz, Philip; Kyrpides, Nikos C; Klenk, Hans-Peter; Lapidus, Alla

    2014-06-15

    Thermotoga thermarum Windberger et al. 1989 is a member to the genomically well characterized genus Thermotoga in the phylum 'Thermotogae'. T. thermarum is of interest for its origin from a continental solfataric spring vs. predominantly marine oil reservoirs of other members of the genus. The genome of strain LA3T also provides fresh data for the phylogenomic positioning of the (hyper-)thermophilic bacteria. T. thermarum strain LA3(T) is the fourth sequenced genome of a type strain from the genus Thermotoga, and the sixth in the family Thermotogaceae to be formally described in a publication. Phylogenetic analyses do not reveal significant discrepancies between the current classification of the group, 16S rRNA gene data and whole-genome sequences. Nevertheless, T. thermarum significantly differs from other Thermotoga species regarding its iron-sulfur cluster synthesis, as it contains only a minimal set of the necessary proteins. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 2,039,943 bp long chromosome with its 2,015 protein-coding and 51 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.

  5. Genome sequence of the Thermotoga thermarum type strain (LA 3 T) from an African solfataric spring

    DOE PAGES

    Goker, Markus; Spring, Stefan; Scheuner, Carmen; ...

    2014-06-15

    Thermotoga thermarum Windberger et al. 1989 is a member to the genomically well characterized genus Thermotoga in the phylum ' Thermotogae'. T. thermarum is of interest for its origin from a continental solfataric spring vs. predominantly marine oil reservoirs of other members of the genus. The genome of strain LA3T also provides fresh data for the phylogenomic positioning of the (hyper-)thermophilic bacteria. T. thermarum strain LA3 T is the fourth sequenced genome of a type strain from the genus Thermotoga, and the sixth in the family Thermotogaceae to be formally described in a publication. Phylogenetic analyses do not reveal significantmore » discrepancies between the current classification of the group, 16S rRNA gene data and whole-genome sequences. Nevertheless, T. thermarum significantly differs from other Thermotoga species regarding its iron-sulfur cluster synthesis, as it contains only a minimal set of the necessary proteins. Here we describe the features of this organism, together with the complete genome sequence and annotation. The 2,039,943 bp long chromosome with its 2,015 protein-coding and 51 RNA genes is a part of the Genomic Encyclopedia of Bacteria and Archaea project.« less

  6. Phylodynamic analysis and molecular diversity of the avian infectious bronchitis virus of chickens in Brazil.

    PubMed

    Fraga, Aline Padilha de; Gräf, Tiago; Pereira, Cleiton Schneider; Ikuta, Nilo; Fonseca, André Salvador Kazantzi; Lunge, Vagner Ricardo

    2018-07-01

    Avian infectious bronchitis virus (IBV) is the etiological agent of a highly contagious disease, which results in severe economic losses to the poultry industry. The spike protein (S1 subunit) is responsible for the molecular diversity of the virus and many sero/genotypes are described around the world. Recently a new standardized classification of the IBV molecular diversity was conducted, based on phylogenetic analysis of the S1 gene sequences sampled worldwide. Brazil is one of the biggest poultry producers in the world and the present study aimed to review the molecular diversity and reconstruct the evolutionary history of IBV in the country. All IBV S1 gene sequences, with local and year of collection information available on GenBank, were retrieved. Phylogenetic analyses were carried out based on a maximum likelihood method for the classification of genotypes occurring in Brazil, according to the new classification. Bayesian phylogenetic analyses were performed with the Brazilian clade and related international sequences to determine the evolutionary history of IBV in Brazil. A total of 143 Brazilian sequences were classified as GI-11 and 46 as GI-1 (Mass). Within the GI-11 clade, we have identified a potential recombinant strain circulating in Brazil. Phylodynamic analysis demonstrated that IBV GI-11 lineage was introduced in Brazil in the 1950s (1951, 1917-1975 95% HPD) and population dynamics was mostly constant throughout the time. Despite the national vaccination protocols, our results show the widespread dissemination and maintenance of the IBV GI-11 lineage in Brazil and highlight the importance of continuous surveillance to evaluate the impact of currently used vaccine strains on the observed viral diversity of the country. Copyright © 2018 Elsevier B.V. All rights reserved.

  7. De novo sequencing and analysis of the transcriptome of Panax ginseng in the leaf-expansion period.

    PubMed

    Liu, Shichao; Wang, Siming; Liu, Meichen; Yang, Fei; Zhang, Hui; Liu, Shiyang; Wang, Qun; Zhao, Yu

    2016-08-01

    Panax ginseng, a traditional Chinese medicine, is used worldwide for its variety of health benefits and its treatment efficacy. However, it is difficult to cultivate due to its vulnerability to environmental stresses. The present study provided the first report, to the best of our knowledge, of transcriptome analysis of ginseng at the leaf‑expansion stage. Using the Illumina sequencing platform, >40,000,000 high‑quality paired‑end reads were obtained and assembled into 100,533 unique sequences. When the sequences were searched against the publicly available National Center for Biotechnology Information protein database using The Basic Local Alignment Search Tool, 61,599 sequences exhibited similarity to known proteins. Functional annotation and classification, including use of the Gene Ontology, Clusters of Orthologous Groups, and Kyoto Encyclopedia of Genes and Genomes databases, revealed that the activated genes in ginseng were predominantly ribonuclease‑like storage genes, environmental stress genes, pathogenesis-related genes and other antioxidant genes. A number of candidate genes in environmental stress‑associated pathways were also identified. These novel data provide useful information on the growth and development stages of ginseng, and serve as an important public information platform for further understanding of the molecular mechanisms and functional genomics of ginseng.

  8. Optimizing taxonomic classification of marker-gene amplicon sequences with QIIME 2's q2-feature-classifier plugin.

    PubMed

    Bokulich, Nicholas A; Kaehler, Benjamin D; Rideout, Jai Ram; Dillon, Matthew; Bolyen, Evan; Knight, Rob; Huttley, Gavin A; Gregory Caporaso, J

    2018-05-17

    Taxonomic classification of marker-gene sequences is an important step in microbiome analysis. We present q2-feature-classifier ( https://github.com/qiime2/q2-feature-classifier ), a QIIME 2 plugin containing several novel machine-learning and alignment-based methods for taxonomy classification. We evaluated and optimized several commonly used classification methods implemented in QIIME 1 (RDP, BLAST, UCLUST, and SortMeRNA) and several new methods implemented in QIIME 2 (a scikit-learn naive Bayes machine-learning classifier, and alignment-based taxonomy consensus methods based on VSEARCH, and BLAST+) for classification of bacterial 16S rRNA and fungal ITS marker-gene amplicon sequence data. The naive-Bayes, BLAST+-based, and VSEARCH-based classifiers implemented in QIIME 2 meet or exceed the species-level accuracy of other commonly used methods designed for classification of marker gene sequences that were evaluated in this work. These evaluations, based on 19 mock communities and error-free sequence simulations, including classification of simulated "novel" marker-gene sequences, are available in our extensible benchmarking framework, tax-credit ( https://github.com/caporaso-lab/tax-credit-data ). Our results illustrate the importance of parameter tuning for optimizing classifier performance, and we make recommendations regarding parameter choices for these classifiers under a range of standard operating conditions. q2-feature-classifier and tax-credit are both free, open-source, BSD-licensed packages available on GitHub.

  9. Sequence and Structure Analysis of Distantly-Related Viruses Reveals Extensive Gene Transfer between Viruses and Hosts and among Viruses

    PubMed Central

    Caprari, Silvia; Metzler, Saskia; Lengauer, Thomas; Kalinina, Olga V.

    2015-01-01

    The origin and evolution of viruses is a subject of ongoing debate. In this study, we provide a full account of the evolutionary relationships between proteins of significant sequence and structural similarity found in viruses that belong to different classes according to the Baltimore classification. We show that such proteins can be found in viruses from all Baltimore classes. For protein families that include these proteins, we observe two patterns of the taxonomic spread. In the first pattern, they can be found in a large number of viruses from all implicated Baltimore classes. In the other pattern, the instances of the corresponding protein in species from each Baltimore class are restricted to a few compact clades. Proteins with the first pattern of distribution are products of so-called viral hallmark genes reported previously. Additionally, this pattern is displayed by the envelope glycoproteins from Flaviviridae and Bunyaviridae and helicases of superfamilies 1 and 2 that have homologs in cellular organisms. The second pattern can often be explained by horizontal gene transfer from the host or between viruses, an example being Orthomyxoviridae and Coronaviridae hemagglutinin esterases. Another facet of horizontal gene transfer comprises multiple independent introduction events of genes from cellular organisms into otherwise unrelated viruses. PMID:26492264

  10. Phylogenetic Characterization of Transport Protein Superfamilies: Superiority of SuperfamilyTree Programs over Those Based on Multiple Alignments

    PubMed Central

    Chen, Jonathan S.; Reddy, Vamsee; Chen, Joshua H.; Shlykov, Maksim A.; Zheng, Wei Hao; Cho, Jaehoon; Yen, Ming Ren; Saier, Milton H.

    2012-01-01

    Transport proteins function in the translocation of ions, solutes and macromolecules across cellular and organellar membranes. These integral membrane proteins fall into >600 families as tabulated in the Transporter Classification Database (www.tcdb.org). Recent studies, some of which are reported here, define distant phylogenetic relationships between families with the creation of superfamilies. Several of these are analyzed using a novel set of programs designed to allow reliable prediction of phylogenetic trees when sequence divergence is too great to allow the use of multiple alignments. These new programs, called SuperfamilyTree1 and 2 (SFT1 and 2), allow display of protein and family relationships, respectively, based on thousands of comparative BLAST scores rather than multiple alignments. Superfamilies analyzed include: (1) Aerolysins, (2) RTX Toxins, (3) Defensins, (4) Ion Transporters, (5) Bile/Arsenite/Riboflavin Transporters, (6) Cation: Proton Antiporters, and (7) the Glucose/Fructose/Lactose superfamily within the prokaryotic phosphoenol pyruvate-dependent Phosphotransferase System. In addition to defining the phylogenetic relationships of the proteins and families within these seven superfamilies, evidence is provided showing that the SFT programs outperform programs that are based on multiple alignments whenever sequence divergence of superfamily members is extensive. The SFT programs should be applicable to virtually any superfamily of proteins or nucleic acids. PMID:22286036

  11. 7TMRmine: a Web server for hierarchical mining of 7TMR proteins

    PubMed Central

    Lu, Guoqing; Wang, Zhifang; Jones, Alan M; Moriyama, Etsuko N

    2009-01-01

    Background Seven-transmembrane region-containing receptors (7TMRs) play central roles in eukaryotic signal transduction. Due to their biomedical importance, thorough mining of 7TMRs from diverse genomes has been an active target of bioinformatics and pharmacogenomics research. The need for new and accurate 7TMR/GPCR prediction tools is paramount with the accelerated rate of acquisition of diverse sequence information. Currently available and often used protein classification methods (e.g., profile hidden Markov Models) are highly accurate for identifying their membership information among already known 7TMR subfamilies. However, these alignment-based methods are less effective for identifying remote similarities, e.g., identifying proteins from highly divergent or possibly new 7TMR families. In this regard, more sensitive (e.g., alignment-free) methods are needed to complement the existing protein classification methods. A better strategy would be to combine different classifiers, from more specific to more sensitive methods, to identify a broader spectrum of 7TMR protein candidates. Description We developed a Web server, 7TMRmine, by integrating alignment-free and alignment-based classifiers specifically trained to identify candidate 7TMR proteins as well as transmembrane (TM) prediction methods. This new tool enables researchers to easily assess the distribution of GPCR functionality in diverse genomes or individual newly-discovered proteins. 7TMRmine is easily customized and facilitates exploratory analysis of diverse genomes. Users can integrate various alignment-based, alignment-free, and TM-prediction methods in any combination and in any hierarchical order. Sixteen classifiers (including two TM-prediction methods) are available on the 7TMRmine Web server. Not only can the 7TMRmine tool be used for 7TMR mining, but also for general TM-protein analysis. Users can submit protein sequences for analysis, or explore pre-analyzed results for multiple genomes. The server currently includes prediction results and the summary statistics for 68 genomes. Conclusion 7TMRmine facilitates the discovery of 7TMR proteins. By combining prediction results from different classifiers in a multi-level filtering process, prioritized sets of 7TMR candidates can be obtained for further investigation. 7TMRmine can be also used as a general TM-protein classifier. Comparisons of TM and 7TMR protein distributions among 68 genomes revealed interesting differences in evolution of these protein families among major eukaryotic phyla. PMID:19538753

  12. Classification of proteins: available structural space for molecular modeling.

    PubMed

    Andreeva, Antonina

    2012-01-01

    The wealth of available protein structural data provides unprecedented opportunity to study and better understand the underlying principles of protein folding and protein structure evolution. A key to achieving this lies in the ability to analyse these data and to organize them in a coherent classification scheme. Over the past years several protein classifications have been developed that aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution and thus provide a discrete rather than continuum view of protein structure space. This chapter presents a strategy for classification of proteins with known three-dimensional structure. Steps in the classification process along with basic definitions are introduced. Examples illustrating some fundamental concepts of protein folding and evolution with a special focus on the exceptions to them are presented.

  13. The COG database: an updated version includes eukaryotes

    PubMed Central

    Tatusov, Roman L; Fedorova, Natalie D; Jackson, John D; Jacobs, Aviva R; Kiryutin, Boris; Koonin, Eugene V; Krylov, Dmitri M; Mazumder, Raja; Mekhedov, Sergei L; Nikolskaya, Anastasia N; Rao, B Sridhar; Smirnov, Sergei; Sverdlov, Alexander V; Vasudevan, Sona; Wolf, Yuri I; Yin, Jodie J; Natale, Darren A

    2003-01-01

    Background The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies. Results We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes. Conclusion The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies. PMID:12969510

  14. Implementation of Objective PASC-Derived Taxon Demarcation Criteria for Official Classification of Filoviruses.

    PubMed

    Bào, Yīmíng; Amarasinghe, Gaya K; Basler, Christopher F; Bavari, Sina; Bukreyev, Alexander; Chandran, Kartik; Dolnik, Olga; Dye, John M; Ebihara, Hideki; Formenty, Pierre; Hewson, Roger; Kobinger, Gary P; Leroy, Eric M; Mühlberger, Elke; Netesov, Sergey V; Patterson, Jean L; Paweska, Janusz T; Smither, Sophie J; Takada, Ayato; Towner, Jonathan S; Volchkov, Viktor E; Wahl-Jensen, Victoria; Kuhn, Jens H

    2017-05-11

    The mononegaviral family Filoviridae has eight members assigned to three genera and seven species. Until now, genus and species demarcation were based on arbitrarily chosen filovirus genome sequence divergence values (≈50% for genera, ≈30% for species) and arbitrarily chosen phenotypic virus or virion characteristics. Here we report filovirus genome sequence-based taxon demarcation criteria using the publicly accessible PAirwise Sequencing Comparison (PASC) tool of the US National Center for Biotechnology Information (Bethesda, MD, USA). Comparison of all available filovirus genomes in GenBank using PASC revealed optimal genus demarcation at the 55-58% sequence diversity threshold range for genera and at the 23-36% sequence diversity threshold range for species. Because these thresholds do not change the current official filovirus classification, these values are now implemented as filovirus taxon demarcation criteria that may solely be used for filovirus classification in case additional data are absent. A near-complete, coding-complete, or complete filovirus genome sequence will now be required to allow official classification of any novel "filovirus." Classification of filoviruses into existing taxa or determining the need for novel taxa is now straightforward and could even become automated using a presented algorithm/flowchart rooted in RefSeq (type) sequences.

  15. Prediction of small molecule binding property of protein domains with Bayesian classifiers based on Markov chains.

    PubMed

    Bulashevska, Alla; Stein, Martin; Jackson, David; Eils, Roland

    2009-12-01

    Accurate computational methods that can help to predict biological function of a protein from its sequence are of great interest to research biologists and pharmaceutical companies. One approach to assume the function of proteins is to predict the interactions between proteins and other molecules. In this work, we propose a machine learning method that uses a primary sequence of a domain to predict its propensity for interaction with small molecules. By curating the Pfam database with respect to the small molecule binding ability of its component domains, we have constructed a dataset of small molecule binding and non-binding domains. This dataset was then used as training set to learn a Bayesian classifier, which should distinguish members of each class. The domain sequences of both classes are modelled with Markov chains. In a Jack-knife test, our classification procedure achieved the predictive accuracies of 77.2% and 66.7% for binding and non-binding classes respectively. We demonstrate the applicability of our classifier by using it to identify previously unknown small molecule binding domains. Our predictions are available as supplementary material and can provide very useful information to drug discovery specialists. Given the ubiquitous and essential role small molecules play in biological processes, our method is important for identifying pharmaceutically relevant components of complete proteomes. The software is available from the author upon request.

  16. Methods and statistics for combining motif match scores.

    PubMed

    Bailey, T L; Gribskov, M

    1998-01-01

    Position-specific scoring matrices are useful for representing and searching for protein sequence motifs. A sequence family can often be described by a group of one or more motifs, and an effective search must combine the scores for matching a sequence to each of the motifs in the group. We describe three methods for combining match scores and estimating the statistical significance of the combined scores and evaluate the search quality (classification accuracy) and the accuracy of the estimate of statistical significance of each. The three methods are: 1) sum of scores, 2) sum of reduced variates, 3) product of score p-values. We show that method 3) is superior to the other two methods in both regards, and that combining motif scores indeed gives better search accuracy. The MAST sequence homology search algorithm utilizing the product of p-values scoring method is available for interactive use and downloading at URL http:/(/)www.sdsc.edu/MEME.

  17. A nearest neighbor approach for automated transporter prediction and categorization from protein sequences.

    PubMed

    Li, Haiquan; Dai, Xinbin; Zhao, Xuechun

    2008-05-01

    Membrane transport proteins play a crucial role in the import and export of ions, small molecules or macromolecules across biological membranes. Currently, there are a limited number of published computational tools which enable the systematic discovery and categorization of transporters prior to costly experimental validation. To approach this problem, we utilized a nearest neighbor method which seamlessly integrates homologous search and topological analysis into a machine-learning framework. Our approach satisfactorily distinguished 484 transporter families in the Transporter Classification Database, a curated and representative database for transporters. A five-fold cross-validation on the database achieved a positive classification rate of 72.3% on average. Furthermore, this method successfully detected transporters in seven model and four non-model organisms, ranging from archaean to mammalian species. A preliminary literature-based validation has cross-validated 65.8% of our predictions on the 11 organisms, including 55.9% of our predictions overlapping with 83.6% of the predicted transporters in TransportDB.

  18. Draft genome sequence for virulent and avirulent strains of Xanthomonas arboricola isolated from Prunus spp. in Spain.

    PubMed

    Garita-Cambronero, Jerson; Palacio-Bielsa, Ana; López, María M; Cubero, Jaime

    2016-01-01

    Xanthomonas arboricola is a species in genus Xanthomonas which is mainly comprised of plant pathogens. Among the members of this taxon, X. arboricola pv. pruni, the causal agent of bacterial spot disease of stone fruits and almond, is distributed worldwide although it is considered a quarantine pathogen in the European Union. Herein, we report the draft genome sequence, the classification, the annotation and the sequence analyses of a virulent strain, IVIA 2626.1, and an avirulent strain, CITA 44, of X. arboricola associated with Prunus spp. The draft genome sequence of IVIA 2626.1 consists of 5,027,671 bp, 4,720 protein coding genes and 50 RNA encoding genes. The draft genome sequence of strain CITA 44 consists of 4,760,482 bp, 4,250 protein coding genes and 56 RNA coding genes. Initial comparative analyses reveals differences in the presence of structural and regulatory components of the type IV pilus, the type III secretion system, the type III effectors as well as variations in the number of the type IV secretion systems. The genome sequence data for these strains will facilitate the development of molecular diagnostics protocols that differentiate virulent and avirulent strains. In addition, comparative genome analysis will provide insights into the plant-pathogen interaction during the bacterial spot disease process.

  19. A novel strategy for the determination of a rhabdovirus genome and its application to sequencing of Eggplant mottled dwarf virus.

    PubMed

    Pappi, Polyxeni G; Dovas, Chrysostomos I; Efthimiou, Konstantinos E; Maliogka, Varvara I; Katis, Nikolaos I

    2013-08-01

    A novel strategy employing the rhabdovirus untranslated conserved intergenic regions was developed and applied successfully for the determination of the complete nucleotide sequence of Eggplant mottled dwarf virus (EMDV). The EMDV genome contains seven open reading frames with the same organization as Potato yellow dwarf virus (PYDV), the type species of the genus Nucleorhabdovirus. These two species encode five core genes [nucleocapsid (N), phosphoprotein (P), matrix (M), glycoprotein (G), and the polymerase (L)] like other viruses of the genus and an additional one (X), located between N and P, giving rise to a protein with currently unknown function. Furthermore, both EMDV and PYDV contain a gene (Y), inserted between P and M, which probably encodes the virus movement protein, in concordance with the rest of the plant-infecting rhabdoviruses. Phylogenetic analysis of the polymerase gene confirmed the classification of EMDV within the genus Nucleorhabdovirus and showed a close evolutionary relationship to PYDV. The novel sequencing strategy developed is a useful tool for the genome determination of yet uncharacterized rhabdoviruses.

  20. PhosphoregDB: The tissue and sub-cellular distribution of mammalian protein kinases and phosphatases

    PubMed Central

    Forrest, Alistair RR; Taylor, Darrin F; Fink, J Lynn; Gongora, M Milena; Flegg, Cameron; Teasdale, Rohan D; Suzuki, Harukazu; Kanamori, Mutsumi; Kai, Chikatoshi; Hayashizaki, Yoshihide; Grimmond, Sean M

    2006-01-01

    Background Protein kinases and protein phosphatases are the fundamental components of phosphorylation dependent protein regulatory systems. We have created a database for the protein kinase-like and phosphatase-like loci of mouse that integrates protein sequence, interaction, classification and pathway information with the results of a systematic screen of their sub-cellular localization and tissue specific expression data mined from the GNF tissue atlas of mouse. Results The database lets users query where a specific kinase or phosphatase is expressed at both the tissue and sub-cellular levels. Similarly the interface allows the user to query by tissue, pathway or sub-cellular localization, to reveal which components are co-expressed or co-localized. A review of their expression reveals 30% of these components are detected in all tissues tested while 70% show some level of tissue restriction. Hierarchical clustering of the expression data reveals that expression of these genes can be used to separate the samples into tissues of related lineage, including 3 larger clusters of nervous tissue, developing embryo and cells of the immune system. By overlaying the expression, sub-cellular localization and classification data we examine correlations between class, specificity and tissue restriction and show that tyrosine kinases are more generally expressed in fewer tissues than serine/threonine kinases. Conclusion Together these data demonstrate that cell type specific systems exist to regulate protein phosphorylation and that for accurate modelling and for determination of enzyme substrate relationships the co-location of components needs to be considered. PMID:16504016

  1. Probabilistic topic modeling for the analysis and classification of genomic sequences

    PubMed Central

    2015-01-01

    Background Studies on genomic sequences for classification and taxonomic identification have a leading role in the biomedical field and in the analysis of biodiversity. These studies are focusing on the so-called barcode genes, representing a well defined region of the whole genome. Recently, alignment-free techniques are gaining more importance because they are able to overcome the drawbacks of sequence alignment techniques. In this paper a new alignment-free method for DNA sequences clustering and classification is proposed. The method is based on k-mers representation and text mining techniques. Methods The presented method is based on Probabilistic Topic Modeling, a statistical technique originally proposed for text documents. Probabilistic topic models are able to find in a document corpus the topics (recurrent themes) characterizing classes of documents. This technique, applied on DNA sequences representing the documents, exploits the frequency of fixed-length k-mers and builds a generative model for a training group of sequences. This generative model, obtained through the Latent Dirichlet Allocation (LDA) algorithm, is then used to classify a large set of genomic sequences. Results and conclusions We performed classification of over 7000 16S DNA barcode sequences taken from Ribosomal Database Project (RDP) repository, training probabilistic topic models. The proposed method is compared to the RDP tool and Support Vector Machine (SVM) classification algorithm in a extensive set of trials using both complete sequences and short sequence snippets (from 400 bp to 25 bp). Our method reaches very similar results to RDP classifier and SVM for complete sequences. The most interesting results are obtained when short sequence snippets are considered. In these conditions the proposed method outperforms RDP and SVM with ultra short sequences and it exhibits a smooth decrease of performance, at every taxonomic level, when the sequence length is decreased. PMID:25916734

  2. MultiSeq: unifying sequence and structure data for evolutionary analysis

    PubMed Central

    Roberts, Elijah; Eargle, John; Wright, Dan; Luthey-Schulten, Zaida

    2006-01-01

    Background Since the publication of the first draft of the human genome in 2000, bioinformatic data have been accumulating at an overwhelming pace. Currently, more than 3 million sequences and 35 thousand structures of proteins and nucleic acids are available in public databases. Finding correlations in and between these data to answer critical research questions is extremely challenging. This problem needs to be approached from several directions: information science to organize and search the data; information visualization to assist in recognizing correlations; mathematics to formulate statistical inferences; and biology to analyze chemical and physical properties in terms of sequence and structure changes. Results Here we present MultiSeq, a unified bioinformatics analysis environment that allows one to organize, display, align and analyze both sequence and structure data for proteins and nucleic acids. While special emphasis is placed on analyzing the data within the framework of evolutionary biology, the environment is also flexible enough to accommodate other usage patterns. The evolutionary approach is supported by the use of predefined metadata, adherence to standard ontological mappings, and the ability for the user to adjust these classifications using an electronic notebook. MultiSeq contains a new algorithm to generate complete evolutionary profiles that represent the topology of the molecular phylogenetic tree of a homologous group of distantly related proteins. The method, based on the multidimensional QR factorization of multiple sequence and structure alignments, removes redundancy from the alignments and orders the protein sequences by increasing linear dependence, resulting in the identification of a minimal basis set of sequences that spans the evolutionary space of the homologous group of proteins. Conclusion MultiSeq is a major extension of the Multiple Alignment tool that is provided as part of VMD, a structural visualization program for analyzing molecular dynamics simulations. Both are freely distributed by the NIH Resource for Macromolecular Modeling and Bioinformatics and MultiSeq is included with VMD starting with version 1.8.5. The MultiSeq website has details on how to download and use the software: PMID:16914055

  3. ESTuber db: an online database for Tuber borchii EST sequences.

    PubMed

    Lazzari, Barbara; Caprera, Andrea; Cosentino, Cristian; Stella, Alessandra; Milanesi, Luciano; Viotti, Angelo

    2007-03-08

    The ESTuber database (http://www.itb.cnr.it/estuber) includes 3,271 Tuber borchii expressed sequence tags (EST). The dataset consists of 2,389 sequences from an in-house prepared cDNA library from truffle vegetative hyphae, and 882 sequences downloaded from GenBank and representing four libraries from white truffle mycelia and ascocarps at different developmental stages. An automated pipeline was prepared to process EST sequences using public software integrated by in-house developed Perl scripts. Data were collected in a MySQL database, which can be queried via a php-based web interface. Sequences included in the ESTuber db were clustered and annotated against three databases: the GenBank nr database, the UniProtKB database and a third in-house prepared database of fungi genomic sequences. An algorithm was implemented to infer statistical classification among Gene Ontology categories from the ontology occurrences deduced from the annotation procedure against the UniProtKB database. Ontologies were also deduced from the annotation of more than 130,000 EST sequences from five filamentous fungi, for intra-species comparison purposes. Further analyses were performed on the ESTuber db dataset, including tandem repeats search and comparison of the putative protein dataset inferred from the EST sequences to the PROSITE database for protein patterns identification. All the analyses were performed both on the complete sequence dataset and on the contig consensus sequences generated by the EST assembly procedure. The resulting web site is a resource of data and links related to truffle expressed genes. The Sequence Report and Contig Report pages are the web interface core structures which, together with the Text search utility and the Blast utility, allow easy access to the data stored in the database.

  4. Molecular Cloning and Characterization of G Alpha Proteins from the Western Tarnished Plant Bug, Lygus hesperus

    PubMed Central

    Hull, J. Joe; Wang, Meixian

    2014-01-01

    The Gα subunits of heterotrimeric G proteins play critical roles in the activation of diverse signal transduction cascades. However, the role of these genes in chemosensation remains to be fully elucidated. To initiate a comprehensive survey of signal transduction genes, we used homology-based cloning methods and transcriptome data mining to identity Gα subunits in the western tarnished plant bug (Lygus hesperus Knight). Among the nine sequences identified were single variants of the Gαi, Gαo, Gαs, and Gα12 subfamilies and five alternative splice variants of the Gαq subfamily. Sequence alignment and phylogenetic analyses of the putative L. hesperus Gα subunits support initial classifications and are consistent with established evolutionary relationships. End-point PCR-based profiling of the transcripts indicated head specific expression for LhGαq4, and largely ubiquitous expression, albeit at varying levels, for the other LhGα transcripts. All subfamilies were amplified from L. hesperus chemosensory tissues, suggesting potential roles in olfaction and/or gustation. Immunohistochemical staining of cultured insect cells transiently expressing recombinant His-tagged LhGαi, LhGαs, and LhGαq1 revealed plasma membrane targeting, suggesting the respective sequences encode functional G protein subunits. PMID:26463065

  5. RepeatsDB-lite: a web server for unit annotation of tandem repeat proteins.

    PubMed

    Hirsh, Layla; Paladin, Lisanna; Piovesan, Damiano; Tosatto, Silvio C E

    2018-05-09

    RepeatsDB-lite (http://protein.bio.unipd.it/repeatsdb-lite) is a web server for the prediction of repetitive structural elements and units in tandem repeat (TR) proteins. TRs are a widespread but poorly annotated class of non-globular proteins carrying heterogeneous functions. RepeatsDB-lite extends the prediction to all TR types and strongly improves the performance both in terms of computational time and accuracy over previous methods, with precision above 95% for solenoid structures. The algorithm exploits an improved TR unit library derived from the RepeatsDB database to perform an iterative structural search and assignment. The web interface provides tools for analyzing the evolutionary relationships between units and manually refine the prediction by changing unit positions and protein classification. An all-against-all structure-based sequence similarity matrix is calculated and visualized in real-time for every user edit. Reviewed predictions can be submitted to RepeatsDB for review and inclusion.

  6. Self-Organizing Hidden Markov Model Map (SOHMMM).

    PubMed

    Ferles, Christos; Stafylopatis, Andreas

    2013-12-01

    A hybrid approach combining the Self-Organizing Map (SOM) and the Hidden Markov Model (HMM) is presented. The Self-Organizing Hidden Markov Model Map (SOHMMM) establishes a cross-section between the theoretic foundations and algorithmic realizations of its constituents. The respective architectures and learning methodologies are fused in an attempt to meet the increasing requirements imposed by the properties of deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and protein chain molecules. The fusion and synergy of the SOM unsupervised training and the HMM dynamic programming algorithms bring forth a novel on-line gradient descent unsupervised learning algorithm, which is fully integrated into the SOHMMM. Since the SOHMMM carries out probabilistic sequence analysis with little or no prior knowledge, it can have a variety of applications in clustering, dimensionality reduction and visualization of large-scale sequence spaces, and also, in sequence discrimination, search and classification. Two series of experiments based on artificial sequence data and splice junction gene sequences demonstrate the SOHMMM's characteristics and capabilities. Copyright © 2013 Elsevier Ltd. All rights reserved.

  7. Integrated Proteomic and Transcriptomic-Based Approaches to Identifying Signature Biomarkers and Pathways for Elucidation of Daoy and UW228 Subtypes.

    PubMed

    Higdon, Roger; Kala, Jessie; Wilkins, Devan; Yan, Julia Fangfei; Sethi, Manveen K; Lin, Liang; Liu, Siqi; Montague, Elizabeth; Janko, Imre; Choiniere, John; Kolker, Natali; Hancock, William S; Kolker, Eugene; Fanayan, Susan

    2017-02-03

    Medulloblastoma (MB) is the most common malignant pediatric brain tumor. Patient survival has remained largely the same for the past 20 years, with therapies causing significant health, cognitive, behavioral and developmental complications for those who survive the tumor. In this study, we profiled the total transcriptome and proteome of two established MB cell lines, Daoy and UW228, using high-throughput RNA sequencing (RNA-Seq) and label-free nano-LC-MS/MS-based quantitative proteomics, coupled with advanced pathway analysis. While Daoy has been suggested to belong to the sonic hedgehog (SHH) subtype, the exact UW228 subtype is not yet clearly established. Thus, a goal of this study was to identify protein markers and pathways that would help elucidate their subtype classification. A number of differentially expressed genes and proteins, including a number of adhesion, cytoskeletal and signaling molecules, were observed between the two cell lines. While several cancer-associated genes/proteins exhibited similar expression across the two cell lines, upregulation of a number of signature proteins and enrichment of key components of SHH and WNT signaling pathways were uniquely observed in Daoy and UW228, respectively. The novel information on differentially expressed genes/proteins and enriched pathways provide insights into the biology of MB, which could help elucidate their subtype classification.

  8. A 3D sequence-independent representation of the protein data bank.

    PubMed

    Fischer, D; Tsai, C J; Nussinov, R; Wolfson, H

    1995-10-01

    Here we address the following questions. How many structurally different entries are there in the Protein Data Bank (PDB)? How do the proteins populate the structural universe? To investigate these questions a structurally non-redundant set of representative entries was selected from the PDB. Construction of such a dataset is not trivial: (i) the considerable size of the PDB requires a large number of comparisons (there were more than 3250 structures of protein chains available in May 1994); (ii) the PDB is highly redundant, containing many structurally similar entries, not necessarily with significant sequence homology, and (iii) there is no clear-cut definition of structural similarity. The latter depend on the criteria and methods used. Here, we analyze structural similarity ignoring protein topology. To date, representative sets have been selected either by hand, by sequence comparison techniques which ignore the three-dimensional (3D) structures of the proteins or by using sequence comparisons followed by linear structural comparison (i.e. the topology, or the sequential order of the chains, is enforced in the structural comparison). Here we describe a 3D sequence-independent automated and efficient method to obtain a representative set of protein molecules from the PDB which contains all unique structures and which is structurally non-redundant. The method has two novel features. The first is the use of strictly structural criteria in the selection process without taking into account the sequence information. To this end we employ a fast structural comparison algorithm which requires on average approximately 2 s per pairwise comparison on a workstation. The second novel feature is the iterative application of a heuristic clustering algorithm that greatly reduces the number of comparisons required. We obtain a representative set of 220 chains with resolution better than 3.0 A, or 268 chains including lower resolution entries, NMR entries and models. The resulting set can serve as a basis for extensive structural classification and studies of 3D recurring motifs and of sequence-structure relationships. The clustering algorithm succeeds in classifying into the same structural family chains with no significant sequence homology, e.g. all the globins in one single group, all the trypsin-like serine proteases in another or all the immunoglobulin-like folds into a third. In addition, unexpected structural similarities of interest have been automatically detected between pairs of chains. A cluster analysis of the representative structures demonstrates the way the "structural universe' is populated.

  9. Phylogenetic and bioinformatic analysis of gap junction-related proteins, innexins, pannexins and connexins.

    PubMed

    Fushiki, Daisuke; Hamada, Yasuo; Yoshimura, Ryoichi; Endo, Yasuhisa

    2010-04-01

    All multi-cellular animals, including hydra, insects and vertebrates, develop gap junctions, which communicate directly with neighboring cells. Gap junctions consist of protein families called connexins in vertebrates and innexins in invertebrates. Connexins and innexins have no homology in their amino acid sequence, but both are thought to have some similar characteristics, such as a tetra-membrane-spanning structure, formation of a channel by hexamer, and transmission of small molecules (e.g. ions) to neighboring cells. Pannexins were recently identified as a homolog of innexins in vertebrate genomes. Although pannexins are thought to share the function of intercellular communication with connexins and innexins, there is little information about the relationship among these three protein families of gap junctions. We phylgenetically and bioinformatically examined these protein families and other tetra-membrane-spanning proteins using a database and three analytical softwares. The clades formed by pannexin families do not belong to the species classification but do to paralogs of each member of pannexins. Amino acid sequences of pannexins are closely related to those of innexins but less to those of connexins. These data suggest that innexins and pannexins have a common origin, but the relationship between innexins/pannexins and connexins is as slight as that of other tetra-membrane-spanning members.

  10. Proteomic analysis of Toxocara canis excretory and secretory (TES) proteins.

    PubMed

    Sperotto, Rita Leal; Kremer, Frederico Schmitt; Aires Berne, Maria Elisabeth; Costa de Avila, Luciana F; da Silva Pinto, Luciano; Monteiro, Karina Mariante; Caumo, Karin Silva; Ferreira, Henrique Bunselmeyer; Berne, Natália; Borsuk, Sibele

    2017-01-01

    Toxocariasis is a neglected disease, and its main etiological agent is the nematode Toxocara canis. Serological diagnosis is performed by an enzyme-linked immunosorbent assay using T. canis excretory and secretory (TES) antigens produced by in vitro cultivation of larvae. Identification of TES proteins can be useful for the development of new diagnostic strategies since few TES components have been described so far. Herein, we report the results obtained by proteomic analysis of TES proteins using a liquid chromatography-tandem mass spectrometry (LC-MS/MS) approach. TES fractions were separated by one-dimensional SDS-PAGE and analyzed by LC-MS/MS. The MS/MS spectra were compared with a database of protein sequences deduced from the genome sequence of T. canis, and a total of 19 proteins were identified. Classification according to the signal peptide prediction using the SignalP server showed that seven of the identified proteins were extracellular, 10 had cytoplasmic or nuclear localization, while the subcellular localization of two proteins was unknown. Analysis of molecular functions by BLAST2GO showed that the majority of the gene ontology (GO) terms associated with the proteins present in the TES sample were associated with binding functions, including but not limited to protein binding (GO:0005515), inorganic ion binding (GO:0043167), and organic cyclic compound binding (GO:0097159). This study provides additional information about the exoproteome of T. canis, which can lead to the development of new strategies for diagnostics or vaccination. Copyright © 2016 Elsevier B.V. All rights reserved.

  11. Feature-based classification of amino acid substitutions outside conserved functional protein domains.

    PubMed

    Gemovic, Branislava; Perovic, Vladimir; Glisic, Sanja; Veljkovic, Nevena

    2013-01-01

    There are more than 500 amino acid substitutions in each human genome, and bioinformatics tools irreplaceably contribute to determination of their functional effects. We have developed feature-based algorithm for the detection of mutations outside conserved functional domains (CFDs) and compared its classification efficacy with the most commonly used phylogeny-based tools, PolyPhen-2 and SIFT. The new algorithm is based on the informational spectrum method (ISM), a feature-based technique, and statistical analysis. Our dataset contained neutral polymorphisms and mutations associated with myeloid malignancies from epigenetic regulators ASXL1, DNMT3A, EZH2, and TET2. PolyPhen-2 and SIFT had significantly lower accuracies in predicting the effects of amino acid substitutions outside CFDs than expected, with especially low sensitivity. On the other hand, only ISM algorithm showed statistically significant classification of these sequences. It outperformed PolyPhen-2 and SIFT by 15% and 13%, respectively. These results suggest that feature-based methods, like ISM, are more suitable for the classification of amino acid substitutions outside CFDs than phylogeny-based tools.

  12. Molecular characterization of African orthobunyaviruses.

    PubMed

    Yandoko, E Nakouné; Gribaldo, S; Finance, C; Le Faou, A; Rihn, B H

    2007-06-01

    The genus Orthobunyavirus is composed of segmented, negative-sense RNA viruses that are responsible for mild to severe human diseases. To date, no molecular studies of bunyaviruses in the genus Orthobunyavirus from central Africa have been reported, and their classification relies on serological testing. Four new primer pairs for RT-PCR amplification and sequencing of the complete genomic small (S) RNA segments of 10 orthobunyaviruses isolated from the Central African Republic and pertaining to five different serogroups have been designed and evaluated. Phylogenetic analysis showed that these 10 viruses belong to the Bunyamwera serogroup. The S segment sequences differ from those of the Bunyamwera virus reference strain by 5-15 % at the nucleotide level, and both overlapping reading frames, encoding the nucleocapsid (N) and non-structural (NS) proteins, were evident in sequenced genomes. This study should improve diagnosis and surveillance of African bunyaviruses.

  13. A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes.

    PubMed

    Mehmood, Tahir; Bohlin, Jon; Snipen, Lars

    2015-01-01

    The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.

  14. Extension of the classical classification of β-turns

    PubMed Central

    de Brevern, Alexandre G.

    2016-01-01

    The functional properties of a protein primarily depend on its three-dimensional (3D) structure. These properties have classically been assigned, visualized and analysed on the basis of protein secondary structures. The β-turn is the third most important secondary structure after helices and β-strands. β-turns have been classified according to the values of the dihedral angles φ and ψ of the central residue. Conventionally, eight different types of β-turns have been defined, whereas those that cannot be defined are classified as type IV β-turns. This classification remains the most widely used. Nonetheless, the miscellaneous type IV β-turns represent 1/3rd of β-turn residues. An unsupervised specific clustering approach was designed to search for recurrent new turns in the type IV category. The classical rules of β-turn type assignment were central to the approach. The four most frequently occurring clusters defined the new β-turn types. Unexpectedly, these types, designated IV1, IV2, IV3 and IV4, represent half of the type IV β-turns and occur more frequently than many of the previously established types. These types show convincing particularities, in terms of both structures and sequences that allow for the classical β-turn classification to be extended for the first time in 25 years. PMID:27627963

  15. Extension of the classical classification of β-turns.

    PubMed

    de Brevern, Alexandre G

    2016-09-15

    The functional properties of a protein primarily depend on its three-dimensional (3D) structure. These properties have classically been assigned, visualized and analysed on the basis of protein secondary structures. The β-turn is the third most important secondary structure after helices and β-strands. β-turns have been classified according to the values of the dihedral angles φ and ψ of the central residue. Conventionally, eight different types of β-turns have been defined, whereas those that cannot be defined are classified as type IV β-turns. This classification remains the most widely used. Nonetheless, the miscellaneous type IV β-turns represent 1/3(rd) of β-turn residues. An unsupervised specific clustering approach was designed to search for recurrent new turns in the type IV category. The classical rules of β-turn type assignment were central to the approach. The four most frequently occurring clusters defined the new β-turn types. Unexpectedly, these types, designated IV1, IV2, IV3 and IV4, represent half of the type IV β-turns and occur more frequently than many of the previously established types. These types show convincing particularities, in terms of both structures and sequences that allow for the classical β-turn classification to be extended for the first time in 25 years.

  16. The Classification and Evolution of Enzyme Function

    PubMed Central

    Martínez Cuesta, Sergio; Rahman, Syed Asad; Furnham, Nicholas; Thornton, Janet M.

    2015-01-01

    Enzymes are the proteins responsible for the catalysis of life. Enzymes sharing a common ancestor as defined by sequence and structure similarity are grouped into families and superfamilies. The molecular function of enzymes is defined as their ability to catalyze biochemical reactions; it is manually classified by the Enzyme Commission and robust approaches to quantitatively compare catalytic reactions are just beginning to appear. Here, we present an overview of studies at the interface of the evolution and function of enzymes. PMID:25986631

  17. MicRhoDE: a curated database for the analysis of microbial rhodopsin diversity and evolution

    PubMed Central

    Boeuf, Dominique; Audic, Stéphane; Brillet-Guéguen, Loraine; Caron, Christophe; Jeanthon, Christian

    2015-01-01

    Microbial rhodopsins are a diverse group of photoactive transmembrane proteins found in all three domains of life and in viruses. Today, microbial rhodopsin research is a flourishing research field in which new understandings of rhodopsin diversity, function and evolution are contributing to broader microbiological and molecular knowledge. Here, we describe MicRhoDE, a comprehensive, high-quality and freely accessible database that facilitates analysis of the diversity and evolution of microbial rhodopsins. Rhodopsin sequences isolated from a vast array of marine and terrestrial environments were manually collected and curated. To each rhodopsin sequence are associated related metadata, including predicted spectral tuning of the protein, putative activity and function, taxonomy for sequences that can be linked to a 16S rRNA gene, sampling date and location, and supporting literature. The database currently covers 7857 aligned sequences from more than 450 environmental samples or organisms. Based on a robust phylogenetic analysis, we introduce an operational classification system with multiple phylogenetic levels ranging from superclusters to species-level operational taxonomic units. An integrated pipeline for online sequence alignment and phylogenetic tree construction is also provided. With a user-friendly interface and integrated online bioinformatics tools, this unique resource should be highly valuable for upcoming studies of the biogeography, diversity, distribution and evolution of microbial rhodopsins. Database URL: http://micrhode.sb-roscoff.fr. PMID:26286928

  18. MicRhoDE: a curated database for the analysis of microbial rhodopsin diversity and evolution.

    PubMed

    Boeuf, Dominique; Audic, Stéphane; Brillet-Guéguen, Loraine; Caron, Christophe; Jeanthon, Christian

    2015-01-01

    Microbial rhodopsins are a diverse group of photoactive transmembrane proteins found in all three domains of life and in viruses. Today, microbial rhodopsin research is a flourishing research field in which new understandings of rhodopsin diversity, function and evolution are contributing to broader microbiological and molecular knowledge. Here, we describe MicRhoDE, a comprehensive, high-quality and freely accessible database that facilitates analysis of the diversity and evolution of microbial rhodopsins. Rhodopsin sequences isolated from a vast array of marine and terrestrial environments were manually collected and curated. To each rhodopsin sequence are associated related metadata, including predicted spectral tuning of the protein, putative activity and function, taxonomy for sequences that can be linked to a 16S rRNA gene, sampling date and location, and supporting literature. The database currently covers 7857 aligned sequences from more than 450 environmental samples or organisms. Based on a robust phylogenetic analysis, we introduce an operational classification system with multiple phylogenetic levels ranging from superclusters to species-level operational taxonomic units. An integrated pipeline for online sequence alignment and phylogenetic tree construction is also provided. With a user-friendly interface and integrated online bioinformatics tools, this unique resource should be highly valuable for upcoming studies of the biogeography, diversity, distribution and evolution of microbial rhodopsins. Database URL: http://micrhode.sb-roscoff.fr. © The Author(s) 2015. Published by Oxford University Press.

  19. A k-mer-based barcode DNA classification methodology based on spectral representation and a neural gas network.

    PubMed

    Fiannaca, Antonino; La Rosa, Massimo; Rizzo, Riccardo; Urso, Alfonso

    2015-07-01

    In this paper, an alignment-free method for DNA barcode classification that is based on both a spectral representation and a neural gas network for unsupervised clustering is proposed. In the proposed methodology, distinctive words are identified from a spectral representation of DNA sequences. A taxonomic classification of the DNA sequence is then performed using the sequence signature, i.e., the smallest set of k-mers that can assign a DNA sequence to its proper taxonomic category. Experiments were then performed to compare our method with other supervised machine learning classification algorithms, such as support vector machine, random forest, ripper, naïve Bayes, ridor, and classification tree, which also consider short DNA sequence fragments of 200 and 300 base pairs (bp). The experimental tests were conducted over 10 real barcode datasets belonging to different animal species, which were provided by the on-line resource "Barcode of Life Database". The experimental results showed that our k-mer-based approach is directly comparable, in terms of accuracy, recall and precision metrics, with the other classifiers when considering full-length sequences. In addition, we demonstrate the robustness of our method when a classification is performed task with a set of short DNA sequences that were randomly extracted from the original data. For example, the proposed method can reach the accuracy of 64.8% at the species level with 200-bp fragments. Under the same conditions, the best other classifier (random forest) reaches the accuracy of 20.9%. Our results indicate that we obtained a clear improvement over the other classifiers for the study of short DNA barcode sequence fragments. Copyright © 2015 Elsevier B.V. All rights reserved.

  20. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations.

    PubMed

    Li, Liqi; Cui, Xiang; Yu, Sanjiu; Zhang, Yuan; Luo, Zhong; Yang, Hua; Zhou, Yue; Zheng, Xiaoqi

    2014-01-01

    Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.

  1. Complete genome sequence of Geobacillus thermoglucosidasius C56-YS93, a novel biomass degrader isolated from obsidian hot spring in Yellowstone National Park

    DOE PAGES

    Brumm, Phillip J.; Land, Miriam L.; Mead, David A.

    2015-10-05

    Geobacillus thermoglucosidasius C56-YS93 was one of several thermophilic organisms isolated from Obsidian Hot Spring, Yellowstone National Park, Montana, USA under permit from the National Park Service. Comparison of 16 S rRNA sequences confirmed the classification of the strain as a G. thermoglucosidasius species. We sequenced the genome, assembled, and annotated by the DOE Joint Genome Institute and deposited at the NCBI in December 2011 (CP002835). Moreover, the genome of G. thermoglucosidasius C56-YS93 consists of one circular chromosome of 3,893,306 bp and two circular plasmids of 80,849 and 19,638 bp and an average G + C content of 43.93 %. G.more » thermoglucosidasius C56-YS93 possesses a xylan degradation cluster not found in the other G. thermoglucosidasius sequenced strains. Furthermore this cluster appears to be related to the xylan degradation cluster found in G. stearothermophilus. G. thermoglucosidasius C56-YS93 possesses two plasmids not found in the other two strains. One plasmid contains a novel gene cluster coding for proteins involved in proline degradation and metabolism, the other contains a collection of mostly hypothetical proteins.« less

  2. Complete genome sequence of Geobacillus thermoglucosidasius C56-YS93, a novel biomass degrader isolated from obsidian hot spring in Yellowstone National Park

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Brumm, Phillip J.; Land, Miriam L.; Mead, David A.

    Geobacillus thermoglucosidasius C56-YS93 was one of several thermophilic organisms isolated from Obsidian Hot Spring, Yellowstone National Park, Montana, USA under permit from the National Park Service. Comparison of 16 S rRNA sequences confirmed the classification of the strain as a G. thermoglucosidasius species. We sequenced the genome, assembled, and annotated by the DOE Joint Genome Institute and deposited at the NCBI in December 2011 (CP002835). Moreover, the genome of G. thermoglucosidasius C56-YS93 consists of one circular chromosome of 3,893,306 bp and two circular plasmids of 80,849 and 19,638 bp and an average G + C content of 43.93 %. G.more » thermoglucosidasius C56-YS93 possesses a xylan degradation cluster not found in the other G. thermoglucosidasius sequenced strains. Furthermore this cluster appears to be related to the xylan degradation cluster found in G. stearothermophilus. G. thermoglucosidasius C56-YS93 possesses two plasmids not found in the other two strains. One plasmid contains a novel gene cluster coding for proteins involved in proline degradation and metabolism, the other contains a collection of mostly hypothetical proteins.« less

  3. A Bioinformatic Strategy for the Detection, Classification and Analysis of Bacterial Autotransporters

    PubMed Central

    Celik, Nermin; Webb, Chaille T.; Leyton, Denisse L.; Holt, Kathryn E.; Heinz, Eva; Gorrell, Rebecca; Kwok, Terry; Naderer, Thomas; Strugnell, Richard A.; Speed, Terence P.; Teasdale, Rohan D.; Likić, Vladimir A.; Lithgow, Trevor

    2012-01-01

    Autotransporters are secreted proteins that are assembled into the outer membrane of bacterial cells. The passenger domains of autotransporters are crucial for bacterial pathogenesis, with some remaining attached to the bacterial surface while others are released by proteolysis. An enigma remains as to whether autotransporters should be considered a class of secretion system, or simply a class of substrate with peculiar requirements for their secretion. We sought to establish a sensitive search protocol that could identify and characterize diverse autotransporters from bacterial genome sequence data. The new sequence analysis pipeline identified more than 1500 autotransporter sequences from diverse bacteria, including numerous species of Chlamydiales and Fusobacteria as well as all classes of Proteobacteria. Interrogation of the proteins revealed that there are numerous classes of passenger domains beyond the known proteases, adhesins and esterases. In addition the barrel-domain-a characteristic feature of autotransporters-was found to be composed from seven conserved sequence segments that can be arranged in multiple ways in the tertiary structure of the assembled autotransporter. One of these conserved motifs overlays the targeting information required for autotransporters to reach the outer membrane. Another conserved and diagnostic motif maps to the linker region between the passenger domain and barrel-domain, indicating it as an important feature in the assembly of autotransporters. PMID:22905239

  4. Fold assessment for comparative protein structure modeling.

    PubMed

    Melo, Francisco; Sali, Andrej

    2007-11-01

    Accurate and automated assessment of both geometrical errors and incompleteness of comparative protein structure models is necessary for an adequate use of the models. Here, we describe a composite score for discriminating between models with the correct and incorrect fold. To find an accurate composite score, we designed and applied a genetic algorithm method that searched for a most informative subset of 21 input model features as well as their optimized nonlinear transformation into the composite score. The 21 input features included various statistical potential scores, stereochemistry quality descriptors, sequence alignment scores, geometrical descriptors, and measures of protein packing. The optimized composite score was found to depend on (1) a statistical potential z-score for residue accessibilities and distances, (2) model compactness, and (3) percentage sequence identity of the alignment used to build the model. The accuracy of the composite score was compared with the accuracy of assessment by single and combined features as well as by other commonly used assessment methods. The testing set was representative of models produced by automated comparative modeling on a genomic scale. The composite score performed better than any other tested score in terms of the maximum correct classification rate (i.e., 3.3% false positives and 2.5% false negatives) as well as the sensitivity and specificity across the whole range of thresholds. The composite score was implemented in our program MODELLER-8 and was used to assess models in the MODBASE database that contains comparative models for domains in approximately 1.3 million protein sequences.

  5. Mycobacteriophage genome database.

    PubMed

    Joseph, Jerrine; Rajendran, Vasanthi; Hassan, Sameer; Kumar, Vanaja

    2011-01-01

    Mycobacteriophage genome database (MGDB) is an exclusive repository of the 64 completely sequenced mycobacteriophages with annotated information. It is a comprehensive compilation of the various gene parameters captured from several databases pooled together to empower mycobacteriophage researchers. The MGDB (Version No.1.0) comprises of 6086 genes from 64 mycobacteriophages classified into 72 families based on ACLAME database. Manual curation was aided by information available from public databases which was enriched further by analysis. Its web interface allows browsing as well as querying the classification. The main objective is to collect and organize the complexity inherent to mycobacteriophage protein classification in a rational way. The other objective is to browse the existing and new genomes and describe their functional annotation. The database is available for free at http://mpgdb.ibioinformatics.org/mpgdb.php.

  6. Genomic characterisation of the feline sarcoid-associated papillomavirus and proposed classification as Bos taurus papillomavirus type 14.

    PubMed

    Munday, John S; Thomson, Neroli; Dunowska, Magda; Knight, Cameron G; Laurie, Rebecca E; Hills, Simon

    2015-06-12

    Feline sarcoids are rare mesenchymal neoplasms of domestic and exotic cats. Previous studies have consistently detected short DNA sequences from a papillomavirus (PV), designated feline sarcoid-associated papillomavirus (FeSarPV), in these neoplasms. The FeSarPV sequence has never been detected in any non-sarcoid sample from cats but has been amplified from the skin of cattle suggesting that feline sarcoids are caused by cross-species infection by a bovine papillomavirus (BPV). The aim of the present study was to determine the genome of the PV that contains the FeSarPV sequence. Using the circular nature of PV DNA, four specifically designed 'outward facing' primers were used to amplify two approximately 4,000 bp DNA segments from a feline sarcoid. The two PCR products were sequenced using next generation sequencing and the full genome of the PV, consisting 7,966 bp, was assembled and analysed. Phylogenetic analysis revealed the PV was closely related to the species 4 delta BPVs-1, -2, and -13, but distantly related to any carnivoran PV genus. These results are consistent with feline sarcoids being caused by a BPV type and we propose a classification of BPV-14 for this novel PV. Initial analysis suggests that, like other delta BPVs, the BPV-14 E5 protein could cause mesenchymal proliferation by binding to the platelet derived growth factor beta receptor. Interestingly BPV-14 has not been detected in any equine sarcoid suggesting that BPV-14 has a host range that is limited to bovids and felids. Copyright © 2015 Elsevier B.V. All rights reserved.

  7. A simple atomic-level hydrophobicity scale reveals protein interfacial structure.

    PubMed

    Kapcha, Lauren H; Rossky, Peter J

    2014-01-23

    Many amino acid residue hydrophobicity scales have been created in an effort to better understand and rapidly characterize water-protein interactions based only on protein structure and sequence. There is surprisingly low consistency in the ranking of residue hydrophobicity between scales, and their ability to provide insightful characterization varies substantially across subject proteins. All current scales characterize hydrophobicity based on entire amino acid residue units. We introduce a simple binary but atomic-level hydrophobicity scale that allows for the classification of polar and non-polar moieties within single residues, including backbone atoms. This simple scale is first shown to capture the anticipated hydrophobic character for those whole residues that align in classification among most scales. Examination of a set of protein binding interfaces establishes good agreement between residue-based and atomic-level descriptions of hydrophobicity for five residues, while the remaining residues produce discrepancies. We then show that the atomistic scale properly classifies the hydrophobicity of functionally important regions where residue-based scales fail. To illustrate the utility of the new approach, we show that the atomic-level scale rationalizes the hydration of two hydrophobic pockets and the presence of a void in a third pocket within a single protein and that it appropriately classifies all of the functionally important hydrophilic sites within two otherwise hydrophobic pores. We suggest that an atomic level of detail is, in general, necessary for the reliable depiction of hydrophobicity for all protein surfaces. The present formulation can be implemented simply in a manner no more complex than current residue-based approaches. © 2013.

  8. From sequence to enzyme mechanism using multi-label machine learning.

    PubMed

    De Ferrari, Luna; Mitchell, John B O

    2014-05-19

    In this work we predict enzyme function at the level of chemical mechanism, providing a finer granularity of annotation than traditional Enzyme Commission (EC) classes. Hence we can predict not only whether a putative enzyme in a newly sequenced organism has the potential to perform a certain reaction, but how the reaction is performed, using which cofactors and with susceptibility to which drugs or inhibitors, details with important consequences for drug and enzyme design. Work that predicts enzyme catalytic activity based on 3D protein structure features limits the prediction of mechanism to proteins already having either a solved structure or a close relative suitable for homology modelling. In this study, we evaluate whether sequence identity, InterPro or Catalytic Site Atlas sequence signatures provide enough information for bulk prediction of enzyme mechanism. By splitting MACiE (Mechanism, Annotation and Classification in Enzymes database) mechanism labels to a finer granularity, which includes the role of the protein chain in the overall enzyme complex, the method can predict at 96% accuracy (and 96% micro-averaged precision, 99.9% macro-averaged recall) the MACiE mechanism definitions of 248 proteins available in the MACiE, EzCatDb (Database of Enzyme Catalytic Mechanisms) and SFLD (Structure Function Linkage Database) databases using an off-the-shelf K-Nearest Neighbours multi-label algorithm. We find that InterPro signatures are critical for accurate prediction of enzyme mechanism. We also find that incorporating Catalytic Site Atlas attributes does not seem to provide additional accuracy. The software code (ml2db), data and results are available online at http://sourceforge.net/projects/ml2db/ and as supplementary files.

  9. Functional Classification of Immune Regulatory Proteins

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Rubinstein, Rotem; Ramagopal, Udupi A.; Nathenson, Stanley G.

    2013-05-01

    Members of the immunoglobulin superfamily (IgSF) control innate and adaptive immunity and are prime targets for the treatment of autoimmune diseases, infectious diseases, and malignancies. We describe a computational method, termed the Brotherhood algorithm, which utilizes intermediate sequence information to classify proteins into functionally related families. This approach identifies functional relationships within the IgSF and predicts additional receptor-ligand interactions. As a specific example, we examine the nectin/nectin-like family of cell adhesion and signaling proteins and propose receptor-ligand interactions within this family. We were guided by the Brotherhood approach and present the high-resolution structural characterization of a homophilic interaction involving themore » class-I MHC-restricted T-cell-associated molecule, which we now classify as a nectin-like family member. The Brotherhood algorithm is likely to have a significant impact on structural immunology by identifying those proteins and complexes for which structural characterization will be particularly informative.« less

  10. Predictors of natively unfolded proteins: unanimous consensus score to detect a twilight zone between order and disorder in generic datasets.

    PubMed

    Deiana, Antonio; Giansanti, Andrea

    2010-04-21

    Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated.

  11. Predictors of natively unfolded proteins: unanimous consensus score to detect a twilight zone between order and disorder in generic datasets

    PubMed Central

    2010-01-01

    Background Natively unfolded proteins lack a well defined three dimensional structure but have important biological functions, suggesting a re-assignment of the structure-function paradigm. To assess that a given protein is natively unfolded requires laborious experimental investigations, then reliable sequence-only methods for predicting whether a sequence corresponds to a folded or to an unfolded protein are of interest in fundamental and applicative studies. Many proteins have amino acidic compositions compatible both with the folded and unfolded status, and belong to a twilight zone between order and disorder. This makes difficult a dichotomic classification of protein sequences into folded and natively unfolded ones. In this work we propose an operational method to identify proteins belonging to the twilight zone by combining into a consensus score good performing single predictors of folding. Results In this methodological paper dichotomic folding indexes are considered: hydrophobicity-charge, mean packing, mean pairwise energy, Poodle-W and a new global index, that is called here gVSL2, based on the local disorder predictor VSL2. The performance of these indexes is evaluated on different datasets, in particular on a new dataset composed by 2369 folded and 81 natively unfolded proteins. Poodle-W, gVSL2 and mean pairwise energy have good performance and stability in all the datasets considered and are combined into a strictly unanimous combination score SSU, that leaves proteins unclassified when the consensus of all combined indexes is not reached. The unclassified proteins: i) belong to an overlap region in the vector space of amino acidic compositions occupied by both folded and unfolded proteins; ii) are composed by approximately the same number of order-promoting and disorder-promoting amino acids; iii) have a mean flexibility intermediate between that of folded and that of unfolded proteins. Conclusions Our results show that proteins unclassified by SSU belong to a twilight zone. Proteins left unclassified by the consensus score SSU have physical properties intermediate between those of folded and those of natively unfolded proteins and their structural properties and evolutionary history are worth to be investigated. PMID:20409339

  12. InterProScan 5: genome-scale protein function classification

    PubMed Central

    Jones, Philip; Binns, David; Chang, Hsin-Yu; Fraser, Matthew; Li, Weizhong; McAnulla, Craig; McWilliam, Hamish; Maslen, John; Mitchell, Alex; Nuka, Gift; Pesseat, Sebastien; Quinn, Antony F.; Sangrador-Vegas, Amaia; Scheremetjew, Maxim; Yong, Siew-Yit; Lopez, Rodrigo; Hunter, Sarah

    2014-01-01

    Motivation: Robust large-scale sequence analysis is a major challenge in modern genomic science, where biologists are frequently trying to characterize many millions of sequences. Here, we describe a new Java-based architecture for the widely used protein function prediction software package InterProScan. Developments include improvements and additions to the outputs of the software and the complete reimplementation of the software framework, resulting in a flexible and stable system that is able to use both multiprocessor machines and/or conventional clusters to achieve scalable distributed data analysis. InterProScan is freely available for download from the EMBl-EBI FTP site and the open source code is hosted at Google Code. Availability and implementation: InterProScan is distributed via FTP at ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/ and the source code is available from http://code.google.com/p/interproscan/. Contact: http://www.ebi.ac.uk/support or interhelp@ebi.ac.uk or mitchell@ebi.ac.uk PMID:24451626

  13. Classification of circulation type sequences applied to snow avalanches over the eastern Pyrenees (Andorra and Catalonia)

    NASA Astrophysics Data System (ADS)

    Esteban, Pere; Beck, Christoph; Philipp, Andreas

    2010-05-01

    Using data associated with accidents or damages caused by snow avalanches over the eastern Pyrenees (Andorra and Catalonia) several atmospheric circulation type catalogues have been obtained. For this purpose, different circulation type classification methods based on Principal Component Analysis (T-mode and S-mode using the extreme scores) and on optimization procedures (Improved K-means and SANDRA) were applied . Considering the characteristics of the phenomena studied, not only single day circulation patterns were taken into account but also sequences of circulation types of varying length. Thus different classifications with different numbers of types and for different sequence lengths were obtained using the different classification methods. Simple between type variability, within type variability, and outlier detection procedures have been applied for selecting the best result concerning snow avalanches type classifications. Furthermore, days without occurrence of the hazards were also related to the avalanche centroids using pattern-correlations, facilitating the calculation of the anomalies between hazardous and no hazardous days, and also frequencies of occurrence of hazardous events for each circulation type. Finally, the catalogues statistically considered the best results are evaluated using the avalanche forecaster expert knowledge. Consistent explanation of snow avalanches occurrence by means of circulation sequences is obtained, but always considering results from classifications with different sequence length. This work has been developed in the framework of the COST Action 733 (Harmonisation and Applications of Weather Type Classifications for European regions).

  14. Accurate Prediction of Protein Contact Maps by Coupling Residual Two-Dimensional Bidirectional Long Short-Term Memory with Convolutional Neural Networks.

    PubMed

    Hanson, Jack; Paliwal, Kuldip; Litfin, Thomas; Yang, Yuedong; Zhou, Yaoqi

    2018-06-19

    Accurate prediction of a protein contact map depends greatly on capturing as much contextual information as possible from surrounding residues for a target residue pair. Recently, ultra-deep residual convolutional networks were found to be state-of-the-art in the latest Critical Assessment of Structure Prediction techniques (CASP12, (Schaarschmidt et al., 2018)) for protein contact map prediction by attempting to provide a protein-wide context at each residue pair. Recurrent neural networks have seen great success in recent protein residue classification problems due to their ability to propagate information through long protein sequences, especially Long Short-Term Memory (LSTM) cells. Here we propose a novel protein contact map prediction method by stacking residual convolutional networks with two-dimensional residual bidirectional recurrent LSTM networks, and using both one-dimensional sequence-based and two-dimensional evolutionary coupling-based information. We show that the proposed method achieves a robust performance over validation and independent test sets with the Area Under the receiver operating characteristic Curve (AUC)>0.95 in all tests. When compared to several state-of-the-art methods for independent testing of 228 proteins, the method yields an AUC value of 0.958, whereas the next-best method obtains an AUC of 0.909. More importantly, the improvement is over contacts at all sequence-position separations. Specifically, a 8.95%, 5.65% and 2.84% increase in precision were observed for the top L∕10 predictions over the next best for short, medium and long-range contacts, respectively. This confirms the usefulness of ResNets to congregate the short-range relations and 2D-BRLSTM to propagate the long-range dependencies throughout the entire protein contact map 'image'. SPOT-Contact server url: http://sparks-lab.org/jack/server/SPOT-Contact/. Supplementary data is available at Bioinformatics online.

  15. Impact of genomics on the understanding of microbial evolution and classification: the importance of Darwin's views on classification.

    PubMed

    Gupta, Radhey S

    2016-07-01

    Analyses of genome sequences, by some approaches, suggest that the widespread occurrence of horizontal gene transfers (HGTs) in prokaryotes disguises their evolutionary relationships and have led to questioning of the Darwinian model of evolution for prokaryotes. These inferences are critically examined in the light of comparative genome analysis, characteristic synapomorphies, phylogenetic trees and Darwin's views on examining evolutionary relationships. Genome sequences are enabling discovery of numerous molecular markers (synapomorphies) such as conserved signature indels (CSIs) and conserved signature proteins (CSPs), which are distinctive characteristics of different prokaryotic taxa. Based on these molecular markers, exhibiting high degree of specificity and predictive ability, numerous prokaryotic taxa of different ranks, currently identified based on the 16S rRNA gene trees, can now be reliably demarcated in molecular terms. Within all studied groups, multiple CSIs and CSPs have been identified for successive nested clades providing reliable information regarding their hierarchical relationships and these inferences are not affected by HGTs. These results strongly support Darwin's views on evolution and classification and supplement the current phylogenetic framework based on 16S rRNA in important respects. The identified molecular markers provide important means for developing novel diagnostics, therapeutics and for functional studies providing important insights regarding prokaryotic taxa. © FEMS 2016. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

  16. HIV classification using the coalescent theory

    PubMed Central

    Bulla, Ingo; Schultz, Anne-Kathrin; Schreiber, Fabian; Zhang, Ming; Leitner, Thomas; Korber, Bette; Morgenstern, Burkhard; Stanke, Mario

    2010-01-01

    Motivation: Existing coalescent models and phylogenetic tools based on them are not designed for studying the genealogy of sequences like those of HIV, since in HIV recombinants with multiple cross-over points between the parental strains frequently arise. Hence, ambiguous cases in the classification of HIV sequences into subtypes and circulating recombinant forms (CRFs) have been treated with ad hoc methods in lack of tools based on a comprehensive coalescent model accounting for complex recombination patterns. Results: We developed the program ARGUS that scores classifications of sequences into subtypes and recombinant forms. It reconstructs ancestral recombination graphs (ARGs) that reflect the genealogy of the input sequences given a classification hypothesis. An ARG with maximal probability is approximated using a Markov chain Monte Carlo approach. ARGUS was able to distinguish the correct classification with a low error rate from plausible alternative classifications in simulation studies with realistic parameters. We applied our algorithm to decide between two recently debated alternatives in the classification of CRF02 of HIV-1 and find that CRF02 is indeed a recombinant of Subtypes A and G. Availability: ARGUS is implemented in C++ and the source code is available at http://gobics.de/software Contact: ibulla@uni-goettingen.de Supplementary Information: Supplementary data are available at Bioinformatics online. PMID:20400454

  17. New progress in snake mitochondrial gene rearrangement.

    PubMed

    Chen, Nian; Zhao, Shujin

    2009-08-01

    To further understand the evolution of snake mitochondrial genomes, the complete mitochondrial DNA (mtDNA) sequences were determined for representative species from two snake families: the Many-banded krait, the Banded krait, the Chinese cobra, the King cobra, the Hundred-pace viper, the Short-tailed mamushi, and the Chain viper. Thirteen protein-coding genes, 22-23 tRNA genes, 2 rRNA genes, and 2 control regions were identified in these mtDNAs. Duplication of the control region and translocation of the tRNAPro gene were two notable features of the snake mtDNAs. These results from the gene rearrangement comparisons confirm the correctness of traditional classification schemes and validate the utility of comparing complete mtDNA sequences for snake phylogeny reconstruction.

  18. The Complete Set of Genes Encoding Major Intrinsic Proteins in Arabidopsis Provides a Framework for a New Nomenclature for Major Intrinsic Proteins in Plants1

    PubMed Central

    Johanson, Urban; Karlsson, Maria; Johansson, Ingela; Gustavsson, Sofia; Sjövall, Sara; Fraysse, Laure; Weig, Alfons R.; Kjellbom, Per

    2001-01-01

    Major intrinsic proteins (MIPs) facilitate the passive transport of small polar molecules across membranes. MIPs constitute a very old family of proteins and different forms have been found in all kinds of living organisms, including bacteria, fungi, animals, and plants. In the genomic sequence of Arabidopsis, we have identified 35 different MIP-encoding genes. Based on sequence similarity, these 35 proteins are divided into four different subfamilies: plasma membrane intrinsic proteins, tonoplast intrinsic proteins, NOD26-like intrinsic proteins also called NOD26-like MIPs, and the recently discovered small basic intrinsic proteins. In Arabidopsis, there are 13 plasma membrane intrinsic proteins, 10 tonoplast intrinsic proteins, nine NOD26-like intrinsic proteins, and three small basic intrinsic proteins. The gene structure in general is conserved within each subfamily, although there is a tendency to lose introns. Based on phylogenetic comparisons of maize (Zea mays) and Arabidopsis MIPs (AtMIPs), it is argued that the general intron patterns in the subfamilies were formed before the split of monocotyledons and dicotyledons. Although the gene structure is unique for each subfamily, there is a common pattern in how transmembrane helices are encoded on the exons in three of the subfamilies. The nomenclature for plant MIPs varies widely between different species but also between subfamilies in the same species. Based on the phylogeny of all AtMIPs, a new and more consistent nomenclature is proposed. The complete set of AtMIPs, together with the new nomenclature, will facilitate the isolation, classification, and labeling of plant MIPs from other species. PMID:11500536

  19. Using random forests for assistance in the curation of G-protein coupled receptor databases.

    PubMed

    Shkurin, Aleksei; Vellido, Alfredo

    2017-08-18

    Biology is experiencing a gradual but fast transformation from a laboratory-centred science towards a data-centred one. As such, it requires robust data engineering and the use of quantitative data analysis methods as part of database curation. This paper focuses on G protein-coupled receptors, a large and heterogeneous super-family of cell membrane proteins of interest to biology in general. One of its families, Class C, is of particular interest to pharmacology and drug design. This family is quite heterogeneous on its own, and the discrimination of its several sub-families is a challenging problem. In the absence of known crystal structure, such discrimination must rely on their primary amino acid sequences. We are interested not as much in achieving maximum sub-family discrimination accuracy using quantitative methods, but in exploring sequence misclassification behavior. Specifically, we are interested in isolating those sequences showing consistent misclassification, that is, sequences that are very often misclassified and almost always to the same wrong sub-family. Random forests are used for this analysis due to their ensemble nature, which makes them naturally suited to gauge the consistency of misclassification. This consistency is here defined through the voting scheme of their base tree classifiers. Detailed consistency results for the random forest ensemble classification were obtained for all receptors and for all data transformations of their unaligned primary sequences. Shortlists of the most consistently misclassified receptors for each subfamily and transformation, as well as an overall shortlist including those cases that were consistently misclassified across transformations, were obtained. The latter should be referred to experts for further investigation as a data curation task. The automatic discrimination of the Class C sub-families of G protein-coupled receptors from their unaligned primary sequences shows clear limits. This study has investigated in some detail the consistency of their misclassification using random forest ensemble classifiers. Different sub-families have been shown to display very different discrimination consistency behaviors. The individual identification of consistently misclassified sequences should provide a tool for quality control to GPCR database curators.

  20. Defining functional distance using manifold embeddings of gene ontology annotations

    PubMed Central

    Lerman, Gilad; Shakhnovich, Boris E.

    2007-01-01

    Although rigorous measures of similarity for sequence and structure are now well established, the problem of defining functional relationships has been particularly daunting. Here, we present several manifold embedding techniques to compute distances between Gene Ontology (GO) functional annotations and consequently estimate functional distances between protein domains. To evaluate accuracy, we correlate the functional distance to the well established measures of sequence, structural, and phylogenetic similarities. Finally, we show that manual classification of structures into folds and superfamilies is mirrored by proximity in the newly defined function space. We show how functional distances place structure–function relationships in biological context resulting in insight into divergent and convergent evolution. The methods and results in this paper can be readily generalized and applied to a wide array of biologically relevant investigations, such as accuracy of annotation transference, the relationship between sequence, structure, and function, or coherence of expression modules. PMID:17595300

  1. New tools for classification and monitoring of autoimmune diseases

    PubMed Central

    Maecker, Holden T.; Lindstrom, Tamsin M.; Robinson, William H.; Utz, Paul J.; Hale, Matthew; Boyd, Scott D.; Shen-Orr, Shai S.; Fathman, C. Garrison

    2012-01-01

    Rheumatologists see patients with a range of autoimmune diseases. Phenotyping these diseases for diagnosis, prognosis and selection of therapies is an ever increasing problem. Advances in multiplexed assay technology at the gene, protein, and cellular level have enabled the identification of `actionable biomarkers'; that is, biological metrics that can inform clinical practice. Not only will such biomarkers yield insight into the development, remission, and exacerbation of a disease, they will undoubtedly improve diagnostic sensitivity and accuracy of classification, and ultimately guide treatment. This Review provides an introduction to these powerful technologies that could promote the identification of actionable biomarkers, including mass cytometry, protein arrays, and immunoglobulin and T-cell receptor high-throughput sequencing. In our opinion, these technologies should become part of routine clinical practice for the management of autoimmune diseases. The use of analytical tools to deconvolve the data obtained from use of these technologies is also presented here. These analyses are revealing a more comprehensive and interconnected view of the immune system than ever before and should have an important role in directing future treatment approaches for autoimmune diseases. PMID:22647780

  2. Computational intelligence techniques in bioinformatics.

    PubMed

    Hassanien, Aboul Ella; Al-Shammari, Eiman Tamah; Ghali, Neveen I

    2013-12-01

    Computational intelligence (CI) is a well-established paradigm with current systems having many of the characteristics of biological computers and capable of performing a variety of tasks that are difficult to do using conventional techniques. It is a methodology involving adaptive mechanisms and/or an ability to learn that facilitate intelligent behavior in complex and changing environments, such that the system is perceived to possess one or more attributes of reason, such as generalization, discovery, association and abstraction. The objective of this article is to present to the CI and bioinformatics research communities some of the state-of-the-art in CI applications to bioinformatics and motivate research in new trend-setting directions. In this article, we present an overview of the CI techniques in bioinformatics. We will show how CI techniques including neural networks, restricted Boltzmann machine, deep belief network, fuzzy logic, rough sets, evolutionary algorithms (EA), genetic algorithms (GA), swarm intelligence, artificial immune systems and support vector machines, could be successfully employed to tackle various problems such as gene expression clustering and classification, protein sequence classification, gene selection, DNA fragment assembly, multiple sequence alignment, and protein function prediction and its structure. We discuss some representative methods to provide inspiring examples to illustrate how CI can be utilized to address these problems and how bioinformatics data can be characterized by CI. Challenges to be addressed and future directions of research are also presented and an extensive bibliography is included. Copyright © 2013 Elsevier Ltd. All rights reserved.

  3. Detailed tail proteomic analysis of axolotl (Ambystoma mexicanum) using an mRNA-seq reference database.

    PubMed

    Demircan, Turan; Keskin, Ilknur; Dumlu, Seda Nilgün; Aytürk, Nilüfer; Avşaroğlu, Mahmut Erhan; Akgün, Emel; Öztürk, Gürkan; Baykal, Ahmet Tarık

    2017-01-01

    Salamander axolotl has been emerging as an important model for stem cell research due to its powerful regenerative capacity. Several advantages, such as the high capability of advanced tissue, organ, and appendages regeneration, promote axolotl as an ideal model system to extend our current understanding on the mechanisms of regeneration. Acknowledging the common molecular pathways between amphibians and mammals, there is a great potential to translate the messages from axolotl research to mammalian studies. However, the utilization of axolotl is hindered due to the lack of reference databases of genomic, transcriptomic, and proteomic data. Here, we introduce the proteome analysis of the axolotl tail section searched against an mRNA-seq database. We translated axolotl mRNA sequences to protein sequences and annotated these to process the LC-MS/MS data and identified 1001 nonredundant proteins. Functional classification of identified proteins was performed by gene ontology searches. The presence of some of the identified proteins was validated by in situ antibody labeling. Furthermore, we have analyzed the proteome expressional changes postamputation at three time points to evaluate the underlying mechanisms of the regeneration process. Taken together, this work expands the proteomics data of axolotl to contribute to its establishment as a fully utilized model. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  4. An updated version of NPIDB includes new classifications of DNA–protein complexes and their families

    PubMed Central

    Zanegina, Olga; Kirsanov, Dmitriy; Baulin, Eugene; Karyagina, Anna; Alexeevski, Andrei; Spirin, Sergey

    2016-01-01

    The recent upgrade of nucleic acid–protein interaction database (NPIDB, http://npidb.belozersky.msu.ru/) includes a newly elaborated classification of complexes of protein domains with double-stranded DNA and a classification of families of related complexes. Our classifications are based on contacting structural elements of both DNA: the major groove, the minor groove and the backbone; and protein: helices, beta-strands and unstructured segments. We took into account both hydrogen bonds and hydrophobic interaction. The analyzed material contains 1942 structures of protein domains from 748 PDB entries. We have identified 97 interaction modes of individual protein domain–DNA complexes and 17 DNA–protein interaction classes of protein domain families. We analyzed the sources of diversity of DNA–protein interaction modes in different complexes of one protein domain family. The observed interaction mode is sometimes influenced by artifacts of crystallization or diversity in secondary structure assignment. The interaction classes of domain families are more stable and thus possess more biological sense than a classification of single complexes. Integration of the classification into NPIDB allows the user to browse the database according to the interacting structural elements of DNA and protein molecules. For each family, we present average DNA shape parameters in contact zones with domains of the family. PMID:26656949

  5. SCOWLP classification: Structural comparison and analysis of protein binding regions

    PubMed Central

    Teyra, Joan; Paszkowski-Rogacz, Maciej; Anders, Gerd; Pisabarro, M Teresa

    2008-01-01

    Background Detailed information about protein interactions is critical for our understanding of the principles governing protein recognition mechanisms. The structures of many proteins have been experimentally determined in complex with different ligands bound either in the same or different binding regions. Thus, the structural interactome requires the development of tools to classify protein binding regions. A proper classification may provide a general view of the regions that a protein uses to bind others and also facilitate a detailed comparative analysis of the interacting information for specific protein binding regions at atomic level. Such classification might be of potential use for deciphering protein interaction networks, understanding protein function, rational engineering and design. Description Protein binding regions (PBRs) might be ideally described as well-defined separated regions that share no interacting residues one another. However, PBRs are often irregular, discontinuous and can share a wide range of interacting residues among them. The criteria to define an individual binding region can be often arbitrary and may differ from other binding regions within a protein family. Therefore, the rational behind protein interface classification should aim to fulfil the requirements of the analysis to be performed. We extract detailed interaction information of protein domains, peptides and interfacial solvent from the SCOWLP database and we classify the PBRs of each domain family. For this purpose, we define a similarity index based on the overlapping of interacting residues mapped in pair-wise structural alignments. We perform our classification with agglomerative hierarchical clustering using the complete-linkage method. Our classification is calculated at different similarity cut-offs to allow flexibility in the analysis of PBRs, feature especially interesting for those protein families with conflictive binding regions. The hierarchical classification of PBRs is implemented into the SCOWLP database and extends the SCOP classification with three additional family sub-levels: Binding Region, Interface and Contacting Domains. SCOWLP contains 9,334 binding regions distributed within 2,561 families. In 65% of the cases we observe families containing more than one binding region. Besides, 22% of the regions are forming complex with more than one different protein family. Conclusion The current SCOWLP classification and its web application represent a framework for the study of protein interfaces and comparative analysis of protein family binding regions. This comparison can be performed at atomic level and allows the user to study interactome conservation and variability. The new SCOWLP classification may be of great utility for reconstruction of protein complexes, understanding protein networks and ligand design. SCOWLP will be updated with every SCOP release. The web application is available at . PMID:18182098

  6. Amyloid fibril proteins and amyloidosis: chemical identification and clinical classification International Society of Amyloidosis 2016 Nomenclature Guidelines.

    PubMed

    Sipe, Jean D; Benson, Merrill D; Buxbaum, Joel N; Ikeda, Shu-Ichi; Merlini, Giampaolo; Saraiva, Maria J M; Westermark, Per

    2016-12-01

    The Nomenclature Committee of the International Society of Amyloidosis (ISA) met during the XVth Symposium of the Society, 3 July-7 July 2016, Uppsala, Sweden, to assess and formulate recommendations for nomenclature for amyloid fibril proteins and the clinical classification of the amyloidoses. An amyloid fibril must exhibit affinity for Congo red and with green, yellow or orange birefringence when the Congo red-stained deposits are viewed with polarized light. While congophilia and birefringence remain the gold standard for demonstration of amyloid deposits, new staining and imaging techniques are proving useful. To be included in the nomenclature list, in addition to congophilia and birefringence, the chemical identity of the protein must be unambiguously characterized by protein sequence analysis when possible. In general, it is insufficient to identify a mutation in the gene of a candidate amyloid protein without confirming the variant changes in the amyloid fibril protein. Each distinct form of amyloidosis is uniquely characterized by the chemical identity of the amyloid fibril protein that deposits in the extracellular spaces of tissues and organs and gives rise to the disease syndrome. The fibril proteins are designated as protein A followed by a suffix that is an abbreviation of the parent or precursor protein name. To date, there are 36 known extracellular fibril proteins in humans, 2 of which are iatrogenic in nature and 9 of which have also been identified in animals. Two newly recognized fibril proteins, AApoCII derived from apolipoprotein CII and AApoCIII derived from apolipoprotein CIII, have been added. AApoCII amyloidosis and AApoCIII amyloidosis are hereditary systemic amyloidoses. Intracellular protein inclusions displaying some of the properties of amyloid, "intracellular amyloid" have been reported. Two proteins which were previously characterized as intracellular inclusions, tau and α-synuclein, are now recognized to form extracellular deposits upon cell death and thus have been included in Table 1 as ATau and AαSyn.

  7. Recognition of functional sites in protein structures.

    PubMed

    Shulman-Peleg, Alexandra; Nussinov, Ruth; Wolfson, Haim J

    2004-06-04

    Recognition of regions on the surface of one protein, that are similar to a binding site of another is crucial for the prediction of molecular interactions and for functional classifications. We first describe a novel method, SiteEngine, that assumes no sequence or fold similarities and is able to recognize proteins that have similar binding sites and may perform similar functions. We achieve high efficiency and speed by introducing a low-resolution surface representation via chemically important surface points, by hashing triangles of physico-chemical properties and by application of hierarchical scoring schemes for a thorough exploration of global and local similarities. We proceed to rigorously apply this method to functional site recognition in three possible ways: first, we search a given functional site on a large set of complete protein structures. Second, a potential functional site on a protein of interest is compared with known binding sites, to recognize similar features. Third, a complete protein structure is searched for the presence of an a priori unknown functional site, similar to known sites. Our method is robust and efficient enough to allow computationally demanding applications such as the first and the third. From the biological standpoint, the first application may identify secondary binding sites of drugs that may lead to side-effects. The third application finds new potential sites on the protein that may provide targets for drug design. Each of the three applications may aid in assigning a function and in classification of binding patterns. We highlight the advantages and disadvantages of each type of search, provide examples of large-scale searches of the entire Protein Data Base and make functional predictions.

  8. Profiling the orphan enzymes

    PubMed Central

    2014-01-01

    The emergence of Next Generation Sequencing generates an incredible amount of sequence and great potential for new enzyme discovery. Despite this huge amount of data and the profusion of bioinformatic methods for function prediction, a large part of known enzyme activities is still lacking an associated protein sequence. These particular activities are called “orphan enzymes”. The present review proposes an update of previous surveys on orphan enzymes by mining the current content of public databases. While the percentage of orphan enzyme activities has decreased from 38% to 22% in ten years, there are still more than 1,000 orphans among the 5,000 entries of the Enzyme Commission (EC) classification. Taking into account all the reactions present in metabolic databases, this proportion dramatically increases to reach nearly 50% of orphans and many of them are not associated to a known pathway. We extended our survey to “local orphan enzymes” that are activities which have no representative sequence in a given clade, but have at least one in organisms belonging to other clades. We observe an important bias in Archaea and find that in general more than 30% of the EC activities have incomplete sequence information in at least one superkingdom. To estimate if candidate proteins for local orphans could be retrieved by homology search, we applied a simple strategy based on the PRIAM software and noticed that candidates may be proposed for an important fraction of local orphan enzymes. Finally, by studying relation between protein domains and catalyzed activities, it appears that newly discovered enzymes are mostly associated with already known enzyme domains. Thus, the exploration of the promiscuity and the multifunctional aspect of known enzyme families may solve part of the orphan enzyme issue. We conclude this review with a presentation of recent initiatives in finding proteins for orphan enzymes and in extending the enzyme world by the discovery of new activities. Reviewers This article was reviewed by Michael Galperin, Daniel Haft and Daniel Kahn. PMID:24906382

  9. Profiling the orphan enzymes.

    PubMed

    Sorokina, Maria; Stam, Mark; Médigue, Claudine; Lespinet, Olivier; Vallenet, David

    2014-06-06

    The emergence of Next Generation Sequencing generates an incredible amount of sequence and great potential for new enzyme discovery. Despite this huge amount of data and the profusion of bioinformatic methods for function prediction, a large part of known enzyme activities is still lacking an associated protein sequence. These particular activities are called "orphan enzymes". The present review proposes an update of previous surveys on orphan enzymes by mining the current content of public databases. While the percentage of orphan enzyme activities has decreased from 38% to 22% in ten years, there are still more than 1,000 orphans among the 5,000 entries of the Enzyme Commission (EC) classification. Taking into account all the reactions present in metabolic databases, this proportion dramatically increases to reach nearly 50% of orphans and many of them are not associated to a known pathway. We extended our survey to "local orphan enzymes" that are activities which have no representative sequence in a given clade, but have at least one in organisms belonging to other clades. We observe an important bias in Archaea and find that in general more than 30% of the EC activities have incomplete sequence information in at least one superkingdom. To estimate if candidate proteins for local orphans could be retrieved by homology search, we applied a simple strategy based on the PRIAM software and noticed that candidates may be proposed for an important fraction of local orphan enzymes. Finally, by studying relation between protein domains and catalyzed activities, it appears that newly discovered enzymes are mostly associated with already known enzyme domains. Thus, the exploration of the promiscuity and the multifunctional aspect of known enzyme families may solve part of the orphan enzyme issue. We conclude this review with a presentation of recent initiatives in finding proteins for orphan enzymes and in extending the enzyme world by the discovery of new activities.

  10. Evolution and Classification of Myosins, a Paneukaryotic Whole-Genome Approach

    PubMed Central

    Sebé-Pedrós, Arnau; Grau-Bové, Xavier; Richards, Thomas A.; Ruiz-Trillo, Iñaki

    2014-01-01

    Myosins are key components of the eukaryotic cytoskeleton, providing motility for a broad diversity of cargoes. Therefore, understanding the origin and evolutionary history of myosin classes is crucial to address the evolution of eukaryote cell biology. Here, we revise the classification of myosins using an updated taxon sampling that includes newly or recently sequenced genomes and transcriptomes from key taxa. We performed a survey of eukaryotic genomes and phylogenetic analyses of the myosin gene family, reconstructing the myosin toolkit at different key nodes in the eukaryotic tree of life. We also identified the phylogenetic distribution of myosin diversity in terms of number of genes, associated protein domains and number of classes in each taxa. Our analyses show that new classes (i.e., paralogs) and domain architectures were continuously generated throughout eukaryote evolution, with a significant expansion of myosin abundance and domain architectural diversity at the stem of Holozoa, predating the origin of animal multicellularity. Indeed, single-celled holozoans have the most complex myosin complement among eukaryotes, with paralogs of most myosins previously considered animal specific. We recover a dynamic evolutionary history, with several lineage-specific expansions (e.g., the myosin III-like gene family diversification in choanoflagellates), convergence in protein domain architectures (e.g., fungal and animal chitin synthase myosins), and important secondary losses. Overall, our evolutionary scheme demonstrates that the ancestral eukaryote likely had a complex myosin repertoire that included six genes with different protein domain architectures. Finally, we provide an integrative and robust classification, useful for future genomic and functional studies on this crucial eukaryotic gene family. PMID:24443438

  11. Sequencing, Analysis, and Annotation of Expressed Sequence Tags for Camelus dromedarius

    PubMed Central

    Al-Swailem, Abdulaziz M.; Shehata, Maher M.; Abu-Duhier, Faisel M.; Al-Yamani, Essam J.; Al-Busadah, Khalid A.; Al-Arawi, Mohammed S.; Al-Khider, Ali Y.; Al-Muhaimeed, Abdullah N.; Al-Qahtani, Fahad H.; Manee, Manee M.; Al-Shomrani, Badr M.; Al-Qhtani, Saad M.; Al-Harthi, Amer S.; Akdemir, Kadir C.; Otu, Hasan H.

    2010-01-01

    Despite its economical, cultural, and biological importance, there has not been a large scale sequencing project to date for Camelus dromedarius. With the goal of sequencing complete DNA of the organism, we first established and sequenced camel EST libraries, generating 70,272 reads. Following trimming, chimera check, repeat masking, cluster and assembly, we obtained 23,602 putative gene sequences, out of which over 4,500 potentially novel or fast evolving gene sequences do not carry any homology to other available genomes. Functional annotation of sequences with similarities in nucleotide and protein databases has been obtained using Gene Ontology classification. Comparison to available full length cDNA sequences and Open Reading Frame (ORF) analysis of camel sequences that exhibit homology to known genes show more than 80% of the contigs with an ORF>300 bp and ∼40% hits extending to the start codons of full length cDNAs suggesting successful characterization of camel genes. Similarity analyses are done separately for different organisms including human, mouse, bovine, and rat. Accompanying web portal, CAGBASE (http://camel.kacst.edu.sa/), hosts a relational database containing annotated EST sequences and analysis tools with possibility to add sequences from public domain. We anticipate our results to provide a home base for genomic studies of camel and other comparative studies enabling a starting point for whole genome sequencing of the organism. PMID:20502665

  12. Novel dehydrins lacking complete K-segments in Pinaceae. The exception rather than the rule

    PubMed Central

    Perdiguero, Pedro; Collada, Carmen; Soto, Álvaro

    2014-01-01

    Dehydrins are thought to play an essential role in the plant response, acclimation and tolerance to different abiotic stresses, such as cold and drought. These proteins contain conserved and repeated segments in their amino acid sequence, used for their classification. Thus, dehydrins from angiosperms present different repetitions of the segments Y, S, and K, while gymnosperm dehydrins show A, E, S, and K segments. The only fragment present in all the dehydrins described to date is the K-segment. Different works suggest the K-segment is involved in key protective functions during dehydration stress, mainly stabilizing membranes. In this work, we describe for the first time two Pinus pinaster proteins with truncated K-segments and a third one completely lacking K-segments, but whose sequence homology leads us to consider them still as dehydrins. qRT-PCR expression analysis show a significant induction of these dehydrins during a severe and prolonged drought stress. By in silico analysis we confirmed the presence of these dehydrins in other Pinaceae species, breaking the convention regarding the compulsory presence of K-segments in these proteins. The way of action of these unusual dehydrins remains unrevealed. PMID:25520734

  13. Trade-offs between enzyme fitness and solubility illuminated by deep mutational scanning

    PubMed Central

    Bacik, John-Paul; Wrenbeck, Emily E.; Michalczyk, Ryszard; Whitehead, Timothy A.

    2017-01-01

    Proteins are marginally stable, and an understanding of the sequence determinants for improved protein solubility is highly desired. For enzymes, it is well known that many mutations that increase protein solubility decrease catalytic activity. These competing effects frustrate efforts to design and engineer stable, active enzymes without laborious high-throughput activity screens. To address the trade-off between enzyme solubility and activity, we performed deep mutational scanning using two different screens/selections that purport to gauge protein solubility for two full-length enzymes. We assayed a TEM-1 beta-lactamase variant and levoglucosan kinase (LGK) using yeast surface display (YSD) screening and a twin-arginine translocation pathway selection. We then compared these scans with published experimental fitness landscapes. Results from the YSD screen could explain 37% of the variance in the fitness landscapes for one enzyme. Five percent to 10% of all single missense mutations improve solubility, matching theoretical predictions of global protein stability. For a given solubility-enhancing mutation, the probability that it would retain wild-type fitness was correlated with evolutionary conservation and distance to active site, and anticorrelated with contact number. Hybrid classification models were developed that could predict solubility-enhancing mutations that maintain wild-type fitness with an accuracy of 90%. The downside of using such classification models is the removal of rare mutations that improve both fitness and solubility. To reveal the biophysical basis of enhanced protein solubility and function, we determined the crystallographic structure of one such LGK mutant. Beyond fundamental insights into trade-offs between stability and activity, these results have potential biotechnological applications. PMID:28196882

  14. Comparative study of classification algorithms for immunosignaturing data

    PubMed Central

    2012-01-01

    Background High-throughput technologies such as DNA, RNA, protein, antibody and peptide microarrays are often used to examine differences across drug treatments, diseases, transgenic animals, and others. Typically one trains a classification system by gathering large amounts of probe-level data, selecting informative features, and classifies test samples using a small number of features. As new microarrays are invented, classification systems that worked well for other array types may not be ideal. Expression microarrays, arguably one of the most prevalent array types, have been used for years to help develop classification algorithms. Many biological assumptions are built into classifiers that were designed for these types of data. One of the more problematic is the assumption of independence, both at the probe level and again at the biological level. Probes for RNA transcripts are designed to bind single transcripts. At the biological level, many genes have dependencies across transcriptional pathways where co-regulation of transcriptional units may make many genes appear as being completely dependent. Thus, algorithms that perform well for gene expression data may not be suitable when other technologies with different binding characteristics exist. The immunosignaturing microarray is based on complex mixtures of antibodies binding to arrays of random sequence peptides. It relies on many-to-many binding of antibodies to the random sequence peptides. Each peptide can bind multiple antibodies and each antibody can bind multiple peptides. This technology has been shown to be highly reproducible and appears promising for diagnosing a variety of disease states. However, it is not clear what is the optimal classification algorithm for analyzing this new type of data. Results We characterized several classification algorithms to analyze immunosignaturing data. We selected several datasets that range from easy to difficult to classify, from simple monoclonal binding to complex binding patterns in asthma patients. We then classified the biological samples using 17 different classification algorithms. Using a wide variety of assessment criteria, we found ‘Naïve Bayes’ far more useful than other widely used methods due to its simplicity, robustness, speed and accuracy. Conclusions ‘Naïve Bayes’ algorithm appears to accommodate the complex patterns hidden within multilayered immunosignaturing microarray data due to its fundamental mathematical properties. PMID:22720696

  15. Bioinformatic prediction of G protein-coupled receptor encoding sequences from the transcriptome of the foreleg, including the Haller’s organ, of the cattle tick, Rhipicephalus australis

    PubMed Central

    Munoz, Sergio; Guerrero, Felix D.; Kellogg, Anastasia; Heekin, Andrew M.

    2017-01-01

    The cattle tick of Australia, Rhipicephalus australis, is a vector for microbial parasites that cause serious bovine diseases. The Haller’s organ, located in the tick’s forelegs, is crucial for host detection and mating. To facilitate the development of new technologies for better control of this agricultural pest, we aimed to sequence and annotate the transcriptome of the R. australis forelegs and associated tissues, including the Haller's organ. As G protein-coupled receptors (GPCRs) are an important family of eukaryotic proteins studied as pharmaceutical targets in humans, we prioritized the identification and classification of the GPCRs expressed in the foreleg tissues. The two forelegs from adult R. australis were excised, RNA extracted, and pyrosequenced with 454 technology. Reads were assembled into unigenes and annotated by sequence similarity. Python scripts were written to find open reading frames (ORFs) from each unigene. These ORFs were analyzed by different GPCR prediction approaches based on sequence alignments, support vector machines, hidden Markov models, and principal component analysis. GPCRs consistently predicted by multiple methods were further studied by phylogenetic analysis and 3D homology modeling. From 4,782 assembled unigenes, 40,907 possible ORFs were predicted. Using Blastp, Pfam, GPCRpred, TMHMM, and PCA-GPCR, a basic set of 46 GPCR candidates were compiled and a phylogenetic tree was constructed. With further screening of tertiary structures predicted by RaptorX, 6 likely GPCRs emerged and the strongest candidate was classified by PCA-GPCR to be a GABAB receptor. PMID:28231302

  16. Bioinformatic prediction of G protein-coupled receptor encoding sequences from the transcriptome of the foreleg, including the Haller's organ, of the cattle tick, Rhipicephalus australis.

    PubMed

    Munoz, Sergio; Guerrero, Felix D; Kellogg, Anastasia; Heekin, Andrew M; Leung, Ming-Ying

    2017-01-01

    The cattle tick of Australia, Rhipicephalus australis, is a vector for microbial parasites that cause serious bovine diseases. The Haller's organ, located in the tick's forelegs, is crucial for host detection and mating. To facilitate the development of new technologies for better control of this agricultural pest, we aimed to sequence and annotate the transcriptome of the R. australis forelegs and associated tissues, including the Haller's organ. As G protein-coupled receptors (GPCRs) are an important family of eukaryotic proteins studied as pharmaceutical targets in humans, we prioritized the identification and classification of the GPCRs expressed in the foreleg tissues. The two forelegs from adult R. australis were excised, RNA extracted, and pyrosequenced with 454 technology. Reads were assembled into unigenes and annotated by sequence similarity. Python scripts were written to find open reading frames (ORFs) from each unigene. These ORFs were analyzed by different GPCR prediction approaches based on sequence alignments, support vector machines, hidden Markov models, and principal component analysis. GPCRs consistently predicted by multiple methods were further studied by phylogenetic analysis and 3D homology modeling. From 4,782 assembled unigenes, 40,907 possible ORFs were predicted. Using Blastp, Pfam, GPCRpred, TMHMM, and PCA-GPCR, a basic set of 46 GPCR candidates were compiled and a phylogenetic tree was constructed. With further screening of tertiary structures predicted by RaptorX, 6 likely GPCRs emerged and the strongest candidate was classified by PCA-GPCR to be a GABAB receptor.

  17. Transporter taxonomy - a comparison of different transport protein classification schemes.

    PubMed

    Viereck, Michael; Gaulton, Anna; Digles, Daniela; Ecker, Gerhard F

    2014-06-01

    Currently, there are more than 800 well characterized human membrane transport proteins (including channels and transporters) and there are estimates that about 10% (approx. 2000) of all human genes are related to transport. Membrane transport proteins are of interest as potential drug targets, for drug delivery, and as a cause of side effects and drug–drug interactions. In light of the development of Open PHACTS, which provides an open pharmacological space, we analyzed selected membrane transport protein classification schemes (Transporter Classification Database, ChEMBL, IUPHAR/BPS Guide to Pharmacology, and Gene Ontology) for their ability to serve as a basis for pharmacology driven protein classification. A comparison of these membrane transport protein classification schemes by using a set of clinically relevant transporters as use-case reveals the strengths and weaknesses of the different taxonomy approaches.

  18. Computational approaches for the classification of seed storage proteins.

    PubMed

    Radhika, V; Rao, V Sree Hari

    2015-07-01

    Seed storage proteins comprise a major part of the protein content of the seed and have an important role on the quality of the seed. These storage proteins are important because they determine the total protein content and have an effect on the nutritional quality and functional properties for food processing. Transgenic plants are being used to develop improved lines for incorporation into plant breeding programs and the nutrient composition of seeds is a major target of molecular breeding programs. Hence, classification of these proteins is crucial for the development of superior varieties with improved nutritional quality. In this study we have applied machine learning algorithms for classification of seed storage proteins. We have presented an algorithm based on nearest neighbor approach for classification of seed storage proteins and compared its performance with decision tree J48, multilayer perceptron neural (MLP) network and support vector machine (SVM) libSVM. The model based on our algorithm has been able to give higher classification accuracy in comparison to the other methods.

  19. Phylogenetic analyses of the genus Aeromonas based on housekeeping gene sequencing and its influence on systematics.

    PubMed

    Navarro, Aaron; Martínez-Murcia, Antonio

    2018-04-19

    The phylogenies derived from housekeeping gene sequence alignments, although mere evolutionary hypotheses, have increased our knowledge about the Aeromonas genetic diversity, providing a robust species delineation framework invaluable for reliable, easy and fast species identification. Previous classifications of Aeromonas, have been fully surpassed by recently developed phylogenetic (natural) classification obtained from the analysis of so-called "molecular chronometers". Despite ribosomal RNAs cannot split all known Aeromonas species, the conserved nature of 16S rRNA offers reliable alignments containing mosaics of sequence signatures which may serve as targets of genus-specific oligonucleotides for subsequent identification/detection tests in samples without culturing. On the contrary, some housekeeping genes coding for proteins show a much better chronometric capacity to discriminate highly related strains. Although both, species and loci, do not all evolve at exactly the same rate, published Aeromonas phylogenies were congruent to each other, indicating that, phylogenetic markers are synchronized and a concatenated multi-gene phylogeny, may be "the mirror" of the entire genomic relationships. Thanks to MLPA approaches, the discovery of new Aeromonas species and strains of rarely isolated species is today more frequent and, consequently, should be extensively promoted for isolate screening and species identification. Although, accumulated data still should be carefully catalogued to inherit a reliable database. This article is protected by copyright. All rights reserved. This article is protected by copyright. All rights reserved.

  20. Characterization and Prediction of Protein Phosphorylation Hotspots in Arabidopsis thaliana.

    PubMed

    Christian, Jan-Ole; Braginets, Rostyslav; Schulze, Waltraud X; Walther, Dirk

    2012-01-01

    The regulation of protein function by modulating the surface charge status via sequence-locally enriched phosphorylation sites (P-sites) in so called phosphorylation "hotspots" has gained increased attention in recent years. We set out to identify P-hotspots in the model plant Arabidopsis thaliana. We analyzed the spacing of experimentally detected P-sites within peptide-covered regions along Arabidopsis protein sequences as available from the PhosPhAt database. Confirming earlier reports (Schweiger and Linial, 2010), we found that, indeed, P-sites tend to cluster and that distributions between serine and threonine P-sites to their respected closest next P-site differ significantly from those for tyrosine P-sites. The ability to predict P-hotspots by applying available computational P-site prediction programs that focus on identifying single P-sites was observed to be severely compromised by the inevitable interference of nearby P-sites. We devised a new approach, named HotSPotter, for the prediction of phosphorylation hotspots. HotSPotter is based primarily on local amino acid compositional preferences rather than sequence position-specific motifs and uses support vector machines as the underlying classification engine. HotSPotter correctly identified experimentally determined phosphorylation hotspots in A. thaliana with high accuracy. Applied to the Arabidopsis proteome, HotSPotter-predicted 13,677 candidate P-hotspots in 9,599 proteins corresponding to 7,847 unique genes. Hotspot containing proteins are involved predominantly in signaling processes confirming the surmised modulating role of hotspots in signaling and interaction events. Our study provides new bioinformatics means to identify phosphorylation hotspots and lays the basis for further investigating novel candidate P-hotspots. All phosphorylation hotspot annotations and predictions have been made available as part of the PhosPhAt database at http://phosphat.mpimp-golm.mpg.de.

  1. A cilevirus infects ornamental hibiscus in Hawaii

    PubMed Central

    Melzer, Michael J.; Simbajon, Nelson; Carillo, James; Borth, Wayne B.; Freitas-Astúa, Juliana; Kitajima, Elliot W.; Neupane, Kabi R.; Hu, John S.

    2013-01-01

    The complete nucleotide sequence of a virus infecting ornamental hibiscus (Hibiscus sp.) in Hawaii with symptoms of green ringspots on senescing leaves was determined from double-stranded RNA isolated from symptomatic tissue. Excluding polyadenylated regions at the 3′ termini, the bipartite RNA genome was 8748 and 5019 nt in length for RNA1 and RNA2, respectively. The genome organization was typical of a cilevirus: RNA1 encoded a large replication-associated protein with methyltransferase, protease, helicase and RNA-dependent RNA polymerase domains as well as a 29-kDa protein of unknown function. RNA2 possessed five open reading frames that potentially encoded proteins with molecular masses of 15, 7, 62, 32, and 24 kDa. The 32-kDa protein is homologous to 3A movement proteins of RNA viruses; the other proteins are of unknown function. A proteome comparison revealed that this virus was 92% identical to citrus leprosis virus cytoplasmic type 2 (CiLV-C2), a recently characterized cilevirus infecting citrus with leprosis-like symptoms in Colombia. The high sequence similarity suggests that the virus described in this study could be a strain of CiLV-C2, but since the new genus Cilevirus does not have species demarcation criteria established at present, the classification of this virus infecting hibiscus is open to interpretation. This study represents the first documented case of a cilevirus established in the United States and provides insight into the diversity within the genus Cilevirus. PMID:23732930

  2. [Entification of the Rubella virus genotype 1H in Western Siberia].

    PubMed

    Seregin, S V; Babkin, I V; Petrova, I D; Iashina, L N; Malkova, E M; Petrov, V S

    2011-01-01

    Molecular epidemiological study of novel strain of Rubella virus isolated during the outbreak in Western Siberia in 2004 was described. Detailed phylogenetic analysis performed based upon entire SP-region, which encodes all three Rubella structural proteins (C, E2, and E1), was implemented. This analysis provides characterization of this strain and classifies it as 1H genotype, thereby correcting previous classification of this strain based upon shorter nucleotide sequence, only encoding E1 protein. Therefore, this study identified the genotype of the Rubella virus not previously detected in Western Siberia (and even entire Russian Federation), which highlights the importance of more extensive characterization of genetic variability of the Rubella virus, especially with regard to potential influence of vaccination on the Rubella virus mutagenesis.

  3. The Classification and Evolution of Enzyme Function.

    PubMed

    Martínez Cuesta, Sergio; Rahman, Syed Asad; Furnham, Nicholas; Thornton, Janet M

    2015-09-15

    Enzymes are the proteins responsible for the catalysis of life. Enzymes sharing a common ancestor as defined by sequence and structure similarity are grouped into families and superfamilies. The molecular function of enzymes is defined as their ability to catalyze biochemical reactions; it is manually classified by the Enzyme Commission and robust approaches to quantitatively compare catalytic reactions are just beginning to appear. Here, we present an overview of studies at the interface of the evolution and function of enzymes. Copyright © 2015 Biophysical Society. Published by Elsevier Inc. All rights reserved.

  4. NetTurnP – Neural Network Prediction of Beta-turns by Use of Evolutionary Information and Predicted Protein Sequence Features

    PubMed Central

    Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl

    2010-01-01

    β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC  = 0.50, Qtotal = 82.1%, sensitivity  = 75.6%, PPV  = 68.8% and AUC  = 0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17 – 0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. Conclusion The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences. PMID:21152409

  5. NetTurnP--neural network prediction of beta-turns by use of evolutionary information and predicted protein sequence features.

    PubMed

    Petersen, Bent; Lundegaard, Claus; Petersen, Thomas Nordahl

    2010-11-30

    β-turns are the most common type of non-repetitive structures, and constitute on average 25% of the amino acids in proteins. The formation of β-turns plays an important role in protein folding, protein stability and molecular recognition processes. In this work we present the neural network method NetTurnP, for prediction of two-class β-turns and prediction of the individual β-turn types, by use of evolutionary information and predicted protein sequence features. It has been evaluated against a commonly used dataset BT426, and achieves a Matthews correlation coefficient of 0.50, which is the highest reported performance on a two-class prediction of β-turn and not-β-turn. Furthermore NetTurnP shows improved performance on some of the specific β-turn types. In the present work, neural network methods have been trained to predict β-turn or not and individual β-turn types from the primary amino acid sequence. The individual β-turn types I, I', II, II', VIII, VIa1, VIa2, VIba and IV have been predicted based on classifications by PROMOTIF, and the two-class prediction of β-turn or not is a superset comprised of all β-turn types. The performance is evaluated using a golden set of non-homologous sequences known as BT426. Our two-class prediction method achieves a performance of: MCC=0.50, Qtotal=82.1%, sensitivity=75.6%, PPV=68.8% and AUC=0.864. We have compared our performance to eleven other prediction methods that obtain Matthews correlation coefficients in the range of 0.17-0.47. For the type specific β-turn predictions, only type I and II can be predicted with reasonable Matthews correlation coefficients, where we obtain performance values of 0.36 and 0.31, respectively. The NetTurnP method has been implemented as a webserver, which is freely available at http://www.cbs.dtu.dk/services/NetTurnP/. NetTurnP is the only available webserver that allows submission of multiple sequences.

  6. Structural and Functional Insights into WRKY3 and WRKY4 Transcription Factors to Unravel the WRKY–DNA (W-Box) Complex Interaction in Tomato (Solanum lycopersicum L.). A Computational Approach

    PubMed Central

    Aamir, Mohd; Singh, Vinay K.; Meena, Mukesh; Upadhyay, Ram S.; Gupta, Vijai K.; Singh, Surendra

    2017-01-01

    The WRKY transcription factors (TFs), play crucial role in plant defense response against various abiotic and biotic stresses. The role of WRKY3 and WRKY4 genes in plant defense response against necrotrophic pathogens is well-reported. However, their functional annotation in tomato is largely unknown. In the present work, we have characterized the structural and functional attributes of the two identified tomato WRKY transcription factors, WRKY3 (SlWRKY3), and WRKY4 (SlWRKY4) using computational approaches. Arabidopsis WRKY3 (AtWRKY3: NP_178433) and WRKY4 (AtWRKY4: NP_172849) protein sequences were retrieved from TAIR database and protein BLAST was done for finding their sequential homologs in tomato. Sequence alignment, phylogenetic classification, and motif composition analysis revealed the remarkable sequential variation between, these two WRKYs. The tomato WRKY3 and WRKY4 clusters with Solanum pennellii showing the monophyletic origin and evolution from their wild homolog. The functional domain region responsible for sequence specific DNA-binding occupied in both proteins were modeled [using AtWRKY4 (PDB ID:1WJ2) and AtWRKY1 (PDBID:2AYD) as template protein structures] through homology modeling using Discovery Studio 3.0. The generated models were further evaluated for their accuracy and reliability based on qualitative and quantitative parameters. The modeled proteins were found to satisfy all the crucial energy parameters and showed acceptable Ramachandran statistics when compared to the experimentally resolved NMR solution structures and/or X-Ray diffracted crystal structures (templates). The superimposition of the functional WRKY domains from SlWRKY3 and SlWRKY4 revealed remarkable structural similarity. The sequence specific DNA binding for two WRKYs was explored through DNA-protein interaction using Hex Docking server. The interaction studies found that SlWRKY4 binds with the W-box DNA through WRKYGQK with Tyr408, Arg409, and Lys419 with the initial flanking sequences also get involved in binding. In contrast, the SlWRKY3 made interaction with RKYGQK along with the residues from zinc finger motifs. Protein-protein interactions studies were done using STRING version 10.0 to explore all the possible protein partners involved in associative functional interaction networks. The Gene ontology enrichment analysis revealed the functional dimension and characterized the identified WRKYs based on their functional annotation. PMID:28611792

  7. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition.

    PubMed

    Feng, Peng-Mian; Chen, Wei; Lin, Hao; Chou, Kuo-Chen

    2013-11-01

    Heat shock proteins (HSPs) are a type of functionally related proteins present in all living organisms, both prokaryotes and eukaryotes. They play essential roles in protein-protein interactions such as folding and assisting in the establishment of proper protein conformation and prevention of unwanted protein aggregation. Their dysfunction may cause various life-threatening disorders, such as Parkinson's, Alzheimer's, and cardiovascular diseases. Based on their functions, HSPs are usually classified into six families: (i) HSP20 or sHSP, (ii) HSP40 or J-class proteins, (iii) HSP60 or GroEL/ES, (iv) HSP70, (v) HSP90, and (vi) HSP100. Although considerable progress has been achieved in discriminating HSPs from other proteins, it is still a big challenge to identify HSPs among their six different functional types according to their sequence information alone. With the avalanche of protein sequences generated in the post-genomic age, it is highly desirable to develop a high-throughput computational tool in this regard. To take up such a challenge, a predictor called iHSP-PseRAAAC has been developed by incorporating the reduced amino acid alphabet information into the general form of pseudo amino acid composition. One of the remarkable advantages of introducing the reduced amino acid alphabet is being able to avoid the notorious dimension disaster or overfitting problem in statistical prediction. It was observed that the overall success rate achieved by iHSP-PseRAAAC in identifying the functional types of HSPs among the aforementioned six types was more than 87%, which was derived by the jackknife test on a stringent benchmark dataset in which none of HSPs included has ≥40% pairwise sequence identity to any other in the same subset. It has not escaped our notice that the reduced amino acid alphabet approach can also be used to investigate other protein classification problems. As a user-friendly web server, iHSP-PseRAAAC is accessible to the public at http://lin.uestc.edu.cn/server/iHSP-PseRAAAC. Copyright © 2013 Elsevier Inc. All rights reserved.

  8. Adaptive Evolution of Eel Fluorescent Proteins from Fatty Acid Binding Proteins Produces Bright Fluorescence in the Marine Environment.

    PubMed

    Gruber, David F; Gaffney, Jean P; Mehr, Shaadi; DeSalle, Rob; Sparks, John S; Platisa, Jelena; Pieribone, Vincent A

    2015-01-01

    We report the identification and characterization of two new members of a family of bilirubin-inducible fluorescent proteins (FPs) from marine chlopsid eels and demonstrate a key region of the sequence that serves as an evolutionary switch from non-fluorescent to fluorescent fatty acid-binding proteins (FABPs). Using transcriptomic analysis of two species of brightly fluorescent Kaupichthys eels (Kaupichthys hyoproroides and Kaupichthys n. sp.), two new FPs were identified, cloned and characterized (Chlopsid FP I and Chlopsid FP II). We then performed phylogenetic analysis on 210 FABPs, spanning 16 vertebrate orders, and including 163 vertebrate taxa. We show that the fluorescent FPs diverged as a protein family and are the sister group to brain FABPs. Our results indicate that the evolution of this family involved at least three gene duplication events. We show that fluorescent FABPs possess a unique, conserved tripeptide Gly-Pro-Pro sequence motif, which is not found in non-fluorescent fatty acid binding proteins. This motif arose from a duplication event of the FABP brain isoforms and was under strong purifying selection, leading to the classification of this new FP family. Residues adjacent to the motif are under strong positive selection, suggesting a further refinement of the eel protein's fluorescent properties. We present a phylogenetic reconstruction of this emerging FP family and describe additional fluorescent FABP members from groups of distantly related eels. The elucidation of this class of fish FPs with diverse properties provides new templates for the development of protein-based fluorescent tools. The evolutionary adaptation from fatty acid-binding proteins to fluorescent fatty acid-binding proteins raises intrigue as to the functional role of bright green fluorescence in this cryptic genus of reclusive eels that inhabit a blue, nearly monochromatic, marine environment.

  9. A comprehensive bioinformatic analysis of hepatitis D virus full-length genomes.

    PubMed

    Delfino, C M; Cerrudo, C S; Biglione, M; Oubiña, J R; Ghiringhelli, P D; Mathet, V L

    2018-02-06

    In association with hepatitis B virus (HBV), hepatitis delta virus (HDV) is a subviral agent that may promote severe acute and chronic forms of liver disease. Based on the percentage of nucleotide identity of the genome, HDV was initially classified into three genotypes. However, since 2006, the original classification has been further expanded into eight clades/genotypes. The intergenotype divergence may be as high as 35%-40% over the entire RNA genome, whereas sequence heterogeneity among the isolates of a given genotype is <20%; furthermore, HDV recombinants have been clearly demonstrated. The genetic diversity of HDV is related to the geographic origin of the isolates. This study shows the first comprehensive bioinformatic analysis of the complete available set of HDV sequences, using both nucleotide and protein phylogenies (based on an evolutionary model selection, gamma distribution estimation, tree inference and phylogenetic distance estimation), protein composition analysis and comparison (based on the presence of invariant residues, molecular signatures, amino acid frequencies and mono- and di-amino acid compositional distances), as well as amino acid changes in sequence evolution. Taking into account the congruent and consistent results of both nucleotide and amino acid analyses of GenBank available sequences (recorded as of January, 2017), we propose that the eight hepatitis D virus genotypes may be grouped into three large genogroups fully supported by their shared characteristics. © 2018 John Wiley & Sons Ltd.

  10. The value of protein structure classification information—Surveying the scientific literature

    PubMed Central

    Fox, Naomi K.; Brenner, Steven E.

    2015-01-01

    ABSTRACT The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP–extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012–2013 that cite SCOP, 439 actually use data from the resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non‐SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings. Proteins 2015; 83:2025–2038. © 2015 The Authors. Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc. PMID:26313554

  11. Characterization of the Complete Mitochondrial Genome Sequence of Spirometra erinaceieuropaei (Cestoda: Diphyllobothriidae) from China

    PubMed Central

    Liu, Guo-Hua; Li, Chun; Li, Jia-Yuan; Zhou, Dong-Hui; Xiong, Rong-Chuan; Lin, Rui-Qing; Zou, Feng-Cai; Zhu, Xing-Quan

    2012-01-01

    Sparganosis, caused by the plerocercoid larvae of members of the genus Spirometra, can cause significant public health problem and considerable economic losses. In the present study, the complete mitochondrial DNA (mtDNA) sequence of Spirometra erinaceieuropaei from China was determined, characterized and compared with that of S. erinaceieuropaei from Japan. The gene arrangement in the mt genome sequences of S. erinaceieuropaei from China and Japan is identical. The identity of the mt genomes was 99.1% between S. erinaceieuropaei from China and Japan, and the complete mtDNA sequence of S. erinaceieuropaei from China is slightly shorter (2 bp) than that from Japan. Phylogenetic analysis of S. erinaceieuropaei with other representative cestodes using two different computational algorithms [Bayesian inference (BI) and maximum likelihood (ML)] based on concatenated amino acid sequences of 12 protein-coding genes, revealed that S. erinaceieuropaei is closely related to Diphyllobothrium spp., supporting classification based on morphological features. The present study determined the complete mtDNA sequences of S. erinaceieuropaei from China that provides novel genetic markers for studying the population genetics and molecular epidemiology of S. erinaceieuropaei in humans and animals. PMID:22553464

  12. Supervised DNA Barcodes species classification: analysis, comparisons and results

    PubMed Central

    2014-01-01

    Background Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. Methods In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. Results A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. Conclusions The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. PMID:24721333

  13. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics.

    PubMed

    Weber, Marc; Teeling, Hanno; Huang, Sixing; Waldmann, Jost; Kassabgy, Mariette; Fuchs, Bernhard M; Klindworth, Anna; Klockow, Christine; Wichels, Antje; Gerdts, Gunnar; Amann, Rudolf; Glöckner, Frank Oliver

    2011-05-01

    Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion.

  14. mirVAFC: A Web Server for Prioritizations of Pathogenic Sequence Variants from Exome Sequencing Data via Classifications.

    PubMed

    Li, Zhongshan; Liu, Zhenwei; Jiang, Yi; Chen, Denghui; Ran, Xia; Sun, Zhong Sheng; Wu, Jinyu

    2017-01-01

    Exome sequencing has been widely used to identify the genetic variants underlying human genetic disorders for clinical diagnoses, but the identification of pathogenic sequence variants among the huge amounts of benign ones is complicated and challenging. Here, we describe a new Web server named mirVAFC for pathogenic sequence variants prioritizations from clinical exome sequencing (CES) variant data of single individual or family. The mirVAFC is able to comprehensively annotate sequence variants, filter out most irrelevant variants using custom criteria, classify variants into different categories as for estimated pathogenicity, and lastly provide pathogenic variants prioritizations based on classifications and mutation effects. Case studies using different types of datasets for different diseases from publication and our in-house data have revealed that mirVAFC can efficiently identify the right pathogenic candidates as in original work in each case. Overall, the Web server mirVAFC is specifically developed for pathogenic sequence variant identifications from family-based CES variants using classification-based prioritizations. The mirVAFC Web server is freely accessible at https://www.wzgenomics.cn/mirVAFC/. © 2016 WILEY PERIODICALS, INC.

  15. Characterization and Construction of Functional cDNA Clones of Pariacoto Virus, the First Alphanodavirus Isolated outside Australasia

    PubMed Central

    Johnson, Karyn N.; Zeddam, Jean-Louis; Ball, L. Andrew

    2000-01-01

    Pariacoto virus (PaV) was recently isolated in Peru from the Southern armyworm (Spodoptera eridania). PaV particles are isometric, nonenveloped, and about 30 nm in diameter. The virus has a bipartite RNA genome and a single major capsid protein with a molecular mass of 39.0 kDa, features that support its classification as a Nodavirus. As such, PaV is the first Alphanodavirus to have been isolated from outside Australasia. Here we report that PaV replicates in wax moth larvae and that PaV genomic RNAs replicate when transfected into cultured baby hamster kidney cells. The complete nucleotide sequences of both segments of the bipartite RNA genome were determined. The larger genome segment, RNA1, is 3,011 nucleotides long and contains a 973-amino-acid open reading frame (ORF) encoding protein A, the viral contribution to the RNA replicase. During replication, a 414-nucleotide long subgenomic RNA (RNA3) is synthesized which is coterminal with the 3′ end of RNA1. RNA3 contains a small ORF which could encode a protein of 90 amino acids similar to the B2 protein of other alphanodaviruses. RNA2 contains 1,311 nucleotides and encodes the 401 amino acids of the capsid protein precursor α. The amino acid sequences of the PaV capsid protein and the replicase subunit share 41 and 26% identity with homologous proteins of Flock house virus, the best characterized of the alphanodaviruses. These and other sequence comparisons indicate that PaV is evolutionarily the most distant of the alphanodaviruses described to date, consistent with its novel geographic origin. Although the PaV capsid precursor is cleaved into the two mature capsid proteins β and γ, the amino acid sequence at the cleavage site, which is Asn/Ala in all other alphanodaviruses, is Asn/Ser in PaV. To facilitate the investigation of PaV replication in cultured cells, we constructed plasmids that transcribed full-length PaV RNAs with authentic 5′ and 3′ termini. Transcription of these plasmids in cells recreated the replication of PaV RNA1 and RNA2, synthesis of subgenomic RNA3, and translation of viral proteins A and α. PMID:10799587

  16. Characterization and construction of functional cDNA clones of Pariacoto virus, the first Alphanodavirus isolated outside Australasia.

    PubMed

    Johnson, K N; Zeddam, J L; Ball, L A

    2000-06-01

    Pariacoto virus (PaV) was recently isolated in Peru from the Southern armyworm (Spodoptera eridania). PaV particles are isometric, nonenveloped, and about 30 nm in diameter. The virus has a bipartite RNA genome and a single major capsid protein with a molecular mass of 39.0 kDa, features that support its classification as a Nodavirus. As such, PaV is the first Alphanodavirus to have been isolated from outside Australasia. Here we report that PaV replicates in wax moth larvae and that PaV genomic RNAs replicate when transfected into cultured baby hamster kidney cells. The complete nucleotide sequences of both segments of the bipartite RNA genome were determined. The larger genome segment, RNA1, is 3,011 nucleotides long and contains a 973-amino-acid open reading frame (ORF) encoding protein A, the viral contribution to the RNA replicase. During replication, a 414-nucleotide long subgenomic RNA (RNA3) is synthesized which is coterminal with the 3' end of RNA1. RNA3 contains a small ORF which could encode a protein of 90 amino acids similar to the B2 protein of other alphanodaviruses. RNA2 contains 1,311 nucleotides and encodes the 401 amino acids of the capsid protein precursor alpha. The amino acid sequences of the PaV capsid protein and the replicase subunit share 41 and 26% identity with homologous proteins of Flock house virus, the best characterized of the alphanodaviruses. These and other sequence comparisons indicate that PaV is evolutionarily the most distant of the alphanodaviruses described to date, consistent with its novel geographic origin. Although the PaV capsid precursor is cleaved into the two mature capsid proteins beta and gamma, the amino acid sequence at the cleavage site, which is Asn/Ala in all other alphanodaviruses, is Asn/Ser in PaV. To facilitate the investigation of PaV replication in cultured cells, we constructed plasmids that transcribed full-length PaV RNAs with authentic 5' and 3' termini. Transcription of these plasmids in cells recreated the replication of PaV RNA1 and RNA2, synthesis of subgenomic RNA3, and translation of viral proteins A and alpha.

  17. The Hsp40 proteins of Plasmodium falciparum and other apicomplexa: regulating chaperone power in the parasite and the host.

    PubMed

    Botha, M; Pesce, E-R; Blatch, G L

    2007-01-01

    Extensive structural and functional remodelling of Plasmodium falciparum (malaria)-infected erythrocytes follows the export of a range of proteins of parasite origin (exportome) across the parasitophorous vacuole into the host erythrocyte. The genome of P. falciparum encodes a diverse chaperone complement including at least 43 members of the heat shock protein 40kDa (Hsp40) family, and six members of the heat shock protein 70kDa (Hsp70) family. Nearly half of the Hsp40 proteins of P. falciparum are predicted to contain a PEXEL/HT (Plasmodium export element/host targeting signal) sequence motif, and hence are likely to be part of the exportome. In this review we critically evaluate the classification, sequence similarity and clustering, and possible interactors of the P. falciparum Hsp40 chaperone machinery. In addition to the types I, II and III Hsp40 proteins all exhibiting the signature J-domain, the P. falciparum genome also encodes a number of specialized Hsp40 proteins with a J-like domain, which we have categorized as type IV Hsp40 proteins. Analysis of the potential P. falciparum Hsp40 protein interaction network revealed connections predominantly with cytoskeletal and membrane proteins, transcriptional machinery, DNA repair and replication machinery, translational machinery, the proteasome and proteolytic enzymes, and enzymes involved in cellular physiology. Comparison of the Hsp40 proteins of P. falciparum to those of other apicomplexa reveals that most of the proteins (especially the PEXEL/HT-containing proteins) are unique to P. falciparum. Furthermore, very few of the P. falciparum Hsp40 proteins have human homologs, except for those proteins implicated in fundamental biological processes. Our analysis suggests that P. falciparum has evolved an expanded and specialized Hsp40 protein machinery to enable it successfully to invade and remodel the human erythrocyte, and we propose a model in which these proteins are involved in chaperone-mediated translocation, folding, assembly and regulation of parasite and host proteins.

  18. A survey of transposable element classification systems--a call for a fundamental update to meet the challenge of their diversity and complexity.

    PubMed

    Piégu, Benoît; Bire, Solenne; Arensburger, Peter; Bigot, Yves

    2015-05-01

    The increase of publicly available sequencing data has allowed for rapid progress in our understanding of genome composition. As new information becomes available we should constantly be updating and reanalyzing existing and newly acquired data. In this report we focus on transposable elements (TEs) which make up a significant portion of nearly all sequenced genomes. Our ability to accurately identify and classify these sequences is critical to understanding their impact on host genomes. At the same time, as we demonstrate in this report, problems with existing classification schemes have led to significant misunderstandings of the evolution of both TE sequences and their host genomes. In a pioneering publication Finnegan (1989) proposed classifying all TE sequences into two classes based on transposition mechanisms and structural features: the retrotransposons (class I) and the DNA transposons (class II). We have retraced how ideas regarding TE classification and annotation in both prokaryotic and eukaryotic scientific communities have changed over time. This has led us to observe that: (1) a number of TEs have convergent structural features and/or transposition mechanisms that have led to misleading conclusions regarding their classification, (2) the evolution of TEs is similar to that of viruses by having several unrelated origins, (3) there might be at least 8 classes and 12 orders of TEs including 10 novel orders. In an effort to address these classification issues we propose: (1) the outline of a universal TE classification, (2) a set of methods and classification rules that could be used by all scientific communities involved in the study of TEs, and (3) a 5-year schedule for the establishment of an International Committee for Taxonomy of Transposable Elements (ICTTE). Copyright © 2015 Elsevier Inc. All rights reserved.

  19. Sequence variations in RepMP2/3 and RepMP4 elements reveal intragenomic homologous DNA recombination events in Mycoplasma pneumoniae.

    PubMed

    Spuesens, Emiel B M; Oduber, Minoushka; Hoogenboezem, Theo; Sluijter, Marcel; Hartwig, Nico G; van Rossum, Annemarie M C; Vink, Cornelis

    2009-07-01

    The gene encoding major adhesin protein P1 of Mycoplasma pneumoniae, MPN141, contains two DNA sequence stretches, designated RepMP2/3 and RepMP4, which display variation among strains. This variation allows strains to be differentiated into two major P1 genotypes (1 and 2) and several variants. Interestingly, multiple versions of the RepMP2/3 and RepMP4 elements exist at other sites within the bacterial genome. Because these versions are closely related in sequence, but not identical, it has been hypothesized that they have the capacity to recombine with their counterparts within MPN141, and thereby serve as a source of sequence variation of the P1 protein. In order to determine the variation within the RepMP2/3 and RepMP4 elements, both within the bacterial genome and among strains, we analysed the DNA sequences of all RepMP2/3 and RepMP4 elements within the genomes of 23 M. pneumoniae strains. Our data demonstrate that: (i) recombination is likely to have occurred between two RepMP2/3 elements in four of the strains, and (ii) all previously described P1 genotypes can be explained by inter-RepMP recombination events. Moreover, the difference between the two major P1 genotypes was reflected in all RepMP elements, such that subtype 1 and 2 strains can be differentiated on the basis of sequence variation in each RepMP element. This implies that subtype 1 and subtype 2 strains represent evolutionarily diverged strain lineages. Finally, a classification scheme is proposed in which the P1 genotype of M. pneumoniae isolates can be described in a sequence-based, universal fashion.

  20. Impact of recent molecular phylogenetic studies on classification of ascomycete yeasts

    USDA-ARS?s Scientific Manuscript database

    Analyses of concatenated gene sequences as well as whole genome sequences are resolving relationships among the ascomycete yeasts (Saccharomycotina), thus allowing classification of members of this subphylum to be based on phylogeny. In addition, changes implemented in the new Botanical Code [Intern...

  1. Structural parameterization and functional prediction of antigenic polypeptome sequences with biological activity through quantitative sequence-activity models (QSAM) by molecular electronegativity edge-distance vector (VMED).

    PubMed

    Li, ZhiLiang; Wu, ShiRong; Chen, ZeCong; Ye, Nancy; Yang, ShengXi; Liao, ChunYang; Zhang, MengJun; Yang, Li; Mei, Hu; Yang, Yan; Zhao, Na; Zhou, Yuan; Zhou, Ping; Xiong, Qing; Xu, Hong; Liu, ShuShen; Ling, ZiHua; Chen, Gang; Li, GenRong

    2007-10-01

    Only from the primary structures of peptides, a new set of descriptors called the molecular electronegativity edge-distance vector (VMED) was proposed and applied to describing and characterizing the molecular structures of oligopeptides and polypeptides, based on the electronegativity of each atom or electronic charge index (ECI) of atomic clusters and the bonding distance between atom-pairs. Here, the molecular structures of antigenic polypeptides were well expressed in order to propose the automated technique for the computerized identification of helper T lymphocyte (Th) epitopes. Furthermore, a modified MED vector was proposed from the primary structures of polypeptides, based on the ECI and the relative bonding distance of the fundamental skeleton groups. The side-chains of each amino acid were here treated as a pseudo-atom. The developed VMED was easy to calculate and able to work. Some quantitative model was established for 28 immunogenic or antigenic polypeptides (AGPP) with 14 (1-14) A(d) and 14 other restricted activities assigned as "1"(+) and "0"(-), respectively. The latter comprised 6 A(b)(15-20), 3 A(k)(21-23), 2 E(k)(24-26), 2 H-2(k)(27 and 28) restricted sequences. Good results were obtained with 90% correct classification (only 2 wrong ones for 20 training samples) and 100% correct prediction (none wrong for 8 testing samples); while contrastively 100% correct classification (none wrong for 20 training samples) and 88% correct classification (1 wrong for 8 testing samples). Both stochastic samplings and cross validations were performed to demonstrate good performance. The described method may also be suitable for estimation and prediction of classes I and II for major histocompatibility antigen (MHC) epitope of human. It will be useful in immune identification and recognition of proteins and genes and in the design and development of subunit vaccines. Several quantitative structure activity relationship (QSAR) models were developed for various oligopeptides and polypeptides including 58 dipeptides and 31 pentapeptides with angiotensin converting enzyme (ACE) inhibition by multiple linear regression (MLR) method. In order to explain the ability to characterize molecular structure of polypeptides, a molecular modeling investigation on QSAR was performed for functional prediction of polypeptide sequences with antigenic activity and heptapeptide sequences with tachykinin activity through quantitative sequence-activity models (QSAMs) by the molecular electronegativity edge-distance vector (VMED). The results showed that VMED exhibited both excellent structural selectivity and good activity prediction. Moreover, the results showed that VMED behaved quite well for both QSAR and QSAM of poly-and oligopeptides, which exhibited both good estimation ability and prediction power, equal to or better than those reported in the previous references. Finally, a preliminary conclusion was drawn: both classical and modified MED vectors were very useful structural descriptors. Some suggestions were proposed for further studies on QSAR/QSAM of proteins in various fields.

  2. A consensus view of fold space: Combining SCOP, CATH, and the Dali Domain Dictionary

    PubMed Central

    Day, Ryan; Beck, David A.C.; Armen, Roger S.; Daggett, Valerie

    2003-01-01

    We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web. PMID:14500873

  3. A consensus view of fold space: combining SCOP, CATH, and the Dali Domain Dictionary.

    PubMed

    Day, Ryan; Beck, David A C; Armen, Roger S; Daggett, Valerie

    2003-10-01

    We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.

  4. Classification of Dynamical Diffusion States in Single Molecule Tracking Microscopy

    PubMed Central

    Bosch, Peter J.; Kanger, Johannes S.; Subramaniam, Vinod

    2014-01-01

    Single molecule tracking of membrane proteins by fluorescence microscopy is a promising method to investigate dynamic processes in live cells. Translating the trajectories of proteins to biological implications, such as protein interactions, requires the classification of protein motion within the trajectories. Spatial information of protein motion may reveal where the protein interacts with cellular structures, because binding of proteins to such structures often alters their diffusion speed. For dynamic diffusion systems, we provide an analytical framework to determine in which diffusion state a molecule is residing during the course of its trajectory. We compare different methods for the quantification of motion to utilize this framework for the classification of two diffusion states (two populations with different diffusion speed). We found that a gyration quantification method and a Bayesian statistics-based method are the most accurate in diffusion-state classification for realistic experimentally obtained datasets, of which the gyration method is much less computationally demanding. After classification of the diffusion, the lifetime of the states can be determined, and images of the diffusion states can be reconstructed at high resolution. Simulations validate these applications. We apply the classification and its applications to experimental data to demonstrate the potential of this approach to obtain further insights into the dynamics of cell membrane proteins. PMID:25099798

  5. The developmental transcriptome of the bamboo snout beetle Cyrtotrachelus buqueti and insights into candidate pheromone-binding proteins

    PubMed Central

    Yang, Wei; Yang, Chunping; Lu, Lin; Chen, Zhangming

    2017-01-01

    Cyrtotrachelus buqueti is an extremely harmful bamboo borer, and the larvae of this pest attack clumping bamboo shoots. Pheromone-binding proteins (PBPs) play an important role in identifying insect sex pheromones, but the C. buqueti genome is not readily available for PBP analysis. Developmental transcriptomes of eggs, larvae from the first instar to the prepupal stage, pupae, and adults (females and males) from emergence to mating were built by RNA sequencing (RNA-Seq) in the present study to establish a sequence background of C. buqueti to help understand PBPs. Approximately 164.8 million clean reads were obtained and annotated into 108,854 transcripts. These were assembled into 24,338, 21,597, 24,798, 21,886, 24,642, and 83,115 unigenes for eggs, larvae, pupae, females, males, and the combined datasets, respectively. Unigenes were annotated against NCBI non-redundant protein sequences, NCBI non-redundant nucleotide sequences, Gene Ontology (GO), Protein family, Clusters of Orthologous Groups of Proteins/ Clusters of Eukaryotic Orthologous Groups (KOG), Swiss-Prot, and KEGG Orthology databases. A total of 17,213 unigenes were annotated into 55 sub-categories belonging to three main GO categories; 10,672 unigenes were classified into 26 functional categories by KOG classification, and 8,063 unigenes were classified into five functional KEGG categories. RSEM software for RNA sequencing showed that 4,816, 3,176, 3,661, 2,898, 4,316, 8,019, 7,273, 5,922, 5,844, and 4,570 genes were differentially expressed between larvae and males, larvae and eggs, larvae and pupae, larvae and females, males and females, males and eggs, males and pupae, females and eggs, females and pupae, and eggs and pupae, respectively. Of these, three were confirmed to be significantly differentially expressed between larvae, females, and males. Furthermore, PBP Cbuq7577_g1 was highly expressed in the antenna of males. A comprehensive sequence resource of a desirable quality was constructed from developmental transcriptomes of C. buqueti eggs, larvae, pupae, and adults. This work enriches the genomic data of C. buqueti, and facilitates our understanding of its metamorphosis, development, and response to environmental change. The identified candidate PBP Cbuq7577_g1 might play a crucial role in identifying sex pheromones, and could be used as a targeted gene to control C. buqueti numbers by disrupting sex pheromone communication. PMID:28662071

  6. Iterative non-sequential protein structural alignment.

    PubMed

    Salem, Saeed; Zaki, Mohammed J; Bystroff, Christopher

    2009-06-01

    Structural similarity between proteins gives us insights into their evolutionary relationships when there is low sequence similarity. In this paper, we present a novel approach called SNAP for non-sequential pair-wise structural alignment. Starting from an initial alignment, our approach iterates over a two-step process consisting of a superposition step and an alignment step, until convergence. We propose a novel greedy algorithm to construct both sequential and non-sequential alignments. The quality of SNAP alignments were assessed by comparing against the manually curated reference alignments in the challenging SISY and RIPC datasets. Moreover, when applied to a dataset of 4410 protein pairs selected from the CATH database, SNAP produced longer alignments with lower rmsd than several state-of-the-art alignment methods. Classification of folds using SNAP alignments was both highly sensitive and highly selective. The SNAP software along with the datasets are available online at http://www.cs.rpi.edu/~zaki/software/SNAP.

  7. The value of protein structure classification information-Surveying the scientific literature

    DOE PAGES

    Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc

    2015-08-27

    The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less

  8. The value of protein structure classification information-Surveying the scientific literature

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Fox, Naomi K.; Brenner, Steven E.; Chandonia, John -Marc

    The Structural Classification of Proteins (SCOP) and Class, Architecture, Topology, Homology (CATH) databases have been valuable resources for protein structure classification for over 20 years. Development of SCOP (version 1) concluded in June 2009 with SCOP 1.75. The SCOPe (SCOP-extended) database offers continued development of the classic SCOP hierarchy, adding over 33,000 structures. We have attempted to assess the impact of these two decade old resources and guide future development. To this end, we surveyed recent articles to learn how structure classification data are used. Of 571 articles published in 2012-2013 that cite SCOP, 439 actually use data from themore » resource. We found that the type of use was fairly evenly distributed among four top categories: A) study protein structure or evolution (27% of articles), B) train and/or benchmark algorithms (28% of articles), C) augment non-SCOP datasets with SCOP classification (21% of articles), and D) examine the classification of one protein/a small set of proteins (22% of articles). Most articles described computational research, although 11% described purely experimental research, and a further 9% included both. We examined how CATH and SCOP were used in 158 articles that cited both databases: while some studies used only one dataset, the majority used data from both resources. Protein structure classification remains highly relevant for a diverse range of problems and settings.« less

  9. An updated phylogenetic analysis of vertebrate melatonin receptor sequences: reflection on the melatonin receptor nomenclature by the Nomenclature Subcommittee of the International Union of Pharmacology.

    PubMed

    Shiu, S Y; Pang, S F

    1998-01-01

    In the past few years, significant progress on melatonin receptor research has led to the discovery of a family of genetically related but pharmacologically distinctive G-protein-coupled receptors in the vertebrates. With increasing number of receptor clones being identified, there is a need for a system of classification and nomenclature for these receptor subtypes. Recently, an updated nomenclature system, which has renamed the existing mammalian melatonin receptor clones, has been proposed by the relevant subcommittee of the International Union of Pharmacology (NC-IUPHAR). However, the majority of receptor clones which have been identified in non-mammalian vertebrates are not clearly defined by this system. By performing phylogenetic analysis of both mammalian and non-mammalian melatonin receptor clones, we would like to propose a classification-nomenclature system for vertebrate melatonin receptors. Hopefully, our system, which incorporates genetic data as well as the pharmacological criteria that have been adopted by the NC-IUPHAR nomenclature system, will provide the framework for future development of a unified scheme of classification and nomenclature for melatonin receptors.

  10. Positive selection and functional divergence of farnesyl pyrophosphate synthase genes in plants.

    PubMed

    Qian, Jieying; Liu, Yong; Chao, Naixia; Ma, Chengtong; Chen, Qicong; Sun, Jian; Wu, Yaosheng

    2017-02-04

    Farnesyl pyrophosphate synthase (FPS) belongs to the short-chain prenyltransferase family, and it performs a conserved and essential role in the terpenoid biosynthesis pathway. However, its classification, evolutionary history, and the forces driving the evolution of FPS genes in plants remain poorly understood. Phylogeny and positive selection analysis was used to identify the evolutionary forces that led to the functional divergence of FPS in plants, and recombinant detection was undertaken using the Genetic Algorithm for Recombination Detection (GARD) method. The dataset included 68 FPS variation pattern sequences (2 gymnosperms, 10 monocotyledons, 54 dicotyledons, and 2 outgroups). This study revealed that the FPS gene was under positive selection in plants. No recombinant within the FPS gene was found. Therefore, it was inferred that the positive selection of FPS had not been influenced by a recombinant episode. The positively selected sites were mainly located in the catalytic center and functional areas, which indicated that the 98S and 234D were important positively selected sites for plant FPS in the terpenoid biosynthesis pathway. They were located in the FPS conserved domain of the catalytic site. We inferred that the diversification of FPS genes was associated with functional divergence and could be driven by positive selection. It was clear that protein sequence evolution via positive selection was able to drive adaptive diversification in plant FPS proteins. This study provides information on the classification and positive selection of plant FPS genes, and the results could be useful for further research on the regulation of triterpenoid biosynthesis.

  11. Thermostable, salt tolerant, wide pH range novel chitobiase from Vibrio parahemolyticus: isolation, characterization, molecular cloning, and expression.

    PubMed

    Zhu, B C; Lo, J Y; Li, Y T; Li, S C; Jaynes, J M; Gildemeister, O S; Laine, R A; Ou, C Y

    1992-07-01

    A chitobiase gene from Vibrio parahemolyticus was cloned into plasmid pUC18 in Escherichia coli strain DH5 alpha. The plasmid construct, pC120, contained a 6.4 kb Vibrio DNA insert. The recombinant gene expressed chitobiase [EC 3.2.1.30] activity similar to that found in the native Vibrio. The enzyme was purified by ion exchange, hydroxylapatite and gel permeation chromatographies, and exhibited an apparent molecular weight of 80 kDa on SDS-polyacrylamide gel electrophoresis. Chitobiose and 6 more substrates, including beta-N-acetyl galactosamine glycosides, were hydrolyzed by the recombinant chitobiase, indicating its putative classification as an hexosaminidase [EC 3.2.1.52]. The enzyme was resistant to denaturation by 2 M NaCl, thermostable at 45 degrees C and active over a very unusual (for glycosyl hydrolases) pH range, from 4 to 10. The purified cloned chitobiase gave 4 closely focussed bands on an isoelectric focusing gel, at pH 4 to 6.5. The N-terminal 43 amino acid sequence shows no homology with other proteins in commercial databanks or in the literature, and from its N-terminal sequence, appears to be a novel protein, unrelated in sequence to chitobiases from other Vibrios reported and unrelated to hexosaminidases from other organisms.

  12. Identification and Classification of bcl Genes and Proteins of Bacillus cereus Group Organisms and Their Application in Bacillus anthracis Detection and Fingerprinting▿ †

    PubMed Central

    Leski, Tomasz A.; Caswell, Clayton C.; Pawlowski, Marcin; Klinke, David J.; Bujnicki, Janusz M.; Hart, Sean J.; Lukomski, Slawomir

    2009-01-01

    The Bacillus cereus group includes three closely related species, B. anthracis, B. cereus, and B. thuringiensis, which form a highly homogeneous subdivision of the genus Bacillus. One of these species, B. anthracis, has been identified as one of the most probable bacterial biowarfare agents. Here, we evaluate the sequence and length polymorphisms of the Bacillus collagen-like protein bcl genes as a basis for B. anthracis detection and fingerprinting. Five genes, designated bclA to bclE, are present in B. anthracis strains. Examination of bclABCDE sequences identified polymorphisms in bclB alleles of the B. cereus group organisms. These sequence polymorphisms allowed specific detection of B. anthracis strains by PCR using both genomic DNA and purified Bacillus spores in reactions. By exploiting the length variation of the bcl alleles it was demonstrated that the combined bclABCDE PCR products generate markedly different fingerprints for the B. anthracis Ames and Sterne strains. Moreover, we predict that bclABCDE length polymorphism creates unique signatures for B. anthracis strains, which facilitates identification of strains with specificity and confidence. Thus, we present a new diagnostic concept for B. anthracis detection and fingerprinting, which can be used alone or in combination with previously established typing platforms. PMID:19767469

  13. Diversity of pneumococcal surface protein A (PspA) and relation to sequence typing in Streptococcus pneumoniae causing invasive disease in Chinese children.

    PubMed

    Qian, J; Yao, K; Xue, L; Xie, G; Zheng, Y; Wang, C; Shang, Y; Wang, H; Wan, L; Liu, L; Li, C; Ji, W; Wang, Y; Xu, P; Yu, S; Tang, Y-W; Yang, Y

    2012-03-01

    The objective of this paper was to investigate the sequence types (STs) and diversity of surface antigen pneumococcal surface protein A (PspA) in 171 invasive Streptococcus pneumoniae isolates from Chinese children. A total of 171 pneumococci isolates were isolated from Chinese children with invasive pneumococcal diseases (IPD) in 11 hospitals between 2006 and 2008. The pneumococci samples were characterized by serotyping, PspA classification, and multilocus sequence typing (MLST). The PspA of these strains could be assigned to two families. The PspA family 2 was the most common (120/171, 70.1%). No PspA family 3 isolates were detected. Family 1 could be subdivided into two clades, with 42 strains in clade 1 and 9 strains in clade 2, and family 2 could be subdivided into clades 3, 4, and 5, which respectively contained 5, 21, and 14 strains. In total, 65 STs were identified, of which ST320 (30/171, 17.5%), ST271 (23/171, 13.5%), and ST876 (18/171, 10.5%) were the most common types. PspA family 2 and family 1 were dominant among pneumococcal clones isolated from Chinese children with invasive disease. The strains with the same ST always presented in the same PspA family.

  14. Complete mitochondrial genome of the Asian paddle crab Charybdis japonica (Crustacea: Decapoda: Portunidae): gene rearrangement of the marine brachyurans and phylogenetic considerations of the decapods.

    PubMed

    Liu, Yuan; Cui, Zhaoxia

    2010-06-01

    Given the commercial and ecological importance of the Asian paddle crab, Charybdis japonica, there is a clearly need for genetic and molecular research on this species. Here, we present the complete mitochondrial genome sequence of C. japonica, determined by the long-polymerase chain reaction and primer walking sequencing method. The entire genome is 15,738 bp in length, encoding a standard set of 13 protein-coding genes, two ribosomal RNA genes, and 22 transfer RNA genes, plus the putative control region, which is typical for metazoans. The total A+T content of the genome is 69.2%, lower than the other brachyuran crabs except for Callinectes sapidus. The gene order is identical to the published marine brachyurans and differs from the ancestral pancrustacean order by only the position of the tRNA ( His ) gene. Phylogenetic analyses using the concatenated nucleotide and amino acid sequences of 13 protein-coding genes strongly support the monophyly of Dendrobranchiata and Pleocyemata, which is consistent with the previous taxonomic classification. However, the systematic status of Charybdis within subfamily Thalamitinae of family Portunidae is not supported. C. japonica, as the first species of Charybdis with complete mitochondrial genome available, will provide important information on both genomics and molecular ecology of the group.

  15. A multigene phylogenetic synthesis for the class Lecanoromycetes (Ascomycota): 1307 fungi representing 1139 infrageneric taxa, 317 genera and 66 families

    PubMed Central

    Miadlikowska, Jolanta; Kauff, Frank; Högnabba, Filip; Oliver, Jeffrey C.; Molnár, Katalin; Fraker, Emily; Gaya, Ester; Hafellner, Josef; Hofstetter, Valérie; Gueidan, Cécile; Otálora, Mónica A.G.; Hodkinson, Brendan; Kukwa, Martin; Lücking, Robert; Björk, Curtis; Sipman, Harrie J.M.; Burgaz, Ana Rosa; Thell, Arne; Passo, Alfredo; Myllys, Leena; Goward, Trevor; Fernández-Brime, Samantha; Hestmark, Geir; Lendemer, James; Lumbsch, H. Thorsten; Schmull, Michaela; Schoch, Conrad; Sérusiaux, Emmanuël; Maddison, David R.; Arnold, A. Elizabeth; Lutzoni, François; Stenroos, Soili

    2014-01-01

    The Lecanoromycetes is the largest class of lichenized Fungi, and one of the most species-rich classes in the kingdom. Here we provide a multigene phylogenetic synthesis (using three ribosomal RNA-coding and two protein-coding genes) of the Lecanoromycetes based on 642 newly generated and 3329 publicly available sequences representing 1139 taxa, 317 genera, 66 families, 17 orders and five subclasses (four currently recognized: Acarosporomycetidae, Lecanoromycetidae, Ostropomycetidae, Umbilicariomycetidae; and one provisionarily recognized, ‘Candelariomycetidae’). Maximum likelihood phylogenetic analyses on four multigene datasets assembled using a cumulative supermatrix approach with a progressively higher number of species and missing data (5-gene, 5+4-gene, 5+4+3-gene and 5+4+3+2-gene datasets) show that the current classification includes non-monophyletic taxa at various ranks, which need to be recircumscribed and require revisionary treatments based on denser taxon sampling and more loci. Two newly circumscribed orders (Arctomiales and Hymeneliales in the Ostropomycetidae) and three families (Ramboldiaceae and Psilolechiaceae in the Lecanorales, and Strangosporaceae in the Lecanoromycetes inc. sed.) are introduced. The potential resurrection of the families Eigleraceae and Lopadiaceae is considered here to alleviate phylogenetic and classification disparities. An overview of the photobionts associated with the main fungal lineages in the Lecanoromycetes based on available published records is provided. A revised schematic classification at the family level in the phylogenetic context of widely accepted and newly revealed relationships across Lecanoromycetes is included. The cumulative addition of taxa with an increasing amount of missing data (i.e., a cumulative supermatrix approach, starting with taxa for which sequences were available for all five targeted genes and ending with the addition of taxa for which only two genes have been sequenced) revealed relatively stable relationships for many families and orders. However, the increasing number of taxa without the addition of more loci also resulted in an expected substantial loss of phylogenetic resolving power and support (especially for deep phylogenetic relationships), potentially including the misplacements of several taxa. Future phylogenetic analyses should include additional single copy protein-coding markers in order to improve the tree of the Lecanoromycetes. As part of this study, a new module (“Hypha”) of the freely available Mesquite software was developed to compare and display the internodal support values derived from this cumulative supermatrix approach. PMID:24747130

  16. Evaluating Functional Annotations of Enzymes Using the Gene Ontology.

    PubMed

    Holliday, Gemma L; Davidson, Rebecca; Akiva, Eyal; Babbitt, Patricia C

    2017-01-01

    The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.

  17. Accurate, Rapid Taxonomic Classification of Fungal Large-Subunit rRNA Genes

    PubMed Central

    Liu, Kuan-Liang; Porras-Alfaro, Andrea; Eichorst, Stephanie A.

    2012-01-01

    Taxonomic and phylogenetic fingerprinting based on sequence analysis of gene fragments from the large-subunit rRNA (LSU) gene or the internal transcribed spacer (ITS) region is becoming an integral part of fungal classification. The lack of an accurate and robust classification tool trained by a validated sequence database for taxonomic placement of fungal LSU genes is a severe limitation in taxonomic analysis of fungal isolates or large data sets obtained from environmental surveys. Using a hand-curated set of 8,506 fungal LSU gene fragments, we determined the performance characteristics of a naïve Bayesian classifier across multiple taxonomic levels and compared the classifier performance to that of a sequence similarity-based (BLASTN) approach. The naïve Bayesian classifier was computationally more rapid (>460-fold with our system) than the BLASTN approach, and it provided equal or superior classification accuracy. Classifier accuracies were compared using sequence fragments of 100 bp and 400 bp and two different PCR primer anchor points to mimic sequence read lengths commonly obtained using current high-throughput sequencing technologies. Accuracy was higher with 400-bp sequence reads than with 100-bp reads. It was also significantly affected by sequence location across the 1,400-bp test region. The highest accuracy was obtained across either the D1 or D2 variable region. The naïve Bayesian classifier provides an effective and rapid means to classify fungal LSU sequences from large environmental surveys. The training set and tool are publicly available through the Ribosomal Database Project (http://rdp.cme.msu.edu/classifier/classifier.jsp). PMID:22194300

  18. Classification of HCV and HIV-1 Sequences with the Branching Index

    PubMed Central

    Hraber, Peter; Kuiken, Carla; Waugh, Mark; Geer, Shaun; Bruno, William J.; Leitner, Thomas

    2009-01-01

    SUMMARY Classification of viral sequences should be fast, objective, accurate, and reproducible. Most methods that classify sequences use either pairwise distances or phylogenetic relations, but cannot discern when a sequence is unclassifiable. The branching index (BI) combines distance and phylogeny methods to compute a ratio that quantifies how closely a query sequence clusters with a subtype clade. In the hypothesis-testing framework of statistical inference, the BI is compared with a threshold to test whether sufficient evidence exists for the query sequence to be classified among known sequences. If above the threshold, the null hypothesis of no support for the subtype relation is rejected and the sequence is taken as belonging to the subtype clade with which it clusters on the tree. This study evaluates statistical properties of the branching index for subtype classification in HCV and HIV-1. Pairs of BI values with known positive and negative test results were computed from 10,000 random fragments of reference alignments. Sampled fragments were of sufficient length to contain phylogenetic signal that groups reference sequences together properly into subtype clades. For HCV, a threshold BI of 0.71 yields 95.1% agreement with reference subtypes, with equal false positive and false negative rates. For HIV-1, a threshold of 0.66 yields 93.5% agreement. Higher thresholds can be used where lower false positive rates are required. In synthetic recombinants, regions without breakpoints are recognized accurately; regions with breakpoints do not uniquely represent any known subtype. Web-based services for viral subtype classification with the branching index are available online. PMID:18753218

  19. A core phylogeny of Dictyostelia inferred from genomes representative of the eight major and minor taxonomic divisions of the group.

    PubMed

    Singh, Reema; Schilde, Christina; Schaap, Pauline

    2016-11-17

    Dictyostelia are a well-studied group of organisms with colonial multicellularity, which are members of the mostly unicellular Amoebozoa. A phylogeny based on SSU rDNA data subdivided all Dictyostelia into four major groups, but left the position of the root and of six group-intermediate taxa unresolved. Recent phylogenies inferred from 30 or 213 proteins from sequenced genomes, positioned the root between two branches, each containing two major groups, but lacked data to position the group-intermediate taxa. Since the positions of these early diverging taxa are crucial for understanding the evolution of phenotypic complexity in Dictyostelia, we sequenced six representative genomes of early diverging taxa. We retrieved orthologs of 47 housekeeping proteins with an average size of 890 amino acids from six newly sequenced and eight published genomes of Dictyostelia and unicellular Amoebozoa and inferred phylogenies from single and concatenated protein sequence alignments. Concatenated alignments of all 47 proteins, and four out of five subsets of nine concatenated proteins all produced the same consensus phylogeny with 100% statistical support. Trees inferred from just two out of the 47 proteins, individually reproduced the consensus phylogeny, highlighting that single gene phylogenies will rarely reflect correct species relationships. However, sets of two or three concatenated proteins again reproduced the consensus phylogeny, indicating that a small selection of genes suffices for low cost classification of as yet unincorporated or newly discovered dictyostelid and amoebozoan taxa by gene amplification. The multi-locus consensus phylogeny shows that groups 1 and 2 are sister clades in branch I, with the group-intermediate taxon D. polycarpum positioned as outgroup to group 2. Branch II consists of groups 3 and 4, with the group-intermediate taxon Polysphondylium violaceum positioned as sister to group 4, and the group-intermediate taxon Dictyostelium polycephalum branching at the base of that whole clade. Given the data, the approximately unbiased test rejects all alternative topologies favoured by SSU rDNA and individual proteins with high statistical support. The test also rejects monophyletic origins for the genera Acytostelium, Polysphondylium and Dictyostelium. The current position of Acytostelium ellipticum in the consensus phylogeny indicates that somatic cells were lost twice in Dictyostelia.

  20. Analysis of deep learning methods for blind protein contact prediction in CASP12.

    PubMed

    Wang, Sheng; Sun, Siqi; Xu, Jinbo

    2018-03-01

    Here we present the results of protein contact prediction achieved in CASP12 by our RaptorX-Contact server, which is an early implementation of our deep learning method for contact prediction. On a set of 38 free-modeling target domains with a median family size of around 58 effective sequences, our server obtained an average top L/5 long- and medium-range contact accuracy of 47% and 44%, respectively (L = length). A complete implementation has an average accuracy of 59% and 57%, respectively. Our deep learning method formulates contact prediction as a pixel-level image labeling problem and simultaneously predicts all residue pairs of a protein using a combination of two deep residual neural networks, taking as input the residue conservation information, predicted secondary structure and solvent accessibility, contact potential, and coevolution information. Our approach differs from existing methods mainly in (1) formulating contact prediction as a pixel-level image labeling problem instead of an image-level classification problem; (2) simultaneously predicting all contacts of an individual protein to make effective use of contact occurrence patterns; and (3) integrating both one-dimensional and two-dimensional deep convolutional neural networks to effectively learn complex sequence-structure relationship including high-order residue correlation. This paper discusses the RaptorX-Contact pipeline, both contact prediction and contact-based folding results, and finally the strength and weakness of our method. © 2017 Wiley Periodicals, Inc.

  1. Classification and evolution of EF-hand proteins

    NASA Technical Reports Server (NTRS)

    Kawasaki, H.; Nakayama, S.; Kretsinger, R. H.

    1998-01-01

    Forty-five distinct subfamilies of EF-hand proteins have been identified. They contain from two to eight EF-hands that are recognizable by amino acid sequence as being statistically similar to other EF-hand domains. All proteins within one subfamily are congruent to one another, i.e. the dendrogram computed from one of the EF-hand domains is similar, within statistical error, to the dendrogram computed from another(s) domain. Thirteen subfamilies--including Calmodulin, Troponin C, Essential light chain, Regulatory light chain--referred to collectively as CTER, are congruent with one another. They appear to have evolved from a single ur-domain by two cycles of gene duplication and fusion. The subfamilies of CTER subsequently evolved by gene duplications and speciations. The remaining 32 subfamilies do not show such general patterns of congruence; however, some--such as S100, intestinal calcium binding protein (calbindin 9 kd), and trichohylin--do not form congruent clusters of subfamilies. Nearly all of the domains 1, 3, 5, and 7 are most similar to other ODD domains. Correspondingly the EVEN numbered domains of all 45 subfamilies most closely resemble EVEN domains of other subfamilies. Many sequence and chemical characteristics do not show systemic trends by subfamily or species of host organisms; such homoplasy is widespread. Eighteen of the subfamilies are heterochimeric; in addition to multiple EF-hands they contain domains of other evolutionary origins.

  2. Analyzing Student Inquiry Data Using Process Discovery and Sequence Classification

    ERIC Educational Resources Information Center

    Emond, Bruno; Buffett, Scott

    2015-01-01

    This paper reports on results of applying process discovery mining and sequence classification mining techniques to a data set of semi-structured learning activities. The main research objective is to advance educational data mining to model and support self-regulated learning in heterogeneous environments of learning content, activities, and…

  3. Experimental approaches to investigate effector translocation into host cells in the Ustilago maydis/maize pathosystem.

    PubMed

    Tanaka, Shigeyuki; Djamei, Armin; Presti, Libera Lo; Schipper, Kerstin; Winterberg, Sarah; Amati, Simone; Becker, Dirk; Büchner, Heike; Kumlehn, Jochen; Reissmann, Stefanie; Kahmann, Regine

    2015-01-01

    The fungus Ustilago maydis is a pathogen that establishes a biotrophic interaction with Zea mays. The interaction with the plant host is largely governed by more than 300 novel, secreted protein effectors, of which only four have been functionally characterized. Prerequisite to examine effector function is to know where effectors reside after secretion. Effectors can remain in the extracellular space, i.e. the plant apoplast (apoplastic effectors), or can cross the plant plasma membrane and exert their function inside the host cell (cytoplasmic effectors). The U. maydis effectors lack conserved motifs in their primary sequences that could allow a classification of the effectome into apoplastic/cytoplasmic effectors. This represents a significant obstacle in functional effector characterization. Here we describe our attempts to establish a system for effector classification into apoplastic and cytoplasmic members, using U. maydis for effector delivery. Copyright © 2015 Elsevier GmbH. All rights reserved.

  4. WRF-TMH: predicting transmembrane helix by fusing composition index and physicochemical properties of amino acids.

    PubMed

    Hayat, Maqsood; Khan, Asifullah

    2013-05-01

    Membrane protein is the prime constituent of a cell, which performs a role of mediator between intra and extracellular processes. The prediction of transmembrane (TM) helix and its topology provides essential information regarding the function and structure of membrane proteins. However, prediction of TM helix and its topology is a challenging issue in bioinformatics and computational biology due to experimental complexities and lack of its established structures. Therefore, the location and orientation of TM helix segments are predicted from topogenic sequences. In this regard, we propose WRF-TMH model for effectively predicting TM helix segments. In this model, information is extracted from membrane protein sequences using compositional index and physicochemical properties. The redundant and irrelevant features are eliminated through singular value decomposition. The selected features provided by these feature extraction strategies are then fused to develop a hybrid model. Weighted random forest is adopted as a classification approach. We have used two benchmark datasets including low and high-resolution datasets. tenfold cross validation is employed to assess the performance of WRF-TMH model at different levels including per protein, per segment, and per residue. The success rates of WRF-TMH model are quite promising and are the best reported so far on the same datasets. It is observed that WRF-TMH model might play a substantial role, and will provide essential information for further structural and functional studies on membrane proteins. The accompanied web predictor is accessible at http://111.68.99.218/WRF-TMH/ .

  5. Domain organization, genomic structure, evolution, and regulation of expression of the aggrecan gene family.

    PubMed

    Schwartz, N B; Pirok, E W; Mensch, J R; Domowicz, M S

    1999-01-01

    Proteoglycans are complex macromolecules, consisting of a polypeptide backbone to which are covalently attached one or more glycosaminoglycan chains. Molecular cloning has allowed identification of the genes encoding the core proteins of various proteoglycans, leading to a better understanding of the diversity of proteoglycan structure and function, as well as to the evolution of a classification of proteoglycans on the basis of emerging gene families that encode the different core proteins. One such family includes several proteoglycans that have been grouped with aggrecan, the large aggregating chondroitin sulfate proteoglycan of cartilage, based on a high number of sequence similarities within the N- and C-terminal domains. Thus far these proteoglycans include versican, neurocan, and brevican. It is now apparent that these proteins, as a group, are truly a gene family with shared structural motifs on the protein and nucleotide (mRNA) levels, and with nearly identical genomic organizations. Clearly a common ancestral origin is indicated for the members of the aggrecan family of proteoglycans. However, differing patterns of amplification and divergence have also occurred within certain exons across species and family members, leading to the class-characteristic protein motifs in the central carbohydrate-rich region exclusively. Thus the overall domain organization strongly suggests that sequence conservation in the terminal globular domains underlies common functions, whereas differences in the central portions of the genes account for functional specialization among the members of this gene family.

  6. Prediction of change in protein unfolding rates upon point mutations in two state proteins.

    PubMed

    Chaudhary, Priyashree; Naganathan, Athi N; Gromiha, M Michael

    2016-09-01

    Studies on protein unfolding rates are limited and challenging due to the complexity of unfolding mechanism and the larger dynamic range of the experimental data. Though attempts have been made to predict unfolding rates using protein sequence-structure information there is no available method for predicting the unfolding rates of proteins upon specific point mutations. In this work, we have systematically analyzed a set of 790 single mutants and developed a robust method for predicting protein unfolding rates upon mutations (Δlnku) in two-state proteins by combining amino acid properties and knowledge-based classification of mutants with multiple linear regression technique. We obtain a mean absolute error (MAE) of 0.79/s and a Pearson correlation coefficient (PCC) of 0.71 between predicted unfolding rates and experimental observations using jack-knife test. We have developed a web server for predicting protein unfolding rates upon mutation and it is freely available at https://www.iitm.ac.in/bioinfo/proteinunfolding/unfoldingrace.html. Prominent features that determine unfolding kinetics as well as plausible reasons for the observed outliers are also discussed. Copyright © 2016 Elsevier B.V. All rights reserved.

  7. Bioinformatic and Comparative Localization of Rab Proteins Reveals Functional Insights into the Uncharacterized GTPases Ypt10p and Ypt11p†

    PubMed Central

    Buvelot Frei, Stéphanie; Rahl, Peter B.; Nussbaum, Maria; Briggs, Benjamin J.; Calero, Monica; Janeczko, Stephanie; Regan, Andrew D.; Chen, Catherine Z.; Barral, Yves; Whittaker, Gary R.; Collins, Ruth N.

    2006-01-01

    A striking characteristic of a Rab protein is its steady-state localization to the cytosolic surface of a particular subcellular membrane. In this study, we have undertaken a combined bioinformatic and experimental approach to examine the evolutionary conservation of Rab protein localization. A comprehensive primary sequence classification shows that 10 out of the 11 Rab proteins identified in the yeast (Saccharomyces cerevisiae) genome can be grouped within a major subclass, each comprising multiple Rab orthologs from diverse species. We compared the locations of individual yeast Rab proteins with their localizations following ectopic expression in mammalian cells. Our results suggest that green fluorescent protein-tagged Rab proteins maintain localizations across large evolutionary distances and that the major known player in the Rab localization pathway, mammalian Rab-GDI, is able to function in yeast. These findings enable us to provide insight into novel gene functions and classify the uncharacterized Rab proteins Ypt10p (YBR264C) as being involved in endocytic function and Ypt11p (YNL304W) as being localized to the endoplasmic reticulum, where we demonstrate it is required for organelle inheritance. PMID:16980630

  8. The Complete Mitochondrial Genome of Galba pervia (Gastropoda: Mollusca), an Intermediate Host Snail of Fasciola spp

    PubMed Central

    Huang, Wei-Yi; Zhao, Guang-Hui; Wei, Shu-Jun; Song, Hui-Qun; Xu, Min-Jun; Lin, Rui-Qing; Zhou, Dong-Hui; Zhu, Xing-Quan

    2012-01-01

    Complete mitochondrial (mt) genomes and the gene rearrangements are increasingly used as molecular markers for investigating phylogenetic relationships. Contributing to the complete mt genomes of Gastropoda, especially Pulmonata, we determined the mt genome of the freshwater snail Galba pervia, which is an important intermediate host for Fasciola spp. in China. The complete mt genome of G. pervia is 13,768 bp in length. Its genome is circular, and consists of 37 genes, including 13 genes for proteins, 2 genes for rRNA, 22 genes for tRNA. The mt gene order of G. pervia showed novel arrangement (tRNA-His, tRNA-Gly and tRNA-Tyr change positions and directions) when compared with mt genomes of Pulmonata species sequenced to date, indicating divergence among different species within the Pulmonata. A total of 3655 amino acids were deduced to encode 13 protein genes. The most frequently used amino acid is Leu (15.05%), followed by Phe (11.24%), Ser (10.76%) and IIe (8.346%). Phylogenetic analyses using the concatenated amino acid sequences of the 13 protein-coding genes, with three different computational algorithms (maximum parsimony, maximum likelihood and Bayesian analysis), all revealed that the families Lymnaeidae and Planorbidae are closely related two snail families, consistent with previous classifications based on morphological and molecular studies. The complete mt genome sequence of G. pervia showed a novel gene arrangement and it represents the first sequenced high quality mt genome of the family Lymnaeidae. These novel mtDNA data provide additional genetic markers for studying the epidemiology, population genetics and phylogeographics of freshwater snails, as well as for understanding interplay between the intermediate snail hosts and the intra-mollusca stages of Fasciola spp.. PMID:22844544

  9. Widespread occurrence of lysine methylation in Plasmodium falciparum proteins at asexual blood stages.

    PubMed

    Kaur, Inderjeet; Zeeshan, Mohammad; Saini, Ekta; Kaushik, Abhinav; Mohmmed, Asif; Gupta, Dinesh; Malhotra, Pawan

    2016-10-20

    Post-transcriptional and post-translational modifications play a major role in Plasmodium life cycle regulation. Lysine methylation of histone proteins is well documented in several organisms, however in recent years lysine methylation of proteins outside histone code is emerging out as an important post-translational modification (PTM). In the present study we have performed global analysis of lysine methylation of proteins in asexual blood stages of Plasmodium falciparum development. We immunoprecipitated stage specific Plasmodium lysates using anti-methyl lysine specific antibodies that immunostained the asexual blood stage parasites. Using liquid chromatography and tandem mass spectrometry analysis, 570 lysine methylated proteins at three different blood stages were identified. Analysis of the peptide sequences identified 605 methylated sites within 422 proteins. Functional classification of the methylated proteins revealed that the proteins are mainly involved in nucleotide metabolic processes, chromatin organization, transport, homeostatic processes and protein folding. The motif analysis of the methylated lysine peptides reveals novel motifs. Many of the identified lysine methylated proteins are also interacting partners/substrates of PfSET domain proteins as revealed by STRING database analysis. Our findings suggest that the protein methylation at lysine residues is widespread in Plasmodium and plays an important regulatory role in diverse set of the parasite pathways.

  10. Texture analysis of common renal masses in multiple MR sequences for prediction of pathology

    NASA Astrophysics Data System (ADS)

    Hoang, Uyen N.; Malayeri, Ashkan A.; Lay, Nathan S.; Summers, Ronald M.; Yao, Jianhua

    2017-03-01

    This pilot study performs texture analysis on multiple magnetic resonance (MR) images of common renal masses for differentiation of renal cell carcinoma (RCC). Bounding boxes are drawn around each mass on one axial slice in T1 delayed sequence to use for feature extraction and classification. All sequences (T1 delayed, venous, arterial, pre-contrast phases, T2, and T2 fat saturated sequences) are co-registered and texture features are extracted from each sequence simultaneously. Random forest is used to construct models to classify lesions on 96 normal regions, 87 clear cell RCCs, 8 papillary RCCs, and 21 renal oncocytomas; ground truths are verified through pathology reports. The highest performance is seen in random forest model when data from all sequences are used in conjunction, achieving an overall classification accuracy of 83.7%. When using data from one single sequence, the overall accuracies achieved for T1 delayed, venous, arterial, and pre-contrast phase, T2, and T2 fat saturated were 79.1%, 70.5%, 56.2%, 61.0%, 60.0%, and 44.8%, respectively. This demonstrates promising results of utilizing intensity information from multiple MR sequences for accurate classification of renal masses.

  11. Improving Classification of Protein Interaction Articles Using Context Similarity-Based Feature Selection.

    PubMed

    Chen, Yifei; Sun, Yuxing; Han, Bing-Qing

    2015-01-01

    Protein interaction article classification is a text classification task in the biological domain to determine which articles describe protein-protein interactions. Since the feature space in text classification is high-dimensional, feature selection is widely used for reducing the dimensionality of features to speed up computation without sacrificing classification performance. Many existing feature selection methods are based on the statistical measure of document frequency and term frequency. One potential drawback of these methods is that they treat features separately. Hence, first we design a similarity measure between the context information to take word cooccurrences and phrase chunks around the features into account. Then we introduce the similarity of context information to the importance measure of the features to substitute the document and term frequency. Hence we propose new context similarity-based feature selection methods. Their performance is evaluated on two protein interaction article collections and compared against the frequency-based methods. The experimental results reveal that the context similarity-based methods perform better in terms of the F1 measure and the dimension reduction rate. Benefiting from the context information surrounding the features, the proposed methods can select distinctive features effectively for protein interaction article classification.

  12. Integrative View of the Diversity and Evolution of SWEET and SemiSWEET Sugar Transporters

    PubMed Central

    Jia, Baolei; Zhu, Xiao Feng; Pu, Zhong Ji; Duan, Yu Xi; Hao, Lu Jiang; Zhang, Jie; Chen, Li-Qing; Jeon, Che Ok; Xuan, Yuan Hu

    2017-01-01

    Sugars Will Eventually be Exported Transporter (SWEET) and SemiSWEET are recently characterized families of sugar transporters in eukaryotes and prokaryotes, respectively. SemiSWEETs contain 3 transmembrane helices (TMHs), while SWEETs contain 7. Here, we performed sequence-based comprehensive analyses for SWEETs and SemiSWEETs across the biosphere. In total, 3,249 proteins were identified and ≈60% proteins were found in green plants and Oomycota, which include a number of important plant pathogens. Protein sequence similarity networks indicate that proteins from different organisms are significantly clustered. Of note, SemiSWEETs with 3 or 4 TMHs that may fuse to SWEET were identified in plant genomes. 7-TMH SWEETs were found in bacteria, implying that SemiSWEET can be fused directly in prokaryote. 15-TMH extraSWEET and 25-TMH superSWEET were also observed in wild rice and oomycetes, respectively. The transporters can be classified into 4, 2, 2, and 2 clades in plants, Metazoa, unicellular eukaryotes, and prokaryotes, respectively. The consensus and coevolution of amino acids in SWEETs were identified by multiple sequence alignments. The functions of the highly conserved residues were analyzed by molecular dynamics analysis. The 19 most highly conserved residues in the SWEETs were further confirmed by point mutagenesis using SWEET1 from Arabidopsis thaliana. The results proved that the conserved residues located in the extrafacial gate (Y57, G58, G131, and P191), the substrate binding pocket (N73, N192, and W176), and the intrafacial gate (P43, Y83, F87, P145, M161, P162, and Q202) play important roles for substrate recognition and transport processes. Taken together, our analyses provide a foundation for understanding the diversity, classification, and evolution of SWEETs and SemiSWEETs using large-scale sequence analysis and further show that gene duplication and gene fusion are important factors driving the evolution of SWEETs. PMID:29326750

  13. Integrative View of the Diversity and Evolution of SWEET and SemiSWEET Sugar Transporters.

    PubMed

    Jia, Baolei; Zhu, Xiao Feng; Pu, Zhong Ji; Duan, Yu Xi; Hao, Lu Jiang; Zhang, Jie; Chen, Li-Qing; Jeon, Che Ok; Xuan, Yuan Hu

    2017-01-01

    Sugars Will Eventually be Exported Transporter (SWEET) and SemiSWEET are recently characterized families of sugar transporters in eukaryotes and prokaryotes, respectively. SemiSWEETs contain 3 transmembrane helices (TMHs), while SWEETs contain 7. Here, we performed sequence-based comprehensive analyses for SWEETs and SemiSWEETs across the biosphere. In total, 3,249 proteins were identified and ≈60% proteins were found in green plants and Oomycota, which include a number of important plant pathogens. Protein sequence similarity networks indicate that proteins from different organisms are significantly clustered. Of note, SemiSWEETs with 3 or 4 TMHs that may fuse to SWEET were identified in plant genomes. 7-TMH SWEETs were found in bacteria, implying that SemiSWEET can be fused directly in prokaryote. 15-TMH extraSWEET and 25-TMH superSWEET were also observed in wild rice and oomycetes, respectively. The transporters can be classified into 4, 2, 2, and 2 clades in plants, Metazoa, unicellular eukaryotes, and prokaryotes, respectively. The consensus and coevolution of amino acids in SWEETs were identified by multiple sequence alignments. The functions of the highly conserved residues were analyzed by molecular dynamics analysis. The 19 most highly conserved residues in the SWEETs were further confirmed by point mutagenesis using SWEET1 from Arabidopsis thaliana . The results proved that the conserved residues located in the extrafacial gate (Y57, G58, G131, and P191), the substrate binding pocket (N73, N192, and W176), and the intrafacial gate (P43, Y83, F87, P145, M161, P162, and Q202) play important roles for substrate recognition and transport processes. Taken together, our analyses provide a foundation for understanding the diversity, classification, and evolution of SWEETs and SemiSWEETs using large-scale sequence analysis and further show that gene duplication and gene fusion are important factors driving the evolution of SWEETs.

  14. AutoFACT: An Automatic Functional Annotation and Classification Tool

    PubMed Central

    Koski, Liisa B; Gray, Michael W; Lang, B Franz; Burger, Gertraud

    2005-01-01

    Background Assignment of function to new molecular sequence data is an essential step in genomics projects. The usual process involves similarity searches of a given sequence against one or more databases, an arduous process for large datasets. Results We present AutoFACT, a fully automated and customizable annotation tool that assigns biologically informative functions to a sequence. Key features of this tool are that it (1) analyzes nucleotide and protein sequence data; (2) determines the most informative functional description by combining multiple BLAST reports from several user-selected databases; (3) assigns putative metabolic pathways, functional classes, enzyme classes, GeneOntology terms and locus names; and (4) generates output in HTML, text and GFF formats for the user's convenience. We have compared AutoFACT to four well-established annotation pipelines. The error rate of functional annotation is estimated to be only between 1–2%. Comparison of AutoFACT to the traditional top-BLAST-hit annotation method shows that our procedure increases the number of functionally informative annotations by approximately 50%. Conclusion AutoFACT will serve as a useful annotation tool for smaller sequencing groups lacking dedicated bioinformatics staff. It is implemented in PERL and runs on LINUX/UNIX platforms. AutoFACT is available at . PMID:15960857

  15. The complete mitochondrial genome of the desert darkling beetle Asbolus verrucosus (Coleoptera, Tenebrionidae).

    PubMed

    Rider, Stanley Dean

    2016-07-01

    The complete mitochondrial genome of the desert darkling beetle Asbolus verrucosus (LeConte, 1851) was sequenced using paired-end technology to an average depth of 42,111× and assembled using De Bruijn graph-based methods. The genome is 15,828 bp in length and conforms to the basal arthropod mitochondrial gene composition with the same gene orders and orientations as other darkling beetle mitochondria. This arrangement includes a control region, 22 tRNA genes, 2 rRNA genes and 13 protein-coding genes. The main coding strand is probably replicated as the lagging strand (GC skew of -0.36 and AT skew of +0.19). Phylogenomics analyses are consistent with taxonomic classifications and indicate that Tenebrio molitor is the closest relative that has a completely sequenced mitochondrial genome available for analysis. This is the first fully assembled mitogenome sequence for a darkling beetle in the subfamily Pimeliinae and will be useful for population studies on members of this ecologically important group of beetles.

  16. 5 CFR 1312.6 - Duration of classification.

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... classification authority shall follow the following sequence: (i) He/She shall attempt to determine a date or... 5 Administrative Personnel 3 2010-01-01 2010-01-01 false Duration of classification. 1312.6 Section 1312.6 Administrative Personnel OFFICE OF MANAGEMENT AND BUDGET OMB DIRECTIVES CLASSIFICATION...

  17. Computational methods using weighed-extreme learning machine to predict protein self-interactions with protein evolutionary information.

    PubMed

    An, Ji-Yong; Zhang, Lei; Zhou, Yong; Zhao, Yu-Jun; Wang, Da-Fu

    2017-08-18

    Self-interactions Proteins (SIPs) is important for their biological activity owing to the inherent interaction amongst their secondary structures or domains. However, due to the limitations of experimental Self-interactions detection, one major challenge in the study of prediction SIPs is how to exploit computational approaches for SIPs detection based on evolutionary information contained protein sequence. In the work, we presented a novel computational approach named WELM-LAG, which combined the Weighed-Extreme Learning Machine (WELM) classifier with Local Average Group (LAG) to predict SIPs based on protein sequence. The major improvement of our method lies in presenting an effective feature extraction method used to represent candidate Self-interactions proteins by exploring the evolutionary information embedded in PSI-BLAST-constructed position specific scoring matrix (PSSM); and then employing a reliable and robust WELM classifier to carry out classification. In addition, the Principal Component Analysis (PCA) approach is used to reduce the impact of noise. The WELM-LAG method gave very high average accuracies of 92.94 and 96.74% on yeast and human datasets, respectively. Meanwhile, we compared it with the state-of-the-art support vector machine (SVM) classifier and other existing methods on human and yeast datasets, respectively. Comparative results indicated that our approach is very promising and may provide a cost-effective alternative for predicting SIPs. In addition, we developed a freely available web server called WELM-LAG-SIPs to predict SIPs. The web server is available at http://219.219.62.123:8888/WELMLAG/ .

  18. Statistical models for detecting differential chromatin interactions mediated by a protein.

    PubMed

    Niu, Liang; Li, Guoliang; Lin, Shili

    2014-01-01

    Chromatin interactions mediated by a protein of interest are of great scientific interest. Recent studies show that protein-mediated chromatin interactions can have different intensities in different types of cells or in different developmental stages of a cell. Such differences can be associated with a disease or with the development of a cell. Thus, it is of great importance to detect protein-mediated chromatin interactions with different intensities in different cells. A recent molecular technique, Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET), which uses formaldehyde cross-linking and paired-end sequencing, is able to detect genome-wide chromatin interactions mediated by a protein of interest. Here we proposed two models (One-Step Model and Two-Step Model) for two sample ChIA-PET count data (one biological replicate in each sample) to identify differential chromatin interactions mediated by a protein of interest. Both models incorporate the data dependency and the extent to which a fragment pair is related to a pair of DNA loci of interest to make accurate identifications. The One-Step Model makes use of the data more efficiently but is more computationally intensive. An extensive simulation study showed that the models can detect those differentially interacted chromatins and there is a good agreement between each classification result and the truth. Application of the method to a two-sample ChIA-PET data set illustrates its utility. The two models are implemented as an R package MDM (available at http://www.stat.osu.edu/~statgen/SOFTWARE/MDM).

  19. Statistical Models for Detecting Differential Chromatin Interactions Mediated by a Protein

    PubMed Central

    Niu, Liang; Li, Guoliang; Lin, Shili

    2014-01-01

    Chromatin interactions mediated by a protein of interest are of great scientific interest. Recent studies show that protein-mediated chromatin interactions can have different intensities in different types of cells or in different developmental stages of a cell. Such differences can be associated with a disease or with the development of a cell. Thus, it is of great importance to detect protein-mediated chromatin interactions with different intensities in different cells. A recent molecular technique, Chromatin Interaction Analysis by Paired-End Tag Sequencing (ChIA-PET), which uses formaldehyde cross-linking and paired-end sequencing, is able to detect genome-wide chromatin interactions mediated by a protein of interest. Here we proposed two models (One-Step Model and Two-Step Model) for two sample ChIA-PET count data (one biological replicate in each sample) to identify differential chromatin interactions mediated by a protein of interest. Both models incorporate the data dependency and the extent to which a fragment pair is related to a pair of DNA loci of interest to make accurate identifications. The One-Step Model makes use of the data more efficiently but is more computationally intensive. An extensive simulation study showed that the models can detect those differentially interacted chromatins and there is a good agreement between each classification result and the truth. Application of the method to a two-sample ChIA-PET data set illustrates its utility. The two models are implemented as an R package MDM (available at http://www.stat.osu.edu/~statgen/SOFTWARE/MDM). PMID:24835279

  20. Detection of distorted frames in retinal video-sequences via machine learning

    NASA Astrophysics Data System (ADS)

    Kolar, Radim; Liberdova, Ivana; Odstrcilik, Jan; Hracho, Michal; Tornow, Ralf P.

    2017-07-01

    This paper describes detection of distorted frames in retinal sequences based on set of global features extracted from each frame. The feature vector is consequently used in classification step, in which three types of classifiers are tested. The best classification accuracy 96% has been achieved with support vector machine approach.

  1. Concurrent Validity and Classification Accuracy of Curriculum-Based Measurement for Written Expression

    ERIC Educational Resources Information Center

    Furey, William M.; Marcotte, Amanda M.; Hintze, John M.; Shackett, Caroline M.

    2016-01-01

    The study presents a critical analysis of written expression curriculum-based measurement (WE-CBM) metrics derived from 3- and 10-min test lengths. Criterion validity and classification accuracy were examined for Total Words Written (TWW), Correct Writing Sequences (CWS), Percent Correct Writing Sequences (%CWS), and Correct Minus Incorrect…

  2. Validity Evidence in Scale Development: The Application of Cross Validation and Classification-Sequencing Validation

    ERIC Educational Resources Information Center

    Acar, Tu¨lin

    2014-01-01

    In literature, it has been observed that many enhanced criteria are limited by factor analysis techniques. Besides examinations of statistical structure and/or psychological structure, such validity studies as cross validation and classification-sequencing studies should be performed frequently. The purpose of this study is to examine cross…

  3. Rapid Classification and Identification of Microcystis aeruginosa Strains Using MALDI-TOF MS and Polygenetic Analysis.

    PubMed

    Sun, Li-Wei; Jiang, Wen-Jing; Sato, Hiroaki; Kawachi, Masanobu; Lu, Xi-Wu

    2016-01-01

    Matrix-assisted laser desorption-ionization-time-of-flight mass spectrometry (MALDI-TOF MS) was used to establish a rapid, simple, and accurate method to differentiate among strains of Microcystis aeruginosa, one of the most prevalent types of bloom-forming cyanobacteria. M. aeruginosa NIES-843, for which a complete genome has been sequenced, was used to characterize ribosomal proteins as biomarkers and to optimize conditions for observing ribosomal proteins as major peaks in a given mass spectrum. Thirty-one of 52 ribosomal subunit proteins were detected and identified along the mass spectrum. Fifty-five strains of M. aeruginosa from different habitats were analyzed using MALDI-TOF MS; among these samples, different ribosomal protein types were observed. A polygenetic analysis was performed using an unweighted pair-group method with arithmetic means and different ribosomal protein types to classify the strains into five major clades. Two clades primarily contained toxic strains, and the other three clades contained exclusively non-toxic strains. This is the first study to differentiate cyanobacterial strains using MALDI-TOF MS.

  4. Impact of training sets on classification of high-throughput bacterial 16s rRNA gene surveys

    PubMed Central

    Werner, Jeffrey J; Koren, Omry; Hugenholtz, Philip; DeSantis, Todd Z; Walters, William A; Caporaso, J Gregory; Angenent, Largus T; Knight, Rob; Ley, Ruth E

    2012-01-01

    Taxonomic classification of the thousands–millions of 16S rRNA gene sequences generated in microbiome studies is often achieved using a naïve Bayesian classifier (for example, the Ribosomal Database Project II (RDP) classifier), due to favorable trade-offs among automation, speed and accuracy. The resulting classification depends on the reference sequences and taxonomic hierarchy used to train the model; although the influence of primer sets and classification algorithms have been explored in detail, the influence of training set has not been characterized. We compared classification results obtained using three different publicly available databases as training sets, applied to five different bacterial 16S rRNA gene pyrosequencing data sets generated (from human body, mouse gut, python gut, soil and anaerobic digester samples). We observed numerous advantages to using the largest, most diverse training set available, that we constructed from the Greengenes (GG) bacterial/archaeal 16S rRNA gene sequence database and the latest GG taxonomy. Phylogenetic clusters of previously unclassified experimental sequences were identified with notable improvements (for example, 50% reduction in reads unclassified at the phylum level in mouse gut, soil and anaerobic digester samples), especially for phylotypes belonging to specific phyla (Tenericutes, Chloroflexi, Synergistetes and Candidate phyla TM6, TM7). Trimming the reference sequences to the primer region resulted in systematic improvements in classification depth, and greatest gains at higher confidence thresholds. Phylotypes unclassified at the genus level represented a greater proportion of the total community variation than classified operational taxonomic units in mouse gut and anaerobic digester samples, underscoring the need for greater diversity in existing reference databases. PMID:21716311

  5. Analysis and functional classification of transcripts from the nematode Meloidogyne incognita

    PubMed Central

    McCarter, James P; Dautova Mitreva, Makedonka; Martin, John; Dante, Mike; Wylie, Todd; Rao, Uma; Pape, Deana; Bowers, Yvette; Theising, Brenda; Murphy, Claire V; Kloek, Andrew P; Chiapelli, Brandi J; Clifton, Sandra W; Bird, David Mck; Waterston, Robert H

    2003-01-01

    Background Plant parasitic nematodes are major pathogens of most crops. Molecular characterization of these species as well as the development of new techniques for control can benefit from genomic approaches. As an entrée to characterizing plant parasitic nematode genomes, we analyzed 5,700 expressed sequence tags (ESTs) from second-stage larvae (L2) of the root-knot nematode Meloidogyne incognita. Results From these, 1,625 EST clusters were formed and classified by function using the Gene Ontology (GO) hierarchy and the Kyoto KEGG database. L2 larvae, which represent the infective stage of the life cycle before plant invasion, express a diverse array of ligand-binding proteins and abundant cytoskeletal proteins. L2 are structurally similar to Caenorhabditis elegans dauer larva and the presence of transcripts encoding glyoxylate pathway enzymes in the M. incognita clusters suggests that root-knot nematode larvae metabolize lipid stores while in search of a host. Homology to other species was observed in 79% of translated cluster sequences, with the C. elegans genome providing more information than any other source. In addition to identifying putative nematode-specific and Tylenchida-specific genes, sequencing revealed previously uncharacterized horizontal gene transfer candidates in Meloidogyne with high identity to rhizobacterial genes including homologs of nodL acetyltransferase and novel cellulases. Conclusions With sequencing from plant parasitic nematodes accelerating, the approaches to transcript characterization described here can be applied to more extensive datasets and also provide a foundation for more complex genome analyses. PMID:12702207

  6. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization.

    PubMed

    Bedbrook, Claire N; Yang, Kevin K; Rice, Austin J; Gradinaru, Viviana; Arnold, Frances H

    2017-10-01

    There is growing interest in studying and engineering integral membrane proteins (MPs) that play key roles in sensing and regulating cellular response to diverse external signals. A MP must be expressed, correctly inserted and folded in a lipid bilayer, and trafficked to the proper cellular location in order to function. The sequence and structural determinants of these processes are complex and highly constrained. Here we describe a predictive, machine-learning approach that captures this complexity to facilitate successful MP engineering and design. Machine learning on carefully-chosen training sequences made by structure-guided SCHEMA recombination has enabled us to accurately predict the rare sequences in a diverse library of channelrhodopsins (ChRs) that express and localize to the plasma membrane of mammalian cells. These light-gated channel proteins of microbial origin are of interest for neuroscience applications, where expression and localization to the plasma membrane is a prerequisite for function. We trained Gaussian process (GP) classification and regression models with expression and localization data from 218 ChR chimeras chosen from a 118,098-variant library designed by SCHEMA recombination of three parent ChRs. We use these GP models to identify ChRs that express and localize well and show that our models can elucidate sequence and structure elements important for these processes. We also used the predictive models to convert a naturally occurring ChR incapable of mammalian localization into one that localizes well.

  7. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization

    PubMed Central

    Rice, Austin J.; Gradinaru, Viviana; Arnold, Frances H.

    2017-01-01

    There is growing interest in studying and engineering integral membrane proteins (MPs) that play key roles in sensing and regulating cellular response to diverse external signals. A MP must be expressed, correctly inserted and folded in a lipid bilayer, and trafficked to the proper cellular location in order to function. The sequence and structural determinants of these processes are complex and highly constrained. Here we describe a predictive, machine-learning approach that captures this complexity to facilitate successful MP engineering and design. Machine learning on carefully-chosen training sequences made by structure-guided SCHEMA recombination has enabled us to accurately predict the rare sequences in a diverse library of channelrhodopsins (ChRs) that express and localize to the plasma membrane of mammalian cells. These light-gated channel proteins of microbial origin are of interest for neuroscience applications, where expression and localization to the plasma membrane is a prerequisite for function. We trained Gaussian process (GP) classification and regression models with expression and localization data from 218 ChR chimeras chosen from a 118,098-variant library designed by SCHEMA recombination of three parent ChRs. We use these GP models to identify ChRs that express and localize well and show that our models can elucidate sequence and structure elements important for these processes. We also used the predictive models to convert a naturally occurring ChR incapable of mammalian localization into one that localizes well. PMID:29059183

  8. Assessing fluctuating evolutionary pressure in yeast and mammal evolutionary rate covariation using bioinformatics of meiotic protein genetic sequences

    NASA Astrophysics Data System (ADS)

    Dehipawala, Sunil; Nguyen, A.; Tremberger, G.; Cheung, E.; Holden, T.; Lieberman, D.; Cheung, T.

    2013-09-01

    The evolutionary rate co-variation in meiotic proteins has been reported for yeast and mammal using phylogenic branch lengths which assess retention, duplication and mutation. The bioinformatics of the corresponding DNA sequences could be classified as a diagram of fractal dimension and Shannon entropy. Results from biomedical gene research provide examples on the diagram methodology. The identification of adaptive selection using entropy marker and functional-structural diversity using fractal dimension would support a regression analysis where the coefficient of determination would serve as evolutionary pathway marker for DNA sequences and be an important component in the astrobiology community. Comparisons between biomedical genes such as EEF2 (elongation factor 2 human, mouse, etc), WDR85 in epigenetics, HAR1 in human specificity, clinical trial targeted cancer gene CD47, SIRT6 in spermatogenesis, and HLA-C in mosquito bite immunology demonstrate the diagram classification methodology. Comparisons to the SEPT4-XIAP pair in stem cell apoptosis, testesexpressed taste genes TAS1R3-GNAT3 pair, and amyloid beta APLP1-APLP2 pair with the yeast-mammal DNA sequences for meiotic proteins RAD50-MRE11 pair and NCAPD2-ICK pair have accounted for the observed fluctuating evolutionary pressure systematically. Regression with high R-sq values or a triangular-like cluster pattern for concordant pairs in co-variation among the studied species could serve as evidences for the possible location of common ancestors in the entropy-fractal dimension diagram, consistent with an example of the human-chimp common ancestor study using the FOXP2 regulated genes reported in human fetal brain study. The Deinococcus radiodurans R1 Rad-A could be viewed as an outlier in the RAD50 diagram and also in the free energy versus fractal dimension regression Cook's distance, consistent with a non-Earth source for this radiation resistant bacterium. Convergent and divergent fluctuating evolutionary pressure could be studied with extension to genetic sequences in organisms in possible astrobiology conditions, with the assumption that the continuation of a book of life would require meiotic proteins everywhere in the universe.

  9. Boosted classification trees result in minor to modest improvement in the accuracy in classifying cardiovascular outcomes compared to conventional classification trees

    PubMed Central

    Austin, Peter C; Lee, Douglas S

    2011-01-01

    Purpose: Classification trees are increasingly being used to classifying patients according to the presence or absence of a disease or health outcome. A limitation of classification trees is their limited predictive accuracy. In the data-mining and machine learning literature, boosting has been developed to improve classification. Boosting with classification trees iteratively grows classification trees in a sequence of reweighted datasets. In a given iteration, subjects that were misclassified in the previous iteration are weighted more highly than subjects that were correctly classified. Classifications from each of the classification trees in the sequence are combined through a weighted majority vote to produce a final classification. The authors' objective was to examine whether boosting improved the accuracy of classification trees for predicting outcomes in cardiovascular patients. Methods: We examined the utility of boosting classification trees for classifying 30-day mortality outcomes in patients hospitalized with either acute myocardial infarction or congestive heart failure. Results: Improvements in the misclassification rate using boosted classification trees were at best minor compared to when conventional classification trees were used. Minor to modest improvements to sensitivity were observed, with only a negligible reduction in specificity. For predicting cardiovascular mortality, boosted classification trees had high specificity, but low sensitivity. Conclusions: Gains in predictive accuracy for predicting cardiovascular outcomes were less impressive than gains in performance observed in the data mining literature. PMID:22254181

  10. TRAP: automated classification, quantification and annotation of tandemly repeated sequences.

    PubMed

    Sobreira, Tiago José P; Durham, Alan M; Gruber, Arthur

    2006-02-01

    TRAP, the Tandem Repeats Analysis Program, is a Perl program that provides a unified set of analyses for the selection, classification, quantification and automated annotation of tandemly repeated sequences. TRAP uses the results of the Tandem Repeats Finder program to perform a global analysis of the satellite content of DNA sequences, permitting researchers to easily assess the tandem repeat content for both individual sequences and whole genomes. The results can be generated in convenient formats such as HTML and comma-separated values. TRAP can also be used to automatically generate annotation data in the format of feature table and GFF files.

  11. Centrifuge: rapid and sensitive classification of metagenomic sequences

    PubMed Central

    Song, Li; Breitwieser, Florian P.

    2016-01-01

    Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space. PMID:27852649

  12. A Fly-Inspired Mushroom Bodies Model for Sensory-Motor Control Through Sequence and Subsequence Learning.

    PubMed

    Arena, Paolo; Calí, Marco; Patané, Luca; Portera, Agnese; Strauss, Roland

    2016-09-01

    Classification and sequence learning are relevant capabilities used by living beings to extract complex information from the environment for behavioral control. The insect world is full of examples where the presentation time of specific stimuli shapes the behavioral response. On the basis of previously developed neural models, inspired by Drosophila melanogaster, a new architecture for classification and sequence learning is here presented under the perspective of the Neural Reuse theory. Classification of relevant input stimuli is performed through resonant neurons, activated by the complex dynamics generated in a lattice of recurrent spiking neurons modeling the insect Mushroom Bodies neuropile. The network devoted to context formation is able to reconstruct the learned sequence and also to trace the subsequences present in the provided input. A sensitivity analysis to parameter variation and noise is reported. Experiments on a roving robot are reported to show the capabilities of the architecture used as a neural controller.

  13. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria.

    PubMed

    Bolduc, Benjamin; Jang, Ho Bin; Doulcier, Guilhem; You, Zhi-Qiang; Roux, Simon; Sullivan, Matthew B

    2017-01-01

    Taxonomic classification of archaeal and bacterial viruses is challenging, yet also fundamental for developing a predictive understanding of microbial ecosystems. Recent identification of hundreds of thousands of new viral genomes and genome fragments, whose hosts remain unknown, requires a paradigm shift away from traditional classification approaches and towards the use of genomes for taxonomy. Here we revisited the use of genomes and their protein content as a means for developing a viral taxonomy for bacterial and archaeal viruses. A network-based analytic was evaluated and benchmarked against authority-accepted taxonomic assignments and found to be largely concordant. Exceptions were manually examined and found to represent areas of viral genome 'sequence space' that are under-sampled or prone to excessive genetic exchange. While both cases are poorly resolved by genome-based taxonomic approaches, the former will improve as viral sequence space is better sampled and the latter are uncommon. Finally, given the largely robust taxonomic capabilities of this approach, we sought to enable researchers to easily and systematically classify new viruses. Thus, we established a tool, vConTACT, as an app at iVirus, where it operates as a fast, highly scalable, user-friendly app within the free and powerful CyVerse cyberinfrastructure.

  14. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria

    PubMed Central

    Doulcier, Guilhem; You, Zhi-Qiang; Roux, Simon

    2017-01-01

    Taxonomic classification of archaeal and bacterial viruses is challenging, yet also fundamental for developing a predictive understanding of microbial ecosystems. Recent identification of hundreds of thousands of new viral genomes and genome fragments, whose hosts remain unknown, requires a paradigm shift away from traditional classification approaches and towards the use of genomes for taxonomy. Here we revisited the use of genomes and their protein content as a means for developing a viral taxonomy for bacterial and archaeal viruses. A network-based analytic was evaluated and benchmarked against authority-accepted taxonomic assignments and found to be largely concordant. Exceptions were manually examined and found to represent areas of viral genome ‘sequence space’ that are under-sampled or prone to excessive genetic exchange. While both cases are poorly resolved by genome-based taxonomic approaches, the former will improve as viral sequence space is better sampled and the latter are uncommon. Finally, given the largely robust taxonomic capabilities of this approach, we sought to enable researchers to easily and systematically classify new viruses. Thus, we established a tool, vConTACT, as an app at iVirus, where it operates as a fast, highly scalable, user-friendly app within the free and powerful CyVerse cyberinfrastructure. PMID:28480138

  15. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Bolduc, Benjamin; Jang, Ho Bin; Doulcier, Guilhem

    Taxonomic classification of archaeal and bacterial viruses is challenging, yet also fundamental for developing a predictive understanding of microbial ecosystems. Recent identification of hundreds of thousands of new viral genomes and genome fragments, whose hosts remain unknown, requires a paradigm shift away from traditional classification approaches and towards the use of genomes for taxonomy. Here we revisited the use of genomes and their protein content as a means for developing a viral taxonomy for bacterial and archaeal viruses. A network-based analytic was evaluated and benchmarked against authority-accepted taxonomic assignments and found to be largely concordant. Exceptions were manually examined andmore » found to represent areas of viral genome ‘sequence space’ that are under-sampled or prone to excessive genetic exchange. While both cases are poorly resolved by genome-based taxonomic approaches, the former will improve as viral sequence space is better sampled and the latter are uncommon. Finally, given the largely robust taxonomic capabilities of this approach, we sought to enable researchers to easily and systematically classify new viruses. Thus, we established a tool, vConTACT, as an app at iVirus, where it operates as a fast, highly scalable, user-friendly app within the free and powerful CyVerse cyberinfrastructure.« less

  16. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria

    DOE PAGES

    Bolduc, Benjamin; Jang, Ho Bin; Doulcier, Guilhem; ...

    2017-05-03

    Taxonomic classification of archaeal and bacterial viruses is challenging, yet also fundamental for developing a predictive understanding of microbial ecosystems. Recent identification of hundreds of thousands of new viral genomes and genome fragments, whose hosts remain unknown, requires a paradigm shift away from traditional classification approaches and towards the use of genomes for taxonomy. Here we revisited the use of genomes and their protein content as a means for developing a viral taxonomy for bacterial and archaeal viruses. A network-based analytic was evaluated and benchmarked against authority-accepted taxonomic assignments and found to be largely concordant. Exceptions were manually examined andmore » found to represent areas of viral genome ‘sequence space’ that are under-sampled or prone to excessive genetic exchange. While both cases are poorly resolved by genome-based taxonomic approaches, the former will improve as viral sequence space is better sampled and the latter are uncommon. Finally, given the largely robust taxonomic capabilities of this approach, we sought to enable researchers to easily and systematically classify new viruses. Thus, we established a tool, vConTACT, as an app at iVirus, where it operates as a fast, highly scalable, user-friendly app within the free and powerful CyVerse cyberinfrastructure.« less

  17. Classification and phylogeny of the cyanobiont Anabaena azollae Strasburger: an answered question?

    PubMed

    Pereira, Ana L; Vasconcelos, Vitor

    2014-06-01

    The symbiosis Azolla-Anabaena azollae, with a worldwide distribution in pantropical and temperate regions, is one of the most studied, because of its potential application as a biofertilizer, especially in rice fields, but also as an animal food and in phytoremediation. The cyanobiont is a filamentous, heterocystic cyanobacterium that inhabits the foliar cavities of the pteridophyte and the indusium on the megasporocarp (female reproductive structure). The classification and phylogeny of the cyanobiont is very controversial: from its morphology, it has been named Nostoc azollae, Anabaena azollae, Anabaena variabilis status azollae and recently Trichormus azollae, but, from its 16S rRNA gene sequence, it has been assigned to Nostoc and/or Anabaena, and from its phycocyanin gene sequence, it has been assigned as non-Nostoc and non-Anabaena. The literature also points to a possible co-evolution between the cyanobiont and the Azolla host, since dendrograms and phylogenetic trees of fatty acids, short tandemly repeated repetitive (STRR) analysis and restriction fragment length polymorphism (RFLP) analysis of nif genes and the 16S rRNA gene give a two-cluster association that matches the two-section ranking of the host (Azolla). Another controversy surrounds the possible existence of more than one genus or more than one species strain. The use of freshly isolated or cultured cyanobionts is an additional problem, since their morphology and protein profiles are different. This review gives an overview of how morphological, chemical and genetic analyses influence the classification and phylogeny of the cyanobiont and future research. © 2014 IUMS.

  18. Genome-Wide Analyses and Functional Classification of Proline Repeat-Rich Proteins: Potential Role of eIF5A in Eukaryotic Evolution

    PubMed Central

    Mandal, Ajeet; Mandal, Swati; Park, Myung Hee

    2014-01-01

    The eukaryotic translation factor, eIF5A has been recently reported as a sequence-specific elongation factor that facilitates peptide bond formation at consecutive prolines in Saccharomyces cerevisiae, as its ortholog elongation factor P (EF-P) does in bacteria. We have searched the genome databases of 35 representative organisms from six kingdoms of life for PPP (Pro-Pro-Pro) and/or PPG (Pro-Pro-Gly)-encoding genes whose expression is expected to depend on eIF5A. We have made detailed analyses of proteome data of 5 selected species, Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, Mus musculus and Homo sapiens. The PPP and PPG motifs are low in the prokaryotic proteomes. However, their frequencies markedly increase with the biological complexity of eukaryotic organisms, and are higher in newly derived proteins than in those orthologous proteins commonly shared in all species. Ontology classifications of S. cerevisiae and human genes encoding the highest level of polyprolines reveal their strong association with several specific biological processes, including actin/cytoskeletal associated functions, RNA splicing/turnover, DNA binding/transcription and cell signaling. Previously reported phenotypic defects in actin polarity and mRNA decay of eIF5A mutant strains are consistent with the proposed role for eIF5A in the translation of the polyproline-containing proteins. Of all the amino acid tandem repeats (≥3 amino acids), only the proline repeat frequency correlates with functional complexity of the five organisms examined. Taken together, these findings suggest the importance of proline repeat-rich proteins and a potential role for eIF5A and its hypusine modification pathway in the course of eukaryotic evolution. PMID:25364902

  19. Toward genetics-based virus taxonomy: comparative analysis of a genetics-based classification and the taxonomy of picornaviruses.

    PubMed

    Lauber, Chris; Gorbalenya, Alexander E

    2012-04-01

    Virus taxonomy has received little attention from the research community despite its broad relevance. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3890-3904, 2012), we have introduced a quantitative approach to hierarchically classify viruses of a family using pairwise evolutionary distances (PEDs) as a measure of genetic divergence. When applied to the six most conserved proteins of the Picornaviridae, it clustered 1,234 genome sequences in groups at three hierarchical levels (to which we refer as the "GENETIC classification"). In this study, we compare the GENETIC classification with the expert-based picornavirus taxonomy and outline differences in the underlying frameworks regarding the relation of virus groups and genetic diversity that represent, respectively, the structure and content of a classification. To facilitate the analysis, we introduce two novel diagrams. The first connects the genetic diversity of taxa to both the PED distribution and the phylogeny of picornaviruses. The second depicts a classification and the accommodated genetic diversity in a standardized manner. Generally, we found striking agreement between the two classifications on species and genus taxa. A few disagreements concern the species Human rhinovirus A and Human rhinovirus C and the genus Aphthovirus, which were split in the GENETIC classification. Furthermore, we propose a new supergenus level and universal, level-specific PED thresholds, not reached yet by many taxa. Since the species threshold is approached mostly by taxa with large sampling sizes and those infecting multiple hosts, it may represent an upper limit on divergence, beyond which homologous recombination in the six most conserved genes between two picornaviruses might not give viable progeny.

  20. Novel methodologies for spectral classification of exon and intron sequences

    NASA Astrophysics Data System (ADS)

    Kwan, Hon Keung; Kwan, Benjamin Y. M.; Kwan, Jennifer Y. Y.

    2012-12-01

    Digital processing of a nucleotide sequence requires it to be mapped to a numerical sequence in which the choice of nucleotide to numeric mapping affects how well its biological properties can be preserved and reflected from nucleotide domain to numerical domain. Digital spectral analysis of nucleotide sequences unfolds a period-3 power spectral value which is more prominent in an exon sequence as compared to that of an intron sequence. The success of a period-3 based exon and intron classification depends on the choice of a threshold value. The main purposes of this article are to introduce novel codes for 1-sequence numerical representations for spectral analysis and compare them to existing codes to determine appropriate representation, and to introduce novel thresholding methods for more accurate period-3 based exon and intron classification of an unknown sequence. The main findings of this study are summarized as follows: Among sixteen 1-sequence numerical representations, the K-Quaternary Code I offers an attractive performance. A windowed 1-sequence numerical representation (with window length of 9, 15, and 24 bases) offers a possible speed gain over non-windowed 4-sequence Voss representation which increases as sequence length increases. A winner threshold value (chosen from the best among two defined threshold values and one other threshold value) offers a top precision for classifying an unknown sequence of specified fixed lengths. An interpolated winner threshold value applicable to an unknown and arbitrary length sequence can be estimated from the winner threshold values of fixed length sequences with a comparable performance. In general, precision increases as sequence length increases. The study contributes an effective spectral analysis of nucleotide sequences to better reveal embedded properties, and has potential applications in improved genome annotation.

  1. By their genes ye shall know them: genomic signatures of predatory bacteria

    PubMed Central

    Pasternak, Zohar; Pietrokovski, Shmuel; Rotem, Or; Gophna, Uri; Lurie-Weinberger, Mor N; Jurkevitch, Edouard

    2013-01-01

    Predatory bacteria are taxonomically disparate, exhibit diverse predatory strategies and are widely distributed in varied environments. To date, their predatory phenotypes cannot be discerned in genome sequence data thereby limiting our understanding of bacterial predation, and of its impact in nature. Here, we define the ‘predatome,' that is, sets of protein families that reflect the phenotypes of predatory bacteria. The proteomes of all sequenced 11 predatory bacteria, including two de novo sequenced genomes, and 19 non-predatory bacteria from across the phylogenetic and ecological landscapes were compared. Protein families discriminating between the two groups were identified and quantified, demonstrating that differences in the proteomes of predatory and non-predatory bacteria are large and significant. This analysis allows predictions to be made, as we show by confirming from genome data an over-looked bacterial predator. The predatome exhibits deficiencies in riboflavin and amino acids biosynthesis, suggesting that predators obtain them from their prey. In contrast, these genomes are highly enriched in adhesins, proteases and particular metabolic proteins, used for binding to, processing and consuming prey, respectively. Strikingly, predators and non-predators differ in isoprenoid biosynthesis: predators use the mevalonate pathway, whereas non-predators, like almost all bacteria, use the DOXP pathway. By defining predatory signatures in bacterial genomes, the predatory potential they encode can be uncovered, filling an essential gap for measuring bacterial predation in nature. Moreover, we suggest that full-genome proteomic comparisons are applicable to other ecological interactions between microbes, and provide a convenient and rational tool for the functional classification of bacteria. PMID:23190728

  2. Partitioning the Genetic Diversity of a Virus Family: Approach and Evaluation through a Case Study of Picornaviruses

    PubMed Central

    Lauber, Chris

    2012-01-01

    The recent advent of genome sequences as the only source available to classify many newly discovered viruses challenges the development of virus taxonomy by expert virologists who traditionally rely on extensive virus characterization. In this proof-of-principle study, we address this issue by presenting a computational approach (DEmARC) to classify viruses of a family into groups at hierarchical levels using a sole criterion—intervirus genetic divergence. To quantify genetic divergence, we used pairwise evolutionary distances (PEDs) estimated by maximum likelihood inference on a multiple alignment of family-wide conserved proteins. PEDs were calculated for all virus pairs, and the resulting distribution was modeled via a mixture of probability density functions. The model enables the quantitative inference of regions of distance discontinuity in the family-wide PED distribution, which define the levels of hierarchy. For each level, a limit on genetic divergence, below which two viruses join the same group, was objectively selected among a set of candidates by minimizing violations of intragroup PEDs to the limit. In a case study, we applied the procedure to hundreds of genome sequences of picornaviruses and extensively evaluated it by modulating four key parameters. It was found that the genetics-based classification largely tolerates variations in virus sampling and multiple alignment construction but is affected by the choice of protein and the measure of genetic divergence. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3905–3915, 2012), we analyze the substantial insight gained with the genetics-based classification approach by comparing it with the expert-based picornavirus taxonomy. PMID:22278230

  3. Top-Down LESA Mass Spectrometry Protein Analysis of Gram-Positive and Gram-Negative Bacteria

    NASA Astrophysics Data System (ADS)

    Kocurek, Klaudia I.; Stones, Leanne; Bunch, Josephine; May, Robin C.; Cooper, Helen J.

    2017-10-01

    We have previously shown that liquid extraction surface analysis (LESA) mass spectrometry (MS) is a technique suitable for the top-down analysis of proteins directly from intact colonies of the Gram-negative bacterium Escherichia coli K-12. Here we extend the application of LESA MS to Gram-negative Pseudomonas aeruginosa PS1054 and Gram-positive Staphylococcus aureus MSSA476, as well as two strains of E. coli (K-12 and BL21 mCherry) and an unknown species of Staphylococcus. Moreover, we demonstrate the discrimination between three species of Gram-positive Streptococcus ( Streptococcus pneumoniae D39, and the viridans group Streptococcus oralis ATCC 35037 and Streptococcus gordonii ATCC35105), a recognized challenge for matrix-assisted laser desorption ionization time-of-flight MS. A range of the proteins detected were selected for top-down LESA MS/MS. Thirty-nine proteins were identified by top-down LESA MS/MS, including 16 proteins that have not previously been observed by any other technique. The potential of LESA MS for classification and characterization of novel species is illustrated by the de novo sequencing of a new protein from the unknown species of Staphylococcus. [Figure not available: see fulltext.

  4. Long Non-coding RNAs and Their Biological Roles in Plants

    PubMed Central

    Liu, Xue; Hao, Lili; Li, Dayong; Zhu, Lihuang; Hu, Songnian

    2015-01-01

    With the development of genomics and bioinformatics, especially the extensive applications of high-throughput sequencing technology, more transcriptional units with little or no protein-coding potential have been discovered. Such RNA molecules are called non-protein-coding RNAs (npcRNAs or ncRNAs). Among them, long npcRNAs or ncRNAs (lnpcRNAs or lncRNAs) represent diverse classes of transcripts longer than 200 nucleotides. In recent years, the lncRNAs have been considered as important regulators in many essential biological processes. In plants, although a large number of lncRNA transcripts have been predicted and identified in few species, our current knowledge of their biological functions is still limited. Here, we have summarized recent studies on their identification, characteristics, classification, bioinformatics, resources, and current exploration of their biological functions in plants. PMID:25936895

  5. Bacteriocins from Lactobacillus plantarum – production, genetic organization and mode of action

    PubMed Central

    Todorov, Svetoslav D.

    2009-01-01

    Bacteriocins are biologically active proteins or protein complexes that display a bactericidal mode of action towards usually closely related species. Numerous strains of bacteriocin producing Lactobacillus plantarum have been isolated in the last two decades from different ecological niches including meat, fish, fruits, vegetables, and milk and cereal products. Several of these plantaricins have been characterized and the aminoacid sequence determined. Different aspects of the mode of action, fermentation optimization and genetic organization of the bacteriocin operon have been studied. However, numerous of bacteriocins produced by different Lactobacillus plantarum strains have not been fully characterized. In this article, a brief overview of the classification, genetics, characterization, including mode of action and production optimization for bacteriocins from Lactic Acid Bacteria in general, and where appropriate, with focus on bacteriocins produced by Lactobacillus plantarum, is presented. PMID:24031346

  6. Clusters of orthologous genes for 41 archaeal genomes and implications for evolutionary genomics of archaea.

    PubMed

    Makarova, Kira S; Sorokin, Alexander V; Novichkov, Pavel S; Wolf, Yuri I; Koonin, Eugene V

    2007-11-27

    An evolutionary classification of genes from sequenced genomes that distinguishes between orthologs and paralogs is indispensable for genome annotation and evolutionary reconstruction. Shortly after multiple genome sequences of bacteria, archaea, and unicellular eukaryotes became available, an attempt on such a classification was implemented in Clusters of Orthologous Groups of proteins (COGs). Rapid accumulation of genome sequences creates opportunities for refining COGs but also represents a challenge because of error amplification. One of the practical strategies involves construction of refined COGs for phylogenetically compact subsets of genomes. New Archaeal Clusters of Orthologous Genes (arCOGs) were constructed for 41 archaeal genomes (13 Crenarchaeota, 27 Euryarchaeota and one Nanoarchaeon) using an improved procedure that employs a similarity tree between smaller, group-specific clusters, semi-automatically partitions orthology domains in multidomain proteins, and uses profile searches for identification of remote orthologs. The annotation of arCOGs is a consensus between three assignments based on the COGs, the CDD database, and the annotations of homologs in the NR database. The 7538 arCOGs, on average, cover approximately 88% of the genes in a genome compared to a approximately 76% coverage in COGs. The finer granularity of ortholog identification in the arCOGs is apparent from the fact that 4538 arCOGs correspond to 2362 COGs; approximately 40% of the arCOGs are new. The archaeal gene core (protein-coding genes found in all 41 genome) consists of 166 arCOGs. The arCOGs were used to reconstruct gene loss and gene gain events during archaeal evolution and gene sets of ancestral forms. The Last Archaeal Common Ancestor (LACA) is conservatively estimated to possess 996 genes compared to 1245 and 1335 genes for the last common ancestors of Crenarchaeota and Euryarchaeota, respectively. It is inferred that LACA was a chemoautotrophic hyperthermophile that, in addition to the core archaeal functions, encoded more idiosyncratic systems, e.g., the CASS systems of antivirus defense and some toxin-antitoxin systems. The arCOGs provide a convenient, flexible framework for functional annotation of archaeal genomes, comparative genomics and evolutionary reconstructions. Genomic reconstructions suggest that the last common ancestor of archaea might have been (nearly) as advanced as the modern archaeal hyperthermophiles. ArCOGs and related information are available at: ftp://ftp.ncbi.nih.gov/pub/koonin/arCOGs/.

  7. Evidence of Divergent Amino Acid Usage in Comparative Analyses of R5- and X4-Associated HIV-1 Vpr Sequences

    PubMed Central

    Antell, Gregory C.; Zhong, Wen; Kercher, Katherine; Passic, Shendra; Williams, Jean; Liu, Yucheng; James, Tony; Jacobson, Jeffrey M.; Szep, Zsofia

    2017-01-01

    Vpr is an HIV-1 accessory protein that plays numerous roles during viral replication, and some of which are cell type dependent. To test the hypothesis that HIV-1 tropism extends beyond the envelope into the vpr gene, studies were performed to identify the associations between coreceptor usage and Vpr variation in HIV-1-infected patients. Colinear HIV-1 Env-V3 and Vpr amino acid sequences were obtained from the LANL HIV-1 sequence database and from well-suppressed patients in the Drexel/Temple Medicine CNS AIDS Research and Eradication Study (CARES) Cohort. Genotypic classification of Env-V3 sequences as X4 (CXCR4-utilizing) or R5 (CCR5-utilizing) was used to group colinear Vpr sequences. To reveal the sequences associated with a specific coreceptor usage genotype, Vpr amino acid sequences were assessed for amino acid diversity and Jensen-Shannon divergence between the two groups. Five amino acid alphabets were used to comprehensively examine the impact of amino acid substitutions involving side chains with similar physiochemical properties. Positions 36, 37, 41, 89, and 96 of Vpr were characterized by statistically significant divergence across multiple alphabets when X4 and R5 sequence groups were compared. In addition, consensus amino acid switches were found at positions 37 and 41 in comparisons of the R5 and X4 sequence populations. These results suggest an evolutionary link between Vpr and gp120 in HIV-1-infected patients. PMID:28620613

  8. Revisiting the taxonomical classification of Porcine Circovirus type 2 (PCV2): still a real challenge.

    PubMed

    Franzo, Giovanni; Cortey, Martí; Olvera, Alex; Novosel, Dinko; Castro, Alessandra Marnie Martins Gomes De; Biagini, Philippe; Segalés, Joaquim; Drigo, Michele

    2015-08-28

    PCV2 has emerged as one of the most devastating viral infections of swine farming, causing a relevant economic impact due to direct losses and control strategies expenses. Epidemiological and experimental studies have evidenced that genetic diversity is potentially affecting the virulence of PVC2. The growing number of PCV2 complete genomes and partial sequences available at GenBank questioned the accepted PCV2 classification. Nine hundred seventy five PCV2 complete genomes and 1,270 ORF2 sequences available from GenBank were subjected to recombination, PASC and phylogenetic analyses and results were used for comparison with previous classification scheme. The outcome of these analyses favors the recognition of four genotypes on the basis of ORF2 sequences, namely PCV2a, PCV2b, PCV2c and PCV2d-mPCV2b. To deal with the difficulty of founding an unambiguous classification and accounting the impossibility to define a p-distance cut-off, a set of reference sequences that could be used in further phylogenetic studies for PCV2 genotyping was established. Being aware that extensive phylogenetic analyses are time-consuming and often impracticable during routine diagnostic activity, ORF2 nucleotide positions adequately conserved in the reference sequences were identified and reported to allow a quick genotype differentiation. Globally, the present work provides an updated scenario of PCV2 genotypes distribution and, based on the limits of the previous classification criteria, proposes new rapid and effective schemes for differentiating the four defined PCV2 genotypes.

  9. Genetic variability of HEV isolates: inconsistencies of current classification.

    PubMed

    Oliveira-Filho, Edmilson F; König, Matthias; Thiel, Heinz-Jürgen

    2013-07-26

    Many HEV and HEV-like sequences have been reported during the last years, including isolates which may represent a number of potential new genera, new genotypes or new subtypes within the family Hepeviridae. Using the most common classification system, difficulties in the establishment of subtypes have been reported. Moreover the relevance of subtype classification for epidemiology can be questioned. In this study we have performed phylogenetic analyses based on whole capsid gene and complete HEV genomic sequences in order to evaluate the current classification of HEV at genotype and subtype levels. The results of our analyses modify the current taxonomy of genotype 3 and refine the established system for typing of HEV. In addition we suggest a classification for hepeviruses recently isolated from bats, ferrets, rats and wild boar. Copyright © 2013 Elsevier B.V. All rights reserved.

  10. Defining Electron Bifurcation in the Electron-Transferring Flavoprotein Family.

    PubMed

    Garcia Costas, Amaya M; Poudel, Saroj; Miller, Anne-Frances; Schut, Gerrit J; Ledbetter, Rhesa N; Fixen, Kathryn R; Seefeldt, Lance C; Adams, Michael W W; Harwood, Caroline S; Boyd, Eric S; Peters, John W

    2017-11-01

    Electron bifurcation is the coupling of exergonic and endergonic redox reactions to simultaneously generate (or utilize) low- and high-potential electrons. It is the third recognized form of energy conservation in biology and was recently described for select electron-transferring flavoproteins (Etfs). Etfs are flavin-containing heterodimers best known for donating electrons derived from fatty acid and amino acid oxidation to an electron transfer respiratory chain via Etf-quinone oxidoreductase. Canonical examples contain a flavin adenine dinucleotide (FAD) that is involved in electron transfer, as well as a non-redox-active AMP. However, Etfs demonstrated to bifurcate electrons contain a second FAD in place of the AMP. To expand our understanding of the functional variety and metabolic significance of Etfs and to identify amino acid sequence motifs that potentially enable electron bifurcation, we compiled 1,314 Etf protein sequences from genome sequence databases and subjected them to informatic and structural analyses. Etfs were identified in diverse archaea and bacteria, and they clustered into five distinct well-supported groups, based on their amino acid sequences. Gene neighborhood analyses indicated that these Etf group designations largely correspond to putative differences in functionality. Etfs with the demonstrated ability to bifurcate were found to form one group, suggesting that distinct conserved amino acid sequence motifs enable this capability. Indeed, structural modeling and sequence alignments revealed that identifying residues occur in the NADH- and FAD-binding regions of bifurcating Etfs. Collectively, a new classification scheme for Etf proteins that delineates putative bifurcating versus nonbifurcating members is presented and suggests that Etf-mediated bifurcation is associated with surprisingly diverse enzymes. IMPORTANCE Electron bifurcation has recently been recognized as an electron transfer mechanism used by microorganisms to maximize energy conservation. Bifurcating enzymes couple thermodynamically unfavorable reactions with thermodynamically favorable reactions in an overall spontaneous process. Here we show that the electron-transferring flavoprotein (Etf) enzyme family exhibits far greater diversity than previously recognized, and we provide a phylogenetic analysis that clearly delineates bifurcating versus nonbifurcating members of this family. Structural modeling of proteins within these groups reveals key differences between the bifurcating and nonbifurcating Etfs. Copyright © 2017 American Society for Microbiology.

  11. Defining Electron Bifurcation in the Electron-Transferring Flavoprotein Family

    PubMed Central

    Garcia Costas, Amaya M.; Poudel, Saroj; Miller, Anne-Frances; Schut, Gerrit J.; Ledbetter, Rhesa N.; Seefeldt, Lance C.; Adams, Michael W. W.

    2017-01-01

    ABSTRACT Electron bifurcation is the coupling of exergonic and endergonic redox reactions to simultaneously generate (or utilize) low- and high-potential electrons. It is the third recognized form of energy conservation in biology and was recently described for select electron-transferring flavoproteins (Etfs). Etfs are flavin-containing heterodimers best known for donating electrons derived from fatty acid and amino acid oxidation to an electron transfer respiratory chain via Etf-quinone oxidoreductase. Canonical examples contain a flavin adenine dinucleotide (FAD) that is involved in electron transfer, as well as a non-redox-active AMP. However, Etfs demonstrated to bifurcate electrons contain a second FAD in place of the AMP. To expand our understanding of the functional variety and metabolic significance of Etfs and to identify amino acid sequence motifs that potentially enable electron bifurcation, we compiled 1,314 Etf protein sequences from genome sequence databases and subjected them to informatic and structural analyses. Etfs were identified in diverse archaea and bacteria, and they clustered into five distinct well-supported groups, based on their amino acid sequences. Gene neighborhood analyses indicated that these Etf group designations largely correspond to putative differences in functionality. Etfs with the demonstrated ability to bifurcate were found to form one group, suggesting that distinct conserved amino acid sequence motifs enable this capability. Indeed, structural modeling and sequence alignments revealed that identifying residues occur in the NADH- and FAD-binding regions of bifurcating Etfs. Collectively, a new classification scheme for Etf proteins that delineates putative bifurcating versus nonbifurcating members is presented and suggests that Etf-mediated bifurcation is associated with surprisingly diverse enzymes. IMPORTANCE Electron bifurcation has recently been recognized as an electron transfer mechanism used by microorganisms to maximize energy conservation. Bifurcating enzymes couple thermodynamically unfavorable reactions with thermodynamically favorable reactions in an overall spontaneous process. Here we show that the electron-transferring flavoprotein (Etf) enzyme family exhibits far greater diversity than previously recognized, and we provide a phylogenetic analysis that clearly delineates bifurcating versus nonbifurcating members of this family. Structural modeling of proteins within these groups reveals key differences between the bifurcating and nonbifurcating Etfs. PMID:28808132

  12. Origins and challenges of viral dark matter.

    PubMed

    Krishnamurthy, Siddharth R; Wang, David

    2017-07-15

    The accurate classification of viral dark matter - metagenomic sequences that originate from viruses but do not align to any reference virus sequences - is one of the major obstacles in comprehensively defining the virome. Depending on the sample, viral dark matter can make up from anywhere between 40 and 90% of sequences. This review focuses on the specific nature of dark matter as it relates to viral sequences. We identify three factors that contribute to the existence of viral dark matter: the divergence and length of virus sequences, the limitations of alignment based classification, and limited representation of viruses in reference sequence databases. We then discuss current methods that have been developed to at least partially circumvent these limitations and thereby reduce the extent of viral dark matter. Copyright © 2017 Elsevier B.V. All rights reserved.

  13. Simrank: Rapid and sensitive general-purpose k-mer search tool

    PubMed Central

    2011-01-01

    Background Terabyte-scale collections of string-encoded data are expected from consortia efforts such as the Human Microbiome Project http://nihroadmap.nih.gov/hmp. Intra- and inter-project data similarity searches are enabled by rapid k-mer matching strategies. Software applications for sequence database partitioning, guide tree estimation, molecular classification and alignment acceleration have benefited from embedded k-mer searches as sub-routines. However, a rapid, general-purpose, open-source, flexible, stand-alone k-mer tool has not been available. Results Here we present a stand-alone utility, Simrank, which allows users to rapidly identify database strings the most similar to query strings. Performance testing of Simrank and related tools against DNA, RNA, protein and human-languages found Simrank 10X to 928X faster depending on the dataset. Conclusions Simrank provides molecular ecologists with a high-throughput, open source choice for comparing large sequence sets to find similarity. PMID:21524302

  14. Coevolutionary constraints in the sequence-space of macromolecular complexes reflect their self-assembly pathways.

    PubMed

    Mallik, Saurav; Kundu, Sudip

    2017-07-01

    Is the order in which biomolecular subunits self-assemble into functional macromolecular complexes imprinted in their sequence-space? Here, we demonstrate that the temporal order of macromolecular complex self-assembly can be efficiently captured using the landscape of residue-level coevolutionary constraints. This predictive power of coevolutionary constraints is irrespective of the structural, functional, and phylogenetic classification of the complex and of the stoichiometry and quaternary arrangement of the constituent monomers. Combining this result with a number of structural attributes estimated from the crystal structure data, we find indications that stronger coevolutionary constraints at interfaces formed early in the assembly hierarchy probably promotes coordinated fixation of mutations that leads to high-affinity binding with higher surface area, increased surface complementarity and elevated number of molecular contacts, compared to those that form late in the assembly. Proteins 2017; 85:1183-1189. © 2017 Wiley Periodicals, Inc. © 2017 Wiley Periodicals, Inc.

  15. Comprehensive evolutionary and phylogenetic analysis of Hepacivirus N (HNV).

    PubMed

    da Silva, M S; Junqueira, D M; Baumbach, L F; Cibulski, S P; Mósena, A C S; Weber, M N; Silveira, S; de Moraes, G M; Maia, R D; Coimbra, V C S; Canal, C W

    2018-05-24

    Hepaciviruses (HVs) have been detected in several domestic and wild animals and present high genetic diversity. The actual classification divides the genus Hepacivirus into 14 species (A-N), according to their phylogenetic relationships, including the bovine hepacivirus [Hepacivirus N (HNV)]. In this study, we confirmed HNV circulation in Brazil and sequenced the whole genome of two strains. Based on the current classification of HCV, which is divided into genotypes and subtypes, we analysed all available bovine hepacivirus sequences in the GenBank database and proposed an HNV classification. All of the sequences were grouped into a single genotype, putatively named 'genotype 1'. This genotype can be clearly divided into four subtypes: A and D containing sequences from Germany and Brazil, respectively, and B and C containing Ghanaian sequences. In addition, the NS3-coding region was used to estimate the time to the most recent common ancestor (TMRCA) of each subtype, using a Bayesian approach and a relaxed molecular clock model. The analyses indicated a common origin of the virus circulating in Germany and Brazil. Ghanaian sequences seemed to have an older TMRCA, indicating a long time of circulation of these viruses in the African continent.

  16. Support vector machine based classification of fast Fourier transform spectroscopy of proteins

    NASA Astrophysics Data System (ADS)

    Lazarevic, Aleksandar; Pokrajac, Dragoljub; Marcano, Aristides; Melikechi, Noureddine

    2009-02-01

    Fast Fourier transform spectroscopy has proved to be a powerful method for study of the secondary structure of proteins since peak positions and their relative amplitude are affected by the number of hydrogen bridges that sustain this secondary structure. However, to our best knowledge, the method has not been used yet for identification of proteins within a complex matrix like a blood sample. The principal reason is the apparent similarity of protein infrared spectra with actual differences usually masked by the solvent contribution and other interactions. In this paper, we propose a novel machine learning based method that uses protein spectra for classification and identification of such proteins within a given sample. The proposed method uses principal component analysis (PCA) to identify most important linear combinations of original spectral components and then employs support vector machine (SVM) classification model applied on such identified combinations to categorize proteins into one of given groups. Our experiments have been performed on the set of four different proteins, namely: Bovine Serum Albumin, Leptin, Insulin-like Growth Factor 2 and Osteopontin. Our proposed method of applying principal component analysis along with support vector machines exhibits excellent classification accuracy when identifying proteins using their infrared spectra.

  17. Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.

    In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less

  18. Automatic Classification of Protein Structure Using the Maximum Contact Map Overlap Metric

    DOE PAGES

    Andonov, Rumen; Djidjev, Hristo Nikolov; Klau, Gunnar W.; ...

    2015-10-09

    In this paper, we propose a new distance measure for comparing two protein structures based on their contact map representations. We show that our novel measure, which we refer to as the maximum contact map overlap (max-CMO) metric, satisfies all properties of a metric on the space of protein representations. Having a metric in that space allows one to avoid pairwise comparisons on the entire database and, thus, to significantly accelerate exploring the protein space compared to no-metric spaces. We show on a gold standard superfamily classification benchmark set of 6759 proteins that our exact k-nearest neighbor (k-NN) scheme classifiesmore » up to 224 out of 236 queries correctly and on a larger, extended version of the benchmark with 60; 850 additional structures, up to 1361 out of 1369 queries. Finally, our k-NN classification thus provides a promising approach for the automatic classification of protein structures based on flexible contact map overlap alignments.« less

  19. Study design requirements for RNA sequencing-based breast cancer diagnostics.

    PubMed

    Mer, Arvind Singh; Klevebring, Daniel; Grönberg, Henrik; Rantalainen, Mattias

    2016-02-01

    Sequencing-based molecular characterization of tumors provides information required for individualized cancer treatment. There are well-defined molecular subtypes of breast cancer that provide improved prognostication compared to routine biomarkers. However, molecular subtyping is not yet implemented in routine breast cancer care. Clinical translation is dependent on subtype prediction models providing high sensitivity and specificity. In this study we evaluate sample size and RNA-sequencing read requirements for breast cancer subtyping to facilitate rational design of translational studies. We applied subsampling to ascertain the effect of training sample size and the number of RNA sequencing reads on classification accuracy of molecular subtype and routine biomarker prediction models (unsupervised and supervised). Subtype classification accuracy improved with increasing sample size up to N = 750 (accuracy = 0.93), although with a modest improvement beyond N = 350 (accuracy = 0.92). Prediction of routine biomarkers achieved accuracy of 0.94 (ER) and 0.92 (Her2) at N = 200. Subtype classification improved with RNA-sequencing library size up to 5 million reads. Development of molecular subtyping models for cancer diagnostics requires well-designed studies. Sample size and the number of RNA sequencing reads directly influence accuracy of molecular subtyping. Results in this study provide key information for rational design of translational studies aiming to bring sequencing-based diagnostics to the clinic.

  20. Practical application of self-organizing maps to interrelate biodiversity and functional data in NGS-based metagenomics

    PubMed Central

    Weber, Marc; Teeling, Hanno; Huang, Sixing; Waldmann, Jost; Kassabgy, Mariette; Fuchs, Bernhard M; Klindworth, Anna; Klockow, Christine; Wichels, Antje; Gerdts, Gunnar; Amann, Rudolf; Glöckner, Frank Oliver

    2011-01-01

    Next-generation sequencing (NGS) technologies have enabled the application of broad-scale sequencing in microbial biodiversity and metagenome studies. Biodiversity is usually targeted by classifying 16S ribosomal RNA genes, while metagenomic approaches target metabolic genes. However, both approaches remain isolated, as long as the taxonomic and functional information cannot be interrelated. Techniques like self-organizing maps (SOMs) have been applied to cluster metagenomes into taxon-specific bins in order to link biodiversity with functions, but have not been applied to broad-scale NGS-based metagenomics yet. Here, we provide a novel implementation, demonstrate its potential and practicability, and provide a web-based service for public usage. Evaluation with published data sets mimicking varyingly complex habitats resulted into classification specificities and sensitivities of close to 100% to above 90% from phylum to genus level for assemblies exceeding 8 kb for low and medium complexity data. When applied to five real-world metagenomes of medium complexity from direct pyrosequencing of marine subsurface waters, classifications of assemblies above 2.5 kb were in good agreement with fluorescence in situ hybridizations, indicating that biodiversity was mostly retained within the metagenomes, and confirming high classification specificities. This was validated by two protein-based classifications (PBCs) methods. SOMs were able to retrieve the relevant taxa down to the genus level, while surpassing PBCs in resolution. In order to make the approach accessible to a broad audience, we implemented a feature-rich web-based SOM application named TaxSOM, which is freely available at http://www.megx.net/toolbox/taxsom. TaxSOM can classify reads or assemblies exceeding 2.5 kb with high accuracy and thus assists in linking biodiversity and functions in metagenome studies, which is a precondition to study microbial ecology in a holistic fashion. PMID:21160538

  1. Molecular Diagnostics of Arthroconidial Yeasts, Frequent Pulmonary Opportunists.

    PubMed

    Kaplan, Engin; Al-Hatmi, Abdullah M S; Ilkit, Macit; Gerrits van den Ende, A H G; Hagen, Ferry; Meis, Jacques F; de Hoog, G Sybren

    2018-01-01

    Magnusiomyces capitatus and Saprochaete clavata are members of the clade of arthroconidial yeasts that represent emerging opportunistic pulmonary pathogens in immunocompromised patients. Given that standard ribosomal DNA (rDNA) identification often provides confusing results, in this study, we analyzed 34 isolates with the goal of finding new genetic markers for classification using multilocus sequencing and amplified fragment length polymorphism (AFLP). The interspecific similarity obtained using rDNA markers (the internal transcribed spacer [ITS] and large subunit regions) was in the range of 96 to 99%, whereas that obtained using protein-coding loci ( Rbp2 , Act , and Tef1α ) was lower at 89.4 to 95.2%. Ultimately, Rbp2 was selected as the best marker for species distinction. On the basis of cloned ITS data, some strains proved to be misidentified in comparison with the identities obtained with phenotypic characters, protein sequences, and AFLP profiles, indicating that different copies of the ribosomal operon were present in a single species. Antifungal susceptibility testing revealed that voriconazole had the lowest MIC against M. capitatus , while amphotericin B had the lowest MIC against S. clavata Both species exhibited in vitro resistance to fluconazole and micafungin. Copyright © 2017 American Society for Microbiology.

  2. RAFP-Pred: Robust Prediction of Antifreeze Proteins Using Localized Analysis of n-Peptide Compositions.

    PubMed

    Khan, Shujaat; Naseem, Imran; Togneri, Roberto; Bennamoun, Mohammed

    2018-01-01

    In extreme cold weather, living organisms produce Antifreeze Proteins (AFPs) to counter the otherwise lethal intracellular formation of ice. Structures and sequences of various AFPs exhibit a high degree of heterogeneity, consequently the prediction of the AFPs is considered to be a challenging task. In this research, we propose to handle this arduous manifold learning task using the notion of localized processing. In particular, an AFP sequence is segmented into two sub-segments each of which is analyzed for amino acid and di-peptide compositions. We propose to use only the most significant features using the concept of information gain (IG) followed by a random forest classification approach. The proposed RAFP-Pred achieved an excellent performance on a number of standard datasets. We report a high Youden's index (sensitivity+specificity-1) value of 0.75 on the standard independent test data set outperforming the AFP-PseAAC, AFP_PSSM, AFP-Pred, and iAFP by a margin of 0.05, 0.06, 0.14, and 0.68, respectively. The verification rate on the UniProKB dataset is found to be 83.19 percent which is substantially superior to the 57.18 percent reported for the iAFP method.

  3. The mitochondrial genome of Cethosia biblis (Drury) (Lepidoptera: Nymphalidae).

    PubMed

    Xin, Tianrong; Li, Lei; Yao, Chengyi; Wang, Yayu; Zou, Zhiwen; Wang, Jing; Xia, Bin

    2016-07-01

    We present the complete mitogenome of Cethosia biblis (Drury) (Lepidoptera: Nymphalidae) in this article. The mitogenome was a circle molecular consisting of 15,286 nucleotides, 37 genes, and an A + T-rich region. The order of 37 genes was typical of insect mitochondrial DNA sequences described to date. The overall base composition of the genome is A (37.41%), T (42.80%), C (11.87%), and G (7.91%) with an A + T-rich hallmark as that of other invertebrate mitochondrial genomes. The start codon was mainly ATA in most of the mitochondrial protein-coding genes such as ND2, COI, ATP8, ND3, ND5, ND4, ND6, and ND1, but COII, ATP6, COIII, ND4L, and Cob genes employing ATG. The stop codon was TAA in all the protein-coding genes. The A + T region is located between 12S rRNA and tRNA(M)(et). The phylogenetic relationships of Lepidoptera species were constructed based on the nucleotides sequences of 13 PCGs of mitogenomes using the neighbor-joining method. The molecular-based phylogeny supported the traditional morphological classification on relationships within Lepidoptera species.

  4. Identification and Analysis of Mitogen-Activated Protein Kinase (MAPK) Cascades in Fragaria vesca.

    PubMed

    Zhou, Heying; Ren, Suyue; Han, Yuanfang; Zhang, Qing; Qin, Ling; Xing, Yu

    2017-08-13

    Mitogen-activated protein kinase (MAPK) cascades are highly conserved signaling modules in eukaryotes, including yeasts, plants and animals. MAPK cascades are responsible for protein phosphorylation during signal transduction events, and typically consist of three protein kinases: MAPK, MAPK kinase, and MAPK kinase kinase. In this current study, we identified a total of 12 FvMAPK , 7 FvMAPKK , 73 FvMAPKKK , and one FvMAPKKKK genes in the recently published Fragaria vesca genome sequence. This work reported the classification, annotation and phylogenetic evaluation of these genes and an assessment of conserved motifs and the expression profiling of members of the gene family were also analyzed here. The expression profiles of the MAPK and MAPKK genes in different organs and fruit developmental stages were further investigated using quantitative real-time reverse transcription PCR (qRT-PCR). Finally, the MAPK and MAPKK expression patterns in response to hormone and abiotic stresses (salt, drought, and high and low temperature) were investigated in fruit and leaves of F. vesca . The results provide a platform for further characterization of the physiological and biochemical functions of MAPK cascades in strawberry.

  5. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation

    PubMed Central

    Amidi, Afshine; Megalooikonomou, Vasileios; Paragios, Nikos

    2018-01-01

    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet. PMID:29740518

  6. EnzyNet: enzyme classification using 3D convolutional neural networks on spatial representation.

    PubMed

    Amidi, Afshine; Amidi, Shervine; Vlachakis, Dimitrios; Megalooikonomou, Vasileios; Paragios, Nikos; Zacharaki, Evangelia I

    2018-01-01

    During the past decade, with the significant progress of computational power as well as ever-rising data availability, deep learning techniques became increasingly popular due to their excellent performance on computer vision problems. The size of the Protein Data Bank (PDB) has increased more than 15-fold since 1999, which enabled the expansion of models that aim at predicting enzymatic function via their amino acid composition. Amino acid sequence, however, is less conserved in nature than protein structure and therefore considered a less reliable predictor of protein function. This paper presents EnzyNet, a novel 3D convolutional neural networks classifier that predicts the Enzyme Commission number of enzymes based only on their voxel-based spatial structure. The spatial distribution of biochemical properties was also examined as complementary information. The two-layer architecture was investigated on a large dataset of 63,558 enzymes from the PDB and achieved an accuracy of 78.4% by exploiting only the binary representation of the protein shape. Code and datasets are available at https://github.com/shervinea/enzynet.

  7. Haplogroup relationships between domestic and wild sheep resolved using a mitogenome panel.

    PubMed

    Meadows, J R S; Hiendleder, S; Kijas, J W

    2011-04-01

    Five haplogroups have been identified in domestic sheep through global surveys of mitochondrial (mt) sequence variation, however these group classifications are often based on small fragments of the complete mtDNA sequence; partial control region or the cytochrome B gene. This study presents the complete mitogenome from representatives of each haplogroup identified in domestic sheep, plus a sample of their wild relatives. Comparison of the sequence successfully resolved the relationships between each haplogroup and provided insight into the relationship with wild sheep. The five haplogroups were characterised as branching independently, a radiation that shared a common ancestor 920,000 ± 190,000 years ago based on protein coding sequence. The utility of various mtDNA components to inform the true relationship between sheep was also examined with Bayesian, maximum likelihood and partitioned Bremmer support analyses. The control region was found to be the mtDNA component, which contributed the highest amount of support to the tree generated using the complete data set. This study provides the nucleus of a mtDNA mitogenome panel, which can be used to assess additional mitogenomes and serve as a reference set to evaluate small fragments of the mtDNA.

  8. Haplogroup relationships between domestic and wild sheep resolved using a mitogenome panel

    PubMed Central

    Meadows, J R S; Hiendleder, S; Kijas, J W

    2011-01-01

    Five haplogroups have been identified in domestic sheep through global surveys of mitochondrial (mt) sequence variation, however these group classifications are often based on small fragments of the complete mtDNA sequence; partial control region or the cytochrome B gene. This study presents the complete mitogenome from representatives of each haplogroup identified in domestic sheep, plus a sample of their wild relatives. Comparison of the sequence successfully resolved the relationships between each haplogroup and provided insight into the relationship with wild sheep. The five haplogroups were characterised as branching independently, a radiation that shared a common ancestor 920 000±190 000 years ago based on protein coding sequence. The utility of various mtDNA components to inform the true relationship between sheep was also examined with Bayesian, maximum likelihood and partitioned Bremmer support analyses. The control region was found to be the mtDNA component, which contributed the highest amount of support to the tree generated using the complete data set. This study provides the nucleus of a mtDNA mitogenome panel, which can be used to assess additional mitogenomes and serve as a reference set to evaluate small fragments of the mtDNA. PMID:20940734

  9. MALDI Top-Down sequencing: calling N- and C-terminal protein sequences with high confidence and speed.

    PubMed

    Suckau, Detlev; Resemann, Anja

    2009-12-01

    The ability to match Top-Down protein sequencing (TDS) results by MALDI-TOF to protein sequences by classical protein database searching was evaluated in this work. Resulting from these analyses were the protein identity, the simultaneous assignment of the N- and C-termini and protein sequences of up to 70 residues from either terminus. In combination with de novo sequencing using the MALDI-TDS data, even fusion proteins were assigned and the detailed sequence around the fusion site was elucidated. MALDI-TDS allowed to efficiently match protein sequences quickly and to validate recombinant protein structures-in particular, protein termini-on the level of undigested proteins.

  10. Prediction and Identification of Krüppel-Like Transcription Factors by Machine Learning Method.

    PubMed

    Liao, Zhijun; Wang, Xinrui; Chen, Xingyong; Zou, Quan

    2017-01-01

    The Krüppel-like factors (KLFs) are a family of containing Zn finger(ZF) motif transcription factors with 18 members in human genome, among them, KLF18 is predicted by bioinformatics. KLFs possess various physiological function involving in a number of cancers and other diseases. Here we perform a binary-class classification of KLFs and non-KLFs by machine learning methods. The protein sequences of KLFs and non-KLFs were searched from UniProt and randomly separate them into training dataset(containing positive and negative sequences) and test dataset(containing only negative sequences), after extracting the 188-dimensional(188D) feature vectors we carry out category with four classifiers(GBDT, libSVM, RF, and k-NN). On the human KLFs, we further dig into the evolutionary relationship and motif distribution, and finally we analyze the conserved amino acid residue of three zinc fingers. The classifier model from training dataset were well constructed, and the highest specificity(Sp) was 99.83% from a library for support vector machine(libSVM) and all the correctly classified rates were over 70% for 10-fold cross-validation on test dataset. The 18 human KLFs can be further divided into 7 groups and the zinc finger domains were located at the carboxyl terminus, and many conserved amino acid residues including Cysteine and Histidine, and the span and interval between them were consistent in the three ZF domains. Two classification models for KLFs prediction have been built by novel machine learning methods. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.

  11. Generation and Analysis of Expressed Sequence Tags (ESTs) from Halophyte Atriplex canescens to Explore Salt-Responsive Related Genes

    PubMed Central

    Li, Jingtao; Sun, Xinhua; Yu, Gang; Jia, Chengguo; Liu, Jinliang; Pan, Hongyu

    2014-01-01

    Little information is available on gene expression profiling of halophyte A. canescens. To elucidate the molecular mechanism for stress tolerance in A. canescens, a full-length complementary DNA library was generated from A. canescens exposed to 400 mM NaCl, and provided 343 high-quality ESTs. In an evaluation of 343 valid EST sequences in the cDNA library, 197 unigenes were assembled, among which 190 unigenes (83.1% ESTs) were identified according to their significant similarities with proteins of known functions. All the 343 EST sequences have been deposited in the dbEST GenBank under accession numbers JZ535802 to JZ536144. According to Arabidopsis MIPS functional category and GO classifications, we identified 193 unigenes of the 311 annotations EST, representing 72 non-redundant unigenes sharing similarities with genes related to the defense response. The sets of ESTs obtained provide a rich genetic resource and 17 up-regulated genes related to salt stress resistance were identified by qRT-PCR. Six of these genes may contribute crucially to earlier and later stage salt stress resistance. Additionally, among the 343 unigenes sequences, 22 simple sequence repeats (SSRs) were also identified contributing to the study of A. canescens resources. PMID:24960361

  12. Sequence variation of the glycoprotein gene identifies three distinct lineages within field isolates of viral hemorrhagic septicemia virus, a fish rhabdovirus

    USGS Publications Warehouse

    Benmansour, A.; Bascuro, B.; Monnier, A.F.; Vende, P.; Winton, J.R.; de Kinkelin, P.

    1997-01-01

    To evaluate the genetic diversity of viral haemorrhagic septicaemia virus (VHSV), the sequence of the glycoprotein genes (G) of 11 North American and European isolates were determined. Comparison with the G protein of representative members of the family Rhabdoviridae suggested that VHSV was a different virus species from infectious haemorrhagic necrosis virus (IHNV) and Hirame rhabdovirus (HIRRV). At a higher taxonomic level, VHSV, IHNV and HIRRV formed a group which was genetically closest to the genus Lyssavirus. Compared with each other, the G genes of VHSV displayed a dissimilar overall genetic diversity which correlated with differences in geographical origin. The multiple sequence alignment of the complete G protein, showed that the divergent positions were not uniformly distributed along the sequence. A central region (amino acid position 245-300) accumulated substitutions and appeared to be highly variable. The genetic heterogeneity within a single isolate was high, with an apparent internal mutation frequency of 1.2 x 10(-3) per nucleotide site, attesting the quasispecies nature of the viral population. The phylogeny separated VHSV strains according to the major geographical area of isolation: genotype I for continental Europe, genotype II for the British Isles, and genotype III for North America. Isolates from continental Europe exhibited the highest genetic variability, with sub-groups correlated partially with the serological classification. Neither neutralizing polyclonal sera, nor monoclonal antibodies, were able to discriminate between the genotypes. The overall structure of the phylogenetic tree suggests that VHSV genetic diversity and evolution fit within the model of random change and positive selection operating on quasispecies.

  13. Centrifuge: rapid and sensitive classification of metagenomic sequences.

    PubMed

    Kim, Daehwan; Song, Li; Breitwieser, Florian P; Salzberg, Steven L

    2016-12-01

    Centrifuge is a novel microbial classification engine that enables rapid, accurate, and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together, these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI nonredundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer-based indexing schemes, which require far more extensive space. © 2016 Kim et al.; Published by Cold Spring Harbor Laboratory Press.

  14. Visualization of genome signatures of eukaryote genomes by batch-learning self-organizing map with a special emphasis on Drosophila genomes.

    PubMed

    Abe, Takashi; Hamano, Yuta; Ikemura, Toshimichi

    2014-01-01

    A strategy of evolutionary studies that can compare vast numbers of genome sequences is becoming increasingly important with the remarkable progress of high-throughput DNA sequencing methods. We previously established a sequence alignment-free clustering method "BLSOM" for di-, tri-, and tetranucleotide compositions in genome sequences, which can characterize sequence characteristics (genome signatures) of a wide range of species. In the present study, we generated BLSOMs for tetra- and pentanucleotide compositions in approximately one million sequence fragments derived from 101 eukaryotes, for which almost complete genome sequences were available. BLSOM recognized phylotype-specific characteristics (e.g., key combinations of oligonucleotide frequencies) in the genome sequences, permitting phylotype-specific clustering of the sequences without any information regarding the species. In our detailed examination of 12 Drosophila species, the correlation between their phylogenetic classification and the classification on the BLSOMs was observed to visualize oligonucleotides diagnostic for species-specific clustering.

  15. Pathological and Molecular Evaluation of Pancreatic Neoplasms

    PubMed Central

    Rishi, Arvind; Goggins, Michael; Wood, Laura D.; Hruban, Ralph H.

    2015-01-01

    Pancreatic neoplasms are morphologically and genetically heterogeneous and include wide variety of neoplasms ranging from benign to malignant with an extremely poor clinical outcome. Our understanding of these pancreatic neoplasms has improved significantly with recent advances in cancer sequencing. Awareness of molecular pathogenesis brings in new opportunities for early detection, improved prognostication, and personalized gene-specific therapies. Here we review the pathological classification of pancreatic neoplasms from their molecular and genetic perspective. All of the major tumor types that arise in the pancreas have been sequenced, and a new classification that incorporates molecular findings together with pathological findings is now possible (Table 1). This classification has significant implications for our understanding of why tumors aggregate in some families, for the development of early detection tests, and for the development of personalized therapies for patients with established cancers. Here we describe this new classification using the framework of the standard histological classification. PMID:25726050

  16. Note: An automated image analysis method for high-throughput classification of surface-bound bacterial cell motions.

    PubMed

    Shen, Simon; Syal, Karan; Tao, Nongjian; Wang, Shaopeng

    2015-12-01

    We present a Single-Cell Motion Characterization System (SiCMoCS) to automatically extract bacterial cell morphological features from microscope images and use those features to automatically classify cell motion for rod shaped motile bacterial cells. In some imaging based studies, bacteria cells need to be attached to the surface for time-lapse observation of cellular processes such as cell membrane-protein interactions and membrane elasticity. These studies often generate large volumes of images. Extracting accurate bacterial cell morphology features from these images is critical for quantitative assessment. Using SiCMoCS, we demonstrated simultaneous and automated motion tracking and classification of hundreds of individual cells in an image sequence of several hundred frames. This is a significant improvement from traditional manual and semi-automated approaches to segmenting bacterial cells based on empirical thresholds, and a first attempt to automatically classify bacterial motion types for motile rod shaped bacterial cells, which enables rapid and quantitative analysis of various types of bacterial motion.

  17. Changing techniques in crop plant classification: molecularization at the National Institute of Agricultural Botany during the 1980s.

    PubMed

    Holmes, Matthew

    2017-04-01

    Modern methods of analysing biological materials, including protein and DNA sequencing, are increasingly the objects of historical study. Yet twentieth-century taxonomic techniques have been overlooked in one of their most important contexts: agricultural botany. This paper addresses this omission by harnessing unexamined archival material from the National Institute of Agricultural Botany (NIAB), a British plant science organization. During the 1980s the NIAB carried out three overlapping research programmes in crop identification and analysis: electrophoresis, near infrared spectroscopy (NIRS) and machine vision systems. For each of these three programmes, contemporary economic, statutory and scientific factors behind their uptake by the NIAB are discussed. This approach reveals significant links between taxonomic practice at the NIAB and historical questions around agricultural research, intellectual property and scientific values. Such links are of further importance given that the techniques developed by researchers at the NIAB during the 1980s remain part of crop classification guidelines issued by international bodies today.

  18. FRAN and RBF-PSO as two components of a hyper framework to recognize protein folds.

    PubMed

    Abbasi, Elham; Ghatee, Mehdi; Shiri, M E

    2013-09-01

    In this paper, an intelligent hyper framework is proposed to recognize protein folds from its amino acid sequence which is a fundamental problem in bioinformatics. This framework includes some statistical and intelligent algorithms for proteins classification. The main components of the proposed framework are the Fuzzy Resource-Allocating Network (FRAN) and the Radial Bases Function based on Particle Swarm Optimization (RBF-PSO). FRAN applies a dynamic method to tune up the RBF network parameters. Due to the patterns complexity captured in protein dataset, FRAN classifies the proteins under fuzzy conditions. Also, RBF-PSO applies PSO to tune up the RBF classifier. Experimental results demonstrate that FRAN improves prediction accuracy up to 51% and achieves acceptable multi-class results for protein fold prediction. Although RBF-PSO provides reasonable results for protein fold recognition up to 48%, it is weaker than FRAN in some cases. However the proposed hyper framework provides an opportunity to use a great range of intelligent methods and can learn from previous experiences. Thus it can avoid the weakness of some intelligent methods in terms of memory, computational time and static structure. Furthermore, the performance of this system can be enhanced throughout the system life-cycle. Copyright © 2013 Elsevier Ltd. All rights reserved.

  19. Differentiation and classification of phytoplasmas in the pigeon pea witches'-broom group (16SrIX): an update based on multiple gene sequence analysis.

    PubMed

    Lee, I-M; Bottner-Parker, K D; Zhao, Y; Bertaccini, A; Davis, R E

    2012-09-01

    The pigeon pea witches'-broom phytoplasma group (16SrIX) comprises diverse strains that cause numerous diseases in leguminous trees and herbaceous crops, vegetables, a fruit, a nut tree and a forest tree. At least 14 strains have been reported worldwide. Comparative phylogenetic analyses of the highly conserved 16S rRNA gene and the moderately conserved rplV (rpl22)-rpsC (rps3) and secY genes indicated that the 16SrIX group consists of at least six distinct genetic lineages. Some of these lineages cannot be readily differentiated based on analysis of 16S rRNA gene sequences alone. The relative genetic distances among these closely related lineages were better assessed by including more variable genes [e.g. ribosomal protein (rp) and secY genes]. The present study demonstrated that virtual RFLP analyses using rp and secY gene sequences allowed unambiguous identification of such lineages. A coding system is proposed to designate each distinct rp and secY subgroup in the 16SrIX group.

  20. Probing the Structures of Viral RNA Regulatory Elements with SHAPE and Related Methodologies

    PubMed Central

    Rausch, Jason W.; Sztuba-Solinska, Joanna; Le Grice, Stuart F. J.

    2018-01-01

    Viral RNAs were selected by evolution to possess maximum functionality in a minimal sequence. Depending on the classification of the virus and the type of RNA in question, viral RNAs must alternately be replicated, spliced, transcribed, transported from the nucleus into the cytoplasm, translated and/or packaged into nascent virions, and in most cases, provide the sequence and structural determinants to facilitate these processes. One consequence of this compact multifunctionality is that viral RNA structures can be exquisitely complex, often involving intermolecular interactions with RNA or protein, intramolecular interactions between sequence segments separated by several thousands of nucleotides, or specialized motifs such as pseudoknots or kissing loops. The fluidity of viral RNA structure can also present a challenge when attempting to characterize it, as genomic RNAs especially are likely to sample numerous conformations at various stages of the virus life cycle. Here we review advances in chemoenzymatic structure probing that have made it possible to address such challenges with respect to cis-acting elements, full-length viral genomes and long non-coding RNAs that play a major role in regulating viral gene expression. PMID:29375504

  1. Systematic analysis of snake neurotoxins' functional classification using a data warehousing approach.

    PubMed

    Siew, Joyce Phui Yee; Khan, Asif M; Tan, Paul T J; Koh, Judice L Y; Seah, Seng Hong; Koo, Chuay Yeng; Chai, Siaw Ching; Armugam, Arunmozhiarasi; Brusic, Vladimir; Jeyaseelan, Kandiah

    2004-12-12

    Sequence annotations, functional and structural data on snake venom neurotoxins (svNTXs) are scattered across multiple databases and literature sources. Sequence annotations and structural data are available in the public molecular databases, while functional data are almost exclusively available in the published articles. There is a need for a specialized svNTXs database that contains NTX entries, which are organized, well annotated and classified in a systematic manner. We have systematically analyzed svNTXs and classified them using structure-function groups based on their structural, functional and phylogenetic properties. Using conserved motifs in each phylogenetic group, we built an intelligent module for the prediction of structural and functional properties of unknown NTXs. We also developed an annotation tool to aid the functional prediction of newly identified NTXs as an additional resource for the venom research community. We created a searchable online database of NTX proteins sequences (http://research.i2r.a-star.edu.sg/Templar/DB/snake_neurotoxin). This database can also be found under Swiss-Prot Toxin Annotation Project website (http://www.expasy.org/sprot/).

  2. A story of chelatase evolution: identification and characterization of a small 13-15-kDa "ancestral" cobaltochelatase (CbiXS) in the archaea.

    PubMed

    Brindley, Amanda A; Raux, Evelyne; Leech, Helen K; Schubert, Heidi L; Warren, Martin J

    2003-06-20

    The cobaltochelatase required for the synthesis of vitamin B12 (cobalamin) in the archaeal kingdom has been identified as CbiX through similarity searching with the CbiX from Bacillus megaterium. However, the CbiX proteins in the archaea are much shorter than the CbiX proteins found in eubacteria, typically containing less than half the number of amino acids in their primary structure. For this reason the shorter CbiX proteins have been termed CbiXS and the longer versions CbiXL. The CbiXS proteins from Methanosarcina barkeri and Methanobacter thermoautotrophicum were overproduced in Escherichia coli as recombinant proteins and characterized. Through complementation studies of a defined chelatase-deficient strain of E. coli and by direct in vitro assays the function of CbiXS as a sirohydrochlorin cobaltochelatase has been demonstrated. On the basis of sequence alignments and conserved active site residues we suggest that CbiXS may represent a primordial chelatase, giving rise to larger chelatases such as CbiXL, SirB, CbiK, and HemH through gene duplication and subsequent variation and selection. A classification scheme for chelatases is proposed.

  3. Machine learning methods can replace 3D profile method in classification of amyloidogenic hexapeptides.

    PubMed

    Stanislawski, Jerzy; Kotulska, Malgorzata; Unold, Olgierd

    2013-01-17

    Amyloids are proteins capable of forming fibrils. Many of them underlie serious diseases, like Alzheimer disease. The number of amyloid-associated diseases is constantly increasing. Recent studies indicate that amyloidogenic properties can be associated with short segments of aminoacids, which transform the structure when exposed. A few hundreds of such peptides have been experimentally found. Experimental testing of all possible aminoacid combinations is currently not feasible. Instead, they can be predicted by computational methods. 3D profile is a physicochemical-based method that has generated the most numerous dataset - ZipperDB. However, it is computationally very demanding. Here, we show that dataset generation can be accelerated. Two methods to increase the classification efficiency of amyloidogenic candidates are presented and tested: simplified 3D profile generation and machine learning methods. We generated a new dataset of hexapeptides, using more economical 3D profile algorithm, which showed very good classification overlap with ZipperDB (93.5%). The new part of our dataset contains 1779 segments, with 204 classified as amyloidogenic. The dataset of 6-residue sequences with their binary classification, based on the energy of the segment, was applied for training machine learning methods. A separate set of sequences from ZipperDB was used as a test set. The most effective methods were Alternating Decision Tree and Multilayer Perceptron. Both methods obtained area under ROC curve of 0.96, accuracy 91%, true positive rate ca. 78%, and true negative rate 95%. A few other machine learning methods also achieved a good performance. The computational time was reduced from 18-20 CPU-hours (full 3D profile) to 0.5 CPU-hours (simplified 3D profile) to seconds (machine learning). We showed that the simplified profile generation method does not introduce an error with regard to the original method, while increasing the computational efficiency. Our new dataset proved representative enough to use simple statistical methods for testing the amylogenicity based only on six letter sequences. Statistical machine learning methods such as Alternating Decision Tree and Multilayer Perceptron can replace the energy based classifier, with advantage of very significantly reduced computational time and simplicity to perform the analysis. Additionally, a decision tree provides a set of very easily interpretable rules.

  4. Large-scale analysis of full-length cDNAs from the tomato (Solanum lycopersicum) cultivar Micro-Tom, a reference system for the Solanaceae genomics.

    PubMed

    Aoki, Koh; Yano, Kentaro; Suzuki, Ayako; Kawamura, Shingo; Sakurai, Nozomu; Suda, Kunihiro; Kurabayashi, Atsushi; Suzuki, Tatsuya; Tsugane, Taneaki; Watanabe, Manabu; Ooga, Kazuhide; Torii, Maiko; Narita, Takanori; Shin-I, Tadasu; Kohara, Yuji; Yamamoto, Naoki; Takahashi, Hideki; Watanabe, Yuichiro; Egusa, Mayumi; Kodama, Motoichiro; Ichinose, Yuki; Kikuchi, Mari; Fukushima, Sumire; Okabe, Akiko; Arie, Tsutomu; Sato, Yuko; Yazawa, Katsumi; Satoh, Shinobu; Omura, Toshikazu; Ezura, Hiroshi; Shibata, Daisuke

    2010-03-30

    The Solanaceae family includes several economically important vegetable crops. The tomato (Solanum lycopersicum) is regarded as a model plant of the Solanaceae family. Recently, a number of tomato resources have been developed in parallel with the ongoing tomato genome sequencing project. In particular, a miniature cultivar, Micro-Tom, is regarded as a model system in tomato genomics, and a number of genomics resources in the Micro-Tom-background, such as ESTs and mutagenized lines, have been established by an international alliance. To accelerate the progress in tomato genomics, we developed a collection of fully-sequenced 13,227 Micro-Tom full-length cDNAs. By checking redundant sequences, coding sequences, and chimeric sequences, a set of 11,502 non-redundant full-length cDNAs (nrFLcDNAs) was generated. Analysis of untranslated regions demonstrated that tomato has longer 5'- and 3'-untranslated regions than most other plants but rice. Classification of functions of proteins predicted from the coding sequences demonstrated that nrFLcDNAs covered a broad range of functions. A comparison of nrFLcDNAs with genes of sixteen plants facilitated the identification of tomato genes that are not found in other plants, most of which did not have known protein domains. Mapping of the nrFLcDNAs onto currently available tomato genome sequences facilitated prediction of exon-intron structure. Introns of tomato genes were longer than those of Arabidopsis and rice. According to a comparison of exon sequences between the nrFLcDNAs and the tomato genome sequences, the frequency of nucleotide mismatch in exons between Micro-Tom and the genome-sequencing cultivar (Heinz 1706) was estimated to be 0.061%. The collection of Micro-Tom nrFLcDNAs generated in this study will serve as a valuable genomic tool for plant biologists to bridge the gap between basic and applied studies. The nrFLcDNA sequences will help annotation of the tomato whole-genome sequence and aid in tomato functional genomics and molecular breeding. Full-length cDNA sequences and their annotations are provided in the database KaFTom http://www.pgb.kazusa.or.jp/kaftom/ via the website of the National Bioresource Project Tomato http://tomato.nbrp.jp.

  5. Toward Genetics-Based Virus Taxonomy: Comparative Analysis of a Genetics-Based Classification and the Taxonomy of Picornaviruses

    PubMed Central

    Lauber, Chris

    2012-01-01

    Virus taxonomy has received little attention from the research community despite its broad relevance. In an accompanying paper (C. Lauber and A. E. Gorbalenya, J. Virol. 86:3890–3904, 2012), we have introduced a quantitative approach to hierarchically classify viruses of a family using pairwise evolutionary distances (PEDs) as a measure of genetic divergence. When applied to the six most conserved proteins of the Picornaviridae, it clustered 1,234 genome sequences in groups at three hierarchical levels (to which we refer as the “GENETIC classification”). In this study, we compare the GENETIC classification with the expert-based picornavirus taxonomy and outline differences in the underlying frameworks regarding the relation of virus groups and genetic diversity that represent, respectively, the structure and content of a classification. To facilitate the analysis, we introduce two novel diagrams. The first connects the genetic diversity of taxa to both the PED distribution and the phylogeny of picornaviruses. The second depicts a classification and the accommodated genetic diversity in a standardized manner. Generally, we found striking agreement between the two classifications on species and genus taxa. A few disagreements concern the species Human rhinovirus A and Human rhinovirus C and the genus Aphthovirus, which were split in the GENETIC classification. Furthermore, we propose a new supergenus level and universal, level-specific PED thresholds, not reached yet by many taxa. Since the species threshold is approached mostly by taxa with large sampling sizes and those infecting multiple hosts, it may represent an upper limit on divergence, beyond which homologous recombination in the six most conserved genes between two picornaviruses might not give viable progeny. PMID:22278238

  6. Identification of a novel calcium binding motif based on the detection of sequence insertions in the animal peroxidase domain of bacterial proteins.

    PubMed

    Santamaría-Hernando, Saray; Krell, Tino; Ramos-González, María-Isabel

    2012-01-01

    Proteins of the animal heme peroxidase (ANP) superfamily differ greatly in size since they have either one or two catalytic domains that match profile PS50292. The orf PP_2561 of Pseudomonas putida KT2440 that we have called PepA encodes a two-domain ANP. The alignment of these domains with those of PepA homologues revealed a variable number of insertions with the consensus G-x-D-G-x-x-[GN]-[TN]-x-D-D. This motif has also been detected in the structure of pseudopilin (pdb 3G20), where it was found to be involved in Ca(2+) coordination although a sequence analysis did not reveal the presence of any known calcium binding motifs in this protein. Isothermal titration calorimetry revealed that a peptide containing this consensus motif bound specifically calcium ions with affinities ranging between 33-79 µM depending on the pH. Microcalorimetric titrations of the purified N-terminal ANP-like domain of PepA revealed Ca(2+) binding with a K(D) of 12 µM and stoichiometry of 1.25 calcium ions per protein monomer. This domain exhibited peroxidase activity after its reconstitution with heme. These data led to the definition of a novel calcium binding motif that we have termed PERCAL and which was abundantly present in animal peroxidase-like domains of bacterial proteins. Bacterial heme peroxidases thus possess two different types of calcium binding motifs, namely PERCAL and the related hemolysin type calcium binding motif, with the latter being located outside the catalytic domains and in their C-terminal end. A phylogenetic tree of ANP-like catalytic domains of bacterial proteins with PERCAL motifs, including single domain peroxidases, was divided into two major clusters, representing domains with and without PERCAL motif containing insertions. We have verified that the recently reported classification of bacterial heme peroxidases in two families (cd09819 and cd09821) is unrelated to these insertions. Sequences matching PERCAL were detected in all kingdoms of life.

  7. Features analysis for identification of date and party hubs in protein interaction network of Saccharomyces Cerevisiae.

    PubMed

    Mirzarezaee, Mitra; Araabi, Babak N; Sadeghi, Mehdi

    2010-12-19

    It has been understood that biological networks have modular organizations which are the sources of their observed complexity. Analysis of networks and motifs has shown that two types of hubs, party hubs and date hubs, are responsible for this complexity. Party hubs are local coordinators because of their high co-expressions with their partners, whereas date hubs display low co-expressions and are assumed as global connectors. However there is no mutual agreement on these concepts in related literature with different studies reporting their results on different data sets. We investigated whether there is a relation between the biological features of Saccharomyces Cerevisiae's proteins and their roles as non-hubs, intermediately connected, party hubs, and date hubs. We propose a classifier that separates these four classes. We extracted different biological characteristics including amino acid sequences, domain contents, repeated domains, functional categories, biological processes, cellular compartments, disordered regions, and position specific scoring matrix from various sources. Several classifiers are examined and the best feature-sets based on average correct classification rate and correlation coefficients of the results are selected. We show that fusion of five feature-sets including domains, Position Specific Scoring Matrix-400, cellular compartments level one, and composition pairs with two and one gaps provide the best discrimination with an average correct classification rate of 77%. We study a variety of known biological feature-sets of the proteins and show that there is a relation between domains, Position Specific Scoring Matrix-400, cellular compartments level one, composition pairs with two and one gaps of Saccharomyces Cerevisiae's proteins, and their roles in the protein interaction network as non-hubs, intermediately connected, party hubs and date hubs. This study also confirms the possibility of predicting non-hubs, party hubs and date hubs based on their biological features with acceptable accuracy. If such a hypothesis is correct for other species as well, similar methods can be applied to predict the roles of proteins in those species.

  8. Improved protein-protein interactions prediction via weighted sparse representation model combining continuous wavelet descriptor and PseAA composition.

    PubMed

    Huang, Yu-An; You, Zhu-Hong; Chen, Xing; Yan, Gui-Ying

    2016-12-23

    Protein-protein interactions (PPIs) are essential to most biological processes. Since bioscience has entered into the era of genome and proteome, there is a growing demand for the knowledge about PPI network. High-throughput biological technologies can be used to identify new PPIs, but they are expensive, time-consuming, and tedious. Therefore, computational methods for predicting PPIs have an important role. For the past years, an increasing number of computational methods such as protein structure-based approaches have been proposed for predicting PPIs. The major limitation in principle of these methods lies in the prior information of the protein to infer PPIs. Therefore, it is of much significance to develop computational methods which only use the information of protein amino acids sequence. Here, we report a highly efficient approach for predicting PPIs. The main improvements come from the use of a novel protein sequence representation by combining continuous wavelet descriptor and Chou's pseudo amino acid composition (PseAAC), and from adopting weighted sparse representation based classifier (WSRC). This method, cross-validated on the PPIs datasets of Saccharomyces cerevisiae, Human and H. pylori, achieves an excellent results with accuracies as high as 92.50%, 95.54% and 84.28% respectively, significantly better than previously proposed methods. Extensive experiments are performed to compare the proposed method with state-of-the-art Support Vector Machine (SVM) classifier. The outstanding results yield by our model that the proposed feature extraction method combing two kinds of descriptors have strong expression ability and are expected to provide comprehensive and effective information for machine learning-based classification models. In addition, the prediction performance in the comparison experiments shows the well cooperation between the combined feature and WSRC. Thus, the proposed method is a very efficient method to predict PPIs and may be a useful supplementary tool for future proteomics studies.

  9. The Evolution of the Observed Hubble Sequence over the past 6Gyr

    NASA Astrophysics Data System (ADS)

    Delgado-Serrano, R.; Hammer, F.; Yang, Y. B.; Puech, M.; Flores, H.; Rodrigues, M.

    2011-10-01

    During the past years we have confronted serious problems of methodology concerning the morphological and kinematic classification of distant galaxies. This has forced us to create a new simple and effective morphological classification methodology, in order to guarantee a morpho-kinematic correlation, make the reproducibility easier and restrict the classification subjectivity. Giving the characteristic of our morphological classification, we have thus been able to apply the same methodology, using equivalent observations, to representative samples of local and distant galaxies. It has allowed us to derive, for the first time, the distant Hubble sequence (~6 Gyr ago), and determine a morphological evolution of galaxies over the past 6 Gyr. Our results strongly suggest that more than half of the present-day spirals had peculiar morphologies, 6 Gyr ago.

  10. Evidence for a Pneumocystis carinii Flo8-like transcription factor: insights into organism adhesion.

    PubMed

    Kottom, Theodore J; Limper, Andrew H

    2016-02-01

    Pneumocystis carinii (Pc) adhesion to alveolar epithelial cells is well established and is thought to be a prerequisite for the initiation of Pneumocystis pneumonia. Pc binding events occur in part through the major Pc surface glycoprotein Msg, as well as an integrin-like molecule termed PcInt1. Recent data from the Pc sequencing project also demonstrate DNA sequences homologous to other genes important in Candida spp. binding to mammalian host cells, as well as organism binding to polystyrene surfaces and in biofilm formation. One of these genes, flo8, a transcription factor needed for downstream cAMP/PKA-pathway-mediated activation of the major adhesion/flocculin Flo11 in yeast, was cloned from a Pc cDNA library utilizing a partial sequence available in the Pc genome database. A CHEF blot of Pc genomic DNA yielded a single band providing evidence this gene is present in the organism. BLASTP analysis of the predicted protein demonstrated 41 % homology to the Saccharomyces cerevisiae Flo8. Northern blotting demonstrated greatest expression at pH 6.0-8.0, pH comparable to reported fungal biofilm milieu. Western blot and immunoprecipitation assays of PcFlo8 protein in isolated cyst and tropic life forms confirmed the presence of the cognate protein in these Pc life forms. Heterologous expression of Pcflo8 cDNA in flo8Δ-deficient yeast strains demonstrated that the Pcflo8 was able to restore yeast binding to polystyrene and invasive growth of yeast flo8Δ cells. Furthermore, Pcflo8 promoted yeast binding to HEK293 human epithelial cells, strengthening its functional classification as a Flo8 transcription factor. Taken together, these data suggest that PcFlo8 is expressed by Pc and may exert activity in organism adhesion and biofilm formation.

  11. Evidence for a Pneumocystis carinii Flo8-like Transcription Factor: Insights into Organism Adhesion

    PubMed Central

    Kottom, Theodore J.; Limper, Andrew H.

    2015-01-01

    Pneumocystis carinii (Pc) adhesion to alveolar epithelial cells is well established and is thought to be a prerequisite for initiation of Pneumocystis pneumonia. Pc binding events occur in part through the major Pc surface glycoprotein Msg, as well as an integrin-like molecule termed PcInt1. Recent data from the Pc sequencing project also demonstrate DNA sequences homologous to other genes important in Candida spp. binding to mammalian host cells, as well as organism binding to polystyrene surfaces and in biofilm formation. One of these genes, flo8, a transcription factor needed for downstream cAMP/PKA-pathway-mediated activation of the major adhesin/flocculin Flo11 in yeast, was cloned from a Pc cDNA library utilizing a partial sequence available in the Pc genome database. A CHEF blot of Pc genomic DNA yielded a single band providing evidence this gene is present in the organism. BLASTP analysis of the predicted protein demonstrated 41% homology to the Saccharomyces cerevisiae Flo8. Northern blotting demonstrated greatest expression at pH 6.0–8.0, pH comparable to reported fungal biofilm milieu. Western blot and immunoprecipitation assays of PcFlo8 protein in isolated cyst and tropic life forms confirmed the presence of the cognate protein in these Pc life forms. Heterologous expression of Pcflo8 cDNA in flo8Δ (deficient) yeast strains demonstrated the Pcflo8 was able to restore yeast binding to polystyrene and invasive growth of yeast flo8Δ cells. Furthermore, Pcflo8 promoted yeast binding to HEK293 human epithelial cells, strengthening its functional classification as a Flo8 transcription factor. Taken together these data suggests that PcFlo8 is expressed by Pc and may exert activity in organism adhesion and biofilm formation. PMID:26215665

  12. Long non-coding RNAs and their biological roles in plants.

    PubMed

    Liu, Xue; Hao, Lili; Li, Dayong; Zhu, Lihuang; Hu, Songnian

    2015-06-01

    With the development of genomics and bioinformatics, especially the extensive applications of high-throughput sequencing technology, more transcriptional units with little or no protein-coding potential have been discovered. Such RNA molecules are called non-protein-coding RNAs (npcRNAs or ncRNAs). Among them, long npcRNAs or ncRNAs (lnpcRNAs or lncRNAs) represent diverse classes of transcripts longer than 200 nucleotides. In recent years, the lncRNAs have been considered as important regulators in many essential biological processes. In plants, although a large number of lncRNA transcripts have been predicted and identified in few species, our current knowledge of their biological functions is still limited. Here, we have summarized recent studies on their identification, characteristics, classification, bioinformatics, resources, and current exploration of their biological functions in plants. Copyright © 2015 The Authors. Production and hosting by Elsevier Ltd.. All rights reserved.

  13. Bacteriocins from Lactobacillus plantarum - production, genetic organization and mode of action: produção, organização genética e modo de ação.

    PubMed

    Todorov, Svetoslav D

    2009-04-01

    Bacteriocins are biologically active proteins or protein complexes that display a bactericidal mode of action towards usually closely related species. Numerous strains of bacteriocin producing Lactobacillus plantarum have been isolated in the last two decades from different ecological niches including meat, fish, fruits, vegetables, and milk and cereal products. Several of these plantaricins have been characterized and the aminoacid sequence determined. Different aspects of the mode of action, fermentation optimization and genetic organization of the bacteriocin operon have been studied. However, numerous of bacteriocins produced by different Lactobacillus plantarum strains have not been fully characterized. In this article, a brief overview of the classification, genetics, characterization, including mode of action and production optimization for bacteriocins from Lactic Acid Bacteria in general, and where appropriate, with focus on bacteriocins produced by Lactobacillus plantarum, is presented.

  14. Linear regression models for solvent accessibility prediction in proteins.

    PubMed

    Wagner, Michael; Adamczak, Rafał; Porollo, Aleksey; Meller, Jarosław

    2005-04-01

    The relative solvent accessibility (RSA) of an amino acid residue in a protein structure is a real number that represents the solvent exposed surface area of this residue in relative terms. The problem of predicting the RSA from the primary amino acid sequence can therefore be cast as a regression problem. Nevertheless, RSA prediction has so far typically been cast as a classification problem. Consequently, various machine learning techniques have been used within the classification framework to predict whether a given amino acid exceeds some (arbitrary) RSA threshold and would thus be predicted to be "exposed," as opposed to "buried." We have recently developed novel methods for RSA prediction using nonlinear regression techniques which provide accurate estimates of the real-valued RSA and outperform classification-based approaches with respect to commonly used two-class projections. However, while their performance seems to provide a significant improvement over previously published approaches, these Neural Network (NN) based methods are computationally expensive to train and involve several thousand parameters. In this work, we develop alternative regression models for RSA prediction which are computationally much less expensive, involve orders-of-magnitude fewer parameters, and are still competitive in terms of prediction quality. In particular, we investigate several regression models for RSA prediction using linear L1-support vector regression (SVR) approaches as well as standard linear least squares (LS) regression. Using rigorously derived validation sets of protein structures and extensive cross-validation analysis, we compare the performance of the SVR with that of LS regression and NN-based methods. In particular, we show that the flexibility of the SVR (as encoded by metaparameters such as the error insensitivity and the error penalization terms) can be very beneficial to optimize the prediction accuracy for buried residues. We conclude that the simple and computationally much more efficient linear SVR performs comparably to nonlinear models and thus can be used in order to facilitate further attempts to design more accurate RSA prediction methods, with applications to fold recognition and de novo protein structure prediction methods.

  15. Relationship between global structural parameters and Enzyme Commission hierarchy: implications for function prediction.

    PubMed

    Boareto, Marcelo; Yamagishi, Michel E B; Caticha, Nestor; Leite, Vitor B P

    2012-10-01

    In protein databases there is a substantial number of proteins structurally determined but without function annotation. Understanding the relationship between function and structure can be useful to predict function on a large scale. We have analyzed the similarities in global physicochemical parameters for a set of enzymes which were classified according to the four Enzyme Commission (EC) hierarchical levels. Using relevance theory we introduced a distance between proteins in the space of physicochemical characteristics. This was done by minimizing a cost function of the metric tensor built to reflect the EC classification system. Using an unsupervised clustering method on a set of 1025 enzymes, we obtained no relevant clustering formation compatible with EC classification. The distance distributions between enzymes from the same EC group and from different EC groups were compared by histograms. Such analysis was also performed using sequence alignment similarity as a distance. Our results suggest that global structure parameters are not sufficient to segregate enzymes according to EC hierarchy. This indicates that features essential for function are rather local than global. Consequently, methods for predicting function based on global attributes should not obtain high accuracy in main EC classes prediction without relying on similarities between enzymes from training and validation datasets. Furthermore, these results are consistent with a substantial number of studies suggesting that function evolves fundamentally by recruitment, i.e., a same protein motif or fold can be used to perform different enzymatic functions and a few specific amino acids (AAs) are actually responsible for enzyme activity. These essential amino acids should belong to active sites and an effective method for predicting function should be able to recognize them. Copyright © 2012 Elsevier Ltd. All rights reserved.

  16. Matrix-Assisted Laser Desorption Ionization (MALDI)-Time of Flight Mass Spectrometry- and MALDI Biotyper-Based Identification of Cultured Biphenyl-Metabolizing Bacteria from Contaminated Horseradish Rhizosphere Soil▿

    PubMed Central

    Uhlik, Ondrej; Strejcek, Michal; Junkova, Petra; Sanda, Miloslav; Hroudova, Miluse; Vlcek, Cestmir; Mackova, Martina; Macek, Tomas

    2011-01-01

    Bacteria that are able to utilize biphenyl as a sole source of carbon were extracted and isolated from polychlorinated biphenyl (PCB)-contaminated soil vegetated by horseradish. Isolates were identified using matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI-TOF MS). The usage of MALDI Biotyper for the classification of isolates was evaluated and compared to 16S rRNA gene sequence analysis. A wide spectrum of bacteria was isolated, with Arthrobacter, Serratia, Rhodococcus, and Rhizobium being predominant. Arthrobacter isolates also represented the most diverse group. The use of MALDI Biotyper in many cases permitted the identification at the level of species, which was not achieved by 16S rRNA gene sequence analyses. However, some isolates had to be identified by 16S rRNA gene analyses if MALDI Biotyper-based identification was at the level of probable or not reliable identification, usually due to a lack of reference spectra included in the database. Overall, this study shows the possibility of using MALDI-TOF MS and MALDI Biotyper for the fast and relatively nonlaborious identification/classification of soil isolates. At the same time, it demonstrates the dominant role of employing 16S rRNA gene analyses for the identification of recently isolated strains that can later fill the gaps in the protein-based identification databases. PMID:21821747

  17. Highly efficient classification and identification of human pathogenic bacteria by MALDI-TOF MS.

    PubMed

    Hsieh, Sen-Yung; Tseng, Chiao-Li; Lee, Yun-Shien; Kuo, An-Jing; Sun, Chien-Feng; Lin, Yen-Hsiu; Chen, Jen-Kun

    2008-02-01

    Accurate and rapid identification of pathogenic microorganisms is of critical importance in disease treatment and public health. Conventional work flows are time-consuming, and procedures are multifaceted. MS can be an alternative but is limited by low efficiency for amino acid sequencing as well as low reproducibility for spectrum fingerprinting. We systematically analyzed the feasibility of applying MS for rapid and accurate bacterial identification. Directly applying bacterial colonies without further protein extraction to MALDI-TOF MS analysis revealed rich peak contents and high reproducibility. The MS spectra derived from 57 isolates comprising six human pathogenic bacterial species were analyzed using both unsupervised hierarchical clustering and supervised model construction via the Genetic Algorithm. Hierarchical clustering analysis categorized the spectra into six groups precisely corresponding to the six bacterial species. Precise classification was also maintained in an independently prepared set of bacteria even when the numbers of m/z values were reduced to six. In parallel, classification models were constructed via Genetic Algorithm analysis. A model containing 18 m/z values accurately classified independently prepared bacteria and identified those species originally not used for model construction. Moreover bacteria fewer than 10(4) cells and different species in bacterial mixtures were identified using the classification model approach. In conclusion, the application of MALDI-TOF MS in combination with a suitable model construction provides a highly accurate method for bacterial classification and identification. The approach can identify bacteria with low abundance even in mixed flora, suggesting that a rapid and accurate bacterial identification using MS techniques even before culture can be attained in the near future.

  18. Taxonomic and functional assignment of cloned sequences from high Andean forest soil metagenome.

    PubMed

    Montaña, José Salvador; Jiménez, Diego Javier; Hernández, Mónica; Angel, Tatiana; Baena, Sandra

    2012-02-01

    Total metagenomic DNA was isolated from high Andean forest soil and subjected to taxonomical and functional composition analyses by means of clone library generation and sequencing. The obtained yield of 1.7 μg of DNA/g of soil was used to construct a metagenomic library of approximately 20,000 clones (in the plasmid p-Bluescript II SK+) with an average insert size of 4 Kb, covering 80 Mb of the total metagenomic DNA. Metagenomic sequences near the plasmid cloning site were sequenced and them trimmed and assembled, obtaining 299 reads and 31 contigs (0.3 Mb). Taxonomic assignment of total sequences was performed by BLASTX, resulting in 68.8, 44.8 and 24.5% classification into taxonomic groups using the metagenomic RAST server v2.0, WebCARMA v1.0 online system and MetaGenome Analyzer v3.8 software, respectively. Most clone sequences were classified as Bacteria belonging to phlya Actinobacteria, Proteobacteria and Acidobacteria. Among the most represented orders were Actinomycetales (34% average), Rhizobiales, Burkholderiales and Myxococcales and with a greater number of sequences in the genus Mycobacterium (7% average), Frankia, Streptomyces and Bradyrhizobium. The vast majority of sequences were associated with the metabolism of carbohydrates, proteins, lipids and catalytic functions, such as phosphatases, glycosyltransferases, dehydrogenases, methyltransferases, dehydratases and epoxide hydrolases. In this study we compared different methods of taxonomic and functional assignment of metagenomic clone sequences to evaluate microbial diversity in an unexplored soil ecosystem, searching for putative enzymes of biotechnological interest and generating important information for further functional screening of clone libraries.

  19. Splice-mediated Variants of Proteins (SpliVaP) - data and characterization of changes in signatures among protein isoforms due to alternative splicing.

    PubMed

    Floris, Matteo; Orsini, Massimiliano; Thanaraj, Thangavel Alphonse

    2008-10-02

    It is often the case that mammalian genes are alternatively spliced; the resulting alternate transcripts often encode protein isoforms that differ in amino acid sequences. Changes among the protein isoforms can alter the cellular properties of proteins. The effect can range from a subtle modulation to a complete loss of function. (i) We examined human splice-mediated protein isoforms (as extracted from a manually curated data set, and from a computationally predicted data set) for differences in the annotation for protein signatures (Pfam domains and PRINTS fingerprints) and we characterized the differences & their effects on protein functionalities. An important question addressed relates to the extent of protein isoforms that may lack any known function in the cell. (ii) We present a database that reports differences in protein signatures among human splice-mediated protein isoform sequences. (i) Characterization: The work points to distinct sets of alternatively spliced genes with varying degrees of annotation for the splice-mediated protein isoforms. Protein molecular functions seen to be often affected are those that relate to: binding, catalytic, transcription regulation, structural molecule, transporter, motor, and antioxidant; and the processes that are often affected are nucleic acid binding, signal transduction, and protein-protein interactions. Signatures are often included/excluded and truncated in length among protein isoforms; truncation is seen as the predominant type of change. Analysis points to the following novel aspects: (a) Analysis using data from the manually curated Vega indicates that one in 8.9 genes can lead to a protein isoform of no "known" function; and one in 18 expressed protein isoforms can be such an "orphan" isoform; the corresponding numbers as seen with computationally predicted ASD data set are: one in 4.9 genes and one in 9.8 isoforms. (b) When swapping of signatures occurs, it is often between those of same functional classifications. (c) Pfam domains can occur in varying lengths, and PRINTS fingerprints can occur with varying number of constituent motifs among isoforms - since such a variation is seen in large number of genes, it could be a general mechanism to modulate protein function. (ii) The reported resource (at http://www.bioinformatica.crs4.org/tools/dbs/splivap/) provides the community ability to access data on splice-mediated protein isoforms (with value-added annotation such as association with diseases) through changes in protein signatures.

  20. Algorithm, applications and evaluation for protein comparison by Ramanujan Fourier transform.

    PubMed

    Zhao, Jian; Wang, Jiasong; Hua, Wei; Ouyang, Pingkai

    2015-12-01

    The amino acid sequence of a protein determines its chemical properties, chain conformation and biological functions. Protein sequence comparison is of great importance to identify similarities of protein structures and infer their functions. Many properties of a protein correspond to the low-frequency signals within the sequence. Low frequency modes in protein sequences are linked to the secondary structures, membrane protein types, and sub-cellular localizations of the proteins. In this paper, we present Ramanujan Fourier transform (RFT) with a fast algorithm to analyze the low-frequency signals of protein sequences. The RFT method is applied to similarity analysis of protein sequences with the Resonant Recognition Model (RRM). The results show that the proposed fast RFT method on protein comparison is more efficient than commonly used discrete Fourier transform (DFT). RFT can detect common frequencies as significant feature for specific protein families, and the RFT spectrum heat-map of protein sequences demonstrates the information conservation in the sequence comparison. The proposed method offers a new tool for pattern recognition, feature extraction and structural analysis on protein sequences. Copyright © 2015 Elsevier Ltd. All rights reserved.

Top